# How Does Unitarity Conserve Information?

1. May 2, 2015

### Staff: Mentor

Lenoard Susskind's video courses on Clasical Mechanics and Quantum Mechanics, often mention convervation of information. Susskind likes to call it "the minus first law."

In classical physics, it is Liouville's Theorum which tells us that the number of states is conserved in time evolutions.

In quantum mechanics, time evolutions are unitary. And in time evolutions, the number of possible states is conserved.

Susskind said that unitarity in quantum mechanics is analogous to Liouville's Theorum in classical mechanics. I'm having diffculty understanding that analogy. The Wikipedia article Unitarity (physics) didn't help, although it does say that unitarity means that the sum of the probabilities of all possibile outcomes equals 1. Thus if the evolution was not unitary, some possibilities would dissapear or new possibilities would be created. Is that the proper thought thread?

2. May 2, 2015

### Physics Monkey

The basic statements that flesh out the analogy are as follows.

Classical:

Requiring probabilities to add to one requires that the integral of the phase space distribution over all phase space must be one. Liouville's theorem implies that this integral is independent of time. So if the probabilities add to one at t=0 then they add to one at all later times.

Quantum:

Requiring probabilities to add to one requires that the norm of the state vector (or more generally the trace of the density matrix) must be one. Unitarity implies that this norm is independent of time. So if the probabilities add to one at t=0 then they add to one at all later times.

Further consequences of these statements in both the classical and quantum case are that the entropies of the distributions are independent of time. For example, in the quantum case the von Neumann entropy is $S(\rho) = - \text{tr}(\rho \log \rho)$. Unitary evolution sends $\rho \rightarrow \rho^U = U \rho U^\dagger$ and you can check that $S(\rho) = S(\rho^U)$ (because S depends only on the eigenvalues of the density matrix).

Hope this helps.

3. May 2, 2015

### vanhees71

Well, first of all one has to define, what's meant by information. I don't know, in which sense Susskind uses this word, but from a modern point of view on statistical physics (or even more general on statistics of whatever kind of facts), one can analyze a situation from the point of view of "information theory". The central notion is a "measure for missing information". In physics is the missing information about the state of whatever a system one is considering. From the point of view of statistics the notion about a system is given by the probability distribution for the outcome of observations on this system, and one can show that the definition of entropy given by Shannon in the context of statistical signal theory (electrical engineers love the corresponding lecture veeeeeeeery much ;-)) and by Jaynes in the context of statistical physics is a useful measure of the missing information about the state of a system.

In the context of quantum theory the Shannen-Jaynes entropy coincides with von Neumann's definition of entropy, which he (heuristically) derived from more conventional approaches, generalizing the idea of entropy (which was introduced by Clausius quite exactly 150 years ago these days) from classical thermodynamics and statistics to quantum theory. In quantum theory one describes the (incomplete) knowledge about the system's state by a statistical operator $\hat{\rho}$, which is by definition a positive semi-definite self-adjoint operator, which admits to take the trace for products with at least that part of the observable operators relevant to the problem in question (a socalled trace-class operator), with $\mathrm{Tr} \hat{\rho}=1$. Then the (von Neumann) entropy is defined by
$$S=-\mathrm{Tr} (\hat{\rho} \ln \hat{\rho}).$$
Taking the eigenvectors $|P_n \rangle$ of $\hat{\rho}$ to take the trace, one finds
$$S=-\sum_n P_n \ln P_n.$$
If $P_n=0$, one has to use by definition $P_n \ln P_n=0$ in this sum.

Now, one should note what "complete knowledge" means in the sense of quantum theory. You have complete knowledge about a quantum system, if you know that it is prepared in a pure state, described by a ray in Hilbert space or, equivalently, by the projection operator $\hat{P}_{\psi}=|\psi \rangle \langle \psi|$, where $|\psi \rangle$ is a normalized Hilbert-space vector, representing the pure state, and this projection operator then is the statistical operator of the system. But then one has only one eigenvalue 1 and all others 0, implying that $S=0$ for a pure state. Indeed, in the sense of quantum theory full knowledge about the state of the system is achieved, if it is known to be prepared in a pure state. As is well known, there is still only probabilistic knowledge achieved, concerning the outcome of measurements of observables. Only the values of observables, for which $|\psi \rangle$ is a eigenstate of the corresponding self-adjoint operator that represents this observable, is known, and this value is the eigenvalue of this operator for this eigenvector.

Now, concerning the time evolution. The mathematical time evolution of the states (i.e., the statistical operators describing them) and the operators representing observables is pretty arbitrary, because it's defined only up to a unitary transformation of states and operators. Only the time evolution of observable quantities like the probabilities to measure the value of one or more compatible observables, the expectation values of observables, etc. are uniquely defined. On the other hand, no matter, how you choose the "picture of time evolution", there's always a unitary operator that describes the time evolution of a closed system, because then the Statistical Operator is only time dependent through its dependence on operators and not expelicitly time dependent. Then it fulfills
$$\hat{\rho}(t)=\hat{C}(t,t_0) \hat{\rho}(t_0) \hat{C}^{\dagger}(t,t_0), \quad \hat{C}^{\dagger}(t,t_0)=\hat{C}^{-1}(t,t_0).$$
This immediately implies that
$$S(t)=-\mathrm{Tr}[\hat{\rho}(t)]=-\mathrm{Tr}[\hat{\rho}(t_0)] \; \Rightarrow\; S=\text{const}.$$
This, however, shows that this holds only true for a exact description of the time evolution of the quantum system. For macroscopic systems that is impossible in practice, and that's the reason, why one uses statistical mechanics in the first place.

E.g., for a many-body system like a macroscopic amount of gas, one considers only "relevant" macroscopic quantities to describe the system's time evolution, and this makes it necessary to through away a lot of information during the time evolution. So at some point in the derivation of the macroscopic equation of motion, e.g., using a transport description (a la the Boltzmann equation for a dilute gas), where one only describes the single-particle distribution, whose exact equation of motion involves the two-particle distribution, including all correlations. This two-particle distribution involves again the three-particle distribution with all the correlations, and so on. This is known as the BBGKY hierarchy, which practically involves infinitely many equations of motion for all the N-particle distribution fucntions, for which you cannot even write down the initial conditions, because it's just too much information, you'd have to store! Thus your through out a lot information, and at one place assume that over macroscopically resolvable time scales (where the macroscopic quantities change) there's some averaging through rapid oscillations of many microscopic degrees of freedom is going on, so that for the purpose to resolve only the macroscopic observables, one can neglect the two-particle correlations, and approximate them as the product of two single-particle distribution functions. This leads to the usual collision term for $2 \rightarrow 2$ scattering processes in the Boltzmann equation. Then, when you define the macroscopic entropy for the corresponding approximation of the one-particle distribution function, which is a coarse-grained macroscopic quantity itself and thus not exactly the same as the microscopic von Neumann entropy. For this macroscopic entropy, you can prove the famous Boltzmann H-theorem, i.e., that this macroscopic entropy is not constant in time but can never decrease. The stationary (equilibrium) states of the macroscopic system is then defined as the maximum of the entropy.

The merit of the information-theoretical approach is that this "maximum entropy principle" is not only applicable to equilibrium states but in a much more general context. In a way, it answers the question, which probability distribution (statistical operator in quantum theory) one should choose, when no such quantity is already known from other considerations, but there are only some coarse information given, e.g., one can assume that one knows the total average energy of a (perhaps open) system of particles. Now, the information theory gives a way to associate a probability distribution suitable for this given coarser information: One should choose the distribution, which does not introduce some prejudice which is not justified to be implied by the given coarse information. Thus, one has to choose that probability distribution, maximizing the entropy under the constraints of the given information. For the example with the average energy, you get to the canonical distribution
$$\hat{\rho}_{\text{can}}=\frac{1}{Z} \exp(-\beta \hat{H}), \quad Z=\mathrm{Tr} \exp(-\beta \hat{H})$$
where the Lagrange multiplierer $\beta$ turns out to be the inverse temperature of the corresponding system in equilibrium. That doesn't mean that the system really is in equilibrium, but the most plausible description, given only the average energy, is the canonical equilibrium distribution in this case.

4. May 2, 2015

### atyy

Jaynes? Wasn't the statistical definition of entropy due to Boltzmann and Gibbs? And the quantum version due to von Neumann?

5. May 2, 2015

### vanhees71

Usually one quotes then information-theoretically defined entropy as "Shannon-Jaynes entropy". It turns out that it is indeed identical with the entropy due to Boltzmann and Gibbs for classical and that by von Neumann for quantum statistical physics. See also the further text of my previous posting!

6. May 2, 2015

### atyy

Yes, I understood your physics, which is fine. I just am asking about the name - why Jaynes? Surely Boltzmann-Gibbs-Shannon. Putting Jaynes in would be like saying F=ma is Halliday and Resnick's second law of motion.

I think after Shannon, there are others who discussed the relationship between the Shannon entropy and the Boltzmann Gibbs entropy before Jaynes, eg. Brillouin in 1956: http://books.google.com/books?id=DWo7lVRVnhcC&source=gbs_navlinks_s. Although I have no proof, I cannot imagine that von Neumann did not know very early that the Shannon and Boltzmann-Gibbs entropies were the same. There is, after all, the famous anecdote that it was von Neumann who told Shannon to call his term the "entropy".

http://www.spatialcomplexity.info/what-von-neumann-said-to-shannon
http://en.wikipedia.org/wiki/History_of_entropy#Information_theory

Last edited: May 2, 2015
7. May 2, 2015

### atyy

Does the information-theoretic approach really do such a thing? One reason I am skeptical is that it is not shown that the "entropy" is unique, in particular the Shannon entropy is only one member of a family of Renyi entropies. What principle picks the Shannon entropy over all other members?

8. May 3, 2015

### vanhees71

In some sense, Boltzmann-Gibbs-von Neumann entropy is "unique" in the information theoretical sense, but of course this uniqueness depends on the definition of what you understand under "missing information". For an introduction, see my (old) statistics manuscript

http://fias.uni-frankfurt.de/~hees/publ/stat.pdf

It's also clear that this traditional entropy principle does not give the correct equilibrium statistics in cases, where long-range interactions are present and non-trivial correlations occur. The most prominent example is the universe, where the (unscreened!) long-range gravitational interaction leads to non-trivial clustering of matter and the structure manifesting itself in the formation of galixies, galaxy clusters, filaments, and all this.

For alternative entropy measures and statistics, see

C. Tsallis, Introduction to nonextensive statistical mechanics, Springer (2009)
http://dx.doi.org/10.1007/978-0-387-85359-8

9. May 3, 2015

### vanhees71

You have a point here, but mentioning Jaynes is quite common in the literature (both in original papers and textbooks), dealing with the information-theoretical approach.

10. May 3, 2015

### atyy

Yes, that's why I don't accept the maximum entropy principle as a principle. It is only a rule of thumb that is highly non-unique - even in the information theoretic sense there are the Renyi entropies, of which the Shannon entropy is only one.

Also, accepting maximum entropy as a principle is not consistent with physics - physics requires experiment to tell us what the right distribution is given incomplete information. Also, it is the underlying dynamics from which probability distributions must be derived, eg. kinetic theory, or the more recent attempts to derive thermal distributions from quantum evolution, eg. http://arxiv.org/abs/cond-mat/0511091, http://arxiv.org/abs/quant-ph/0511225.

However, despite the non-uniqueness, there is perhaps some physical principle for using one of the maximum entropy distributions as a guess. The idea is that if we have to make a guess, if we have any chance that the guess is correct despite our lack of information, then the distribution should be "universal" or "stable" in some sense, eg. the Levy and Cauchy distributions are also stable. The renormalization group of Wilson is closely related, where the convergence to a fixed point is analogous to stability. An interesting question is whether every distribution that is "stable" satisfies a maximum entropy principle. I don't know the full story, but some partial results are found in eg. http://www.stat.yale.edu/~arb4/publ...ndTheCentralLimitTheoremAnnalsProbability.pdf and http://www.santafe.edu/media/workingpapers/08-05-020.pdf.

Another interesting point is that the maximum entropy for continuous variables is meaningless, since it is dependent on the choice of coordinates, so one really has to talk about the relative entropy - the Shannon information is a type of relative entropy. Essentially the same thing happens in classical statistical mechanics, where one must state the preferred coordinates in which the entropy is maximized - the coordinates must be canonical - any choice of canonical coordinates is permitted, since the Jacobian for canonical transformations is just unity.

Last edited: May 3, 2015
11. May 4, 2015

### vanhees71

Of course, the entropy measure usually depends on a prior distribution. One should, e.g., know which is the correct phase-space measure for a gas of particles, and this is answered by quantum theory, where you can uniquely count the microstates of a macrostate, and it turns out that the correct measure is $\mathrm{d}^{3N} \vec{x} \mathrm{d}^{3N }\vec{p}/(2 \pi \hbar)^{3N}$.

Then, I've my objections against Tsallis, Renyi entropies etc. precisely because they cannot be derived from dynamics, while the usual Boltzmann-Gibbs-von Neumann entropy has a firm ground in kinetical theory.

Also, if you measure the distribution, you don't need the maximum-entropy principle anymore to estimate one. The point of the maximum entropy principle is to find a distribution under the constraint of given information which introduces the least possible prejudices.

E.g., if you have a dice, and you don't know anything about it, your guess for the occurance of "6" or any other number schould be 1/6, according to the maximum entropy principle. Nobody tells you that this is the correct probability function. Now you can measure the distribution and then you find some corrected probability function, if the dice is not fair (including an estimate of the uncertainty of the new distribution function too).

12. May 4, 2015

### atyy

But your replies show that this is not true. In order to choose the prior measure for the classical ideal gas, you needed to add quantum considerations, and to argue against the Renyi entropies, you needed to add dynamics. So if one is not adding quantum considerations or dynamics as additional constraints, there is nothing to say that the maximum entropy is the distribution of least possible prejudice.

13. May 5, 2015

### vanhees71

Sure, you always have to use some prior knowledge, among them generally valid laws of physics like the density of states of an ideal gas for the Boltzmann-Gibbs (or in the semi-classical picture the Boltzmann-Uehling-Uhlenbeck) statistical physics. This becomes invalid when you have long-range forces, because then the asymptotic states are never free particles. This is solved within kinetic theory by the use of mean-field models (Vlasov equation) or you do a microcanonical simulation within (semi-)classical dynamics (as e.g., in the famous millenium simulation of the universe to understand structure formation). The same is with Renyi, Tsallis and other statistics, where you choose another entropy measure to consider non-trivial correlations in systems, where you cannot ignore them as in the Boltzmann-Gibbs case.

Another thing, I was thinking a lot about, is the question, whether you can come up with a convincing derivation of classical statistics without using the quantum argument about the phase-space measure. As far as I know, there's none known, and indeed the phase-space measure contains the "quantum constant" $\hbar$, which also give everything the correct dimensions. You also need a minimal "indistinguishability assumption" in classical statistics to resolve the Gibbs paradoxon. Boltzmann found this in an ad-hoc assumption, i.e., he figured out that one needs a factor $1/N!$ in his counting method to evaluate the number (or density) of microstates, where $N$ is the total number of the gas particles. He had no explanation for it, except that it worked leading to an extensive entropy expression (the Sackur-Tetrode formula in the case of an ideal gas). Nowadays it's clear from quantum theory, that one has to count differently, depending on the bosonic or fermionic nature of the particles. Then you can take the classical limit for the case where the occupation numbers for the relevant microstates making up the macrostate are small, in the sense that you can neglect the $\pm 1$ in the denominator of the Fermi- or Bose distribution, respectively.

Last but not least any model in many-body physics is subject to empirical tests, i.e., at the end experiment decides whether a model is good or bad in describing a situation, and in this way we can learn about the microscopic workings behind (semi-)classical behavior of macrosystems. That's the more true for my field of research, heavy-ion collisions, where one tries to learn about the properties of strongly interacting matter, especially about the QCD phase diagram, from observations in heavy-ion collisions (at LHC, RHIC, GSI and in the future FAIR and NICA).