Well, first of all one has to define, what's meant by information. I don't know, in which sense Susskind uses this word, but from a modern point of view on statistical physics (or even more general on statistics of whatever kind of facts), one can analyze a situation from the point of view of "information theory". The central notion is a "measure for missing information". In physics is the missing information about the state of whatever a system one is considering. From the point of view of statistics the notion about a system is given by the probability distribution for the outcome of observations on this system, and one can show that the definition of entropy given by Shannon in the context of statistical signal theory (electrical engineers love the corresponding lecture veeeeeeeery much ;-)) and by Jaynes in the context of statistical physics is a useful measure of the missing information about the state of a system.
In the context of quantum theory the Shannen-Jaynes entropy coincides with von Neumann's definition of entropy, which he (heuristically) derived from more conventional approaches, generalizing the idea of entropy (which was introduced by Clausius quite exactly 150 years ago these days) from classical thermodynamics and statistics to quantum theory. In quantum theory one describes the (incomplete) knowledge about the system's state by a statistical operator ##\hat{\rho}##, which is by definition a positive semi-definite self-adjoint operator, which admits to take the trace for products with at least that part of the observable operators relevant to the problem in question (a socalled trace-class operator), with ##\mathrm{Tr} \hat{\rho}=1##. Then the (von Neumann) entropy is defined by
$$S=-\mathrm{Tr} (\hat{\rho} \ln \hat{\rho}).$$
Taking the eigenvectors ##|P_n \rangle## of ##\hat{\rho}## to take the trace, one finds
$$S=-\sum_n P_n \ln P_n.$$
If ##P_n=0##, one has to use by definition ##P_n \ln P_n=0## in this sum.
Now, one should note what "complete knowledge" means in the sense of quantum theory. You have complete knowledge about a quantum system, if you know that it is prepared in a pure state, described by a ray in Hilbert space or, equivalently, by the projection operator ##\hat{P}_{\psi}=|\psi \rangle \langle \psi|##, where ##|\psi \rangle## is a normalized Hilbert-space vector, representing the pure state, and this projection operator then is the statistical operator of the system. But then one has only one eigenvalue 1 and all others 0, implying that ##S=0## for a pure state. Indeed, in the sense of quantum theory full knowledge about the state of the system is achieved, if it is known to be prepared in a pure state. As is well known, there is still only probabilistic knowledge achieved, concerning the outcome of measurements of observables. Only the values of observables, for which ##|\psi \rangle## is a eigenstate of the corresponding self-adjoint operator that represents this observable, is known, and this value is the eigenvalue of this operator for this eigenvector.
Now, concerning the time evolution. The mathematical time evolution of the states (i.e., the statistical operators describing them) and the operators representing observables is pretty arbitrary, because it's defined only up to a unitary transformation of states and operators. Only the time evolution of observable quantities like the probabilities to measure the value of one or more compatible observables, the expectation values of observables, etc. are uniquely defined. On the other hand, no matter, how you choose the "picture of time evolution", there's always a unitary operator that describes the time evolution of a closed system, because then the Statistical Operator is only time dependent through its dependence on operators and not expelicitly time dependent. Then it fulfills
$$\hat{\rho}(t)=\hat{C}(t,t_0) \hat{\rho}(t_0) \hat{C}^{\dagger}(t,t_0), \quad \hat{C}^{\dagger}(t,t_0)=\hat{C}^{-1}(t,t_0).$$
This immediately implies that
$$S(t)=-\mathrm{Tr}[\hat{\rho}(t)]=-\mathrm{Tr}[\hat{\rho}(t_0)] \; \Rightarrow\; S=\text{const}.$$
This, however, shows that this holds only true for a exact description of the time evolution of the quantum system. For macroscopic systems that is impossible in practice, and that's the reason, why one uses statistical mechanics in the first place.
E.g., for a many-body system like a macroscopic amount of gas, one considers only "relevant" macroscopic quantities to describe the system's time evolution, and this makes it necessary to through away a lot of information during the time evolution. So at some point in the derivation of the macroscopic equation of motion, e.g., using a transport description (a la the Boltzmann equation for a dilute gas), where one only describes the single-particle distribution, whose exact equation of motion involves the two-particle distribution, including all correlations. This two-particle distribution involves again the three-particle distribution with all the correlations, and so on. This is known as the BBGKY hierarchy, which practically involves infinitely many equations of motion for all the N-particle distribution functions, for which you cannot even write down the initial conditions, because it's just too much information, you'd have to store! Thus your through out a lot information, and at one place assume that over macroscopically resolvable time scales (where the macroscopic quantities change) there's some averaging through rapid oscillations of many microscopic degrees of freedom is going on, so that for the purpose to resolve only the macroscopic observables, one can neglect the two-particle correlations, and approximate them as the product of two single-particle distribution functions. This leads to the usual collision term for ##2 \rightarrow 2## scattering processes in the Boltzmann equation. Then, when you define the macroscopic entropy for the corresponding approximation of the one-particle distribution function, which is a coarse-grained macroscopic quantity itself and thus not exactly the same as the microscopic von Neumann entropy. For this macroscopic entropy, you can prove the famous Boltzmann H-theorem, i.e., that this macroscopic entropy is not constant in time but can never decrease. The stationary (equilibrium) states of the macroscopic system is then defined as the maximum of the entropy.
The merit of the information-theoretical approach is that this "maximum entropy principle" is not only applicable to equilibrium states but in a much more general context. In a way, it answers the question, which probability distribution (statistical operator in quantum theory) one should choose, when no such quantity is already known from other considerations, but there are only some coarse information given, e.g., one can assume that one knows the total average energy of a (perhaps open) system of particles. Now, the information theory gives a way to associate a probability distribution suitable for this given coarser information: One should choose the distribution, which does not introduce some prejudice which is not justified to be implied by the given coarse information. Thus, one has to choose that probability distribution, maximizing the entropy under the constraints of the given information. For the example with the average energy, you get to the canonical distribution
$$\hat{\rho}_{\text{can}}=\frac{1}{Z} \exp(-\beta \hat{H}), \quad Z=\mathrm{Tr} \exp(-\beta \hat{H})$$
where the Lagrange multiplierer ##\beta## turns out to be the inverse temperature of the corresponding system in equilibrium. That doesn't mean that the system really is in equilibrium, but the most plausible description, given only the average energy, is the canonical equilibrium distribution in this case.