Thinking about statistical physics becomes much clearer when you have an idea about information theory. In information theory, entropy is the "measure of missing information" under the constraint of what information is given about the system.
Thermal equilibrium is the state of maximum entropy, i.e., you maximize the entropy with the information given just about the additive conserved quantities.
In quantum statistical physics the entropy is given by
$$S=-k_{\text{B}} \mathrm{Tr} (\hat{\rho} \ln \hat{\rho}),$$
where ##\hat{\rho}## is the statistical operator. If you know that your system is in a state of definite energy ##E## ("microcanonical ensemble"). Let now ##|E,\alpha \rangle## be a set of orthonormal energy eigenstates with the energy eigenvalue ##E##. Since now the equilibrium state commutes with ##\hat{H}##, because it's time-independent, you can choose his orthonormal set to be the eigenvectors of ##\hat{\rho}##, i.e., you have
$$\hat{\rho}=\sum_{\alpha} p_{\alpha} |E,\alpha \rangle \langle E,\alpha|,$$
and the entropy is
$$S=-k_{\text{B}} p_{\alpha} \ln p_{\alpha},$$
and this must be maximized under the contraint that ##\mathrm{Tr} \hat{\rho}=\sum_{\alpha} p_{\alpha}=1##. This you do with a Lagrange multiplier in the variation of ##S##:
$$\delta S= -k_{\text{B}} \sum_{\lambda} \delta p_{\alpha} (\ln p_{\alpha} -1-\lambda)=0.$$
Now you can vary the ##p_{\alpha}## independently, and thus the bracket under the sum must vanish for each ##\alpha##. This implies that
$$\ln p_{\alpha} =1+\lambda=\text{const} \; \Rightarrow \; p_{\alpha}=p.$$
Further you must have
$$\sum_{\alpha} p=g p,$$
where ##g=\mathrm{dim} \text{Eig}(E)## is the degneracy of the energy eigenvalue. Thus you get ##p=1/g## and
$$\hat{\rho} = \frac{1}{g} \sum_{\alpha} |E,\alpha \rangle \langle E,\alpha|.$$
From the information-theoretical point of view that's pretty obvious: If you don't know anything else about the system than its energy, then you only know that it must be in an energy eigenstate with the corresponding eigenvalue, ##E##, and thus the choice of probabilities of the "minimal prejudice" (i.e., that of the maximal missing information or entropy) is such that none of these energy eigenstates is in any way distinguished from the others, i.e., they must all be equally probable.
The issue becomes a bit more complicated, if your Hamiltonian has a continuous spectrum. Then the grand-canonical ensemble must be defined such that the energy value is in some "small energy shell", ##E \in [E_0-\delta E/2,E+\delta E/2]##.