Can you give some further elaboration on what this means?
It's a rather long story, but I'll try to give some basic insight into what it means.
The classical measure of information is the Shannon entropy. Suppose, a random variable X has a probability distribution given by p_X(j), where j runs over the possible values that X can take. The Shannon entropy is given by
H(X) = - \sum_j p_X(j) \log_2 p_X(j)
It measures the amount of information or "degree of surprise" that you will have when you do an experiment with outcomes distributed according to p_X. For example, suppose you toss a fair coin with the probabilities of getting heads and tails both equal to 1/2. Then, you know nothing about what the outcome of the coin toss will be, so you will be equally surprised if it is heads or tails. The entropy is 1 in this case. On the other hand, if you have a completely biased coin that only ever lands on heads, then you already know the outcome of every future coin toss, so you gain no information by tossing the coin. The entropy is 0 in this case.
Shannon entropy has an operational interpretation. If you are given N instances of a random variable X, then you can store the data in N H(X) bits and recover it perfectly without any loss (although strictly this is only true in the limit N \rightarrow \infty). This is what is going on in compression programs like WinZip or gzip, that you might use on your computer.
When you have two random variables, X, Y it is possible that they might be correlated, e.g. they might actually be two copies of the same variable, or Y might be obtained by subjecting X to some noise process. Generally, they will have a joint probability distribution p_{X,Y}(j,k) and a joint entropy
H(X,Y) = - \sum_{j,k}p_{X,Y}(j,k) \log_2 p_{X,Y}(j,k)
The marginal probability distributions are given by
p_X (j) = \sum_k p_{X,Y} (j,k)
p_Y (k) = \sum_j p_{X,Y} (j,k)
from which we can define the entropies of the individual random variables. The mutual information is given by
I(X:Y) = H(X) + H(Y) - H(X,Y)
and it can be interpreted as the amount of information we learn about Y by looking at the outcomes of X. This is the same as the amount of information we learn about X by looking at the outcomes of Y.
For further details and justification of these definitions, look at any textbook on Information Theory (Cover and Thomas is my favourite).
Now, it can be shown that for any pair of measurements done on an entangled state, the resulting random variables have mutual information equal to zero. Thus, no information is transmitted in such an experiment.
Note that, information theory works at an entirely operational level. All these quantities are related to the amount of data (bits) that have to be transmitted to perform some communication task. Thus, it says nothing about whether the underlying physics of a process is local or non-local. However, note that quantum mechanics (at least the standard formalism) is operational too, in that it only tells us how to calculate probabilities of measurement outcomes, and says nothing about what happens between measurements. Therefore, we can't really conclude anything about locality from experiments on entangled particles. To do so, requires the extra assumption of realism, i.e. there are some physical parameters that really exist that determine the outcomes of mesurements and they exist whether or not we actually make the measurement. Bell's theorem tells us that we have to throw out either locality or realism.
May I ask how the other particle (call it particle B) "picks" the opposite state? Is it just a matter of us assigning the opposite state to it after we measure the spin of particle A?
This is actually a very controversial question in the foundations of quantum theory. In some interpretations of the theory, the wavefunction only represents the knowledge of the experimenter about the quantum system. In those interpretations, it is just a matter of assigning a state. The problem is that the rules for state assignment are very different from those of classical statistics and proponents of these interpretations have to give some justification of why this is the case. They also have to explain how classical probability theory emerges on the scale of everyday objects. Both of these things are very problematic and there is no universal agreement on the best way to approach them.
On the other hand, some interpretations treat the wavefunction as an element of reality, i.e. it is not just a description, but really exists in the physical world. Such interpretations often run into problems when we try to combine them with relativity theory because of the nonlocality.