You are using "did better" in two different senses.
Comparing the z scores tells you that Cindy is ranked higher among those who took her test than Bobby is among those who took his. In that sense, Cindy "did better" than Bobby. She did better compared to her peers on the test she took than Bobby did compared to his peers on the test he took.
Comparing the raw means tells you that the people who took the test Bobby did, on average, scored higher than those who took the test Cindy did. However, there is a much larger spread in the scores in Cindy's test, so some of the high performers probably scored better than high performers who took Bobby's test. That much is valid. If you knew the numbers in each class you could compute the variances of the means and form a rough opinion on whether there's a statistically significant difference (formally you'd probably want to do a load more careful work, but the difference in means divided by the standard deviation of the mean is a finger-in-the-wind estimate). In that sense, Bobby's class did better on average than Cindy's class, although individuals will have scored higher or lower.
Where you went wrong was introducing Bobby and Cindy's individual scores into that. You can certainly talk about difference in the means of populations ("on average, my class did better than yours") or you can talk about individual differences ("I did better than you"), but you can't really mix them up like you did. You can talk about a population of differences when the differences are well-defined. For example, consider weight before and after a weight loss treatment. We're interested in the pairs of measurements made on one person, and no other pairing would make sense - my before weight minus your after weight is meaningless.
But when comparing class scores on a test, why would you pair up Bobby and Cindy and not Bobby and Diana? There might not even be equal numbers in the two classes to pair up. Bobby can certainly choose to compare his raw score and/or his z score to Cindy's. But there's no systematic way to pair up each person who took one test to one person who took the other, so there's no defined population of "difference in score" to do statistics on.