Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

How to read Z-score histogram for variability?

  1. May 22, 2012 #1
    Hi All

    Would be most grateful if there are some pointers given on this question.
    Ques: There is a range of different brands of museli bars with information of nutritional values. E.g Museli bars A with variables of Vitamin, Fat, Potassium values and so on. I have been asked to plot a histograms according to the variables. Since the variables have different scales, I normalised them by using Z-score. I got a few histograms with outliners, 1 or 2 with gaps between the columns in the graphs, and a few jumbled up charts. I have tried to reclassify the gaps columns charts to no avail.
    Next, how do I read the graphs for largest variability based on these plotted histograms?

    My understanding is that a data point above Z-score 0 means refers better than above average, and having outliners in the charts would probably mean that the standard deviation is great? Beside these guidelines, how do I interpret the graphs correctly for variablility?

    Many thanks
  2. jcsd
  3. May 22, 2012 #2
    Hi Merci,

    I'm not sure if I quite follow you but anyway, if I understand correctly you just normalized some data and you want to study the resulting histograms.

    1. You say you tried to reclassify the gaps in your histogram. Why would you do that? if you histograms have gaps that's OK, there is nothing wrong with gaps in a histogram. It just means those values are not possible or unlikely.
    2. If you want to have a rough idea about variability in a histogram just check how flat or sharp is the "bell" the flatter the more variance.
    3. An outlier is the weird data in the set, it does not mean is wrong and you cannot immediately conclude either that the outlier means a bigger variance, you need to analyze them per case.

    But anyway, since you are talking about brands of muesli, and I don't think you have a large number of those, the histogram might be not very informative to begin with.
  4. May 22, 2012 #3


    User Avatar
    Science Advisor

    Hey merci and welcome to the forums.

    In terms of raw variance of the sample, you calculate this using the standard way. You can then if your assumptions are correct, get an estimate for the underlying population variance using either classical methods (chi-square distribution) or bayesian methods. To understand the difference you need to consider ideas from hypothesis testing which deal with getting the right answer as well as Type I and Type II errors and what they mean qualitatively and quantitatively.

    If all variables are considered independent and can't be related to each other then you will just have to deal in the natural units for Vitamins, Fat and Potassium and there is nothing else you can really do since you really have an apples and oranges comparison.

    The general standardization procedure (subtracting mean and dividing this whole thing by standard deviation) will always normalize your distribution no matter what and you can show this by using the properties of expectation and variance. It's not just used for normally distributed distributions.

    Because you have standardized the new mean to be zero, then what this means is that zero corresponds to 'average' and thus things greater than this are greater than average just as you suspected.

    I would be careful though about standardizing your data because you want to interpret your data for what it is and keep the context of your dimensionality and scale in a way that you don't lose this important information. If you transform your data in the wrong way or use transformed data in the wrong way, then your analysis will get screwed up.
  5. May 25, 2012 #4
    Hi Viraltux & Chiro

    Thanks for helping. =)
    Viraltux: I have to use histogram. No other choice based on the question. Supposely, box plot would be a better choice to do the comparsion? The flatter bell means the lower in height of the curve correct (the lowest no. of frequency, spread is wider )?

    Chiro: Yes. The thought has occurred to me before. Am I using the right method to normalised it first? or should instead leave the data alone. The problem is that after plotting one or 2 brands, I noticed that the scales are all different. E.g. Vitamin 1 histogram, Calories 1 histogram. From these, I have to analyse for variablility. Following, I read somewhere in the net that if do encounter different scales to compare. It is better to normalise using z-score method to do easy reference. Am I doing the incorrect thing now? Should I revert to original data & plot them? ~ confused.
  6. May 25, 2012 #5


    User Avatar
    Science Advisor

    Standardizing is useful for plotting when you want to get some of the features of your distribution visually.

    In terms of the scaling used though, you would need to put your data into context. For example you might want to say standardize your results by instead using means and variances that correspond to say particular characteristics like say 'average' intake per day for some class of people (children, adults, atheletes etc) and the use a variance that corresponds to these classes.

    Doing the above will put your data into perspective relative to something which in the above case is relative to something like the recommended intake or actual intake for a class of people of different types.

    Before you actually uses this data in further analysis you need to check with someone with statistical knowledge how you use these in that analyses. You could ask here for some advice on this but ultimately you will have to make sure that your analyses, your data, and assumptions and interpretations of analsyses are sound.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook