How to Normalize a simulated dataset to fit the actual dataset?

  • Context: Undergrad 
  • Thread starter Thread starter jothisadhana
  • Start date Start date
  • Tags Tags
    Fit Normalize
Click For Summary
SUMMARY

This discussion focuses on normalizing a simulated dataset (Dataset B) to align it with a real-life dataset (Dataset A) containing numerical values from an experiment. The user seeks guidance on normalization techniques, specifically the calculation of z-scores and min-max scaling. The importance of normalization is emphasized for accurate comparison between the two datasets, particularly through the analysis of cumulative frequencies plotted in MS Excel. The conversation highlights the ambiguity of the term "normalize" and suggests that z-scores may be the most defensible approach if the data distributions are bell-shaped.

PREREQUISITES
  • Understanding of cumulative frequency analysis
  • Familiarity with MS Excel for data visualization
  • Knowledge of statistical concepts such as mean and standard deviation
  • Basic understanding of normalization techniques like z-scores and min-max scaling
NEXT STEPS
  • Learn how to compute z-scores for dataset normalization
  • Explore min-max scaling techniques for data transformation
  • Investigate the implications of data distribution shapes on normalization choices
  • Study how to effectively visualize and compare datasets using MS Excel
USEFUL FOR

Data analysts, statisticians, and researchers looking to compare simulated and real-life datasets through effective normalization techniques.

jothisadhana
Messages
5
Reaction score
0
Can someone tell me how I can 'normalize' my dataset?

My scenario is as follows.

I have two datasets, A (real-life data) and B (simulated data).

Dataset A contains 4 numerical values (from an actual experiment):
-> E.g. 4 leaves from a binary tree each assigned with values 12.5,13.5,20.0 and 45.0.

Dataset B contains 40 numerical values (from a simulation done by the computer):
-> E.g. 40 leaves from a total of 10 binary trees where each tree produces 4 leaves with randomly assigned numerical values for each leaf.

For both datasets, I have computed their respective cumulative frequencies and plotted their respective charts using MS Excel e.g. [Cumulative frequencies of leaf values VS Leaf values]. This was to observe how similar/different are both of these data sets, where the smaller the vertical displacement between the two plots implies that both datasets are less different.

I was instructed to normalize my data from Dataset B and re-plot the chart for a better comparison between set A and set B.

How can I do this (and why is this important?)?

An example based on the situation described here will help a great deal. Thanks in advance.
 
Physics news on Phys.org
Unfortunately "normalize" is an ambiguous instruction. It might mean to convert each data value v to it's "z-score" by computing \frac{v - \mu}{\sigma} where \mu is the mean of the sample in question ( real or simulated) and \sigma is the standard deviation of the sample.

It could mean something as simplistic as converting each data value v to a sort of ranking by computing \frac{v - v_{min}}{v_{max} - v_{min} } where v_{max} and v_{min} are, respectively, the max and min values in the sample.

We'd have to know more about what the data and the simulation represent to know what makes sense - (and we'd have to assume the person who told to do this gave sensible advice!). If you use z-scores you can probably defend that choice as a common meaning for "normalize". If both your historgrams had a roughly a bell shaped appearance, I'd guess that this was was your advisor meant.
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 6 ·
Replies
6
Views
1K
Replies
1
Views
2K
Replies
1
Views
2K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 1 ·
Replies
1
Views
4K
  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K