How can I calculate the median of a terabyte dataset efficiently?

  • Thread starter physical101
  • Start date
  • Tags
    Median
In summary, the conversation discusses methods for estimating the median of a large dataset without having to store all the data in memory. One method involves using a paper from 1997 which only stores 5 variables from the data set and utilizes a parabolic fit to update these variables as the dataset is processed. Another idea is to use an online clustering algorithm and compute the median of the clusters in the end. There is also a mention of an online variance algorithm from Knuth Vol 2. Finally, one participant shares their experience with linear data approximation and the usefulness of using quantiles to reduce run time.
  • #1
physical101
47
0
Hi all

I am working with a huge dataset that it to large to store in memory all at the same time.

I have been looking at estimating the median of this dataset and how found things such as the median of medians etc but these methods still require me to store a lot of the data.

For efficiency, I wanted to be able to derive something as basic as the median as the data is being processed. I found this paper:

http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf

From way back but it seemed to be able to do what I wanted it to do.

Basically this method only stores 5 variables from the data set, the maximum, the minimum, a point below the median and a point above it.

A parabolic fit is made between the median and the two points above and below it and and these entities are updated as the dataset gets processed. In the examples the paper gives they find a solution to one of the equations they present.

Its on page 1085 - equation 5. The example they provide is on 1081. When using their forumla I get 4.46 for the second row fourth column entry made into the new height matrix. Does any know where this extra 0.1 came from ?

Thanks for thinking!
 
Mathematics news on Phys.org
  • #2
I've never done anything like this, so this idea might be silly, and it also requires that you know a bit about the data distribution...

Anyway, the idea I had was to use an online clustering algorithm, and then compute the median of the clusters in the end. That is, each cluster would have a "point count". When adding a point (a sample value), if the point is close enough to an existing cluster, it would add 1 to the closest cluster's point count. If it's not close to any existing cluster, a new cluster is added.

When computing the approximated median the point count is used to find the cluster which approximates the median.

The point distance function (computing the distance between two points) needs to be adjusted depending on your expected data set, such as the effective "cluster radius" and maybe even make it non-linear (logarithmic for example).

From some quick tests here it seems to work quite OK as long as you get the "cluster radius" approximately right, however it shouldn't be too hard to perform re-clustering using a larger cluster radius, thus making the algorithm somewhat adaptive. I tested it on uniformly distributed data, log-normal distributed data and spike-y data (90% uniform[0,1] and 10% exponentially distributed data multiplied by 100 to simulate a large spike).

Again, might be silly and might not work with the data distribution.
 
  • #3
Knuth Vol 2 has algorithms for reading data linearly, calculating a mean, variance, std deviation, etc. all on the fly. The algorithm reads a datastream or an array or whatever.

It is called the online variance algorithm. See:

http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
 
  • #4
Hey all,

Turns out linear data approximation is a bad idea! Takes the code 6 times as long to run as would eitherwise be. I am using just the quantiles with the p2 to see what I end up with - severly reduces run time (and my head ache hopefully).

Thank you for your replys and suggestions

Duane
 
  • #5


I would first congratulate the poster for their efforts in finding a solution to estimating the median of such a large dataset. It is a common challenge in working with big data and it is great to see someone actively seeking out different methods to address it.

Regarding the paper they found, it seems like a promising approach to efficiently calculate the median while only storing a small number of variables. However, I would suggest caution in relying solely on this method without thoroughly understanding its limitations and potential biases.

In regards to the question about the discrepancy in the calculation, it would be helpful to provide more context and information about the dataset and the calculations being performed. It is possible that the extra 0.1 could be due to rounding errors or other factors. I would recommend reaching out to the authors of the paper or seeking out additional resources for clarification.

Overall, it is important to continue exploring and evaluating different methods for handling large datasets, but also to be critical and thorough in understanding their potential limitations and biases.
 

1. What is the median of a terabyte dataset?

The median of a terabyte dataset is the middle value when all the data points are arranged in ascending or descending order. It is the value that separates the higher half of the data from the lower half.

2. How do you calculate the median of a terabyte dataset?

To calculate the median of a terabyte dataset, first arrange all the data points in ascending or descending order. Then, if the number of data points is odd, the median is the middle value. If the number of data points is even, the median is the average of the two middle values.

3. Why is the median used instead of the mean for a terabyte dataset?

The median is used instead of the mean for a terabyte dataset because it is less affected by extreme values or outliers. Since a terabyte dataset is a large dataset, it is more likely to have outliers that can significantly impact the mean value, making it less representative of the dataset as a whole. The median, on the other hand, is not affected by outliers, making it a more robust measure of central tendency for large datasets.

4. What does the median of a terabyte dataset tell us?

The median of a terabyte dataset tells us the central tendency of the dataset. It gives us an idea of the typical or average value in the dataset, and it is less affected by extreme values or outliers compared to the mean. It is also useful for identifying the spread or variability of the dataset.

5. How is the median used in data analysis?

The median is used in data analysis as a measure of central tendency to summarize a large dataset. It is often used in conjunction with other statistical measures, such as the mean and standard deviation, to better understand the distribution of the data. It is also useful for making comparisons between different datasets and identifying potential outliers or unusual data points.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
2K
Replies
1
Views
971
  • Programming and Computer Science
Replies
4
Views
678
  • Atomic and Condensed Matter
Replies
3
Views
2K
Replies
1
Views
1K
  • Introductory Physics Homework Help
Replies
2
Views
2K
  • Calculus and Beyond Homework Help
Replies
1
Views
1K
  • STEM Academic Advising
Replies
11
Views
1K
  • Mechanical Engineering
Replies
3
Views
2K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
1
Views
4K
Back
Top