How can I calculate the median of a terabyte dataset efficiently?

  • Thread starter Thread starter physical101
  • Start date Start date
  • Tags Tags
    Median
AI Thread Summary
Calculating the median of a terabyte dataset efficiently requires methods that minimize memory usage. The discussed approach involves using a parabolic fit based on storing only five variables: the maximum, minimum, and two points surrounding the median. An alternative method proposed is to utilize an online clustering algorithm to approximate the median by counting points within clusters, adjusting the cluster radius as needed. Initial tests indicate that this clustering method can be effective, though it may require fine-tuning based on data distribution. Ultimately, simpler quantile-based methods were found to significantly reduce runtime compared to linear data approximation.
physical101
Messages
41
Reaction score
0
Hi all

I am working with a huge dataset that it to large to store in memory all at the same time.

I have been looking at estimating the median of this dataset and how found things such as the median of medians etc but these methods still require me to store a lot of the data.

For efficiency, I wanted to be able to derive something as basic as the median as the data is being processed. I found this paper:

http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf

From way back but it seemed to be able to do what I wanted it to do.

Basically this method only stores 5 variables from the data set, the maximum, the minimum, a point below the median and a point above it.

A parabolic fit is made between the median and the two points above and below it and and these entities are updated as the dataset gets processed. In the examples the paper gives they find a solution to one of the equations they present.

Its on page 1085 - equation 5. The example they provide is on 1081. When using their forumla I get 4.46 for the second row fourth column entry made into the new height matrix. Does any know where this extra 0.1 came from ?

Thanks for thinking!
 
Mathematics news on Phys.org
I've never done anything like this, so this idea might be silly, and it also requires that you know a bit about the data distribution...

Anyway, the idea I had was to use an online clustering algorithm, and then compute the median of the clusters in the end. That is, each cluster would have a "point count". When adding a point (a sample value), if the point is close enough to an existing cluster, it would add 1 to the closest cluster's point count. If it's not close to any existing cluster, a new cluster is added.

When computing the approximated median the point count is used to find the cluster which approximates the median.

The point distance function (computing the distance between two points) needs to be adjusted depending on your expected data set, such as the effective "cluster radius" and maybe even make it non-linear (logarithmic for example).

From some quick tests here it seems to work quite OK as long as you get the "cluster radius" approximately right, however it shouldn't be too hard to perform re-clustering using a larger cluster radius, thus making the algorithm somewhat adaptive. I tested it on uniformly distributed data, log-normal distributed data and spike-y data (90% uniform[0,1] and 10% exponentially distributed data multiplied by 100 to simulate a large spike).

Again, might be silly and might not work with the data distribution.
 
Knuth Vol 2 has algorithms for reading data linearly, calculating a mean, variance, std deviation, etc. all on the fly. The algorithm reads a datastream or an array or whatever.

It is called the online variance algorithm. See:

http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
 
Hey all,

Turns out linear data approximation is a bad idea! Takes the code 6 times as long to run as would eitherwise be. I am using just the quantiles with the p2 to see what I end up with - severly reduces run time (and my head ache hopefully).

Thank you for your replys and suggestions

Duane
 
Insights auto threads is broken atm, so I'm manually creating these for new Insight articles. In Dirac’s Principles of Quantum Mechanics published in 1930 he introduced a “convenient notation” he referred to as a “delta function” which he treated as a continuum analog to the discrete Kronecker delta. The Kronecker delta is simply the indexed components of the identity operator in matrix algebra Source: https://www.physicsforums.com/insights/what-exactly-is-diracs-delta-function/ by...
Suppose ,instead of the usual x,y coordinate system with an I basis vector along the x -axis and a corresponding j basis vector along the y-axis we instead have a different pair of basis vectors ,call them e and f along their respective axes. I have seen that this is an important subject in maths My question is what physical applications does such a model apply to? I am asking here because I have devoted quite a lot of time in the past to understanding convectors and the dual...
Thread 'Imaginary Pythagoras'
I posted this in the Lame Math thread, but it's got me thinking. Is there any validity to this? Or is it really just a mathematical trick? Naively, I see that i2 + plus 12 does equal zero2. But does this have a meaning? I know one can treat the imaginary number line as just another axis like the reals, but does that mean this does represent a triangle in the complex plane with a hypotenuse of length zero? Ibix offered a rendering of the diagram using what I assume is matrix* notation...
Back
Top