- #1

- 47

- 0

I am working with a huge dataset that it to large to store in memory all at the same time.

I have been looking at estimating the median of this dataset and how found things such as the median of medians etc but these methods still require me to store a lot of the data.

For efficiency, I wanted to be able to derive something as basic as the median as the data is being processed. I found this paper:

http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf

From way back but it seemed to be able to do what I wanted it to do.

Basically this method only stores 5 variables from the data set, the maximum, the minimum, a point below the median and a point above it.

A parabolic fit is made between the median and the two points above and below it and and these entities are updated as the dataset gets processed. In the examples the paper gives they find a solution to one of the equations they present.

Its on page 1085 - equation 5. The example they provide is on 1081. When using their forumla I get 4.46 for the second row fourth column entry made in to the new height matrix. Does any know where this extra 0.1 came from ?

Thanks for thinking!