Large datasets - how to handle / alternatives

SKT_90 · Mar 11, 2019

Hello,

I am trying to do the following:

I have two inputs that are very large files - I extract the first column of each so I have two vectors of ~2-5mil by 1. This is timing information. The goal is to extract the indices of each vector where the time in one vector is within a certain window of the other.

Originally I was going to work in Matlab to do:

for j = 1:L1
for i = 1:L2
timediff = (1t(j)-1t(i)); %time diff
td(j,i) = timediff;
end
end

where L1, L2 are lengths of the vectors, 1t,2t are the vectors. Then I would find the indices within my window.

This however will generate a massive matrix which Matlab cannot handle. is there even a code which can? it is too big.

alternatively, how can I come to a solution for this problem without the big matrix? Can I compare one element from one vector to all elements of another vector and only save the value in my window so I don't have to make this big matrix?

any help is appreciated!

thanks!

SKT_90 · Mar 11, 2019

I am thinking to use ismembertol but this is not unique and I want values of vector2 within coincidence window after vector1, so not absolute comparison

jedishrfu · Mar 11, 2019

When you say large how large?

If MATLAB can't handle the array then you will need to read parts in and process them.

Have you looked at python and numpy? And maybe pandas?

A sql database comes to mind giving you a lot of flexibility to manipule and select your data.

A simpler solution would be a random access binary file. The revord index would correspond to one of your array indices. Or you could make a composite index from your indices and the dimensions of your array to find individual cells.

The random file approach makes it tough to do matrix operations though like a matrix multiply. Youd be reading and writing a lot of data.

jedishrfu · Mar 11, 2019

I reread your post and you have 2.5 million records which should import to MATLAB if you have a reasonable amount of memory left.

https://stackoverflow.com/questions/2332430/maximum-array-size-in-matlab

In the post example, the user is trying to make a ridiculously large array of 1 mil by 1 mil and they suggest doing it a different way.

BvU · Mar 11, 2019

SKT_90 said:

This is timing information

Does that timing information happen to be monotonically increasing

?

jedishrfu · Mar 11, 2019

The timing may not be linearly spaced though.

FactChecker · Mar 11, 2019

BvU said:

Does that timing information happen to be monotonically increasing ?

Good question. And is the "timing data" consecutive times or are they sets of elapsed times?
Your current method is too "brute force". In all cases, I think that the first step is to make sure that both sets are sorted by time. (Do not use your own "home-grown" sort algorithm. Modern sort utilities are amazingly efficient.) Then step through the first set and for each one, search through the second set for the list of indices matching the criteria. Save the list of matching indices in a string or structure, indexed by the index of the first list. Use some lower/upper search-limiting numbers to reduce the searching through the second set and update them as appropriate.

SKT_90 · Mar 11, 2019

FactChecker said:

Then step through the first set and for each one, search through the second set for the list of indices matching the criteria. Save the list of matching indices in a string or structure, indexed by the index of the first list. Use some lower/upper search-limiting numbers to reduce the searching through the second set and update them as appropriate.

Hi,

Thanks all for the replies. In response, the time information is already increasing but it is not evenly spaced. I am not sure what you mean since that would only give me indices for the first set, but I need the indices for the second set as well and the two have different lengths.

Also, I am unsure how to go about what you’re suggesting above. Would I get many matrices this way?

BvU · Mar 11, 2019

So for each component of the first vector you need 2 values: an index where the time window opens and a (higher) index where the time window closes.
Idem vice versa.
Four arrays is a lot better than a huge matrix...

SKT_90 · Mar 11, 2019

Hi BvU,

I agree that 4 is better than one big one. I just do not know how to implement what you mean. I need indices corresponding to values of first vector within window of second vector. is what you're suggesting giving me values of first vector within window of itself?

BvU · Mar 11, 2019

Excel example: window size is ##\pm 1##
Column Vector 1 is rand() + value of cell above.
Vector2 idem.
Column Vlookup -1 A cell looks up "value of cell to its left - 1" in vector 2
Column Match looks up position of value in cell to its left in vector 2
Column Vlookup +1 A cell looks up "value of cell to its left in Vector 1 + 1" in vector 2
Column Match looks up position of value in cell to its left in vector 2
Example: 3.7 in vector 1 yields positions 5 and 9 in vector 2

FactChecker · Mar 11, 2019

SKT_90 said:

Hi BvU,

I agree that 4 is better than one big one. I just do not know how to implement what you mean. I need indices corresponding to values of first vector within window of second vector. is what you're suggesting giving me values of first vector within window of itself?

The approach I described would do the opposite, but you can just switch the two arrays to get what you want. As you progress, one-by-one, through the second list, search through the first list to find the indices that meat your criteria and collect a set/structure of them for that second-list item. Then move on to the next item on the second list.

But @BvU 's suggestion is a good one. Just keep track of the smallest and largest index in the first list that meets your criteria. (If none do, set the smallest and largest to a negative number.)

SKT_90 · Mar 11, 2019

BvU said:

View attachment 240096
Excel example: window size is ##\pm 1##
Column Vector 1 is rand() + value of cell above.
Vector2 idem.
Column Vlookup -1 A cell looks up "value of cell to its left - 1" in vector 2
Column Match looks up position of value in cell to its left in vector 2
Column Vlookup +1 A cell looks up "value of cell to its left in Vector 1 + 1" in vector 2
Column Match looks up position of value in cell to its left in vector 2
Example: 3.7 in vector 1 yields positions 5 and 9 in vector 2

Hi BvU,

I need unique values, so once matched with 1 it doesn't match with another.

SKT_90 · Mar 11, 2019

FactChecker said:

The approach I described would do the opposite, but you can just switch the two arrays to get what you want. As you progress, one-by-one, through the second list, search through the first list to find the indices that meat your criteria and collect a set/structure of them for that second-list item. Then move on to the next item on the second list.

But @BvU 's suggestion is a good one. Just keep track of the smallest and largest index in the first list that meets your criteria. (If none do, set the smallest and largest to a negative number.)

Hi FactChecker,

This is similar to what i was doing but instead of (i,j) i would get (i,1) j times? that would be too many arrays wouldn't it?

BvU · Mar 11, 2019

SKT_90 said:

I need unique values, so once matched with 1 it doesn't match with another.

Asking for unique values is contradicting the time window concept.

Why don't you post a complete specification instead of knitting the bits of info together one after the other ?

BvU · Mar 11, 2019

SKT_90 said:

The goal is to extract the indices of each vector where the time in one vector is within a certain window of the other.

Lemme guess: you simply want the index of the nearest value in the other vector. That's only one array (per vector)

SKT_90 · Mar 11, 2019

BvU said:

Lemme guess: you simply want the index of the nearest value in the other vector. That's only one array (per vector)

Hi BvU,

Yes sorry if my initial explanation was not clear. I am trying this in MATLAB where I have vector 1; and vector 1+window

Then I search for indices of vector 2 that lie between these values, running through each value for vector 1, vector 1+ window

This is taking very long and I still have no result. Is there a more efficient way to do this with such large data files?

Thank you again

Ibix · Mar 11, 2019

It's easy to find the first item in your second array that's after each item in your first array because they are sorted. Find the answer for the first element in your first array, and the answer for the second element must be the same or later. The answer for the third element must be the same or later than that for the second element. So you don't have to search the entire second array each time - just start where you left off last time.

Your major problem is sorting out how to break ties, where the same element in the second array is the best match for several elements in the first one. There are several ways to do this, but you need to decide what you want to happen.

SKT_90 · Mar 11, 2019

Ibix said:

It's easy to find the first item in your second array that's after each item in your first array because they are sorted. Find the answer for the first element in your first array, and the answer for the second element must be the same or later. The answer for the third element must be the same or later than that for the second element. So you don't have to search the entire second array each time - just start where you left off last time.

Your major problem is sorting out how to break ties, where the same element in the second array is the best match for several elements in the first one. There are several ways to do this, but you need to decide what you want to happen.

Hi,

Thank you for your response. If I want to save the first index that is the best match in the second array, how would I break after that? Just to get the code working first and then I can change afterwards if it is not a good match.

Thanks!

Ibix · Mar 11, 2019

SKT_90 said:

how would I break after that?

If you are asking about Matlab syntax, what did Google tell you? If you are asking something else I'm afraid I don't understand the question.

FactChecker · Mar 11, 2019

SKT_90 said:

Hi BvU,

Yes sorry if my initial explanation was not clear. I am trying this in MATLAB where I have vector 1; and vector 1+window

Then I search for indices of vector 2 that lie between these values, running through each value for vector 1, vector 1+ window

This is taking very long and I still have no result. Is there a more efficient way to do this with such large data files?

Thank you again

MATLAB can be astonishingly slow when you use loops as in your first post. If you find a way to use MATLAB's implicit looping, it can be much faster. The amount of data is not especially large as long as you do not create a two-dimensional matrix as in your first post. It should run fairly fast.

I think that you still have not given an answer to the question of whether the times in each list are strictly increasing. That is very important to know. I think that we are assuming that they are.

SKT_90 · Mar 11, 2019

FactChecker said:

MATLAB can be astonishingly slow when you use loops as in your first post. If you find a way to use MATLAB's implicit looping, it can be much faster. The amount of data is not especially large as long as you do not create a two-dimensional matrix as in your first post. It should run fairly fast.

I think that you still have not given an answer to the question of whether the times in each list are strictly increasing. That is very important to know. I think that we are assuming that they are.

Hi, yes they are increasing values, but the size of the arrays differs quite a bit. I am trying to do an implicit loop but I cannot get it to work. Do you mean by setting up the index as a function?

FactChecker · Mar 11, 2019

SKT_90 said:

Hi, yes they are increasing values, but the size of the arrays differs quite a bit. I am trying to do an implicit loop but I cannot get it to work. Do you mean by setting up the index as a function?

I can't think of a way to do it. But if there is a way, it will probably run much faster.

Large datasets - how to handle / alternatives

Attachments

Similar threads

Hot Threads

Touch-typing for programmers

How to calculate Tension for a series of connected points?

Python Complaining About Python

Fortran Reading files in pre-f77 - handling end of file

Sequential Analog Computers?

Recent Insights

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem