# Large datasets - how to handle / alternatives

#### SKT_90

Hello,

I am trying to do the following:

I have two inputs that are very large files - I extract the first column of each so I have two vectors of ~2-5mil by 1. This is timing information. The goal is to extract the indices of each vector where the time in one vector is within a certain window of the other.

Originally I was going to work in Matlab to do:

for j = 1:L1
for i = 1:L2
timediff = (1t(j)-1t(i)); %time diff
td(j,i) = timediff;
end
end

where L1, L2 are lengths of the vectors, 1t,2t are the vectors. Then I would find the indices within my window.

This however will generate a massive matrix which Matlab cannot handle. is there even a code which can? it is too big.

alternatively, how can I come to a solution for this problem without the big matrix? Can I compare one element from one vector to all elements of another vector and only save the value in my window so I don't have to make this big matrix?

any help is appreciated!

thanks!

Related Programming and Computer Science News on Phys.org

#### SKT_90

I am thinking to use ismembertol but this is not unique and I want values of vector2 within coincidence window after vector1, so not absolute comparison

#### jedishrfu

Mentor
When you say large how large?

If matlab cant handle the array then you will need to read parts in and process them.

Have you looked at python and numpy? And maybe pandas?

A sql database comes to mind giving you a lot of flexibility to manipule and select your data.

A simpler solution would be a random access binary file. The revord index would correspond to one of your array indices. Or you could make a composite index from your indices and the dimensions of your array to find individual cells.

The random file approach makes it tough to do matrix operations though like a matrix multiply. Youd be reading and writing a lot of data.

Mentor

#### BvU

Homework Helper
This is timing information
Does that timing information happen to be monotonically increasing ?

#### jedishrfu

Mentor
The timing may not be linearly spaced though.

#### FactChecker

Gold Member
2018 Award
Does that timing information happen to be monotonically increasing ?
Good question. And is the "timing data" consecutive times or are they sets of elapsed times?
Your current method is too "brute force". In all cases, I think that the first step is to make sure that both sets are sorted by time. (Do not use your own "home-grown" sort algorithm. Modern sort utilities are amazingly efficient.) Then step through the first set and for each one, search through the second set for the list of indices matching the criteria. Save the list of matching indices in a string or structure, indexed by the index of the first list. Use some lower/upper search-limiting numbers to reduce the searching through the second set and update them as appropriate.

#### SKT_90

Then step through the first set and for each one, search through the second set for the list of indices matching the criteria. Save the list of matching indices in a string or structure, indexed by the index of the first list. Use some lower/upper search-limiting numbers to reduce the searching through the second set and update them as appropriate.
Hi,

Thanks all for the replies. In response, the time information is already increasing but it is not evenly spaced. I am not sure what you mean since that would only give me indices for the first set, but I need the indices for the second set as well and the two have different lengths.

Also, I am unsure how to go about what you’re suggesting above. Would I get many matrices this way?

#### BvU

Homework Helper
So for each component of the first vector you need 2 values: an index where the time window opens and a (higher) index where the time window closes.
Idem vice versa.
Four arrays is a lot better than a huge matrix...

#### SKT_90

Hi BvU,

I agree that 4 is better than one big one. I just do not know how to implement what you mean. I need indices corresponding to values of first vector within window of second vector. is what you're suggesting giving me values of first vector within window of itself?

#### BvU

Homework Helper

Excel example: window size is $\pm 1$
Column Vector 1 is rand() + value of cell above.
Vector2 idem.
Column Vlookup -1 A cell looks up "value of cell to its left - 1" in vector 2
Column Match looks up position of value in cell to its left in vector 2
Column Vlookup +1 A cell looks up "value of cell to its left in Vector 1 + 1" in vector 2
Column Match looks up position of value in cell to its left in vector 2
Example: 3.7 in vector 1 yields positions 5 and 9 in vector 2

#### Attachments

• 36.8 KB Views: 104

#### FactChecker

Gold Member
2018 Award
Hi BvU,

I agree that 4 is better than one big one. I just do not know how to implement what you mean. I need indices corresponding to values of first vector within window of second vector. is what you're suggesting giving me values of first vector within window of itself?
The approach I described would do the opposite, but you can just switch the two arrays to get what you want. As you progress, one-by-one, through the second list, search through the first list to find the indices that meat your criteria and collect a set/structure of them for that second-list item. Then move on to the next item on the second list.

But @BvU 's suggestion is a good one. Just keep track of the smallest and largest index in the first list that meets your criteria. (If none do, set the smallest and largest to a negative number.)

#### SKT_90

View attachment 240096
Excel example: window size is $\pm 1$
Column Vector 1 is rand() + value of cell above.
Vector2 idem.
Column Vlookup -1 A cell looks up "value of cell to its left - 1" in vector 2
Column Match looks up position of value in cell to its left in vector 2
Column Vlookup +1 A cell looks up "value of cell to its left in Vector 1 + 1" in vector 2
Column Match looks up position of value in cell to its left in vector 2
Example: 3.7 in vector 1 yields positions 5 and 9 in vector 2
Hi BvU,

I need unique values, so once matched with 1 it doesn't match with another.

#### SKT_90

The approach I described would do the opposite, but you can just switch the two arrays to get what you want. As you progress, one-by-one, through the second list, search through the first list to find the indices that meat your criteria and collect a set/structure of them for that second-list item. Then move on to the next item on the second list.

But @BvU 's suggestion is a good one. Just keep track of the smallest and largest index in the first list that meets your criteria. (If none do, set the smallest and largest to a negative number.)
Hi FactChecker,

This is similar to what i was doing but instead of (i,j) i would get (i,1) j times? that would be too many arrays wouldn't it?

#### BvU

Homework Helper
I need unique values, so once matched with 1 it doesn't match with another.

Why don't you post a complete specification instead of knitting the bits of info together one after the other ?

#### BvU

Homework Helper
The goal is to extract the indices of each vector where the time in one vector is within a certain window of the other.
Lemme guess: you simply want the index of the nearest value in the other vector. That's only one array (per vector)

#### SKT_90

Lemme guess: you simply want the index of the nearest value in the other vector. That's only one array (per vector)
Hi BvU,

Yes sorry if my initial explanation was not clear. I am trying this in matlab where I have vector 1; and vector 1+window

Then I search for indices of vector 2 that lie between these values, running through each value for vector 1, vector 1+ window

This is taking very long and I still have no result. Is there a more efficient way to do this with such large data files?

Thank you again

#### Ibix

It's easy to find the first item in your second array that's after each item in your first array because they are sorted. Find the answer for the first element in your first array, and the answer for the second element must be the same or later. The answer for the third element must be the same or later than that for the second element. So you don't have to search the entire second array each time - just start where you left off last time.

Your major problem is sorting out how to break ties, where the same element in the second array is the best match for several elements in the first one. There are several ways to do this, but you need to decide what you want to happen.

#### SKT_90

It's easy to find the first item in your second array that's after each item in your first array because they are sorted. Find the answer for the first element in your first array, and the answer for the second element must be the same or later. The answer for the third element must be the same or later than that for the second element. So you don't have to search the entire second array each time - just start where you left off last time.

Your major problem is sorting out how to break ties, where the same element in the second array is the best match for several elements in the first one. There are several ways to do this, but you need to decide what you want to happen.
Hi,

Thank you for your response. If I want to save the first index that is the best match in the second array, how would I break after that? Just to get the code working first and then I can change afterwards if it is not a good match.

Thanks!

#### Ibix

how would I break after that?
If you are asking about Matlab syntax, what did Google tell you? If you are asking something else I'm afraid I don't understand the question.

#### FactChecker

Gold Member
2018 Award
Hi BvU,

Yes sorry if my initial explanation was not clear. I am trying this in matlab where I have vector 1; and vector 1+window

Then I search for indices of vector 2 that lie between these values, running through each value for vector 1, vector 1+ window

This is taking very long and I still have no result. Is there a more efficient way to do this with such large data files?

Thank you again
MATLAB can be astonishingly slow when you use loops as in your first post. If you find a way to use MATLAB's implicit looping, it can be much faster. The amount of data is not especially large as long as you do not create a two-dimensional matrix as in your first post. It should run fairly fast.

I think that you still have not given an answer to the question of whether the times in each list are strictly increasing. That is very important to know. I think that we are assuming that they are.

Last edited:

#### SKT_90

MATLAB can be astonishingly slow when you use loops as in your first post. If you find a way to use MATLAB's implicit looping, it can be much faster. The amount of data is not especially large as long as you do not create a two-dimensional matrix as in your first post. It should run fairly fast.

I think that you still have not given an answer to the question of whether the times in each list are strictly increasing. That is very important to know. I think that we are assuming that they are.
Hi, yes they are increasing values, but the size of the arrays differs quite a bit. I am trying to do an implicit loop but I cannot get it to work. Do you mean by setting up the index as a function?

#### FactChecker

Gold Member
2018 Award
Hi, yes they are increasing values, but the size of the arrays differs quite a bit. I am trying to do an implicit loop but I cannot get it to work. Do you mean by setting up the index as a function?
I can't think of a way to do it. But if there is a way, it will probably run much faster.

"Large datasets - how to handle / alternatives"

### Physics Forums Values

We Value Quality
• Topics based on mainstream science
• Proper English grammar and spelling
We Value Civility
• Positive and compassionate attitudes
• Patience while debating
We Value Productivity
• Disciplined to remain on-topic
• Recognition of own weaknesses
• Solo and co-op problem solving