Large datasets - how to handle / alternatives

In summary, you are trying to do something with two large vectors of time data. You extract the first column of each vector, and then use a window to find indices where the time in one vector is within a certain window of the other.
  • #1
SKT_90
13
0
Hello,

I am trying to do the following:

I have two inputs that are very large files - I extract the first column of each so I have two vectors of ~2-5mil by 1. This is timing information. The goal is to extract the indices of each vector where the time in one vector is within a certain window of the other.

Originally I was going to work in Matlab to do:

for j = 1:L1
for i = 1:L2
timediff = (1t(j)-1t(i)); %time diff
td(j,i) = timediff;
end
end

where L1, L2 are lengths of the vectors, 1t,2t are the vectors. Then I would find the indices within my window.

This however will generate a massive matrix which Matlab cannot handle. is there even a code which can? it is too big.

alternatively, how can I come to a solution for this problem without the big matrix? Can I compare one element from one vector to all elements of another vector and only save the value in my window so I don't have to make this big matrix?

any help is appreciated!

thanks!
 
Technology news on Phys.org
  • #2
I am thinking to use ismembertol but this is not unique and I want values of vector2 within coincidence window after vector1, so not absolute comparison
 
  • #3
When you say large how large?

If MATLAB can't handle the array then you will need to read parts in and process them.

Have you looked at python and numpy? And maybe pandas?

A sql database comes to mind giving you a lot of flexibility to manipule and select your data.

A simpler solution would be a random access binary file. The revord index would correspond to one of your array indices. Or you could make a composite index from your indices and the dimensions of your array to find individual cells.

The random file approach makes it tough to do matrix operations though like a matrix multiply. Youd be reading and writing a lot of data.
 
  • #5
SKT_90 said:
This is timing information
Does that timing information happen to be monotonically increasing :rolleyes: ?
 
  • Like
Likes FactChecker and jedishrfu
  • #7
BvU said:
Does that timing information happen to be monotonically increasing :rolleyes: ?
Good question. And is the "timing data" consecutive times or are they sets of elapsed times?
Your current method is too "brute force". In all cases, I think that the first step is to make sure that both sets are sorted by time. (Do not use your own "home-grown" sort algorithm. Modern sort utilities are amazingly efficient.) Then step through the first set and for each one, search through the second set for the list of indices matching the criteria. Save the list of matching indices in a string or structure, indexed by the index of the first list. Use some lower/upper search-limiting numbers to reduce the searching through the second set and update them as appropriate.
 
  • Like
Likes jedishrfu
  • #8
FactChecker said:
Then step through the first set and for each one, search through the second set for the list of indices matching the criteria. Save the list of matching indices in a string or structure, indexed by the index of the first list. Use some lower/upper search-limiting numbers to reduce the searching through the second set and update them as appropriate.
Hi,

Thanks all for the replies. In response, the time information is already increasing but it is not evenly spaced. I am not sure what you mean since that would only give me indices for the first set, but I need the indices for the second set as well and the two have different lengths.

Also, I am unsure how to go about what you’re suggesting above. Would I get many matrices this way?
 
  • #9
So for each component of the first vector you need 2 values: an index where the time window opens and a (higher) index where the time window closes.
Idem vice versa.
Four arrays is a lot better than a huge matrix...
 
  • #10
Hi BvU,

I agree that 4 is better than one big one. I just do not know how to implement what you mean. I need indices corresponding to values of first vector within window of second vector. is what you're suggesting giving me values of first vector within window of itself?
 
  • #11
upload_2019-3-11_16-59-41.png

Excel example: window size is ##\pm 1##
Column Vector 1 is rand() + value of cell above.
Vector2 idem.
Column Vlookup -1 A cell looks up "value of cell to its left - 1" in vector 2
Column Match looks up position of value in cell to its left in vector 2
Column Vlookup +1 A cell looks up "value of cell to its left in Vector 1 + 1" in vector 2
Column Match looks up position of value in cell to its left in vector 2
Example: 3.7 in vector 1 yields positions 5 and 9 in vector 2
 

Attachments

  • upload_2019-3-11_16-59-41.png
    upload_2019-3-11_16-59-41.png
    8.6 KB · Views: 661
  • Like
Likes sysprog and FactChecker
  • #12
SKT_90 said:
Hi BvU,

I agree that 4 is better than one big one. I just do not know how to implement what you mean. I need indices corresponding to values of first vector within window of second vector. is what you're suggesting giving me values of first vector within window of itself?
The approach I described would do the opposite, but you can just switch the two arrays to get what you want. As you progress, one-by-one, through the second list, search through the first list to find the indices that meat your criteria and collect a set/structure of them for that second-list item. Then move on to the next item on the second list.

But @BvU 's suggestion is a good one. Just keep track of the smallest and largest index in the first list that meets your criteria. (If none do, set the smallest and largest to a negative number.)
 
  • Like
Likes sysprog
  • #13
BvU said:
View attachment 240096
Excel example: window size is ##\pm 1##
Column Vector 1 is rand() + value of cell above.
Vector2 idem.
Column Vlookup -1 A cell looks up "value of cell to its left - 1" in vector 2
Column Match looks up position of value in cell to its left in vector 2
Column Vlookup +1 A cell looks up "value of cell to its left in Vector 1 + 1" in vector 2
Column Match looks up position of value in cell to its left in vector 2
Example: 3.7 in vector 1 yields positions 5 and 9 in vector 2
Hi BvU,

I need unique values, so once matched with 1 it doesn't match with another.
 
  • #14
FactChecker said:
The approach I described would do the opposite, but you can just switch the two arrays to get what you want. As you progress, one-by-one, through the second list, search through the first list to find the indices that meat your criteria and collect a set/structure of them for that second-list item. Then move on to the next item on the second list.

But @BvU 's suggestion is a good one. Just keep track of the smallest and largest index in the first list that meets your criteria. (If none do, set the smallest and largest to a negative number.)
Hi FactChecker,

This is similar to what i was doing but instead of (i,j) i would get (i,1) j times? that would be too many arrays wouldn't it?
 
  • #15
SKT_90 said:
I need unique values, so once matched with 1 it doesn't match with another.
Asking for unique values is contradicting the time window concept.

Why don't you post a complete specification instead of knitting the bits of info together one after the other ?
 
  • Like
Likes sysprog
  • #16
SKT_90 said:
The goal is to extract the indices of each vector where the time in one vector is within a certain window of the other.
Lemme guess: you simply want the index of the nearest value in the other vector. That's only one array (per vector) :biggrin:
 
  • Like
Likes sysprog
  • #17
BvU said:
Lemme guess: you simply want the index of the nearest value in the other vector. That's only one array (per vector) :biggrin:
Hi BvU,

Yes sorry if my initial explanation was not clear. I am trying this in MATLAB where I have vector 1; and vector 1+window

Then I search for indices of vector 2 that lie between these values, running through each value for vector 1, vector 1+ window

This is taking very long and I still have no result. Is there a more efficient way to do this with such large data files?

Thank you again
 
  • #18
It's easy to find the first item in your second array that's after each item in your first array because they are sorted. Find the answer for the first element in your first array, and the answer for the second element must be the same or later. The answer for the third element must be the same or later than that for the second element. So you don't have to search the entire second array each time - just start where you left off last time.

Your major problem is sorting out how to break ties, where the same element in the second array is the best match for several elements in the first one. There are several ways to do this, but you need to decide what you want to happen.
 
  • #19
Ibix said:
It's easy to find the first item in your second array that's after each item in your first array because they are sorted. Find the answer for the first element in your first array, and the answer for the second element must be the same or later. The answer for the third element must be the same or later than that for the second element. So you don't have to search the entire second array each time - just start where you left off last time.

Your major problem is sorting out how to break ties, where the same element in the second array is the best match for several elements in the first one. There are several ways to do this, but you need to decide what you want to happen.
Hi,

Thank you for your response. If I want to save the first index that is the best match in the second array, how would I break after that? Just to get the code working first and then I can change afterwards if it is not a good match.

Thanks!
 
  • #20
SKT_90 said:
how would I break after that?
If you are asking about Matlab syntax, what did Google tell you? If you are asking something else I'm afraid I don't understand the question.
 
  • #21
SKT_90 said:
Hi BvU,

Yes sorry if my initial explanation was not clear. I am trying this in MATLAB where I have vector 1; and vector 1+window

Then I search for indices of vector 2 that lie between these values, running through each value for vector 1, vector 1+ window

This is taking very long and I still have no result. Is there a more efficient way to do this with such large data files?

Thank you again
MATLAB can be astonishingly slow when you use loops as in your first post. If you find a way to use MATLAB's implicit looping, it can be much faster. The amount of data is not especially large as long as you do not create a two-dimensional matrix as in your first post. It should run fairly fast.

I think that you still have not given an answer to the question of whether the times in each list are strictly increasing. That is very important to know. I think that we are assuming that they are.
 
Last edited:
  • #22
FactChecker said:
MATLAB can be astonishingly slow when you use loops as in your first post. If you find a way to use MATLAB's implicit looping, it can be much faster. The amount of data is not especially large as long as you do not create a two-dimensional matrix as in your first post. It should run fairly fast.

I think that you still have not given an answer to the question of whether the times in each list are strictly increasing. That is very important to know. I think that we are assuming that they are.

Hi, yes they are increasing values, but the size of the arrays differs quite a bit. I am trying to do an implicit loop but I cannot get it to work. Do you mean by setting up the index as a function?
 
  • #23
SKT_90 said:
Hi, yes they are increasing values, but the size of the arrays differs quite a bit. I am trying to do an implicit loop but I cannot get it to work. Do you mean by setting up the index as a function?
I can't think of a way to do it. But if there is a way, it will probably run much faster.
 

1. What are the challenges of handling large datasets?

The main challenges of handling large datasets include storage, processing speed, and analysis. Large datasets can require a significant amount of storage space, making it difficult to store and access them. Processing speed can also be a challenge, as it can take a long time to perform calculations and analyses on large datasets. Additionally, analyzing and making sense of large datasets can be complex and time-consuming.

2. What are some common techniques for handling large datasets?

Some common techniques for handling large datasets include data compression, parallel processing, and data partitioning. Data compression reduces the size of the data, making it easier to store and access. Parallel processing involves breaking up the data into smaller chunks and processing them simultaneously, which can speed up processing time. Data partitioning involves dividing the data into smaller subsets, making it easier to manage and analyze.

3. What are the benefits of using alternative methods for handling large datasets?

Alternative methods for handling large datasets, such as cloud computing and distributed databases, offer several benefits. These methods often have more storage capacity, faster processing speeds, and better scalability compared to traditional methods. They can also reduce costs and improve data security.

4. What are some potential alternatives to traditional databases for handling large datasets?

Some potential alternatives to traditional databases for handling large datasets include NoSQL databases, Hadoop, and Spark. NoSQL databases are designed specifically for large datasets and can handle unstructured data more efficiently. Hadoop is a framework that allows for distributed processing of large datasets across clusters of computers. Spark is a data processing engine that can handle large datasets in-memory for faster processing.

5. How can data visualization be used to handle large datasets?

Data visualization is an essential tool for handling large datasets. It allows for the representation of complex data in a visual format, making it easier to identify patterns and trends. By using data visualization techniques, scientists can quickly gain insights from large datasets and communicate their findings more effectively.

Similar threads

  • Programming and Computer Science
Replies
4
Views
589
Replies
1
Views
1K
  • Programming and Computer Science
Replies
1
Views
1K
Replies
9
Views
1K
  • Programming and Computer Science
Replies
31
Views
2K
  • Programming and Computer Science
Replies
15
Views
3K
  • Programming and Computer Science
Replies
5
Views
1K
  • Programming and Computer Science
3
Replies
75
Views
4K
  • Engineering and Comp Sci Homework Help
Replies
1
Views
805
  • Engineering and Comp Sci Homework Help
Replies
1
Views
1K
Back
Top