The discussion centers on optimizing a MySQL database by summing data into strings instead of rows, with the goal of reducing query times. It is noted that strings cannot be added together, and using strings for storage may actually increase space requirements due to NULL characters. Participants suggest using arrays and dynamic memory allocation for more efficient data handling, while also considering the processing of data outside the database before storage. The complexity of the dataset, which includes billions of records, raises concerns about performance, leading to suggestions for alternative data processing methods, including MATLAB. Ultimately, the consensus leans towards avoiding raw data storage in a database due to the sheer volume and complexity of the queries required.
I have a MySQL database and I have the problem that queries take way too much time. I want to optimize the database and one way would be to save data into a string, instead of in rows (the string would be replicates of the same condition).
Each string is an interval of 4 seconds, to reduce the number of rows further I want to sum the data into 30-minute intervals. How can I sum strings? I'd like to use PHP for that. Shall I push replicates into an array?
NULL values may be present
To illustrate:
time 1: 0,1,0,1,1,1,0
time 2: 1,1,0,,0,1,1
You can not add strings together. There will never be a data structure that supports addition of strings. It will be very confusing for fixed and floating point arithmetic to handle strings.
Also storing data in strings will not save time or space. In fact it will take additional space of a NULL character. The best way would be using dynamic memory allocation and pointers.
Also if NULL value has the same effect as 0 why not replace it with a 0?
I'd rather write a C program to operate on the arrays.
Thanks for your reply! It would save space in that I only have to declare the conditions once, as in:
A B C D E #1
A B C D E #2
A B C D E #3
A B C D E #4
would become
A B C D E #1#2#3#4 (SELECT will be a lot faster)
I've dumped everything in one table, since otherwise I'd end up with at least 350 tables that would slow down queries as well.
NULL won't have the same effect as 0, if there are NULL values in a 30-minute segment I need to exclude that segment due to missing data (I'm monitoring activity over time and sometimes test subjects (fruitflies) die).
Dynamic memory allocation... I did change the MySQL config settings between normal, large and huge database. I think that changes the way memory is handled? It didn't make a difference. Also memory is not limiting in the query, the CPU does go up considerably.
Pointers... I must admit I'm not sure what you mean with that. I've worked with arrays, but I've never really mastered them and I've never used them as a database field.
Last edited:
#4
Kholdstare
388
1
You can never add strings, anyway. So you can put your data into arrays and operate on them. You need pointers to do it efficiently. But I think you can do without it.
However, if the data is meant for permanent storage, why not do all the processing on it before putting it into SQL. You might write a small program with an input and output file. Just dump the data in the input file with some predefined formatting (standard ones are CSV or tab separated), run the program once and get the processed output file and copy it into SQL database. If you can write wrappers and have SQL libraries you can even automate the process of copying output data into database.
Dynamic memory allocation has nothing to do with MySQL configuration settings. It is a programming style where you create variable when you need it and delete them from memory when you do not need it any longer. Every sophisticated program has it. That's why Microsoft Word does not reserve RAM for 300 pages of data beforehand when you start it. But once you write 300 pages the RAM used by Microsoft Word grows to that size.
And so when you encounter a NULL value you discard the whole set. Right? If you describe the operation you are supposed to do and the format of data thoroughly I can come up with a more robust solution.
And so when you encounter a NULL value you discard the whole set. Right?
Yes, a NULL value means the set is not complete.
If you describe the operation you are supposed to do and the format of data thoroughly I can come up with a more robust solution.
Well, it's complicated...
I have 8 columns that specify the experiment
1. date of start experiment
2. monitor# (data is collected in separate devices)
3. date/time (data collection time, 4-s interval) <- 302400 time points (two weeks)
4. promotor
5. lesion
6. type
7. replicate#
8. light (on/off)
Column 9 is then "activity": the actual measurement that is taken. There are 302400 measurements for each experiment.
I have 275 distinct experiments (column 1, 4, 5, 6), over 4 monitors = 1100 conditions.
I have 2905 replicates, with 302400 timepoints = 966 billion records.
I'm no expert, but I think that's a lot..
I can't even begin to describe the queries that need to be done on it, after converting to a 30-min time interval, it involves checking whether lights are on or off, find experiments where light has been off for two days, from those check how many days are alternating light on/off, take the average of those of the time period to recreate an "average" day, append the days of complete darkness, extract sub-timepoints, then do the whole thing again but for control experiments, with the condition that the data must be collected within the same experiment as the mutant.
I've tried that on a database with 9.5 million records (1 monitor file) and it takes 20 minutes to complete. Ack.
I could use an array to fill the activity and use a separate table that defines the array values. For instance the first array value would be a certain: promotor, lesion, type, replicate#. That would introduce more NULLs, but hopefully doesn't slow things down?
I'm meeting with an information manager next week, hopefully that will help...
For the 4-second time interval I convert them to 0s 1s (I've observed that anything higher is a measurement error). The larger time intervals would be integers, the averages of those intervals would be decimals.
Considering that I'm approaching a Trillion rows I think I'll have to step off the idea of putting all raw data in a database.
#7
DavidSnider
Gold Member
511
147
Monique said:
For the 4-second time interval I convert them to 0s 1s (I've observed that anything higher is a measurement error). The larger time intervals would be integers, the averages of those intervals would be decimals.
Considering that I'm approaching a Trillion rows I think I'll have to step off the idea of putting all raw data in a database.
Well, once you normalize the condition data the size won't be that big of a deal. 966billion bits is 112GB which is totally manageable.
(3) Issue the following command at the linux bash shell prompt: sh myprog01.sh
See if file out01 now contains what you want.
Note that the above command skips any lines that do not contain the label "time" beginning in column 1. If your data does not really contain "time 1: " labels, as shown in post 1, then let us know what is literally in the input file.
I think the above currently will omit incomplete interval segments containing null. I will not be able to test it until 2013-05-27.
Hi nvn, thank you for your input. I'm not so familiar with your code, but I'll try to digest it.
I had a meeting with an information manager today, his suggestion was to step off the idea of storing information in a database for later retrieval. His suggestion was to do all processing on the raw files (similar to Kholdstare's suggestion) and use MATLAB for it. I've never worked with MATLAB before, so I'd have to teach myself first how to handle information with it
Does any of you have experience with it? My data is in 120 separate files, the data structure does not allow for the creation of one big file. Can I program MATLAB is such a way that it will search for the relevant data structures in each file (both in the columns and rows) and put all the information together in one table to be analyzed?
#11
Kholdstare
388
1
The first thing you need to do is non-uniqify the storage method of the data, to reduce the size requirement significantly. The trick is to use tagged filename/foldername to achieve this.
e.g. If I have two files in /home/Kholdstare/ (BTW. you use Linux or Windows?)
Will produce 28May2013_monitor1_timecolumn_xyz1_xyz2.txt from exp1.txt
Compare the file size and see that new one is significantly lower than the other as you don't have to store redundant information in every row like. xyz1, xyz3 etc. This is called de-uniqification. Because in exp1.txt every row entry was unique and could be identified by the whole line. But in the new file row entries are not unique as they share common filename/foldername tags. To uniquely identify a row from the new file, one has to use the tags used in the filename/foldername.
Monique: I am not sure I would recommend Matlab for this application. Do all rows and columns of your 120 files have the same format? Would you be able to post example lines from four or six of your files (preferably inside [noparse]
Code:
[/noparse] delimiters)? And describe a given example of the columns and rows you want to search for in each file, and how you want the data put together into one table to be analyzed? And then we might be able to show you commands you could try, to see if it gives you what you want or helps you.
Ok, directly working from files would be nice. I have a program that can convert the 4-second file into any other interval. For simplicity let's look at the 30-minute interval now.
Columns A-J are always in the same format, with column B/C containing the date/time and column J containing the light information on(1)/off(0). The 32 columns after that contain different genotypes tested.
The 2nd and 3rd code contain two monitors that collected data on the same days. Incidentally the genotypes look to be the same in both monitors, but this is not usual.
So how I'd like to pull out information:
Does the file contain 2096_8012_0?
(possibly remove information when the activity remains 0 for >2 hours = convert to NULL)
Does the file contain two days where the lights are off? (best done by looking at the rounded average light in column J from 0:00:00-23:30:30, which should be 0).
Count how many days are before that, for each time-point take the average over 4 days (if there are 4 days, otherwise less)
Return the average of the time points of the day, also return the information of the two days of darkness (so that would yield one column representing 3 days)
Do that for each column that matches 2096_8012_0 (the averaging and returning of information)
Do that for all the files, make note of the first date of the experiment.
Now, for each date that contained 2096_8012_0, also search for 2096_2056_0.
Repeat extract data, this is the control.
For all the replicates of 2096_8012_0 and 2096_2056_0, perform an analysis
For each column sum the activity from 2:00:00 to 8:00:00 (day 1) and 2:00:00 to 8:00:00 (day 2)
Calculate the T-test statistic between the case and control
Calculate the average sum of activity for all time points of 2096_8012_0 and 2096_2056_0
Calculate the 95% confidence interval -> make graph that represents the three days
Monique: What are the file names of the four files in post 15? Do the files have a naming convention? Secondly I notice, all four files in post 15 contain only one date. Is this always true? Or do some files contain more than one date?
Monique: I worked on it awhile, and I have it automatically extracting data for the given genotype. Over the next four days, as time permits, I next plan to implement processing of the lighting data. I plan to get back with you then, in about four days, to show you the code, so far, so you can start testing it, to see if it gives you what you want.
Does the file contain two days where the lights are off?
Monique: (1) Let's take a hypothetical example, and say you have four days of complete darkness, occurring in days 5, 6, 8, and 9. Do you want to find the first two days of darkness (days 5 and 6), or do you want to find the last two days of darkness (days 8 and 9)?
(2) Same question, except the four days of complete darkness are now days 5, 6, 7, and 8?
(3) Same question, except the four days of complete darkness are now days 5, 6, 12, and 13?
Monique said:
Return the average of the time points of the day, also return the information of the two days of darkness (so that would yield one column representing 3 days).
(4) What are "the time points of the day"? I.e., what is "time points" in "the time points of the day"?
(5) In the second and third row of the "one column representing 3 days," why do we return the average lighting for the two days of complete darkness, since we already know the two days of complete darkness have an average of 0 % lights on?
Monique: (1) Let's take a hypothetical example, and say you have four days of complete darkness, occurring in days 5, 6, 8, and 9. Do you want to find the first two days of darkness (days 5 and 6), or do you want to find the last two days of darkness (days 8 and 9)?
The first two days, so 5 and 6.
(2) Same question, except the four days of complete darkness are now days 5, 6, 7, and 8?
The same, 5 and 6.
(3) Same question, except the four days of complete darkness are now days 5, 6, 12, and 13?
The same, 5 and 6. Always the earliest two days.
(4) What are "the time points of the day"? I.e., what is "time points" in "the time points of the day"?
Timepoint 0:00:00 in the experiment is the moment that lights turn on, 12:00:00 lights turn off. This is not with atomic precision, so if the lights are not off at 12:00:00, they might go off at 12:00:08.
(5) In the second and third row of the "one column representing 3 days," why do we return the average lighting for the two days of complete darkness, since we already know the two days of complete darkness have an average of 0 % lights on?
It's the activity (the actual data) that need to be returned, column K and higher. So it's the average activity of the first normal days, then the activity of the next two days of darkness (that data can't be averaged, since there is a circadian shift in darkness).
Monique: I now have it computing the genotype 2096_8012_0 and 2096_2056_0 average activity for "day" 1, day 2, and day 3. For day 1, "day" is in quotes, because day 1 consists of four days, or less. I created a test case; and some intermediate results it returned are listed below. The first record gives a label of what the following record contains. The following record gives the average activity for the stated genotype, in each column that contained that genotype. Any column that did not contain that genotype contains -1. Therefore, you have, e.g., column 12, which is one column with three rows, representing three days (day 1, 2, and 3), for genotype 2096_8012_0. Day 2 and 3 are complete darkness, defined as the earliest-occurring 96 consecutive time points having lights off.
The reason it says "1 files generated" is because, in my test case, I used the four files you posted in post 15 (monitor1.csv, monitor2.csv, monitor3.csv, and monitor4.csv). I then fabricated a lot of fictitious records in file monitor1.csv, so it would be long enough for testing. Notice, it found two days of complete darkness in file monitor1.csv, but did not find two days of darkness in the other three files, because the other three files did not contain two days of darkness. It worked correctly, as it should, generating output for 1 file (monitor1.csv).
The control genotype (2096_2056_0) average activity is also listed above, which you can see in the labels.
The averages are computed as follows. For day 1, avg_day1 = summation(activity)/(192 or less). For day 2, avg_day2 = summation(activity)/48. For day 3, avg_day3 = summation(activity)/48. A day of complete darkness consists of 48 thirty-minute time points. The reason avg_day1 says "192 or less" in the denominator is because day 1 consists of a maximum of four days (192 thirty-minute time points), or less (if less than 192 time points occurred before the two days of complete darkness).
Monique said:
Now, for each date that contained 2096_8012_0, also search for 2096_2056_0.
Monique: For each date that contained 2096_8012_0, will 2096_2056_0 always be in the same files where 2096_8012_0 was found? Or can 2096_2056_0 occur (on that same date) in other files that do not contain 2096_8012_0?
Monique said:
For each column, sum the activity from 2:00:00 to 8:00:00 (day 1) and 2:00:00 to 8:00:00 (day 2).
I am working on this now. But could you elaborate on how your "day 1" and "day 2," here, compare to my day 1, 2, and 3, listed above?
Monique said:
For all the replicates of 2096_8012_0 and 2096_2056_0, perform an analysis.
For all occurrences of 2096_8012_0 and 2096_2056_0? Or only those found in my "days 1, 2, and 3," described above in paragraphs 1 and 4?
For each column, sum the activity from 2:00:00 to 8:00:00 (day 1) and 2:00:00 to 8:00:00 (day 2).
For now, I will assume "day 1" and "day 2" in your above sentence have the same meaning as "day 1" and "day 2" in the beginning of posts 15 and 21. I.e., "day 1" is the four days before the two days of complete darkness. "Day 2" is the first day of complete darkness (48 consecutive thirty-minute time points of darkness, 24 hours). Footnote: Day 3, which is the second 24-hour day of complete darkness, is omitted for the summation described in your above sentence.
Monique: Here are some intermediate results my test case automatically returns now, as described in posts 21 and 22. "Mode 1" denotes average activity for all values, for days 1, 2, and 3. "Mode 2" denotes average activity from time 2:00 to 8:00, for days 1 and 2. The summation of activity (even though I did not output it below) is just the average value multiplied by rcount. rcount is the number of thirty-minute time points used to compute the average.
This is just intermediate data I am obtaining. If this looks correct, and looks like what you want, then I guess the next step would be to compute the T-test statistic? If so, how would that be computed?