Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

How to calculate arithmetic sum of multiple strings

  1. May 22, 2013 #1

    Monique

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    I have a MySQL database and I have the problem that queries take way too much time. I want to optimize the database and one way would be to save data into a string, instead of in rows (the string would be replicates of the same condition).

    Each string is an interval of 4 seconds, to reduce the number of rows further I want to sum the data into 30-minute intervals. How can I sum strings? I'd like to use PHP for that. Shall I push replicates into an array?

    NULL values may be present

    To illustrate:
    time 1: 0,1,0,1,1,1,0
    time 2: 1,1,0,,0,1,1

    sum time 1+2: 1,2,0,1,1,2,1
     
    Last edited: May 22, 2013
  2. jcsd
  3. May 22, 2013 #2
    You can not add strings together. There will never be a data structure that supports addition of strings. It will be very confusing for fixed and floating point arithmetic to handle strings.

    Also storing data in strings will not save time or space. In fact it will take additional space of a NULL character. The best way would be using dynamic memory allocation and pointers.

    Also if NULL value has the same effect as 0 why not replace it with a 0?

    I'd rather write a C program to operate on the arrays.
     
  4. May 22, 2013 #3

    Monique

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Thanks for your reply! It would save space in that I only have to declare the conditions once, as in:
    A B C D E #1
    A B C D E #2
    A B C D E #3
    A B C D E #4

    would become
    A B C D E #1#2#3#4 (SELECT will be a lot faster)

    I've dumped everything in one table, since otherwise I'd end up with at least 350 tables that would slow down queries as well.

    NULL won't have the same effect as 0, if there are NULL values in a 30-minute segment I need to exclude that segment due to missing data (I'm monitoring activity over time and sometimes test subjects (fruitflies) die).

    Dynamic memory allocation... I did change the MySQL config settings between normal, large and huge database. I think that changes the way memory is handled? It didn't make a difference. Also memory is not limiting in the query, the CPU does go up considerably.

    Pointers... I must admit I'm not sure what you mean with that. I've worked with arrays, but I've never really mastered them and I've never used them as a database field.
     
    Last edited: May 22, 2013
  5. May 22, 2013 #4
    You can never add strings, anyway. So you can put your data into arrays and operate on them. You need pointers to do it efficiently. But I think you can do without it.

    However, if the data is meant for permanent storage, why not do all the processing on it before putting it into SQL. You might write a small program with an input and output file. Just dump the data in the input file with some predefined formatting (standard ones are CSV or tab separated), run the program once and get the processed output file and copy it into SQL database. If you can write wrappers and have SQL libraries you can even automate the process of copying output data into database.

    Dynamic memory allocation has nothing to do with MySQL configuration settings. It is a programming style where you create variable when you need it and delete them from memory when you do not need it any longer. Every sophisticated program has it. That's why Microsoft Word does not reserve RAM for 300 pages of data beforehand when you start it. But once you write 300 pages the RAM used by Microsoft Word grows to that size.

    And so when you encounter a NULL value you discard the whole set. Right? If you describe the operation you are supposed to do and the format of data thoroughly I can come up with a more robust solution.
     
  6. May 23, 2013 #5

    Monique

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Yes, a NULL value means the set is not complete.
    Well, it's complicated....

    I have 8 columns that specify the experiment
    1. date of start experiment
    2. monitor# (data is collected in separate devices)
    3. date/time (data collection time, 4-s interval) <- 302400 time points (two weeks)
    4. promotor
    5. lesion
    6. type
    7. replicate#
    8. light (on/off)

    Column 9 is then "activity": the actual measurement that is taken. There are 302400 measurements for each experiment.

    I have 275 distinct experiments (column 1, 4, 5, 6), over 4 monitors = 1100 conditions.
    I have 2905 replicates, with 302400 timepoints = 966 billion records.

    I'm no expert, but I think that's a lot..

    I can't even begin to describe the queries that need to be done on it, after converting to a 30-min time interval, it involves checking whether lights are on or off, find experiments where light has been off for two days, from those check how many days are alternating light on/off, take the average of those of the time period to recreate an "average" day, append the days of complete darkness, extract sub-timepoints, then do the whole thing again but for control experiments, with the condition that the data must be collected within the same experiment as the mutant.

    I've tried that on a database with 9.5 million records (1 monitor file) and it takes 20 minutes to complete. Ack.

    I could use an array to fill the activity and use a separate table that defines the array values. For instance the first array value would be a certain: promotor, lesion, type, replicate#. That would introduce more NULLs, but hopefully doesn't slow things down?

    I'm meeting with an information manager next week, hopefully that will help....
     
  7. May 23, 2013 #6

    Monique

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    For the 4-second time interval I convert them to 0s 1s (I've observed that anything higher is a measurement error). The larger time intervals would be integers, the averages of those intervals would be decimals.

    Considering that I'm approaching a Trillion rows I think I'll have to step off the idea of putting all raw data in a database.
     
  8. May 23, 2013 #7

    DavidSnider

    User Avatar
    Gold Member

    Well, once you normalize the condition data the size won't be that big of a deal. 966billion bits is 112GB which is totally manageable.
     
  9. May 25, 2013 #8

    nvn

    User Avatar
    Science Advisor
    Homework Helper

    Monique: (1) If your example data in post 1 is in a file named in01, which therefore contains lines such as the following, then go to item 2, below.

    time 1: 0,1,0,1,1,1,0
    time 2: 1,1,0,,0,1,1

    (2) Type the following code into a file named myprog01.sh.

    #!/bin/sh
    # Created: 2013-05-25, nvn.
    sed -e 's/://' -e 's/ /,/g' in01 \
    | awk -F, '/^time/ {c3=c3+$3;c4=c4+$4;c5=c5+$5;c6=c6+$6;
    c7=c7+$7;c8=c8+$8;c9=c9+$9;rcount=rcount+1;
    if(nullflag==""){for(ii=1;ii<=NF;ii++){if($ii=="")nullflag="1"}};
    if(rcount==450){halfh=halfh+1;
    if(nullflag=="")printf("half_hour,%d,%d,%d,%d,%d,%d,%d,%d\n",
    halfh,c3,c4,c5,c6,c7,c8,c9);
    rcount=0;nullflag="";
    c3=0;c4=0;c5=0;c6=0;c7=0;c8=0;c9=0}}' > out01​

    (3) Issue the following command at the linux bash shell prompt: sh myprog01.sh

    See if file out01 now contains what you want.

    Note that the above command skips any lines that do not contain the label "time" beginning in column 1. If your data does not really contain "time 1: " labels, as shown in post 1, then let us know what is literally in the input file.

    I think the above currently will omit incomplete interval segments containing null. I will not be able to test it until 2013-05-27.
     
    Last edited: May 25, 2013
  10. May 27, 2013 #9

    nvn

    User Avatar
    Science Advisor
    Homework Helper

    Monique: I now tested the code in post 8, and it worked fine.
     
  11. May 28, 2013 #10

    Monique

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Hi nvn, thank you for your input. I'm not so familiar with your code, but I'll try to digest it.

    I had a meeting with an information manager today, his suggestion was to step off the idea of storing information in a database for later retrieval. His suggestion was to do all processing on the raw files (similar to Kholdstare's suggestion) and use MATLAB for it. I've never worked with MATLAB before, so I'd have to teach myself first how to handle information with it :cry:

    Does any of you have experience with it? My data is in 120 separate files, the data structure does not allow for the creation of one big file. Can I program MATLAB is such a way that it will search for the relevant data structures in each file (both in the columns and rows) and put all the information together in one table to be analyzed?
     
  12. May 28, 2013 #11
    The first thing you need to do is non-uniqify the storage method of the data, to reduce the size requirement significantly. The trick is to use tagged filename/foldername to achieve this.

    e.g. If I have two files in /home/Kholdstare/ (BTW. you use Linux or Windows?)

    cat /home/Kholdstare/exp1.txt
    28/May/2013 monitor1 11:01 xyz1 xyz2 1
    28/May/2013 monitor1 12:24 xyz1 xyz2 0
    28/May/2013 monitor1 13:05 xyz1 xyz2 1
    28/May/2013 monitor1 14:55 xyz1 xyz2 2
    28/May/2013 monitor1 17:12 xyz1 xyz2 0
    28/May/2013 monitor1 17:20 xyz1 xyz2 1
    28/May/2013 monitor1 18:03 xyz1 xyz2 1
    cat /home/Kholdstare/exp2.txt
    27/Apr/2013 monitor1 02:14 xyz3 xyz2 2
    27/Apr/2013 monitor1 08:23 xyz3 xyz2 1
    27/Apr/2013 monitor1 09:44 xyz3 xyz2 0
    27/Apr/2013 monitor1 10:25 xyz3 xyz2 2
    27/Apr/2013 monitor1 12:00 xyz3 xyz2 0
    27/Apr/2013 monitor1 13:02 xyz3 xyz2 1
    27/Apr/2013 monitor1 15:32 xyz3 xyz2 3

    Convert them to either 1)

    cat /home/Kholdstare/28May2013_monitor1_timecolumn_xyz1_xyz2.txt
    11:01 1
    12:24 0
    13:05 1
    14:55 2
    17:12 0
    17:20 1
    18:03 1
    cat /home/Kholdstare/27Apr2013_monitor1_timecolumn_xyz3_xyz2.txt
    02:14 2
    08:23 1
    09:44 0
    10:25 2
    12:00 0
    13:02 1
    15:32 3

    Or 2)

    cat /home/Kholdstare/28May2013/monitor1/timecolumn/xyz1/xyz2.txt
    11:01 1
    12:24 0
    13:05 1
    14:55 2
    17:12 0
    17:20 1
    18:03 1
    cat /home/Kholdstare/27Apr2013/monitor1/timecolumn/xyz3/xyz2.txt
    02:14 2
    08:23 1
    09:44 0
    10:25 2
    12:00 0
    13:02 1
    15:32 3
     
  13. May 28, 2013 #12
    The 1) or 2) can be easily done from the initial file by using 'gawk' command of linux.

    Code (Text):
    cat exp1.txt | gawk '{ print $2 $4 }' > 28May2013_monitor1_timecolumn_xyz1_xyz2.txt
    Will produce 28May2013_monitor1_timecolumn_xyz1_xyz2.txt from exp1.txt

    Compare the file size and see that new one is significantly lower than the other as you dont have to store redundant information in every row like. xyz1, xyz3 etc. This is called de-uniqification. Because in exp1.txt every row entry was unique and could be identified by the whole line. But in the new file row entries are not unique as they share common filename/foldername tags. To uniquely identify a row from the new file, one has to use the tags used in the filename/foldername.

    Now delete the exp1.txt file.
     
  14. May 28, 2013 #13

    Monique

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Thanks for your thoughts, I wasn't aware of that system (I use a Mac).
     
  15. May 28, 2013 #14

    nvn

    User Avatar
    Science Advisor
    Homework Helper

    Monique: I am not sure I would recommend Matlab for this application. Do all rows and columns of your 120 files have the same format? Would you be able to post example lines from four or six of your files (preferably inside [noparse]
    Code (Text):
    [/noparse] delimiters)? And describe a given example of the columns and rows you want to search for in each file, and how you want the data put together into one table to be analyzed? And then we might be able to show you commands you could try, to see if it gives you what you want or helps you.
     
    Last edited: May 29, 2013
  16. May 29, 2013 #15

    Monique

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Ok, directly working from files would be nice. I have a program that can convert the 4-second file into any other interval. For simplicity let's look at the 30-minute interval now.

    Columns A-J are always in the same format, with column B/C containing the date/time and column J containing the light information on(1)/off(0). The 32 columns after that contain different genotypes tested.

    The 2nd and 3rd code contain two monitors that collected data on the same days. Incidentally the genotypes look to be the same in both monitors, but this is not usual.

    Code (Text):
    A   B       C   D   E   F   G   H   I   J   2096_2056_0 2096_8012_0 2096_8038_0 2096_6040_0 2096_8012_0 2096_8038_1N    2096_6040_1N    2096_8038_1N    2096_6040_1N    2096_2056_0 2096_8012_0 2096_8038_0 2096_6040_0 2096_8012_0 2096_8038_1N    2096_6040_1N    2096_8012_1N    2096_6040_1N    2096_2056_1N    2096_8012_0 2096_8038_0 2096_6040_0 2096_8012_0 2096_8038_1N    2096_8038_1N    2096_8012_1N    2096_6040_1N    2096_2056_1N    2096_8012_0 2096_8038_0 2096_6040_0 2096_8012_0
    36090   19-Apr-13   0:00:00 1   449 0   0   0   0   1   152 25  94  224 48  79  137 54  58  23  18  118 119 73  88  171 89  69  50  162 69  169 58  26  62  89  117 38  38  127 93  75
    36540   19-Apr-13   0:30:00 1   449 0   0   0   0   1   132 15  74  167 41  20  102 64  75  39  14  39  100 0   72  175 44  53  12  88  9   110 18  8   97  18  97  6   23  124 82  0
    36990   19-Apr-13   1:00:00 1   449 0   0   0   0   1   62  13  62  157 14  5   80  54  63  78  0   6   68  0   12  173 24  30  18  97  1   6   2   10  85  6   83  2   12  94  84  0
    Code (Text):
    A   B       C   D   E   F   G   H   I   J   2096_8012_0 2096_2056_1N    2096_8038_1N    2096_8012_0 2096_2056_0 2096_8012_1N    2096_8038_0 2096_8012_1N    2096_6040_1N    2096_8012_0 2096_2056_1N    2096_8038_1N    2096_8012_0 2096_2056_0 2096_8012_1N    2096_8038_0 2096_8012_1N    2096_6040_1N    2096_8012_0 2096_2056_1N    2096_8038_1N    2096_8012_0 2096_6040_0 2096_8012_1N    2096_6040_0 2096_8012_1N    2096_6040_1N    2096_8012_0 2096_2056_0 2096_8038_0 2096_8012_1N    2096_6040_0
    38340   15-May-13   0:00:00 1   449 0   0   0   0   1   110 92  68  88  102 84  37  31  111 85  86  126 145 116 30  58  89  134 102 54  92  86  128 92  173 88  59  72  99  43  73  129
    38790   15-May-13   0:30:00 1   449 0   0   0   0   1   63  45  59  39  81  75  21  25  95  34  113 112 116 64  8   53  64  97  78  53  56  45  77  62  166 36  41  38  87  62  27  86
    39240   15-May-13   1:00:00 1   449 0   0   0   0   1   44  28  54  24  65  22  23  6   53  24  97  63  96  40  26  1   43  46  48  46  58  26  48  44  150 29  47  27  63  50  20  101
    Code (Text):
    A   B       C   D   E   F   G   H   I   J   2096_8012_0 2096_2056_1N    2096_8038_1N    2096_8012_0 2096_2056_0 2096_8012_1N    2096_8038_0 2096_8012_1N    2096_6040_1N    2096_8012_0 2096_2056_1N    2096_8038_1N    2096_8012_0 2096_2056_0 2096_8012_1N    2096_8038_0 2096_8012_1N    2096_6040_1N    2096_8012_0 2096_2056_1N    2096_8038_1N    2096_8012_0 2096_6040_0 2096_8012_1N    2096_6040_0 2096_8012_1N    2096_6040_1N    2096_8012_0 2096_2056_0 2096_8038_0 2096_8012_1N    2096_6040_0
    38340   15-May-13   0:00:00 1   449 0   0   0   0   1   156 103 90  37  77  86  29  74  161 157 93  37  31  107 45  66  76  83  51  106 54  72  91  70  147 62  75  154 81  106 65  80
    38790   15-May-13   0:30:00 1   449 0   0   0   0   1   78  75  93  31  71  41  14  43  143 122 72  49  7   87  14  71  45  86  40  93  29  81  88  56  108 28  52  75  31  72  24  62
    39240   15-May-13   1:00:00 1   449 0   0   0   0   1   44  25  47  44  33  27  2   38  147 18  38  40  0   68  3   44  19  71  12  82  12  42  79  45  107 16  24  49  8   33  9   48
    Code (Text):
    A   B       C   D   E   F   G   H   I   J   2096_8012_0 2096_8013_0 2096_2056_0 2096_8012_1N    2096_8013_1N    2906_2056_1N    2096_8012_0 2096_8013_0 2096_2056_0 2096_8012_1N    2096_8013_1N    2906_2056_1N    2096_8012_0 2096_8013_0 2096_2056_0 2096_8012_1N    2096_8013_1N    2906_2056_1N    2096_8012_0 2096_8013_0 2096_2056_0 2096_8012_1N    2096_8013_1N    2906_2056_1N    2096_8012_0 2096_8013_0 2096_2056_0 2096_8012_1N    2096_8013_1N    2906_2056_1N    2096_8013_1N    2096_8012_1N
    12156   1-Jan-11    0:00:00 1   449 0   0   0   0   1   144 7   55  30  13  92  100 59  59  40  34  56  113 105 48  62  102 73  2   93  42  22  37  83  31  95  68  96  36  83  30  36
    12606   1-Jan-11    0:30:00 1   449 0   0   0   0   1   126 9   38  4   25  24  72  58  26  64  37  32  51  113 26  66  93  48  0   98  8   17  39  119 35  70  50  91  24  49  48  98
    13056   1-Jan-11    1:00:00 1   449 0   0   0   0   1   120 6   2   8   6   10  5   35  5   42  0   0   45  98  14  9   98  0   3   102 22  3   30  110 29  14  15  71  7   1   9   48
    So how I'd like to pull out information:
    Does the file contain 2096_8012_0?
    (possibly remove information when the activity remains 0 for >2 hours = convert to NULL)
    Does the file contain two days where the lights are off? (best done by looking at the rounded average light in column J from 0:00:00-23:30:30, which should be 0).
    Count how many days are before that, for each time-point take the average over 4 days (if there are 4 days, otherwise less)
    Return the average of the time points of the day, also return the information of the two days of darkness (so that would yield one column representing 3 days)
    Do that for each column that matches 2096_8012_0 (the averaging and returning of information)
    Do that for all the files, make note of the first date of the experiment.

    Now, for each date that contained 2096_8012_0, also search for 2096_2056_0.
    Repeat extract data, this is the control.

    For all the replicates of 2096_8012_0 and 2096_2056_0, perform an analysis
    For each column sum the activity from 2:00:00 to 8:00:00 (day 1) and 2:00:00 to 8:00:00 (day 2)
    Calculate the T-test statistic between the case and control
    Calculate the average sum of activity for all time points of 2096_8012_0 and 2096_2056_0
    Calculate the 95% confidence interval -> make graph that represents the three days
     
  17. May 29, 2013 #16

    nvn

    User Avatar
    Science Advisor
    Homework Helper

    Monique: What are the file names of the four files in post 15? Do the files have a naming convention? Secondly I notice, all four files in post 15 contain only one date. Is this always true? Or do some files contain more than one date?
     
  18. May 29, 2013 #17

    Monique

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    For each experimental start date there can be eight file names: Monitor1.csv – 8.csv
    The files contain more than one date, usually 7-14 days.
     
  19. May 29, 2013 #18

    nvn

    User Avatar
    Science Advisor
    Homework Helper

    Monique: I worked on it awhile, and I have it automatically extracting data for the given genotype. Over the next four days, as time permits, I next plan to implement processing of the lighting data. I plan to get back with you then, in about four days, to show you the code, so far, so you can start testing it, to see if it gives you what you want.
     
  20. May 31, 2013 #19

    nvn

    User Avatar
    Science Advisor
    Homework Helper

    Monique: (1) Let's take a hypothetical example, and say you have four days of complete darkness, occurring in days 5, 6, 8, and 9. Do you want to find the first two days of darkness (days 5 and 6), or do you want to find the last two days of darkness (days 8 and 9)?

    (2) Same question, except the four days of complete darkness are now days 5, 6, 7, and 8?

    (3) Same question, except the four days of complete darkness are now days 5, 6, 12, and 13?

    (4) What are "the time points of the day"? I.e., what is "time points" in "the time points of the day"?

    (5) In the second and third row of the "one column representing 3 days," why do we return the average lighting for the two days of complete darkness, since we already know the two days of complete darkness have an average of 0 % lights on?
     
  21. May 31, 2013 #20

    Monique

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    The first two days, so 5 and 6.

    The same, 5 and 6.

    The same, 5 and 6. Always the earliest two days.

    Timepoint 0:00:00 in the experiment is the moment that lights turn on, 12:00:00 lights turn off. This is not with atomic precision, so if the lights are not off at 12:00:00, they might go off at 12:00:08.

    It's the activity (the actual data) that need to be returned, column K and higher. So it's the average activity of the first normal days, then the activity of the next two days of darkness (that data can't be averaged, since there is a circadian shift in darkness).
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook




Similar Discussions: How to calculate arithmetic sum of multiple strings
Loading...