Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

DNA sequencing program!

  1. May 31, 2009 #1
    Guys I need major help with this assignment to be written in C code. I'm a beginner at this, so please explain what I should do in detail. Thanks!

    Caveat
    The solution to this assignment should be a C program necessarily structured as a main function.

    The problem

    Suppose a DNA strand consists of a sequence of genes. Each gene is a sequence of four types
    of nucleotides: Adenine (A), Thymine (T), Cytosine (C), Guanine (G). Moreover, a gene has
    two well defined, adjacent regions: a coding region called exon, located in the beginning of the
    gene, and a non-coding region called intron, located in the immediate tail of the gene. If a gene
    contains h nucleotides in the exon region, then the length t of the intron region (i.e., the number
    of nucleotides in it) can be determined as follows:

    t = 3h + 1

    For instance, Figure 1 shows a DNA strand of a fictitious organism. It contains 2 genes and each
    gene has 3 nucleotides in the exon region (shown underlined in the figure).

    GATAD{GAATGCC
    First Gene
    CCTCGTAGTTGAC
    Second Gene




    In view of the above, your C main function should:

    1. scan a data file containing samples of DNA strands of (fictitious) organisms: in this data
    file, for each sample there is a line of data with the following fields separated by blanks -
    sample ID (integer), number of genes in the DNA strand (integer), length of the exon portion
    of each gene (integer), a sequence of characters representing the DNA strand. Note: you
    should NOT open a le to read data from it using C commands fopen and fscanf. Instead,
    you should set up your C development environment (Quincy) to take a data file as the source of input, and use the C function scanf to read the data. If you do not know how to do that, you should consult the Discussion Board, your TA, or your Instructor.

    2. determine the type of the DNA strand: the DNA type depends on the mean number and
    type of nucleotides in the exon regions of the genes present in the DNA strand, as shown in


    Table 1:

    DNA TYPE | CRITERIA
    1 | C<A<T<G
    2 | A<C<,G>0
    3 | other wise

    *The underlines on the letters means the mean(average), while on the < means equal to in this case greater and equal to. (I had to make the table since it couldn't copy properly from the page I was viewing it in.


    Table 1: DNA types and criteria. In the Criteria column, A; T;C; and G denote the mean number of the respective nucleotides present in the exon regions of a DNA strand (i.e., A denotes the mean number of Adenine in the exon regions of the strand).

    3. produce separate printouts for all the samples contained in each data file: each printout
    will present for each strand, on separate lines, the ID number of the strand, the values of
    A; T; G; and C, and the type of the strand. It should also include the name of the data file.
    The data should be labeled and formatted exactly as shown in Figure 2.

    Data File: dataFile01.txt

    ID mA mT mG mC Type
    684 0.27 0.30 0.20 0.24 3
    465 0.26 0.25 0.24 0.24 3
    131 0.21 0.21 0.30 0.27 3

    * I typed up 3 lines for time's sake, there was about 7, this is not the data being read, instead this is an example how I should make my program print out the data.

    *the numbers integers that have 3 digits are the sample numbers, the decimal numbers are the mean values of each nucleotide and the single integer number(all 3s) are the type numbers. Sorry guys the columns are not aligned properly.



    * What I'm having trouble with is, how do I know which parts are A T G and C on the DNA sequence, because if I don't know this I can't perform an operation to get their mean values to assign a type. Also how do I write a program that knows when its done reading one line of a DNA sequence den moves onto the next as a brand new one;how do I print out my values like in figure 2 to and how do i use prinft and scanf to read and write my data ( is this redirection?). Thanks guys, if I know some idea of this I can start on it, right now I'm having a big programmers block in my head preventing me from even starting this, I'm stuck on int main void.. I need help guys thanks. An algorithm would be nice(not the C code its self). I will be updating you guys on the code I write, hopefully have it done by Monday night (June 1st). Thanks a lot guys I appreciate your help.



    *EDIT May 31 2009 11:56pm (EST)

    I found a file that came with the assignment I overlooked, maybe you guys can help me better by looking at this file. My program basically has to read this file in.


    684 30 8 ATCAGAGGGGATTTCGCCGGTCTATCGAAGCTGAATTCATGTATTAACGATTACAACGAATCAGTAAAGAGTCCTTAGACGGTGATACAGCACGGTGGGTCGTGCTTTAGCCTTTTGCTTTCACTGTTCTAGTGATAATGAGGCTCGAAACTCCTGACCATATAAATGAGTACATAGGGACCCAAGGATAGCTATTCTTATTTACATGTATACGCACTCTCCACCTGCAAAGTCCTTTAGCAGATCCCCCACATGTCTCTATTAACCTAGTATATCCGCTTTTCATGGCTGATCCAACGTAAGGATCTCAGTCGCTCTGGGGTAGAAGTCGCCAATGGGCGTAAACGTAATTTGTTCCGGATTCATATTAACGTAATATAGCAACCTCCGAAACACAATGCGTGAGATTACCTATGTGCTTAACTCTATTTACATCGTGAGAGTTCCGGCAGTTAAGACAGCCCTCTAGTGGAAGGGGCTCCACCACAAATTTGTCTCCGCTTGAGAGAAATTGGATCGACCGTCCGTGAGGACCCGCCGCTGTTCACAGCCAAAGTAAAATGGTATAAAACCGGCGGTATCACTCAAACTTGCCCCATCATCTAAATGAGGCGATAGAATAGGCCTCACTCCTTTTTCGGGCACCCATGAACCCCCACGCCGTACTTACTCGCGAAGTCCCAGTAATTAGACAGCCGTGGGAGATCGTGAGGTCTAAGCGCCCGCACTCAACCATGGGACTGCGAAAGATAGAATAATCTGACATCCGAGAAGTTCCTCGATCCGAAGACGAGAAAGTTCTCAAGCGGCTACCGAACCTTCCTCTGCTGGACAGGTGCTGCGGTCCCGAAGTTAGCCCGTCTCATGAAAAACCAAACGCCTTGCTTTCAGTTATTAACGTCATCTGACAACCCGAAGCATTAGGTGAGAACGCCCCGGCGCCTTGCGCCGGTCCTGATGTTCTGCTCAATCCCCGTAACCTGCGAGGCC
    465 87 7 GTGTAGTTGCTCCGACAGACCCGAAACCCCCAACTTTCTGGAACTTTTCTGTATAGCAGGGCCTAGATATTCCGAAATCTATGCTGCCCTCTCTCCATGATTCGCGTTAGCACTTTACATCTACCCCAATAGAATGACATTGGGCTCTACCCTCACGGGTCACATAGCGGGTAAACATCGAAGCGATAAGGCTAAGGCGTCACCCAGTTCAGCTCAACAACTTAACTTCAAGTCTGCCCATAAATTGGTGGGCCATGACCAGTATACTCGCAAAGGCATTACGCCTCCAGACGCTGACGGTTTCAGGCGCAAACCCATCAGGATCACAGCCCCGCAAGATGAGCGATGCCCTAAACGCCAAAGGTGACTGGGCTTTTCGTGACTCCCGTAATTCCCCTGCAAGGAGCTGTGGACAGACGTCACATGGAGACAGTACATCTTGGTGAGCCCTTGGCTCGCGTGACCGTATGGTCTAGAGAATTCACGCGTGTGCCAGACTCGCACGGTCAGTAGTCAGGTCCGTCAAATATCTGCGACCGAGTTTGGAGCAGAAGTTGGGGCACGCAAAGTGGGCCGGCATGTGAGTAATGAGGAAAGGACACTGATGTTGAGGGTGCAGCTATTTAAGAGATCTCCTTCGGAGTTATTGACCCGCTCTAGAGTACGGCGGGAATACCTCGAAGTCCCTATTGAGCCTTACATTAACGGATCTGTGCCAAAAGTTCGGACATGGGACATCCGGCGCTGGGCGCCCTCGTGTAATGCGCGTTGTAGGAATCCAGTGAAAGTGTATTCCATAGAATGTCGATAACAGTAAACCCCGCCGTCTACTCGTGAAACAGGACTTTATGCAATTTCTGCCATGGATGAGCGCTGGGTTAAGATAGCTATCTCGATAATCGAACGTTAATCCCCTGGTCAGACGGGTAACATCAATTTCTTCCCAAATACGCTACATTCTTCTACGTGTCGGCTTGGAAGGAGTCCTCTCGTTCAAAATGATATATCCAATGAGTTAACATCCATTTCCGCGGGCTGCGAAACCACCCGCTTTAATAAACTTGGTCTAGATATTCACCAAGCTTCTTACCAACACGCAAGTACTGCAATTTCGCGAACCCTGAATCTGATAGGTGGAACATATGGAGCCCTGACAATGCGATGATGGGGGGTCGACTAAGGGGCATATGTGCTTTCCAAGCACTGGGTAGTGCAACAAAGAATGCTAATACTCTGAGTGCGGGGTCGCGGACATCCTCCTGACAGGTACTCCGAGCGCCCGGTATTACTTGAAGACACACTAATCCGAAGAACCTGGCCATCTAATAATTGGCCGCTGTTGGCGGACCTTAGGCACACAGTTTCTCTGCTTCCCGAACGTACGAGAGCTTGCCTGAGACGCTAGTACCAAGGTGGAATATCACACTCTTGGAATGGGAAACTTGGCCTTTTAGCCCTCTCTGATGTCACAGAAGCCGATGAACGGCTACAGCAGATGTGTGATCATGTACCTTCAGCCTAGGATCGTCTAGGGCCTGAATCTAGCAATAGAACTTAGATAGGTGATGACTAAATCCGTACTTAGGTTCTAGGAAACCGTGTGACACTATGGGCCCACAGACAAGGGCACGATCAATAGAACCGGGATTTATCTACTTTGAGTTGCTCAACATCTACCAGATGTTAATCGGTTGTGGGATTCCATATCAGTTGGACTATTAGATGTCATGCAAAGAAATGGCGCCCGCGATAACCAGTTCCTAACACTGTTGACAGAGAACAAACTCTCCTTCGGGTTGCTATATTTCTAGAAAACAAGATTGTGCGAAGAACATGTGTGTATGTTGTGATATCCTGTCTGTAAGCAGACCTTAAATCATGCGCTTGCAGGTGCTCTACATCTTACGATGCGTTATGGACTTTCATTCGTTTAATTGTGCGCTGCCCGCTTTACTGATGGGGATGAAATTTAGTGCTGGCTTTAACACCCGAGGCAACTACGTATAGAGTAACATTTTACGAACGATAGGGTAACAACGCGCTGGAACGTTGGATCAATGAGCGCTGATCCGGGGCTAGTACTGGCGTATGAGACTTTACTCGAGGGCACGCGACACCGCTGCATCTACATGGGTCACTGATACATGTGATTACTTTAAAGTAGTGTAAACCGCTGGCATTCCTTCAGACTGGCCGGAATTCGACCCTCGTGGAGATCTGTCCTACAGGTCTCCAAAAATGGGGGTATTCTACCGATCAGAGGCCGCAAGCTATTCATGTATGGCGGCATTGGATATCCTAAATTCGTATCCAGCCGCTAACGAATGAGTCTTTCGCCGTTTTCCGTCTCAGATAATTGTCTTCCTGTAGTTAAATAAGCAATCCTTCTTACTACGGCCGCTCTAGCGATCGATGGGAGCCGGCCCCCCCGCGTGTTCATCACTCAACGTCAGCAGCGTAAGTTAGTAATGTTAGATGAGAGCTCTTCGTGTGATAACTATTATTATTTGCAAGTGCCAGGTC
    131 8 7 GAGCAACCCAGCCTCAGAGAGGCGGCCACGGGAACGACGCCTAAAACTATCTGCGCTCCCGTCGGTCACACATCGAGTTAATAATAAGCTGTTACACTAGGATAGCGCAGTGCGCTTGTTTGGTGGTTTCAAAGAGGTTCCAGTCAGCCCACAGTCTGTCGCCGTCAAATTCTTTCCCACTATTTAAGGAGCTCATCGGGCACATGATGATGTAGCACGGAAGTTCTGTAGT
    295 44 7 TTTGCGGTGTAAGCATGACTGACAGATGAGTTGGTACCGTATACCAAGTACTAGCGACTAAGGTCTCCACGGAACACTGCAAAGACCATAGCGCACGTAAGTGACATGTCCTGGGTGTCGAAGTACGCTACAATCCACGATGTAGTGTGATCCCCATCGATGCATTGGTCTAGCATGGTCATATAAATAGCAGCCCGAACCGGGGTTCCCATGTGACCGCGACTTCGTCTATGTAATGATGACTTGCTTCGAATAGATACGAAGCTGGACCGAGATCAGGTCGTCTGCTGGCAGGGAACCGGGAGCGTCGAGAGGGGATACCGCGCTTTATATAAAAAGCAGCGAGCTAAAGCGGGACGTTTGGGCATGTATGGGGCTGTTGTCGCTGTCTTGTCAAGTTTAAAATGTCGCGGGCTAGCCTGAGTTAAATTTCGACATAGCTGCATGGATTATTTATACTGTCCAATGGCCAAGAGACCATAAACAGTGGGGGCACCCTTCCGATGCCCTGAACGCTGGAAGATTAAACCGCAGTGTTAGAGGCACTGAGAGTTCTAGCTGATGCAGTGCGCACGGCGGCACGCCCCGACTTTGGCTATTGGAGACCCGAACCGGACAGCTCTTATCGCATTTAGCCAGCGTATGAGTGACCCAGAGGCATGATGTTGTTGCTAAAGACCAGTCACCTCCACTAGCCTGCACCGAGCTTATTTCGTCAGGCCCGGTGTGGAGAATATGAGCACCCCCCATCGCAAACTTCTGAGACTCGACCTACGGGCCCATCATTCAAAATCCTGGAGGAAGATCGAGGAGGGATCAACACGGTCGATAGCCGATCCATTAAAAGGATGAATATCAGCGTCACAGCCCTGATCCTGGATAGACACCACTTCGAATTATGTAAGTCGGGAGCGGCCGGAGCTCAGGCCGCCTTGCAGCTGCTCGAGAGCAAAGGCTAAGAGCCACTGTGCGAGCAGTCATTCCTGTAGTCTTCACCACCTTTAGGACCACGTGTTCGACGCTACAGGACTCGTATAACAGGGATCGCCGATACGGATTTGTTCCTGACTCTCAGTTAAACGATATCTCGGGGATATACAGTTAGGTACATCCTGCGTACGCTCTTCGCTTATTGCGGTTTTGACTGAGCACCGTAAGGAGTAAGGCAAAGGTTGCAACCGGCGAGTTAGGGTGGAGTCTTGATATTGTGCGCGGCGTATCACAACCTTAACGTCACGCACGGCGCCTTCGGTCACTCAATCTGTATTTACAGCCC
    216 29 9 ACTAGTATTTAGCCGAGGTGATCATAGGTGATGCACCAGTGTATACATTGGTCGGTTGCTGGCCTCAAAACCAGCCTGCTGCCTGGACCCAAGTTGGTCTAGAAATTTTCGTAGCCGTACAAAGAAAACTATTAGTGTATGTGTCAACTCTAGGTAGGGGCCCCGGTTGCCCTGATTATAGTAATGTCCTTCATCAGAGTCATAACATAGTTCCCGAGGGTTCTGAGTAGACAGACGATGTATTACGTAGCTAAATTTAATTGCACGACGTCTATCATTCCCCCTTCCGTCAGCTTCCGTGAACTCCGGGTACCGAGAATTCGCTGACAAACTTAACGGCCAGGAGAAAGGCCCGTAGCCGGTTGTGGCGCCACAAACAAACCGCCCCCGCTGATTGTCATTAAGTGGCCGAGTTTAAACAGCTCGCCGTCGAATGGTATGCTGGCGCGAAATTTTGGTATTCATGACTCCTGAGTGCTAAGTAACCGGAATTCTTTGAGAAGTGCCGACATCATAAACCGGGATTTACAAGAACTGGACACAGTCCGCAGTAACGGTTACATGTCCTACACCTGAACGGAGGTGATGTCAACTACGCTTCACGATCCCAGAGTCTGAGAGCCCTAGAAGTAATACAGCACTCCGTGTTGATCAGACCTTCGAACAGGTTAATTTAGAGGTAGATCAGTACCCGAATAGGGCAAAACGACTCTAAGAGACTAGCATGACGGAATACTTGCTGAGCCACAATTTGTGGTCGAGGAATCACGTTTGGACCGATGCCTTCGCTCCCAGCGAGTCTAGAGGCTTCGCAGAAGACTTCTCAGCTCGGGGCGACGTAGAGCTCATTGGAAGTTCTGCCTGTGGCGCTTTCGCGTTTAGTACCAGCCGCGTAGGGTGGTCCCTAAGAATACATCCGTGGGTCGACCATGAATTGGGGTCGAATGCTGATACAGCGCACGAGAAAGTTGTTGTCGTCTGATACTTGTCATTCTTGTTTCAGTCTCCTGATCTAAGCAGACCTTCCGCCATTTCGAATCTATTCAGTTAATTAGTTCTCCTATAATGGGTAC
    766 73 6 CGCATGTTTGTCGATTGTTAAAGCGGATGTATGTTGTTAGTTTCCTTAAAGTGGCTCCCGGAAGTACACCGACTGATCTCGATCAATCATAAACTGTAGAACAGCGCAGGCCCCTTCAGCATGTTTGCAGCGGGACGAGGCATCTGTTTCACTACGCCAACTTCCGTCGGTTTTATTCTTCTCACGCTGGCGTCCATCCTGACGCAGTTCTGATAAGAGCTATGCGTTATTACATACCAGCCGGCGAGAGTTGGGCAGTAGAGAGAACCTTCGGAAGCTCCTTCCTGACCGGGGCAATTCTTTGCGCATAGTCATTCCGTTATCAACTTCGCTAACCCAATCAGCTTGCGGGCAGAACGACCGGCAAGCCTTCGCGTTGGTAAGGGTTCTAATGATGTATAATTAAGTCCCAGCCTGTTGGTTACTCAGAATTGTAAACATGTGTCGCGTAGTCAGCTCATGGCTGCACATACGGTTCCGTTCATTTCGGGCTGAAAGACGGGGCCTCTTCTAAGCTTATGCACTTTGAGCTACGACTGTACCGAACGGAATTACCAGTATTCCGGACCCATGCGTAATCTCCACCGGATAATGATCTTGCATGACCGCCTGTGGATTAGGAAACGGCTAAAACAATGCTGTGAGTCGTCCACCCAGTCCGCATAAGGCATCCAGAATTAAGGGCTGTATTGTTGGCATTTACGCAATTCACCTGATACTACGAGTGGGAGACCGGGGCGTACGTCTCCAGGATTATTCTAACTACGTGCATTAAGATAAGTGCTCGCAGATGACCTCGTAGGTGTGGTTTCCGTTGTAAACCGAATGGGATCCCATAATGCGCAATTCGGTTACCAACATGACGGGGATACCCTATGCGAATCAGCCAAGTCGGATATCGCCGCGGCATAACTCCATCGTCGGGATATGTCTCATCCGAACAAGTCAAATCTCTCCGCGCCCTGTATCAATCCGTTCCTAACGAGTCGTTTTTACTCAGCCAATCTTCAAATGACAGGACTCATGTAATTAGCCAACCTACGGGGGGTTTCATATTTCGCTAATTTTGCCAGGGGCAGGAGGATAACTTACAGAACTATCGGTGTAACCAATTACAATTACCTGCGCCCTAAACTGCTGCGACGGACTGTATCTTCGGGGAATTGCTTATGAGAACTCTGTATCGACAGTATTTCAAGCACTAGCTTGCCCCGATACCAGGTGAATAGAACAGAGGTCAATACATTCCTTCAACCTAGTACGCTCAAATTAAATTTCGGAACATCCCTGTGGTATGCTTCGTTTCACCAATGAAAAGGTACAGAATGGTAAACATCGCTGCAGAAACTACCAATGAAACTTTTTATTCATAAAGCGGTGACTGGCCGTGTGAGCGACTGCGGCGCACCGATAGACCAAGCACGATAATACGACAAATTCCAGGCACCGAACACTAACCACCGATAGGATTGCTCCGCGCGCACTTTTGGAAGTGTCTATACATTCTTGATGAATCCGACACTAGCCGCGCTTACTGGAATTTGATCTTGCTCCATCGGGGCCTGTGGTAACTGGTGGATTCTCAGGTACCCCTAGTCAGGCTCCTGAATTTATGAAATGTGTCTGTCGATTTTCGGTGGTTCTAACAGCGAGCAAGACTTGGCCATTGGGCTGGCGGCAATAAAAAAGGGACGTGAGTTACCAGCGGGCTGCGGGCCCTACGCAAAGGCCAGGGTTAGCGAGACTCGTGTTACAAAATGGGGACTGTTCTGTTGGATAGTGCCAGACATAGGCGATCGAGTTAATCATACTTACAGGACCAAAA
    594 94 3 GACGCGACAACTCAGGATCACGACATCCCAAATAATGTCACCTCAAAAAATTTTGACCTGTTGGACGCACATTATAGTGTTTCGCTATCGTCCCGTTTGGCCCTGAAAATCTACACTATATGAATCCCGGAAGCCCGAACGATAGGTCAA
     
    Last edited: May 31, 2009
  2. jcsd
  3. May 31, 2009 #2
    http://www.cplusplus.com/reference/clibrary/cstdio/scanf/ is how to use scanf. Just read it until you understand it.

    First you want to read the 3 numbers at the beginning of the line. Then malloc a char * large enough to hold the ATCG's, based on the numbers you read. Then read the rest of the line into the your char *, and process it.

    Using redirection means that your program is getting input from "standard input" and giving output to "standard output" which are similar to files, only they don't have names on disk. All it means in your program is you use printf and scanf instead of fprintf and fscanf.

    If you are being assigned this, you should have already learned how to do things like access characters within your DNA string and use printf; if not, you need to start reading a C book from the beginning.
     
    Last edited by a moderator: Apr 24, 2017
  4. May 31, 2009 #3
    We never learned malloc yet, and yes we learned about printf and scanf, but we didn't emphasize much on redirection. I will go over it myself, but what is the format used for redirection, C code wise. I know without u have to use FILE*(file name), (filename) = fopen(filepath), and same with fclose. But whats the syntax for redirection. Can you guys explain to me another way of reading the first 3 digits of the line as the sample ID then the next as a dna sequence and how far to go b4 it knows to go on to the next line. Also how to program so it knows each line is an separate DNA sequence. Thanks a lot guys.
     
  5. May 31, 2009 #4
    Without redirection, you DON'T have to use the FILE * stuff, you just use scanf and printf. As for the rest, it sounds like you need to learn things about C that I can't teach you in a post or two; read a C book and read documentation about the functions you need to use.
     
  6. May 31, 2009 #5
    I know my documentation, I was wondering after its done reading one line, it goes automatically goes to the next right? Does it know again that its reading the sample ID?
     
  7. Jun 2, 2009 #6
    I just saw what my edit did my bad guys, for some reason I can't edit it out:S can a mod delete this thread.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook




Similar Discussions: DNA sequencing program!
  1. Fibonacci sequence (Replies: 17)

  2. Sequence Diagram (Replies: 0)

  3. Explain this program. (Replies: 3)

  4. Programming languages (Replies: 16)

  5. HMI programming (Replies: 1)

Loading...