How to Write a C Program for DNA Sequencing?

In summary, a DNA sequencing program is a software tool used to analyze and interpret the genetic code of an organism. It works by breaking down a DNA sample into smaller fragments and comparing them to a reference genome. These programs have various applications, including genetic research and medical diagnosis, and there are different types such as Sanger sequencing, NGS, and TGS. While they have a high level of accuracy, it may vary depending on the technology and sample quality.
  • #1
DanielT29
16
0
Guys I need major help with this assignment to be written in C code. I'm a beginner at this, so please explain what I should do in detail. Thanks!

Caveat
The solution to this assignment should be a C program necessarily structured as a main function.

The problem

Suppose a DNA strand consists of a sequence of genes. Each gene is a sequence of four types
of nucleotides: Adenine (A), Thymine (T), Cytosine (C), Guanine (G). Moreover, a gene has
two well defined, adjacent regions: a coding region called exon, located in the beginning of the
gene, and a non-coding region called intron, located in the immediate tail of the gene. If a gene
contains h nucleotides in the exon region, then the length t of the intron region (i.e., the number
of nucleotides in it) can be determined as follows:

t = 3h + 1

For instance, Figure 1 shows a DNA strand of a fictitious organism. It contains 2 genes and each
gene has 3 nucleotides in the exon region (shown underlined in the figure).

GATAD{GAATGCC
First Gene
CCTCGTAGTTGAC
Second Gene




In view of the above, your C main function should:

1. scan a data file containing samples of DNA strands of (fictitious) organisms: in this data
file, for each sample there is a line of data with the following fields separated by blanks -
sample ID (integer), number of genes in the DNA strand (integer), length of the exon portion
of each gene (integer), a sequence of characters representing the DNA strand. Note: you
should NOT open a le to read data from it using C commands fopen and fscanf. Instead,
you should set up your C development environment (Quincy) to take a data file as the source of input, and use the C function scanf to read the data. If you do not know how to do that, you should consult the Discussion Board, your TA, or your Instructor.

2. determine the type of the DNA strand: the DNA type depends on the mean number and
type of nucleotides in the exon regions of the genes present in the DNA strand, as shown in


Table 1:

DNA TYPE | CRITERIA
1 | C<A<T<G
2 | A<C<,G>0
3 | other wise

*The underlines on the letters means the mean(average), while on the < means equal to in this case greater and equal to. (I had to make the table since it couldn't copy properly from the page I was viewing it in.


Table 1: DNA types and criteria. In the Criteria column, A; T;C; and G denote the mean number of the respective nucleotides present in the exon regions of a DNA strand (i.e., A denotes the mean number of Adenine in the exon regions of the strand).

3. produce separate printouts for all the samples contained in each data file: each printout
will present for each strand, on separate lines, the ID number of the strand, the values of
A; T; G; and C, and the type of the strand. It should also include the name of the data file.
The data should be labeled and formatted exactly as shown in Figure 2.

Data File: dataFile01.txt

ID mA mT mG mC Type
684 0.27 0.30 0.20 0.24 3
465 0.26 0.25 0.24 0.24 3
131 0.21 0.21 0.30 0.27 3

* I typed up 3 lines for time's sake, there was about 7, this is not the data being read, instead this is an example how I should make my program print out the data.

*the numbers integers that have 3 digits are the sample numbers, the decimal numbers are the mean values of each nucleotide and the single integer number(all 3s) are the type numbers. Sorry guys the columns are not aligned properly.



* What I'm having trouble with is, how do I know which parts are A T G and C on the DNA sequence, because if I don't know this I can't perform an operation to get their mean values to assign a type. Also how do I write a program that knows when its done reading one line of a DNA sequence den moves onto the next as a brand new one;how do I print out my values like in figure 2 to and how do i use prinft and scanf to read and write my data ( is this redirection?). Thanks guys, if I know some idea of this I can start on it, right now I'm having a big programmers block in my head preventing me from even starting this, I'm stuck on int main void.. I need help guys thanks. An algorithm would be nice(not the C code its self). I will be updating you guys on the code I write, hopefully have it done by Monday night (June 1st). Thanks a lot guys I appreciate your help.



*EDIT May 31 2009 11:56pm (EST)

I found a file that came with the assignment I overlooked, maybe you guys can help me better by looking at this file. My program basically has to read this file in.


684 30 8 ATCAGAGGGGATTTCGCCGGTCTATCGAAGCTGAATTCATGTATTAACGATTACAACGAATCAGTAAAGAGTCCTTAGACGGTGATACAGCACGGTGGGTCGTGCTTTAGCCTTTTGCTTTCACTGTTCTAGTGATAATGAGGCTCGAAACTCCTGACCATATAAATGAGTACATAGGGACCCAAGGATAGCTATTCTTATTTACATGTATACGCACTCTCCACCTGCAAAGTCCTTTAGCAGATCCCCCACATGTCTCTATTAACCTAGTATATCCGCTTTTCATGGCTGATCCAACGTAAGGATCTCAGTCGCTCTGGGGTAGAAGTCGCCAATGGGCGTAAACGTAATTTGTTCCGGATTCATATTAACGTAATATAGCAACCTCCGAAACACAATGCGTGAGATTACCTATGTGCTTAACTCTATTTACATCGTGAGAGTTCCGGCAGTTAAGACAGCCCTCTAGTGGAAGGGGCTCCACCACAAATTTGTCTCCGCTTGAGAGAAATTGGATCGACCGTCCGTGAGGACCCGCCGCTGTTCACAGCCAAAGTAAAATGGTATAAAACCGGCGGTATCACTCAAACTTGCCCCATCATCTAAATGAGGCGATAGAATAGGCCTCACTCCTTTTTCGGGCACCCATGAACCCCCACGCCGTACTTACTCGCGAAGTCCCAGTAATTAGACAGCCGTGGGAGATCGTGAGGTCTAAGCGCCCGCACTCAACCATGGGACTGCGAAAGATAGAATAATCTGACATCCGAGAAGTTCCTCGATCCGAAGACGAGAAAGTTCTCAAGCGGCTACCGAACCTTCCTCTGCTGGACAGGTGCTGCGGTCCCGAAGTTAGCCCGTCTCATGAAAAACCAAACGCCTTGCTTTCAGTTATTAACGTCATCTGACAACCCGAAGCATTAGGTGAGAACGCCCCGGCGCCTTGCGCCGGTCCTGATGTTCTGCTCAATCCCCGTAACCTGCGAGGCC
465 87 7 GTGTAGTTGCTCCGACAGACCCGAAACCCCCAACTTTCTGGAACTTTTCTGTATAGCAGGGCCTAGATATTCCGAAATCTATGCTGCCCTCTCTCCATGATTCGCGTTAGCACTTTACATCTACCCCAATAGAATGACATTGGGCTCTACCCTCACGGGTCACATAGCGGGTAAACATCGAAGCGATAAGGCTAAGGCGTCACCCAGTTCAGCTCAACAACTTAACTTCAAGTCTGCCCATAAATTGGTGGGCCATGACCAGTATACTCGCAAAGGCATTACGCCTCCAGACGCTGACGGTTTCAGGCGCAAACCCATCAGGATCACAGCCCCGCAAGATGAGCGATGCCCTAAACGCCAAAGGTGACTGGGCTTTTCGTGACTCCCGTAATTCCCCTGCAAGGAGCTGTGGACAGACGTCACATGGAGACAGTACATCTTGGTGAGCCCTTGGCTCGCGTGACCGTATGGTCTAGAGAATTCACGCGTGTGCCAGACTCGCACGGTCAGTAGTCAGGTCCGTCAAATATCTGCGACCGAGTTTGGAGCAGAAGTTGGGGCACGCAAAGTGGGCCGGCATGTGAGTAATGAGGAAAGGACACTGATGTTGAGGGTGCAGCTATTTAAGAGATCTCCTTCGGAGTTATTGACCCGCTCTAGAGTACGGCGGGAATACCTCGAAGTCCCTATTGAGCCTTACATTAACGGATCTGTGCCAAAAGTTCGGACATGGGACATCCGGCGCTGGGCGCCCTCGTGTAATGCGCGTTGTAGGAATCCAGTGAAAGTGTATTCCATAGAATGTCGATAACAGTAAACCCCGCCGTCTACTCGTGAAACAGGACTTTATGCAATTTCTGCCATGGATGAGCGCTGGGTTAAGATAGCTATCTCGATAATCGAACGTTAATCCCCTGGTCAGACGGGTAACATCAATTTCTTCCCAAATACGCTACATTCTTCTACGTGTCGGCTTGGAAGGAGTCCTCTCGTTCAAAATGATATATCCAATGAGTTAACATCCATTTCCGCGGGCTGCGAAACCACCCGCTTTAATAAACTTGGTCTAGATATTCACCAAGCTTCTTACCAACACGCAAGTACTGCAATTTCGCGAACCCTGAATCTGATAGGTGGAACATATGGAGCCCTGACAATGCGATGATGGGGGGTCGACTAAGGGGCATATGTGCTTTCCAAGCACTGGGTAGTGCAACAAAGAATGCTAATACTCTGAGTGCGGGGTCGCGGACATCCTCCTGACAGGTACTCCGAGCGCCCGGTATTACTTGAAGACACACTAATCCGAAGAACCTGGCCATCTAATAATTGGCCGCTGTTGGCGGACCTTAGGCACACAGTTTCTCTGCTTCCCGAACGTACGAGAGCTTGCCTGAGACGCTAGTACCAAGGTGGAATATCACACTCTTGGAATGGGAAACTTGGCCTTTTAGCCCTCTCTGATGTCACAGAAGCCGATGAACGGCTACAGCAGATGTGTGATCATGTACCTTCAGCCTAGGATCGTCTAGGGCCTGAATCTAGCAATAGAACTTAGATAGGTGATGACTAAATCCGTACTTAGGTTCTAGGAAACCGTGTGACACTATGGGCCCACAGACAAGGGCACGATCAATAGAACCGGGATTTATCTACTTTGAGTTGCTCAACATCTACCAGATGTTAATCGGTTGTGGGATTCCATATCAGTTGGACTATTAGATGTCATGCAAAGAAATGGCGCCCGCGATAACCAGTTCCTAACACTGTTGACAGAGAACAAACTCTCCTTCGGGTTGCTATATTTCTAGAAAACAAGATTGTGCGAAGAACATGTGTGTATGTTGTGATATCCTGTCTGTAAGCAGACCTTAAATCATGCGCTTGCAGGTGCTCTACATCTTACGATGCGTTATGGACTTTCATTCGTTTAATTGTGCGCTGCCCGCTTTACTGATGGGGATGAAATTTAGTGCTGGCTTTAACACCCGAGGCAACTACGTATAGAGTAACATTTTACGAACGATAGGGTAACAACGCGCTGGAACGTTGGATCAATGAGCGCTGATCCGGGGCTAGTACTGGCGTATGAGACTTTACTCGAGGGCACGCGACACCGCTGCATCTACATGGGTCACTGATACATGTGATTACTTTAAAGTAGTGTAAACCGCTGGCATTCCTTCAGACTGGCCGGAATTCGACCCTCGTGGAGATCTGTCCTACAGGTCTCCAAAAATGGGGGTATTCTACCGATCAGAGGCCGCAAGCTATTCATGTATGGCGGCATTGGATATCCTAAATTCGTATCCAGCCGCTAACGAATGAGTCTTTCGCCGTTTTCCGTCTCAGATAATTGTCTTCCTGTAGTTAAATAAGCAATCCTTCTTACTACGGCCGCTCTAGCGATCGATGGGAGCCGGCCCCCCCGCGTGTTCATCACTCAACGTCAGCAGCGTAAGTTAGTAATGTTAGATGAGAGCTCTTCGTGTGATAACTATTATTATTTGCAAGTGCCAGGTC
131 8 7 GAGCAACCCAGCCTCAGAGAGGCGGCCACGGGAACGACGCCTAAAACTATCTGCGCTCCCGTCGGTCACACATCGAGTTAATAATAAGCTGTTACACTAGGATAGCGCAGTGCGCTTGTTTGGTGGTTTCAAAGAGGTTCCAGTCAGCCCACAGTCTGTCGCCGTCAAATTCTTTCCCACTATTTAAGGAGCTCATCGGGCACATGATGATGTAGCACGGAAGTTCTGTAGT
295 44 7 TTTGCGGTGTAAGCATGACTGACAGATGAGTTGGTACCGTATACCAAGTACTAGCGACTAAGGTCTCCACGGAACACTGCAAAGACCATAGCGCACGTAAGTGACATGTCCTGGGTGTCGAAGTACGCTACAATCCACGATGTAGTGTGATCCCCATCGATGCATTGGTCTAGCATGGTCATATAAATAGCAGCCCGAACCGGGGTTCCCATGTGACCGCGACTTCGTCTATGTAATGATGACTTGCTTCGAATAGATACGAAGCTGGACCGAGATCAGGTCGTCTGCTGGCAGGGAACCGGGAGCGTCGAGAGGGGATACCGCGCTTTATATAAAAAGCAGCGAGCTAAAGCGGGACGTTTGGGCATGTATGGGGCTGTTGTCGCTGTCTTGTCAAGTTTAAAATGTCGCGGGCTAGCCTGAGTTAAATTTCGACATAGCTGCATGGATTATTTATACTGTCCAATGGCCAAGAGACCATAAACAGTGGGGGCACCCTTCCGATGCCCTGAACGCTGGAAGATTAAACCGCAGTGTTAGAGGCACTGAGAGTTCTAGCTGATGCAGTGCGCACGGCGGCACGCCCCGACTTTGGCTATTGGAGACCCGAACCGGACAGCTCTTATCGCATTTAGCCAGCGTATGAGTGACCCAGAGGCATGATGTTGTTGCTAAAGACCAGTCACCTCCACTAGCCTGCACCGAGCTTATTTCGTCAGGCCCGGTGTGGAGAATATGAGCACCCCCCATCGCAAACTTCTGAGACTCGACCTACGGGCCCATCATTCAAAATCCTGGAGGAAGATCGAGGAGGGATCAACACGGTCGATAGCCGATCCATTAAAAGGATGAATATCAGCGTCACAGCCCTGATCCTGGATAGACACCACTTCGAATTATGTAAGTCGGGAGCGGCCGGAGCTCAGGCCGCCTTGCAGCTGCTCGAGAGCAAAGGCTAAGAGCCACTGTGCGAGCAGTCATTCCTGTAGTCTTCACCACCTTTAGGACCACGTGTTCGACGCTACAGGACTCGTATAACAGGGATCGCCGATACGGATTTGTTCCTGACTCTCAGTTAAACGATATCTCGGGGATATACAGTTAGGTACATCCTGCGTACGCTCTTCGCTTATTGCGGTTTTGACTGAGCACCGTAAGGAGTAAGGCAAAGGTTGCAACCGGCGAGTTAGGGTGGAGTCTTGATATTGTGCGCGGCGTATCACAACCTTAACGTCACGCACGGCGCCTTCGGTCACTCAATCTGTATTTACAGCCC
216 29 9 ACTAGTATTTAGCCGAGGTGATCATAGGTGATGCACCAGTGTATACATTGGTCGGTTGCTGGCCTCAAAACCAGCCTGCTGCCTGGACCCAAGTTGGTCTAGAAATTTTCGTAGCCGTACAAAGAAAACTATTAGTGTATGTGTCAACTCTAGGTAGGGGCCCCGGTTGCCCTGATTATAGTAATGTCCTTCATCAGAGTCATAACATAGTTCCCGAGGGTTCTGAGTAGACAGACGATGTATTACGTAGCTAAATTTAATTGCACGACGTCTATCATTCCCCCTTCCGTCAGCTTCCGTGAACTCCGGGTACCGAGAATTCGCTGACAAACTTAACGGCCAGGAGAAAGGCCCGTAGCCGGTTGTGGCGCCACAAACAAACCGCCCCCGCTGATTGTCATTAAGTGGCCGAGTTTAAACAGCTCGCCGTCGAATGGTATGCTGGCGCGAAATTTTGGTATTCATGACTCCTGAGTGCTAAGTAACCGGAATTCTTTGAGAAGTGCCGACATCATAAACCGGGATTTACAAGAACTGGACACAGTCCGCAGTAACGGTTACATGTCCTACACCTGAACGGAGGTGATGTCAACTACGCTTCACGATCCCAGAGTCTGAGAGCCCTAGAAGTAATACAGCACTCCGTGTTGATCAGACCTTCGAACAGGTTAATTTAGAGGTAGATCAGTACCCGAATAGGGCAAAACGACTCTAAGAGACTAGCATGACGGAATACTTGCTGAGCCACAATTTGTGGTCGAGGAATCACGTTTGGACCGATGCCTTCGCTCCCAGCGAGTCTAGAGGCTTCGCAGAAGACTTCTCAGCTCGGGGCGACGTAGAGCTCATTGGAAGTTCTGCCTGTGGCGCTTTCGCGTTTAGTACCAGCCGCGTAGGGTGGTCCCTAAGAATACATCCGTGGGTCGACCATGAATTGGGGTCGAATGCTGATACAGCGCACGAGAAAGTTGTTGTCGTCTGATACTTGTCATTCTTGTTTCAGTCTCCTGATCTAAGCAGACCTTCCGCCATTTCGAATCTATTCAGTTAATTAGTTCTCCTATAATGGGTAC
766 73 6 CGCATGTTTGTCGATTGTTAAAGCGGATGTATGTTGTTAGTTTCCTTAAAGTGGCTCCCGGAAGTACACCGACTGATCTCGATCAATCATAAACTGTAGAACAGCGCAGGCCCCTTCAGCATGTTTGCAGCGGGACGAGGCATCTGTTTCACTACGCCAACTTCCGTCGGTTTTATTCTTCTCACGCTGGCGTCCATCCTGACGCAGTTCTGATAAGAGCTATGCGTTATTACATACCAGCCGGCGAGAGTTGGGCAGTAGAGAGAACCTTCGGAAGCTCCTTCCTGACCGGGGCAATTCTTTGCGCATAGTCATTCCGTTATCAACTTCGCTAACCCAATCAGCTTGCGGGCAGAACGACCGGCAAGCCTTCGCGTTGGTAAGGGTTCTAATGATGTATAATTAAGTCCCAGCCTGTTGGTTACTCAGAATTGTAAACATGTGTCGCGTAGTCAGCTCATGGCTGCACATACGGTTCCGTTCATTTCGGGCTGAAAGACGGGGCCTCTTCTAAGCTTATGCACTTTGAGCTACGACTGTACCGAACGGAATTACCAGTATTCCGGACCCATGCGTAATCTCCACCGGATAATGATCTTGCATGACCGCCTGTGGATTAGGAAACGGCTAAAACAATGCTGTGAGTCGTCCACCCAGTCCGCATAAGGCATCCAGAATTAAGGGCTGTATTGTTGGCATTTACGCAATTCACCTGATACTACGAGTGGGAGACCGGGGCGTACGTCTCCAGGATTATTCTAACTACGTGCATTAAGATAAGTGCTCGCAGATGACCTCGTAGGTGTGGTTTCCGTTGTAAACCGAATGGGATCCCATAATGCGCAATTCGGTTACCAACATGACGGGGATACCCTATGCGAATCAGCCAAGTCGGATATCGCCGCGGCATAACTCCATCGTCGGGATATGTCTCATCCGAACAAGTCAAATCTCTCCGCGCCCTGTATCAATCCGTTCCTAACGAGTCGTTTTTACTCAGCCAATCTTCAAATGACAGGACTCATGTAATTAGCCAACCTACGGGGGGTTTCATATTTCGCTAATTTTGCCAGGGGCAGGAGGATAACTTACAGAACTATCGGTGTAACCAATTACAATTACCTGCGCCCTAAACTGCTGCGACGGACTGTATCTTCGGGGAATTGCTTATGAGAACTCTGTATCGACAGTATTTCAAGCACTAGCTTGCCCCGATACCAGGTGAATAGAACAGAGGTCAATACATTCCTTCAACCTAGTACGCTCAAATTAAATTTCGGAACATCCCTGTGGTATGCTTCGTTTCACCAATGAAAAGGTACAGAATGGTAAACATCGCTGCAGAAACTACCAATGAAACTTTTTATTCATAAAGCGGTGACTGGCCGTGTGAGCGACTGCGGCGCACCGATAGACCAAGCACGATAATACGACAAATTCCAGGCACCGAACACTAACCACCGATAGGATTGCTCCGCGCGCACTTTTGGAAGTGTCTATACATTCTTGATGAATCCGACACTAGCCGCGCTTACTGGAATTTGATCTTGCTCCATCGGGGCCTGTGGTAACTGGTGGATTCTCAGGTACCCCTAGTCAGGCTCCTGAATTTATGAAATGTGTCTGTCGATTTTCGGTGGTTCTAACAGCGAGCAAGACTTGGCCATTGGGCTGGCGGCAATAAAAAAGGGACGTGAGTTACCAGCGGGCTGCGGGCCCTACGCAAAGGCCAGGGTTAGCGAGACTCGTGTTACAAAATGGGGACTGTTCTGTTGGATAGTGCCAGACATAGGCGATCGAGTTAATCATACTTACAGGACCAAAA
594 94 3 GACGCGACAACTCAGGATCACGACATCCCAAATAATGTCACCTCAAAAAATTTTGACCTGTTGGACGCACATTATAGTGTTTCGCTATCGTCCCGTTTGGCCCTGAAAATCTACACTATATGAATCCCGGAAGCCCGAACGATAGGTCAA
 
Last edited:
Technology news on Phys.org
  • #2
http://www.cplusplus.com/reference/clibrary/cstdio/scanf/ is how to use scanf. Just read it until you understand it.

First you want to read the 3 numbers at the beginning of the line. Then malloc a char * large enough to hold the ATCG's, based on the numbers you read. Then read the rest of the line into the your char *, and process it.

Using redirection means that your program is getting input from "standard input" and giving output to "standard output" which are similar to files, only they don't have names on disk. All it means in your program is you use printf and scanf instead of fprintf and fscanf.

If you are being assigned this, you should have already learned how to do things like access characters within your DNA string and use printf; if not, you need to start reading a C book from the beginning.
 
Last edited by a moderator:
  • #3
We never learned malloc yet, and yes we learned about printf and scanf, but we didn't emphasize much on redirection. I will go over it myself, but what is the format used for redirection, C code wise. I know without u have to use FILE*(file name), (filename) = fopen(filepath), and same with fclose. But what's the syntax for redirection. Can you guys explain to me another way of reading the first 3 digits of the line as the sample ID then the next as a dna sequence and how far to go b4 it knows to go on to the next line. Also how to program so it knows each line is an separate DNA sequence. Thanks a lot guys.
 
  • #4
Without redirection, you DON'T have to use the FILE * stuff, you just use scanf and printf. As for the rest, it sounds like you need to learn things about C that I can't teach you in a post or two; read a C book and read documentation about the functions you need to use.
 
  • #5
I know my documentation, I was wondering after its done reading one line, it goes automatically goes to the next right? Does it know again that its reading the sample ID?
 
  • #6
I just saw what my edit did my bad guys, for some reason I can't edit it out:S can a mod delete this thread.
 

1. What is a DNA sequencing program?

A DNA sequencing program is a software tool used to analyze and interpret the genetic code of an organism by determining the order of nucleotides in a DNA molecule.

2. How does a DNA sequencing program work?

A DNA sequencing program works by taking a sample of DNA and breaking it down into smaller fragments. These fragments are then sequenced and compared to a reference genome to determine the order of nucleotides.

3. What are the applications of DNA sequencing programs?

DNA sequencing programs have a wide range of applications, including genetic research, medical diagnosis, forensic analysis, and evolutionary studies.

4. What are the different types of DNA sequencing programs?

There are several types of DNA sequencing programs, including Sanger sequencing, Next Generation Sequencing (NGS), and Third Generation Sequencing (TGS). Each type has its own advantages and limitations.

5. Are DNA sequencing programs accurate?

Yes, DNA sequencing programs have a high level of accuracy, with a minimal error rate of less than 1%. However, the accuracy may vary depending on the type of sequencing technology and the quality of the DNA sample.

Similar threads

  • Biology and Medical
Replies
3
Views
903
  • Biology and Medical
Replies
2
Views
1K
  • Biology and Medical
Replies
1
Views
961
  • Programming and Computer Science
Replies
14
Views
2K
  • Programming and Computer Science
Replies
2
Views
1K
  • Programming and Computer Science
Replies
8
Views
2K
  • Programming and Computer Science
Replies
25
Views
2K
  • Programming and Computer Science
Replies
12
Views
1K
  • Programming and Computer Science
Replies
2
Views
294
  • Programming and Computer Science
Replies
2
Views
911
Back
Top