How to Write a C Program for DNA Sequencing?

  • Thread starter Thread starter DanielT29
  • Start date Start date
  • Tags Tags
    Dna Program
Click For Summary
The discussion revolves around writing a C program for DNA sequencing, specifically focusing on the structure of the main function. The program must read a data file containing DNA strand samples, where each line includes a sample ID, the number of genes, the length of the exon, and the DNA sequence. Participants are seeking guidance on how to identify nucleotide types (A, T, C, G) within the DNA sequence and calculate their mean values to classify the DNA type based on specified criteria. Additionally, they need help with using scanf and printf for input and output, as well as determining when to move to the next line of data. The conversation highlights the challenges faced by beginners in programming and the need for clear algorithms to facilitate coding.
DanielT29
Messages
16
Reaction score
0
Guys I need major help with this assignment to be written in C code. I'm a beginner at this, so please explain what I should do in detail. Thanks!

Caveat
The solution to this assignment should be a C program necessarily structured as a main function.

The problem

Suppose a DNA strand consists of a sequence of genes. Each gene is a sequence of four types
of nucleotides: Adenine (A), Thymine (T), Cytosine (C), Guanine (G). Moreover, a gene has
two well defined, adjacent regions: a coding region called exon, located in the beginning of the
gene, and a non-coding region called intron, located in the immediate tail of the gene. If a gene
contains h nucleotides in the exon region, then the length t of the intron region (i.e., the number
of nucleotides in it) can be determined as follows:

t = 3h + 1

For instance, Figure 1 shows a DNA strand of a fictitious organism. It contains 2 genes and each
gene has 3 nucleotides in the exon region (shown underlined in the figure).

GATAD{GAATGCC
First Gene
CCTCGTAGTTGAC
Second Gene




In view of the above, your C main function should:

1. scan a data file containing samples of DNA strands of (fictitious) organisms: in this data
file, for each sample there is a line of data with the following fields separated by blanks -
sample ID (integer), number of genes in the DNA strand (integer), length of the exon portion
of each gene (integer), a sequence of characters representing the DNA strand. Note: you
should NOT open a le to read data from it using C commands fopen and fscanf. Instead,
you should set up your C development environment (Quincy) to take a data file as the source of input, and use the C function scanf to read the data. If you do not know how to do that, you should consult the Discussion Board, your TA, or your Instructor.

2. determine the type of the DNA strand: the DNA type depends on the mean number and
type of nucleotides in the exon regions of the genes present in the DNA strand, as shown in


Table 1:

DNA TYPE | CRITERIA
1 | C<A<T<G
2 | A<C<,G>0
3 | other wise

*The underlines on the letters means the mean(average), while on the < means equal to in this case greater and equal to. (I had to make the table since it couldn't copy properly from the page I was viewing it in.


Table 1: DNA types and criteria. In the Criteria column, A; T;C; and G denote the mean number of the respective nucleotides present in the exon regions of a DNA strand (i.e., A denotes the mean number of Adenine in the exon regions of the strand).

3. produce separate printouts for all the samples contained in each data file: each printout
will present for each strand, on separate lines, the ID number of the strand, the values of
A; T; G; and C, and the type of the strand. It should also include the name of the data file.
The data should be labeled and formatted exactly as shown in Figure 2.

Data File: dataFile01.txt

ID mA mT mG mC Type
684 0.27 0.30 0.20 0.24 3
465 0.26 0.25 0.24 0.24 3
131 0.21 0.21 0.30 0.27 3

* I typed up 3 lines for time's sake, there was about 7, this is not the data being read, instead this is an example how I should make my program print out the data.

*the numbers integers that have 3 digits are the sample numbers, the decimal numbers are the mean values of each nucleotide and the single integer number(all 3s) are the type numbers. Sorry guys the columns are not aligned properly.



* What I'm having trouble with is, how do I know which parts are A T G and C on the DNA sequence, because if I don't know this I can't perform an operation to get their mean values to assign a type. Also how do I write a program that knows when its done reading one line of a DNA sequence den moves onto the next as a brand new one;how do I print out my values like in figure 2 to and how do i use prinft and scanf to read and write my data ( is this redirection?). Thanks guys, if I know some idea of this I can start on it, right now I'm having a big programmers block in my head preventing me from even starting this, I'm stuck on int main void.. I need help guys thanks. An algorithm would be nice(not the C code its self). I will be updating you guys on the code I write, hopefully have it done by Monday night (June 1st). Thanks a lot guys I appreciate your help.



*EDIT May 31 2009 11:56pm (EST)

I found a file that came with the assignment I overlooked, maybe you guys can help me better by looking at this file. My program basically has to read this file in.


684 30 8 ATCAGAGGGGATTTCGCCGGTCTATCGAAGCTGAATTCATGTATTAACGATTACAACGAATCAGTAAAGAGTCCTTAGACGGTGATACAGCACGGTGGGTCGTGCTTTAGCCTTTTGCTTTCACTGTTCTAGTGATAATGAGGCTCGAAACTCCTGACCATATAAATGAGTACATAGGGACCCAAGGATAGCTATTCTTATTTACATGTATACGCACTCTCCACCTGCAAAGTCCTTTAGCAGATCCCCCACATGTCTCTATTAACCTAGTATATCCGCTTTTCATGGCTGATCCAACGTAAGGATCTCAGTCGCTCTGGGGTAGAAGTCGCCAATGGGCGTAAACGTAATTTGTTCCGGATTCATATTAACGTAATATAGCAACCTCCGAAACACAATGCGTGAGATTACCTATGTGCTTAACTCTATTTACATCGTGAGAGTTCCGGCAGTTAAGACAGCCCTCTAGTGGAAGGGGCTCCACCACAAATTTGTCTCCGCTTGAGAGAAATTGGATCGACCGTCCGTGAGGACCCGCCGCTGTTCACAGCCAAAGTAAAATGGTATAAAACCGGCGGTATCACTCAAACTTGCCCCATCATCTAAATGAGGCGATAGAATAGGCCTCACTCCTTTTTCGGGCACCCATGAACCCCCACGCCGTACTTACTCGCGAAGTCCCAGTAATTAGACAGCCGTGGGAGATCGTGAGGTCTAAGCGCCCGCACTCAACCATGGGACTGCGAAAGATAGAATAATCTGACATCCGAGAAGTTCCTCGATCCGAAGACGAGAAAGTTCTCAAGCGGCTACCGAACCTTCCTCTGCTGGACAGGTGCTGCGGTCCCGAAGTTAGCCCGTCTCATGAAAAACCAAACGCCTTGCTTTCAGTTATTAACGTCATCTGACAACCCGAAGCATTAGGTGAGAACGCCCCGGCGCCTTGCGCCGGTCCTGATGTTCTGCTCAATCCCCGTAACCTGCGAGGCC
465 87 7 GTGTAGTTGCTCCGACAGACCCGAAACCCCCAACTTTCTGGAACTTTTCTGTATAGCAGGGCCTAGATATTCCGAAATCTATGCTGCCCTCTCTCCATGATTCGCGTTAGCACTTTACATCTACCCCAATAGAATGACATTGGGCTCTACCCTCACGGGTCACATAGCGGGTAAACATCGAAGCGATAAGGCTAAGGCGTCACCCAGTTCAGCTCAACAACTTAACTTCAAGTCTGCCCATAAATTGGTGGGCCATGACCAGTATACTCGCAAAGGCATTACGCCTCCAGACGCTGACGGTTTCAGGCGCAAACCCATCAGGATCACAGCCCCGCAAGATGAGCGATGCCCTAAACGCCAAAGGTGACTGGGCTTTTCGTGACTCCCGTAATTCCCCTGCAAGGAGCTGTGGACAGACGTCACATGGAGACAGTACATCTTGGTGAGCCCTTGGCTCGCGTGACCGTATGGTCTAGAGAATTCACGCGTGTGCCAGACTCGCACGGTCAGTAGTCAGGTCCGTCAAATATCTGCGACCGAGTTTGGAGCAGAAGTTGGGGCACGCAAAGTGGGCCGGCATGTGAGTAATGAGGAAAGGACACTGATGTTGAGGGTGCAGCTATTTAAGAGATCTCCTTCGGAGTTATTGACCCGCTCTAGAGTACGGCGGGAATACCTCGAAGTCCCTATTGAGCCTTACATTAACGGATCTGTGCCAAAAGTTCGGACATGGGACATCCGGCGCTGGGCGCCCTCGTGTAATGCGCGTTGTAGGAATCCAGTGAAAGTGTATTCCATAGAATGTCGATAACAGTAAACCCCGCCGTCTACTCGTGAAACAGGACTTTATGCAATTTCTGCCATGGATGAGCGCTGGGTTAAGATAGCTATCTCGATAATCGAACGTTAATCCCCTGGTCAGACGGGTAACATCAATTTCTTCCCAAATACGCTACATTCTTCTACGTGTCGGCTTGGAAGGAGTCCTCTCGTTCAAAATGATATATCCAATGAGTTAACATCCATTTCCGCGGGCTGCGAAACCACCCGCTTTAATAAACTTGGTCTAGATATTCACCAAGCTTCTTACCAACACGCAAGTACTGCAATTTCGCGAACCCTGAATCTGATAGGTGGAACATATGGAGCCCTGACAATGCGATGATGGGGGGTCGACTAAGGGGCATATGTGCTTTCCAAGCACTGGGTAGTGCAACAAAGAATGCTAATACTCTGAGTGCGGGGTCGCGGACATCCTCCTGACAGGTACTCCGAGCGCCCGGTATTACTTGAAGACACACTAATCCGAAGAACCTGGCCATCTAATAATTGGCCGCTGTTGGCGGACCTTAGGCACACAGTTTCTCTGCTTCCCGAACGTACGAGAGCTTGCCTGAGACGCTAGTACCAAGGTGGAATATCACACTCTTGGAATGGGAAACTTGGCCTTTTAGCCCTCTCTGATGTCACAGAAGCCGATGAACGGCTACAGCAGATGTGTGATCATGTACCTTCAGCCTAGGATCGTCTAGGGCCTGAATCTAGCAATAGAACTTAGATAGGTGATGACTAAATCCGTACTTAGGTTCTAGGAAACCGTGTGACACTATGGGCCCACAGACAAGGGCACGATCAATAGAACCGGGATTTATCTACTTTGAGTTGCTCAACATCTACCAGATGTTAATCGGTTGTGGGATTCCATATCAGTTGGACTATTAGATGTCATGCAAAGAAATGGCGCCCGCGATAACCAGTTCCTAACACTGTTGACAGAGAACAAACTCTCCTTCGGGTTGCTATATTTCTAGAAAACAAGATTGTGCGAAGAACATGTGTGTATGTTGTGATATCCTGTCTGTAAGCAGACCTTAAATCATGCGCTTGCAGGTGCTCTACATCTTACGATGCGTTATGGACTTTCATTCGTTTAATTGTGCGCTGCCCGCTTTACTGATGGGGATGAAATTTAGTGCTGGCTTTAACACCCGAGGCAACTACGTATAGAGTAACATTTTACGAACGATAGGGTAACAACGCGCTGGAACGTTGGATCAATGAGCGCTGATCCGGGGCTAGTACTGGCGTATGAGACTTTACTCGAGGGCACGCGACACCGCTGCATCTACATGGGTCACTGATACATGTGATTACTTTAAAGTAGTGTAAACCGCTGGCATTCCTTCAGACTGGCCGGAATTCGACCCTCGTGGAGATCTGTCCTACAGGTCTCCAAAAATGGGGGTATTCTACCGATCAGAGGCCGCAAGCTATTCATGTATGGCGGCATTGGATATCCTAAATTCGTATCCAGCCGCTAACGAATGAGTCTTTCGCCGTTTTCCGTCTCAGATAATTGTCTTCCTGTAGTTAAATAAGCAATCCTTCTTACTACGGCCGCTCTAGCGATCGATGGGAGCCGGCCCCCCCGCGTGTTCATCACTCAACGTCAGCAGCGTAAGTTAGTAATGTTAGATGAGAGCTCTTCGTGTGATAACTATTATTATTTGCAAGTGCCAGGTC
131 8 7 GAGCAACCCAGCCTCAGAGAGGCGGCCACGGGAACGACGCCTAAAACTATCTGCGCTCCCGTCGGTCACACATCGAGTTAATAATAAGCTGTTACACTAGGATAGCGCAGTGCGCTTGTTTGGTGGTTTCAAAGAGGTTCCAGTCAGCCCACAGTCTGTCGCCGTCAAATTCTTTCCCACTATTTAAGGAGCTCATCGGGCACATGATGATGTAGCACGGAAGTTCTGTAGT
295 44 7 TTTGCGGTGTAAGCATGACTGACAGATGAGTTGGTACCGTATACCAAGTACTAGCGACTAAGGTCTCCACGGAACACTGCAAAGACCATAGCGCACGTAAGTGACATGTCCTGGGTGTCGAAGTACGCTACAATCCACGATGTAGTGTGATCCCCATCGATGCATTGGTCTAGCATGGTCATATAAATAGCAGCCCGAACCGGGGTTCCCATGTGACCGCGACTTCGTCTATGTAATGATGACTTGCTTCGAATAGATACGAAGCTGGACCGAGATCAGGTCGTCTGCTGGCAGGGAACCGGGAGCGTCGAGAGGGGATACCGCGCTTTATATAAAAAGCAGCGAGCTAAAGCGGGACGTTTGGGCATGTATGGGGCTGTTGTCGCTGTCTTGTCAAGTTTAAAATGTCGCGGGCTAGCCTGAGTTAAATTTCGACATAGCTGCATGGATTATTTATACTGTCCAATGGCCAAGAGACCATAAACAGTGGGGGCACCCTTCCGATGCCCTGAACGCTGGAAGATTAAACCGCAGTGTTAGAGGCACTGAGAGTTCTAGCTGATGCAGTGCGCACGGCGGCACGCCCCGACTTTGGCTATTGGAGACCCGAACCGGACAGCTCTTATCGCATTTAGCCAGCGTATGAGTGACCCAGAGGCATGATGTTGTTGCTAAAGACCAGTCACCTCCACTAGCCTGCACCGAGCTTATTTCGTCAGGCCCGGTGTGGAGAATATGAGCACCCCCCATCGCAAACTTCTGAGACTCGACCTACGGGCCCATCATTCAAAATCCTGGAGGAAGATCGAGGAGGGATCAACACGGTCGATAGCCGATCCATTAAAAGGATGAATATCAGCGTCACAGCCCTGATCCTGGATAGACACCACTTCGAATTATGTAAGTCGGGAGCGGCCGGAGCTCAGGCCGCCTTGCAGCTGCTCGAGAGCAAAGGCTAAGAGCCACTGTGCGAGCAGTCATTCCTGTAGTCTTCACCACCTTTAGGACCACGTGTTCGACGCTACAGGACTCGTATAACAGGGATCGCCGATACGGATTTGTTCCTGACTCTCAGTTAAACGATATCTCGGGGATATACAGTTAGGTACATCCTGCGTACGCTCTTCGCTTATTGCGGTTTTGACTGAGCACCGTAAGGAGTAAGGCAAAGGTTGCAACCGGCGAGTTAGGGTGGAGTCTTGATATTGTGCGCGGCGTATCACAACCTTAACGTCACGCACGGCGCCTTCGGTCACTCAATCTGTATTTACAGCCC
216 29 9 ACTAGTATTTAGCCGAGGTGATCATAGGTGATGCACCAGTGTATACATTGGTCGGTTGCTGGCCTCAAAACCAGCCTGCTGCCTGGACCCAAGTTGGTCTAGAAATTTTCGTAGCCGTACAAAGAAAACTATTAGTGTATGTGTCAACTCTAGGTAGGGGCCCCGGTTGCCCTGATTATAGTAATGTCCTTCATCAGAGTCATAACATAGTTCCCGAGGGTTCTGAGTAGACAGACGATGTATTACGTAGCTAAATTTAATTGCACGACGTCTATCATTCCCCCTTCCGTCAGCTTCCGTGAACTCCGGGTACCGAGAATTCGCTGACAAACTTAACGGCCAGGAGAAAGGCCCGTAGCCGGTTGTGGCGCCACAAACAAACCGCCCCCGCTGATTGTCATTAAGTGGCCGAGTTTAAACAGCTCGCCGTCGAATGGTATGCTGGCGCGAAATTTTGGTATTCATGACTCCTGAGTGCTAAGTAACCGGAATTCTTTGAGAAGTGCCGACATCATAAACCGGGATTTACAAGAACTGGACACAGTCCGCAGTAACGGTTACATGTCCTACACCTGAACGGAGGTGATGTCAACTACGCTTCACGATCCCAGAGTCTGAGAGCCCTAGAAGTAATACAGCACTCCGTGTTGATCAGACCTTCGAACAGGTTAATTTAGAGGTAGATCAGTACCCGAATAGGGCAAAACGACTCTAAGAGACTAGCATGACGGAATACTTGCTGAGCCACAATTTGTGGTCGAGGAATCACGTTTGGACCGATGCCTTCGCTCCCAGCGAGTCTAGAGGCTTCGCAGAAGACTTCTCAGCTCGGGGCGACGTAGAGCTCATTGGAAGTTCTGCCTGTGGCGCTTTCGCGTTTAGTACCAGCCGCGTAGGGTGGTCCCTAAGAATACATCCGTGGGTCGACCATGAATTGGGGTCGAATGCTGATACAGCGCACGAGAAAGTTGTTGTCGTCTGATACTTGTCATTCTTGTTTCAGTCTCCTGATCTAAGCAGACCTTCCGCCATTTCGAATCTATTCAGTTAATTAGTTCTCCTATAATGGGTAC
766 73 6 CGCATGTTTGTCGATTGTTAAAGCGGATGTATGTTGTTAGTTTCCTTAAAGTGGCTCCCGGAAGTACACCGACTGATCTCGATCAATCATAAACTGTAGAACAGCGCAGGCCCCTTCAGCATGTTTGCAGCGGGACGAGGCATCTGTTTCACTACGCCAACTTCCGTCGGTTTTATTCTTCTCACGCTGGCGTCCATCCTGACGCAGTTCTGATAAGAGCTATGCGTTATTACATACCAGCCGGCGAGAGTTGGGCAGTAGAGAGAACCTTCGGAAGCTCCTTCCTGACCGGGGCAATTCTTTGCGCATAGTCATTCCGTTATCAACTTCGCTAACCCAATCAGCTTGCGGGCAGAACGACCGGCAAGCCTTCGCGTTGGTAAGGGTTCTAATGATGTATAATTAAGTCCCAGCCTGTTGGTTACTCAGAATTGTAAACATGTGTCGCGTAGTCAGCTCATGGCTGCACATACGGTTCCGTTCATTTCGGGCTGAAAGACGGGGCCTCTTCTAAGCTTATGCACTTTGAGCTACGACTGTACCGAACGGAATTACCAGTATTCCGGACCCATGCGTAATCTCCACCGGATAATGATCTTGCATGACCGCCTGTGGATTAGGAAACGGCTAAAACAATGCTGTGAGTCGTCCACCCAGTCCGCATAAGGCATCCAGAATTAAGGGCTGTATTGTTGGCATTTACGCAATTCACCTGATACTACGAGTGGGAGACCGGGGCGTACGTCTCCAGGATTATTCTAACTACGTGCATTAAGATAAGTGCTCGCAGATGACCTCGTAGGTGTGGTTTCCGTTGTAAACCGAATGGGATCCCATAATGCGCAATTCGGTTACCAACATGACGGGGATACCCTATGCGAATCAGCCAAGTCGGATATCGCCGCGGCATAACTCCATCGTCGGGATATGTCTCATCCGAACAAGTCAAATCTCTCCGCGCCCTGTATCAATCCGTTCCTAACGAGTCGTTTTTACTCAGCCAATCTTCAAATGACAGGACTCATGTAATTAGCCAACCTACGGGGGGTTTCATATTTCGCTAATTTTGCCAGGGGCAGGAGGATAACTTACAGAACTATCGGTGTAACCAATTACAATTACCTGCGCCCTAAACTGCTGCGACGGACTGTATCTTCGGGGAATTGCTTATGAGAACTCTGTATCGACAGTATTTCAAGCACTAGCTTGCCCCGATACCAGGTGAATAGAACAGAGGTCAATACATTCCTTCAACCTAGTACGCTCAAATTAAATTTCGGAACATCCCTGTGGTATGCTTCGTTTCACCAATGAAAAGGTACAGAATGGTAAACATCGCTGCAGAAACTACCAATGAAACTTTTTATTCATAAAGCGGTGACTGGCCGTGTGAGCGACTGCGGCGCACCGATAGACCAAGCACGATAATACGACAAATTCCAGGCACCGAACACTAACCACCGATAGGATTGCTCCGCGCGCACTTTTGGAAGTGTCTATACATTCTTGATGAATCCGACACTAGCCGCGCTTACTGGAATTTGATCTTGCTCCATCGGGGCCTGTGGTAACTGGTGGATTCTCAGGTACCCCTAGTCAGGCTCCTGAATTTATGAAATGTGTCTGTCGATTTTCGGTGGTTCTAACAGCGAGCAAGACTTGGCCATTGGGCTGGCGGCAATAAAAAAGGGACGTGAGTTACCAGCGGGCTGCGGGCCCTACGCAAAGGCCAGGGTTAGCGAGACTCGTGTTACAAAATGGGGACTGTTCTGTTGGATAGTGCCAGACATAGGCGATCGAGTTAATCATACTTACAGGACCAAAA
594 94 3 GACGCGACAACTCAGGATCACGACATCCCAAATAATGTCACCTCAAAAAATTTTGACCTGTTGGACGCACATTATAGTGTTTCGCTATCGTCCCGTTTGGCCCTGAAAATCTACACTATATGAATCCCGGAAGCCCGAACGATAGGTCAA
 
Last edited:
Technology news on Phys.org
http://www.cplusplus.com/reference/clibrary/cstdio/scanf/ is how to use scanf. Just read it until you understand it.

First you want to read the 3 numbers at the beginning of the line. Then malloc a char * large enough to hold the ATCG's, based on the numbers you read. Then read the rest of the line into the your char *, and process it.

Using redirection means that your program is getting input from "standard input" and giving output to "standard output" which are similar to files, only they don't have names on disk. All it means in your program is you use printf and scanf instead of fprintf and fscanf.

If you are being assigned this, you should have already learned how to do things like access characters within your DNA string and use printf; if not, you need to start reading a C book from the beginning.
 
Last edited by a moderator:
We never learned malloc yet, and yes we learned about printf and scanf, but we didn't emphasize much on redirection. I will go over it myself, but what is the format used for redirection, C code wise. I know without u have to use FILE*(file name), (filename) = fopen(filepath), and same with fclose. But what's the syntax for redirection. Can you guys explain to me another way of reading the first 3 digits of the line as the sample ID then the next as a dna sequence and how far to go b4 it knows to go on to the next line. Also how to program so it knows each line is an separate DNA sequence. Thanks a lot guys.
 
Without redirection, you DON'T have to use the FILE * stuff, you just use scanf and printf. As for the rest, it sounds like you need to learn things about C that I can't teach you in a post or two; read a C book and read documentation about the functions you need to use.
 
I know my documentation, I was wondering after its done reading one line, it goes automatically goes to the next right? Does it know again that its reading the sample ID?
 
I just saw what my edit did my bad guys, for some reason I can't edit it out:S can a mod delete this thread.
 
Learn If you want to write code for Python Machine learning, AI Statistics/data analysis Scientific research Web application servers Some microcontrollers JavaScript/Node JS/TypeScript Web sites Web application servers C# Games (Unity) Consumer applications (Windows) Business applications C++ Games (Unreal Engine) Operating systems, device drivers Microcontrollers/embedded systems Consumer applications (Linux) Some more tips: Do not learn C++ (or any other dialect of C) as a...

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 3 ·
Replies
3
Views
1K
  • · Replies 1 ·
Replies
1
Views
1K
Replies
14
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 25 ·
Replies
25
Views
2K
  • · Replies 7 ·
Replies
7
Views
4K
  • · Replies 2 ·
Replies
2
Views
5K