How to Write a C Program for DNA Sequencing?

  • Thread starter Thread starter DanielT29
  • Start date Start date
  • Tags Tags
    Dna Program
Click For Summary

Discussion Overview

The discussion revolves around writing a C program for DNA sequencing, specifically focusing on the structure and requirements of the program as part of an assignment. Participants are exploring how to read DNA data from a file, calculate nucleotide means, and classify DNA types based on given criteria.

Discussion Character

  • Homework-related
  • Exploratory
  • Technical explanation

Main Points Raised

  • One participant requests detailed guidance on how to structure a C program, emphasizing their beginner status and the need for clarity.
  • The assignment specifies that the program must read from a data file without using fopen and fscanf, instead utilizing scanf for input.
  • There is a requirement to determine the type of DNA strand based on the mean number of nucleotides in the exon regions, with specific criteria outlined for classification.
  • Participants express uncertainty about how to identify the nucleotide types (A, T, G, C) within the DNA sequence and how to calculate their mean values.
  • Concerns are raised about how to manage the reading of multiple lines of DNA sequences and how to format the output correctly as specified in the assignment.
  • A later reply mentions finding a sample data file that could assist in understanding the assignment better, suggesting that it may clarify how to implement the program.

Areas of Agreement / Disagreement

Participants generally agree on the need for clarity in the program structure and the challenges involved in reading and processing the DNA data. However, there is no consensus on specific methods for identifying nucleotides or calculating means, as these remain points of uncertainty.

Contextual Notes

Limitations include the lack of clarity on how to implement the reading of nucleotide sequences and the calculation of their means, as well as the specific formatting required for output. Participants have not resolved these issues, and the discussion reflects ongoing exploration of these topics.

DanielT29
Messages
16
Reaction score
0
Guys I need major help with this assignment to be written in C code. I'm a beginner at this, so please explain what I should do in detail. Thanks!

Caveat
The solution to this assignment should be a C program necessarily structured as a main function.

The problem

Suppose a DNA strand consists of a sequence of genes. Each gene is a sequence of four types
of nucleotides: Adenine (A), Thymine (T), Cytosine (C), Guanine (G). Moreover, a gene has
two well defined, adjacent regions: a coding region called exon, located in the beginning of the
gene, and a non-coding region called intron, located in the immediate tail of the gene. If a gene
contains h nucleotides in the exon region, then the length t of the intron region (i.e., the number
of nucleotides in it) can be determined as follows:

t = 3h + 1

For instance, Figure 1 shows a DNA strand of a fictitious organism. It contains 2 genes and each
gene has 3 nucleotides in the exon region (shown underlined in the figure).

GATAD{GAATGCC
First Gene
CCTCGTAGTTGAC
Second Gene




In view of the above, your C main function should:

1. scan a data file containing samples of DNA strands of (fictitious) organisms: in this data
file, for each sample there is a line of data with the following fields separated by blanks -
sample ID (integer), number of genes in the DNA strand (integer), length of the exon portion
of each gene (integer), a sequence of characters representing the DNA strand. Note: you
should NOT open a le to read data from it using C commands fopen and fscanf. Instead,
you should set up your C development environment (Quincy) to take a data file as the source of input, and use the C function scanf to read the data. If you do not know how to do that, you should consult the Discussion Board, your TA, or your Instructor.

2. determine the type of the DNA strand: the DNA type depends on the mean number and
type of nucleotides in the exon regions of the genes present in the DNA strand, as shown in


Table 1:

DNA TYPE | CRITERIA
1 | C<A<T<G
2 | A<C<,G>0
3 | other wise

*The underlines on the letters means the mean(average), while on the < means equal to in this case greater and equal to. (I had to make the table since it couldn't copy properly from the page I was viewing it in.


Table 1: DNA types and criteria. In the Criteria column, A; T;C; and G denote the mean number of the respective nucleotides present in the exon regions of a DNA strand (i.e., A denotes the mean number of Adenine in the exon regions of the strand).

3. produce separate printouts for all the samples contained in each data file: each printout
will present for each strand, on separate lines, the ID number of the strand, the values of
A; T; G; and C, and the type of the strand. It should also include the name of the data file.
The data should be labeled and formatted exactly as shown in Figure 2.

Data File: dataFile01.txt

ID mA mT mG mC Type
684 0.27 0.30 0.20 0.24 3
465 0.26 0.25 0.24 0.24 3
131 0.21 0.21 0.30 0.27 3

* I typed up 3 lines for time's sake, there was about 7, this is not the data being read, instead this is an example how I should make my program print out the data.

*the numbers integers that have 3 digits are the sample numbers, the decimal numbers are the mean values of each nucleotide and the single integer number(all 3s) are the type numbers. Sorry guys the columns are not aligned properly.



* What I'm having trouble with is, how do I know which parts are A T G and C on the DNA sequence, because if I don't know this I can't perform an operation to get their mean values to assign a type. Also how do I write a program that knows when its done reading one line of a DNA sequence den moves onto the next as a brand new one;how do I print out my values like in figure 2 to and how do i use prinft and scanf to read and write my data ( is this redirection?). Thanks guys, if I know some idea of this I can start on it, right now I'm having a big programmers block in my head preventing me from even starting this, I'm stuck on int main void.. I need help guys thanks. An algorithm would be nice(not the C code its self). I will be updating you guys on the code I write, hopefully have it done by Monday night (June 1st). Thanks a lot guys I appreciate your help.



*EDIT May 31 2009 11:56pm (EST)

I found a file that came with the assignment I overlooked, maybe you guys can help me better by looking at this file. My program basically has to read this file in.


684 30 8 ATCAGAGGGGATTTCGCCGGTCTATCGAAGCTGAATTCATGTATTAACGATTACAACGAATCAGTAAAGAGTCCTTAGACGGTGATACAGCACGGTGGGTCGTGCTTTAGCCTTTTGCTTTCACTGTTCTAGTGATAATGAGGCTCGAAACTCCTGACCATATAAATGAGTACATAGGGACCCAAGGATAGCTATTCTTATTTACATGTATACGCACTCTCCACCTGCAAAGTCCTTTAGCAGATCCCCCACATGTCTCTATTAACCTAGTATATCCGCTTTTCATGGCTGATCCAACGTAAGGATCTCAGTCGCTCTGGGGTAGAAGTCGCCAATGGGCGTAAACGTAATTTGTTCCGGATTCATATTAACGTAATATAGCAACCTCCGAAACACAATGCGTGAGATTACCTATGTGCTTAACTCTATTTACATCGTGAGAGTTCCGGCAGTTAAGACAGCCCTCTAGTGGAAGGGGCTCCACCACAAATTTGTCTCCGCTTGAGAGAAATTGGATCGACCGTCCGTGAGGACCCGCCGCTGTTCACAGCCAAAGTAAAATGGTATAAAACCGGCGGTATCACTCAAACTTGCCCCATCATCTAAATGAGGCGATAGAATAGGCCTCACTCCTTTTTCGGGCACCCATGAACCCCCACGCCGTACTTACTCGCGAAGTCCCAGTAATTAGACAGCCGTGGGAGATCGTGAGGTCTAAGCGCCCGCACTCAACCATGGGACTGCGAAAGATAGAATAATCTGACATCCGAGAAGTTCCTCGATCCGAAGACGAGAAAGTTCTCAAGCGGCTACCGAACCTTCCTCTGCTGGACAGGTGCTGCGGTCCCGAAGTTAGCCCGTCTCATGAAAAACCAAACGCCTTGCTTTCAGTTATTAACGTCATCTGACAACCCGAAGCATTAGGTGAGAACGCCCCGGCGCCTTGCGCCGGTCCTGATGTTCTGCTCAATCCCCGTAACCTGCGAGGCC
465 87 7 GTGTAGTTGCTCCGACAGACCCGAAACCCCCAACTTTCTGGAACTTTTCTGTATAGCAGGGCCTAGATATTCCGAAATCTATGCTGCCCTCTCTCCATGATTCGCGTTAGCACTTTACATCTACCCCAATAGAATGACATTGGGCTCTACCCTCACGGGTCACATAGCGGGTAAACATCGAAGCGATAAGGCTAAGGCGTCACCCAGTTCAGCTCAACAACTTAACTTCAAGTCTGCCCATAAATTGGTGGGCCATGACCAGTATACTCGCAAAGGCATTACGCCTCCAGACGCTGACGGTTTCAGGCGCAAACCCATCAGGATCACAGCCCCGCAAGATGAGCGATGCCCTAAACGCCAAAGGTGACTGGGCTTTTCGTGACTCCCGTAATTCCCCTGCAAGGAGCTGTGGACAGACGTCACATGGAGACAGTACATCTTGGTGAGCCCTTGGCTCGCGTGACCGTATGGTCTAGAGAATTCACGCGTGTGCCAGACTCGCACGGTCAGTAGTCAGGTCCGTCAAATATCTGCGACCGAGTTTGGAGCAGAAGTTGGGGCACGCAAAGTGGGCCGGCATGTGAGTAATGAGGAAAGGACACTGATGTTGAGGGTGCAGCTATTTAAGAGATCTCCTTCGGAGTTATTGACCCGCTCTAGAGTACGGCGGGAATACCTCGAAGTCCCTATTGAGCCTTACATTAACGGATCTGTGCCAAAAGTTCGGACATGGGACATCCGGCGCTGGGCGCCCTCGTGTAATGCGCGTTGTAGGAATCCAGTGAAAGTGTATTCCATAGAATGTCGATAACAGTAAACCCCGCCGTCTACTCGTGAAACAGGACTTTATGCAATTTCTGCCATGGATGAGCGCTGGGTTAAGATAGCTATCTCGATAATCGAACGTTAATCCCCTGGTCAGACGGGTAACATCAATTTCTTCCCAAATACGCTACATTCTTCTACGTGTCGGCTTGGAAGGAGTCCTCTCGTTCAAAATGATATATCCAATGAGTTAACATCCATTTCCGCGGGCTGCGAAACCACCCGCTTTAATAAACTTGGTCTAGATATTCACCAAGCTTCTTACCAACACGCAAGTACTGCAATTTCGCGAACCCTGAATCTGATAGGTGGAACATATGGAGCCCTGACAATGCGATGATGGGGGGTCGACTAAGGGGCATATGTGCTTTCCAAGCACTGGGTAGTGCAACAAAGAATGCTAATACTCTGAGTGCGGGGTCGCGGACATCCTCCTGACAGGTACTCCGAGCGCCCGGTATTACTTGAAGACACACTAATCCGAAGAACCTGGCCATCTAATAATTGGCCGCTGTTGGCGGACCTTAGGCACACAGTTTCTCTGCTTCCCGAACGTACGAGAGCTTGCCTGAGACGCTAGTACCAAGGTGGAATATCACACTCTTGGAATGGGAAACTTGGCCTTTTAGCCCTCTCTGATGTCACAGAAGCCGATGAACGGCTACAGCAGATGTGTGATCATGTACCTTCAGCCTAGGATCGTCTAGGGCCTGAATCTAGCAATAGAACTTAGATAGGTGATGACTAAATCCGTACTTAGGTTCTAGGAAACCGTGTGACACTATGGGCCCACAGACAAGGGCACGATCAATAGAACCGGGATTTATCTACTTTGAGTTGCTCAACATCTACCAGATGTTAATCGGTTGTGGGATTCCATATCAGTTGGACTATTAGATGTCATGCAAAGAAATGGCGCCCGCGATAACCAGTTCCTAACACTGTTGACAGAGAACAAACTCTCCTTCGGGTTGCTATATTTCTAGAAAACAAGATTGTGCGAAGAACATGTGTGTATGTTGTGATATCCTGTCTGTAAGCAGACCTTAAATCATGCGCTTGCAGGTGCTCTACATCTTACGATGCGTTATGGACTTTCATTCGTTTAATTGTGCGCTGCCCGCTTTACTGATGGGGATGAAATTTAGTGCTGGCTTTAACACCCGAGGCAACTACGTATAGAGTAACATTTTACGAACGATAGGGTAACAACGCGCTGGAACGTTGGATCAATGAGCGCTGATCCGGGGCTAGTACTGGCGTATGAGACTTTACTCGAGGGCACGCGACACCGCTGCATCTACATGGGTCACTGATACATGTGATTACTTTAAAGTAGTGTAAACCGCTGGCATTCCTTCAGACTGGCCGGAATTCGACCCTCGTGGAGATCTGTCCTACAGGTCTCCAAAAATGGGGGTATTCTACCGATCAGAGGCCGCAAGCTATTCATGTATGGCGGCATTGGATATCCTAAATTCGTATCCAGCCGCTAACGAATGAGTCTTTCGCCGTTTTCCGTCTCAGATAATTGTCTTCCTGTAGTTAAATAAGCAATCCTTCTTACTACGGCCGCTCTAGCGATCGATGGGAGCCGGCCCCCCCGCGTGTTCATCACTCAACGTCAGCAGCGTAAGTTAGTAATGTTAGATGAGAGCTCTTCGTGTGATAACTATTATTATTTGCAAGTGCCAGGTC
131 8 7 GAGCAACCCAGCCTCAGAGAGGCGGCCACGGGAACGACGCCTAAAACTATCTGCGCTCCCGTCGGTCACACATCGAGTTAATAATAAGCTGTTACACTAGGATAGCGCAGTGCGCTTGTTTGGTGGTTTCAAAGAGGTTCCAGTCAGCCCACAGTCTGTCGCCGTCAAATTCTTTCCCACTATTTAAGGAGCTCATCGGGCACATGATGATGTAGCACGGAAGTTCTGTAGT
295 44 7 TTTGCGGTGTAAGCATGACTGACAGATGAGTTGGTACCGTATACCAAGTACTAGCGACTAAGGTCTCCACGGAACACTGCAAAGACCATAGCGCACGTAAGTGACATGTCCTGGGTGTCGAAGTACGCTACAATCCACGATGTAGTGTGATCCCCATCGATGCATTGGTCTAGCATGGTCATATAAATAGCAGCCCGAACCGGGGTTCCCATGTGACCGCGACTTCGTCTATGTAATGATGACTTGCTTCGAATAGATACGAAGCTGGACCGAGATCAGGTCGTCTGCTGGCAGGGAACCGGGAGCGTCGAGAGGGGATACCGCGCTTTATATAAAAAGCAGCGAGCTAAAGCGGGACGTTTGGGCATGTATGGGGCTGTTGTCGCTGTCTTGTCAAGTTTAAAATGTCGCGGGCTAGCCTGAGTTAAATTTCGACATAGCTGCATGGATTATTTATACTGTCCAATGGCCAAGAGACCATAAACAGTGGGGGCACCCTTCCGATGCCCTGAACGCTGGAAGATTAAACCGCAGTGTTAGAGGCACTGAGAGTTCTAGCTGATGCAGTGCGCACGGCGGCACGCCCCGACTTTGGCTATTGGAGACCCGAACCGGACAGCTCTTATCGCATTTAGCCAGCGTATGAGTGACCCAGAGGCATGATGTTGTTGCTAAAGACCAGTCACCTCCACTAGCCTGCACCGAGCTTATTTCGTCAGGCCCGGTGTGGAGAATATGAGCACCCCCCATCGCAAACTTCTGAGACTCGACCTACGGGCCCATCATTCAAAATCCTGGAGGAAGATCGAGGAGGGATCAACACGGTCGATAGCCGATCCATTAAAAGGATGAATATCAGCGTCACAGCCCTGATCCTGGATAGACACCACTTCGAATTATGTAAGTCGGGAGCGGCCGGAGCTCAGGCCGCCTTGCAGCTGCTCGAGAGCAAAGGCTAAGAGCCACTGTGCGAGCAGTCATTCCTGTAGTCTTCACCACCTTTAGGACCACGTGTTCGACGCTACAGGACTCGTATAACAGGGATCGCCGATACGGATTTGTTCCTGACTCTCAGTTAAACGATATCTCGGGGATATACAGTTAGGTACATCCTGCGTACGCTCTTCGCTTATTGCGGTTTTGACTGAGCACCGTAAGGAGTAAGGCAAAGGTTGCAACCGGCGAGTTAGGGTGGAGTCTTGATATTGTGCGCGGCGTATCACAACCTTAACGTCACGCACGGCGCCTTCGGTCACTCAATCTGTATTTACAGCCC
216 29 9 ACTAGTATTTAGCCGAGGTGATCATAGGTGATGCACCAGTGTATACATTGGTCGGTTGCTGGCCTCAAAACCAGCCTGCTGCCTGGACCCAAGTTGGTCTAGAAATTTTCGTAGCCGTACAAAGAAAACTATTAGTGTATGTGTCAACTCTAGGTAGGGGCCCCGGTTGCCCTGATTATAGTAATGTCCTTCATCAGAGTCATAACATAGTTCCCGAGGGTTCTGAGTAGACAGACGATGTATTACGTAGCTAAATTTAATTGCACGACGTCTATCATTCCCCCTTCCGTCAGCTTCCGTGAACTCCGGGTACCGAGAATTCGCTGACAAACTTAACGGCCAGGAGAAAGGCCCGTAGCCGGTTGTGGCGCCACAAACAAACCGCCCCCGCTGATTGTCATTAAGTGGCCGAGTTTAAACAGCTCGCCGTCGAATGGTATGCTGGCGCGAAATTTTGGTATTCATGACTCCTGAGTGCTAAGTAACCGGAATTCTTTGAGAAGTGCCGACATCATAAACCGGGATTTACAAGAACTGGACACAGTCCGCAGTAACGGTTACATGTCCTACACCTGAACGGAGGTGATGTCAACTACGCTTCACGATCCCAGAGTCTGAGAGCCCTAGAAGTAATACAGCACTCCGTGTTGATCAGACCTTCGAACAGGTTAATTTAGAGGTAGATCAGTACCCGAATAGGGCAAAACGACTCTAAGAGACTAGCATGACGGAATACTTGCTGAGCCACAATTTGTGGTCGAGGAATCACGTTTGGACCGATGCCTTCGCTCCCAGCGAGTCTAGAGGCTTCGCAGAAGACTTCTCAGCTCGGGGCGACGTAGAGCTCATTGGAAGTTCTGCCTGTGGCGCTTTCGCGTTTAGTACCAGCCGCGTAGGGTGGTCCCTAAGAATACATCCGTGGGTCGACCATGAATTGGGGTCGAATGCTGATACAGCGCACGAGAAAGTTGTTGTCGTCTGATACTTGTCATTCTTGTTTCAGTCTCCTGATCTAAGCAGACCTTCCGCCATTTCGAATCTATTCAGTTAATTAGTTCTCCTATAATGGGTAC
766 73 6 CGCATGTTTGTCGATTGTTAAAGCGGATGTATGTTGTTAGTTTCCTTAAAGTGGCTCCCGGAAGTACACCGACTGATCTCGATCAATCATAAACTGTAGAACAGCGCAGGCCCCTTCAGCATGTTTGCAGCGGGACGAGGCATCTGTTTCACTACGCCAACTTCCGTCGGTTTTATTCTTCTCACGCTGGCGTCCATCCTGACGCAGTTCTGATAAGAGCTATGCGTTATTACATACCAGCCGGCGAGAGTTGGGCAGTAGAGAGAACCTTCGGAAGCTCCTTCCTGACCGGGGCAATTCTTTGCGCATAGTCATTCCGTTATCAACTTCGCTAACCCAATCAGCTTGCGGGCAGAACGACCGGCAAGCCTTCGCGTTGGTAAGGGTTCTAATGATGTATAATTAAGTCCCAGCCTGTTGGTTACTCAGAATTGTAAACATGTGTCGCGTAGTCAGCTCATGGCTGCACATACGGTTCCGTTCATTTCGGGCTGAAAGACGGGGCCTCTTCTAAGCTTATGCACTTTGAGCTACGACTGTACCGAACGGAATTACCAGTATTCCGGACCCATGCGTAATCTCCACCGGATAATGATCTTGCATGACCGCCTGTGGATTAGGAAACGGCTAAAACAATGCTGTGAGTCGTCCACCCAGTCCGCATAAGGCATCCAGAATTAAGGGCTGTATTGTTGGCATTTACGCAATTCACCTGATACTACGAGTGGGAGACCGGGGCGTACGTCTCCAGGATTATTCTAACTACGTGCATTAAGATAAGTGCTCGCAGATGACCTCGTAGGTGTGGTTTCCGTTGTAAACCGAATGGGATCCCATAATGCGCAATTCGGTTACCAACATGACGGGGATACCCTATGCGAATCAGCCAAGTCGGATATCGCCGCGGCATAACTCCATCGTCGGGATATGTCTCATCCGAACAAGTCAAATCTCTCCGCGCCCTGTATCAATCCGTTCCTAACGAGTCGTTTTTACTCAGCCAATCTTCAAATGACAGGACTCATGTAATTAGCCAACCTACGGGGGGTTTCATATTTCGCTAATTTTGCCAGGGGCAGGAGGATAACTTACAGAACTATCGGTGTAACCAATTACAATTACCTGCGCCCTAAACTGCTGCGACGGACTGTATCTTCGGGGAATTGCTTATGAGAACTCTGTATCGACAGTATTTCAAGCACTAGCTTGCCCCGATACCAGGTGAATAGAACAGAGGTCAATACATTCCTTCAACCTAGTACGCTCAAATTAAATTTCGGAACATCCCTGTGGTATGCTTCGTTTCACCAATGAAAAGGTACAGAATGGTAAACATCGCTGCAGAAACTACCAATGAAACTTTTTATTCATAAAGCGGTGACTGGCCGTGTGAGCGACTGCGGCGCACCGATAGACCAAGCACGATAATACGACAAATTCCAGGCACCGAACACTAACCACCGATAGGATTGCTCCGCGCGCACTTTTGGAAGTGTCTATACATTCTTGATGAATCCGACACTAGCCGCGCTTACTGGAATTTGATCTTGCTCCATCGGGGCCTGTGGTAACTGGTGGATTCTCAGGTACCCCTAGTCAGGCTCCTGAATTTATGAAATGTGTCTGTCGATTTTCGGTGGTTCTAACAGCGAGCAAGACTTGGCCATTGGGCTGGCGGCAATAAAAAAGGGACGTGAGTTACCAGCGGGCTGCGGGCCCTACGCAAAGGCCAGGGTTAGCGAGACTCGTGTTACAAAATGGGGACTGTTCTGTTGGATAGTGCCAGACATAGGCGATCGAGTTAATCATACTTACAGGACCAAAA
594 94 3 GACGCGACAACTCAGGATCACGACATCCCAAATAATGTCACCTCAAAAAATTTTGACCTGTTGGACGCACATTATAGTGTTTCGCTATCGTCCCGTTTGGCCCTGAAAATCTACACTATATGAATCCCGGAAGCCCGAACGATAGGTCAA
 
Last edited:
Technology news on Phys.org
http://www.cplusplus.com/reference/clibrary/cstdio/scanf/ is how to use scanf. Just read it until you understand it.

First you want to read the 3 numbers at the beginning of the line. Then malloc a char * large enough to hold the ATCG's, based on the numbers you read. Then read the rest of the line into the your char *, and process it.

Using redirection means that your program is getting input from "standard input" and giving output to "standard output" which are similar to files, only they don't have names on disk. All it means in your program is you use printf and scanf instead of fprintf and fscanf.

If you are being assigned this, you should have already learned how to do things like access characters within your DNA string and use printf; if not, you need to start reading a C book from the beginning.
 
Last edited by a moderator:
We never learned malloc yet, and yes we learned about printf and scanf, but we didn't emphasize much on redirection. I will go over it myself, but what is the format used for redirection, C code wise. I know without u have to use FILE*(file name), (filename) = fopen(filepath), and same with fclose. But what's the syntax for redirection. Can you guys explain to me another way of reading the first 3 digits of the line as the sample ID then the next as a dna sequence and how far to go b4 it knows to go on to the next line. Also how to program so it knows each line is an separate DNA sequence. Thanks a lot guys.
 
Without redirection, you DON'T have to use the FILE * stuff, you just use scanf and printf. As for the rest, it sounds like you need to learn things about C that I can't teach you in a post or two; read a C book and read documentation about the functions you need to use.
 
I know my documentation, I was wondering after its done reading one line, it goes automatically goes to the next right? Does it know again that its reading the sample ID?
 
I just saw what my edit did my bad guys, for some reason I can't edit it out:S can a mod delete this thread.
 

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 3 ·
Replies
3
Views
1K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
14
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 25 ·
Replies
25
Views
3K
  • · Replies 7 ·
Replies
7
Views
4K
  • · Replies 2 ·
Replies
2
Views
5K