Parsing data records I
Two sequence records in FASTA format:
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
Challenge #1
Download the file from here. Open the file
SingleSeq.fasta
, read its content line by line and print it
See the Solution to challenge #1
Challenge #2
Download the
SingleSeq.fasta
file from here. Open the file, read its content line by line and write it to another file.
See the Solution to challenge #2
Writing different things depending on a condition
Read a sequence in FASTA format and print only the header of the sequence
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
Making choices: The if/elif/else
statements
if condition1 #if expression in condition1is TRUE
statements1 #execute statements1
elif condition2 #else if expression in condition2is TRUE
statements2 #execute statements2...
elif condition3 #etc...
...
…
else:
statementN
Check these conditions:
'ACTC'[0] == 'C'
is True or false?'ACTC'[0] == 'A'
is True or false?
Operators:
== != => <= > <
The if/elif/else
construct produces different effects compared with the use of a series of if
conditions
nucl = ['A','C','T','G']
if 'A' in nucl: print 'A'
elif 'C' in nucl: print 'C'
elif 'T' in nucl: print 'T'
else: print 'G'
nucl = ['A','C','T','G']
if 'A' in nucl: print 'A'
if 'C' in nucl: print 'C'
if 'T' in nucl: print 'T'
if 'G' in nucl: print 'G'
Challenge #3
Download the file
SingleSeq.fasta
from here. Read a sequence in FASTA format and print only the header of the sequence>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
See the Solution to challenge #3
Challenge #4
Download the file
SingleSeq.fasta
from here. Read the file in FASTA format and write to a new file only the header of the record.
See the Solution to challenge #4
Challenge #5
Download the file
SingleSeq.fasta
from here. Read a file in FASTA format and write to a new file only the sequence (without the header).
See the Solution to challenge #5
Challenge: Merge programs #4 and #5
Download the file
SingleSeq.fasta
from here. Read a file in FASTA format and write the header to a file and the sequence to a different one.
See the Solution to merge challenge #4 and #5
Challenge #6
Let’s increase the difficulty just a bit… Download the file
SingleSeq.fasta
from here.
- Read a FASTA file line by line
- Save the header in a variable and the sequence in a different one
- Print header and sequence separately
See the Solution to challenge #6
Challenge #7
- Implement challenge #6 by counting the number of cysteine (“C”) residues in the sequence
- Print separately header, sequence and the number of cysteine residues
See the Solution to challenge #7
Challenge #8
Download the file
SingleSeq.fasta
from here.
- Read a file in FASTA format line-by-line.
- Print or write the record to a file only if the sequence is from Homo sapiens.
See the Solution to challenge #8
Challenge #9
Very often in reality you will need to analyze several sequences….
Download the Uniprot multiple sequence FASTA file
SwissProt-Human.fasta
here. Consider the content of the file:
SwissProt-Human.fasta
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ
>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GNYWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN
...
...
...
Write the record headers to a new file.
See the Solution to challenge #9
Challenge #10
Download the Uniprot multiple sequence FASTA file
SwissProt-Human.fasta
here. Read a multiple sequence FASTA file and write the sequences to a new file separated by a blank line
See the Solution to challenge #10
Challenge #11
Download the Uniprot multiple sequence FASTA file
SwissProt-Human.fasta
here. Read a file in FASTA format and copy to a new file the records’ Accession Numbers (AC).
See the Solution to challenge #11
Challenge #12
Download the Uniprot multiple sequence FASTA file
SwissProt-Human.fasta
here.
- Read FASTA records from the file
- Count (and print) the cysteine residues in each sequence.
See the Solution to challenge #12
Challenge #13
Download the Uniprot multiple sequence FASTA file
SwissProt-Human.fasta
here. Read the file and write in a new file only the records from Homo sapiens.
See the Solution to challenge #13
Challenge #14 homework
Download the Uniprot multiple sequence FASTA file
SwissProt-Human.fasta
here.
- Read a multiple sequence file in FASTA format and only write to a new file the records the sequences of which start with a methionine (‘M’) and have at least two tryptophan residues (‘W’)
First:
- Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences starting with a methionine (‘M’)
Then
- Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences having at least two tryptophan residues (‘W’)
Finally merge the two steps
See the Solution to challenge #14
Challenge #15 homework
Download the
ap006852.gbk
file here Read a Genbank record and write to a file the nucleotide sequence in FASTA format. Extract and write to a file the gene sequence from the Candida albicans genomic DNA, chromosome 7, complete sequence (file ap006852.gbk)Try to write it in FASTA format:
AP006852 ccactgtccaatggctcaacacgccaatcatcatacaatacccccaacaggaatcaccaa agtactgatgcttctcactatcaatagtttgtactttcaccacacaatagcagatgatcc atctaaatccaccttcctatcgatcgtgaccacccccataaaataggtcaactccataaa cacctccatcaccaacgctagactcacaacccagaacatgttaatcaaccggtgggccaa Gtaccgttgtagctctctcgtaaacacaagaaccaacaccaaacaacatactacaactga ... ...
See the Solution to challenge #15
Recap: parsing data records
-
Start by visually inspecting the file you want to parse
-
Identify the information you want to extract
-
Identify separators to select your information using if conditions
-
Use lists if you have to compare data from different files
Back
Back to main page.