Two sequence records in FASTA format:

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN

Challenge #1

Download the file from here. Open the file SingleSeq.fasta, read its content line by line and print it


See the Solution to challenge #1


Challenge #2

Download the SingleSeq.fasta file from here. Open the file, read its content line by line and write it to another file.


See the Solution to challenge #2


Writing different things depending on a condition

Read a sequence in FASTA format and print only the header of the sequence

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN

Making choices: The if/elif/else statements

if condition1 #if expression in condition1is TRUE
    statements1   #execute statements1
elif condition2 #else if expression in condition2is TRUE
    statements2 #execute statements2...
elif condition3 #etc...
...


else:
  statementN

Check these conditions:

  • 'ACTC'[0] == 'C' is True or false?
  • 'ACTC'[0] == 'A' is True or false?

Operators:

==    !=     =>    <=    >      <

The if/elif/else construct produces different effects compared with the use of a series of if conditions

  nucl = ['A','C','T','G']
  if 'A' in nucl: print 'A'
  elif 'C' in nucl: print 'C'
  elif 'T' in nucl: print 'T'
  else: print 'G'
nucl = ['A','C','T','G']
if 'A' in nucl: print 'A'
if 'C' in nucl: print 'C'
if 'T' in nucl: print 'T'
if 'G' in nucl: print 'G'

Challenge #3

Download the file SingleSeq.fasta from here. Read a sequence in FASTA format and print only the header of the sequence

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN

See the Solution to challenge #3


Challenge #4

Download the file SingleSeq.fasta from here. Read the file in FASTA format and write to a new file only the header of the record.


See the Solution to challenge #4


Challenge #5

Download the file SingleSeq.fasta from here. Read a file in FASTA format and write to a new file only the sequence (without the header).


See the Solution to challenge #5


Challenge: Merge programs #4 and #5

Download the file SingleSeq.fasta from here. Read a file in FASTA format and write the header to a file and the sequence to a different one.


See the Solution to merge challenge #4 and #5


Challenge #6

Let’s increase the difficulty just a bit… Download the file SingleSeq.fasta from here.

  • Read a FASTA file line by line
  • Save the header in a variable and the sequence in a different one
  • Print header and sequence separately

See the Solution to challenge #6


Challenge #7

  • Implement challenge #6 by counting the number of cysteine (“C”) residues in the sequence
  • Print separately header, sequence and the number of cysteine residues

See the Solution to challenge #7


Challenge #8

Download the file SingleSeq.fasta from here.

  • Read a file in FASTA format line-by-line.
  • Print or write the record to a file only if the sequence is from Homo sapiens.

See the Solution to challenge #8


Challenge #9

Very often in reality you will need to analyze several sequences….

Download the Uniprot multiple sequence FASTA file SwissProt-Human.fasta here. Consider the content of the file:

SwissProt-Human.fasta

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ
>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GNYWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN
...
...
...

Write the record headers to a new file.


See the Solution to challenge #9


Challenge #10

Download the Uniprot multiple sequence FASTA file SwissProt-Human.fasta here. Read a multiple sequence FASTA file and write the sequences to a new file separated by a blank line


See the Solution to challenge #10


Challenge #11

Download the Uniprot multiple sequence FASTA file SwissProt-Human.fasta here. Read a file in FASTA format and copy to a new file the records’ Accession Numbers (AC).


See the Solution to challenge #11


Challenge #12

Download the Uniprot multiple sequence FASTA file SwissProt-Human.fasta here.

  • Read FASTA records from the file
  • Count (and print) the cysteine residues in each sequence.

See the Solution to challenge #12


Challenge #13

Download the Uniprot multiple sequence FASTA file SwissProt-Human.fasta here. Read the file and write in a new file only the records from Homo sapiens.


See the Solution to challenge #13


Challenge #14 homework

Download the Uniprot multiple sequence FASTA file SwissProt-Human.fasta here.

  • Read a multiple sequence file in FASTA format and only write to a new file the records the sequences of which start with a methionine (‘M’) and have at least two tryptophan residues (‘W’)

First:

  • Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences starting with a methionine (‘M’)

Then

  • Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences having at least two tryptophan residues (‘W’)

Finally merge the two steps


See the Solution to challenge #14


Challenge #15 homework

Download the ap006852.gbk file here Read a Genbank record and write to a file the nucleotide sequence in FASTA format. Extract and write to a file the gene sequence from the Candida albicans genomic DNA, chromosome 7, complete sequence (file ap006852.gbk)

Try to write it in FASTA format:

AP006852
ccactgtccaatggctcaacacgccaatcatcatacaatacccccaacaggaatcaccaa
agtactgatgcttctcactatcaatagtttgtactttcaccacacaatagcagatgatcc
atctaaatccaccttcctatcgatcgtgaccacccccataaaataggtcaactccataaa
cacctccatcaccaacgctagactcacaacccagaacatgttaatcaaccggtgggccaa
Gtaccgttgtagctctctcgtaaacacaagaaccaacaccaaacaacatactacaactga
...
...

See the Solution to challenge #15


Recap: parsing data records

  • Start by visually inspecting the file you want to parse

  • Identify the information you want to extract

  • Identify separators to select your information using if conditions

  • Use lists if you have to compare data from different files


Back

Back to main page.