Running BLAST from Python

Five ways to run BLAST

  • locally from the shell command line
  • locally from a Python script or interactive Python session
  • locally using Biopython
  • through the NCBI web server using Biopython
  • using your browser and the BLAST web page


What are the advantages of running BLAST locally?

  • you can search a query sequence in a customised database, e.g. in a newly sequenced genome you are studying, or a set of protein sequences of your interest (e.g. only protein kinases).
  • you may want to insert the program in a pipeline
  • only by running BLAST locally you have full control over the sequence database and by that, reproducibility of your search


Running Blast locally

  • Download and install the BLAST+ package from here
  • BLAST manual can be find here
  • The downloaded files are unpacked into a BLAST directory, and you have to add the path of this directory to the PATH environment variable of your computer (i.e. tell your system where to look for the installed BLAST programs)
  • Otherwise, you have to change to the BLAST directory on the shell and run BLAST from there
  • Inform the BLAST programs which directory to search for the databases
  • In other words you have to modify two environment variables: PATH and BLASTDB


Modify the PATH environment variable

  • When you install the program from source, you will have to place the downloaded package under a desired directory, e.g. /home/john
  • When you unpack the package, a BLAST directory will appear in /home/john (e.g. /home/john/ncbi-blast-2.2.23+)
  • You have to add to the PATH environment variable the bin directory under this BLAST directory

Under the bash shell (.bash_profile or .bashrc)

PATH=$PATH:/home/john/blast-2.2.23+/bin
export PATH

Under the tcsh shell (.cshrc ):

setenv PATH ${PATH}:/home/john/ncbi-blast-2.2.23+/bin

Notice that when you use the dmg disk to install BLAST on Mac OS X (10.4 or higher), all BLAST+ programs will be installed under /usr/local/ncbi/blast/bin


Modify the BLASTDB environment variable

  • Create the BLAST database directory /blast/db in your home directory
    mkdir /home/john/blast/db
    
  • This is the directory where you will put all the databases (either downloaded from the BLAST website or your custom ones) that you will use with BLAST.
  • Save at least a database in /home/john/blast/db • If you want to download a database from NCBI, go here


Create a .ncbirc text file in your home directory having the following path specification

; Start the section for BLAST configuration
[BLAST]
; Specifies the path where BLAST databases are installed
BLASTDB=/home/john/blast/db

The semicolon at the beginning of the first and third lines indicates a comment.

  • Unless you use a pre-formatted database downloaded from the NCBI ftp site, you will need to format your custom sequence file.
  • makeblastdb produces BLAST databases from FASTA files:
    makeblastdb –in genome.fasta -parse_seqids -dbtype prot
    

    -in is the option for the input file
    -parse_seqids enables parsing of sequence ids
    -dbtype type of input molecules (nucl or prot)

The query sequence can be in FASTA format and this is the structure of the command line:

blastProgram -query InSeq.fasta -db Database -out OutFile

For example:

blastp -query P05480.fasta -out blast_output -db nr.00

blastp aligns protein sequences, nr.00 is the name of the BLAST-formatted database
blast_output is the name you have chosen for the BLAST output
P05480.fasta is a file that contains your query sequence in FASTA format

import os
S = "blastp -query P05480.fasta -out\
    blast_output -db nr.00"
os.system(S)
import subprocess
command_line = ['blastp','-query',
                'P05480.fasta','-out',
                'blout','-db','nr.00']
subprocess.call(command_line)


Passing the input to a Python program from the command line

python gbk_to_fasta.py ap006852.gbk
import sys

print sys.argv

% python sys_argv.py
['sys_argv.py']

% python sys_argv.py ap006852.gbk
['sys_argv.py', 'ap006852.gbk']

% python sys_argv.py ap006852.gbk ap006852.fasta
['sys_argv.py', 'ap006852.gbk', 'ap006852.fasta']

Challenge #1

  • Read the ap006852.gbk file into a program (read_gbk.py) from the command line;
  • Read the file line-by-line and print the line corresponding to the ‘LOCUS’ identifier;
  • Run the program.

See the Solution to challenge #1


Challenge #2

Run Blastp from a script for the three sequences

seqs = ['P00519', 'P05480', 'P12931']

Use subprocess and a tabular format for the output

The option for the output format is -outfmt
0 = pairwise
1 = query-anchored showing identities
2 = query-anchored no identities
3 = flat query-anchored, show identities
4 = flat query-anchored, no identities
5 = XML Blast output
6 = tabular
7 = tabular with comment lines
8 = Text ASN.1
9 = Binary ASN.1
10 = Comma-separated values


See the Solution to challenge #1


Challenge #3

Read the three blast output files and write to a new file file the values in the first and the third columns just for the first line of each file.


See the Solution to challenge #2


Back

Back to main page.