Wednesday, 6 July 2016

Retreiving GenBank FASTA records with Python

I recently needed to download a relatively large number of GenBank FASTA records from GenBank (as I wanted to have a play with the paired alpha-beta TCR chains sequenced from single cells in this lovely paper.

Unfortunately what is supposedly probably the easiest way - using NCBI's Batch Entrez function - did not work for me. Instead I just got lots of "the following records can't be retrieved/might be obsolete" and "No items found" warnings.

However there are a number of other ways, but I especially liked Pierre Lindenbaum's answer on https://www.biostars.org/p/3297/, which basically uses bash to pull the record directly from the NCBI website. Nice and simple, and it also gives the URL format required.

As I want to do some post-download processing in Python, I thought I may as well download the records directly using the urllib2 module. In case it may be of use to others (and for the next time I need to do this myself!) I thought I'd throw the basic code up here. The only requirement that this code has is to already have a file containing a list of GenBank accession IDs: in this case I just generated this in Excel for the ranges needed, but it could be easily coded directly in Python.

import urllib2

with open('GenbankAccessions') as infile, open('Sequences.fasta', "w") as outfile:
  for accession in infile:
    url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=" + accession.rstrip() + "&rettype=fasta"
    entry = urllib2.urlopen(url)
    for line in entry:
      outfile.write(line)

This will write all of the FASTA records of the accessions listed in the 'GenbankAccessions' file to the new out file 'Sequences.fasta'.