Unfortunately what is supposedly probably the easiest way - using NCBI's Batch Entrez function - did not work for me. Instead I just got lots of "the following records can't be retrieved/might be obsolete" and "No items found" warnings.
However there are a number of other ways, but I especially liked Pierre Lindenbaum's answer on https://www.biostars.org/p/3297/, which basically uses bash to pull the record directly from the NCBI website. Nice and simple, and it also gives the URL format required.
As I want to do some post-download processing in Python, I thought I may as well download the records directly using the urllib2 module. In case it may be of use to others (and for the next time I need to do this myself!) I thought I'd throw the basic code up here. The only requirement that this code has is to already have a file containing a list of GenBank accession IDs: in this case I just generated this in Excel for the ranges needed, but it could be easily coded directly in Python.
import urllib2
with open('GenbankAccessions') as infile, open('Sequences.fasta', "w") as outfile:
for accession in infile:
url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=" + accession.rstrip() + "&rettype=fasta"
entry = urllib2.urlopen(url)
for line in entry:
outfile.write(line)
This will write all of the FASTA records of the accessions listed in the 'GenbankAccessions' file to the new out file 'Sequences.fasta'.
No comments:
Post a Comment