- I want to do some niche bioinformatics related thing
- I cobble together a quick script to do said thing
- I throw that script up on the internet on the offchance it will save someone else the time of doing 2
It's all explained in the docstring, but the basic idea is that you go on UniProt, search for the proteomes you want, and use their export tool to download tsv files containing the unique accession numbers with identify the data you're after. Then you simply run this script in the same directory; it takes those accessions, turns them in to URLs, downloads the FASTA data at that address and outputs it to new FASTA files on your computer, with separate files named after whatever the tsv files were named.
The best thing about this is you can download multiple different lists of accessions, and have them output to separate files. Say maybe you have a range of pathogens you're interesting in, each with multiple proteomes banked; this way you end up with one FASTA file for each, containing as many of their proteomes as you felt like including in your search.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
download-proteome-fastas.py | |
Used to download whole proteomes from UniProt. | |
First you need to go to http://www.uniprot.org/proteomes/ and download lists of proteome accessions | |
- Can be compressed or uncompressed, as many files as desired | |
- Run this script in the same directory | |
- Files should be named in format [unique-identifier].tsv(.gz - if compressed) | |
- 'tsv' HAS to be present, separated from the identifier by a single period | |
The actual trick of the URL to use was found here: https://www.biostars.org/p/292993/ (top answer as of 2018-01-29) | |
""" | |
__version__ = '0.1.0' | |
__author__ = 'Jamie Heather' | |
import gzip | |
import os | |
import urllib2 | |
if __name__ == '__main__': | |
# Get files | |
all_files = [x for x in os.listdir(os.getcwd()) if 'tsv' in x] | |
all_files.sort() | |
# Define the URL parameters | |
url_prefix = 'https://www.uniprot.org/uniprot/?query=proteome:' | |
url_suffix= '&format=fasta' | |
print "Reading in accessions, downloading fasta data..." | |
# Loop through all accession tsv files | |
for f in all_files: | |
# Determine opener | |
if f.endswith(".gz"): | |
opener = gzip.open | |
else: | |
opener = open | |
accessions = [] | |
base_name = f.split('.')[0] | |
out_name = base_name + '.fasta.gz' | |
print '\t' + base_name | |
# Open file, read in accessions | |
with opener(f, 'rU') as in_file, gzip.open(out_name, 'w') as out_file: | |
line_count = 0 | |
for line in in_file: | |
if line_count == 0: | |
headers = line.rstrip().split('\t') | |
else: | |
bits = line.rstrip().split('\t') | |
accession = bits[0] | |
# Determine full URL, pull the data and write to output file | |
url = urllib2.urlopen(url_prefix + accession + url_suffix) | |
for url_line in url: | |
out_file.write(url_line) | |
line_count += 1 |
No comments:
Post a Comment