Showing posts with label gist. Show all posts
Showing posts with label gist. Show all posts

Thursday, 8 February 2018

Bulk downloading proteome files from UniProt using Python

It's that time again, where the following has happened:
  1. I want to do some niche bioinformatics related thing
  2. I cobble together a quick script to do said thing
  3. I throw that script up on the internet on the offchance it will save someone else the time of doing 2
It's a little shift of target and scale from a similar previous post (in which I used Python to extract specific DNA sequences from UCSC. This time I've been downloading a large number of proteome files from UniProt.

It's all explained in the docstring, but the basic idea is that you go on UniProt, search for the proteomes you want, and use their export tool to download tsv files containing the unique accession numbers with identify the data you're after. Then you simply run this script in the same directory; it takes those accessions, turns them in to URLs, downloads the FASTA data at that address and outputs it to new FASTA files on your computer, with separate files named after whatever the tsv files were named.

The best thing about this is you can download multiple different lists of accessions, and have them output to separate files. Say maybe you have a range of pathogens you're interesting in, each with multiple proteomes banked; this way you end up with one FASTA file for each, containing as many of their proteomes as you felt like including in your search.


Saturday, 25 February 2017

Download specific DNA sequences from hg19 using Python

I've been working on a little side-project recently that involved needing to grab lots of different human DNA sequences based on their position, which lead me to discover the wonderful UCSC DAS server (from this informative Biostars thread).

Seeing as the rest of the project was written in Python, I knocked together a quick function to do just that. It's all nice and easy: just give it the chromosome number/letter*, and a numerical start and stop position, and the function returns the hg19 DNA sequence in that range.

I'm also trying to make a bit more use of GitHub (including knocking together a place for my publications), so I thought this was the perfect thing to make a gist from:

* Currently this function won't be able to grab anything from the unassigned chromosome contigs - just chromosomes 1-22, X, Y and mitochondrial (M) sequences.