Seeing as the rest of the project was written in Python, I knocked together a quick function to do just that. It's all nice and easy: just give it the chromosome number/letter*, and a numerical start and stop position, and the function returns the hg19 DNA sequence in that range.
I'm also trying to make a bit more use of GitHub (including knocking together a place for my publications), so I thought this was the perfect thing to make a gist from:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
get_hg19_sequence.py | |
Jamie Heather, February 2017 | |
For use on Python 2.7, requires urllib2 module | |
""" | |
import urllib2 | |
def get_hg19_seq(chrm, seq_from, seq_to): | |
""" | |
Takes a chromosome number or name (1-22, X/Y/M) and two coordinates (from/to) | |
Returns the corresponding hg19 nucleotide sequence via the UCSC DAS server. | |
""" | |
base_url = "http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr" | |
page = urllib2.urlopen(base_url + str(chrm) + ":" + str(seq_from) + "," + str(seq_to)) | |
contents = [] | |
for line in page: | |
if "<" not in line: | |
contents.append(line.rstrip()) | |
full_seq = "".join(contents).upper() | |
return full_seq |