Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Saturday, 10 November 2018

Making coding T cell receptor sequences from V-J-CDR3

If, like me, you work on T cell receptors, occasionally you’re probably going to want to express a particular TCR in cells. However you're not always going to have the sequence, at either nucleotide or protein sequence level.

Not a problem, you can sort this out. You can look up all the relevant germline sequences from IMGT, trim away all the non-used bits, add in the non-templated stuff, then manually stitch it all together and have a look to see if it still makes what you were expecting. You can do all that... or you can just use the code I wrote.

StiTChR does it all: give it a V gene, a J gene, and a CDR3 amino acid sequence and it'll look up, trim and stitch together all the relevant TCR nucleotide sequences for you, back-translating the non-templated region using the most frequent codon per residue. It also translates it all, and will run a quick rudimentary alignment against a known partial protein sequence if you have one for a visual confirmation that it's made the right thing.

You can then take the alpha/beta TCR sequences it generates, bang them into an expression vector (typically split by a T2A sequence or something) and transduce your cells of interest.

I wrote this code to save me a bit of time in future, but hopefully it can do the same for some of you!

Thursday, 8 February 2018

Bulk downloading proteome files from UniProt using Python

It's that time again, where the following has happened:
  1. I want to do some niche bioinformatics related thing
  2. I cobble together a quick script to do said thing
  3. I throw that script up on the internet on the offchance it will save someone else the time of doing 2
It's a little shift of target and scale from a similar previous post (in which I used Python to extract specific DNA sequences from UCSC. This time I've been downloading a large number of proteome files from UniProt.

It's all explained in the docstring, but the basic idea is that you go on UniProt, search for the proteomes you want, and use their export tool to download tsv files containing the unique accession numbers with identify the data you're after. Then you simply run this script in the same directory; it takes those accessions, turns them in to URLs, downloads the FASTA data at that address and outputs it to new FASTA files on your computer, with separate files named after whatever the tsv files were named.

The best thing about this is you can download multiple different lists of accessions, and have them output to separate files. Say maybe you have a range of pathogens you're interesting in, each with multiple proteomes banked; this way you end up with one FASTA file for each, containing as many of their proteomes as you felt like including in your search.


Saturday, 25 February 2017

Download specific DNA sequences from hg19 using Python

I've been working on a little side-project recently that involved needing to grab lots of different human DNA sequences based on their position, which lead me to discover the wonderful UCSC DAS server (from this informative Biostars thread).

Seeing as the rest of the project was written in Python, I knocked together a quick function to do just that. It's all nice and easy: just give it the chromosome number/letter*, and a numerical start and stop position, and the function returns the hg19 DNA sequence in that range.

I'm also trying to make a bit more use of GitHub (including knocking together a place for my publications), so I thought this was the perfect thing to make a gist from:

* Currently this function won't be able to grab anything from the unassigned chromosome contigs - just chromosomes 1-22, X, Y and mitochondrial (M) sequences.

Wednesday, 6 July 2016

Retreiving GenBank FASTA records with Python

I recently needed to download a relatively large number of GenBank FASTA records from GenBank (as I wanted to have a play with the paired alpha-beta TCR chains sequenced from single cells in this lovely paper.

Unfortunately what is supposedly probably the easiest way - using NCBI's Batch Entrez function - did not work for me. Instead I just got lots of "the following records can't be retrieved/might be obsolete" and "No items found" warnings.

However there are a number of other ways, but I especially liked Pierre Lindenbaum's answer on https://www.biostars.org/p/3297/, which basically uses bash to pull the record directly from the NCBI website. Nice and simple, and it also gives the URL format required.

As I want to do some post-download processing in Python, I thought I may as well download the records directly using the urllib2 module. In case it may be of use to others (and for the next time I need to do this myself!) I thought I'd throw the basic code up here. The only requirement that this code has is to already have a file containing a list of GenBank accession IDs: in this case I just generated this in Excel for the ranges needed, but it could be easily coded directly in Python.

import urllib2

with open('GenbankAccessions') as infile, open('Sequences.fasta', "w") as outfile:
  for accession in infile:
    url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=" + accession.rstrip() + "&rettype=fasta"
    entry = urllib2.urlopen(url)
    for line in entry:
      outfile.write(line)

This will write all of the FASTA records of the accessions listed in the 'GenbankAccessions' file to the new out file 'Sequences.fasta'.

Thursday, 24 October 2013

Biopython SeqIO tips

The Biopython libraries are of tremendous use in analysing biological data. I use them in the vast majority of my bespoke fastq analysis, as I'm sure many do.
However, there's a couple of tasks I regularly find myself wanting to do, but that I could not easily find solutions for. In case anyone finds themselves in similar need, here are the solutions I found, maybe save you a bit of time.
Output quality scores
This is the main one; it's always annoyed me that there doesn't seem to be an easy way to output the quality score in SeqIO. Sure, I know with the letter_annotations option I can output the actual score in numbers, but sometimes I want to output the actual ASCII characters (such as if I want to take a subsection of a fastq record, both of nucleotide and quality strings).
Here's how I get around it; turn the whole record into a string, split that on its new lines, and just take the fourth:
 str(record.format("fastq")).split("\n")[3]  
Generate new fastq records in situ
Sometimes it's necessary to generate complete fastq records on the fly, as opposed to reading them in from an existing file.
I've found a couple of ways of doing this. The first comes buried in amongst some unrelated Biopython features, and looks like this:
 from Bio.Alphabet import generic_dna  
 from Bio.Seq import Seq  
 from Bio.SeqRecord import SeqRecord  
 new_record = SeqRecord(Seq("AAAAAA", generic_dna), id="New Record", description="")  
 new_record.letter_annotations["phred_quality"] = [40,40,40,40,40,40]   
The second way came from an answer on Biostars, and works like this:
 from Bio import SeqIO  
 from StringIO import StringIO  
 fastq_string = "@%s\n%s\n+\n%s\n" % ("New Record", "AAAAAA", "IIIIII")  
 new_record = SeqIO.read(StringIO(fastq_string), "fastq-illumina")  
They both work, with a couple of differences.
According to a quick test, the latter is faster, and obviously requires loading fewer modules.
Which to use might also depend on what format you have your qualities in; if you only have integers, then the first might be more tempting, whereas if you're making your new fastq from pieces of other existing records then the second is probably the way to go.

Sunday, 20 October 2013

Installing python dependencies for Decombinator

I've written previously about Decombinator, the TCR repertoire analysis program developed in our laboratory.

Having just formatted my computer and reinstalled all my packages, I've been reminded of what a faff this can be.

This time around I did things a lot more efficiently than I had before. Seeing as how my colleague responsible for updating the current readme is busy writing up his thesis, here's my quick guide to installing all the python modules required by Decombinator in Linux - at least in Ubuntu.

 sudo apt-get install python-numpy python-biopython python-matplotlib python-levenshtein   
 sudo apt-get install python-pip  
 sudo pip install acora  

Most of the modules are available in the repositories, which has the added bonus of filling in all their required dependencies. Acora (the module that enacts the Aho-Corasick finite-state machine) however is not, but is easily installed from the command line using pip.

Easy. Time to get decombining.

August 2016 update:
Decombinator has gone through a number of improvements since writing this post. The current version (and installation details) can be found on the Innate2Adaptive lab's Github repo.

Tuesday, 10 September 2013

Python fastq analysis for beginners


It's been a little while since my last post, as I did that rare and wondrous thing and took a holiday. Now I'm back, and galvanised into action by a particularly nice reference to my blog, I though I'd continue my vague hopscotch bioinformatics for beginners series, Here's a pretty basic one I've been meaning to put up for a while: fastq analysis.

Fastq is of course the format for storage next generation DNA sequencing data. Each sequence gets four lines in the file: one for sequence identifier, nucleotide sequence, a description line (often just a '+' character), and finally the per-base quality scores (thus lines two and four must be equal length).

If you're on this page, chances are good you're a wet-labber who's just starting to get into working with sequencing data. You can do basic manipulation using a multitude of different programs (I particularly like the FASTX toolkit for a nice spread of basic but useful tricks), but at some point, you're probably going to have to get your hands dirty and code something.

Now I know this isn't best practise, but often when I want to code something I can't immediately think of a solution for, I just get combing tutorials, manuals and forums finding fragments of script that do similar things to what I want, and weaving them together.

Often, the tricky place is knowing where to start. With a view to helping out people like me, here's the basic script I use for the majority of the fastq analysis that I write in Python.

It makes use of the SeqIO interface in Biopython. Basically, all this script does is scroll through a fastq file one record at a time, allowing you to interrogate or manipulate your individual sequences as needed.

from Bio import SeqIO # module needed for sequence input

countseqs = 0

fastqfile = open("YOUR_FILENAME_HERE.fastq", "rU")  
      
for record in SeqIO.parse(fastqfile,"fastq"):  
   
  ######## PUT WHATEVER YOU WANT TO HAPPEN IN HERE  
   
  countseqs += 1  
   
print "Reading from", fastqfile  
print countseqs, "records analysed."  

I usually make use of a little variant on this that allows command line input of the fastq file name; this means I can apply the same script to a large number of files without having to edit the filename in the script each time (and instead can make use of the tab-complete function of my shell):

from Bio import SeqIO # module needed for sequence input   
import sys   # module needed for command line argument to get fastq name   
     
if (len(sys.argv) > 1):   
  fastqfile = open(sys.argv[1], "rU")   
           
elif (len(sys.argv) == 1):   
  print "Please supply a fastq filename (i.e. python script.py file.fastq)"   
  sys.exit()   
          
for record in SeqIO.parse(fastqfile,"fastq"):   
     
  ######## PUT WHATEVER YOU WANT TO HAPPEN IN HERE   
   
  countseqs += 1   
     
print "Reading from", fastqfile   
print countseqs, "records analysed."   
NB - this script is easily converted into a fasta file, just by doing a find and replace of 'fastq' for 'fasta' - the parser handles it just fine.

This is of course the easy bit - filling in the "whatever you want to happen" bit is always going to be tricky.

To give you an example, here are the bones of a script I wrote based on this template, that I alluded to in a previous post.

from Bio import SeqIO # module needed for sequence input  
import sys      # module needed for command line argument to get fastq name  
import regex        # module for regex  
   
###################  
   
if (len(sys.argv) > 1):  
  fastqfile1 = open(sys.argv[1], "rU")  
   
   
elif (len(sys.argv) == 1):  
  print "Please supply a fastq filename (i.e. python script.py file.fastq)"  
  sys.exit()  
   
######################  
   
SEQ1_outfilename = ("SEQ_1_" + sys.argv[1][:15] + ".fastq")  
SEQ1_outfile = open(SEQ1_outfilename, "w")  
SEQ1_outfile.close()  
   
SEQ2_outfilename = ("SEQ_2_" + sys.argv[1][:15] + ".fastq")  
SEQ2_outfile = open(SEQ2_outfilename, "w")  
SEQ2_outfile.close()  
   
countseqs = 0  
countSEQ_1 = 0  
countSEQ_2 = 0  
   
####################  
      
for record in SeqIO.parse(fastqfile1,"fastq"): # scrolls through fastq file, using SeqIO package  
  
  countseqs += 1 # counts number of sequences measured  
   
  seq1 = regex.search('(ATATATATAT){e<=1}', str(record.seq))  
   
  if seq1:  
   
    countSEQ_1 +=1  
   
    output_handle = open(str(SEQ1_outfilename), "a")  
    SeqIO.write(record, output_handle, "fastq-sanger")  
    output_handle.close()  
   
  else:  
                  
    seq2 = regex.search('(CGCGCGCGCG){e<=1}', str(record.seq))  
   
    if seq2:   
   
      countSEQ_2 +=1  
   
      output_handle = open(str(SEQ2_outfilename), "a")  
      SeqIO.write(record, output_handle, "fastq-sanger")  
      output_handle.close()  
   
print "Reading from", fastqfile1  
print countseqs, "records analysed."  
print countSEQ_1, "Sequence 1s detected."  
print countSEQ_2, "Sequence 2s sequences detected."  

This uses the framework of the scripts above, looking for two sequences (in this case two primers, the presence of which would indicate which amplicons the sequences belonged to); depending on which subsequence it finds (if any), it bins the record in one of two output files.

To give a bit of leeway for sequencing error, I've made use of 'regex', an alternative regular expression module (as opposed to the standard/older 're'), which has an option for fuzzy string matching - i.e. you can allow mismatches or indels.

Taking one of the pretend sequences I've used here (ATATATATAT), this means that any of the following sequences would also be categorised as a hit using this search: ATATATGTAT, ATATATATT, ATATAATATAT.

There we have it; a crude, yet effective script to take a fastq file, and output two new fastqs containing (mutually exclusive) reads based on the presence of a given subsequence.

Note that this probably isn't the most efficient solution to this problem - even for something written in Python - but it was knocked together quickly, building up on simple principles from a basic fastq analysis template script, and most importantly, it got the job done. I only needed to run this code three or four times, and was able to do the analysis I needed, which is, let's face it, the most important thing.

----

Update: it's been a few years since I wrote this article, and it's worth pointing out that I've found a much nicer way to parse fasta/fastq files: the readfq function from Heng Li. It's available in a variety of languages, is simple, customisable, pared down, and crucially it's fast. Particularly in applications where you don't need all the extra legwork that SeqIO is doing - like converting your quality scores into their integer equivalents.

Wednesday, 21 November 2012

Generating random amino acid/DNA/RNA sequences in Python

This is my first post in what I hope to be a recurrent theme, that of bioinformatic practice.

I am by training a wet-lab scientist; my degrees were in molecular genetics, and I'm now doing my PhD in immunology. I've always had a keen interest in bioinformatics, and always tried to do as much as I can, but I've not been formally trained (and I've never claimed to be the most natural mathmatician/statistician).

Accordingly, whenever I'm trying to do some new bit of coding or analysis, I usually like to look around in manuals, papers, blogs and forums to see what and how people are doing things, and see if I can prevent some arduous wheel-reinventing.

I've also noticed that a lot of the time some little snippet of info or code that would have proved very useful to me is omitted by more practised programmers, to whom it would have been so obvious it goes unsaid or forgotten.

In that vein, I plan to post some of the bioinformatic tricks and codes I find and use as I go, in the hopes that I can pay it forward to others who find myself in my position, that is (usually), knowing exactly what I want to do, just not knowing off-hand the names of the functions to do it.

So, kicking off, here's something that I've been meaning to code for a little while; a way to generate random biological sequences (be they protein or nucleic acid sequences), written in Python, based on this biostars post.

This example produces random peptide sequences. For my current purposes, I also wanted these peptides to be of variable length, around a given normal distribution (that I'd measured empirically from the data set I was making these random sequences to act as a control for).

Note that I also wanted to plot the distribution of the newly created peptides; if you didn't want that functionality you can just delete every line with a "# plot" comment.

import random
import matplotlib.pyplot as plt # plot

def random_AA_seq(length):
    return ''.join(random.choice('ACDEFGHIKLMNPQRSTVWY') for i in range(length))

list_size = 3000

lengths = []

for j in range(list_size):

    a = int(random.normalvariate(17, 2))
    lengths.append(a)  # plot
    print random_AA_seq(a)

plt.hist(lengths, bins=14) # plot
plt.show() # plot

You can adjust the distribution of lengths by changing the values fed to normalvariate; the former is the mean, the latter the standard deviation (I deduced the values of 17 and 2 here from the sample I was trying to match).

Then of course these is easily adapted to DNA or RNA by swapping 'ACDEFGHIKLMNPQRSTVWY' for 'ACGT' or 'ACGU'.

Also, thanks go to David Craft, whose blog post on Syntax Highlighter came in useful in this post!