jamimmunology: programming

Showing posts with label programming. Show all posts

Wednesday, 6 July 2016

Retreiving GenBank FASTA records with Python

I recently needed to download a relatively large number of GenBank FASTA records from GenBank (as I wanted to have a play with the paired alpha-beta TCR chains sequenced from single cells in this lovely paper.

Unfortunately what is supposedly probably the easiest way - using NCBI's Batch Entrez function - did not work for me. Instead I just got lots of "the following records can't be retrieved/might be obsolete" and "No items found" warnings.

However there are a number of other ways, but I especially liked Pierre Lindenbaum's answer on https://www.biostars.org/p/3297/, which basically uses bash to pull the record directly from the NCBI website. Nice and simple, and it also gives the URL format required.

As I want to do some post-download processing in Python, I thought I may as well download the records directly using the urllib2 module. In case it may be of use to others (and for the next time I need to do this myself!) I thought I'd throw the basic code up here. The only requirement that this code has is to already have a file containing a list of GenBank accession IDs: in this case I just generated this in Excel for the ranges needed, but it could be easily coded directly in Python.

import urllib2

with open('GenbankAccessions') as infile, open('Sequences.fasta', "w") as outfile:
  for accession in infile:
    url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=" + accession.rstrip() + "&rettype=fasta"
    entry = urllib2.urlopen(url)
    for line in entry:
      outfile.write(line)

This will write all of the FASTA records of the accessions listed in the 'GenbankAccessions' file to the new out file 'Sequences.fasta'.

Thursday, 24 October 2013

Biopython SeqIO tips

The Biopython libraries are of tremendous use in analysing biological data. I use them in the vast majority of my bespoke fastq analysis, as I'm sure many do.

However, there's a couple of tasks I regularly find myself wanting to do, but that I could not easily find solutions for. In case anyone finds themselves in similar need, here are the solutions I found, maybe save you a bit of time.

Output quality scores

This is the main one; it's always annoyed me that there doesn't seem to be an easy way to output the quality score in SeqIO. Sure, I know with the letter_annotations option I can output the actual score in numbers, but sometimes I want to output the actual ASCII characters (such as if I want to take a subsection of a fastq record, both of nucleotide and quality strings).

Here's how I get around it; turn the whole record into a string, split that on its new lines, and just take the fourth:

 str(record.format("fastq")).split("\n")[3]

Generate new fastq records in situ

Sometimes it's necessary to generate complete fastq records on the fly, as opposed to reading them in from an existing file.

I've found a couple of ways of doing this. The first comes buried in amongst some unrelated Biopython features, and looks like this:

 from Bio.Alphabet import generic_dna  
 from Bio.Seq import Seq  
 from Bio.SeqRecord import SeqRecord  
 new_record = SeqRecord(Seq("AAAAAA", generic_dna), id="New Record", description="")  
 new_record.letter_annotations["phred_quality"] = [40,40,40,40,40,40]

The second way came from an answer on Biostars, and works like this:

 from Bio import SeqIO  
 from StringIO import StringIO  
 fastq_string = "@%s\n%s\n+\n%s\n" % ("New Record", "AAAAAA", "IIIIII")  
 new_record = SeqIO.read(StringIO(fastq_string), "fastq-illumina")

They both work, with a couple of differences.

According to a quick test, the latter is faster, and obviously requires loading fewer modules.

Which to use might also depend on what format you have your qualities in; if you only have integers, then the first might be more tempting, whereas if you're making your new fastq from pieces of other existing records then the second is probably the way to go.

Monday, 30 September 2013

5 minute bash tutorial - how to win the lottery!

Apologies for the slightly misleading title. To expand...

In my office, a number of us are involved in analysis of large datasets, often sequence or text files. Each of us has our own little habits and idiosyncrasies; given the same task, we each find different ways to go about it.

A running joke with one of my friends is that I'll persevere with getting a long, bafflingly piped bash command long after she'll have resorted to scripting something in a higher level language.

(In all fairness we both do both, but I do find something deeply satisfying about chaining everything together in one line, no matter how human-unreadable it may be.)

Last Friday, this topic came up during a conversation about the upcoming Euromillions lottery. As a bit of a joke, I whipped up a quick line of bash to generate compatible random numbers (NB - I actually wrote it wrong in the tweet!).

Today I realised that if I explained the steps in the process, it might help someone new to bash understand how best to get some more functionality of out the command line.

So, here it is (with the error removed):

 for i in {1..50}; do echo $i >> win1; done; shuf win1 | head -5; for i in {1..11}; do echo $i >> win2; done; shuf win2 | head -2; rm win*

This should produce output that fits the Euromillions format; five numbers between one and fifty, with two 'star' numbers between one and eleven, e.g.:
22
4
2
15
6
7
2

Let's break this down.

 for i in {1..50};

This sets up a for loop, which will iterate through every integer in the range 1 to 50. For loops in bash take the format:

 for VARIABLE in THING  
 do   
 SOMETHING  
 SOMETHING_ELSE  
 done

Here, the VARIABLE is the thing that changes each iteration, and THING is the range or list of items. SOMETHING and SOMETHING_ELSE are then commands you want to run within the loop, which have to be sandwiched between 'do' and 'done'. More bash loop examples/explanation here.

However, all those carriage-returns take up far too much time (and make it far too comprehensible) - I tend to substitute those for semi-colons, which allow you to concatenate multiple commands into one-liners.

So, our one-liner first loops from 1 to 50, then what?

 do echo $i >> win1; done;

Here we have the contents of that first loop, sandwiched between 'do' and 'done' as per requirements, so that bash knows what to do at each turn of that loop.

Echo is the unix equivalent of 'print'; it just outputs something you tell it to the screen (or somewhere else).

In this case, we echo $i, where i denotes the variable we set up in the loop; we just need the dollar sign to let bash know we're asking for the contents of a variable. So, in the first cycle of the loop $i = 1, then in the next $i = 2, all the way up to $i = 50, which is when the loop stops.

Now if we just had 'do echo $i' in our loop, we'd just get the numbers 1 to 50 printed out in the terminal. However, we've used '>>' to redirect the numbers away from the terminal, instead putting them in a new file called 'win1'.

By using '>>' instead of '>', each cycle of the loop appends the new number to the end of the file win1 - if we used just one > each iteration of the loop would overwrite the whole file, leaving win1 at the end containing just the number 50.

So now we have the file win1 containing the numbers 1-50, each on its own line.

 shuf win1 | head -5;

This is the bit of the code that picks the numbers.

Shuf simply shuffles all of the lines in a file, into a random(ish) order. It's an incredible useful command, which I kicked myself when I found out about it, for having taken so long. In this instance, it mixes up all the lines in win1, so they will no longer be in numerical order.

Here's where we first encounter pipes - that long character between the shuf and the next command is said to 'pipe' (or carry) the results of the first command into the second.

We take the randomly shuffled win1 file, and pipe that into head (one of the commands you'll probably end up using most!) to take off the top five numbers, giving you your first five lottery numbers.

The next bit just repeats the same methodology for the lucky star balls (two numbers between one and eleven), outputting them after the previous five:

 for i in {1..11}; do echo $i >> win2; done; shuf win2 | head -2;

Finally, a bit of tidying up:

 rm win*

Rm removes files - here I've used the '*' wildcard character to make it remove all files which begin 'win', thus removing the win1 and win2 files created earlier. This is also vital if you're going to run the script more than once, as otherwise you might end up drawing the same number twice (as re-running the loop would add yet more numbers to the files).

Note that rm can be dangerous - by default, it won't double check if you're sure you want to delete something, and there's basically no going back, so be careful! For instance, by not explicitly stating the files I wanted to remove, this line would also remove any other files which fit the pattern - e.g. if I had a file called 'winning_lottery_numbers.txt' in the same directory, it would have been deleted too!

This can be overcome by being explicit in the files you're removing, like so:

 rm win1 win2

However if you have more than a few files, you're probably going to want to use the wildcards sensibly.

There you have it - a silly bit of code, but hopefully a useful one to a bash novice!

In case you're wondering, I did a small bit of rigourous, carefully controlled science with my script - I bought a ticket using numbers generated with either my code, or with the lottery's own 'lucky dip'. My code won nothing, but I got £3 from the lucky dip! Pretty conclusive we can all agree.

Tuesday, 10 September 2013

Python fastq analysis for beginners

It's been a little while since my last post, as I did that rare and wondrous thing and took a holiday. Now I'm back, and galvanised into action by a particularly nice reference to my blog, I though I'd continue my vague hopscotch bioinformatics for beginners series, Here's a pretty basic one I've been meaning to put up for a while: fastq analysis.

Fastq is of course the format for storage next generation DNA sequencing data. Each sequence gets four lines in the file: one for sequence identifier, nucleotide sequence, a description line (often just a '+' character), and finally the per-base quality scores (thus lines two and four must be equal length).

If you're on this page, chances are good you're a wet-labber who's just starting to get into working with sequencing data. You can do basic manipulation using a multitude of different programs (I particularly like the FASTX toolkit for a nice spread of basic but useful tricks), but at some point, you're probably going to have to get your hands dirty and code something.

Now I know this isn't best practise, but often when I want to code something I can't immediately think of a solution for, I just get combing tutorials, manuals and forums finding fragments of script that do similar things to what I want, and weaving them together.

Often, the tricky place is knowing where to start. With a view to helping out people like me, here's the basic script I use for the majority of the fastq analysis that I write in Python.

It makes use of the SeqIO interface in Biopython. Basically, all this script does is scroll through a fastq file one record at a time, allowing you to interrogate or manipulate your individual sequences as needed.

from Bio import SeqIO # module needed for sequence input

countseqs = 0

fastqfile = open("YOUR_FILENAME_HERE.fastq", "rU")  
      
for record in SeqIO.parse(fastqfile,"fastq"):  
   
  ######## PUT WHATEVER YOU WANT TO HAPPEN IN HERE  
   
  countseqs += 1  
   
print "Reading from", fastqfile  
print countseqs, "records analysed."

I usually make use of a little variant on this that allows command line input of the fastq file name; this means I can apply the same script to a large number of files without having to edit the filename in the script each time (and instead can make use of the tab-complete function of my shell):

from Bio import SeqIO # module needed for sequence input   
import sys   # module needed for command line argument to get fastq name   
     
if (len(sys.argv) &gt; 1):   
  fastqfile = open(sys.argv[1], "rU")   
           
elif (len(sys.argv) == 1):   
  print "Please supply a fastq filename (i.e. python script.py file.fastq)"   
  sys.exit()   
          
for record in SeqIO.parse(fastqfile,"fastq"):   
     
  ######## PUT WHATEVER YOU WANT TO HAPPEN IN HERE   
   
  countseqs += 1   
     
print "Reading from", fastqfile   
print countseqs, "records analysed."

NB - this script is easily converted into a fasta file, just by doing a find and replace of 'fastq' for 'fasta' - the parser handles it just fine.

This is of course the easy bit - filling in the "whatever you want to happen" bit is always going to be tricky.

To give you an example, here are the bones of a script I wrote based on this template, that I alluded to in a previous post.

from Bio import SeqIO # module needed for sequence input  
import sys      # module needed for command line argument to get fastq name  
import regex        # module for regex  
   
###################  
   
if (len(sys.argv) > 1):  
  fastqfile1 = open(sys.argv[1], "rU")  
   
   
elif (len(sys.argv) == 1):  
  print "Please supply a fastq filename (i.e. python script.py file.fastq)"  
  sys.exit()  
   
######################  
   
SEQ1_outfilename = ("SEQ_1_" + sys.argv[1][:15] + ".fastq")  
SEQ1_outfile = open(SEQ1_outfilename, "w")  
SEQ1_outfile.close()  
   
SEQ2_outfilename = ("SEQ_2_" + sys.argv[1][:15] + ".fastq")  
SEQ2_outfile = open(SEQ2_outfilename, "w")  
SEQ2_outfile.close()  
   
countseqs = 0  
countSEQ_1 = 0  
countSEQ_2 = 0  
   
####################  
      
for record in SeqIO.parse(fastqfile1,"fastq"): # scrolls through fastq file, using SeqIO package  
  
  countseqs += 1 # counts number of sequences measured  
   
  seq1 = regex.search('(ATATATATAT){e<=1}', str(record.seq))  
   
  if seq1:  
   
    countSEQ_1 +=1  
   
    output_handle = open(str(SEQ1_outfilename), "a")  
    SeqIO.write(record, output_handle, "fastq-sanger")  
    output_handle.close()  
   
  else:  
                  
    seq2 = regex.search('(CGCGCGCGCG){e<=1}', str(record.seq))  
   
    if seq2:   
   
      countSEQ_2 +=1  
   
      output_handle = open(str(SEQ2_outfilename), "a")  
      SeqIO.write(record, output_handle, "fastq-sanger")  
      output_handle.close()  
   
print "Reading from", fastqfile1  
print countseqs, "records analysed."  
print countSEQ_1, "Sequence 1s detected."  
print countSEQ_2, "Sequence 2s sequences detected."

This uses the framework of the scripts above, looking for two sequences (in this case two primers, the presence of which would indicate which amplicons the sequences belonged to); depending on which subsequence it finds (if any), it bins the record in one of two output files.

To give a bit of leeway for sequencing error, I've made use of 'regex', an alternative regular expression module (as opposed to the standard/older 're'), which has an option for fuzzy string matching - i.e. you can allow mismatches or indels.

Taking one of the pretend sequences I've used here (ATATATATAT), this means that any of the following sequences would also be categorised as a hit using this search: ATATATGTAT, ATATATATT, ATATAATATAT.

There we have it; a crude, yet effective script to take a fastq file, and output two new fastqs containing (mutually exclusive) reads based on the presence of a given subsequence.

Note that this probably isn't the most efficient solution to this problem - even for something written in Python - but it was knocked together quickly, building up on simple principles from a basic fastq analysis template script, and most importantly, it got the job done. I only needed to run this code three or four times, and was able to do the analysis I needed, which is, let's face it, the most important thing.

----

Update: it's been a few years since I wrote this article, and it's worth pointing out that I've found a much nicer way to parse fasta/fastq files: the readfq function from Heng Li. It's available in a variety of languages, is simple, customisable, pared down, and crucially it's fast. Particularly in applications where you don't need all the extra legwork that SeqIO is doing - like converting your quality scores into their integer equivalents.

Thursday, 29 November 2012

Quick and dirty coding: count DNA base composition in bash

Another installation in my (completely non-ordered, non-structured) quick and dirty code series; here's a way to easily measure the base composition of all of the reads in a fastq file (if you want to check your GC content say) in bash.

I was going to say a quick way, but objectively, it's probably not. Bash is really not an appropriate language to be doing this kind of thing in, but if you're working a lot in the terminal for other things it's sometimes convenient to just have a stock of bash-only scripts for common tasks.

 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "A" | wc -l  
 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "C" | wc -l  
 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "G" | wc -l  
 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "T" | wc -l

As I'm hoping for these to possibly provide some assistance to people who know less about bioinformatics and programming than me (perhaps the smaller end of the wedge, but still significant), I'm going to try and provide some comments and context, so that people can learn from these as well as just use them.

Of note in this example is the first sed term used here (between the echo and the pipe ("|")), which is one that I find myself using very frequently for preliminary analysis of NGS data. All it does it look through a text file and send every 2nd line of four (N-4d just ignores the Nth line of every four lines) and send that to standard out (so either to the terminal or the next process in the pipe)

In a fastq file this setup outputs the sequence line, but obviously it's very easily adapted to output any of the lines you're interested in, from any repetitively structured text file.

In case people might want a marginally more detailed readout (with a bit more of a time penalty), I threw together the following bash script, which I saved as BaseComposition.sh, which I've commented to give a little flavour of some of the things one can do in bash (however inappropriately). Note that bash by default can only do integer arithmetic; you need to use bc or awk or something else to use floats.

 # Jamie Heather, UCL, November 2012  
   
 # Short bash script to give a quick read-out of the DNA base composition of a fastq file  
 # Run with command:  
 # $ sh BaseComposition.sh FILENAME.fastq  
   
 # Takes an input b, then runs through following pipe, whereby:  
 # every second line of 4 line repeat (i.e. sequence line) taken | sed adds tildes between each character | tr replaces tildes for new lines | grep counts each character | wc counts each hit  
 # Then feeds each base into it's own variable  
 ACOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "A" | wc -l)  
 CCOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "C" | wc -l)  
 GCOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "G" | wc -l)  
 TCOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "T" | wc -l)  
   
 # Adds up all the individual bases   
 TOTALBASES=$(($ACOUNT + $CCOUNT + $GCOUNT + $TCOUNT))  
   
 # Prints out both the number of hits for each base, and the percentage of that each base's contribution to the whole  
 echo "\n"  
 echo A bases: "\n"$ACOUNT "\n"Percentage A:  
 echo "scale=2;$ACOUNT/$TOTALBASES"*100 | bc  
 echo "\n"  
 echo C bases: "\n"$CCOUNT "\n"Percentage C:   
 echo "scale=2;$CCOUNT/$TOTALBASES"*100 | bc  
 echo "\n"  
 echo G bases: "\n"$GCOUNT "\n"Percentage G:   
 echo "scale=2;$GCOUNT/$TOTALBASES"*100 | bc  
 echo "\n"  
 echo T bases: "\n"$TCOUNT "\n"Percentage T:   
 echo "scale=2;$TCOUNT/$TOTALBASES"*100 | bc

Wednesday, 21 November 2012

Generating random amino acid/DNA/RNA sequences in Python

This is my first post in what I hope to be a recurrent theme, that of bioinformatic practice.

I am by training a wet-lab scientist; my degrees were in molecular genetics, and I'm now doing my PhD in immunology. I've always had a keen interest in bioinformatics, and always tried to do as much as I can, but I've not been formally trained (and I've never claimed to be the most natural mathmatician/statistician).

Accordingly, whenever I'm trying to do some new bit of coding or analysis, I usually like to look around in manuals, papers, blogs and forums to see what and how people are doing things, and see if I can prevent some arduous wheel-reinventing.

I've also noticed that a lot of the time some little snippet of info or code that would have proved very useful to me is omitted by more practised programmers, to whom it would have been so obvious it goes unsaid or forgotten.

In that vein, I plan to post some of the bioinformatic tricks and codes I find and use as I go, in the hopes that I can pay it forward to others who find myself in my position, that is (usually), knowing exactly what I want to do, just not knowing off-hand the names of the functions to do it.

So, kicking off, here's something that I've been meaning to code for a little while; a way to generate random biological sequences (be they protein or nucleic acid sequences), written in Python, based on this biostars post.

This example produces random peptide sequences. For my current purposes, I also wanted these peptides to be of variable length, around a given normal distribution (that I'd measured empirically from the data set I was making these random sequences to act as a control for).

Note that I also wanted to plot the distribution of the newly created peptides; if you didn't want that functionality you can just delete every line with a "# plot" comment.

import random
import matplotlib.pyplot as plt # plot

def random_AA_seq(length):
    return ''.join(random.choice('ACDEFGHIKLMNPQRSTVWY') for i in range(length))

list_size = 3000

lengths = []

for j in range(list_size):

    a = int(random.normalvariate(17, 2))
    lengths.append(a)  # plot
    print random_AA_seq(a)

plt.hist(lengths, bins=14) # plot
plt.show() # plot

You can adjust the distribution of lengths by changing the values fed to normalvariate; the former is the mean, the latter the standard deviation (I deduced the values of 17 and 2 here from the sample I was trying to match).

Then of course these is easily adapted to DNA or RNA by swapping 'ACDEFGHIKLMNPQRSTVWY' for 'ACGT' or 'ACGU'.

Also, thanks go to David Craft, whose blog post on Syntax Highlighter came in useful in this post!