Thursday, 29 November 2012

Quick and dirty coding: count DNA base composition in bash

Another installation in my (completely non-ordered, non-structured) quick and dirty code series; here's a way to easily measure the base composition of all of the reads in a fastq file (if you want to check your GC content say) in bash.

I was going to say a quick way, but objectively, it's probably not. Bash is really not an appropriate language to be doing this kind of thing in, but if you're working a lot in the terminal for other things it's sometimes convenient to just have a stock of bash-only scripts for common tasks.

 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "A" | wc -l  
 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "C" | wc -l  
 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "G" | wc -l  
 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "T" | wc -l  

As I'm hoping for these to possibly provide some assistance to people who know less about bioinformatics and programming than me (perhaps the smaller end of the wedge, but still significant), I'm going to try and provide some comments and context, so that people can learn from these as well as just use them.

Of note in this example is the first sed term used here (between the echo and the pipe ("|")), which is one that I find myself using very frequently for preliminary analysis of NGS data. All it does it look through a text file and send every 2nd line of four (N-4d just ignores the Nth line of every four lines) and send that to standard out (so either to the terminal or the next process in the pipe)

In a fastq file this setup outputs the sequence line, but obviously it's very easily adapted to output any of the lines you're interested in, from any repetitively structured text file.

In case people might want a marginally more detailed readout (with a bit more of a time penalty), I threw together the following bash script, which I saved as, which I've commented to give a little flavour of some of the things one can do in bash (however inappropriately). Note that bash by default can only do integer arithmetic; you need to use bc or awk or something else to use floats.

 # Jamie Heather, UCL, November 2012  
 # Short bash script to give a quick read-out of the DNA base composition of a fastq file  
 # Run with command:  
 # $ sh FILENAME.fastq  
 # Takes an input b, then runs through following pipe, whereby:  
 # every second line of 4 line repeat (i.e. sequence line) taken | sed adds tildes between each character | tr replaces tildes for new lines | grep counts each character | wc counts each hit  
 # Then feeds each base into it's own variable  
 ACOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "A" | wc -l)  
 CCOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "C" | wc -l)  
 GCOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "G" | wc -l)  
 TCOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "T" | wc -l)  
 # Adds up all the individual bases   
 # Prints out both the number of hits for each base, and the percentage of that each base's contribution to the whole  
 echo "\n"  
 echo A bases: "\n"$ACOUNT "\n"Percentage A:  
 echo "scale=2;$ACOUNT/$TOTALBASES"*100 | bc  
 echo "\n"  
 echo C bases: "\n"$CCOUNT "\n"Percentage C:   
 echo "scale=2;$CCOUNT/$TOTALBASES"*100 | bc  
 echo "\n"  
 echo G bases: "\n"$GCOUNT "\n"Percentage G:   
 echo "scale=2;$GCOUNT/$TOTALBASES"*100 | bc  
 echo "\n"  
 echo T bases: "\n"$TCOUNT "\n"Percentage T:   
 echo "scale=2;$TCOUNT/$TOTALBASES"*100 | bc  

Tuesday, 27 November 2012

Don't count your clusters before they hatch...

I've dropped a few veiled hints about my previous disastrous runs on the MiSeq (which I do plan to write up eventually, but the anguish is still too fresh), but today we got samples from our revised strategy running on the machine.

This is the first of our samples we've run ourselves (well, that we set up at the genomics expert's direction), and as such we were pretty excited to watch the data start to roll out in real time, particularly in light of our previous failures.

Then it happened. The first data to pop up in BaseSpace looked like this:

There's a couple of things that put the fear in me here.

The first is the upper panel; not only are the intensities quite low, some even below 200, they are going up.

The second is the more worrying; from our lower panel we can see that not only do we have relatively low cluster density (300k/mm2 when we aim for 800k/mm2, see blue box), but it looks like none of them have passed filter (indicated by the green box smooshed into zero on the axis).

Like I say, this is only the first run I've watched from the very beginning, so I was hoping that this was just a feature of it counting the clusters in earlier cycles before finishing the filter checks at cycle 25, as I'd read in some Illumina documentation.

However, I wasn't sure, so I spent a paranoid five minutes trying to double check whether this was the case, to no avail at all (and obviously I couldn't find the Illumina reference where I'd read it in the first place).

So it was with much relief that, when cycle 25 ticked over, this happened:

Phew. We might not have many clusters, but at least most of them are passing filter.

To anyone who's done more than one run this post probably seems like a lot of fuss about nothing, but I certainly spent those five minutes googling with my stomach turning over. Hopefully if anyone else finds themselves in a similar scenario, they might find their way here and heed this advice; just wait for cycle 25.

As an aside, the run is currently at cycle 150, and the intensities are still rising. Is this perhaps a feature of the v2 kits, or something more anomalous? Until I have a play with the sequence tomorrow, I'm reluctant to speculate on what this might mean.


The machine's still running, but I've got some feedback about the funny error profile, after I posted it on twitter. It turns out that this is to be expected in v2 kits. Good to know!

Sunday, 25 November 2012


I covered an interesting (if slightly intricate) paper for journal club last week, so I thought I'd go over some of the major techniques and findings here. Note that while I'm going to try to keep it simple, (for the purposes of this post) I'm going to assume a rudimentary knowledge of immunology.

First off, here is the paper, that came out in last month's edition of Immunity, entitled "Subsets of Nonclonal Neighboring CD4+ T Cells Specifically Regulate the Frequency of Individual Antigen-Reactive T Cells"1.

Coming from the group of Ronald Schwartz at the NIH, this body of work is interested in the mechanisms of T cell homeostatis.

Let's get fundamental. Our bodies are of a finite size, so we can only fit so many cells in our circulatory systems and tissues. In terms of our T cells, we want to keep as wide a repertoire of different T cell receptor (TCR) bearing clones as possible, to maximise the number of potential pathogens and challenges we can recognise.

However, after a T cell gets activated by its cognate pMHC target (and has expanded and differentiated into effector cells to clear the challenge), how does our body get rid of all those expanded clones, contracting that population and only that population?

This work builds on previous work from the same lead author, Nevil Singh, where they noticed something interesting2.

In order to understand what's gone on, I just need to explain a few of the basics of their model systems.

Both papers make heavy use of adoptively transferring transgenic-TCR bearing T cells, into mice which either have their own endogenous T cell machinery in place, or who are T cell deficient (CD3ε knock outs).

The main cell they transfer are 5C.C7 T cells, which bear a TCR that recognises a peptide of PCC (pidgeon cytochrome C). This group's previous paper found that when transferred into mice that express PCC (under a MHC-I promoter), one of two things can happen.

If the recipient mouse lacks its own, functioning T cell system (PCC+/Cd3e-/-) then the transferred 5C.C7 T cells persist in that mouse, causing an autoimmune arthritis. 

If the recipient mouse has it's own T cells however (PCC+/Cd3e+/+), then the 5C.C7 T cells not only fail to cause any disease, but they actually disappear from that mouse over time. Or, as they would have it, the 5C.C7 cells get 'deleted'.

At the start of this recent paper, they recapitulate this finding in a neat experiment where they transfer 5C.C7 cells (from transgenic, Rag2-/- mice3) into PCC+/Cd3e-/- mice, under one of three conditions; either alone, or having received a million CD4+ or CD8+ polyclonal T cells the day before (figure 1A). 

It seems that the deletor activity they'd noticed previously is entirely due to some component of a polyclonal CD4+ response.

Figure 1A (from the paper1): Survival of 5C.C7 T cells transferred into a T cell deficient mouse (blue line), compared against mice who received one million CD8+ (green line) or CD4+ (red line), measured in pooled lymph node and spleen. 5C.C7 cells measured by double staining for CD4 and Vβ3, the beta variable gene used in the 5C.C7 TCR.

However, when they go on to try and narrow down what CD4+ T cell might be responsible, the deletor phenotype doesn't appear to easily fractionate to any obvious subset.

Without a clear smoking gun, they adopt an alternate tack. They knew that transferring a million CD4 T cells into the target mouse protected them from the persistence and pathology of the 5C.C7 cells. So, they thought to try and divide this million up, and see which fractions (if any) remain deletors.

To do this they flow sort polyclonal CD4+ T cells into random pools of 100 cells, which they call 'centipools'. Each centipool is then expanded in vitro, and then a million of each expanded population is transferred into a mouse to assay for 5C.C7 deletor activity.

They found that of the 54 centipools tested, there were three that seemed to recapitulate the entire deletor phenotype.

They probed these pools all manor of ways to see if they could pin down what was responsible. They ran microarrays and sequenced TCR genes; nothing seemed to correlate with deletor ability, and nor was any one TCR clone shared across all three centipools.

With no hook to hang a hypothesis on, they go on the attack, and simply take all the TCR sequences from the strongest-deletor centipool and generate retrogenic TCR T cells for each of them.

Transferring these retrogenic T cells into the PCC+, T cell deficient mice reveals that within a given centipool, there's only one clone that's reponsible for all of the deletor phenotype. Moreover, this clone deletes 5C.C7 even in PCC- mice; there doesn't need to be target antigen present!

The authors then turn back to the literature; 5C.C7 is a well-studied TCR, perhaps there's something in the back catalogue that might be pertinent. Lo and behold, it turns out there are other peptides to which 5C.C7 is known to react, including one from the endogenous retrovirus GP, which is implicated in positively selecting 5C.C7 in the thymus4.

There follows some fairly fiddly experiments (which probably contain the least convincing data in the paper), where they reportedly show that this GP-derived peptide is actually a shared ligand for both the 5C.C7 TCR and the retrogenic TCR cloned from the deletor centipool.

You have to pump the mice full of GP peptide, but it seems that eventually you do see a slight proliferation of both 5C.C7 and deletor T cell, a feature which interestingly could not be repeated in vitro. Apparently, all other assays failed to assign the GP peptide as a stimulating ligand for either cell type.

So, show me the model.

The idea is that different, unrelated T cell clones can be categorised into colonies, based on shared, sub-threshold ligands (e.g. the 5C.C7 and deletor cells described in the paper).

If a member of a colony gets expanded by its (above threshold) ligand, it will expand. However, the contraction phase will then be mediated at the colony-level, enforced by other clonotypes within that colony, preventing collateral damage to non-colonial T cells and thus maintaining a rich, diverse TCR repertoire.

I really quite like this paper; it was a fiddly finding to unravel, and they make use of a number of techniques to do so. I'm not 100% convinced about the GP peptide, but should it be true it's an incredibly intriguing finding. It also would have been nice if they'd made inroads into finding out the mechanism of this colony-regulation, but I suppose there's always the next paper.

My PhD project is all about sequencing TCR repertoires, and day-to-day we tend to operate one the naive assumption that by-and-large the information required to maintain a given T cell clone's population is wrapped up in the genotype and phenotype of that clone. The idea that every T cell clonotype could be modulated by several others within its colony (and in turn be modulating that colony's other members) is fascinating, and could well represent another level of intricate regulation to a marvellously complex biological system.

For an alternative write up of this paper, this edition of the journal also featured a preview of the piece.

If you're interested, you can also check out the prezi presentation I gave in my journal club presentation of this paper.

1. Singh, N. J., Bando, J. K., & Schwartz, R. H. (2012). Subsets of Nonclonal Neighboring CD4(+) T Cells Specifically Regulate the Frequency of Individual Antigen-Reactive T Cells. Immunity, 37(4), 735-46. Elsevier Inc. doi:10.1016/j.immuni.2012.08.008 ($)

2. Singh, N. J., Chen, C., & Schwartz, R. H. (2006). The impact of T cell intrinsic antigen adaptation on peripheral immune tolerance. PLoS biology, 4(11), e340. doi:10.1371/journal.pbio.0040340

3. As these are Rag2 knock outs they would be unable to undergo endogenous V(D)J rearrangement, and thus will only produce mature T cells for the transgenic 5C.C7 receptor

4. Ebert, P.J., Jiang, S., Xie, J., Li, Q.J., and Davis, M.M. (2009). An endogenous positively selecting peptide enhances mature T cell responses and becomes an autoantigen in the absence of microRNA miR-181a. Nat. Immunol. 10, 1162–1169. doi:10.1038/ni.1797 ($)

Wednesday, 21 November 2012

Generating random amino acid/DNA/RNA sequences in Python

This is my first post in what I hope to be a recurrent theme, that of bioinformatic practice.

I am by training a wet-lab scientist; my degrees were in molecular genetics, and I'm now doing my PhD in immunology. I've always had a keen interest in bioinformatics, and always tried to do as much as I can, but I've not been formally trained (and I've never claimed to be the most natural mathmatician/statistician).

Accordingly, whenever I'm trying to do some new bit of coding or analysis, I usually like to look around in manuals, papers, blogs and forums to see what and how people are doing things, and see if I can prevent some arduous wheel-reinventing.

I've also noticed that a lot of the time some little snippet of info or code that would have proved very useful to me is omitted by more practised programmers, to whom it would have been so obvious it goes unsaid or forgotten.

In that vein, I plan to post some of the bioinformatic tricks and codes I find and use as I go, in the hopes that I can pay it forward to others who find myself in my position, that is (usually), knowing exactly what I want to do, just not knowing off-hand the names of the functions to do it.

So, kicking off, here's something that I've been meaning to code for a little while; a way to generate random biological sequences (be they protein or nucleic acid sequences), written in Python, based on this biostars post.

This example produces random peptide sequences. For my current purposes, I also wanted these peptides to be of variable length, around a given normal distribution (that I'd measured empirically from the data set I was making these random sequences to act as a control for).

Note that I also wanted to plot the distribution of the newly created peptides; if you didn't want that functionality you can just delete every line with a "# plot" comment.

import random
import matplotlib.pyplot as plt # plot

def random_AA_seq(length):
    return ''.join(random.choice('ACDEFGHIKLMNPQRSTVWY') for i in range(length))

list_size = 3000

lengths = []

for j in range(list_size):

    a = int(random.normalvariate(17, 2))
    lengths.append(a)  # plot
    print random_AA_seq(a)

plt.hist(lengths, bins=14) # plot # plot

You can adjust the distribution of lengths by changing the values fed to normalvariate; the former is the mean, the latter the standard deviation (I deduced the values of 17 and 2 here from the sample I was trying to match).

Then of course these is easily adapted to DNA or RNA by swapping 'ACDEFGHIKLMNPQRSTVWY' for 'ACGT' or 'ACGU'.

Also, thanks go to David Craft, whose blog post on Syntax Highlighter came in useful in this post!

Tuesday, 20 November 2012

3D print lab combs

Ever since I read Cory Doctorow's Makers, I've wanted a 3D printer (or at least access to one). Completely unjustifiably, I might add; I freely admit I just want to play, maybe make some custom lego bricks or miniatures of myself or something frivolous.

So it is with great excitement that I read stories like this, about people generating standard, useful lab equipment in house using 3D printers (Russel's blog is particularly relevant, as I could do with some custom mixed-width gel combs that aren't available commercially!).

Every story I read is one step closer to 3D printers becoming standard lab items, making me able to make my own unnecessary bits of plastic fun practical pieces of lab equipment cheaply and conveniently.

Update: I just did a quick search on Thingiverse (a repository of files and instructions to construct objects with tools like 3D printers and laser-cutters), and there already is a selection of electrophoresis parts available.

I have a feeling the DIYbio movement is going to run riot with this technology.

Monday, 5 November 2012

Custom MiSeq sequencing primers

Custom sequencing primers opens the door to a lot of possibilities, particularly when attempting amplicon sequencing, allowing you to sequence off your amplification primers and not waste high-quality read data on the one bit of sequence you already knew.

Here follows a list of considerations one might want to take into account when generating custom sequencing oligonucleotides for use on the Illumina MiSeq machine (ignoring all the low diversity issues inherent in Illumina sequencing).

I write this with full disclosure from the outset; I'm relatively new to NGS, new to the MiSeq, and the only runs I've tried using custom primers have failed abysmally. However, in troubleshooting why these might have failed (to feature in a later, longer post) I've assembled a fair bit of information on custom primer design, which might be useful for others (or others might be able to point out where I've gone wrong!).

First, the basics. The MiSeq is the latest offering from sequencing powerhouse Illumina, providing a benchtop alternative to it's larger previous models. It uses the same Solexa-style chemistry as the Genome Analyzer and the HiSeq, but in a smaller box, aimed at lower-throughput applications.

The important thing for this post, is that the MiSeq does support the use of custom primers. These can either be used independently, or spiked into the primer reservoirs for read one or two (see p73 of the MiSeq System User Guide, access probably requires free sign up).

The question then becomes how to design these primers.

The first consideration practically goes without saying; the primer needs to prime, and only where you want it to. Were there to be more than one complementary site per cluster there would be multiple bases seen per cycle, and the cluster would be thrown out.

Within that constraint, the basic idea typically is then to match the characteristics of the Illumina sequencing primers, the sequences of which can be found in the Illumina Customer Sequence Letter (obligatory copyright notice; oligonucleotide sequences © 2007-2012 Illumina, Inc, all rights reserved, yada yada yada).

(By the way: In assessing the Tm of your oligos, apparently Illumina tech suggest using the IDT calculator)

The Illumina sequencing primer 1 (SP1) is 33bp long, 51.5% GC, with a Tm of 65.5°C, while SP2 is a bit longer/tighter binder, being 37bp long, 59.5% GC and having a Tm of 70.1°C. Note that the MiSeq runs hotter than the HiSeq during the deblocking and extension stages, meaning that the Tm of your oligo has to exceed 65°C in order to prevent dissociation from the target.

Logistically speaking, the most important of these criteria to match (or exceed) is probably the Tm; the length and GC content will largely be dictated by your sequences*.

If the primers used to generate your amplicon have a Tm lower than 65°C, there are several workarounds you can use to bring the melting point up to scratch.

One option is to extend your sequencing oligo into the P5 or P7 element that it borders, akin to how the Illumina SP1 oligo overlaps P5 by 5 bases. A note of caution; Illumina tech have warned me that too long an overlap could result in non-specific annealing of primers to the lawn of adapters. I can't see that this would ever result in mis-calling, as the complex would have a huge 3' overhang (therefore not extend, thus contributing no fluorescence), but it could certainly reduce the effective concentration of sequencing primer in the mix.

Another similar sensible option suggested to me recently by Tony Brooks (of UCL Genomics) is to insert a few dummy nucleotides to the 5' of your PCR primer, and then include these in the sequencing oligo, bringing the Tm up without entering the adapter elements.

Lastly, a powerful but potentially tricky option; use modified bases in your sequencing primer to increase the stability. Particularly, the use of Locked Nucleic Acid (LNA) containing oligonucleotides (the subject of a future post I think) has received a lot of talk as a potential way to boost the Tm/specificity without increasing the length.

Custom oligos open doors to a lot of useful and novel sequencing strategies. However, as I'm finding out - to my continued frustration - they do present a wealth of problems, which by their very nature are unique, and of course, unsupported.

If anyone has any pointers, corrections or other advice, please do share in the comments.

TL:DR: Check this nice explanation out. It has pictures.

* As a little aside, it's worth noting that binding of a primer to its cognate site isn't necessarily the key determining factor of its ability to prime; certain motifs may be preferentially favoured by polymerases. However, this is hard to measure; the only hints I'm aware of are a tantalising blog post and a patent application from prolific professor Jian Han (from the HudsonAlpha Institute).