Thursday, 8 August 2013

Re-assigning incorrect ladder peaks on the bioanalyzer


Today I ran a high sensitivity DNA bioanalyzer chip to quality check some amplicon libraries I was setting up for the MiSeq. In so doing, I noticed that my peaks were coming out at the wrong sizes, for a simple reason; the 2100 Expert software hadn't correctly detected one of the ladder peaks (the second, 50bp band). 
Note second peak from left has a peak concentration of zero!

It hasn't been recorded as the 50bp ladder band, instead mistakenly recorded as 39bp

I had a look through the user guide and the troubleshooting guide (see page 41) to try to re-assign the ladder peaks as you can set the upper and lower marker using manual integration, to no avail. A quick google and check of SeqAnswers didn't help either.

In case anyone's struggling with a similar problem/lack of immediately searchable answer, here's how I solved it.

The problem was really that the software had called the two biggest peaks at a location that really was just the biggest; presumably it looks for the upper and lower markers first, then assigns the remaining peaks from largest to smallest.

This is easily solved - I just right clicked the peak (on the Peak Table tab when looking at the electropherogram) and exclude it. After that the program re-calculates the proper sizes, assigning the 50bp peak correctly and outputting the correct sizes for my samples. Job done!
The slight shoulder off the upper marker mistakenly got detected as the next peak down

Much better!

Now the band is correctly labelled
This is probably an obvious fix, but if (like me) you were looking for the wrong search terms (re-assigning peaks when really I needed to exclude) hopefully this might help you shave a little bit of time off for you!

Wednesday, 17 July 2013

Merging paired-end reads (or not!)


Note: post updated, see below!

This post is a bit all over the place, but bear with me. I've had a very trying day trying to combine paired end reads into longer single end reads. While currently not having the most success myself, I gained a few insights along the way that might help out some other people a little, if only to see how someone did it badly.

Let's set the scene; I recently ran a small sample of four pooled amplicons on the same index on a paired-end, 2x251bp MiSeq run (I luckily managed to squeeze into a friend's run, as I just wanted to test the protocol was working).

These amplicons should be roughly different lengths: two amplicons should be around 430 bp, while the other two will be a bit shorter, around 320 bp, so they should have overlaps around 70 and 180 bp long respectively.

Stitching these overlapped reads would make a lot of my downstream processing easier, and seems like it should be fairly easy, right? There's loads of programs and scripts designed to do just that (summed up nicely in this nice post on The Genome Factory).

If only life were that easy.

I first tried what seemed to be the big three: FLASH, COPEread and PANDASeq. Getting the things up and running proved to be no mean feat - the classic of games of find-the-dependency, excavating old binaries and correcting the typos in the installation commands of course only fuelling my analytical mindset.

I tried COPEread first. It's a bit tricker than it first might seem, as in order to work fully you first need to generate k-mer frequency tables from the files you wish to overlap, using the bundled program kmerfreq on a text file containing the names of the fastq files, like so:

 kmerfreq -k 16 -t 3 -q 33 -p outputfile filenames.lst  

Here -k is the length of the kmers, -t the number of threads to use, and -q 33 indicates that I'm using phred scores starting from ASCII value 33, hence (newer) Illumina values - COPE really seem to like making the old value of 64 as the default. Also note that I had to use kmers of 16, below the default of 17, as 17 or above is seemingly beyond the abilities of the memory of my 8 GB machine.

Running COPEread on its most basic settings looks like this.

 cope -a R1.fq -b R2.fq -o merged.fq -2 leftovers_R1.fq -3 leftovers_R2.fq -s 33 -m 0  

(Note that it seems a bit particular about certain parameters being filled, and might not give a sensible error message)

However, this manages to join less than a percent for my amplicons. Let's try beefing it up to the full mode, using the kmer frequency table made above:

 cope -a R1.fq -b R2.fq -o merged.fq -2 leftovers_R1.fq -3 leftovers_R2.fq -s 33 -m 3 -k 16 -t freqtable.freq.cz -f freqtable.freq.cz.len   

Still we're getting effectively no reads joining. Maybe COPEread isn't the one.

Let's try PANDAseq. Let's please try PANDAseq. Why oh why did this installation take me so long?

I tried so many ways to get this working, that I can now no longer seem to even run it properly, as there are libraries and binaries strewn across directories all over my drive. However, I ran up against the seemingly well known problem that PANDAseq is a fussy eater - https://github.com/neufeld/pandaseq/issues/7, meaning it wouldn't accept my data, even using the -B flag.

No problem, I can just dress my IDs up with fake barcodes comme ça, before running PANDAseq on them:

 sed -i '1~4s/$/\ 1:N:0:TAGACA/' R1.fq  
 sed -i '1~4s/$/\ 2:N:0:TAGACA/' R2.fq  
 pandaseq -f R1.fq -r R2.fq -u unpaired.fq  

Now it seems to recognise the data as valid, but doesn't do anything afterwards. It just finds no pairs, and outputs nothing into the unpaired file. According to the bundled pandaseq-checkid, my IDs are still 'BAD', but as far as I can tell they shouldn't be. However, by this point I'd spent far too long plugging away at just one program when I still had more potentials, so maybe the next wouldn't be so pernickity.

(In fairness to the PANDAseq devs, they do seem to be interacting with the users and fixing the bugs - I'm sure if I chased this I could get some help).

On to FLASH, which seems to be a bit of a favourite over on SEQanswers. Nice and simple default options:

 flash R1.fq R2.fq  

This combined 30% of the reads! Admittedly not really as high as we'd want, but certainly the best so far.

I spent a long time flag-fiddling with FLASH, with no real improvement. Here's my most redundantly parameterised command, stipulating the minimum/maximum overlap sizes, length of reads and amplicons, with standard deviation:

 flash -m 30 -M 220 -r 502 -f 376 -s 56 s1.fq s2.fq   


Still wobbling around the 30% mark. Getting pretty frustrated by now, so I asked the good people over at SEQanswers, and received the great suggestion to demultiplex my sample by amplicons.

Knocked up a quick python script to scroll through the R2 fastq (which contains the primers that differentiate the amplicons) and bin reads into different files based on which sequence it finds (in a post to follow!). Once I've split the R2 into four, I can harvest the ID line of each with sed, remove the @ and then use fastqselect.tcl to pull the appropriate matched reads out of R1, e.g.:

 sed '2~4d;3~4d;4~4d' R2.fq > idR2  
 ...  
 for i in id*; do echo $i; sed -i 's/@//g' $i; done  

Now I get something really interesting. The two amplicons in my pool that were longer (~430 bp, ~70 bp overlaps) combine great, over 90% assembling. The shorter amplicons (~320 bp, ~180 bp overlaps) however resolutely refuse to combine at all, even though if I pick a few mate reads and BLAST them against each other I can see ~200 bp overlaps with 99-100% identity.

I don't know why I couldn't get COPEread to work, I don't know what ID format PANDAseq was looking for, and I don't know why FLASH wont combine rather extreme long-ish reads where the overlap makes up a rather large percentage of each read.

Sadly my time is tight; I don't anticipate this to be a regular feature of my pipeline (as my usual amplicons are too long to overlap) and I only needed a couple of joined files to test the analysis I wanted to try out, so I'm yet to pursue this. Truthfully I probably spent more time to do this than I should have, because I'm stubborn and it seems like something that should be simple.

Because it should be simple, right?

If you can see something obvious I'm doing wrong, please let me know in the comments.

Post-script:

I did also try a few other methods.

Pear said that the number of reads in each file did not match (when clearly they did).

mergePairs.py looked like it was working, but inspection of the merged files it was producing shows a number of clearly junk reads stuck together. Also obviously slower than the C based approaches above.

SeqPrep never installed, but as it trims and assembles in one I'm not sure if it's appropriate for me (as I have random sequence at both ends of the amplicons).

Bearing in mind that all of these presumably work for at least some (if not most) people, I'm aware that the fault probably lies with my data, which I expect is longer and more overlapped than most of these programs were designed for. That said, if I can align some of these myself just by eye, I don't see why the leading software in the field can't.




UPDATE (19th July 2013)

After throwing this post up on Twitter, I got a bit of feedback from my local sequencing guru Tony Brooks and doyen of bioinformatics Nick Loman, which prompted me to have a second crack at my difficult overlaps.


First off, a correction: earlier in this post I mentioned that I'd tried FLASH using both the -M and -rfs flags all at once. I had tried them individually (either -M or -r -f -s), with equal success.


The main thing to come out of the discussion was the obvious, which is that I don't really need to combine the reads of the shorter length amplicons, as the read length already spans the entire insert (fig. 1).


This post was sorely in need of a schematic
Figure 1: Schematic of short versus long amplicons

However, I'd still like the possibility to combine these reads. For one thing, it'd be nice to have just one pipeline which I can roll out across all my amplicons, without my having to demultiplex them as I did here. More importantly, the quality of the reads starts to dip the further through the read we get, so combinining the reads would allow me to use the higher confidence base calls from the other direction.


The major advice given was that I should trim the adaptors off my reads, which should help with the combinining. Out comes Trimmomatic, an excellent bit of kit which seems to be a bit of a trimming-favourite.


Frustratingly, the success I get from Trimmomatic mirrors that which I see with FLASH:
java -jar ~/PATH_TO_FOLDER/trimmomatic-0.30.jar PE -phred33 -trimlog trimlog R1.fq R2.fq trimpairR1.fq trimUNpairR1.fq trimpairR2.fq trimUNpairR2.fq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:80


When applied to all four of my demultiplexed samples, almost all of the reads in R2 fail to pass muster for two of the amplicons - any guesses which? The shorter amplicons not only refuse to merge, they refuse to trim to assist merging! I initially put this down to a slightly worse quality profile in the R2 reads, but the difference between the quality of R1 and R2 is about the same among both longer and shorter amplicons.


It was around this time that Eric Biggers, one of the FLASH software developers, replied to an email I'd sent him earlier detailing my problem, with a couple example reads which I thought should align. Here's his reply:

"The problem is that FLASH is looking for overlaps where read 2 ends at some position in read 1, whereas in the two example pairs you gave, read 2 ends before read 1 even begins.
So your data looks something like:
read_1      --------------------->
read_2
<--------------------
which FLASH doesn't expect, since those are, in effect, outward-facing pairs rather than the inward-facing pairs it expects. In both cases the sequence at the end of read 2 (at the head of the arrow) is similar, so I assume it's an adapter sequence. In short, FLASH wasn't originally designed to handle cases like this, but after seeing this I'm considering modifying it to look for such overlaps, especially since I recall someone else having a similar problem."
There you have it; it makes so much sense, I feel silly for not having thought of it sooner (hey, I'd had a long day) - it can't find the overlap, because the sequence at the end of the read doesn't appear in the other read!
Combined with the trimming advice I'd received, there was then a simple way to rectify this problem; just reduce the read length to make my data into inward-facing pairs. The FASTX-Toolkit is great for this kind of thing (just remember to use the -Q33 flag to make sure it's looking for the right phred scale - I still can't fathom why that doesn't feature in their manual!).
 fastx_trimmer -Q33 -f 1 -l 175 -i R1.fq -o shortR1.fq   
 fastx_trimmer -Q33 -f 1 -l 175 -i R2.fq -o shortR2.fq   
 flash shortR1.fq shortR2.fq   
Eureka! I now get as good merging of pairs for the shorter amplicons as I saw with the longer! So, were I to roll this out to my un-demultiplexed amplicon pool data I imagine I would run FLASH on the original files, then trim any reads that don't merge before re-running FLASH on those.

Big thanks to Eric Biggers, Tony Brooks and Nick Loman and all the people on SEQanswers for their help.

Monday, 15 July 2013

Delays in communication

I know I know, I have been entirely slovenly with my upkeep of this blog. At first it was justified - I undertook my PhD upgrade at the end of May, before a wealth of conferences, colloquia and symposia in June.

I've also taken on a great role as a volunteer writer for BioNews, a London based charity involved in disseminating accurate and balanced genetics/stem cell/IVF related news (my stories here), which has been soaking up a lot of the writing brain cells that ordinarily might be aimed here.

However, excuses are excuses, and I promise to try harder. My BioNews commitment should be slightly less regular, and for a change I don't have any upcoming posters or presentations to prepare, so I can get back on the blog-horse! Little and often is my goal - although I do have a couple of long posts in the works with which I hope to redeem myself - watch this space!

Saturday, 18 May 2013

Up-goer five PhD description

Someone recently pointed out the Up-Goer Five text editor to me, which only lets you type using the 1,000 top most common words (inspired by this xkcd). The challenge is then to describe your PhD project using just these words. Here's my try:

I study a type of blood cell that helps keep us safe and well, called 'T_cells'. These cells search the body for signs of problems or things that shouldn't be there, and fix them if they find any. They can look for many different kinds of problems; taken as a group, they can look for far more problem signs than almost any other cell type. They feel the face of other cells as they move around the body, each one looking for different signs that something isn't right. I'm working on checking to see what 'T_cells' people have, to see if we can learn why some people get sick more easily than others.

Monday, 29 April 2013

Happy Day of Immunology!

I know I haven't posted in a while (mostly due to my impending PhD upgrade, which I have tomorrow!), but I had to note this most august of internal days, the International* Day of Immunology!

I was hoping that this day meant that all of my experiments would go perfectly to plan, but going on the plates I left incubating on the bench over the weekend, I was sadly mistaken.

In other news, as a result of doing my upgrade, and having to do a poster imminently, I think I might have a play with putting some stuff on figshare, so I'll be sure to link to it here.

* Oddly apart from Norway, who went their own way and celebrate theirs on the 26th.

Sunday, 24 February 2013

T-Cell Receptors, Part Five: A Receptor For All Cells?


This last installment in this week of blogging covers the more recent, and potentially more controversial stories I've found of wandering TCRs.

The claims made in the following papers are big, and could have far reaching connotations if they bear out. All of these papers report whole, functional TCRs, being expressed and used by white blood cells. Just the wrong ones.

It all starts in 2006, with a sentence. A tantalising, if near-painfully vague sentence:

"A series of control experiments prompted us to test the hypothesis that human neutrophils express components of the TCR machinery."

A series of control experiments ey, that old chestnut.Wait, what, neutrophils?

These cells are one of the major components of the innate immune system, and account for the majority of circulating white blood cells at any one given time, despite their rapid turnover.

Their job is to turn up early to places where the body is in trouble (such as inflamed or infected areas) and do some damage control, which largely consists of eating any pathogens they can find, and making it harder for any they can't find to survive or spread.

Previously they'd been thought to act completely via innate immune receptors. The presence of one of the hallmark receptors of adaptive immunity on them is a little surprising to say the least.

Using antibodies directed against the alpha and beta constant regions, they report that around five to eight percent of freshly isolated neutrophils express αβ-TCR, seemingly in a manner comparable to typical TCR expression.

They seem to express a varied repertoire, when looking for different Vα and Vβ transcription by RT-PCR, that are revealed to be rearranged as per normal. There's CD3 components, CD28, all seemingly upregulated by TCR agonists, as well as a number of proteins required for TCR signalling.

So what are these TCRs supposed to be doing in these neutrophils? Upon TCR stimulation (with anti-CD3 and anti-CD28 antibodies, which is thought to represent a fairly physiological level of stimulation), and watched what the neutrophils did next.

What they did next was live long and prosper; it seemed stimulating the TCRs inhibited neutrophil apoptosis and increased secretion of IL-8, the chemokine responsible for recruiting more neutrophils to the danger zone.

So, the theory goes that some neutrophils express TCR, which presumably helps them recognise either specific pathogens, or a broader swathe of pathogens, which can then find bugs quicker, helping to recruit more neutrophils to the threat.

The same group (Wolfgang's Kaminski's group from Heidelberg) has had a couple of follow up papers on this, during which time it gets re-dubbed the TCRLn, as it's TCR-'like', and in neutrophils, which somewhat solves the quandry of the 'TC' in TCR.

One of these papers is really just a long observation that the repertoire of different TCRs expressed in neutrophils starts off broad, and shrinks with increasing age.

The next is just a case study of a patient with a nasty autoimmune condition (where their red blood cells get targetted and destroyed by their own antibodies), whose number of TCR positive neutrophils had jumped from 5% to 80%.

There's also one final neutrophil paper from a dental group, who noted that oral neutrophils also seem to have a higher expression of TCR than do their circulating cousins, which ties in with their previous work showing that oral neutrophils have a different phenotype.

It takes three years after the start of the neutrophil story before the next cell joins the party. Eosinophils, a kind of granulocyte (mostly responsible for killing parasitic worms) enter the fray, wielding not αβ, but γδ-TCRs.

It's another frustrating start; this time they were on the hunt for γδ TCRs in eosinophils because of the "surprising similarities" between them. Either this is fantastically lucky fishing, or there's a few experiments they're not telling us about.

Either way, they find the γδ-TCR on the eosinophils by flow, along with CD3. Interestingly, they don't find much γδ in lymphocytes (1.4%, which is a few percent shy of usual), nor do they find αβ expressed on neutrophils, putting them at odds with the previously discussed papers.

It seems that while possessing all the bits and bobs needed for TCR recognition, these eosinophils don't produce nearly as much TCR message as T-cells, nor as diverse a range of TCRs. However, activation with TCR agonists caused eosinophils to do exactly what they're supposed to do when activated normally, with degranulation and the release of cytotoxic proteins and reactive oxygen species (ROS).

There's even a couple of figures showing exploring possible functions for the γδ receptor. The presence of γδ-blocking antibodies inhibits the ability of eosinophils to produce ROS in response to mycobacterium, or to induce apoptosis in a colorectal cancer cell line.

The final additions to the TCR club is reported by the same Heidelberg group, giving the impression they probably went on a TCR-testing rampage.

Macrophages are key cells responsible for maintaining immunity in the tissues, by phagocytosing and killing pathogens and presenting their antigens to T-cells. They differentiate from monocytes, which circulate in the body looking for signs for infection or inflammation.

These are classical examples of innate immune cells that bridge the gap to adaptive immunity, through their antigen presentation. These recent papers suggest that they might go one further, with some subpopulations expressing either the αβ or γδ TCRs.

The story unfolds much like the others. There's a small percentage of monocytes and macrophages expressing a limited repertoire of 'TCRLm' (5% for αβ, 3% for γδ), along with other TCR signalling components.

The obligatory search for function touches on some big topics.

For the TCRLmαβ, they investigate the intereaction with mycobacteria, which it seems upregulate the expression of the TCR. They go further, and stain lung sections from tuberculous patients, showing that the contacting edge of cells around the granulomas are enriched with TCR expressing macrophages.

TCRLmγδ on the other hand was investigated during (murine) bacterial meningitis, during which the presence of γδ-macrophages was enriched in the CSF. They also found γδ-MΦs in atherosclerotic plaques, indicating a pretty broad range of possible interactions.

There's a lot to muddy the water in these papers. There's a couple of not-completely-convincing figures, and the obvious matter of the contradiction between the papers. They do all however go a long way to refute the possibility they're just seeing T-cell contamination (either by immunohistochemistry or FISH).

These papers seem to ask more questions than they answer. How does this happen. Do they undergo any selection, and if not, how do they avoid auto-reactivity? If so many other cells can make use of the TCR, why do we even need T-cells in the first place?

If true, these are hugely interesting findings. Here we have terminally differentiated myeloid cells, seemingly expressing one of the classical lymphoid, adaptive immune receptors. Having presented some of these at journal clubs I've seen first hand a bit of resistance to accepting these possibilities outright.

Personally, my view is that biology is a massively confusing thing, and the closer we look at it the more wierd, unexpected stuff we're going to find. Once biology has evolved a system such as the TCR, there's no reason why other cells within the same organism shouldn't make use of it. As technology increases the throughput and sensitivity at  which we can operate, more and more of our accepted models are going to gather inconvenient aberrations like these.

There we have it; the case of the wandering TCR. Whether or not you believe these stories (or more importantly, whether or not you think they have any biological signnificance), I hope you found them as interesting as I did.

While this week of blogging ballooned into a much bigger project than intended, I've enjoyed doing it. Over the coming months I hope to do some other key aspects of TCR biology, just maybe not all at once next time.