Showing posts with label fastq. Show all posts

Wednesday, 10 August 2016

Extracting sample names from Illumina FASTQ filenames

Another post in the vein of "minor analysis task and my solution": how to extract the sample name or identifier (i.e. what you called a given file in the samplesheet) from a list of Illumina FASTQ file names.

So you might have something that looks like this (which I've cobbled together from a couple of runs from different machines that I pulled off Basespace):

322_S18_R2_001.fastq.gz
1c8_P6_act_S11_R2_001.fastq.gz
5c4_P8_res_S1_R1_001.fastq.gz

In this exercise, I wanted the first part of the IDs, which are what we labelled the actual samples with. In an ideal world this should be very easy: we could just use sed or cut to delete everything after the first instance of a character used in the text that Illumina appends to the sample name. However different people (or the same people at different times) use different naming conventions, which can often include underscores, the character that separates the string fields that get added.

Therefore what we need to do is not delete from say the first pattern, but from the nth pattern from the end, which turned out to be a slightly trickier problem (at least to one such as myself that doesn't find sed to be the most intuitive program in the world). Here's what I came up with:

sed 's/\(.*\)_\(.*_\)\(.*\)_/\1|\2/;s/|.*//' FilenameList.txt

What this does is find the third underscore to last (just before S[XX]_R[X].fastq.gz string) and replace it with a pipe character ('|'), which should never appear in a filename, before finding this pipe character and deleting everything after it. This produces the following output:

322
1c8_P6_act
5c4_P8_res

This feels a bit hacky to me, as really I feel like I should be able to do this in one substitution, but the first part of the command actually selects the latter half (so if anyone knows how to tidy this up I'd love to hear it!).

It's also worth pointing out that the output of different machines or demultiplexing settings will differ in the number of underscores, and so the command will need to change slightly. For instance, the above example was run on NextSeq data demultiplexed using bcl2fastq --no-lane-splitting. For an example of a file containing lane information ('L001' etc), here's the equivalent on some MiSeq filenames I pulled up.

# This 
324bAB_S8_L001_R1_001.fastq.gz 
076-PBMCs-alpha_S5_L001_R1_001.fastq.gz 

# After this
sed 's/\(.*\)_\(.*_\)\(.*\)_\(.*\)_/\1|\2/;s/|.*//' FilenameList.txt   

# Becomes this 
324bAB
076-PBMCs-alpha

The reason I was actually doing this was because I had a number of different NextSeq runs, some of which contained samples sequenced from the same original libraries (to increase the depth), so I needed to get the names of all samples sequenced in each run. I had each run in separate directories, and then ran this in the folder above:

for i in */
do echo $i
ls $i | grep R1 | grep fastq.gz | sed 's/\(.*\)_\(.*_\)\(.*\)_/\1|\
2/;s/|.*//' > ${i%?}.fl
done

This goes through all the directories, finds the R1 FASTQ files (as there are other files in there, and this gives me only one entry per sample) and strips their sample names out as above, and writes these to a textfile with the suffix '.fl'.

Wednesday, 15 January 2014

Installing CASAVA/bcl2fastq on Ubuntu

I've been playing around with working some of my own demultiplexing scripts into my current analysis pipeline, so I thought I'd best get to grips with CASAVA, or bcl2fastq, so I can de-demultiplex (multiplex?) my MiSeq data.

The trouble is, CASAVA is not supported for my OS, Ubuntu (for reference, I'm running 13.04). Still, I thought I'd give it a try, can't take too long right? Wrong. Most painful installation ever.

I won't bore you with all the details, but here are the major stops and errors I got along the way, to help people find their way if they find themselves similarly stumped.

I started off trying to install CASAVA v1.8.2, the most up to date version I could find (before I realised that CASAVA has since turned into bcl2fastq).

I tried to build from source as per the instructions, but was unable to make:

make: *** No targets specified and no makefile found. Stop.

Checking out the log, revealed that boost was failing; it couldn't find the make file as the configure hadn't made it.

So, then I tried installing boost via apt, which let configure run fine, but everything ran aground during the make due to incompatibility issues. I should have expected this; the version on boost available through apt was several versions newer than that which comes bundled with CASAVA (1_44) (I know right, the apt version being too new? I didn't see it coming either).

Removing the version of boost I just installed seemed to now allow the bundled version to take over with the make, but now it encountered another issue:

/usr/local/CASAVA_v1.8.2/src/c++/lib/applications/AlignContig.cpp:35:32: fatal error: boost/filesystem.hpp: No such file or directory
compilation terminated.
make[2]: *** [c++/lib/applications/CMakeFiles/casava_applications.dir/AlignContig.cpp.o] Error 1
make[1]: *** [c++/lib/applications/CMakeFiles/casava_applications.dir/all] Error 2
make: *** [all] Error 2

So far so frustrated. Plan B, try and install the rpm version of the more up-to-date bcl2fastq. Now as an Ubuntu user I have no experience with rpm, but let's give it a go.

Rpm couldn't find any dependencies. I mean, any dependencies. At all. Including this:

/bin/sh is needed by bcl2fastq-1.8.4-1.x86_64

I'm not sure about a lot of the packages installed, but I know I've got sh.

OK, let's try something else; googling around the subject suggests that yum is the way to go. I tried this, it still couldn't find any of the required dependencies.

So I tried one last thing, and installed alien, and used that to install the bcl2fastq rpm...

alien -i bcl2fastq-1.8.4-Linux-x86_64.rpm

...and it worked! Huzzah!

Although I really shouldn't have been surprised. The rpm installation even told me I should:

rpm: RPM should not be used directly install RPM packages, use Alien instead!

TL,DR: Trying to install CASAVA/bcl2fastq on Ubuntu? Try using alien to install the rpm!

Update for Ubuntu 14.04 (Trusty Tahr):

After recently updating to the newest version of Ubuntu, I've tried to follow my own advice regarding the installation of bcl2fastq, only to find that it no longer works (see below for horrendous stream of errors)!

Luckily the problem - that the newer version of Perl used in Trusty isn't compatible with bcl2fastq - had already been identified and solved by the good people over at SeqAnswers (thanks to Tony Brooks and Hiro Mishima).

"my" variable $value masks earlier declaration in same statement at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm line 760.
syntax error at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm line 747, near "$variable qw(ELAND_FASTQ_FILES_PER_PROCESS)"
Global symbol "$variable" requires explicit package name at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm line 749.
syntax error at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm line 751, near "$directory qw(ELAND_GENOME)"
Global symbol "$self" requires explicit package name at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm line 753.
Global symbol "$directory" requires explicit package name at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm line 753.
Global symbol "$project" requires explicit package name at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm line 753.
Global symbol "$sample" requires explicit package name at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm line 753.
Global symbol "$lane" requires explicit package name at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm line 753.
Global symbol "$barcode" requires explicit package name at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm line 753.
Global symbol "$reference" requires explicit package name at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm line 753.
syntax error at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm line 761, near "}"
/usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm has too many errors.
Compilation failed in require at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment.pm line 61.
BEGIN failed--compilation aborted at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment.pm line 61.
Compilation failed in require at /usr/local/bin/configureBclToFastq.pl line 250.
BEGIN failed--compilation aborted at /usr/local/bin/configureBclToFastq.pl line 250.

Thursday, 24 October 2013

Biopython SeqIO tips

The Biopython libraries are of tremendous use in analysing biological data. I use them in the vast majority of my bespoke fastq analysis, as I'm sure many do.

However, there's a couple of tasks I regularly find myself wanting to do, but that I could not easily find solutions for. In case anyone finds themselves in similar need, here are the solutions I found, maybe save you a bit of time.

Output quality scores

This is the main one; it's always annoyed me that there doesn't seem to be an easy way to output the quality score in SeqIO. Sure, I know with the letter_annotations option I can output the actual score in numbers, but sometimes I want to output the actual ASCII characters (such as if I want to take a subsection of a fastq record, both of nucleotide and quality strings).

Here's how I get around it; turn the whole record into a string, split that on its new lines, and just take the fourth:

 str(record.format("fastq")).split("\n")[3]

Generate new fastq records in situ

Sometimes it's necessary to generate complete fastq records on the fly, as opposed to reading them in from an existing file.

I've found a couple of ways of doing this. The first comes buried in amongst some unrelated Biopython features, and looks like this:

 from Bio.Alphabet import generic_dna  
 from Bio.Seq import Seq  
 from Bio.SeqRecord import SeqRecord  
 new_record = SeqRecord(Seq("AAAAAA", generic_dna), id="New Record", description="")  
 new_record.letter_annotations["phred_quality"] = [40,40,40,40,40,40]

The second way came from an answer on Biostars, and works like this:

 from Bio import SeqIO  
 from StringIO import StringIO  
 fastq_string = "@%s\n%s\n+\n%s\n" % ("New Record", "AAAAAA", "IIIIII")  
 new_record = SeqIO.read(StringIO(fastq_string), "fastq-illumina")

They both work, with a couple of differences.

According to a quick test, the latter is faster, and obviously requires loading fewer modules.

Which to use might also depend on what format you have your qualities in; if you only have integers, then the first might be more tempting, whereas if you're making your new fastq from pieces of other existing records then the second is probably the way to go.

Tuesday, 10 September 2013

Python fastq analysis for beginners

It's been a little while since my last post, as I did that rare and wondrous thing and took a holiday. Now I'm back, and galvanised into action by a particularly nice reference to my blog, I though I'd continue my vague hopscotch bioinformatics for beginners series, Here's a pretty basic one I've been meaning to put up for a while: fastq analysis.

Fastq is of course the format for storage next generation DNA sequencing data. Each sequence gets four lines in the file: one for sequence identifier, nucleotide sequence, a description line (often just a '+' character), and finally the per-base quality scores (thus lines two and four must be equal length).

If you're on this page, chances are good you're a wet-labber who's just starting to get into working with sequencing data. You can do basic manipulation using a multitude of different programs (I particularly like the FASTX toolkit for a nice spread of basic but useful tricks), but at some point, you're probably going to have to get your hands dirty and code something.

Now I know this isn't best practise, but often when I want to code something I can't immediately think of a solution for, I just get combing tutorials, manuals and forums finding fragments of script that do similar things to what I want, and weaving them together.

Often, the tricky place is knowing where to start. With a view to helping out people like me, here's the basic script I use for the majority of the fastq analysis that I write in Python.

It makes use of the SeqIO interface in Biopython. Basically, all this script does is scroll through a fastq file one record at a time, allowing you to interrogate or manipulate your individual sequences as needed.

from Bio import SeqIO # module needed for sequence input

countseqs = 0

fastqfile = open("YOUR_FILENAME_HERE.fastq", "rU")  
      
for record in SeqIO.parse(fastqfile,"fastq"):  
   
  ######## PUT WHATEVER YOU WANT TO HAPPEN IN HERE  
   
  countseqs += 1  
   
print "Reading from", fastqfile  
print countseqs, "records analysed."

I usually make use of a little variant on this that allows command line input of the fastq file name; this means I can apply the same script to a large number of files without having to edit the filename in the script each time (and instead can make use of the tab-complete function of my shell):

from Bio import SeqIO # module needed for sequence input   
import sys   # module needed for command line argument to get fastq name   
     
if (len(sys.argv) &gt; 1):   
  fastqfile = open(sys.argv[1], "rU")   
           
elif (len(sys.argv) == 1):   
  print "Please supply a fastq filename (i.e. python script.py file.fastq)"   
  sys.exit()   
          
for record in SeqIO.parse(fastqfile,"fastq"):   
     
  ######## PUT WHATEVER YOU WANT TO HAPPEN IN HERE   
   
  countseqs += 1   
     
print "Reading from", fastqfile   
print countseqs, "records analysed."

NB - this script is easily converted into a fasta file, just by doing a find and replace of 'fastq' for 'fasta' - the parser handles it just fine.

This is of course the easy bit - filling in the "whatever you want to happen" bit is always going to be tricky.

To give you an example, here are the bones of a script I wrote based on this template, that I alluded to in a previous post.

from Bio import SeqIO # module needed for sequence input  
import sys      # module needed for command line argument to get fastq name  
import regex        # module for regex  
   
###################  
   
if (len(sys.argv) > 1):  
  fastqfile1 = open(sys.argv[1], "rU")  
   
   
elif (len(sys.argv) == 1):  
  print "Please supply a fastq filename (i.e. python script.py file.fastq)"  
  sys.exit()  
   
######################  
   
SEQ1_outfilename = ("SEQ_1_" + sys.argv[1][:15] + ".fastq")  
SEQ1_outfile = open(SEQ1_outfilename, "w")  
SEQ1_outfile.close()  
   
SEQ2_outfilename = ("SEQ_2_" + sys.argv[1][:15] + ".fastq")  
SEQ2_outfile = open(SEQ2_outfilename, "w")  
SEQ2_outfile.close()  
   
countseqs = 0  
countSEQ_1 = 0  
countSEQ_2 = 0  
   
####################  
      
for record in SeqIO.parse(fastqfile1,"fastq"): # scrolls through fastq file, using SeqIO package  
  
  countseqs += 1 # counts number of sequences measured  
   
  seq1 = regex.search('(ATATATATAT){e<=1}', str(record.seq))  
   
  if seq1:  
   
    countSEQ_1 +=1  
   
    output_handle = open(str(SEQ1_outfilename), "a")  
    SeqIO.write(record, output_handle, "fastq-sanger")  
    output_handle.close()  
   
  else:  
                  
    seq2 = regex.search('(CGCGCGCGCG){e<=1}', str(record.seq))  
   
    if seq2:   
   
      countSEQ_2 +=1  
   
      output_handle = open(str(SEQ2_outfilename), "a")  
      SeqIO.write(record, output_handle, "fastq-sanger")  
      output_handle.close()  
   
print "Reading from", fastqfile1  
print countseqs, "records analysed."  
print countSEQ_1, "Sequence 1s detected."  
print countSEQ_2, "Sequence 2s sequences detected."

This uses the framework of the scripts above, looking for two sequences (in this case two primers, the presence of which would indicate which amplicons the sequences belonged to); depending on which subsequence it finds (if any), it bins the record in one of two output files.

To give a bit of leeway for sequencing error, I've made use of 'regex', an alternative regular expression module (as opposed to the standard/older 're'), which has an option for fuzzy string matching - i.e. you can allow mismatches or indels.

Taking one of the pretend sequences I've used here (ATATATATAT), this means that any of the following sequences would also be categorised as a hit using this search: ATATATGTAT, ATATATATT, ATATAATATAT.

There we have it; a crude, yet effective script to take a fastq file, and output two new fastqs containing (mutually exclusive) reads based on the presence of a given subsequence.

Note that this probably isn't the most efficient solution to this problem - even for something written in Python - but it was knocked together quickly, building up on simple principles from a basic fastq analysis template script, and most importantly, it got the job done. I only needed to run this code three or four times, and was able to do the analysis I needed, which is, let's face it, the most important thing.

----

Update: it's been a few years since I wrote this article, and it's worth pointing out that I've found a much nicer way to parse fasta/fastq files: the readfq function from Heng Li. It's available in a variety of languages, is simple, customisable, pared down, and crucially it's fast. Particularly in applications where you don't need all the extra legwork that SeqIO is doing - like converting your quality scores into their integer equivalents.

Saturday, 10 August 2013

Decombining the Recombined: High-Throughput TCR Repertoire Analysis

August 2016 update:
Decombinator has gone through a number of improvements since writing this post. The current version can be found on the Innate2Adaptive lab's Github repo.

The background

The history

The problem

The solution?

The (odds and) end(s)

The papers

The background

As I've mentioned in the past, my PhD project is involved with using high-throughput sequencing (HTS) to investigate T-cell receptor (TCR) repertoires.

Next-generation sequencing (NGS) technologies allow us to investigate the workings and dynamics of the adaptive immune system with greater breadth and resolution than ever before. It’s possible to extract or amplify the coding sequence for thousands to millions of different clones, and throw them all into one run of a sequencer. The trick is extracting the TCR information from the torrent of data that pours out.

Our lab’s paper for the high-throughput analysis of such repertoire data has been out for a while now, so here’s my attempt to put in to a little wider context than a paper allows, while hopefully translating it for the less computationally inclined.

In order to understand best how the brains in our lab tackled this problem, it's probably worth looking at how others before us have.

The history

There's a beaten path to follow. The germline V, D and J gene sequences are easily downloaded from IMGT GENE-DB; then you take your pick of a short read aligner to match up the V and J sequences used (we’ll come to the Ds later).

Most of the popular alignment algorithms get covered by some group or other: SWA (Ndifon et al., 2012; Wang et al., 2010), BLAT (Klarenbeek et al., 2010) and BLAST (Mamedov et al., 2011) all feature. IMGT’s HighV-QUEST software uses DNAPLOT, a bespoke aligner written for their previous low-throughput version (Alamyar et al, 2012; Giudicelli et al, 2004; Lefranc et al, 1999).

Sadly, some of the big hitters in the field don’t see fit to mention what they use to look for TCR genes (or I’ve just missed it). Robert Holt’s group produced the first NGS TCR sequencing paper I’m aware of (Freeman, Warren, Webb, Nelson, & Holt, 2009), but don’t mention how they assign their genes (admittedly they’re more focused on explaining how iSSAKE, their short-read assembler designed for TCR data, works).

The most prolific author in the TCR ‘RepSeq’ field is Harlen Robins, who has contributed to a wealth of wonderful repertoire papers (Emerson et al., 2013; Robins et al., 2009, 2010, 2012; Sherwood et al., 2013; Srivastava & Robins, 2012; Wu et al., 2012), yet all remain equally vague on TCR assignation methods (probably related to the fact that he and several other early colleagues set up a company, Adaptive Biotech, that offers TCR repertoire sequencing and analysis).

So we see a strong tendency towards alignment (and a disappointing prevalence of ‘in house scripts’). I can understand why: you've sequenced your TCRs, you've got folder of fastqs burning a hole in your hard drive, you’re itching to get analysing. What else does your average common house or garden biologist do when confronted with sequences, but align?

However, this doesn't really exploit the nature of the problem.

The problem

When trying to fill out a genome, alignment and assembly make sense; you take a read, and you either see where it matches to a reference, or see where it overlaps with the other reads to make a contig.

For TCR analysis however, we're not trying to map reads to or make a genome; arguably we’re dealing with some of the few sequences that might not be covered under the regular remit of 'the genome'. Nor are we trying to identify known amplicons from a pool, where they should all the same, give or take the odd SNP.

TCR amplicons instead should have one of a limited number of sequences at each end (corresponding to the V and J/C gene present, depending on your TCR amplification/capture strategy), with a potentially (and indeed probably) completely novel sequence in between.

In this scenario, pairwise alignment isn’t necessarily the best option. Why waste time trying to match every bit of each read to a reference (in this case, a list of germ-line V(D)J gene segments), when only the bits you’re least interested in – the germline sequences – stand a hope of aligning anywhere?

The solution?

Enter our approach: Decombinator (Thomas, Heather, Ndifon, Shawe-Taylor, & Chain, 2013).

Decombinator rapidly scans through sequence files looking for rearranged TCRs, classifying any it finds by these criteria; which V and J genes were used, how many nucleotides have been deleted from each, and the string of nucleotides between the end of the V and the start of the J. This simple five-part index condenses all of the information contained in any given detected rearrangement into a compact format, convenient for downstream processing.

All TCR analysis probably does something similar; the clever thing about Decombinator is how it gets there. At its core lies a finite state automaton that passes through each read looking for a panel short sequence ‘tags’, each one uniquely identifying the presence of a germline V or J gene. It if finds tags for both a V and a J it can populate the five fields of the identifier, thus calling the re-arrangement.

Example of the comma-delimited output from Decombinator. From left to right: (optional sixth-field) clonal frequency; V gene used; J gene used; number of V deletions; number of J deletions, and insert string

The algorithm used was developed by Aho and Corasick in the ‘70s, for bibliographic text searching, and the computer science people tell me that it’s really never been topped – when it comes to searching for multiple strings at once, the Aho-Corasick (AC) algorithm is the one (Aho & Corasick, 1975).

Its strength lies in its speed – it’s simply the most effective way to search one target string for multiple substrings. By using the AC algorithm Decombinator runs orders of magnitude faster than alignment based techniques. It does this by generating a special trie of the substrings, which it uses to search the target string exceedingly efficiently.

Essentially, the algorithm uses the trie to look for every substring or tag simultaneously, in just one pass through the sequence to be searched. It passes through the string, using the character at the current position to decide how to navigate; by knowing where it is on the trie at any one given point, it’s able to use the longest matched suffix to find the longest available matched prefix.

Figshare seems to have scotched up the resolution somewhat, but you get the idea.

Decombinator uses a slight modification of the AC approach, in order to cope with mismatches between the tag sequences and the germline gene sequenced (perhaps from SNPs, PCR/sequencing error, or use of non-prototypic alleles). If no complete tags are found, the code breaks each of the tags into two halves, making new half-tags which are then used to make new tries and search the sequence.

If a half-tag is found, Decombinator then compares the sequence in the read to all of the whole-tags that contain that half-tag*; if there’s only a one nucleotide mismatch (a Hamming distance of one) then that germline region is assigned to the read. In simulated data, we find this to work pretty well, correctly assigning ~99% of artificial reads with a 0.1% error (i.e. one random mismatch every 1,000 nucleotides, on average), dropping to ~88% for 1% error **.

It’s simple, fast, effective and open source; if you do have high-throughput human or mouse alpha-beta or gamma-delta TCR data set to analyse, it’s probably worth giving it a try. ~~The only real available other option is HighV-QUEST, which (due to submission and speed constraints) might not be too appealing an option if you really have a serious amount of data~~.

(CORRECTION – in the course of writing this entry (which typically has gone on about five times longer than I intended, in both time taken and words written), some new rather exciting looking software has just been published (Bolotin et al., 2013). MiTCR makes a number of bold claims which if true, would make it a very tempting bit of software. However, given that I haven’t been able to get it working, I think I’ll save any discussion of this for a later blog entry.)

The (odds and) end(s)

D segments

If you’ve made it this far through the post, chances are good you’re either thinking about doing some TCR sequencing or have done already, in which case you’re probably wondering, ‘but what about the Ds – when do they get assigned’?

In the beta locus (of both mice and men), there are but two D regions, which are very short, and very similar. As they can get exonucleolytically nibbled from both ends, the vast majority of the time you simply can’t tell which has been used (one early paper managed in around a quarter of reads (Freeman et al., 2009)). Maybe you could do some kind of probabilistic inference for the rest, based on the fact that there is a correlation between which Ds pair with which Js, likely due to the chromatin conformation at this relative small part of the locus (Murugan, Mora, Walczak, & Callan, 2012; Ndifon et al., 2012), but that’s a lot of work for very little reward.

Hence Decombinator does not assign TRBDs; they just get included in the fifth, string based component of the five-part identifier (which explains the longer inserts you see for beta compared to alpha). If you want to go TRBD mining you’re very welcome, just take the relevant column of the output and get aligning. However, for our purposes (and I suspect many others’), knowing the exact TRBD isn’t that important, where indeed even possible at all.

Errors

There’s also the question of how to deal with errors, which can accrue during amplification or sequencing of samples. While Decombinator does mitigate error somewhat through use of the half-tags and omission of reads containing ambiguous N calls, it doesn’t have any other specific error-filtration. As with any pipeline, garbage in begets garbage out; there’s plenty of software to trim or filter HTS data, so we don’t really need to reinvent the wheel and put some in here.

Similarly, Decombinator doesn’t currently offer sequence clustering, whereby very similar sequences get amalgamated into one, as some published pipelines do. Personally, I have reservations about applying standard clustering techniques to variable immunoreceptor sequence data.

Sequences can legitimately differ by only one nucleotide, and production of very similar clonotypes is inbuilt into the recombination machinery (Greenaway et al., 2013; Venturi et al., 2011); it is very easy to imagine bona fide but low frequency clones being absorbed into their more prevalent cousins, which could obscure some genuine biology. The counter argument is of course that by not clustering, one allows through a greater proportion of errors, thus artificially inflating diversity. Again, if desired other tools for sequence clustering exist.

Disclaimer

My contribution to Decombinator was relatively minor - the real brainwork was done before I'd even joined the lab, by my labmate, mathematician Niclas Thomas, our shared immunologist supervisor Benny Chain, and Nic's mathematical supervisor, John Shawe-Taylor. They came up with the idea and implemented the first few versions. I came in later, designing tags for the human alpha chain, testing and debugging, and bringing another biologist’s view to the table for feature development***. The other author, Wilfred Ndifon, at the time was a postdoc in a group with close collaborative ties, who I believe gave additional advice on development, and provided pair-wise alignment scripts against which to test Decombinator.

* Due to the short length of the half-tags and the high levels of homology between germline regions, not all half tags are unique

** In our Illumina data, by comparing the V sequence upstream of the rearrangement – which should be invariant – to the reference, we typically get error rates below 0.5%, some of which could be explained by allelic variation or SNPs relative to the reference

*** At first we had a system that Nic would buy a drink for every bug found. For a long time I was the only person really using Decombinator (and probably remain the person who’s used it most), often tweaking it for whatever I happened to need to make it do that day, putting me in prime place to be Bug-Finder Extraordinaire. I think I let him off from the offer after the first dozen or so bugs found.

The papers

Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6), 333–340. doi:10.1145/360825.360855

Alamyar, A., Giudicelli, V., Li, S., Duroux, P., & Lefranc, M. (2012). IMGT/HIGHV-QUEST: THE IMGT® WEB PORTAL FOR IMMUNOGLOBULIN (IG) OR ANTIBODY AND T CELL RECEPTOR (TR) ANALYSIS FROM NGS HIGH THROUGHPUT AND DEEP SEQUENCING. Immunome Research, 8(1), 2. doi:10.1038/nmat3328

Bolotin, D. a, Shugay, M., Mamedov, I. Z., Putintseva, E. V, Turchaninova, M. a, Zvyagin, I. V, Britanova, O. V, et al. (2013). MiTCR: software for T-cell receptor sequencing data analysis. Nature methods, (8). doi:10.1038/nmeth.2555

Emerson, R., Sherwood, A., Desmarais, C., Malhotra, S., Phippard, D., & Robins, H. (2013). Estimating the ratio of CD4+ to CD8+ T cells using high-throughput sequence data. Journal of immunological methods. doi:10.1016/j.jim.2013.02.002

Freeman, J. D., Warren, R. L., Webb, J. R., Nelson, B. H., & Holt, R. a. (2009). Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing. Genome research, 19(10), 1817–24. doi:10.1101/gr.092924.109

Giudicelli, V., Chaume, D., & Lefranc, M.-P. (2004). IMGT/V-QUEST, an integrated software program for immunoglobulin and T cell receptor V-J and V-D-J rearrangement analysis. Nucleic acids research, 32(Web Server issue), W435–40. doi:10.1093/nar/gkh412

Greenaway, H. Y., Ng, B., Price, D. a, Douek, D. C., Davenport, M. P., & Venturi, V. (2013). NKT and MAIT invariant TCRα sequences can be produced efficiently by VJ gene recombination. Immunobiology, 218(2), 213–24. doi:10.1016/j.imbio.2012.04.003

Klarenbeek, P. L., Tak, P. P., Van Schaik, B. D. C., Zwinderman, A. H., Jakobs, M. E., Zhang, Z., Van Kampen, A. H. C., et al. (2010). Human T-cell memory consists mainly of unexpanded clones. Immunology letters, 133(1), 42–8. doi:10.1016/j.imlet.2010.06.011

Lefranc, M.-P. (1999). IMGT, the international ImMunoGeneTics database. Nucleic acids research, 27(1), 209–212. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=165532&tool=pmcentrez&rendertype=abstract

Mamedov, I. Z., Britanova, O. V, Bolotin, D., Chkalina, A. V, Staroverov, D. B., Zvyagin, I. V, Kotlobay, A. A., et al. (2011). Quantitative tracking of T cell clones after hematopoietic stem cell transplantation. EMBO molecular medicine, 1–8.

Murugan, a., Mora, T., Walczak, a. M., & Callan, C. G. (2012). Statistical inference of the generation probability of T-cell receptors from sequence repertoires. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.1212755109

Ndifon, W., Gal, H., Shifrut, E., Aharoni, R., Yissachar, N., Waysbort, N., Reich-Zeliger, S., et al. (2012). Chromatin conformation governs T-cell receptor J gene segment usage. Proceedings of the National Academy of Sciences, 1–6. doi:10.1073/pnas.1203916109

Robins, H. S., Campregher, P. V, Srivastava, S. K., Wacher, A., Turtle, C. J., Kahsai, O., Riddell, S. R., et al. (2009). Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. Blood, 114(19), 4099–107. doi:10.1182/blood-2009-04-217604

Robins, H. S., Desmarais, C., Matthis, J., Livingston, R., Andriesen, J., Reijonen, H., Carlson, C., et al. (2012). Ultra-sensitive detection of rare T cell clones. Journal of Immunological Methods, 375(1-2), 14–19. doi:10.1016/j.jim.2011.09.001

Robins, H. S., Srivastava, S. K., Campregher, P. V, Turtle, C. J., Andriesen, J., Riddell, S. R., Carlson, C. S., et al. (2010). Overlap and effective size of the human CD8+ T cell receptor repertoire. Science translational medicine, 2(47), 47ra64. doi:10.1126/scitranslmed.3001442

Sherwood, A. M., Emerson, R. O., Scherer, D., Habermann, N., Buck, K., Staffa, J., Desmarais, C., et al. (2013). Tumor-infiltrating lymphocytes in colorectal tumors display a diversity of T cell receptor sequences that differ from the T cells in adjacent mucosal tissue. Cancer immunology, immunotherapy: CII. doi:10.1007/s00262-013-1446-2

Srivastava, S. K., & Robins, H. S. (2012). Palindromic nucleotide analysis in human T cell receptor rearrangements. PloS one, 7(12), e52250. doi:10.1371/journal.pone.0052250

Thomas, N., Heather, J., Ndifon, W., Shawe-Taylor, J., & Chain, B. (2013). Decombinator: a tool for fast, efficient gene assignment in T-cell receptor sequences using a finite state machine. Bioinformatics, 29(5), 542–50. doi:10.1093/bioinformatics/btt004

Venturi, V., Quigley, M. F., Greenaway, H. Y., Ng, P. C., Ende, Z. S., McIntosh, T., Asher, T. E., et al. (2011). A Mechanism for TCR Sharing between T Cell Subsets and Individuals Revealed by Pyrosequencing. Journal of immunology. doi:10.4049/jimmunol.1003898

Wang, C., Sanders, C. M., Yang, Q., Schroeder, H. W., Wang, E., Babrzadeh, F., Gharizadeh, B., et al. (2010). High throughput sequencing reveals a complex pattern of dynamic interrelationships among human T cell subsets. Proceedings of the National Academy of Sciences of the United States of America, 107(4), 1518–23. doi:10.1073/pnas.0913939107

Wu, D., Sherwood, A., Fromm, J. R., Winter, S. S., Dunsmore, K. P., Loh, M. L., Greisman, H. a., et al. (2012). High-Throughput Sequencing Detects Minimal Residual Disease in Acute T Lymphoblastic Leukemia. Science Translational Medicine, 4(134), 134ra63–134ra63. doi:10.1126/scitranslmed.3003656