Showing posts with label T cell receptor. Show all posts
Showing posts with label T cell receptor. Show all posts

Saturday, 10 November 2018

Making coding T cell receptor sequences from V-J-CDR3

If, like me, you work on T cell receptors, occasionally you’re probably going to want to express a particular TCR in cells. However you're not always going to have the sequence, at either nucleotide or protein sequence level.

Not a problem, you can sort this out. You can look up all the relevant germline sequences from IMGT, trim away all the non-used bits, add in the non-templated stuff, then manually stitch it all together and have a look to see if it still makes what you were expecting. You can do all that... or you can just use the code I wrote.

StiTChR does it all: give it a V gene, a J gene, and a CDR3 amino acid sequence and it'll look up, trim and stitch together all the relevant TCR nucleotide sequences for you, back-translating the non-templated region using the most frequent codon per residue. It also translates it all, and will run a quick rudimentary alignment against a known partial protein sequence if you have one for a visual confirmation that it's made the right thing.

You can then take the alpha/beta TCR sequences it generates, bang them into an expression vector (typically split by a T2A sequence or something) and transduce your cells of interest.

I wrote this code to save me a bit of time in future, but hopefully it can do the same for some of you!

Tuesday, 4 September 2018

The problem with Adaptive TCR data

I'm a big proponent of DIY TCR (and BCR) sequencing. It's the best way to be able to vouch that every step in the process has been done correctly; you are able to QC and query whatever steps you wish; it's typically more customisable to your specific hypotheses and research questions, and; it's invariably cheaper. What's more, there's lots of great labs making and publishing such pipelines (including the one I helped develop back in London), so you don't even need to go to the effort of making one yourself.
However there are a number of situations in which you might instead choose to outsource this task to a commercial supplier. The greater cost and loss of flexibility can be replaced with scalability, reduced hands on time, third party guarantees, and avoid the need to build capacity for sequencing and data processing in house, which brings its own savings and time benefits.
Without even needing to check I can confidently say that Adaptive Biotech are foremost among the companies offering this as a service. As part of a few different projects I've recently been getting my feet wet analysing some large datasets produced from Adaptive, including both publicly available projects of theirs (accessed via their immunoSEQ portal) and data from samples that we've sent to them.
Generally speaking, I'm pretty happy with both the service and the data we've received. I love how they make a lot of their own data publicly accessible, and the frequency with which they publish cool and important papers. I like how they are making RepSeq available to labs that might otherwise not be able to leverage this powerful technology (at least those as can afford it). In almost every sense, it's a company that I am generally pretty in favour of.
However, in designing their analyses Adaptive have taken one massive liberty, which (while I'm sure was undertaken with the best of intentions) stands to cause any number of problems, frustrations, and potential disasters - both to their customers and the field at large.
What is this heinous crime, this terrible sin they've committed? Could they be harvesting private data, releasing CDR3 sequences with coded messages, pooling all of our adaptive repertoire data in some bizarre arcane ritual? No. Instead they tried to make the TCR gene naming system make a little bit more sense (cue dramatic thunder sound effects).
It's a crime as old as biology, one particularly prevalent in immunology: you don't like the current gene naming system, so what do you do? Start a new one! A better, shinier one, with new features and definitely no downsides - it'll be so good it could even become the new standard!*
I know exactly why they did it too; when I worked on our own TCR analysis software and results in my PhD, I encountered the same problems. The TCR names are bothersome from a computing perspective. They don't sort right - either alphabetically or chromosomally. They don't contain the same number of characters as each other, so they don't line up nice on an axis. They're generally just a bit disordered, which can be confusing. They're precisely not what a software engineer would design.
Adaptive's solution is however a classic engineering one. Here's a problem, let's fix it. 'TR' is almost 'TCR' but not quite – that's confusing, so let's just chuck a 'C' in there and make it explicit. Some V/J genes have extra hyphenated numbers – so let's give all of them hyphenated numbers. And hey, some gene groups have more then ten members – let's add leading zeros so they all sort nice and alphabetically. We'll take those annoying seemingly arbitrary special cases, and bring them all into a nice consistent system. Bing bang bosh, problem solved.
This is all very well and good until you realise that this isn't about making something perfect, neat and orderly; we're talking about describing biology here, where complexity, redundancy and just plain messiness are par for the course. Having a bunch of edge cases that don't fit the rule basically is the rule!
Let's look at some examples, maybe starting at the beginning of the beta locus with the V gene that the rest of knows as TRBV1. If you go looking for this in your Adaptive data (at least if you export it from their website as I did) then you might not find it straight away; instead, it goes by the name TCRBV01-01. Similarly TRBV15 becomes TCRBV15-01, TRBV27 → TCRBV27-01, and so on.
Sure, the names all look prettier now, but this approach is deeply problematic for a bunch of reasons. With respect to these specific examples, the hyphenated numbers aren't just applied to genes randomly, it denotes those genes who are part of a subgroup containing more than one gene (meaning they share more than 75% nucleotide identity in the germline). You can argue this is an arbitrary threshold, but it is still nevertheless useful; it allows a quick shorthand to roughly infer both evolutionary divergence times and current similarity, within that threshold. Adding hypenated numbers to all genes washes out one of the few bits of information you could actually glean about a TCR or BCR gene just by looking at the name (along with approximate chromosomal position and potential degree of polymorphism, going off the allele number when present). Which genes fall in subgroups with multiple members also differs between species, which adds another extra level of usefulness to the current setup; appending '-XX' to all genes like Adaptive makes it easier to become confused or make mistakes when comparing repertoires or loci of different organisms.
The more important reason however has nothing to do with what incidental utility is lost or gained; the fact of the matter is that these genes have already been named! When it comes to asking what the corresponding gene symbol for a particular V, D or J sequence is, there is a correct answer. It has been agreed upon for years, internationally recognised and codified. People sat around in a committee and decided it.  
Whether you like it or not, HUGO and IMGT between them have got this covered, and we should all be using the agreed upon names. To do otherwise is to invite confusion, ambiguity and inaccuracies, weakening the utility of published reports and shared data. Gene name standardisation is hardly sexy, but it is important.
Admittedly Adaptive are not the only people guilty of ignoring the standardised gene names IMGT has gone to the trouble to lay out. Even now I still come across new papers where authors use old TCR gene nomenclatures (I'm looking at you flow cytometrists!). I would however argue that it's especially troubling when Adaptive does it, as they are the data producers for large numbers of customers, and are quite possible the first entry point into RepSeq for many of those. This means that mean a large body of data is being generated in the field with the wrong IDs. This in turns risks a whole host of errors during the necessary conversion to the correct format for publication or comparison with other datasets. Worse, it means that potentially a considerable fraction of new participants in the field are being taught the wrong conventions, which will feed forward and further dilute out the standard and pour more oil on the fire of confusion – as if immunology wasn't already plagued with enough nomenclature woes!
While I'm on the subject, it's also interesting to note that in 2011 (a couple years after their formation) Adaptive did state that “one of the community standards that we try to adhere to is IMGT nomenclature and definitions”. More interestingly perhaps is a poster from 2015 where they claim to actually be using IMGT nomenclature, despite clearly showing their edited version of it. In a way this is both reassurring, and a little upsetting. They clearly know that the standard exists, and that it should be adhered to, but they presumably don't think the problems generated by adding characters into externally regulated gene symbols is problematic enough to not do. So close yet so far!
Adaptive is clearly full of lots of clever people who know the field very well. I'm certain that they've had exactly this discussion in the past, and – I hope – revisit it occasionally, perhaps when they get feedback. Because of that hope, I'm encourage other Adaptive customers, immunoSEQ users, and generally any RepSeq/AIRR-seq interested parties to put the word in with your Adaptive representatives when you can. Let's see if we can convince them to take up the actual standard, instead of their well-meaning but ultimately frustrating derivative.

* Writing this section reminds me of a lecturer I had back in my undergrad, who was fond of quoting Keith Yamamoto's famous refrain: “scientists would rather share each other's underwear than use each other's nomenclature”. Much like she did, I tend to want to share it whenever any remotely related topic comes up, just because it's so good.


Wednesday, 26 March 2014

TCR trivia part one - translating T cell receptor sequences


Here's the first in what's likely to a be pretty niche series of posts – but hopefully useful to some – on various bits of TCR trivia I've gleaned during my PhD. This installment: translating TCR sequences.
Whether you're doing high- or low-throughput DNA TCR sequencing, it's highly likely that at some point you'll want to translate your DNA into amino acid sequences.
There's usually a few main goals to bear in mind when translating; to check the frame, and (potential) functionality of the sequence (i.e. lacking premature stop codons), before defining the hypervariable CDR3 region.
Assuming your reads are indel free, sequencing variable antigen receptors will always involve a bit more thought about checking the reading frame than regular amplicon sequencing, due to the non-templated addition and deletion of nucleotides that occurs during V(D)J recombination.
So how do can we tell whether our final recombined sequence is in frame or not?
This will mostly depend on how your sequencing protocols work.
If, say you've somehow managed to sequence an entire TCR mRNA sequence, then it's easy; you can just take the sequence from the start codon of the leader region to the stop codon of the constant region and if it divides exactly by three then it's in frame. Assuming there's no stop codons in between and the CDR3 checks out, chances are good your sequence encodes a productively rearranged TCR chain.
Given today's technology, that's unlikely to be the case; whether you're cloning into plasmids and Sanger sequencing or throwing libraries into some next-gen machine, chances are good that the sequence you'll be working with is actually just a small window around the recombination site.
What's likely is that you have amplified from a constant or J region primer at one end, to a V or RACE/template-switching primer at the other. Sorting out the frame is then just a matter of finding sequences on either side of the rearrangement that you know should be in frame, and seeing whether they are.
So far so self-explanatory. Here's the first of the fiddlier bits of TCR trivia I want to impart, the thing I need to remind myself of every time I tweak my TCR translating scripts: in fully-spliced TCR message the last nucleotide of the J region makes up the first nucleotide of the first codon of the constant region.
This means that if you go from the first nucleotide of the V (or the leader sequence, all functional leader sequences being divisible by three) to either the second to last nucleotide of the J, or the second nucleotide of the C (if you've started with mRNA instead of gDNA) then presto, you'll be in frame!
This feature also produces another noteworthy feature, at least in one chain; while every TRBJ ends in the same nucleotide, there are functional TRAJ genes ending in each base. This means that the first residue of the constant region of the alpha chain can be one of four different amino acids – which can be a bit off-putting if you don't know this and you're looking at what's supposed to be constant.
Once you've translated your TCR the next step is to find the CDR3, which should surely be the easiest bit; just run from the second conserved cysteine to the phenylalanine in the FGXG motif, right, just as IMGT says? Simple, or so it might seem.
Both sides of the CDR3 offer their own complications. First off, finding the second conserved cysteine isn't as easy as just picking the second C from the 5', or the 5'-most one upstream of the rearrangement, as some Vs have more than two germline cysteines, and it's feasible that new ones might be generated in the rearrangement. At some point you're just going to have to do an alignment of all the different Vs, and record the position of all the conserved cysteines.
The FGXG is also trickier than it might seem, by merit of the fact that there are J genes in both the alpha and gamma loci* that (while supposedly functional) lack the motif, being either FXXG or XGXG. If you're only looking for CDR3s matching the C to FGXG pattern, then there's going to be whole J genes which never seem to get used by your analysis!
There you have it, a couple of little tips that are worth bearing in mind when translating TCR sequence**.

Addendum - it's actually a little bit more complicated. Naturally.
* If you're interested (or don't believe me), they are: TRAJ16, TRAJ33, TRAJ38, TRGJP1 and TRGJP2.
** Note that everything written here is based on human TCR sequence as that's what I work with, but most of it will probably apply to other species as well.

Wednesday, 29 January 2014

Immunological 3D printing, the how-to


Here's a quick overview of the different stages of the process I went through to make the 3D printed T-cell receptors detailed in part 1.

Part 2: the process

Now there's a couple of nice tutorials kicking around on how to 3D print your favourite protein. One option which clearly has beautiful results is to use UCSF's Chimera software, which was also the approach taken by the first protein-printing I saw. However this seemed a little full-on for my first attempts, so I opted for relatively easy approach based on PyMol, modelling software I'm already familiar with.

The technique I used was covered very well in a blog post by Jessica Polka, which was then covered again in a very well written instructable. As these are both very nice resources I won't spend too long going over the same ground, but I thought it would be nice to fill in some of the gaps, maybe add a few pictures.

1: Find a protein

This should be the easy bit for most researchers (although if you work on a popular molecule finding the best crystal structure might take a little longer). Have a browse of the Protein Data Bank, download the PDB and open the molecule in PyMol.

All you need to do here is hide everything, show the surface, and export as a .wrl file (by saving as VRML 2). I mean that's all you need to do. If you want to colour it in, that's totally fine too.

PyMol keeps me entertained for hours.

2: Convert to .stl

Nice and easy; open your .wrl file in MeshLab ('Import Mesh'), and then export as  a .stl, which is a very common filetype for 3D structures.

Say goodbye to your lovely colours.

3: Make it printable

Now we have a format that the proprietary 3D printing software can handle. As I primarily only have access to MakerBots, I next open my stl files in MakerWare.

Starting to feel real now, right?
There's a couple of things to take into consideration here. First, is placement; you need to decide which side of your molecule will be the 'top'; remember, most protein structures are going to require scaffolds in order to be printed, which might cause some damage when removed.

Next is the quality of the print. One factor is the scale; the bigger you make your molecule the better it will look, and likely print, at the cost of speed and plastic. Additionally you can alter the thickness of each print layer, the percentage infill (how solid the inside of the print is, up to a completely solid print) and the number of  'shells' that the print has.

Remember to tick 'Preview before printing' in order to get a time estimate.

4: The print!

Both of my molecules so far have been printed on MakerBot Replicator 2Xs, using ABS plastic, taking between 10 and 14 hours per print due to the complexity and size of the models. This part is also nice and simple; just warm up your printer, load your file and click go.

A side view of the printer as it runs, detailing the raft and scaffolds that will support the various overhangs.
The TCR begins to emerge, with hexagonal infill also visible.

5: The tidy-up

The prints come out looking a little something like this:

Note the colour change where one roll of ABS ran out, and someone thoughtfully swapped it over, if sadly not for the same colour
This green really does not photograph well. I like to pretend I was going for the nearest I could to the IMGT-approved shade of variable-region green, but really I was just picking the roll with the most ABS left to be sure it wouldn't run out again.
Then comes the wonderfully satisfying job of ripping off the raft and the scaffolds. Words can't describe just how enjoyable (if incredibly fiddly and pokey) this is.

My thanks go to Katharine and Mattia for de-scaffolding services rendered.
Seriously, this stuff is better than bubble wrap.

Prepare for mess and you will not be disappointed. Instead, you will be picking tiny bits out plastic out of your hair.
I found a sliding scale of using needle-nose pliers, then tweezers, then very fine forceps seemed to work best. At this point make sure you keep some of your scrap ABS in reserve, as it can be useful later.

Once you've gotten all the scaffolding off, your protein should look the right shape, if a little scraggy around the edges. I've read that 3D printing people generally sometimes use fine sandpaper here to neaten up some of these edges, which I will consider in future, but generally the surface area to cover is fairly large and inaccessible, so it's not an option I've spent long dwelling on.

The nasty underbelly of the print, after scaffold removal
In an effort to minimise such unsightly bottoms in the second print, I went for a higher quality print than I had before (see above MakerWare dialog screenshot), however it still produced both scaffold bobbles and misprint holes - you can see two in the photo below, one just above and one just below slightly off centre to the right.

The other side is much nicer, I promise.
Mmm, radioactive.

However increasing the infill percentage and number of shells had one major noticeable difference; the side chains that stick out are much less fragile than they were on the first print*.

Note that this is also when the spare ABS can come in handy; dissolved in a little acetone, it readily becomes a sloppy paint, which can be slathered on to fill in any glaring flaws in the model.

I should point out that at this point the rest of the model tends to look pretty good (if I do say so myself).

I got a lot of questions asking about the significance of the other colour.

 6: Smoothing

In addition to physical removal of lumps, it's also possible to smooth out the printing layers themselves by exposing the print to acetone vapour for a time, as discussed in many nice tutorials.

I personally like the contour effect of the printing somewhat, and due to the delicate nature of the protrusion-heavy proteins I didn't want to go for an extreme acetone shower, but I think a light application has smoothed off some of the rougher imperfections.

This is a particularly easy thing to achieve for the average wet-lab biologist, as we have ready access to heat blocks and (usually) large beakers. Unfortunately the 3T0E, CD4 containing print was too large for any of the beakers in the lab, so I had to make do.

Nothing pleases me more than a satisfactory bodge-job.
I'm not sure if it was the larger volume or the denser or perhaps different plastic, but this set up took a lot longer to achieve lesser results than the previous model did. However it did still achieve a smoothing of the layers.


Still shiny - and tacky - from the acetone.
It's worth doing this in a well ventilated area, as the combination of acetone vapour and melted-plastic smell aren't the nicest. Bear in mind the print itself will smell for a little while after smoothing, which will upset your office mates.

The finished result; the model that made the immunologists of twitter all want rice pudding. What a shame the nice side had the colour change. That the acetone-smoothing appears to have affected the two colours differently suggests that different rolls of ABS do indeed have different dissolving properties.
There you have it, the simple method to easily 3D print the structure of proteins. Honestly, the hardest bit is finding a 3D printer to use.

Or in my case, finding the time to get over there to use it.

* I tell people that I was experimenting with tactile mutational analysis, when really I just dropped the print and a couple of aromatic side chains fell off. Note that they do readily stick back on with superglue.

Immunological 3D printing



Part 1: the pictures

As a good little geek, I’ve been itching to have a play with 3D printers for a while now. At first I’d mostly contemplated what bits and bobs I might produce for use in the lab, but then I started to see a number of fantastic 3D printed protein models.

Form is so important to function in biology, yet a lot of the time we biologists forget or ignore the shape and make-up of the molecules we study. As Richard Feynman one said, “it is very easy to answer many of these fundamental biological questions; you just look at the thing”.

3D printing protein structures represents a great opportunity to (literally) get to grips with proteins, cells, microbes and macromolecules. While I still recommend playing around with PDBs to get a feel for a molecule, having an actual physical, tactile item to hold appeals to me greatly.

So when I got access to the UCL Institute of Making, I got straight to work printing out examples of the immune molecules I study, T-cell receptors. You can read about how I made them here. Or, if you're just here for some pretty pictures of 3D prints, continue; here are the two I've printed so far.

Here are the two finished products! I apologise for the quality: a combination of my garish fluorescent office lighting and shonky camera-phones does not a happy photo make.
My first try: 3WPW. This is the A6 TCR, recognising the self-peptide HuD bound in the groove of the class I HLA-A2. HLA-A2 is coloured in dark pink, with β2 microglobulin in light pink, while the alpha and beta TCR chains are shown by light and dark blue respectively.
I particularly love the holes, crevices and caves pitted throughout the molecules. Having spent a goodly deal of time painstakingly pulling the scaffolding material out of these holes, I can confirm that you do indeed get a feel for the intricate surfaces of these structures.

You can imagine the antigen presenting cell on the left, with the T-cell on the right, as if we were watching from within the plane of the immunological synapse.

As a brief aside, in playing around with the 3PWP structure in PyMol (as detailed in an earlier blogpost) I was surprised to see the following; despite being a class I MHC (the binding grooves of which should be closed in at both ends) we can see the green of the peptide peeking out contributing to the surface mesh.

There's that HuD peptide!
The new addition: 3T0E. This brilliant ternary structure shows another autoimmune TCR, this time recognising a class II MHC, HLA-DR4, with an additional coreceptor interaction; enter CD4! Here we have the TCR chains coloured as above, while the HLA-DR alpha and beta chains are red and magenta respectively. Leaning in to touch the (membrane-proximal) foot of the MHC is the yellow CD4. Note that I took feedback, and this time went for a colour that didn't look so rice-puddingy.
The structure that became my second print was a particularly lucky find, as it contains not only a TCR-pMHC interaction, but also the CD4 coreceptor. This shot is angled as if we're inside the T-cell looking out across the synapse. If you imagine the various components of CD3 clustering around the constant region of the TCR you can really start to visualise the molecular complexity of even a single TCR-pMHC ligation event.

It's also quite nice to see that despite the differences in HLA composition between classes (one MHC-encoded chain plus B2M in class I versus two MHC-encoded chains in class II), they structurally seem quite similar by eye - at least at a surface level scale.

There you have it, my first forays into 3D printing immunological molecules. Let me know what you think, if you have any ideas for future prints - I'm thinking probably some immunoglobulins for the next run - or if you're going to do any printing yourself.

Saturday, 10 August 2013

Decombining the Recombined: High-Throughput TCR Repertoire Analysis

August 2016 update:
Decombinator has gone through a number of improvements since writing this post. The current version can be found on the Innate2Adaptive lab's Github repo.

The background

The background      

As I've mentioned in the past, my PhD project is involved with using high-throughput sequencing (HTS) to investigate T-cell receptor (TCR) repertoires.
Next-generation sequencing (NGS) technologies allow us to investigate the workings and dynamics of the adaptive immune system with greater breadth and resolution than ever before. It’s possible to extract or amplify the coding sequence for thousands to millions of different clones, and throw them all into one run of a sequencer. The trick is extracting the TCR information from the torrent of data that pours out.
Our lab’s paper for the high-throughput analysis of such repertoire data has been out for a while now, so here’s my attempt to put in to a little wider context than a paper allows, while hopefully translating it for the less computationally inclined.
In order to understand best how the brains in our lab tackled this problem, it's probably worth looking at how others before us have.

The history

There's a beaten path to follow. The germline V, D and J gene sequences are easily downloaded from IMGT GENE-DB; then you take your pick of a short read aligner to match up the V and J sequences used (we’ll come to the Ds later).
Most of the popular alignment algorithms get covered by some group or other: SWA (Ndifon et al., 2012; Wang et al., 2010), BLAT (Klarenbeek et al., 2010) and BLAST (Mamedov et al., 2011) all feature. IMGT’s HighV-QUEST software uses DNAPLOT, a bespoke aligner written for their previous low-throughput version (Alamyar et al, 2012; Giudicelli et al, 2004; Lefranc et al, 1999).
Sadly, some of the big hitters in the field don’t see fit to mention what they use to look for TCR genes (or I’ve just missed it). Robert Holt’s group produced the first NGS TCR sequencing paper I’m aware of (Freeman, Warren, Webb, Nelson, & Holt, 2009), but don’t mention how they assign their genes (admittedly they’re more focused on explaining how iSSAKE, their short-read assembler designed for TCR data, works).
The most prolific author in the TCR ‘RepSeq’ field is Harlen Robins, who has contributed to a wealth of wonderful repertoire papers (Emerson et al., 2013; Robins et al., 2009, 2010, 2012; Sherwood et al., 2013; Srivastava & Robins, 2012; Wu et al., 2012), yet all remain equally vague on TCR assignation methods (probably related to the fact that he and several other early colleagues set up a company, Adaptive Biotech, that offers TCR repertoire sequencing and analysis).
So we see a strong tendency towards alignment (and a disappointing prevalence of ‘in house scripts’). I can understand why: you've sequenced your TCRs, you've got folder of fastqs burning a hole in your hard drive, you’re itching to get analysing. What else does your average common house or garden biologist do when confronted with sequences, but align?
However, this doesn't really exploit the nature of the problem.

The problem

When trying to fill out a genome, alignment and assembly make sense; you take a read, and you either see where it matches to a reference, or see where it overlaps with the other reads to make a contig.
For TCR analysis however, we're not trying to map reads to or make a genome; arguably we’re dealing with some of the few sequences that might not be covered under the regular remit of 'the genome'. Nor are we trying to identify known amplicons from a pool, where they should all the same, give or take the odd SNP.
TCR amplicons instead should have one of a limited number of sequences at each end (corresponding to the V and J/C gene present, depending on your TCR amplification/capture strategy), with a potentially (and indeed probably) completely novel sequence in between.
In this scenario, pairwise alignment isn’t necessarily the best option. Why waste time trying to match every bit of each read to a reference (in this case, a list of germ-line V(D)J gene segments), when only the bits you’re least interested in – the germline sequences – stand a hope of aligning anywhere?

The solution?

Enter our approach: Decombinator (Thomas, Heather, Ndifon, Shawe-Taylor, & Chain, 2013).
Decombinator rapidly scans through sequence files looking for rearranged TCRs, classifying any it finds by these criteria; which V and J genes were used, how many nucleotides have been deleted from each, and the string of nucleotides between the end of the V and the start of the J. This simple five-part index condenses all of the information contained in any given detected rearrangement into a compact format, convenient for downstream processing.
All TCR analysis probably does something similar; the clever thing about Decombinator is how it gets there. At its core lies a finite state automaton that passes through each read looking for a panel short sequence ‘tags’, each one uniquely identifying the presence of a germline V or J gene. It if finds tags for both a V and a J it can populate the five fields of the identifier, thus calling the re-arrangement.
Example of the comma-delimited output from Decombinator. From left to right: (optional sixth-field) clonal frequency; V gene used; J gene used; number of V deletions; number of J deletions, and insert string
The algorithm used was developed by Aho and Corasick in the ‘70s, for bibliographic text searching, and the computer science people tell me that it’s really never been topped – when it comes to searching for multiple strings at once, the Aho-Corasick (AC) algorithm is the one (Aho & Corasick, 1975).
Its strength lies in its speed – it’s simply the most effective way to search one target string for multiple substrings. By using the AC algorithm Decombinator runs orders of magnitude faster than alignment based techniques. It does this by generating a special trie of the substrings, which it uses to search the target string exceedingly efficiently.
Essentially, the algorithm uses the trie to look for every substring or tag simultaneously, in just one pass through the sequence to be searched. It passes through the string, using the character at the current position to decide how to navigate; by knowing where it is on the trie at any one given point, it’s able to use the longest matched suffix to find the longest available matched prefix.


Figshare seems to have scotched up the resolution somewhat, but you get the idea.

Decombinator uses a slight modification of the AC approach, in order to cope with mismatches between the tag sequences and the germline gene sequenced (perhaps from SNPs, PCR/sequencing error, or use of non-prototypic alleles). If no complete tags are found, the code breaks each of the tags into two halves, making new half-tags which are then used to make new tries and search the sequence.
If a half-tag is found, Decombinator then compares the sequence in the read to all of the whole-tags that contain that half-tag*; if there’s only a one nucleotide mismatch (a Hamming distance of one) then that germline region is assigned to the read. In simulated data, we find this to work pretty well, correctly assigning ~99% of artificial reads with a 0.1% error (i.e. one random mismatch every 1,000 nucleotides, on average), dropping to ~88% for 1% error **.
It’s simple, fast, effective and open source; if you do have high-throughput human or mouse alpha-beta or gamma-delta TCR data set to analyse, it’s probably worth giving it a try. The only real available other option is HighV-QUEST, which (due to submission and speed constraints) might not be too appealing an option if you really have a serious amount of data.
(CORRECTION – in the course of writing this entry (which typically has gone on about five times longer than I intended, in both time taken and words written), some new rather exciting looking software has just been published (Bolotin et al., 2013). MiTCR makes a number of bold claims which if true, would make it a very tempting bit of software. However, given that I haven’t been able to get it working, I think I’ll save any discussion of this for a later blog entry.)

The (odds and) end(s)

D segments
If you’ve made it this far through the post, chances are good you’re either thinking about doing some TCR sequencing or have done already, in which case you’re probably wondering, ‘but what about the Ds – when do they get assigned’?
In the beta locus (of both mice and men), there are but two D regions, which are very short, and very similar. As they can get exonucleolytically nibbled from both ends, the vast majority of the time you simply can’t tell which has been used (one early paper managed in around a quarter of reads (Freeman et al., 2009)). Maybe you could do some kind of probabilistic inference for the rest, based on the fact that there is a correlation between which Ds pair with which Js, likely due to the chromatin conformation at this relative small part of the locus (Murugan, Mora, Walczak, & Callan, 2012; Ndifon et al., 2012), but that’s a lot of work for very little reward.
Hence Decombinator does not assign TRBDs; they just get included in the fifth, string based component of the five-part identifier (which explains the longer inserts you see for beta compared to alpha). If you want to go TRBD mining you’re very welcome, just take the relevant column of the output and get aligning. However, for our purposes (and I suspect many others’), knowing the exact TRBD isn’t that important, where indeed even possible at all.
Errors
There’s also the question of how to deal with errors, which can accrue during amplification or sequencing of samples. While Decombinator does mitigate error somewhat through use of the half-tags and omission of reads containing ambiguous N calls, it doesn’t have any other specific error-filtration. As with any pipeline, garbage in begets garbage out; there’s plenty of software to trim or filter HTS data, so we don’t really need to reinvent the wheel and put some in here.
Similarly, Decombinator doesn’t currently offer sequence clustering, whereby very similar sequences get amalgamated into one, as some published pipelines do. Personally, I have reservations about applying standard clustering techniques to variable immunoreceptor sequence data.
Sequences can legitimately differ by only one nucleotide, and production of very similar clonotypes is inbuilt into the recombination machinery (Greenaway et al., 2013; Venturi et al., 2011); it is very easy to imagine bona fide but low frequency clones being absorbed into their more prevalent cousins, which could obscure some genuine biology. The counter argument is of course that by not clustering, one allows through a greater proportion of errors, thus artificially inflating diversity. Again, if desired other tools for sequence clustering exist.
Disclaimer
My contribution to Decombinator was relatively minor - the real brainwork was done before I'd even joined the lab, by my labmate, mathematician Niclas Thomas, our shared immunologist supervisor Benny Chain, and Nic's mathematical supervisor, John Shawe-Taylor. They came up with the idea and implemented the first few versions. I came in later, designing tags for the human alpha chain, testing and debugging, and bringing another biologist’s view to the table for feature development***. The other author, Wilfred Ndifon, at the time was a postdoc in a group with close collaborative ties, who I believe gave additional advice on development, and provided pair-wise alignment scripts against which to test Decombinator.

* Due to the short length of the half-tags and the high levels of homology between germline regions, not all half tags are unique
** In our Illumina data, by comparing the V sequence upstream of the rearrangement – which should be invariant – to the reference, we typically get error rates below 0.5%, some of which could be explained by allelic variation or SNPs relative to the reference
*** At first we had a system that Nic would buy a drink for every bug found. For a long time I was the only person really using Decombinator (and probably remain the person who’s used it most), often tweaking it for whatever I happened to need to make it do that day, putting me in prime place to be Bug-Finder Extraordinaire. I think I let him off from the offer after the first dozen or so bugs found.

The papers

Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6), 333–340. doi:10.1145/360825.360855
Alamyar, A., Giudicelli, V., Li, S., Duroux, P., & Lefranc, M. (2012). IMGT/HIGHV-QUEST: THE IMGT® WEB PORTAL FOR IMMUNOGLOBULIN (IG) OR ANTIBODY AND T CELL RECEPTOR (TR) ANALYSIS FROM NGS HIGH THROUGHPUT AND DEEP SEQUENCING. Immunome Research, 8(1), 2. doi:10.1038/nmat3328
Bolotin, D. a, Shugay, M., Mamedov, I. Z., Putintseva, E. V, Turchaninova, M. a, Zvyagin, I. V, Britanova, O. V, et al. (2013). MiTCR: software for T-cell receptor sequencing data analysis. Nature methods, (8). doi:10.1038/nmeth.2555
Emerson, R., Sherwood, A., Desmarais, C., Malhotra, S., Phippard, D., & Robins, H. (2013). Estimating the ratio of CD4+ to CD8+ T cells using high-throughput sequence data. Journal of immunological methods. doi:10.1016/j.jim.2013.02.002
Freeman, J. D., Warren, R. L., Webb, J. R., Nelson, B. H., & Holt, R. a. (2009). Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing. Genome research, 19(10), 1817–24. doi:10.1101/gr.092924.109
Giudicelli, V., Chaume, D., & Lefranc, M.-P. (2004). IMGT/V-QUEST, an integrated software program for immunoglobulin and T cell receptor V-J and V-D-J rearrangement analysis. Nucleic acids research, 32(Web Server issue), W435–40. doi:10.1093/nar/gkh412
Greenaway, H. Y., Ng, B., Price, D. a, Douek, D. C., Davenport, M. P., & Venturi, V. (2013). NKT and MAIT invariant TCRα sequences can be produced efficiently by VJ gene recombination. Immunobiology, 218(2), 213–24. doi:10.1016/j.imbio.2012.04.003
Klarenbeek, P. L., Tak, P. P., Van Schaik, B. D. C., Zwinderman, A. H., Jakobs, M. E., Zhang, Z., Van Kampen, A. H. C., et al. (2010). Human T-cell memory consists mainly of unexpanded clones. Immunology letters, 133(1), 42–8. doi:10.1016/j.imlet.2010.06.011
Lefranc, M.-P. (1999). IMGT, the international ImMunoGeneTics database. Nucleic acids research, 27(1), 209–212. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=165532&tool=pmcentrez&rendertype=abstract
Mamedov, I. Z., Britanova, O. V, Bolotin, D., Chkalina, A. V, Staroverov, D. B., Zvyagin, I. V, Kotlobay, A. A., et al. (2011). Quantitative tracking of T cell clones after hematopoietic stem cell transplantation. EMBO molecular medicine, 1–8.
Murugan, a., Mora, T., Walczak, a. M., & Callan, C. G. (2012). Statistical inference of the generation probability of T-cell receptors from sequence repertoires. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.1212755109
Ndifon, W., Gal, H., Shifrut, E., Aharoni, R., Yissachar, N., Waysbort, N., Reich-Zeliger, S., et al. (2012). Chromatin conformation governs T-cell receptor J gene segment usage. Proceedings of the National Academy of Sciences, 1–6. doi:10.1073/pnas.1203916109
Robins, H. S., Campregher, P. V, Srivastava, S. K., Wacher, A., Turtle, C. J., Kahsai, O., Riddell, S. R., et al. (2009). Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. Blood, 114(19), 4099–107. doi:10.1182/blood-2009-04-217604
Robins, H. S., Desmarais, C., Matthis, J., Livingston, R., Andriesen, J., Reijonen, H., Carlson, C., et al. (2012). Ultra-sensitive detection of rare T cell clones. Journal of Immunological Methods, 375(1-2), 14–19. doi:10.1016/j.jim.2011.09.001
Robins, H. S., Srivastava, S. K., Campregher, P. V, Turtle, C. J., Andriesen, J., Riddell, S. R., Carlson, C. S., et al. (2010). Overlap and effective size of the human CD8+ T cell receptor repertoire. Science translational medicine, 2(47), 47ra64. doi:10.1126/scitranslmed.3001442
Sherwood, A. M., Emerson, R. O., Scherer, D., Habermann, N., Buck, K., Staffa, J., Desmarais, C., et al. (2013). Tumor-infiltrating lymphocytes in colorectal tumors display a diversity of T cell receptor sequences that differ from the T cells in adjacent mucosal tissue. Cancer immunology, immunotherapy: CII. doi:10.1007/s00262-013-1446-2
Srivastava, S. K., & Robins, H. S. (2012). Palindromic nucleotide analysis in human T cell receptor rearrangements. PloS one, 7(12), e52250. doi:10.1371/journal.pone.0052250
Thomas, N., Heather, J., Ndifon, W., Shawe-Taylor, J., & Chain, B. (2013). Decombinator: a tool for fast, efficient gene assignment in T-cell receptor sequences using a finite state machine. Bioinformatics, 29(5), 542–50. doi:10.1093/bioinformatics/btt004
Venturi, V., Quigley, M. F., Greenaway, H. Y., Ng, P. C., Ende, Z. S., McIntosh, T., Asher, T. E., et al. (2011). A Mechanism for TCR Sharing between T Cell Subsets and Individuals Revealed by Pyrosequencing. Journal of immunology. doi:10.4049/jimmunol.1003898
Wang, C., Sanders, C. M., Yang, Q., Schroeder, H. W., Wang, E., Babrzadeh, F., Gharizadeh, B., et al. (2010). High throughput sequencing reveals a complex pattern of dynamic interrelationships among human T cell subsets. Proceedings of the National Academy of Sciences of the United States of America, 107(4), 1518–23. doi:10.1073/pnas.0913939107
Wu, D., Sherwood, A., Fromm, J. R., Winter, S. S., Dunsmore, K. P., Loh, M. L., Greisman, H. a., et al. (2012). High-Throughput Sequencing Detects Minimal Residual Disease in Acute T Lymphoblastic Leukemia. Science Translational Medicine, 4(134), 134ra63–134ra63. doi:10.1126/scitranslmed.3003656