Showing posts with label repertoire. Show all posts

Tuesday, 4 September 2018

The problem with Adaptive TCR data

I'm a big proponent of DIY TCR (and BCR) sequencing. It's the best way to be able to vouch that every step in the process has been done correctly; you are able to QC and query whatever steps you wish; it's typically more customisable to your specific hypotheses and research questions, and; it's invariably cheaper. What's more, there's lots of great labs making and publishing such pipelines (including the one I helped develop back in London), so you don't even need to go to the effort of making one yourself.

However there are a number of situations in which you might instead choose to outsource this task to a commercial supplier. The greater cost and loss of flexibility can be replaced with scalability, reduced hands on time, third party guarantees, and avoid the need to build capacity for sequencing and data processing in house, which brings its own savings and time benefits.

Without even needing to check I can confidently say that Adaptive Biotech are foremost among the companies offering this as a service. As part of a few different projects I've recently been getting my feet wet analysing some large datasets produced from Adaptive, including both publicly available projects of theirs (accessed via their immunoSEQ portal) and data from samples that we've sent to them.

Generally speaking, I'm pretty happy with both the service and the data we've received. I love how they make a lot of their own data publicly accessible, and the frequency with which they publish cool and important papers. I like how they are making RepSeq available to labs that might otherwise not be able to leverage this powerful technology (at least those as can afford it). In almost every sense, it's a company that I am generally pretty in favour of.

However, in designing their analyses Adaptive have taken one massive liberty, which (while I'm sure was undertaken with the best of intentions) stands to cause any number of problems, frustrations, and potential disasters - both to their customers and the field at large.

What is this heinous crime, this terrible sin they've committed? Could they be harvesting private data, releasing CDR3 sequences with coded messages, pooling all of our adaptive repertoire data in some bizarre arcane ritual? No. Instead they tried to make the TCR gene naming system make a little bit more sense (cue dramatic thunder sound effects).

It's a crime as old as biology, one particularly prevalent in immunology: you don't like the current gene naming system, so what do you do? Start a new one! A better, shinier one, with new features and definitely no downsides - it'll be so good it could even become the new standard!*

I know exactly why they did it too; when I worked on our own TCR analysis software and results in my PhD, I encountered the same problems. The TCR names are bothersome from a computing perspective. They don't sort right - either alphabetically or chromosomally. They don't contain the same number of characters as each other, so they don't line up nice on an axis. They're generally just a bit disordered, which can be confusing. They're precisely not what a software engineer would design.

Adaptive's solution is however a classic engineering one. Here's a problem, let's fix it. 'TR' is almost 'TCR' but not quite – that's confusing, so let's just chuck a 'C' in there and make it explicit. Some V/J genes have extra hyphenated numbers – so let's give all of them hyphenated numbers. And hey, some gene groups have more then ten members – let's add leading zeros so they all sort nice and alphabetically. We'll take those annoying seemingly arbitrary special cases, and bring them all into a nice consistent system. Bing bang bosh, problem solved.

This is all very well and good until you realise that this isn't about making something perfect, neat and orderly; we're talking about describing biology here, where complexity, redundancy and just plain messiness are par for the course. Having a bunch of edge cases that don't fit the rule basically is the rule!

Let's look at some examples, maybe starting at the beginning of the beta locus with the V gene that the rest of knows as TRBV1. If you go looking for this in your Adaptive data (at least if you export it from their website as I did) then you might not find it straight away; instead, it goes by the name TCRBV01-01. Similarly TRBV15 becomes TCRBV15-01, TRBV27 → TCRBV27-01, and so on.

Sure, the names all look prettier now, but this approach is deeply problematic for a bunch of reasons. With respect to these specific examples, the hyphenated numbers aren't just applied to genes randomly, it denotes those genes who are part of a subgroup containing more than one gene (meaning they share more than 75% nucleotide identity in the germline). You can argue this is an arbitrary threshold, but it is still nevertheless useful; it allows a quick shorthand to roughly infer both evolutionary divergence times and current similarity, within that threshold. Adding hypenated numbers to all genes washes out one of the few bits of information you could actually glean about a TCR or BCR gene just by looking at the name (along with approximate chromosomal position and potential degree of polymorphism, going off the allele number when present). Which genes fall in subgroups with multiple members also differs between species, which adds another extra level of usefulness to the current setup; appending '-XX' to all genes like Adaptive makes it easier to become confused or make mistakes when comparing repertoires or loci of different organisms.

The more important reason however has nothing to do with what incidental utility is lost or gained; the fact of the matter is that these genes have already been named! When it comes to asking what the corresponding gene symbol for a particular V, D or J sequence is, there is a correct answer. It has been agreed upon for years, internationally recognised and codified. People sat around in a committee and decided it.

Whether you like it or not, HUGO and IMGT between them have got this covered, and we should all be using the agreed upon names. To do otherwise is to invite confusion, ambiguity and inaccuracies, weakening the utility of published reports and shared data. Gene name standardisation is hardly sexy, but it is important.

Admittedly Adaptive are not the only people guilty of ignoring the standardised gene names IMGT has gone to the trouble to lay out. Even now I still come across new papers where authors use old TCR gene nomenclatures (I'm looking at you flow cytometrists!). I would however argue that it's especially troubling when Adaptive does it, as they are the data producers for large numbers of customers, and are quite possible the first entry point into RepSeq for many of those. This means that mean a large body of data is being generated in the field with the wrong IDs. This in turns risks a whole host of errors during the necessary conversion to the correct format for publication or comparison with other datasets. Worse, it means that potentially a considerable fraction of new participants in the field are being taught the wrong conventions, which will feed forward and further dilute out the standard and pour more oil on the fire of confusion – as if immunology wasn't already plagued with enough nomenclature woes!

While I'm on the subject, it's also interesting to note that in 2011 (a couple years after their formation) Adaptive did state that “one of the community standards that we try to adhere to is IMGT nomenclature and definitions”. More interestingly perhaps is a poster from 2015 where they claim to actually be using IMGT nomenclature, despite clearly showing their edited version of it. In a way this is both reassurring, and a little upsetting. They clearly know that the standard exists, and that it should be adhered to, but they presumably don't think the problems generated by adding characters into externally regulated gene symbols is problematic enough to not do. So close yet so far!

Adaptive is clearly full of lots of clever people who know the field very well. I'm certain that they've had exactly this discussion in the past, and – I hope – revisit it occasionally, perhaps when they get feedback. Because of that hope, I'm encourage other Adaptive customers, immunoSEQ users, and generally any RepSeq/AIRR-seq interested parties to put the word in with your Adaptive representatives when you can. Let's see if we can convince them to take up the actual standard, instead of their well-meaning but ultimately frustrating derivative.

* Writing this section reminds me of a lecturer I had back in my undergrad, who was fond of quoting Keith Yamamoto's famous refrain: “scientists would rather share each other's underwear than use each other's nomenclature”. Much like she did, I tend to want to share it whenever any remotely related topic comes up, just because it's so good.

Monday, 7 December 2015

The key to finding TCR sequences in RNA-seq data

I had previously written a short blog post touching on how I'd tried to mine some PBMC RNA-seq data (from the ENCODE project) for rearranged T-cell receptor genes, to try and open up this huge resource for TCR repertoire analysis. However, I hadn't gotten very far, on account of finding very few TCR sequences per file.

That sets the background for an extremely pleasant surprise this morning, when I found that Scott Brown, Lisa Raeburn and Robert Holt from Vancouver (the latter of whom being notable for producing one of the very earliest high-throughput sequencing TCR repertoire papers) had published a very nice paper doing just that!

This is a lovely example of different groups seeing the same problem and coming up with different takes. I saw an extremely low rate of return when TCR-mining in RNA-seq data from heterogeneous cell types, and gave up on it as a search for needles in a haystack. The Holt group saw the same problem, and simply searched more haystacks!

This paper tidily exemplifies the re-purposing of biological datasets to allow us to ask new biological questions (something that I consider a practical and moral necessity, given the complexity of such data and the time and costs involved in their generation).

Moreover, they do some really nice tricks, like estimating TCR transcript proportions in other data sets based on constant region usage, investigate TCR diversity relative to CD3 expression, testing on simulated RNA-seq data sets as a control, looked for public or known-specificity receptors and inferred possible alpha-beta pairs by checking all each sample's possible combinations for their presence in at least one other sample (somewhat akin to Harlan Robins' pairSEQ approach).

All in all, a very nice paper indeed, and I hope we see more of this kind of data re-purposing in the field at large. Such approaches could certainly be adapted for immunoglobulin genes. I also wonder if, given whole-genome sequencing data from mixed blood cell populations, we might even be able to do a similar analysis on rearranged variable antigen receptors from gDNA.

Friday, 7 August 2015

PhD thesis writing advice

When I sat down to write this post, I had an opening line in mind that was going to bemoan me being a bit remiss with my blogging of late. That was before I checked my last post and realised it had been ten months, and 'remiss' feels a bit inadequate: I basically stopped.

I had a pretty good reason, in that I needed to finish my PhD. Those ten months have basically been filled with write paper, submit paper, thesis thesis thesis, do a talk, thesis thesis thesis, get paper rejections, rewrite, resubmit, thesis thesis thesis … then finally viva (and a couple of more paper submissions). It is exhausting, and frankly having either a thesis or a paper to be writing at any one given time takes the shine off blogging somewhat.

Now that it's done and out of the way (no corrections, wahey!), I begin to turn my mind to getting the blog up and running again. What have I been up to lately that I could write about, I asked myself. Well, there's always this big blue beast.

Having been the major task of the last year of my life, I've spent more than a little time thinking about the process of thesis writing, so I thought I'd share some pointers and thoughts, maybe being the usual and the obvious.

1) Everything takes longer than expected.

This is the best advice I got going in, and is always the first thing I say to anyone who asks. In a perfect example of Hofstader's Law, no matter how much time you think a given task should take, in reality it will take longer.

2) Be consistent.

The longer a given document is, the more chance there is that inconsistencies will creep in. Theses are long documents that are invariably only read by accomplished academics (or those in training), many of whom have a keen eye for finding such consistencies. 'n=5' on one line and 'n = 4' on another. 'Fig1: Blah', 'Figure 2 – Blah blah' and 'Fig. 3 Even more blah'. These might not seem like big deals, but added up they can make the document feel less professional. Choice of spelling (e.g. UK or US English), where you put your spaces, how you refer to and describe your figures and references, all of these types of things: it doesn't matter so much how you choose to do them, as long as you do them the same each time.

On a related note, many technical notations – such as writing gene, protein or chemical names – are covered by exhaustive committee-approved nomenclatures. Use them, or justify why you're not using them, but again, be consistent.

3) Make it easy for yourself: get into good habits.

Writing long documents is difficult, so don't make it hard on yourself. Start as you mean to go on, and go on like that – I made a rather foolish decision to change how I formatted my figures one chapter in to my writing, and going back to re-format the old stuff was time I could have been spent writing*. Doing something right the first time is much more efficient than re-doing it two or three times.

Make sure you make full use of reference manager software. This will sound obvious to people that use them, but I am consistently surprised by the number of PhD students I meet who write their bibliographies and citations manually. I personally use Mendeley, which operates perfectly well and has a nice reading interface as well, although there are plenty of others. You are probably going to have a lot (hundreds) of references, and even more citations: doing them manually is a recipe for disaster.

Similarly, don't do any manual cross-referencing if you can avoid it – the document as you write it will likely be entirely fluid and subject to change for months, so any 'hard' references you put in could well end up needing to be changed, which not only takes time but increases the risk of you missing something and carrying an error along to your finished PhD.

If you have the time, I would recommend trying to get into LaTeX (with the 'X' pronounced as a 'K'), which is a free, open-source code-based type-setting program. It's a bit of a steep learning curve, but there are plenty of good templates and once you've got a grasp of the basic commands it's incredibly powerful. Crucially, as your file is just a text document (which effectively just 'links' to pictures when you compile your PDF) it remains small in size, and therefore easy to load, backup and play with. It also makes referencing, cross-referencing, and generally producing beautiful looking text a lot easier then most word processors.

Theses are often full of technical words and abbreviations, and it's entirely likely your examiners won't know them all beforehand – therefore they need to be defined the first time they're used. However, if you're moving chunks of text around (sometimes whole chapters), how do you know which one is the first time? My tactic was to not define anything while writing, but whenever I did use a new phrase I added it to a spreadsheet, along with its definition. Then once everything was set in place I worked through that spreadsheet and used 'find' to add the appropriate definition to the first instance of every term. What's more, that spreadsheet was then easily alphabetised and converted straight into a convenient glossary!

4) Be prepared to get sick of it ...

You will spend an unhealthy amount of time doing one thing: working on this one document. It will bore you, and it will make you boring, as it will take over your time, thoughts and life**. It is basically guaranteed that you will bone-weary of sitting down to your computer and working on it again. It's relentless, it just keeps going on and on and on, to the point where you forget that your life hasn't and will not always just be thesis-writing.

5) ... but remember it will end.

It might not seem like it at the time, but it will. You will finish writing, you will finish checking, you will hand it in. You'll then find some errors, but that's OK, your examiners are never going to read it as closely as you do when you check it. Remember that your supervisor(s)/thesis committees/whoever shouldn't let you submit unless they think you're ready, so the fact you're submitting means you're probably going to be fine!

* For what it's worth, the final way I did my figures was better, I should just have thought about it and done it first. Basically I outputted my plots from Python and R as SVG files and compiled them into whole figures in Inkscape (which is also great for making schematics) and saving these as PDFs. A word of warning thought – certain lines/boxes in occasional Python-saved SVGs failed to print (apparently something to do with the way fancy printers flatten the layers), so it is probably worth keeping backup EPS or non-vector versions of your Inkscape files on hand.

** Look at me, I've just closed my file for the last time (before uploading to university servers) and the first thing I do is go and write a thousand words about it!

Sunday, 4 May 2014

Translating TCR sequences addendum: not as easy as FGXG

I recently wrote a blog post about the strategies used to translate T-cell receptor nucleotides en masse and extract (what can arguable be considered) the useful bit: the CDR3.

In that talk I touched on the IMGT-definition of the CDR3: it runs from the second conserved cysteine in the V region to the conserved FGXG motif in the J. Nice and easy, but we have to remember that it's the conserved bit that's key here: there are other cysteines to factor in, and there are a few germline J genes that don't use the typical FGXG motif.

However even that paints too simple a picture, so here's a quick follow up point:

These are human-imposed definitions, based more on convenience for human-understanding than biological necessity. The fact is that we might well produce a number of TCRs that don't make use of these motifs at all, but that are still able to function perfectly well; assuming the C/FGXG motifs have function, it's possible alternative motifs might compensate for these.

I have examples in my own sequence data that appear to clearly show these motifs having been deleted into, and then replaced with different nucleotides encoding the same. Alternative residues must certainly be introduced on occasion, and I'd be surprised if none of these make it through selection; we just don't see these because we aren't able to generate rules to computationally look for these.

I actually even recently found such an example with verified biological activity: this paper sequenced tetramer-sorted HIV-reactive T-cells, revealing one that contained an alpha chain using the CDR3 'CAVNIGFGNVLHCGSG'.*

For the majority of analyses, looking for rare exceptions to rules probably won't make much difference. However as we increase the resolution and throughput of our experiments, we're going to find more and more examples of things which don't fit the tidy rules we made up when we weren't looking so deeply. If we're going to get the most out of our 'big data', we need to be ready for them

* I was looking through the literature harvesting CDR3s, which reminds me of another point I want to make. Can I just ask, from the bottom of my heart, for people to put their CDR3s in sensible formats so that others can make use of them? Ideally, give me the nucleotide sequence. Bare minimum, give me the CDR3 sequence as well as which V and J were used (and while I stick to IMGT standards, I won't judge you if you don't - but do say which standards you are using!). Most of all, and I can't stress this enough, please please PLEASE make all DNA/amino acid sequences copyable.**

** Although spending valuable time copying out or removing PDF-copying errors from hundreds of sequences drives me ever so slightly nearer to a breakdown, it does allow me to play that excellent game of "what's the longest actual word I can find in biological sequences". For CDR3s, I'm up to a sixer with 'CASSIS'.

Sunday, 23 March 2014

TCR diversity in health and in HIV

Before I forget, here's a record of the poster I presented recently at the Quantitative Immunology workshop in Les Houches (which I blogged about the first day of last week*).

In a move I was pretty pleased with, I hosted my poster on figshare, with an additional pdf of supplementary information. (I even included a QR code on the poster linking to both, which I thought pretty cunning, but this plan was sadly scuppered by a complete lack of wifi signal.)

You can access the poster here, and the supplementary information here.

The poster gives a gives a quick glance at some of my recent work where I've been using random barcodes to error-correct deep-sequenced TCR repertoires, a technique I'm applying to comparing healthy individuals to HIV patients, both before and after three months of antiretroviral therapy.

* For those interested in a larger overview of the conference, my supervisor Benny Chain wrote a longer piece at his blog Immunology with numbers

Sunday, 20 October 2013

Installing python dependencies for Decombinator

I've written previously about Decombinator, the TCR repertoire analysis program developed in our laboratory.

Having just formatted my computer and reinstalled all my packages, I've been reminded of what a faff this can be.

This time around I did things a lot more efficiently than I had before. Seeing as how my colleague responsible for updating the current readme is busy writing up his thesis, here's my quick guide to installing all the python modules required by Decombinator in Linux - at least in Ubuntu.

 sudo apt-get install python-numpy python-biopython python-matplotlib python-levenshtein   
 sudo apt-get install python-pip  
 sudo pip install acora

Most of the modules are available in the repositories, which has the added bonus of filling in all their required dependencies. Acora (the module that enacts the Aho-Corasick finite-state machine) however is not, but is easily installed from the command line using pip.

Easy. Time to get decombining.

August 2016 update:
Decombinator has gone through a number of improvements since writing this post. The current version (and installation details) can be found on the Innate2Adaptive lab's Github repo.

Saturday, 10 August 2013

Decombining the Recombined: High-Throughput TCR Repertoire Analysis

August 2016 update:
Decombinator has gone through a number of improvements since writing this post. The current version can be found on the Innate2Adaptive lab's Github repo.

The background

The history

The problem

The solution?

The (odds and) end(s)

The papers

The background

As I've mentioned in the past, my PhD project is involved with using high-throughput sequencing (HTS) to investigate T-cell receptor (TCR) repertoires.

Next-generation sequencing (NGS) technologies allow us to investigate the workings and dynamics of the adaptive immune system with greater breadth and resolution than ever before. It’s possible to extract or amplify the coding sequence for thousands to millions of different clones, and throw them all into one run of a sequencer. The trick is extracting the TCR information from the torrent of data that pours out.

Our lab’s paper for the high-throughput analysis of such repertoire data has been out for a while now, so here’s my attempt to put in to a little wider context than a paper allows, while hopefully translating it for the less computationally inclined.

In order to understand best how the brains in our lab tackled this problem, it's probably worth looking at how others before us have.

The history

There's a beaten path to follow. The germline V, D and J gene sequences are easily downloaded from IMGT GENE-DB; then you take your pick of a short read aligner to match up the V and J sequences used (we’ll come to the Ds later).

Most of the popular alignment algorithms get covered by some group or other: SWA (Ndifon et al., 2012; Wang et al., 2010), BLAT (Klarenbeek et al., 2010) and BLAST (Mamedov et al., 2011) all feature. IMGT’s HighV-QUEST software uses DNAPLOT, a bespoke aligner written for their previous low-throughput version (Alamyar et al, 2012; Giudicelli et al, 2004; Lefranc et al, 1999).

Sadly, some of the big hitters in the field don’t see fit to mention what they use to look for TCR genes (or I’ve just missed it). Robert Holt’s group produced the first NGS TCR sequencing paper I’m aware of (Freeman, Warren, Webb, Nelson, & Holt, 2009), but don’t mention how they assign their genes (admittedly they’re more focused on explaining how iSSAKE, their short-read assembler designed for TCR data, works).

The most prolific author in the TCR ‘RepSeq’ field is Harlen Robins, who has contributed to a wealth of wonderful repertoire papers (Emerson et al., 2013; Robins et al., 2009, 2010, 2012; Sherwood et al., 2013; Srivastava & Robins, 2012; Wu et al., 2012), yet all remain equally vague on TCR assignation methods (probably related to the fact that he and several other early colleagues set up a company, Adaptive Biotech, that offers TCR repertoire sequencing and analysis).

So we see a strong tendency towards alignment (and a disappointing prevalence of ‘in house scripts’). I can understand why: you've sequenced your TCRs, you've got folder of fastqs burning a hole in your hard drive, you’re itching to get analysing. What else does your average common house or garden biologist do when confronted with sequences, but align?

However, this doesn't really exploit the nature of the problem.

The problem

When trying to fill out a genome, alignment and assembly make sense; you take a read, and you either see where it matches to a reference, or see where it overlaps with the other reads to make a contig.

For TCR analysis however, we're not trying to map reads to or make a genome; arguably we’re dealing with some of the few sequences that might not be covered under the regular remit of 'the genome'. Nor are we trying to identify known amplicons from a pool, where they should all the same, give or take the odd SNP.

TCR amplicons instead should have one of a limited number of sequences at each end (corresponding to the V and J/C gene present, depending on your TCR amplification/capture strategy), with a potentially (and indeed probably) completely novel sequence in between.

In this scenario, pairwise alignment isn’t necessarily the best option. Why waste time trying to match every bit of each read to a reference (in this case, a list of germ-line V(D)J gene segments), when only the bits you’re least interested in – the germline sequences – stand a hope of aligning anywhere?

The solution?

Enter our approach: Decombinator (Thomas, Heather, Ndifon, Shawe-Taylor, & Chain, 2013).

Decombinator rapidly scans through sequence files looking for rearranged TCRs, classifying any it finds by these criteria; which V and J genes were used, how many nucleotides have been deleted from each, and the string of nucleotides between the end of the V and the start of the J. This simple five-part index condenses all of the information contained in any given detected rearrangement into a compact format, convenient for downstream processing.

All TCR analysis probably does something similar; the clever thing about Decombinator is how it gets there. At its core lies a finite state automaton that passes through each read looking for a panel short sequence ‘tags’, each one uniquely identifying the presence of a germline V or J gene. It if finds tags for both a V and a J it can populate the five fields of the identifier, thus calling the re-arrangement.

Example of the comma-delimited output from Decombinator. From left to right: (optional sixth-field) clonal frequency; V gene used; J gene used; number of V deletions; number of J deletions, and insert string

The algorithm used was developed by Aho and Corasick in the ‘70s, for bibliographic text searching, and the computer science people tell me that it’s really never been topped – when it comes to searching for multiple strings at once, the Aho-Corasick (AC) algorithm is the one (Aho & Corasick, 1975).

Its strength lies in its speed – it’s simply the most effective way to search one target string for multiple substrings. By using the AC algorithm Decombinator runs orders of magnitude faster than alignment based techniques. It does this by generating a special trie of the substrings, which it uses to search the target string exceedingly efficiently.

Essentially, the algorithm uses the trie to look for every substring or tag simultaneously, in just one pass through the sequence to be searched. It passes through the string, using the character at the current position to decide how to navigate; by knowing where it is on the trie at any one given point, it’s able to use the longest matched suffix to find the longest available matched prefix.

Figshare seems to have scotched up the resolution somewhat, but you get the idea.

Decombinator uses a slight modification of the AC approach, in order to cope with mismatches between the tag sequences and the germline gene sequenced (perhaps from SNPs, PCR/sequencing error, or use of non-prototypic alleles). If no complete tags are found, the code breaks each of the tags into two halves, making new half-tags which are then used to make new tries and search the sequence.

If a half-tag is found, Decombinator then compares the sequence in the read to all of the whole-tags that contain that half-tag*; if there’s only a one nucleotide mismatch (a Hamming distance of one) then that germline region is assigned to the read. In simulated data, we find this to work pretty well, correctly assigning ~99% of artificial reads with a 0.1% error (i.e. one random mismatch every 1,000 nucleotides, on average), dropping to ~88% for 1% error **.

It’s simple, fast, effective and open source; if you do have high-throughput human or mouse alpha-beta or gamma-delta TCR data set to analyse, it’s probably worth giving it a try. ~~The only real available other option is HighV-QUEST, which (due to submission and speed constraints) might not be too appealing an option if you really have a serious amount of data~~.

(CORRECTION – in the course of writing this entry (which typically has gone on about five times longer than I intended, in both time taken and words written), some new rather exciting looking software has just been published (Bolotin et al., 2013). MiTCR makes a number of bold claims which if true, would make it a very tempting bit of software. However, given that I haven’t been able to get it working, I think I’ll save any discussion of this for a later blog entry.)

The (odds and) end(s)

D segments

If you’ve made it this far through the post, chances are good you’re either thinking about doing some TCR sequencing or have done already, in which case you’re probably wondering, ‘but what about the Ds – when do they get assigned’?

In the beta locus (of both mice and men), there are but two D regions, which are very short, and very similar. As they can get exonucleolytically nibbled from both ends, the vast majority of the time you simply can’t tell which has been used (one early paper managed in around a quarter of reads (Freeman et al., 2009)). Maybe you could do some kind of probabilistic inference for the rest, based on the fact that there is a correlation between which Ds pair with which Js, likely due to the chromatin conformation at this relative small part of the locus (Murugan, Mora, Walczak, & Callan, 2012; Ndifon et al., 2012), but that’s a lot of work for very little reward.

Hence Decombinator does not assign TRBDs; they just get included in the fifth, string based component of the five-part identifier (which explains the longer inserts you see for beta compared to alpha). If you want to go TRBD mining you’re very welcome, just take the relevant column of the output and get aligning. However, for our purposes (and I suspect many others’), knowing the exact TRBD isn’t that important, where indeed even possible at all.

Errors

There’s also the question of how to deal with errors, which can accrue during amplification or sequencing of samples. While Decombinator does mitigate error somewhat through use of the half-tags and omission of reads containing ambiguous N calls, it doesn’t have any other specific error-filtration. As with any pipeline, garbage in begets garbage out; there’s plenty of software to trim or filter HTS data, so we don’t really need to reinvent the wheel and put some in here.

Similarly, Decombinator doesn’t currently offer sequence clustering, whereby very similar sequences get amalgamated into one, as some published pipelines do. Personally, I have reservations about applying standard clustering techniques to variable immunoreceptor sequence data.

Sequences can legitimately differ by only one nucleotide, and production of very similar clonotypes is inbuilt into the recombination machinery (Greenaway et al., 2013; Venturi et al., 2011); it is very easy to imagine bona fide but low frequency clones being absorbed into their more prevalent cousins, which could obscure some genuine biology. The counter argument is of course that by not clustering, one allows through a greater proportion of errors, thus artificially inflating diversity. Again, if desired other tools for sequence clustering exist.

Disclaimer

My contribution to Decombinator was relatively minor - the real brainwork was done before I'd even joined the lab, by my labmate, mathematician Niclas Thomas, our shared immunologist supervisor Benny Chain, and Nic's mathematical supervisor, John Shawe-Taylor. They came up with the idea and implemented the first few versions. I came in later, designing tags for the human alpha chain, testing and debugging, and bringing another biologist’s view to the table for feature development***. The other author, Wilfred Ndifon, at the time was a postdoc in a group with close collaborative ties, who I believe gave additional advice on development, and provided pair-wise alignment scripts against which to test Decombinator.

* Due to the short length of the half-tags and the high levels of homology between germline regions, not all half tags are unique

** In our Illumina data, by comparing the V sequence upstream of the rearrangement – which should be invariant – to the reference, we typically get error rates below 0.5%, some of which could be explained by allelic variation or SNPs relative to the reference

*** At first we had a system that Nic would buy a drink for every bug found. For a long time I was the only person really using Decombinator (and probably remain the person who’s used it most), often tweaking it for whatever I happened to need to make it do that day, putting me in prime place to be Bug-Finder Extraordinaire. I think I let him off from the offer after the first dozen or so bugs found.

The papers

Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6), 333–340. doi:10.1145/360825.360855

Alamyar, A., Giudicelli, V., Li, S., Duroux, P., & Lefranc, M. (2012). IMGT/HIGHV-QUEST: THE IMGT® WEB PORTAL FOR IMMUNOGLOBULIN (IG) OR ANTIBODY AND T CELL RECEPTOR (TR) ANALYSIS FROM NGS HIGH THROUGHPUT AND DEEP SEQUENCING. Immunome Research, 8(1), 2. doi:10.1038/nmat3328

Bolotin, D. a, Shugay, M., Mamedov, I. Z., Putintseva, E. V, Turchaninova, M. a, Zvyagin, I. V, Britanova, O. V, et al. (2013). MiTCR: software for T-cell receptor sequencing data analysis. Nature methods, (8). doi:10.1038/nmeth.2555

Emerson, R., Sherwood, A., Desmarais, C., Malhotra, S., Phippard, D., & Robins, H. (2013). Estimating the ratio of CD4+ to CD8+ T cells using high-throughput sequence data. Journal of immunological methods. doi:10.1016/j.jim.2013.02.002

Freeman, J. D., Warren, R. L., Webb, J. R., Nelson, B. H., & Holt, R. a. (2009). Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing. Genome research, 19(10), 1817–24. doi:10.1101/gr.092924.109

Giudicelli, V., Chaume, D., & Lefranc, M.-P. (2004). IMGT/V-QUEST, an integrated software program for immunoglobulin and T cell receptor V-J and V-D-J rearrangement analysis. Nucleic acids research, 32(Web Server issue), W435–40. doi:10.1093/nar/gkh412

Greenaway, H. Y., Ng, B., Price, D. a, Douek, D. C., Davenport, M. P., & Venturi, V. (2013). NKT and MAIT invariant TCRα sequences can be produced efficiently by VJ gene recombination. Immunobiology, 218(2), 213–24. doi:10.1016/j.imbio.2012.04.003

Klarenbeek, P. L., Tak, P. P., Van Schaik, B. D. C., Zwinderman, A. H., Jakobs, M. E., Zhang, Z., Van Kampen, A. H. C., et al. (2010). Human T-cell memory consists mainly of unexpanded clones. Immunology letters, 133(1), 42–8. doi:10.1016/j.imlet.2010.06.011

Lefranc, M.-P. (1999). IMGT, the international ImMunoGeneTics database. Nucleic acids research, 27(1), 209–212. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=165532&tool=pmcentrez&rendertype=abstract

Mamedov, I. Z., Britanova, O. V, Bolotin, D., Chkalina, A. V, Staroverov, D. B., Zvyagin, I. V, Kotlobay, A. A., et al. (2011). Quantitative tracking of T cell clones after hematopoietic stem cell transplantation. EMBO molecular medicine, 1–8.

Murugan, a., Mora, T., Walczak, a. M., & Callan, C. G. (2012). Statistical inference of the generation probability of T-cell receptors from sequence repertoires. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.1212755109

Ndifon, W., Gal, H., Shifrut, E., Aharoni, R., Yissachar, N., Waysbort, N., Reich-Zeliger, S., et al. (2012). Chromatin conformation governs T-cell receptor J gene segment usage. Proceedings of the National Academy of Sciences, 1–6. doi:10.1073/pnas.1203916109

Robins, H. S., Campregher, P. V, Srivastava, S. K., Wacher, A., Turtle, C. J., Kahsai, O., Riddell, S. R., et al. (2009). Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. Blood, 114(19), 4099–107. doi:10.1182/blood-2009-04-217604

Robins, H. S., Desmarais, C., Matthis, J., Livingston, R., Andriesen, J., Reijonen, H., Carlson, C., et al. (2012). Ultra-sensitive detection of rare T cell clones. Journal of Immunological Methods, 375(1-2), 14–19. doi:10.1016/j.jim.2011.09.001

Robins, H. S., Srivastava, S. K., Campregher, P. V, Turtle, C. J., Andriesen, J., Riddell, S. R., Carlson, C. S., et al. (2010). Overlap and effective size of the human CD8+ T cell receptor repertoire. Science translational medicine, 2(47), 47ra64. doi:10.1126/scitranslmed.3001442

Sherwood, A. M., Emerson, R. O., Scherer, D., Habermann, N., Buck, K., Staffa, J., Desmarais, C., et al. (2013). Tumor-infiltrating lymphocytes in colorectal tumors display a diversity of T cell receptor sequences that differ from the T cells in adjacent mucosal tissue. Cancer immunology, immunotherapy: CII. doi:10.1007/s00262-013-1446-2

Srivastava, S. K., & Robins, H. S. (2012). Palindromic nucleotide analysis in human T cell receptor rearrangements. PloS one, 7(12), e52250. doi:10.1371/journal.pone.0052250

Thomas, N., Heather, J., Ndifon, W., Shawe-Taylor, J., & Chain, B. (2013). Decombinator: a tool for fast, efficient gene assignment in T-cell receptor sequences using a finite state machine. Bioinformatics, 29(5), 542–50. doi:10.1093/bioinformatics/btt004

Venturi, V., Quigley, M. F., Greenaway, H. Y., Ng, P. C., Ende, Z. S., McIntosh, T., Asher, T. E., et al. (2011). A Mechanism for TCR Sharing between T Cell Subsets and Individuals Revealed by Pyrosequencing. Journal of immunology. doi:10.4049/jimmunol.1003898

Wang, C., Sanders, C. M., Yang, Q., Schroeder, H. W., Wang, E., Babrzadeh, F., Gharizadeh, B., et al. (2010). High throughput sequencing reveals a complex pattern of dynamic interrelationships among human T cell subsets. Proceedings of the National Academy of Sciences of the United States of America, 107(4), 1518–23. doi:10.1073/pnas.0913939107

Wu, D., Sherwood, A., Fromm, J. R., Winter, S. S., Dunsmore, K. P., Loh, M. L., Greisman, H. a., et al. (2012). High-Throughput Sequencing Detects Minimal Residual Disease in Acute T Lymphoblastic Leukemia. Science Translational Medicine, 4(134), 134ra63–134ra63. doi:10.1126/scitranslmed.3003656