Showing posts with label IMGT. Show all posts
Showing posts with label IMGT. Show all posts

Tuesday, 4 September 2018

The problem with Adaptive TCR data

I'm a big proponent of DIY TCR (and BCR) sequencing. It's the best way to be able to vouch that every step in the process has been done correctly; you are able to QC and query whatever steps you wish; it's typically more customisable to your specific hypotheses and research questions, and; it's invariably cheaper. What's more, there's lots of great labs making and publishing such pipelines (including the one I helped develop back in London), so you don't even need to go to the effort of making one yourself.
However there are a number of situations in which you might instead choose to outsource this task to a commercial supplier. The greater cost and loss of flexibility can be replaced with scalability, reduced hands on time, third party guarantees, and avoid the need to build capacity for sequencing and data processing in house, which brings its own savings and time benefits.
Without even needing to check I can confidently say that Adaptive Biotech are foremost among the companies offering this as a service. As part of a few different projects I've recently been getting my feet wet analysing some large datasets produced from Adaptive, including both publicly available projects of theirs (accessed via their immunoSEQ portal) and data from samples that we've sent to them.
Generally speaking, I'm pretty happy with both the service and the data we've received. I love how they make a lot of their own data publicly accessible, and the frequency with which they publish cool and important papers. I like how they are making RepSeq available to labs that might otherwise not be able to leverage this powerful technology (at least those as can afford it). In almost every sense, it's a company that I am generally pretty in favour of.
However, in designing their analyses Adaptive have taken one massive liberty, which (while I'm sure was undertaken with the best of intentions) stands to cause any number of problems, frustrations, and potential disasters - both to their customers and the field at large.
What is this heinous crime, this terrible sin they've committed? Could they be harvesting private data, releasing CDR3 sequences with coded messages, pooling all of our adaptive repertoire data in some bizarre arcane ritual? No. Instead they tried to make the TCR gene naming system make a little bit more sense (cue dramatic thunder sound effects).
It's a crime as old as biology, one particularly prevalent in immunology: you don't like the current gene naming system, so what do you do? Start a new one! A better, shinier one, with new features and definitely no downsides - it'll be so good it could even become the new standard!*
I know exactly why they did it too; when I worked on our own TCR analysis software and results in my PhD, I encountered the same problems. The TCR names are bothersome from a computing perspective. They don't sort right - either alphabetically or chromosomally. They don't contain the same number of characters as each other, so they don't line up nice on an axis. They're generally just a bit disordered, which can be confusing. They're precisely not what a software engineer would design.
Adaptive's solution is however a classic engineering one. Here's a problem, let's fix it. 'TR' is almost 'TCR' but not quite – that's confusing, so let's just chuck a 'C' in there and make it explicit. Some V/J genes have extra hyphenated numbers – so let's give all of them hyphenated numbers. And hey, some gene groups have more then ten members – let's add leading zeros so they all sort nice and alphabetically. We'll take those annoying seemingly arbitrary special cases, and bring them all into a nice consistent system. Bing bang bosh, problem solved.
This is all very well and good until you realise that this isn't about making something perfect, neat and orderly; we're talking about describing biology here, where complexity, redundancy and just plain messiness are par for the course. Having a bunch of edge cases that don't fit the rule basically is the rule!
Let's look at some examples, maybe starting at the beginning of the beta locus with the V gene that the rest of knows as TRBV1. If you go looking for this in your Adaptive data (at least if you export it from their website as I did) then you might not find it straight away; instead, it goes by the name TCRBV01-01. Similarly TRBV15 becomes TCRBV15-01, TRBV27 → TCRBV27-01, and so on.
Sure, the names all look prettier now, but this approach is deeply problematic for a bunch of reasons. With respect to these specific examples, the hyphenated numbers aren't just applied to genes randomly, it denotes those genes who are part of a subgroup containing more than one gene (meaning they share more than 75% nucleotide identity in the germline). You can argue this is an arbitrary threshold, but it is still nevertheless useful; it allows a quick shorthand to roughly infer both evolutionary divergence times and current similarity, within that threshold. Adding hypenated numbers to all genes washes out one of the few bits of information you could actually glean about a TCR or BCR gene just by looking at the name (along with approximate chromosomal position and potential degree of polymorphism, going off the allele number when present). Which genes fall in subgroups with multiple members also differs between species, which adds another extra level of usefulness to the current setup; appending '-XX' to all genes like Adaptive makes it easier to become confused or make mistakes when comparing repertoires or loci of different organisms.
The more important reason however has nothing to do with what incidental utility is lost or gained; the fact of the matter is that these genes have already been named! When it comes to asking what the corresponding gene symbol for a particular V, D or J sequence is, there is a correct answer. It has been agreed upon for years, internationally recognised and codified. People sat around in a committee and decided it.  
Whether you like it or not, HUGO and IMGT between them have got this covered, and we should all be using the agreed upon names. To do otherwise is to invite confusion, ambiguity and inaccuracies, weakening the utility of published reports and shared data. Gene name standardisation is hardly sexy, but it is important.
Admittedly Adaptive are not the only people guilty of ignoring the standardised gene names IMGT has gone to the trouble to lay out. Even now I still come across new papers where authors use old TCR gene nomenclatures (I'm looking at you flow cytometrists!). I would however argue that it's especially troubling when Adaptive does it, as they are the data producers for large numbers of customers, and are quite possible the first entry point into RepSeq for many of those. This means that mean a large body of data is being generated in the field with the wrong IDs. This in turns risks a whole host of errors during the necessary conversion to the correct format for publication or comparison with other datasets. Worse, it means that potentially a considerable fraction of new participants in the field are being taught the wrong conventions, which will feed forward and further dilute out the standard and pour more oil on the fire of confusion – as if immunology wasn't already plagued with enough nomenclature woes!
While I'm on the subject, it's also interesting to note that in 2011 (a couple years after their formation) Adaptive did state that “one of the community standards that we try to adhere to is IMGT nomenclature and definitions”. More interestingly perhaps is a poster from 2015 where they claim to actually be using IMGT nomenclature, despite clearly showing their edited version of it. In a way this is both reassurring, and a little upsetting. They clearly know that the standard exists, and that it should be adhered to, but they presumably don't think the problems generated by adding characters into externally regulated gene symbols is problematic enough to not do. So close yet so far!
Adaptive is clearly full of lots of clever people who know the field very well. I'm certain that they've had exactly this discussion in the past, and – I hope – revisit it occasionally, perhaps when they get feedback. Because of that hope, I'm encourage other Adaptive customers, immunoSEQ users, and generally any RepSeq/AIRR-seq interested parties to put the word in with your Adaptive representatives when you can. Let's see if we can convince them to take up the actual standard, instead of their well-meaning but ultimately frustrating derivative.

* Writing this section reminds me of a lecturer I had back in my undergrad, who was fond of quoting Keith Yamamoto's famous refrain: “scientists would rather share each other's underwear than use each other's nomenclature”. Much like she did, I tend to want to share it whenever any remotely related topic comes up, just because it's so good.


Saturday, 26 July 2014

See an error in a database? Let someone know!


Anyone who does any T cell receptor analysis will know IMGT (the ImmunoGenetics DataBase), the repository for all things TCR and Ig. You either use it, or you're one of those annoying people that makes me have to drag up all the tables of outdated nomenclatures.
Much like any resource, IMGT has it good points (simple and highly useful features like GENE-DB and LocusView in particular) and its bad (the less said about LIGM-DB the better).
However, again like any resource, it's only as good as the data stored in it. The data in it, as far as I can tell, is pretty damn good (and I use it a lot). I guess that's why they got to be in charge of all the data in the first place.
As such, when I recently found an error in a sequence*, I made sure to let them know: I certainly get a lot of mileage out of their data, it's only fair that I pay them back (and pay it forward to others) by ensuring the data that is there is good.
It's always a little nerve-inducing, being a PhD student emailing senior doctors and professors to let them know of a mistake you've discovered, but as hoped the information was very warmly received, and I'm told that the error will be corrected.
Science has to be self-correcting to stop errors lingering and spreading; firing a quick email off to correct an annotation might not seem like much, but if it stops one person going through the same short time of confusion that you went through unravelling the mistake then you've done a net service to the world.
* For the people that found their way here suffering from this particular error, here's what I found. I was looking at the TCR leader regions(the mono-spliced section of the transcript between the start of translation and the beginning of the V region which encodes the localisation signal peptide), when I noticed that one gene never seemed to produce functional transcripts. It turned out that while some of the entries for the human alpha gene TRAV29,DV5 were correct, if you downloaded the L-PART2 region alone the sequence produced actually contains a section of the start of the V gene. So, instead of reading 'GGGTAAAC', it reads 'GGGTAAACAGTCAACAGAAGAATGAT'. I just checked and it still gives the old sequence, but I assume there's a lag time for databases to update.

Sunday, 4 May 2014

Translating TCR sequences addendum: not as easy as FGXG

I recently wrote a blog post about the strategies used to translate T-cell receptor nucleotides en masse and extract (what can arguable be considered) the useful bit: the CDR3.

In that talk I touched on the IMGT-definition of the CDR3: it runs from the second conserved cysteine in the V region to the conserved FGXG motif in the J. Nice and easy, but we have to remember that it's the conserved bit that's key here: there are other cysteines to factor in, and there are a few germline J genes that don't use the typical FGXG motif.

However even that paints too simple a picture, so here's a quick follow up point:

These are human-imposed definitions, based more on convenience for human-understanding than biological necessity. The fact is that we might well produce a number of TCRs that don't make use of these motifs at all, but that are still able to function perfectly well; assuming the C/FGXG motifs have function, it's possible alternative motifs might compensate for these.

I have examples in my own sequence data that appear to clearly show these motifs having been deleted into, and then replaced with different nucleotides encoding the same. Alternative residues must certainly be introduced on occasion, and I'd be surprised if none of these make it through selection; we just don't see these because we aren't able to generate rules to computationally look for these.

I actually even recently found such an example with verified biological activity: this paper sequenced tetramer-sorted HIV-reactive T-cells, revealing one that contained an alpha chain using the CDR3 'CAVNIGFGNVLHCGSG'.*

For the majority of analyses, looking for rare exceptions to rules probably won't make much difference. However as we increase the resolution and throughput of our experiments, we're going to find more and more examples of things which don't fit the tidy rules we made up when we weren't looking so deeply. If we're going to get the most out of our 'big data', we need to be ready for them

* I was looking through the literature harvesting CDR3s, which reminds me of another point I want to make. Can I just ask, from the bottom of my heart, for people to put their CDR3s in sensible formats so that others can make use of them? Ideally, give me the nucleotide sequence. Bare minimum,  give me the CDR3 sequence as well as which V and J were used (and while I stick to IMGT standards, I won't judge you if you don't - but do say which standards you are using!). Most of all, and I can't stress this enough, please please PLEASE make all DNA/amino acid sequences copyable.**


** Although spending valuable time copying out or removing PDF-copying errors from hundreds of sequences drives me ever so slightly nearer to a breakdown, it does allow me to play that excellent game of "what's the longest actual word I can find in biological sequences". For CDR3s, I'm up to a sixer with 'CASSIS'.