I'm
a big proponent
of DIY TCR (and BCR) sequencing. It's the best way to be able to
vouch that every step in the process has been done correctly; you are
able to QC and query whatever steps you wish; it's typically more
customisable to your specific hypotheses and research questions, and;
it's invariably cheaper. What's more, there's lots of great labs
making and publishing such pipelines (including the one I
helped develop back in London),
so you don't even need to go to the effort of making one yourself.
However there are a
number of situations in which you might instead choose to outsource
this task to a commercial supplier. The greater cost and loss of
flexibility can be replaced with scalability, reduced hands on time,
third party guarantees, and avoid the need to build capacity for
sequencing and data processing in house, which brings its own savings
and time benefits.
Without even needing
to check I can confidently say that Adaptive Biotech are foremost
among the companies offering this as a service. As part of a few
different projects I've recently been getting my feet wet analysing
some large datasets produced from Adaptive, including both publicly
available projects of theirs (accessed via their immunoSEQ portal)
and data from samples that we've sent to them.
Generally speaking,
I'm pretty happy with both the service and the data we've received. I
love how they make a lot of their own data publicly accessible, and
the frequency with which they publish cool and important papers. I
like how they are making RepSeq available to labs that might
otherwise not be able to leverage this powerful technology (at least
those as can afford it). In almost every sense, it's a company that I
am generally pretty in favour of.
However, in designing
their analyses Adaptive have taken one massive liberty, which (while
I'm sure was undertaken with the best of intentions) stands to cause
any number of problems, frustrations, and potential disasters - both
to their customers and the field at large.
What is this heinous
crime, this terrible sin they've committed? Could they be harvesting
private data, releasing CDR3 sequences with coded messages, pooling
all of our adaptive repertoire data in some bizarre arcane ritual?
No. Instead they tried to make the TCR gene naming system make a
little bit more sense (cue dramatic thunder sound effects).
It's
a crime as old as biology, one particularly prevalent in immunology:
you don't like the current gene naming system, so what do you do?
Start a new one! A better, shinier one,
with new features and definitely no downsides - it'll be so good it
could
even become the new
standard!*
I know exactly why
they did it too; when I worked on our own TCR analysis software and
results in my PhD, I encountered the same problems. The TCR names are
bothersome from a computing perspective. They don't sort right -
either alphabetically or chromosomally. They don't contain the same
number of characters as each other, so they don't line up nice on an
axis. They're generally just a bit disordered, which can be
confusing. They're precisely not what a software engineer would
design.
Adaptive's solution
is however a classic engineering one. Here's a problem, let's fix it.
'TR' is almost 'TCR' but not quite – that's confusing, so let's
just chuck a 'C' in there and make it explicit. Some V/J genes have
extra hyphenated numbers – so let's give all of them hyphenated
numbers. And hey, some gene groups have more then ten members –
let's add leading zeros so they all sort nice and alphabetically.
We'll take those annoying seemingly arbitrary special cases, and
bring them all into a nice consistent system. Bing bang bosh, problem
solved.
This is all very well
and good until you realise that this isn't about making something
perfect, neat and orderly; we're talking about describing biology
here, where complexity, redundancy and just plain messiness are par
for the course. Having a bunch of edge cases that don't fit the rule
basically is the rule!
Let's
look at some examples, maybe starting at the beginning of the beta
locus with the V gene that the rest of
knows as TRBV1.
If you go looking for this in your Adaptive data (at least if you
export it from their website as I did) then you might not find it
straight away; instead, it goes by the name TCRBV01-01.
Similarly TRBV15 becomes TCRBV15-01, TRBV27 →
TCRBV27-01, and so on.
Sure,
the names all look prettier now,
but this approach is deeply problematic for a bunch of reasons. With
respect to these specific examples, the hyphenated numbers aren't
just applied to genes randomly, it denotes those genes who are part
of a subgroup containing more than one gene (meaning they share
more than 75% nucleotide identity in the germline).
You can argue this
is an arbitrary threshold, but it is still
nevertheless useful; it allows a quick
shorthand to roughly infer both evolutionary divergence times and
current similarity, within that threshold.
Adding hypenated numbers to all genes washes out one of the few bits
of information you could actually glean
about a TCR or BCR gene just by looking at the name (along with
approximate chromosomal position and potential degree of
polymorphism, going off the allele number when present). Which
genes fall in subgroups with multiple members also differs between
species, which adds another extra level of usefulness to the current
setup; appending '-XX' to all genes like Adaptive makes it easier to
become confused or make mistakes when comparing repertoires or loci
of different organisms.
The
more important reason however has nothing to do with what incidental
utility is lost or gained; the fact of the
matter is that these genes have already
been named! When it comes to asking
what the corresponding gene symbol for a particular V, D or J
sequence is, there is a correct answer. It has been agreed upon for
years, internationally recognised and codified. People sat around in
a committee and decided it.
Whether you like it
or not, HUGO and IMGT between them have got this covered, and we
should all be using the agreed upon names. To do otherwise is to
invite confusion, ambiguity and inaccuracies, weakening the utility
of published reports and shared data. Gene name standardisation is
hardly sexy, but it is important.
Admittedly Adaptive
are not the only people guilty of ignoring the standardised gene
names IMGT has gone to the trouble to lay out. Even now I still come
across new papers where authors use old TCR gene nomenclatures (I'm
looking at you flow cytometrists!). I would however argue that
it's especially troubling when Adaptive does it, as they are the data
producers for large numbers of customers, and are quite possible the
first entry point into RepSeq for many of those. This means that mean
a large body of data is being generated in the field with the wrong
IDs. This in turns risks a whole host of errors during the necessary
conversion to the correct format for publication or comparison with
other datasets. Worse, it means that potentially a considerable
fraction of new participants in the field are being taught the wrong
conventions, which will feed forward and further dilute out the
standard and pour more oil on the fire of confusion – as if
immunology wasn't already plagued with enough nomenclature woes!
While
I'm on the subject, it's also interesting to note that in 2011 (a
couple years after their formation) Adaptive did state that “one
of the community standards that we try to adhere to is IMGT
nomenclature and definitions”. More interestingly perhaps
is a poster
from 2015 where they claim to actually be
using IMGT nomenclature, despite clearly showing their edited version
of it. In a way this is both reassurring, and a little upsetting.
They clearly know that the standard exists, and that it should be
adhered to, but they presumably don't think the problems generated by
adding characters into externally regulated gene symbols is
problematic enough to not do. So close yet so far!
Adaptive is clearly
full of lots of clever people who know the field very well. I'm
certain that they've had exactly this discussion in the past, and –
I hope – revisit it occasionally, perhaps when they get feedback.
Because of that hope, I'm encourage other Adaptive customers,
immunoSEQ users, and generally any RepSeq/AIRR-seq interested parties
to put the word in with your Adaptive representatives when you can. Let's see if we
can convince them to take up the actual standard, instead of their
well-meaning but ultimately frustrating derivative.
*
Writing this section reminds me of a lecturer I had back in my
undergrad, who was fond of quoting Keith Yamamoto's famous refrain:
“scientists would
rather share each other's underwear than use each other's
nomenclature”.
Much like she did, I
tend to want to share it whenever any remotely related topic comes
up, just because it's so good.