Showing posts with label TCR. Show all posts
Showing posts with label TCR. Show all posts

Saturday, 10 November 2018

Making coding T cell receptor sequences from V-J-CDR3

If, like me, you work on T cell receptors, occasionally you’re probably going to want to express a particular TCR in cells. However you're not always going to have the sequence, at either nucleotide or protein sequence level.

Not a problem, you can sort this out. You can look up all the relevant germline sequences from IMGT, trim away all the non-used bits, add in the non-templated stuff, then manually stitch it all together and have a look to see if it still makes what you were expecting. You can do all that... or you can just use the code I wrote.

StiTChR does it all: give it a V gene, a J gene, and a CDR3 amino acid sequence and it'll look up, trim and stitch together all the relevant TCR nucleotide sequences for you, back-translating the non-templated region using the most frequent codon per residue. It also translates it all, and will run a quick rudimentary alignment against a known partial protein sequence if you have one for a visual confirmation that it's made the right thing.

You can then take the alpha/beta TCR sequences it generates, bang them into an expression vector (typically split by a T2A sequence or something) and transduce your cells of interest.

I wrote this code to save me a bit of time in future, but hopefully it can do the same for some of you!

Tuesday, 4 September 2018

The problem with Adaptive TCR data

I'm a big proponent of DIY TCR (and BCR) sequencing. It's the best way to be able to vouch that every step in the process has been done correctly; you are able to QC and query whatever steps you wish; it's typically more customisable to your specific hypotheses and research questions, and; it's invariably cheaper. What's more, there's lots of great labs making and publishing such pipelines (including the one I helped develop back in London), so you don't even need to go to the effort of making one yourself.
However there are a number of situations in which you might instead choose to outsource this task to a commercial supplier. The greater cost and loss of flexibility can be replaced with scalability, reduced hands on time, third party guarantees, and avoid the need to build capacity for sequencing and data processing in house, which brings its own savings and time benefits.
Without even needing to check I can confidently say that Adaptive Biotech are foremost among the companies offering this as a service. As part of a few different projects I've recently been getting my feet wet analysing some large datasets produced from Adaptive, including both publicly available projects of theirs (accessed via their immunoSEQ portal) and data from samples that we've sent to them.
Generally speaking, I'm pretty happy with both the service and the data we've received. I love how they make a lot of their own data publicly accessible, and the frequency with which they publish cool and important papers. I like how they are making RepSeq available to labs that might otherwise not be able to leverage this powerful technology (at least those as can afford it). In almost every sense, it's a company that I am generally pretty in favour of.
However, in designing their analyses Adaptive have taken one massive liberty, which (while I'm sure was undertaken with the best of intentions) stands to cause any number of problems, frustrations, and potential disasters - both to their customers and the field at large.
What is this heinous crime, this terrible sin they've committed? Could they be harvesting private data, releasing CDR3 sequences with coded messages, pooling all of our adaptive repertoire data in some bizarre arcane ritual? No. Instead they tried to make the TCR gene naming system make a little bit more sense (cue dramatic thunder sound effects).
It's a crime as old as biology, one particularly prevalent in immunology: you don't like the current gene naming system, so what do you do? Start a new one! A better, shinier one, with new features and definitely no downsides - it'll be so good it could even become the new standard!*
I know exactly why they did it too; when I worked on our own TCR analysis software and results in my PhD, I encountered the same problems. The TCR names are bothersome from a computing perspective. They don't sort right - either alphabetically or chromosomally. They don't contain the same number of characters as each other, so they don't line up nice on an axis. They're generally just a bit disordered, which can be confusing. They're precisely not what a software engineer would design.
Adaptive's solution is however a classic engineering one. Here's a problem, let's fix it. 'TR' is almost 'TCR' but not quite – that's confusing, so let's just chuck a 'C' in there and make it explicit. Some V/J genes have extra hyphenated numbers – so let's give all of them hyphenated numbers. And hey, some gene groups have more then ten members – let's add leading zeros so they all sort nice and alphabetically. We'll take those annoying seemingly arbitrary special cases, and bring them all into a nice consistent system. Bing bang bosh, problem solved.
This is all very well and good until you realise that this isn't about making something perfect, neat and orderly; we're talking about describing biology here, where complexity, redundancy and just plain messiness are par for the course. Having a bunch of edge cases that don't fit the rule basically is the rule!
Let's look at some examples, maybe starting at the beginning of the beta locus with the V gene that the rest of knows as TRBV1. If you go looking for this in your Adaptive data (at least if you export it from their website as I did) then you might not find it straight away; instead, it goes by the name TCRBV01-01. Similarly TRBV15 becomes TCRBV15-01, TRBV27 → TCRBV27-01, and so on.
Sure, the names all look prettier now, but this approach is deeply problematic for a bunch of reasons. With respect to these specific examples, the hyphenated numbers aren't just applied to genes randomly, it denotes those genes who are part of a subgroup containing more than one gene (meaning they share more than 75% nucleotide identity in the germline). You can argue this is an arbitrary threshold, but it is still nevertheless useful; it allows a quick shorthand to roughly infer both evolutionary divergence times and current similarity, within that threshold. Adding hypenated numbers to all genes washes out one of the few bits of information you could actually glean about a TCR or BCR gene just by looking at the name (along with approximate chromosomal position and potential degree of polymorphism, going off the allele number when present). Which genes fall in subgroups with multiple members also differs between species, which adds another extra level of usefulness to the current setup; appending '-XX' to all genes like Adaptive makes it easier to become confused or make mistakes when comparing repertoires or loci of different organisms.
The more important reason however has nothing to do with what incidental utility is lost or gained; the fact of the matter is that these genes have already been named! When it comes to asking what the corresponding gene symbol for a particular V, D or J sequence is, there is a correct answer. It has been agreed upon for years, internationally recognised and codified. People sat around in a committee and decided it.  
Whether you like it or not, HUGO and IMGT between them have got this covered, and we should all be using the agreed upon names. To do otherwise is to invite confusion, ambiguity and inaccuracies, weakening the utility of published reports and shared data. Gene name standardisation is hardly sexy, but it is important.
Admittedly Adaptive are not the only people guilty of ignoring the standardised gene names IMGT has gone to the trouble to lay out. Even now I still come across new papers where authors use old TCR gene nomenclatures (I'm looking at you flow cytometrists!). I would however argue that it's especially troubling when Adaptive does it, as they are the data producers for large numbers of customers, and are quite possible the first entry point into RepSeq for many of those. This means that mean a large body of data is being generated in the field with the wrong IDs. This in turns risks a whole host of errors during the necessary conversion to the correct format for publication or comparison with other datasets. Worse, it means that potentially a considerable fraction of new participants in the field are being taught the wrong conventions, which will feed forward and further dilute out the standard and pour more oil on the fire of confusion – as if immunology wasn't already plagued with enough nomenclature woes!
While I'm on the subject, it's also interesting to note that in 2011 (a couple years after their formation) Adaptive did state that “one of the community standards that we try to adhere to is IMGT nomenclature and definitions”. More interestingly perhaps is a poster from 2015 where they claim to actually be using IMGT nomenclature, despite clearly showing their edited version of it. In a way this is both reassurring, and a little upsetting. They clearly know that the standard exists, and that it should be adhered to, but they presumably don't think the problems generated by adding characters into externally regulated gene symbols is problematic enough to not do. So close yet so far!
Adaptive is clearly full of lots of clever people who know the field very well. I'm certain that they've had exactly this discussion in the past, and – I hope – revisit it occasionally, perhaps when they get feedback. Because of that hope, I'm encourage other Adaptive customers, immunoSEQ users, and generally any RepSeq/AIRR-seq interested parties to put the word in with your Adaptive representatives when you can. Let's see if we can convince them to take up the actual standard, instead of their well-meaning but ultimately frustrating derivative.

* Writing this section reminds me of a lecturer I had back in my undergrad, who was fond of quoting Keith Yamamoto's famous refrain: “scientists would rather share each other's underwear than use each other's nomenclature”. Much like she did, I tend to want to share it whenever any remotely related topic comes up, just because it's so good.


Wednesday, 24 May 2017

Getting CDR3 nucleotide sequence out of Decombinator

I was recently asked whether CDR3translator (the CDR3 extraction component of the Decombinator suite of T-cell receptor analysis scripts) has an option to output the CDR3 nucleotide, rather than amino acid sequence.

There currently is not, and I don't think there's much call to institute it as a in built feature, but I knocked together a bodge for them and it's an easy one line change so I'm posting it here in case anyone else has use of it:


This just exploits the way that the script currently finds CDR3s: it reconstructs the whole V/J nucleotide sequence from the five part Decombinator index (storing it in the 'nt' variable), translates this into its amino acid sequence ('aa') and then looks for the conserved motifs at the appropriate positions ('start_cdr3' and 'end_cdr3').

If you multiply each of these by 3 then you get the correct positions to extract the CDR3 nucleotide sequence, running from the conserved V-gene cysteine to the J-gene phenylalanine. If you want the GXG portion of the FGXG motif then you’ll need to add '+12' before the final closing square bracket.

Wednesday, 1 March 2017

Extracting antigen-specific TCRs from vdjdb on the command line

My previous lab and I recently published a review on the techniques and possibilities of analysing TCR repertoire data produced from high-throughput sequencing.

By and large I'm exceedingly happy with it - apart from a couple of missed references and one very unfortunate mix up regarding the accessibility, I'm very pleased with how it came out (and hope it will prove useful!).

One thing that I'm particularly pleased that we included (in spite of the lack of published descriptions yet) is the pair of manually curated TCR databases that have recently emerged: VDJdb and McPAS-TCR, in which you can find a small (but growing) host of TCRs of known specificity and/or disease association. We thought it was important to get these out there as soon as possible, as this is a rapidly changing field which is currently sorely needing for such efforts at standardisation and resource development.

With that in mind, I've been playing around with both of these, and thought I'd share some of the bare bones of the bash code I've been using to pull out sequences related to epitopes I'm interested in. Here is my quick vignette using VDJdb to pull out HIV-reactive TCR sequences - and even then just the fields of the database I'm interested in - using basically just default terminal commands.

Sunday, 28 February 2016

Early 2016 immunology conferences


I'm very lucky to have attended two fantastic immunology conferences this year: the Keystone Systems Immunology conference, hosted at Big Sky in Montana, USA, and the Next Gen Immunology conference at the Weizmann Institute in Israel. I don't want to go into the details of either conference in any great depth (as frankly the standard of both was so high any thorough recounting would shortly turn into just a complete retelling of each event), but I thought it might be nice to recount a few of my impressions and recollections.

The Systems Immunology meeting was my first Keystone, fitting what I understand to be the regular format; talks in the morning and late afternoon, with a gap over lunch to allow people to sneak off to the slopes. This definitely seemed to be the strategy of a healthy proportion of the attendees - even a ski-novice like me managed to make it up there once or twice!

The Next Gen Immunology conference was a day shorter, but felt like a longer event by merit of absolutely jam-packing the talks and events in! I'm not sure that I've ever been to a busier conference, with talks running from before nine in the morning to past seven in the evening (albeit with a number of breaks to allow consumption of excessive amounts of coffee, even by general science standards). It also had the slickest branding of any conference I've ever been to, complete with it's own celebrity-filled introductory video (which was originally emailed round but has sadly now been made private!).

There were some interesting thematic differences between the conferences, despite having somewhat similar stated scopes. The Keystone talks were seemingly linked mostly in their approaches, displaying a fine array of highly-quantitative, systems level experiments. There was a definite emphasis on the single-cell technologies; it seemed that rare was the talk that wasn't employing at least either single-cell RNA-seq or 40-odd colour mass cytometry. The NGI talks on the other hand were much more diverse in terms of techniques employed, but converged more on the general area of research, which was the microbiome and the mucosal immune response.

Another observation that ticked me was the difference in plotting standards between the two conferences. At the Systems meeting, with (I think it's fair to say) a greater influence or prevalence of mathematical and computational backgrounds, there was a definite enrichment of LaTeX produced slides, and ggplot produced plots, whereas Prism and Excel were far more common in Israel. There was even a fairly large smattering of talks using comic sans, which came as a bit of surprise.

All together, I've kicked off the year with some fantastic science. I've ticked off a number of speakers that I've been wanting to hear for years – including Ron Germaine, Rick Flavell and David Baltimore – and added a few more to the list of stories I need to hear more from, like Aviv Regev's incredible bulk tumour scRNA-seq data, or the wonderful things that Ton Schumacher can do with T-cells.

Crucially, I also heard a little bit about the growing story of assembling T-cell receptor sequences out of scRNA-seq data (to be the topic of a future post I feel), which brings us to the shallows of one of the major goals of repertoire research; getting paired clonotype and phenotype in one fell swoop.

Monday, 7 December 2015

The key to finding TCR sequences in RNA-seq data

I had previously written a short blog post touching on how I'd tried to mine some PBMC RNA-seq data (from the ENCODE project) for rearranged T-cell receptor genes, to try and open up this huge resource for TCR repertoire analysis. However, I hadn't gotten very far, on account of finding very few TCR sequences per file.

That sets the background for an extremely pleasant surprise this morning, when I found that Scott Brown, Lisa Raeburn and Robert Holt from Vancouver (the latter of whom being notable for producing one of the very earliest high-throughput sequencing TCR repertoire papers) had published a very nice paper doing just that!

This is a lovely example of different groups seeing the same problem and coming up with different takes. I saw an extremely low rate of return when TCR-mining in RNA-seq data from heterogeneous cell types, and gave up on it as a search for needles in a haystack. The Holt group saw the same problem, and simply searched more haystacks!

This paper tidily exemplifies the re-purposing of biological datasets to allow us to ask new biological questions (something that I consider a practical and moral necessity, given the complexity of such data and the time and costs involved in their generation).

Moreover, they do some really nice tricks, like estimating TCR transcript proportions in other data sets based on constant region usage, investigate TCR diversity relative to CD3 expression, testing on simulated RNA-seq data sets as a control, looked for public or known-specificity receptors and inferred possible alpha-beta pairs by checking all each sample's possible combinations for their presence in at least one other sample (somewhat akin to Harlan Robins' pairSEQ approach).

All in all, a very nice paper indeed, and I hope we see more of this kind of data re-purposing in the field at large. Such approaches could certainly be adapted for immunoglobulin genes. I also wonder if, given whole-genome sequencing data from mixed blood cell populations, we might even be able to do a similar analysis on rearranged variable antigen receptors from gDNA.

Tuesday, 17 November 2015

Heterogeneity in the polymerase chain reaction


I've touched briefly on some of the insights I made writing my thesis in a previous blog post. The other thing I've been doing a lot of over the last year or so is writing and contributing to papers. I've been thinking that it might be nice to write a few little blog posts on these, to give a little background information on the papers themselves, and maybe (in the theme of this blog) share a little insight into the processes that went into making them.
The paper I'll cover in this piece was published in Scientific Reports in October. I won't go into great detail on this one, not least because I'm only a (actually, the) middle author on it: this was primarily the excellent work of my friends and colleagues Katharine Best and Theres Oakes, who performed the bulk of the analysis and wet-lab work respectively (although I also did a little of both). Also, our supervisor Benny Chain summarised the findings of the article itself on his own blog, which covers the principles very succinctly.
Instead, I thought I'd write this blog to share that piece of information that I always wonder about when I read a paper: what made them look at this, what put them on this path? This is where I think I made my major contribution to this paper, as (based on my recollections) it began with observations made during my PhD.
My PhD primarily dealt with the development and application of deep-sequencing protocols for measuring T-cell receptor (TCR) repertoires (which, when I started, there were not many published protocols for). As a part of optimising our library preparation strategies, I thought that we might use random nucleotide sequences in our PCR products – which were originally added to increase diversity, overcoming a limitation in the Illumina sequencing technology – to act as unique molecular barcodes. Basically, adding random sequences to our target DNA before amplification uniquely labels each molecule. Then, in the final data we can infer that any matching sequences that share the same barcode are probably just PCR duplicates, if we have enough random barcodes*, meaning that sequence was less prevalent in the original sample than one might think based on raw read counts. Not only does this provide better quantitative data, but by looking to see whether different sequences share a barcode we can find likely erroneous sequences produced during PCR or sequencing, improving the qualitative aspects of the data as well. Therefore we thought (and still do!) that we were on to a good thing.
(Please note that we are not saying that we invented this, just that we have done it: it has of course been done before, both in RNA-seq (e.g. Fu et al, 2011 and Shiroguchi et al, 2012) at large and in variable antigen receptor sequencing (Weinstein et al, 2009), but it certainly wasn't widespread at the time; indeed there's really only one other lab I know of even now that's doing it (Shugay et al, 2014).)
However, in writing the scripts to 'collapse' the data (i.e. remove artificial sequence inflation due to PCR amplification, and throw out erroneous sequences) I noticed that the degree to which different TCR sequences were amplified followed an interesting distribution:


Here I've plotted the raw, uncollapsed frequency of a given TCR sequence (i.e. the number of reads containing that TCR, here slightly inaccurately labelled 'clonal frequency') against that value divided by the number of random barcodes it associated with, giving a 'duplication rate' (not great axis labels I agree, but this is pulling the plots straight out of a lab meeting I gave three years ago). The two plots show the same data, with a shortened X axis on the right to show the bulk of the spread better.
We can see that above a given frequency – in this case about 500, although it varies – we observe a 'duplication rate' around 70. This means that above a certain size, sequences are generally amplified at a rate proportional to their prevalence (give or take the odd outlier), or that for every input molecule of cDNA it gets amplified and observed seventy times. This is the scenario we'd generally imagine for PCR. However, below that variable threshold there is a very different, very noisy picture, where the amount to which a sequence is found to be amplified and observed is not related to the collapsed prevalence. This was the bait on the hook that lead our lab down this path.
Now, everyone knows PCR doesn't behave like it does in the diagrams, like it should. That's what everyone always says (usually as they stick another gel picture containing mysteriously sized bands into their labbooks). However, people have rarely looked at what's actually going on. There's a bit of special PCR magic that goes on, and a million different target and reaction variables that might affect things: you just optimise your reaction until your output looks like what you'd expect. It's only with the relatively recent advances in DNA sequencing technology that we can actually look at exactly what molecules are being made in the reaction that we can start to get actual data showing how just un-like the schematics the reaction can in fact behave.
This is exactly what Katharine's paper chases up, applying the same unique molecular barcoding strategy to TCR sequences from both polyclonal and monoclonal** T-cells. I won't go into the details, because hey, you can just read the paper (which says it much better), but the important thing is that this variability is not due to the standard things you might expect like CG content, DNA motifs or amplicon length, because it happens even for identical sequences. It doesn't matter how well you control your reactions, the noise in the system breeds variability. This makes unique molecular barcoding hugely important, at least if you want accurate relative quantitation of DNA species across the entire dynamic range of your data.
* Theoretically about 16.7 million in our case, or 412, as we use twelve random nucleotides in our barcodes.

** Although it's worth saying that while the line used, KT-2, is monoclonal, that doesn't mean the TCR repertoire is exactly as clean as you'd expect. T-cell receptor expression in T-cell lines is another thing that isn't simple as the textbook diagrams pretend.