Saturday, 10 November 2018

Making coding T cell receptor sequences from V-J-CDR3

If, like me, you work on T cell receptors, occasionally you’re probably going to want to express a particular TCR in cells. However you're not always going to have the sequence, at either nucleotide or protein sequence level.

Not a problem, you can sort this out. You can look up all the relevant germline sequences from IMGT, trim away all the non-used bits, add in the non-templated stuff, then manually stitch it all together and have a look to see if it still makes what you were expecting. You can do all that... or you can just use the code I wrote.

StiTChR does it all: give it a V gene, a J gene, and a CDR3 amino acid sequence and it'll look up, trim and stitch together all the relevant TCR nucleotide sequences for you, back-translating the non-templated region using the most frequent codon per residue. It also translates it all, and will run a quick rudimentary alignment against a known partial protein sequence if you have one for a visual confirmation that it's made the right thing.

You can then take the alpha/beta TCR sequences it generates, bang them into an expression vector (typically split by a T2A sequence or something) and transduce your cells of interest.

I wrote this code to save me a bit of time in future, but hopefully it can do the same for some of you!

Tuesday, 4 September 2018

The problem with Adaptive TCR data

I'm a big proponent of DIY TCR (and BCR) sequencing. It's the best way to be able to vouch that every step in the process has been done correctly; you are able to QC and query whatever steps you wish; it's typically more customisable to your specific hypotheses and research questions, and; it's invariably cheaper. What's more, there's lots of great labs making and publishing such pipelines (including the one I helped develop back in London), so you don't even need to go to the effort of making one yourself.
However there are a number of situations in which you might instead choose to outsource this task to a commercial supplier. The greater cost and loss of flexibility can be replaced with scalability, reduced hands on time, third party guarantees, and avoid the need to build capacity for sequencing and data processing in house, which brings its own savings and time benefits.
Without even needing to check I can confidently say that Adaptive Biotech are foremost among the companies offering this as a service. As part of a few different projects I've recently been getting my feet wet analysing some large datasets produced from Adaptive, including both publicly available projects of theirs (accessed via their immunoSEQ portal) and data from samples that we've sent to them.
Generally speaking, I'm pretty happy with both the service and the data we've received. I love how they make a lot of their own data publicly accessible, and the frequency with which they publish cool and important papers. I like how they are making RepSeq available to labs that might otherwise not be able to leverage this powerful technology (at least those as can afford it). In almost every sense, it's a company that I am generally pretty in favour of.
However, in designing their analyses Adaptive have taken one massive liberty, which (while I'm sure was undertaken with the best of intentions) stands to cause any number of problems, frustrations, and potential disasters - both to their customers and the field at large.
What is this heinous crime, this terrible sin they've committed? Could they be harvesting private data, releasing CDR3 sequences with coded messages, pooling all of our adaptive repertoire data in some bizarre arcane ritual? No. Instead they tried to make the TCR gene naming system make a little bit more sense (cue dramatic thunder sound effects).
It's a crime as old as biology, one particularly prevalent in immunology: you don't like the current gene naming system, so what do you do? Start a new one! A better, shinier one, with new features and definitely no downsides - it'll be so good it could even become the new standard!*
I know exactly why they did it too; when I worked on our own TCR analysis software and results in my PhD, I encountered the same problems. The TCR names are bothersome from a computing perspective. They don't sort right - either alphabetically or chromosomally. They don't contain the same number of characters as each other, so they don't line up nice on an axis. They're generally just a bit disordered, which can be confusing. They're precisely not what a software engineer would design.
Adaptive's solution is however a classic engineering one. Here's a problem, let's fix it. 'TR' is almost 'TCR' but not quite – that's confusing, so let's just chuck a 'C' in there and make it explicit. Some V/J genes have extra hyphenated numbers – so let's give all of them hyphenated numbers. And hey, some gene groups have more then ten members – let's add leading zeros so they all sort nice and alphabetically. We'll take those annoying seemingly arbitrary special cases, and bring them all into a nice consistent system. Bing bang bosh, problem solved.
This is all very well and good until you realise that this isn't about making something perfect, neat and orderly; we're talking about describing biology here, where complexity, redundancy and just plain messiness are par for the course. Having a bunch of edge cases that don't fit the rule basically is the rule!
Let's look at some examples, maybe starting at the beginning of the beta locus with the V gene that the rest of knows as TRBV1. If you go looking for this in your Adaptive data (at least if you export it from their website as I did) then you might not find it straight away; instead, it goes by the name TCRBV01-01. Similarly TRBV15 becomes TCRBV15-01, TRBV27 → TCRBV27-01, and so on.
Sure, the names all look prettier now, but this approach is deeply problematic for a bunch of reasons. With respect to these specific examples, the hyphenated numbers aren't just applied to genes randomly, it denotes those genes who are part of a subgroup containing more than one gene (meaning they share more than 75% nucleotide identity in the germline). You can argue this is an arbitrary threshold, but it is still nevertheless useful; it allows a quick shorthand to roughly infer both evolutionary divergence times and current similarity, within that threshold. Adding hypenated numbers to all genes washes out one of the few bits of information you could actually glean about a TCR or BCR gene just by looking at the name (along with approximate chromosomal position and potential degree of polymorphism, going off the allele number when present). Which genes fall in subgroups with multiple members also differs between species, which adds another extra level of usefulness to the current setup; appending '-XX' to all genes like Adaptive makes it easier to become confused or make mistakes when comparing repertoires or loci of different organisms.
The more important reason however has nothing to do with what incidental utility is lost or gained; the fact of the matter is that these genes have already been named! When it comes to asking what the corresponding gene symbol for a particular V, D or J sequence is, there is a correct answer. It has been agreed upon for years, internationally recognised and codified. People sat around in a committee and decided it.  
Whether you like it or not, HUGO and IMGT between them have got this covered, and we should all be using the agreed upon names. To do otherwise is to invite confusion, ambiguity and inaccuracies, weakening the utility of published reports and shared data. Gene name standardisation is hardly sexy, but it is important.
Admittedly Adaptive are not the only people guilty of ignoring the standardised gene names IMGT has gone to the trouble to lay out. Even now I still come across new papers where authors use old TCR gene nomenclatures (I'm looking at you flow cytometrists!). I would however argue that it's especially troubling when Adaptive does it, as they are the data producers for large numbers of customers, and are quite possible the first entry point into RepSeq for many of those. This means that mean a large body of data is being generated in the field with the wrong IDs. This in turns risks a whole host of errors during the necessary conversion to the correct format for publication or comparison with other datasets. Worse, it means that potentially a considerable fraction of new participants in the field are being taught the wrong conventions, which will feed forward and further dilute out the standard and pour more oil on the fire of confusion – as if immunology wasn't already plagued with enough nomenclature woes!
While I'm on the subject, it's also interesting to note that in 2011 (a couple years after their formation) Adaptive did state that “one of the community standards that we try to adhere to is IMGT nomenclature and definitions”. More interestingly perhaps is a poster from 2015 where they claim to actually be using IMGT nomenclature, despite clearly showing their edited version of it. In a way this is both reassurring, and a little upsetting. They clearly know that the standard exists, and that it should be adhered to, but they presumably don't think the problems generated by adding characters into externally regulated gene symbols is problematic enough to not do. So close yet so far!
Adaptive is clearly full of lots of clever people who know the field very well. I'm certain that they've had exactly this discussion in the past, and – I hope – revisit it occasionally, perhaps when they get feedback. Because of that hope, I'm encourage other Adaptive customers, immunoSEQ users, and generally any RepSeq/AIRR-seq interested parties to put the word in with your Adaptive representatives when you can. Let's see if we can convince them to take up the actual standard, instead of their well-meaning but ultimately frustrating derivative.

* Writing this section reminds me of a lecturer I had back in my undergrad, who was fond of quoting Keith Yamamoto's famous refrain: “scientists would rather share each other's underwear than use each other's nomenclature”. Much like she did, I tend to want to share it whenever any remotely related topic comes up, just because it's so good.


Sunday, 29 July 2018

3D printed 15 to 50 ml tube rotator converter

Sometimes you just need to leave something in the lab, but you might not always have the right sized rotator brackets. This is the situation that pops up in my lab, where we have a rotator in the cold room - one of those old classics which is probably older than me and will outlive us all - but which only fits 15 ml tubes. I decided to solve this problem over

Enter the 3D printer. I knocked together a couple of quite prototype models in Tinkercad, then one quick test and a re-tweak later I've got a working adapter, letting you rotate 50 ml conicals in 15 ml brackets. I've put it up on Thingiverse so anyone can download the STL and make it themselves.

This is the joy of 3D printers; I went from a problem to a solution after an hour's work. There's probably a whole host of other little problems or inefficiencies that could be solved in the lab with the addition of a custom bit of kit - we just need to be clever about thinking what those are and how to build them!


Sunday, 11 February 2018

High-throughput immunopeptidomics

In my PhD I focused on studying the complexity of the immune system at the level of the T cell repeptor. Recently I’ve been getting in to what happens on the other side of the conversation as well; in addition to looking at TCR repertoires I’m increasingly playing with MHC-bound peptide repertoires too.

Immunopeptidomics is a super interesting field, with a great deal of promise, but it’s got a much higher barrier to entry for research groups relative to something like AIRR-seq. Nearly every lab can do PCR, and access to deep-sequencing machines or cores becomes ever cheaper and more commonplace. However not every lab has expertise with fiddly pull downs, while only a tiny fraction can do highly sensitive mass spec. This is why efforts to make immunopeptide data generation and sharing easier should be suitably welcomed.

One of the groups whose work commendably contributes to both of these efforts is that of Michal Bassani-Sternberg. For sharing, she consistently makes all of her data available (and is seemingly a senior founder and major contributor to the recent SysteMHC Atlas Project), while for generation her papers give clear and thorough technical notes, which aid in reproducibility.

However from the generation perspective this paper (which came out at the end of last year in Mol. Cell Proteomics) describes a protocol which – through application of sensible experimental design – should result in the easier production of immunopeptidomic data, even from more limited samples.

The idea is to basically increase the throughput of the methods by hugely reducing the number of handling steps and time required to do the protocol. Samples are mushed up, lysed, spun, and then run through a variety of stacked plates. The first (if required) catches irrelevant, endogenous antibodies in the lysates; the next catches MHC class I (MHC-I) peptide complexes via bead-cross-linked antibodies; the next similarly catches pMHC-II, while the final well catches everything else (giving you lovely sample-matched gDNA and proteomes to play with, should you choose). Each plate of pMHC can then be taken and treated with acid to elute the peptides from their grooves, before purification and mass spec. It’s a nice neat solution, which supposedly can all be done with readily commercially available goodies (although how much all these bits and bobs cost I have no idea).

Crucially it means that you get everything you might want (peptides from MHC-I/-II, plus the rest of the lysates) in separate fractions, from a single input sample, in a protocol that spans hours rather then days. Having it all done in one pass helps boost recovery from limited samples, which is always nice for say clinical material. Although I should say, ‘limited’ is a relative term. For people used to dealing with nice, conveniently amplifiable nucleic acids, tens to thousands of cells may be limiting. Here, they managed to go down as low as 10 million. (Which is not to knock it, as this is still much much better then hundreds of millions to billions of cells which these experiments can sometimes require. I don’t want everyone to go away thinking about repurposing their collection of banked Super Rare But Sadly Impractically Tiny tissue samples here.)

So on technical merit alone, it’s already a pretty interesting paper. However, there’s also a nice angle where they test out their new protocol on an ovarian carcinoma cell line with or without IFNg treatment, which tacks on a nice bit of biology to the paper too.

You see the things you might expect – like a shift in peptides seemingly produced by degradation from the standard proteasome to more of those produced by the immunoproteasome – and some you might not. Another nice little observation which follows on perfectly from this is that you also see an alteration in the abundance of peptides presented by different HLA alleles: for instance the increased  chemotryptic-like degradation of the immunoproteasome favours the loading of HLA-B*07:02 molecules, due to making more peptides with the appropriate motif.

My favourite observation however relates to the fact that there’s a consistent quantitative and qualitative shift in peptidomes between IFNg treated cells and mock. This raises an interesting possibility to me, about what should be possible in the near future, as we iron out the remaining wrinkles in the methodologies. Not only should we learn about what proteins are being expressed, based on which proteins those peptides are derived from, but we should be able to infer something about what cytokines those cells have been expressed to, based on how those peptides have been processed and presented.

Thursday, 8 February 2018

Bulk downloading proteome files from UniProt using Python

It's that time again, where the following has happened:
  1. I want to do some niche bioinformatics related thing
  2. I cobble together a quick script to do said thing
  3. I throw that script up on the internet on the offchance it will save someone else the time of doing 2
It's a little shift of target and scale from a similar previous post (in which I used Python to extract specific DNA sequences from UCSC. This time I've been downloading a large number of proteome files from UniProt.

It's all explained in the docstring, but the basic idea is that you go on UniProt, search for the proteomes you want, and use their export tool to download tsv files containing the unique accession numbers with identify the data you're after. Then you simply run this script in the same directory; it takes those accessions, turns them in to URLs, downloads the FASTA data at that address and outputs it to new FASTA files on your computer, with separate files named after whatever the tsv files were named.

The best thing about this is you can download multiple different lists of accessions, and have them output to separate files. Say maybe you have a range of pathogens you're interesting in, each with multiple proteomes banked; this way you end up with one FASTA file for each, containing as many of their proteomes as you felt like including in your search.