jamimmunology: papers

Sunday, 11 February 2018

High-throughput immunopeptidomics

In my PhD I focused on studying the complexity of the immune system at the level of the T cell repeptor. Recently I’ve been getting in to what happens on the other side of the conversation as well; in addition to looking at TCR repertoires I’m increasingly playing with MHC-bound peptide repertoires too.

Immunopeptidomics is a super interesting field, with a great deal of promise, but it’s got a much higher barrier to entry for research groups relative to something like AIRR-seq. Nearly every lab can do PCR, and access to deep-sequencing machines or cores becomes ever cheaper and more commonplace. However not every lab has expertise with fiddly pull downs, while only a tiny fraction can do highly sensitive mass spec. This is why efforts to make immunopeptide data generation and sharing easier should be suitably welcomed.

One of the groups whose work commendably contributes to both of these efforts is that of Michal Bassani-Sternberg. For sharing, she consistently makes all of her data available (and is seemingly a senior founder and major contributor to the recent SysteMHC Atlas Project), while for generation her papers give clear and thorough technical notes, which aid in reproducibility.

However from the generation perspective this paper (which came out at the end of last year in Mol. Cell Proteomics) describes a protocol which – through application of sensible experimental design – should result in the easier production of immunopeptidomic data, even from more limited samples.

The idea is to basically increase the throughput of the methods by hugely reducing the number of handling steps and time required to do the protocol. Samples are mushed up, lysed, spun, and then run through a variety of stacked plates. The first (if required) catches irrelevant, endogenous antibodies in the lysates; the next catches MHC class I (MHC-I) peptide complexes via bead-cross-linked antibodies; the next similarly catches pMHC-II, while the final well catches everything else (giving you lovely sample-matched gDNA and proteomes to play with, should you choose). Each plate of pMHC can then be taken and treated with acid to elute the peptides from their grooves, before purification and mass spec. It’s a nice neat solution, which supposedly can all be done with readily commercially available goodies (although how much all these bits and bobs cost I have no idea).

Crucially it means that you get everything you might want (peptides from MHC-I/-II, plus the rest of the lysates) in separate fractions, from a single input sample, in a protocol that spans hours rather then days. Having it all done in one pass helps boost recovery from limited samples, which is always nice for say clinical material. Although I should say, ‘limited’ is a relative term. For people used to dealing with nice, conveniently amplifiable nucleic acids, tens to thousands of cells may be limiting. Here, they managed to go down as low as 10 million. (Which is not to knock it, as this is still much much better then hundreds of millions to billions of cells which these experiments can sometimes require. I don’t want everyone to go away thinking about repurposing their collection of banked Super Rare But Sadly Impractically Tiny tissue samples here.)

So on technical merit alone, it’s already a pretty interesting paper. However, there’s also a nice angle where they test out their new protocol on an ovarian carcinoma cell line with or without IFNg treatment, which tacks on a nice bit of biology to the paper too.

You see the things you might expect – like a shift in peptides seemingly produced by degradation from the standard proteasome to more of those produced by the immunoproteasome – and some you might not. Another nice little observation which follows on perfectly from this is that you also see an alteration in the abundance of peptides presented by different HLA alleles: for instance the increased chemotryptic-like degradation of the immunoproteasome favours the loading of HLA-B*07:02 molecules, due to making more peptides with the appropriate motif.

My favourite observation however relates to the fact that there’s a consistent quantitative and qualitative shift in peptidomes between IFNg treated cells and mock. This raises an interesting possibility to me, about what should be possible in the near future, as we iron out the remaining wrinkles in the methodologies. Not only should we learn about what proteins are being expressed, based on which proteins those peptides are derived from, but we should be able to infer something about what cytokines those cells have been expressed to, based on how those peptides have been processed and presented.

Tuesday, 17 November 2015

Heterogeneity in the polymerase chain reaction

I've touched briefly on some of the insights I made writing my thesis in a previous blog post. The other thing I've been doing a lot of over the last year or so is writing and contributing to papers. I've been thinking that it might be nice to write a few little blog posts on these, to give a little background information on the papers themselves, and maybe (in the theme of this blog) share a little insight into the processes that went into making them.

The paper I'll cover in this piece was published in Scientific Reports in October. I won't go into great detail on this one, not least because I'm only a (actually, the) middle author on it: this was primarily the excellent work of my friends and colleagues Katharine Best and Theres Oakes, who performed the bulk of the analysis and wet-lab work respectively (although I also did a little of both). Also, our supervisor Benny Chain summarised the findings of the article itself on his own blog, which covers the principles very succinctly.

Instead, I thought I'd write this blog to share that piece of information that I always wonder about when I read a paper: what made them look at this, what put them on this path? This is where I think I made my major contribution to this paper, as (based on my recollections) it began with observations made during my PhD.

My PhD primarily dealt with the development and application of deep-sequencing protocols for measuring T-cell receptor (TCR) repertoires (which, when I started, there were not many published protocols for). As a part of optimising our library preparation strategies, I thought that we might use random nucleotide sequences in our PCR products – which were originally added to increase diversity, overcoming a limitation in the Illumina sequencing technology – to act as unique molecular barcodes. Basically, adding random sequences to our target DNA before amplification uniquely labels each molecule. Then, in the final data we can infer that any matching sequences that share the same barcode are probably just PCR duplicates, if we have enough random barcodes*, meaning that sequence was less prevalent in the original sample than one might think based on raw read counts. Not only does this provide better quantitative data, but by looking to see whether different sequences share a barcode we can find likely erroneous sequences produced during PCR or sequencing, improving the qualitative aspects of the data as well. Therefore we thought (and still do!) that we were on to a good thing.

(Please note that we are not saying that we invented this, just that we have done it: it has of course been done before, both in RNA-seq (e.g. Fu et al, 2011 and Shiroguchi et al, 2012) at large and in variable antigen receptor sequencing (Weinstein et al, 2009), but it certainly wasn't widespread at the time; indeed there's really only one other lab I know of even now that's doing it (Shugay et al, 2014).)

However, in writing the scripts to 'collapse' the data (i.e. remove artificial sequence inflation due to PCR amplification, and throw out erroneous sequences) I noticed that the degree to which different TCR sequences were amplified followed an interesting distribution:

Here I've plotted the raw, uncollapsed frequency of a given TCR sequence (i.e. the number of reads containing that TCR, here slightly inaccurately labelled 'clonal frequency') against that value divided by the number of random barcodes it associated with, giving a 'duplication rate' (not great axis labels I agree, but this is pulling the plots straight out of a lab meeting I gave three years ago). The two plots show the same data, with a shortened X axis on the right to show the bulk of the spread better.

We can see that above a given frequency – in this case about 500, although it varies – we observe a 'duplication rate' around 70. This means that above a certain size, sequences are generally amplified at a rate proportional to their prevalence (give or take the odd outlier), or that for every input molecule of cDNA it gets amplified and observed seventy times. This is the scenario we'd generally imagine for PCR. However, below that variable threshold there is a very different, very noisy picture, where the amount to which a sequence is found to be amplified and observed is not related to the collapsed prevalence. This was the bait on the hook that lead our lab down this path.

Now, everyone knows PCR doesn't behave like it does in the diagrams, like it should. That's what everyone always says (usually as they stick another gel picture containing mysteriously sized bands into their labbooks). However, people have rarely looked at what's actually going on. There's a bit of special PCR magic that goes on, and a million different target and reaction variables that might affect things: you just optimise your reaction until your output looks like what you'd expect. It's only with the relatively recent advances in DNA sequencing technology that we can actually look at exactly what molecules are being made in the reaction that we can start to get actual data showing how just un-like the schematics the reaction can in fact behave.

This is exactly what Katharine's paper chases up, applying the same unique molecular barcoding strategy to TCR sequences from both polyclonal and monoclonal** T-cells. I won't go into the details, because hey, you can just read the paper (which says it much better), but the important thing is that this variability is not due to the standard things you might expect like CG content, DNA motifs or amplicon length, because it happens even for identical sequences. It doesn't matter how well you control your reactions, the noise in the system breeds variability. This makes unique molecular barcoding hugely important, at least if you want accurate relative quantitation of DNA species across the entire dynamic range of your data.

* Theoretically about 16.7 million in our case, or 4¹², as we use twelve random nucleotides in our barcodes.

** Although it's worth saying that while the line used, KT-2, is monoclonal, that doesn't mean the TCR repertoire is exactly as clean as you'd expect. T-cell receptor expression in T-cell lines is another thing that isn't simple as the textbook diagrams pretend.