Wednesday 26 March 2014

TCR trivia part one - translating T cell receptor sequences


Here's the first in what's likely to a be pretty niche series of posts – but hopefully useful to some – on various bits of TCR trivia I've gleaned during my PhD. This installment: translating TCR sequences.
Whether you're doing high- or low-throughput DNA TCR sequencing, it's highly likely that at some point you'll want to translate your DNA into amino acid sequences.
There's usually a few main goals to bear in mind when translating; to check the frame, and (potential) functionality of the sequence (i.e. lacking premature stop codons), before defining the hypervariable CDR3 region.
Assuming your reads are indel free, sequencing variable antigen receptors will always involve a bit more thought about checking the reading frame than regular amplicon sequencing, due to the non-templated addition and deletion of nucleotides that occurs during V(D)J recombination.
So how do can we tell whether our final recombined sequence is in frame or not?
This will mostly depend on how your sequencing protocols work.
If, say you've somehow managed to sequence an entire TCR mRNA sequence, then it's easy; you can just take the sequence from the start codon of the leader region to the stop codon of the constant region and if it divides exactly by three then it's in frame. Assuming there's no stop codons in between and the CDR3 checks out, chances are good your sequence encodes a productively rearranged TCR chain.
Given today's technology, that's unlikely to be the case; whether you're cloning into plasmids and Sanger sequencing or throwing libraries into some next-gen machine, chances are good that the sequence you'll be working with is actually just a small window around the recombination site.
What's likely is that you have amplified from a constant or J region primer at one end, to a V or RACE/template-switching primer at the other. Sorting out the frame is then just a matter of finding sequences on either side of the rearrangement that you know should be in frame, and seeing whether they are.
So far so self-explanatory. Here's the first of the fiddlier bits of TCR trivia I want to impart, the thing I need to remind myself of every time I tweak my TCR translating scripts: in fully-spliced TCR message the last nucleotide of the J region makes up the first nucleotide of the first codon of the constant region.
This means that if you go from the first nucleotide of the V (or the leader sequence, all functional leader sequences being divisible by three) to either the second to last nucleotide of the J, or the second nucleotide of the C (if you've started with mRNA instead of gDNA) then presto, you'll be in frame!
This feature also produces another noteworthy feature, at least in one chain; while every TRBJ ends in the same nucleotide, there are functional TRAJ genes ending in each base. This means that the first residue of the constant region of the alpha chain can be one of four different amino acids – which can be a bit off-putting if you don't know this and you're looking at what's supposed to be constant.
Once you've translated your TCR the next step is to find the CDR3, which should surely be the easiest bit; just run from the second conserved cysteine to the phenylalanine in the FGXG motif, right, just as IMGT says? Simple, or so it might seem.
Both sides of the CDR3 offer their own complications. First off, finding the second conserved cysteine isn't as easy as just picking the second C from the 5', or the 5'-most one upstream of the rearrangement, as some Vs have more than two germline cysteines, and it's feasible that new ones might be generated in the rearrangement. At some point you're just going to have to do an alignment of all the different Vs, and record the position of all the conserved cysteines.
The FGXG is also trickier than it might seem, by merit of the fact that there are J genes in both the alpha and gamma loci* that (while supposedly functional) lack the motif, being either FXXG or XGXG. If you're only looking for CDR3s matching the C to FGXG pattern, then there's going to be whole J genes which never seem to get used by your analysis!
There you have it, a couple of little tips that are worth bearing in mind when translating TCR sequence**.

Addendum - it's actually a little bit more complicated. Naturally.
* If you're interested (or don't believe me), they are: TRAJ16, TRAJ33, TRAJ38, TRGJP1 and TRGJP2.
** Note that everything written here is based on human TCR sequence as that's what I work with, but most of it will probably apply to other species as well.

Sunday 23 March 2014

TCR diversity in health and in HIV

Before I forget, here's a record of the poster I presented recently at the Quantitative Immunology workshop in Les Houches (which I blogged about the first day of last week*).

In a move I was pretty pleased with, I hosted my poster on figshare, with an additional pdf of supplementary information. (I even included a QR code on the poster linking to both, which I thought pretty cunning, but this plan was sadly scuppered by a complete lack of wifi signal.)

You can access the poster here, and the supplementary information here.

The poster gives a gives a quick glance at some of my recent work where I've been using random barcodes to error-correct deep-sequenced TCR repertoires, a technique I'm applying to comparing healthy individuals to HIV patients, both before and after three months of antiretroviral therapy.

* For those interested in a larger overview of the conference, my supervisor Benny Chain wrote a longer piece at his blog Immunology with numbers

Monday 10 March 2014

First day of qImmunology workshop: diversity and error


It's the first day of the quantitative Immunology workshop in Les Houches (#qImmLH), and there's been a definite theme: TCR repertoire sequencing. In fact it seems to be the main theme of the conference, with around half the people here seemingly working on them in one sense or other – my accommodation alone seems to be populated exclusively with us*! Seeing as I have a bit of time today, have a quick post about it.
There's been a couple of recurrent points which have been coming up in the talks, and are being particularly dwelt on by the discussions from the group (being based in a Physics retreats apparently demands that we must do as the physicists do, and ask questions throughout a talk). Well, there have been many but these are the two I picked up on most, but that might just be because it's what my poster is about so I'm biased.
The first is that of error. So far, all of our pipelines involve some amount of PCR amplification, which adds a great deal of errors on top of whatever the sequencing technique itself will introduce. My own supervisor Benny Chain probably dwelt on this the longest, going over some evidence to suggest that within a given PCR there's some variability in the efficiency of amplification, so for lower frequency clones there's less relationship between the number of reads coming out and the number of original molecules of DNA that went in. However a number of the talks touched on this, and I'm sure some of the later ones will as well.
The second theme is that of diversity, and how to measure it. Based on the fact all of the speakers used a different metric (and the number of questions it raised from the audience) there's clearly scope for discussion. In brief, Encarnita Mariotti-Ferrandiz used a species richness index to describe the number of different unique clonotypes in mice Treg and Teff cells, Thierry Mora used Shannon Entropy to compare diversities of zebrafish Ig CDR3s, and Eric Shifrut looked at Gini indexes of aging mice repertoires.
This of course all ties in to the error of the system, as any additional error will be likely be artificially inflating the diversity while at the same time distorting the frequency distribution.
The last full talk of the day ended with Aleksandra Walczak talking through the generation of diversity in TCR repertoires, mostly just going through the figures from the excellent Murugan et al paper.
So far this is shaping up to be an ideal workshop for me, what with so many people working on and talking about the exact problems I find myself faced with on a daily basis. That it's all taking place among some achingly beautiful scenery is just icing on the cake.

* I know it's an awful thing to think about, but I can't help feeling that if an avalanche hit the resort we'd be setting the relatively young field of adaptive repertoire sequencing back a decent way!