I recently wrote a blog post about the strategies used to translate T-cell receptor nucleotides en masse and extract (what can arguable be considered) the useful bit: the CDR3.
In that talk I touched on the IMGT-definition of the CDR3: it runs from the second conserved cysteine in the V region to the conserved FGXG motif in the J. Nice and easy, but we have to remember that it's the conserved bit that's key here: there are other cysteines to factor in, and there are a few germline J genes that don't use the typical FGXG motif.
However even that paints too simple a picture, so here's a quick follow up point:
These are human-imposed definitions, based more on convenience for human-understanding than biological necessity. The fact is that we might well produce a number of TCRs that don't make use of these motifs at all, but that are still able to function perfectly well; assuming the C/FGXG motifs have function, it's possible alternative motifs might compensate for these.
I have examples in my own sequence data that appear to clearly show these motifs having been deleted into, and then replaced with different nucleotides encoding the same. Alternative residues must certainly be introduced on occasion, and I'd be surprised if none of these make it through selection; we just don't see these because we aren't able to generate rules to computationally look for these.
I actually even recently found such an example with verified biological activity: this paper sequenced tetramer-sorted HIV-reactive T-cells, revealing one that contained an alpha chain using the CDR3 'CAVNIGFGNVLHCGSG'.*
For the majority of analyses, looking for rare exceptions to rules probably won't make much difference. However as we increase the resolution and throughput of our experiments, we're going to find more and more examples of things which don't fit the tidy rules we made up when we weren't looking so deeply. If we're going to get the most out of our 'big data', we need to be ready for them
* I was looking through the literature harvesting CDR3s, which reminds me of another point I want to make. Can I just ask, from the bottom of my heart, for people to put their CDR3s in sensible formats so that others can make use of them? Ideally, give me the nucleotide sequence. Bare minimum, give me the CDR3 sequence as well as which V and J were used (and while I stick to IMGT standards, I won't judge you if you don't - but do say which standards you are using!). Most of all, and I can't stress this enough, please please PLEASE make all DNA/amino acid sequences copyable.**
** Although spending valuable time copying out or removing PDF-copying errors from hundreds of sequences drives me ever so slightly nearer to a breakdown, it does allow me to play that excellent game of "what's the longest actual word I can find in biological sequences". For CDR3s, I'm up to a sixer with 'CASSIS'.
My thoughts on immunology, T-cell receptors, next-generation sequencing, molecular biology, and anything else that takes my fancy.
Showing posts with label FGXG. Show all posts
Showing posts with label FGXG. Show all posts
Sunday, 4 May 2014
Wednesday, 26 March 2014
TCR trivia part one - translating T cell receptor sequences
Here's the first in what's likely to
a be pretty niche series of posts – but hopefully useful to some –
on various bits of TCR trivia I've gleaned during my PhD. This
installment: translating TCR sequences.
Whether you're doing high- or
low-throughput DNA TCR sequencing, it's highly likely that at some
point you'll want to translate your DNA into amino acid sequences.
There's usually a few main goals to
bear in mind when translating; to check the frame, and (potential)
functionality of the sequence (i.e. lacking premature stop codons),
before defining the hypervariable CDR3 region.
Assuming your reads are indel free,
sequencing variable antigen receptors will always involve a bit more
thought about checking the reading frame than regular amplicon
sequencing, due to the non-templated addition and deletion of
nucleotides that occurs during V(D)J
recombination.
So how do can we tell whether our
final recombined sequence is in frame or not?
This will mostly depend on how your
sequencing protocols work.
If, say you've somehow managed to
sequence an entire TCR mRNA sequence, then it's easy; you can just
take the sequence from the start codon of the leader region to the
stop codon of the constant region and if it divides exactly by three
then it's in frame. Assuming there's no stop codons in between and
the CDR3 checks out, chances are good your sequence encodes a
productively rearranged TCR chain.
Given today's technology, that's
unlikely to be the case; whether you're cloning into plasmids and
Sanger sequencing or throwing libraries into some next-gen machine,
chances are good that the sequence you'll be working with is actually
just a small window around the recombination site.
What's likely is that you have
amplified from a constant or J region primer at one end, to a V or
RACE/template-switching primer at the other. Sorting out the frame is
then just a matter of finding sequences on either side of the
rearrangement that you know should be in frame, and seeing whether
they are.
So far so self-explanatory. Here's
the first of the fiddlier bits of TCR trivia I want to impart, the
thing I need to remind myself of every time I tweak my TCR
translating scripts: in fully-spliced TCR message the last
nucleotide of the J region makes up the first
nucleotide of the first codon of the constant region.
This means that if you go from the
first nucleotide of the V (or the leader sequence, all functional
leader sequences being divisible by three) to either the second to
last nucleotide of the J, or the second nucleotide of the C (if
you've started with mRNA instead of gDNA) then presto, you'll be in
frame!
This feature also produces another
noteworthy feature, at least in one chain; while every TRBJ ends in
the same nucleotide, there are functional TRAJ genes ending in each
base. This means that the first residue of the constant region of the
alpha chain can be one of four different amino acids – which can be
a bit off-putting if you don't know this and you're looking at what's
supposed to be constant.
Once you've translated your TCR the
next step is to find the CDR3, which should surely be the easiest
bit; just run from the second conserved cysteine to the phenylalanine
in the FGXG motif, right, just as IMGT says? Simple, or so it might
seem.
Both sides of the CDR3 offer their
own complications. First off, finding the second conserved cysteine
isn't as easy as just picking the second C from the 5', or the
5'-most one upstream of the rearrangement, as some Vs have more than
two germline cysteines, and it's feasible that new ones might be
generated in the rearrangement. At some point you're just going to
have to do an alignment of all the different Vs, and record the
position of all the conserved cysteines.
The FGXG is also trickier than it
might seem, by merit of the fact that there are J genes in both the
alpha and gamma loci* that (while supposedly functional) lack the
motif, being either FXXG or XGXG. If you're only looking for CDR3s
matching the C to FGXG pattern, then there's going to be whole J
genes which never seem to get used by your analysis!
There you have it, a couple of little
tips that are worth bearing in mind when translating TCR sequence**.
Addendum - it's actually a little bit more complicated. Naturally.
Addendum - it's actually a little bit more complicated. Naturally.
*
If you're interested (or don't believe me), they are: TRAJ16, TRAJ33,
TRAJ38, TRGJP1 and TRGJP2.
**
Note that everything written here is based on human TCR sequence as
that's what I work with, but most of it will probably apply to other
species as well.
Subscribe to:
Posts (Atom)