jamimmunology: V

Saturday, 10 November 2018

Making coding T cell receptor sequences from V-J-CDR3

If, like me, you work on T cell receptors, occasionally you’re probably going to want to express a particular TCR in cells. However you're not always going to have the sequence, at either nucleotide or protein sequence level.

Not a problem, you can sort this out. You can look up all the relevant germline sequences from IMGT, trim away all the non-used bits, add in the non-templated stuff, then manually stitch it all together and have a look to see if it still makes what you were expecting. You can do all that... or you can just use the code I wrote.

StiTChR does it all: give it a V gene, a J gene, and a CDR3 amino acid sequence and it'll look up, trim and stitch together all the relevant TCR nucleotide sequences for you, back-translating the non-templated region using the most frequent codon per residue. It also translates it all, and will run a quick rudimentary alignment against a known partial protein sequence if you have one for a visual confirmation that it's made the right thing.

You can then take the alpha/beta TCR sequences it generates, bang them into an expression vector (typically split by a T2A sequence or something) and transduce your cells of interest.

I wrote this code to save me a bit of time in future, but hopefully it can do the same for some of you!

Saturday, 26 July 2014

See an error in a database? Let someone know!

Anyone who does any T cell receptor analysis will know IMGT (the ImmunoGenetics DataBase), the repository for all things TCR and Ig. You either use it, or you're one of those annoying people that makes me have to drag up all the tables of outdated nomenclatures.

Much like any resource, IMGT has it good points (simple and highly useful features like GENE-DB and LocusView in particular) and its bad (the less said about LIGM-DB the better).

However, again like any resource, it's only as good as the data stored in it. The data in it, as far as I can tell, is pretty damn good (and I use it a lot). I guess that's why they got to be in charge of all the data in the first place.

As such, when I recently found an error in a sequence*, I made sure to let them know: I certainly get a lot of mileage out of their data, it's only fair that I pay them back (and pay it forward to others) by ensuring the data that is there is good.

It's always a little nerve-inducing, being a PhD student emailing senior doctors and professors to let them know of a mistake you've discovered, but as hoped the information was very warmly received, and I'm told that the error will be corrected.

Science has to be self-correcting to stop errors lingering and spreading; firing a quick email off to correct an annotation might not seem like much, but if it stops one person going through the same short time of confusion that you went through unravelling the mistake then you've done a net service to the world.

* For the people that found their way here suffering from this particular error, here's what I found. I was looking at the TCR leader regions(the mono-spliced section of the transcript between the start of translation and the beginning of the V region which encodes the localisation signal peptide), when I noticed that one gene never seemed to produce functional transcripts. It turned out that while some of the entries for the human alpha gene TRAV29,DV5 were correct, if you downloaded the L-PART2 region alone the sequence produced actually contains a section of the start of the V gene. So, instead of reading 'GGGTAAAC', it reads 'GGGTAAACAGTCAACAGAAGAATGAT'. I just checked and it still gives the old sequence, but I assume there's a lag time for databases to update.