Wednesday, 1 March 2017

Extracting antigen-specific TCRs from vdjdb on the command line

My previous lab and I recently published a review on the techniques and possibilities of analysing TCR repertoire data produced from high-throughput sequencing.

By and large I'm exceedingly happy with it - apart from a couple of missed references and one very unfortunate mix up regarding the accessibility, I'm very pleased with how it came out (and hope it will prove useful!).

One thing that I'm particularly pleased that we included (in spite of the lack of published descriptions yet) is the pair of manually curated TCR databases that have recently emerged: VDJdb and McPAS-TCR, in which you can find a small (but growing) host of TCRs of known specificity and/or disease association. We thought it was important to get these out there as soon as possible, as this is a rapidly changing field which is currently sorely needing for such efforts at standardisation and resource development.

With that in mind, I've been playing around with both of these, and thought I'd share some of the bare bones of the bash code I've been using to pull out sequences related to epitopes I'm interested in. Here is my quick vignette using VDJdb to pull out HIV-reactive TCR sequences - and even then just the fields of the database I'm interested in - using basically just default terminal commands.

# Download repo, build database then navigate to it
brew install groovy # OSX groovy installation, replace depending on your setup
git clone https://github.com/antigenomics/vdjdb-db.git
cd vdjdb-db/src/
groovy -cp . BuildDatabase.groovy
cd ../database/
# See fields
head -1 vdjdb_full.txt
# fields 1 to 3 are alpha CDR3, V and J, fields 4 to 7 are beta CDR3, V, D and J, field 9 is MHC, 12 is peptide and 14 is antigen source
# Get just the alpha CDR3 sequences for any TCRs recognising HIV
# Awk checks that CDR3 is present, and that antigen source is HIV
# Then tidy it up/ get unique tab-delimited data using sort/uniq/sed
awk -F '\t' '(length($1) > 1) && ($14 == "HIV"){print $1}' vdjdb_full.txt | sort | uniq | sort | sed 's/ /\t/g' > all_alpha_hiv_cdr3s.tsv
# Get CDR3, V, J and recognised peptide of all beta chain recognising HIV on an HLA-A*02 background
# This time we're using grep to catch the HLA type, as it's used more variably (e.g. HLA-A*02/02:01/02:01:XX, depending on source)
awk -F '\t' '(length($4) > 1) && ($14 == "CMV"){print $4, $5, $7, $9, $12}' vdjdb_full.txt | grep HLA-A\\*02 | sort | uniq | sort | sed 's/ /\t/g' > all_a2_beta_hiv_cdr3s.tsv
# Note that I've only successfully tested this code in the terminal on a Mac - quick test on Linux proves that awk behaves differently on each
# Frankly in all but a few edge cases you're probably just OK grepping across the whole lines