I've touched
briefly on some of the insights I made writing my
thesis in a
previous blog post. The other thing I've been doing a lot of over
the last year or so is writing and contributing to papers. I've been
thinking that it might be nice to write a few little blog posts on
these, to give a little background information on the papers
themselves, and maybe (in the theme of this blog) share a little
insight into the processes that went into making them.
The paper I'll
cover in this piece was published in Scientific
Reports in October. I won't go into great detail on this one, not
least because I'm only a (actually, the) middle author on it: this
was primarily the excellent work of my friends and colleagues
Katharine Best and Theres Oakes, who performed the bulk of the
analysis and wet-lab work respectively (although I also did a little
of both). Also, our supervisor Benny Chain summarised the findings of
the article itself on
his own blog, which covers the principles very succinctly.
Instead, I thought
I'd write this blog to share that piece of information that I
always wonder about when I read a paper: what made them look at
this, what put them on
this path? This is where I think
I made my major contribution to this paper, as (based
on my recollections) it began
with observations made during my PhD.
My
PhD primarily
dealt
with the development and application of deep-sequencing protocols for
measuring T-cell receptor (TCR) repertoires (which, when I started,
there were not many published protocols for). As a part of optimising
our library preparation strategies, I thought that we might use
random nucleotide sequences in our PCR products – which were
originally added to increase diversity, overcoming a limitation in
the Illumina sequencing technology – to act as unique molecular
barcodes. Basically, adding random sequences to
our target DNA before amplification uniquely labels each molecule. Then, in the final data we can infer that
any matching sequences that share the same barcode are probably just PCR duplicates, if we have enough random barcodes*, meaning that sequence was less prevalent in the original sample than one might think based on raw read counts. Not
only does this provide better
quantitative data, but by
looking to see whether different sequences
share a barcode we can find
likely erroneous sequences produced during PCR or sequencing,
improving the qualitative aspects of the data as well. Therefore
we thought (and still do!)
that we were on to a good thing.
(Please
note that we are not saying that we invented this, just that we have
done it: it has of course been
done
before, both in RNA-seq (e.g.
Fu et al, 2011
and Shiroguchi et
al, 2012) at large and in
variable antigen receptor sequencing (Weinstein
et al, 2009), but it
certainly wasn't widespread at the time; indeed there's really only
one other lab I know of even now that's doing it (Shugay
et al, 2014).)
However,
in writing the scripts to 'collapse' the data (i.e. remove artificial
sequence inflation due to PCR amplification, and throw out erroneous
sequences) I noticed that the degree to which different
TCR sequences were amplified
followed an interesting distribution:
Here I've plotted the raw, uncollapsed frequency of a given TCR sequence (i.e. the number of reads containing that TCR, here slightly inaccurately labelled 'clonal frequency') against that value divided by the number of random barcodes it associated with, giving a 'duplication rate' (not great axis labels I agree, but this is pulling the plots straight out of a lab meeting I gave three years ago). The two plots show the same data, with a shortened X axis on the right to show the bulk of the spread better.
We can
see that above a given frequency – in this case about 500, although
it varies – we observe a 'duplication rate' around 70. This means
that above a certain size, sequences are generally amplified at a
rate proportional to their prevalence (give or take the odd outlier), or that for every input molecule of cDNA it gets amplified and observed seventy times. This is the scenario we'd generally imagine for PCR. However, below
that variable threshold there is a very
different, very noisy picture, where
the amount to which a sequence is found to be amplified and observed is not
related to the collapsed prevalence. This was the bait on the hook
that lead our lab down this path.
Now,
everyone knows PCR doesn't behave like it does in the diagrams, like it should.
That's what everyone always says (usually as they stick another gel picture containing mysteriously sized bands into their labbooks). However, people have rarely looked
at what's actually
going on. There's a bit of special PCR magic that goes on, and a
million different target and reaction variables that might affect
things: you just optimise your reaction until your output looks like
what you'd expect. It's only with the relatively recent advances in
DNA sequencing technology that we can actually look at exactly what
molecules are being made in the reaction that we can start to get
actual data showing how just un-like the schematics the reaction can
in fact behave.
This
is exactly what Katharine's paper chases up, applying the same unique
molecular barcoding strategy to TCR sequences from both
polyclonal and monoclonal**
T-cells.
I won't go into the details, because hey, you can just read the paper
(which
says it much better),
but the important thing is that this variability is not due to the
standard things you might expect like CG content, DNA motifs or
amplicon length, because it happens even for identical
sequences.
It doesn't matter how well
you control your reactions, the noise in the system breeds
variability. This makes
unique molecular barcoding hugely important,
at least if you want accurate
relative quantitation of DNA species across the entire
dynamic range of your data.
* Theoretically about 16.7 million in our case, or 412,
as we use twelve random
nucleotides in our barcodes.
** Although it's worth saying that while the line used, KT-2, is monoclonal, that doesn't mean the TCR repertoire is exactly as clean as you'd expect. T-cell receptor expression in T-cell lines is another thing that isn't simple as the textbook diagrams pretend.
** Although it's worth saying that while the line used, KT-2, is monoclonal, that doesn't mean the TCR repertoire is exactly as clean as you'd expect. T-cell receptor expression in T-cell lines is another thing that isn't simple as the textbook diagrams pretend.