I had previously written a short blog post touching on how I'd tried to mine some PBMC RNA-seq data (from the ENCODE project) for rearranged T-cell receptor genes, to try and open up this huge resource for TCR repertoire analysis. However, I hadn't gotten very far, on account of finding very few TCR sequences per file.
That sets the background for an extremely pleasant surprise this morning, when I found that Scott Brown, Lisa Raeburn and Robert Holt from Vancouver (the latter of whom being notable for producing one of the very earliest high-throughput sequencing TCR repertoire papers) had published a very nice paper doing just that!
This is a lovely example of different groups seeing the same problem and coming up with different takes. I saw an extremely low rate of return when TCR-mining in RNA-seq data from heterogeneous cell types, and gave up on it as a search for needles in a haystack. The Holt group saw the same problem, and simply searched more haystacks!
This paper tidily exemplifies the re-purposing of biological datasets to allow us to ask new biological questions (something that I consider a practical and moral necessity, given the complexity of such data and the time and costs involved in their generation).
Moreover, they do some really nice tricks, like estimating TCR transcript proportions in other data sets based on constant region usage, investigate TCR diversity relative to CD3 expression, testing on simulated RNA-seq data sets as a control, looked for public or known-specificity receptors and inferred possible alpha-beta pairs by checking all each sample's possible combinations for their presence in at least one other sample (somewhat akin to Harlan Robins' pairSEQ approach).
All in all, a very nice paper indeed, and I hope we see more of this kind of data re-purposing in the field at large. Such approaches could certainly be adapted for immunoglobulin genes. I also wonder if, given whole-genome sequencing data from mixed blood cell populations, we might even be able to do a similar analysis on rearranged variable antigen receptors from gDNA.
My thoughts on immunology, T-cell receptors, next-generation sequencing, molecular biology, and anything else that takes my fancy.
Monday, 7 December 2015
Tuesday, 17 November 2015
Heterogeneity in the polymerase chain reaction
I've touched
briefly on some of the insights I made writing my
thesis in a
previous blog post. The other thing I've been doing a lot of over
the last year or so is writing and contributing to papers. I've been
thinking that it might be nice to write a few little blog posts on
these, to give a little background information on the papers
themselves, and maybe (in the theme of this blog) share a little
insight into the processes that went into making them.
The paper I'll
cover in this piece was published in Scientific
Reports in October. I won't go into great detail on this one, not
least because I'm only a (actually, the) middle author on it: this
was primarily the excellent work of my friends and colleagues
Katharine Best and Theres Oakes, who performed the bulk of the
analysis and wet-lab work respectively (although I also did a little
of both). Also, our supervisor Benny Chain summarised the findings of
the article itself on
his own blog, which covers the principles very succinctly.
Instead, I thought
I'd write this blog to share that piece of information that I
always wonder about when I read a paper: what made them look at
this, what put them on
this path? This is where I think
I made my major contribution to this paper, as (based
on my recollections) it began
with observations made during my PhD.
My
PhD primarily
dealt
with the development and application of deep-sequencing protocols for
measuring T-cell receptor (TCR) repertoires (which, when I started,
there were not many published protocols for). As a part of optimising
our library preparation strategies, I thought that we might use
random nucleotide sequences in our PCR products – which were
originally added to increase diversity, overcoming a limitation in
the Illumina sequencing technology – to act as unique molecular
barcodes. Basically, adding random sequences to
our target DNA before amplification uniquely labels each molecule. Then, in the final data we can infer that
any matching sequences that share the same barcode are probably just PCR duplicates, if we have enough random barcodes*, meaning that sequence was less prevalent in the original sample than one might think based on raw read counts. Not
only does this provide better
quantitative data, but by
looking to see whether different sequences
share a barcode we can find
likely erroneous sequences produced during PCR or sequencing,
improving the qualitative aspects of the data as well. Therefore
we thought (and still do!)
that we were on to a good thing.
(Please
note that we are not saying that we invented this, just that we have
done it: it has of course been
done
before, both in RNA-seq (e.g.
Fu et al, 2011
and Shiroguchi et
al, 2012) at large and in
variable antigen receptor sequencing (Weinstein
et al, 2009), but it
certainly wasn't widespread at the time; indeed there's really only
one other lab I know of even now that's doing it (Shugay
et al, 2014).)
However,
in writing the scripts to 'collapse' the data (i.e. remove artificial
sequence inflation due to PCR amplification, and throw out erroneous
sequences) I noticed that the degree to which different
TCR sequences were amplified
followed an interesting distribution:
Here I've plotted the raw, uncollapsed frequency of a given TCR sequence (i.e. the number of reads containing that TCR, here slightly inaccurately labelled 'clonal frequency') against that value divided by the number of random barcodes it associated with, giving a 'duplication rate' (not great axis labels I agree, but this is pulling the plots straight out of a lab meeting I gave three years ago). The two plots show the same data, with a shortened X axis on the right to show the bulk of the spread better.
We can
see that above a given frequency – in this case about 500, although
it varies – we observe a 'duplication rate' around 70. This means
that above a certain size, sequences are generally amplified at a
rate proportional to their prevalence (give or take the odd outlier), or that for every input molecule of cDNA it gets amplified and observed seventy times. This is the scenario we'd generally imagine for PCR. However, below
that variable threshold there is a very
different, very noisy picture, where
the amount to which a sequence is found to be amplified and observed is not
related to the collapsed prevalence. This was the bait on the hook
that lead our lab down this path.
Now,
everyone knows PCR doesn't behave like it does in the diagrams, like it should.
That's what everyone always says (usually as they stick another gel picture containing mysteriously sized bands into their labbooks). However, people have rarely looked
at what's actually
going on. There's a bit of special PCR magic that goes on, and a
million different target and reaction variables that might affect
things: you just optimise your reaction until your output looks like
what you'd expect. It's only with the relatively recent advances in
DNA sequencing technology that we can actually look at exactly what
molecules are being made in the reaction that we can start to get
actual data showing how just un-like the schematics the reaction can
in fact behave.
This
is exactly what Katharine's paper chases up, applying the same unique
molecular barcoding strategy to TCR sequences from both
polyclonal and monoclonal**
T-cells.
I won't go into the details, because hey, you can just read the paper
(which
says it much better),
but the important thing is that this variability is not due to the
standard things you might expect like CG content, DNA motifs or
amplicon length, because it happens even for identical
sequences.
It doesn't matter how well
you control your reactions, the noise in the system breeds
variability. This makes
unique molecular barcoding hugely important,
at least if you want accurate
relative quantitation of DNA species across the entire
dynamic range of your data.
* Theoretically about 16.7 million in our case, or 412,
as we use twelve random
nucleotides in our barcodes.
** Although it's worth saying that while the line used, KT-2, is monoclonal, that doesn't mean the TCR repertoire is exactly as clean as you'd expect. T-cell receptor expression in T-cell lines is another thing that isn't simple as the textbook diagrams pretend.
** Although it's worth saying that while the line used, KT-2, is monoclonal, that doesn't mean the TCR repertoire is exactly as clean as you'd expect. T-cell receptor expression in T-cell lines is another thing that isn't simple as the textbook diagrams pretend.
Thursday, 1 October 2015
ProFlex Problems
Our lab is lucky enough to have just purchased a few new bits of kit, among which was an AB ProFlex thermal cycler, which replaces the increasingly erratic G-STORM unit we'd previously been working on.
As the nerdiest of the current lab crew, it fell to me to set it up, which involved surmounting a few small errors that popped up along the way. As it's a relatively new machine, I thought that I would relay my experiences here to help anyone similarly inconvenienced.
The first problem came when I tried to get it online using the Wi-Fi dongle that came with the order: it wouldn't accept that it was plugged in! Instead, I got the following error message:
No amount of restarting or re-seating the dongle made a difference: instead a bit of searching revealed that I needed to install a firmware update, available from the Thermo website (v 1.1.5). It's pretty simple to install: throw it on a USB stick (that contains no other *.update files), insert it into the PCR machine, go into Settings and update the system.
However, I now got another error message popping up on boot:
This was a brand spanking new machine fresh out of the box, so I was pretty sure this couldn't be the case. However this time the googling came up empty, so I had to get help from tech support. It turns out that there was some issue with the default time stamp, so that when the machine got networked (and updated its clock) it thought it was much older than it actually was. They sent me another patch (a file named ProFlex-ServiceDateReset-1.0.0.update) which did the trick (although it's a bit annoying have to apply two patches within a week of opening a brand new cycler!). I can't find this online, but if you are suffering from a similar problem at least you'll know what to ask your rep for.
My final problem is with the app, which lets you interface remotely with the machine, letting you check availability and so on. Sounds like a fun idea, if a little gimmicky, but nice if you have a long protocol with many heat steps, or a busy lab (of which I have both) so I thought worth playing with.
The problem is that the app refuses to accept my Thermo Fisher account log in details, saying that the user name or password are incorrect (which they most certainly were not). Tech help to the rescue again, and it turns out that it's a known bug (potentially just with the Android version), which is due to be fixed in an update at the end of the month.
Apart from these minor inconveniences, I must say that it's actually quite a lovely machine! There's a couple of adjustments that I had to make - like using the plastic inserts to prevent tube-squashing, and turning off constant heated lids to speed up running my RNA-seq protocol which requires about 3 different lid temperatures - but otherwise it's all very intuitive!
As the nerdiest of the current lab crew, it fell to me to set it up, which involved surmounting a few small errors that popped up along the way. As it's a relatively new machine, I thought that I would relay my experiences here to help anyone similarly inconvenienced.
The first problem came when I tried to get it online using the Wi-Fi dongle that came with the order: it wouldn't accept that it was plugged in! Instead, I got the following error message:
Error
⚠
Please make sure the USB WiFi card is inserted and restart the instrument
No amount of restarting or re-seating the dongle made a difference: instead a bit of searching revealed that I needed to install a firmware update, available from the Thermo website (v 1.1.5). It's pretty simple to install: throw it on a USB stick (that contains no other *.update files), insert it into the PCR machine, go into Settings and update the system.
However, I now got another error message popping up on boot:
⚠
The instrument is due for servicing for the following test(s):
Temperature Non-Uniformity Test,
Temperature Verification Test,
Cycle Performance Test,
Heated Cover Test
This was a brand spanking new machine fresh out of the box, so I was pretty sure this couldn't be the case. However this time the googling came up empty, so I had to get help from tech support. It turns out that there was some issue with the default time stamp, so that when the machine got networked (and updated its clock) it thought it was much older than it actually was. They sent me another patch (a file named ProFlex-ServiceDateReset-1.0.0.update) which did the trick (although it's a bit annoying have to apply two patches within a week of opening a brand new cycler!). I can't find this online, but if you are suffering from a similar problem at least you'll know what to ask your rep for.
My final problem is with the app, which lets you interface remotely with the machine, letting you check availability and so on. Sounds like a fun idea, if a little gimmicky, but nice if you have a long protocol with many heat steps, or a busy lab (of which I have both) so I thought worth playing with.
The problem is that the app refuses to accept my Thermo Fisher account log in details, saying that the user name or password are incorrect (which they most certainly were not). Tech help to the rescue again, and it turns out that it's a known bug (potentially just with the Android version), which is due to be fixed in an update at the end of the month.
Apart from these minor inconveniences, I must say that it's actually quite a lovely machine! There's a couple of adjustments that I had to make - like using the plastic inserts to prevent tube-squashing, and turning off constant heated lids to speed up running my RNA-seq protocol which requires about 3 different lid temperatures - but otherwise it's all very intuitive!
Labels:
AB,
applied bioscience,
error,
fisher,
fix,
patch,
PCR,
problem,
test,
thermal cycler,
thermo,
update,
wifi
Monday, 10 August 2015
Get single pages from a PDF with pdftk
Another little trick for file manipulation in Linux, which I find useful when making/reading papers, is how to convert a multi-page PDF document into multiple single-page files.
Here is how I do it in Ubuntu, using the lovely program pdftk:
Here is how I do it in Ubuntu, using the lovely program pdftk:
for i in {1..7}; do echo Working on page $i; pdftk A=INFILE.pdf cat A$i output OUTFILEpage$i.pdf; done
Friday, 7 August 2015
PhD thesis writing advice
When I sat down to
write this post, I had an opening line in mind that was going to
bemoan me being a bit remiss with my blogging of late. That was
before I checked my last post and realised it had been ten months,
and 'remiss' feels a bit inadequate: I basically stopped.
I had a pretty
good reason, in that I needed to finish my PhD. Those ten months have
basically been filled with write paper, submit paper, thesis thesis
thesis, do a talk, thesis thesis thesis, get paper rejections,
rewrite, resubmit, thesis thesis thesis … then finally viva (and a
couple of more paper submissions). It is exhausting, and frankly
having either a thesis or a paper to be writing at any one given time
takes the shine off blogging somewhat.
Now that it's done
and out of the way (no corrections, wahey!), I begin to turn my mind
to getting the blog up and running again. What have I been up to
lately that I could write about, I asked myself. Well, there's always
this big blue beast.
Having been the
major task of the last year of my life, I've spent more than a little
time thinking about the process
of thesis writing, so I thought I'd share some pointers and thoughts,
maybe being the usual and the obvious.
1) Everything
takes longer than expected.
This
is the best advice I got going in, and is always the first thing I
say to anyone who asks. In a perfect example of Hofstader's
Law, no matter how much time you think a given task should
take, in reality it will take longer.
2) Be consistent.
The longer a given document is, the more chance there is that
inconsistencies will creep in. Theses are long documents that are
invariably only read by accomplished academics (or those in
training), many of whom have a keen eye for finding such
consistencies. 'n=5' on one line and 'n = 4' on another. 'Fig1:
Blah', 'Figure 2 – Blah blah' and 'Fig. 3 Even more blah'. These
might not seem like big deals, but added up they can make the
document feel less professional. Choice of spelling (e.g. UK or US
English), where you put your spaces, how you refer to and describe
your figures and references, all of these types of things: it doesn't
matter so much how you choose to do them, as long as you do them the
same each time.
On a related note, many technical notations – such as writing gene,
protein or chemical names – are covered by exhaustive
committee-approved nomenclatures. Use them, or justify why you're not
using them, but again, be consistent.
3)
Make it easy for yourself: get
into good habits.
Writing
long documents is difficult, so don't make it hard on yourself. Start
as you mean to go on, and go on like that – I made a rather foolish
decision to change how I formatted my figures one chapter in to my
writing, and going back to re-format the old stuff was time I could
have been spent writing*. Doing something right the first time is
much more efficient than re-doing it two or three times.
Make
sure you make full use of reference manager software. This will sound
obvious to people that use them, but I am consistently surprised by
the number of PhD students I meet who write their bibliographies and
citations manually. I personally use Mendeley,
which operates perfectly well and has a nice reading interface as
well, although there are plenty of others. You are probably going to
have a lot (hundreds) of references, and even more citations: doing
them manually is a recipe for disaster.
Similarly,
don't do any manual cross-referencing if you can avoid it – the
document as you write it will likely be entirely fluid and subject to
change for months,
so any 'hard' references you put in could well end up needing to be
changed, which not only takes time but increases the risk of you
missing something and carrying an error along to your finished PhD.
If
you have the time, I would recommend trying to get into LaTeX
(with the 'X' pronounced as a 'K'), which is a free, open-source
code-based type-setting program. It's a bit of a steep learning
curve, but there are plenty of good templates and once you've got a
grasp of the basic commands it's incredibly powerful. Crucially, as
your file is just a text document (which effectively just 'links' to
pictures when you compile your PDF) it remains small in size, and
therefore easy to load, backup and play with. It also makes
referencing, cross-referencing, and generally producing beautiful
looking text a lot easier then most word processors.
Theses
are often full of technical words and abbreviations, and it's
entirely likely your examiners won't know them all beforehand –
therefore they need to be defined the first time they're used.
However, if you're moving chunks of text around (sometimes whole
chapters), how do you know which one is the first time? My tactic was
to not define anything
while writing, but whenever I did use a new phrase I added it to a
spreadsheet, along with its definition. Then once everything was set
in place I worked through that spreadsheet and used 'find' to add the
appropriate definition to the first instance of every term. What's
more, that spreadsheet was then easily alphabetised and converted
straight into a convenient glossary!
4)
Be
prepared to get sick of it ...
You will spend an
unhealthy amount of time doing one thing: working on this one
document. It will bore you, and it will make you boring, as it will
take over your time, thoughts and life**. It is basically guaranteed
that you will bone-weary of sitting down to your computer and working
on it again. It's relentless,
it just keeps going on and on and on, to the point where you forget
that your life hasn't and will not always just be thesis-writing.
5) ... but remember it will end.
It might not seem like it at the time, but it will. You will finish
writing, you will finish checking, you will hand it in. You'll then
find some errors, but that's OK, your examiners are never going to
read it as closely as you do when you check it. Remember that your
supervisor(s)/thesis committees/whoever shouldn't let you submit
unless they think you're ready, so the fact you're submitting means
you're probably going to be fine!
*
For
what it's worth, the final way I did my figures was better, I should
just have thought about it and done it first.
Basically I outputted my plots from Python and R as SVG files and
compiled them into whole figures in Inkscape
(which is also great for making schematics)
and saving these as PDFs. A word of warning thought – certain
lines/boxes in occasional Python-saved SVGs failed
to print (apparently something to do with the way fancy printers
flatten the layers), so it is probably worth keeping backup EPS or
non-vector versions of your Inkscape files on hand.
** Look at me, I've just
closed my file for the last time (before uploading to university
servers) and the first thing I do is go and write a thousand words
about it!
Labels:
bibliography,
cross-references,
inkscape,
latex,
mendeley,
my PhD,
phd,
references,
repertoire,
TCR,
theses,
thesis,
tips,
tricks,
writing
Subscribe to:
Posts (Atom)