Wednesday, 10 August 2016

Extracting sample names from Illumina FASTQ filenames

Another post in the vein of "minor analysis task and my solution": how to extract the sample name or identifier (i.e. what you called a given file in the samplesheet) from a list of Illumina FASTQ file names.

So you might have something that looks like this (which I've cobbled together from a couple of runs from different machines that I pulled off Basespace):
 
322_S18_R2_001.fastq.gz 
1c8_P6_act_S11_R2_001.fastq.gz
5c4_P8_res_S1_R1_001.fastq.gz


In this exercise, I wanted the first part of the IDs, which are what we labelled the actual samples with. In an ideal world this should be very easy: we could just use sed or cut to delete everything after the first instance of a character used in the text that Illumina appends to the sample name. However different people (or the same people at different times) use different naming conventions, which can often include underscores, the character that separates the string fields that get added.

Therefore what we need to do is not delete from say the first pattern, but from the nth pattern from the end, which turned out to be a slightly trickier problem (at least to one such as myself that doesn't find sed to be the most intuitive program in the world). Here's what I came up with:
sed 's/\(.*\)_\(.*_\)\(.*\)_/\1|\2/;s/|.*//' FilenameList.txt

What this does is find the third underscore to last (just before S[XX]_R[X].fastq.gz string) and replace it with a pipe character ('|'), which should never appear in a filename, before finding this pipe character and deleting everything after it. This produces the following output:
 
322
1c8_P6_act
5c4_P8_res

This feels a bit hacky to me, as really I feel like I should be able to do this in one substitution, but the first part of the command actually selects the latter half (so if anyone knows how to tidy this up I'd love to hear it!).

It's also worth pointing out that the output of different machines or demultiplexing settings will differ in the number of underscores, and so the command will need to change slightly. For instance, the above example was run on NextSeq data demultiplexed using bcl2fastq --no-lane-splitting. For an example of a file containing lane information ('L001' etc), here's the equivalent on some MiSeq filenames I pulled up.
# This 
324bAB_S8_L001_R1_001.fastq.gz 
076-PBMCs-alpha_S5_L001_R1_001.fastq.gz 

# After this
sed 's/\(.*\)_\(.*_\)\(.*\)_\(.*\)_/\1|\2/;s/|.*//' FilenameList.txt   

# Becomes this 
324bAB
076-PBMCs-alpha
The reason I was actually doing this was because I had a number of different NextSeq runs, some of which contained samples sequenced from the same original libraries (to increase the depth), so I needed to get the names of all samples sequenced in each run. I had each run in separate directories, and then ran this in the folder above:
for i in */
do echo $i
ls $i | grep R1 | grep fastq.gz | sed 's/\(.*\)_\(.*_\)\(.*\)_/\1|\
2/;s/|.*//' > ${i%?}.fl
done
This goes through all the directories, finds the R1 FASTQ files (as there are other files in there, and this gives me only one entry per sample) and strips their sample names out as above, and writes these to a textfile with the suffix '.fl'.

Thursday, 4 August 2016

Blog fail, but life successes

While I doubt there are many regular readers of this blog (not least because there isn't a regular output of posts!) I thought I'd make an update here, to explain the sparsity of content and continue the general narrative.

Essentially, I've not written nearly as often or as much here as I might wish. There are a variety of reasons for this, but it basically comes down to the twin demons of being too busy and falling out of the habit.

It's hard for me to feel too bad about it, because it has indeed been a pretty eventful year or two for me. I finished my PhD and went to a couple of very exciting international conferences. I also got out my first crop of PhD papers*, which were slightly longer in the making than I might have wished, and have been working on a number of other papers since with various labmates.

I've also still been doing the odd bit of writing for BioNews every now and then, and responding to lab and scientific questions, problems and discussions on Reddit (which I think makes for a great discussion platform for scientists, given its large user base and easy interface). This often scratches the itch that blogging does but in a more schedule-friendly (and less inspiration-necessary) manner, so I can see why it might be syphoning off some what's left of my writing enthusiasm after working on whatever the current paper is.

But life is more than work, and I have a number of life events this year that have kept the rest of my free time pretty well booked up. In addition to some of the bigger, more time-consuming life events that exist (including the biggest!) I'm also very happy to report that I've accepted an invitation to go Boston for a post-doc, where I'll be exploring TCR sequences in the context of cancer rather then infection. While I am hugely excited for all of these developments, they take time and effort to happen, and so other pursuits (e.g. blogging) can suffer.

Hopefully, once I've got the next big paper finished and have successfully made it to the other side of the pond, I can start getting the ball rolling here again. Little and often is the plan; I've got a few particular posts in mind, but I think things like a TCRseq publication alert roundup might be a nice place to start. So if you do take something from these posts (and I'm very surprised but happy to find that there are people that claim to!) take heart, as more is definitely on the cards.


*An application of the TCRseq tech we'd worked on during my thesis characterising HIV+ patient repertoires and a review of the history DNA sequencing methods.

Wednesday, 6 July 2016

Retreiving GenBank FASTA records with Python

I recently needed to download a relatively large number of GenBank FASTA records from GenBank (as I wanted to have a play with the paired alpha-beta TCR chains sequenced from single cells in this lovely paper.

Unfortunately what is supposedly probably the easiest way - using NCBI's Batch Entrez function - did not work for me. Instead I just got lots of "the following records can't be retrieved/might be obsolete" and "No items found" warnings.

However there are a number of other ways, but I especially liked Pierre Lindenbaum's answer on https://www.biostars.org/p/3297/, which basically uses bash to pull the record directly from the NCBI website. Nice and simple, and it also gives the URL format required.

As I want to do some post-download processing in Python, I thought I may as well download the records directly using the urllib2 module. In case it may be of use to others (and for the next time I need to do this myself!) I thought I'd throw the basic code up here. The only requirement that this code has is to already have a file containing a list of GenBank accession IDs: in this case I just generated this in Excel for the ranges needed, but it could be easily coded directly in Python.

import urllib2

with open('GenbankAccessions') as infile, open('Sequences.fasta', "w") as outfile:
  for accession in infile:
    url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=" + accession.rstrip() + "&rettype=fasta"
    entry = urllib2.urlopen(url)
    for line in entry:
      outfile.write(line)

This will write all of the FASTA records of the accessions listed in the 'GenbankAccessions' file to the new out file 'Sequences.fasta'.

Sunday, 28 February 2016

Early 2016 immunology conferences


I'm very lucky to have attended two fantastic immunology conferences this year: the Keystone Systems Immunology conference, hosted at Big Sky in Montana, USA, and the Next Gen Immunology conference at the Weizmann Institute in Israel. I don't want to go into the details of either conference in any great depth (as frankly the standard of both was so high any thorough recounting would shortly turn into just a complete retelling of each event), but I thought it might be nice to recount a few of my impressions and recollections.

The Systems Immunology meeting was my first Keystone, fitting what I understand to be the regular format; talks in the morning and late afternoon, with a gap over lunch to allow people to sneak off to the slopes. This definitely seemed to be the strategy of a healthy proportion of the attendees - even a ski-novice like me managed to make it up there once or twice!

The Next Gen Immunology conference was a day shorter, but felt like a longer event by merit of absolutely jam-packing the talks and events in! I'm not sure that I've ever been to a busier conference, with talks running from before nine in the morning to past seven in the evening (albeit with a number of breaks to allow consumption of excessive amounts of coffee, even by general science standards). It also had the slickest branding of any conference I've ever been to, complete with it's own celebrity-filled introductory video (which was originally emailed round but has sadly now been made private!).

There were some interesting thematic differences between the conferences, despite having somewhat similar stated scopes. The Keystone talks were seemingly linked mostly in their approaches, displaying a fine array of highly-quantitative, systems level experiments. There was a definite emphasis on the single-cell technologies; it seemed that rare was the talk that wasn't employing at least either single-cell RNA-seq or 40-odd colour mass cytometry. The NGI talks on the other hand were much more diverse in terms of techniques employed, but converged more on the general area of research, which was the microbiome and the mucosal immune response.

Another observation that ticked me was the difference in plotting standards between the two conferences. At the Systems meeting, with (I think it's fair to say) a greater influence or prevalence of mathematical and computational backgrounds, there was a definite enrichment of LaTeX produced slides, and ggplot produced plots, whereas Prism and Excel were far more common in Israel. There was even a fairly large smattering of talks using comic sans, which came as a bit of surprise.

All together, I've kicked off the year with some fantastic science. I've ticked off a number of speakers that I've been wanting to hear for years – including Ron Germaine, Rick Flavell and David Baltimore – and added a few more to the list of stories I need to hear more from, like Aviv Regev's incredible bulk tumour scRNA-seq data, or the wonderful things that Ton Schumacher can do with T-cells.

Crucially, I also heard a little bit about the growing story of assembling T-cell receptor sequences out of scRNA-seq data (to be the topic of a future post I feel), which brings us to the shallows of one of the major goals of repertoire research; getting paired clonotype and phenotype in one fell swoop.

Monday, 7 December 2015

The key to finding TCR sequences in RNA-seq data

I had previously written a short blog post touching on how I'd tried to mine some PBMC RNA-seq data (from the ENCODE project) for rearranged T-cell receptor genes, to try and open up this huge resource for TCR repertoire analysis. However, I hadn't gotten very far, on account of finding very few TCR sequences per file.

That sets the background for an extremely pleasant surprise this morning, when I found that Scott Brown, Lisa Raeburn and Robert Holt from Vancouver (the latter of whom being notable for producing one of the very earliest high-throughput sequencing TCR repertoire papers) had published a very nice paper doing just that!

This is a lovely example of different groups seeing the same problem and coming up with different takes. I saw an extremely low rate of return when TCR-mining in RNA-seq data from heterogeneous cell types, and gave up on it as a search for needles in a haystack. The Holt group saw the same problem, and simply searched more haystacks!

This paper tidily exemplifies the re-purposing of biological datasets to allow us to ask new biological questions (something that I consider a practical and moral necessity, given the complexity of such data and the time and costs involved in their generation).

Moreover, they do some really nice tricks, like estimating TCR transcript proportions in other data sets based on constant region usage, investigate TCR diversity relative to CD3 expression, testing on simulated RNA-seq data sets as a control, looked for public or known-specificity receptors and inferred possible alpha-beta pairs by checking all each sample's possible combinations for their presence in at least one other sample (somewhat akin to Harlan Robins' pairSEQ approach).

All in all, a very nice paper indeed, and I hope we see more of this kind of data re-purposing in the field at large. Such approaches could certainly be adapted for immunoglobulin genes. I also wonder if, given whole-genome sequencing data from mixed blood cell populations, we might even be able to do a similar analysis on rearranged variable antigen receptors from gDNA.

Tuesday, 17 November 2015

Heterogeneity in the polymerase chain reaction


I've touched briefly on some of the insights I made writing my thesis in a previous blog post. The other thing I've been doing a lot of over the last year or so is writing and contributing to papers. I've been thinking that it might be nice to write a few little blog posts on these, to give a little background information on the papers themselves, and maybe (in the theme of this blog) share a little insight into the processes that went into making them.
The paper I'll cover in this piece was published in Scientific Reports in October. I won't go into great detail on this one, not least because I'm only a (actually, the) middle author on it: this was primarily the excellent work of my friends and colleagues Katharine Best and Theres Oakes, who performed the bulk of the analysis and wet-lab work respectively (although I also did a little of both). Also, our supervisor Benny Chain summarised the findings of the article itself on his own blog, which covers the principles very succinctly.
Instead, I thought I'd write this blog to share that piece of information that I always wonder about when I read a paper: what made them look at this, what put them on this path? This is where I think I made my major contribution to this paper, as (based on my recollections) it began with observations made during my PhD.
My PhD primarily dealt with the development and application of deep-sequencing protocols for measuring T-cell receptor (TCR) repertoires (which, when I started, there were not many published protocols for). As a part of optimising our library preparation strategies, I thought that we might use random nucleotide sequences in our PCR products – which were originally added to increase diversity, overcoming a limitation in the Illumina sequencing technology – to act as unique molecular barcodes. Basically, adding random sequences to our target DNA before amplification uniquely labels each molecule. Then, in the final data we can infer that any matching sequences that share the same barcode are probably just PCR duplicates, if we have enough random barcodes*, meaning that sequence was less prevalent in the original sample than one might think based on raw read counts. Not only does this provide better quantitative data, but by looking to see whether different sequences share a barcode we can find likely erroneous sequences produced during PCR or sequencing, improving the qualitative aspects of the data as well. Therefore we thought (and still do!) that we were on to a good thing.
(Please note that we are not saying that we invented this, just that we have done it: it has of course been done before, both in RNA-seq (e.g. Fu et al, 2011 and Shiroguchi et al, 2012) at large and in variable antigen receptor sequencing (Weinstein et al, 2009), but it certainly wasn't widespread at the time; indeed there's really only one other lab I know of even now that's doing it (Shugay et al, 2014).)
However, in writing the scripts to 'collapse' the data (i.e. remove artificial sequence inflation due to PCR amplification, and throw out erroneous sequences) I noticed that the degree to which different TCR sequences were amplified followed an interesting distribution:


Here I've plotted the raw, uncollapsed frequency of a given TCR sequence (i.e. the number of reads containing that TCR, here slightly inaccurately labelled 'clonal frequency') against that value divided by the number of random barcodes it associated with, giving a 'duplication rate' (not great axis labels I agree, but this is pulling the plots straight out of a lab meeting I gave three years ago). The two plots show the same data, with a shortened X axis on the right to show the bulk of the spread better.
We can see that above a given frequency – in this case about 500, although it varies – we observe a 'duplication rate' around 70. This means that above a certain size, sequences are generally amplified at a rate proportional to their prevalence (give or take the odd outlier), or that for every input molecule of cDNA it gets amplified and observed seventy times. This is the scenario we'd generally imagine for PCR. However, below that variable threshold there is a very different, very noisy picture, where the amount to which a sequence is found to be amplified and observed is not related to the collapsed prevalence. This was the bait on the hook that lead our lab down this path.
Now, everyone knows PCR doesn't behave like it does in the diagrams, like it should. That's what everyone always says (usually as they stick another gel picture containing mysteriously sized bands into their labbooks). However, people have rarely looked at what's actually going on. There's a bit of special PCR magic that goes on, and a million different target and reaction variables that might affect things: you just optimise your reaction until your output looks like what you'd expect. It's only with the relatively recent advances in DNA sequencing technology that we can actually look at exactly what molecules are being made in the reaction that we can start to get actual data showing how just un-like the schematics the reaction can in fact behave.
This is exactly what Katharine's paper chases up, applying the same unique molecular barcoding strategy to TCR sequences from both polyclonal and monoclonal** T-cells. I won't go into the details, because hey, you can just read the paper (which says it much better), but the important thing is that this variability is not due to the standard things you might expect like CG content, DNA motifs or amplicon length, because it happens even for identical sequences. It doesn't matter how well you control your reactions, the noise in the system breeds variability. This makes unique molecular barcoding hugely important, at least if you want accurate relative quantitation of DNA species across the entire dynamic range of your data.
* Theoretically about 16.7 million in our case, or 412, as we use twelve random nucleotides in our barcodes.

** Although it's worth saying that while the line used, KT-2, is monoclonal, that doesn't mean the TCR repertoire is exactly as clean as you'd expect. T-cell receptor expression in T-cell lines is another thing that isn't simple as the textbook diagrams pretend.