Friday, 26 August 2016

Count how many MiSeq reads derived from each surface of the flowcell

I recently had call to perform one of those tasks that I think others might, yet not be entirely sure how to go about it.

Specifically, in troubleshooting a MiSeq run's poor yield, I wanted to see whether there were significantly more reads derived from one of the flow cell surfaces (top or bottom) relative to the other. The reason I did this was my FWHM (full cluster width at half maximum, a measure of the focus during imaging) was noticeably higher for that surface.

I mean, I have no idea if ~3 is that much worse than ~2.8-2.9, but there's no harm in checking right?
This is very easily achieved as all of the information required to work it out is contained within the FASTQ reads themselves, in tile section of the identifier line of each each.

Therefore with a quick bit of basic bash we can find out exactly how many reads derived from each surface.

# Get all index reads (as the shortest) in one file
zcat *I1*z > I1.fq

# Extract the identifier lines with sed
 # and grep for those with a '1' at the right position
 # This indicated they derived from the top surface
sed '2~4d;3~4d;4~4d' I1.fq | grep ^.............................1 -c

# Do the same for '2', i.e. the bottom surface
sed '2~4d;3~4d;4~4d' I1.fq | grep ^.............................2 -c

And there you have it. Simple, quick and effective.

(As it turned out I have almost equal numbers derived from both surfaces, so it wasn't to blame in my case, but this might be useful for other situations!)

