Tuesday 10 September 2013

Python fastq analysis for beginners


It's been a little while since my last post, as I did that rare and wondrous thing and took a holiday. Now I'm back, and galvanised into action by a particularly nice reference to my blog, I though I'd continue my vague hopscotch bioinformatics for beginners series, Here's a pretty basic one I've been meaning to put up for a while: fastq analysis.

Fastq is of course the format for storage next generation DNA sequencing data. Each sequence gets four lines in the file: one for sequence identifier, nucleotide sequence, a description line (often just a '+' character), and finally the per-base quality scores (thus lines two and four must be equal length).

If you're on this page, chances are good you're a wet-labber who's just starting to get into working with sequencing data. You can do basic manipulation using a multitude of different programs (I particularly like the FASTX toolkit for a nice spread of basic but useful tricks), but at some point, you're probably going to have to get your hands dirty and code something.

Now I know this isn't best practise, but often when I want to code something I can't immediately think of a solution for, I just get combing tutorials, manuals and forums finding fragments of script that do similar things to what I want, and weaving them together.

Often, the tricky place is knowing where to start. With a view to helping out people like me, here's the basic script I use for the majority of the fastq analysis that I write in Python.

It makes use of the SeqIO interface in Biopython. Basically, all this script does is scroll through a fastq file one record at a time, allowing you to interrogate or manipulate your individual sequences as needed.

from Bio import SeqIO # module needed for sequence input

countseqs = 0

fastqfile = open("YOUR_FILENAME_HERE.fastq", "rU")  
      
for record in SeqIO.parse(fastqfile,"fastq"):  
   
  ######## PUT WHATEVER YOU WANT TO HAPPEN IN HERE  
   
  countseqs += 1  
   
print "Reading from", fastqfile  
print countseqs, "records analysed."  

I usually make use of a little variant on this that allows command line input of the fastq file name; this means I can apply the same script to a large number of files without having to edit the filename in the script each time (and instead can make use of the tab-complete function of my shell):

from Bio import SeqIO # module needed for sequence input   
import sys   # module needed for command line argument to get fastq name   
     
if (len(sys.argv) > 1):   
  fastqfile = open(sys.argv[1], "rU")   
           
elif (len(sys.argv) == 1):   
  print "Please supply a fastq filename (i.e. python script.py file.fastq)"   
  sys.exit()   
          
for record in SeqIO.parse(fastqfile,"fastq"):   
     
  ######## PUT WHATEVER YOU WANT TO HAPPEN IN HERE   
   
  countseqs += 1   
     
print "Reading from", fastqfile   
print countseqs, "records analysed."   
NB - this script is easily converted into a fasta file, just by doing a find and replace of 'fastq' for 'fasta' - the parser handles it just fine.

This is of course the easy bit - filling in the "whatever you want to happen" bit is always going to be tricky.

To give you an example, here are the bones of a script I wrote based on this template, that I alluded to in a previous post.

from Bio import SeqIO # module needed for sequence input  
import sys      # module needed for command line argument to get fastq name  
import regex        # module for regex  
   
###################  
   
if (len(sys.argv) > 1):  
  fastqfile1 = open(sys.argv[1], "rU")  
   
   
elif (len(sys.argv) == 1):  
  print "Please supply a fastq filename (i.e. python script.py file.fastq)"  
  sys.exit()  
   
######################  
   
SEQ1_outfilename = ("SEQ_1_" + sys.argv[1][:15] + ".fastq")  
SEQ1_outfile = open(SEQ1_outfilename, "w")  
SEQ1_outfile.close()  
   
SEQ2_outfilename = ("SEQ_2_" + sys.argv[1][:15] + ".fastq")  
SEQ2_outfile = open(SEQ2_outfilename, "w")  
SEQ2_outfile.close()  
   
countseqs = 0  
countSEQ_1 = 0  
countSEQ_2 = 0  
   
####################  
      
for record in SeqIO.parse(fastqfile1,"fastq"): # scrolls through fastq file, using SeqIO package  
  
  countseqs += 1 # counts number of sequences measured  
   
  seq1 = regex.search('(ATATATATAT){e<=1}', str(record.seq))  
   
  if seq1:  
   
    countSEQ_1 +=1  
   
    output_handle = open(str(SEQ1_outfilename), "a")  
    SeqIO.write(record, output_handle, "fastq-sanger")  
    output_handle.close()  
   
  else:  
                  
    seq2 = regex.search('(CGCGCGCGCG){e<=1}', str(record.seq))  
   
    if seq2:   
   
      countSEQ_2 +=1  
   
      output_handle = open(str(SEQ2_outfilename), "a")  
      SeqIO.write(record, output_handle, "fastq-sanger")  
      output_handle.close()  
   
print "Reading from", fastqfile1  
print countseqs, "records analysed."  
print countSEQ_1, "Sequence 1s detected."  
print countSEQ_2, "Sequence 2s sequences detected."  

This uses the framework of the scripts above, looking for two sequences (in this case two primers, the presence of which would indicate which amplicons the sequences belonged to); depending on which subsequence it finds (if any), it bins the record in one of two output files.

To give a bit of leeway for sequencing error, I've made use of 'regex', an alternative regular expression module (as opposed to the standard/older 're'), which has an option for fuzzy string matching - i.e. you can allow mismatches or indels.

Taking one of the pretend sequences I've used here (ATATATATAT), this means that any of the following sequences would also be categorised as a hit using this search: ATATATGTAT, ATATATATT, ATATAATATAT.

There we have it; a crude, yet effective script to take a fastq file, and output two new fastqs containing (mutually exclusive) reads based on the presence of a given subsequence.

Note that this probably isn't the most efficient solution to this problem - even for something written in Python - but it was knocked together quickly, building up on simple principles from a basic fastq analysis template script, and most importantly, it got the job done. I only needed to run this code three or four times, and was able to do the analysis I needed, which is, let's face it, the most important thing.

----

Update: it's been a few years since I wrote this article, and it's worth pointing out that I've found a much nicer way to parse fasta/fastq files: the readfq function from Heng Li. It's available in a variety of languages, is simple, customisable, pared down, and crucially it's fast. Particularly in applications where you don't need all the extra legwork that SeqIO is doing - like converting your quality scores into their integer equivalents.

1 comment: