jamimmunology: sed

Wednesday, 10 August 2016

Extracting sample names from Illumina FASTQ filenames

Another post in the vein of "minor analysis task and my solution": how to extract the sample name or identifier (i.e. what you called a given file in the samplesheet) from a list of Illumina FASTQ file names.

So you might have something that looks like this (which I've cobbled together from a couple of runs from different machines that I pulled off Basespace):

322_S18_R2_001.fastq.gz
1c8_P6_act_S11_R2_001.fastq.gz
5c4_P8_res_S1_R1_001.fastq.gz

In this exercise, I wanted the first part of the IDs, which are what we labelled the actual samples with. In an ideal world this should be very easy: we could just use sed or cut to delete everything after the first instance of a character used in the text that Illumina appends to the sample name. However different people (or the same people at different times) use different naming conventions, which can often include underscores, the character that separates the string fields that get added.

Therefore what we need to do is not delete from say the first pattern, but from the nth pattern from the end, which turned out to be a slightly trickier problem (at least to one such as myself that doesn't find sed to be the most intuitive program in the world). Here's what I came up with:

sed 's/\(.*\)_\(.*_\)\(.*\)_/\1|\2/;s/|.*//' FilenameList.txt

What this does is find the third underscore to last (just before S[XX]_R[X].fastq.gz string) and replace it with a pipe character ('|'), which should never appear in a filename, before finding this pipe character and deleting everything after it. This produces the following output:

322
1c8_P6_act
5c4_P8_res

This feels a bit hacky to me, as really I feel like I should be able to do this in one substitution, but the first part of the command actually selects the latter half (so if anyone knows how to tidy this up I'd love to hear it!).

It's also worth pointing out that the output of different machines or demultiplexing settings will differ in the number of underscores, and so the command will need to change slightly. For instance, the above example was run on NextSeq data demultiplexed using bcl2fastq --no-lane-splitting. For an example of a file containing lane information ('L001' etc), here's the equivalent on some MiSeq filenames I pulled up.

# This 
324bAB_S8_L001_R1_001.fastq.gz 
076-PBMCs-alpha_S5_L001_R1_001.fastq.gz 

# After this
sed 's/\(.*\)_\(.*_\)\(.*\)_\(.*\)_/\1|\2/;s/|.*//' FilenameList.txt   

# Becomes this 
324bAB
076-PBMCs-alpha

The reason I was actually doing this was because I had a number of different NextSeq runs, some of which contained samples sequenced from the same original libraries (to increase the depth), so I needed to get the names of all samples sequenced in each run. I had each run in separate directories, and then ran this in the folder above:

for i in */
do echo $i
ls $i | grep R1 | grep fastq.gz | sed 's/\(.*\)_\(.*_\)\(.*\)_/\1|\
2/;s/|.*//' > ${i%?}.fl
done

This goes through all the directories, finds the R1 FASTQ files (as there are other files in there, and this gives me only one entry per sample) and strips their sample names out as above, and writes these to a textfile with the suffix '.fl'.

Thursday, 29 November 2012

Quick and dirty coding: count DNA base composition in bash

Another installation in my (completely non-ordered, non-structured) quick and dirty code series; here's a way to easily measure the base composition of all of the reads in a fastq file (if you want to check your GC content say) in bash.

I was going to say a quick way, but objectively, it's probably not. Bash is really not an appropriate language to be doing this kind of thing in, but if you're working a lot in the terminal for other things it's sometimes convenient to just have a stock of bash-only scripts for common tasks.

 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "A" | wc -l  
 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "C" | wc -l  
 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "G" | wc -l  
 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "T" | wc -l

As I'm hoping for these to possibly provide some assistance to people who know less about bioinformatics and programming than me (perhaps the smaller end of the wedge, but still significant), I'm going to try and provide some comments and context, so that people can learn from these as well as just use them.

Of note in this example is the first sed term used here (between the echo and the pipe ("|")), which is one that I find myself using very frequently for preliminary analysis of NGS data. All it does it look through a text file and send every 2nd line of four (N-4d just ignores the Nth line of every four lines) and send that to standard out (so either to the terminal or the next process in the pipe)

In a fastq file this setup outputs the sequence line, but obviously it's very easily adapted to output any of the lines you're interested in, from any repetitively structured text file.

In case people might want a marginally more detailed readout (with a bit more of a time penalty), I threw together the following bash script, which I saved as BaseComposition.sh, which I've commented to give a little flavour of some of the things one can do in bash (however inappropriately). Note that bash by default can only do integer arithmetic; you need to use bc or awk or something else to use floats.

 # Jamie Heather, UCL, November 2012  
   
 # Short bash script to give a quick read-out of the DNA base composition of a fastq file  
 # Run with command:  
 # $ sh BaseComposition.sh FILENAME.fastq  
   
 # Takes an input b, then runs through following pipe, whereby:  
 # every second line of 4 line repeat (i.e. sequence line) taken | sed adds tildes between each character | tr replaces tildes for new lines | grep counts each character | wc counts each hit  
 # Then feeds each base into it's own variable  
 ACOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "A" | wc -l)  
 CCOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "C" | wc -l)  
 GCOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "G" | wc -l)  
 TCOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "T" | wc -l)  
   
 # Adds up all the individual bases   
 TOTALBASES=$(($ACOUNT + $CCOUNT + $GCOUNT + $TCOUNT))  
   
 # Prints out both the number of hits for each base, and the percentage of that each base's contribution to the whole  
 echo "\n"  
 echo A bases: "\n"$ACOUNT "\n"Percentage A:  
 echo "scale=2;$ACOUNT/$TOTALBASES"*100 | bc  
 echo "\n"  
 echo C bases: "\n"$CCOUNT "\n"Percentage C:   
 echo "scale=2;$CCOUNT/$TOTALBASES"*100 | bc  
 echo "\n"  
 echo G bases: "\n"$GCOUNT "\n"Percentage G:   
 echo "scale=2;$GCOUNT/$TOTALBASES"*100 | bc  
 echo "\n"  
 echo T bases: "\n"$TCOUNT "\n"Percentage T:   
 echo "scale=2;$TCOUNT/$TOTALBASES"*100 | bc