Showing posts with label tutorial. Show all posts
Showing posts with label tutorial. Show all posts

Thursday, 17 October 2013

Good pipetting rules of thumb


Accurate micro-pipetting is probably the single most important physical skill that a molecular or cell biologist need learn. It is the foundation upon which the vast majority of our experiments are built.

Pipettes are to us as brushes are to painters. They are the tools by which we achieve our aims; any mistake in pipetting at any stage stands to affect the outcome of our experiment.

Understandably, there are many resources available to help inform best pipetting practice, which I recommend budding biologists make use of. This is a nice one, and there are plenty of (cheesy) videos around too. People should also know a bit (if not a lot!) about how they work, how they can go wrong, and the different ways in which they can be used – there’s an unwieldy if thorough guide covering these here.

However, all this advice tends to focus on what’s best for the experiment – which is important – but overlooks another concern; your body.

Experiments can often involve highly repetitive movements, for extended periods of time. This isn’t that great on your mechanisms, particularly those of your thumb and wrist (as well I know!), and can cause repetitive strain injuries. There are a number of ways you can cut down on these risks, but here’s a quick rundown of the ones I think are worth bearing in mind:

         Don’t rush. Be fast, by all means – if you’re able – but hurrying usually means you do things the wrong way (both in terms of your data’s and your own safety)

         Set up your bench. My bench might look a mess when I’m working, but it’s organised so that I can access everything I need to without stretching or moving lots of stuff around. Which leads onto…

         Minimise. The injuries you’re at risk of are cumulative in nature, so you want to get away with using the minimum force and number of repetitions you can get away with. Maximise use of master mixes, reverse-pipette when needed, use multi-channel/step pipettes where possible, and for the love of all that is good, don’t bang your pipette tips on as hard as you can! (This is a bit of a bugbear of mine; the more you push it on just adds to the force required to get it off! It’s also damaging to the pipette in the long run. Gentle but firm pressure with a slight twist should be more than enough to form a strong enough seal for aspiration)

         Mix it up. If you’re able, swap hands every now and then to split the load. Admittedly, most people’s weaker-hand is probably less reliable than their stronger-hand, but accuracy’s sometimes less important – for instance, a lot of your larger-volume wash steps can probably afford to take the hit

         Take breaks. The most effective, but least practical option (which I am guilty of ignoring most of the time!). Try to break up your experiments, either with pauses or activities that don’t put so much pressure on your wrists and hands (so you can still be working, fear not PIs!).

The best part is that most of these tips help improve both the accuracy and speed of your work too.

Remember, you only get one pair of hands (if you’re lucky), so don’t put them at risk for your work, no matter how important you think it is at the time. Speaking from experience, it can also impact directly on your research, as my current bout of tendonitis tenosynovitis (damn you qPCR!) put me out of commission lab-wise for a fortnight*. It’s in everyone’s best interest for you to take the time and effort to pipette safely, and that includes your data.

OK now I’m anthropomorphising results. Time to end blog post.

* I know it seems rich me writing this article after admitting I have the injuries I claim knowledge of preventing, but it's a bit of a 'do as I do and not as I say' situation. I happen to have always had dodgy wrists, no doubt from a childhood spent on computer games, an adolescence spent typing, and an adulthood spent pipetting. I think the addition of thumb-swipey smart phone ownership was the last straw.

Monday, 30 September 2013

5 minute bash tutorial - how to win the lottery!

Apologies for the slightly misleading title. To expand...

In my office, a number of us are involved in analysis of large datasets, often sequence or text files. Each of us has our own little habits and idiosyncrasies; given the same task, we each find different ways to go about it.

A running joke with one of my friends is that I'll persevere with getting a long, bafflingly piped bash command long after she'll have resorted to scripting something in a higher level language.

(In all fairness we both do both, but I do find something deeply satisfying about chaining everything together in one line, no matter how human-unreadable it may be.)

Last Friday, this topic came up during a conversation about the upcoming Euromillions lottery. As a bit of a joke, I whipped up a quick line of bash to generate compatible random numbers (NB - I actually wrote it wrong in the tweet!).

Today I realised that if I explained the steps in the process, it might help someone new to bash understand how best to get some more functionality of out the command line.

So, here it is (with the error removed):
 for i in {1..50}; do echo $i >> win1; done; shuf win1 | head -5; for i in {1..11}; do echo $i >> win2; done; shuf win2 | head -2; rm win*  

This should produce output that fits the Euromillions format; five numbers between one and fifty, with two 'star' numbers between one and eleven, e.g.:
22
4
2
15
6
7
2

Let's break this down.

 for i in {1..50};  

This sets up a for loop, which will iterate through every integer in the range 1 to 50. For loops in bash take the format:

 for VARIABLE in THING  
 do   
 SOMETHING  
 SOMETHING_ELSE  
 done  

Here, the VARIABLE is the thing that changes each iteration, and THING is the range or list of items. SOMETHING and SOMETHING_ELSE are then commands you want to run within the loop, which have to be sandwiched between 'do' and 'done'. More bash loop examples/explanation here.

However, all those carriage-returns take up far too much time (and make it far too comprehensible) - I tend to substitute those for semi-colons, which allow you to concatenate multiple commands into one-liners.

So, our one-liner first loops from 1 to 50, then what?

 do echo $i >> win1; done;  

Here we have the contents of that first loop, sandwiched between 'do' and 'done' as per requirements, so that bash knows what to do at each turn of that loop.

Echo is the unix equivalent of 'print'; it just outputs something you tell it to the screen (or somewhere else).

In this case, we echo $i, where i denotes the variable we set up in the loop; we just need the dollar sign to let bash know we're asking for the contents of a variable. So, in the first cycle of the loop $i = 1, then in the next $i = 2, all the way up to $i = 50, which is when the loop stops.

Now if we just had 'do echo $i' in our loop, we'd just get the numbers 1 to 50 printed out in the terminal. However, we've used '>>' to redirect the numbers away from the terminal, instead putting them in a new file called 'win1'.

By using '>>' instead of '>', each cycle of the loop appends the new number to the end of the file win1 - if we used just one > each iteration of the loop would overwrite the whole file, leaving win1 at the end containing just the number 50.

So now we have the file win1 containing the numbers 1-50, each on its own line.

 shuf win1 | head -5;  

This is the bit of the code that picks the numbers.

Shuf simply shuffles all of the lines in a file, into a random(ish) order. It's an incredible useful command, which I kicked myself when I found out about it, for having taken so long. In this instance, it mixes up all the lines in win1, so they will no longer be in numerical order.

Here's where we first encounter pipes - that long character between the shuf and the next command is said to 'pipe' (or carry) the results of the first command into the second. 

We take the randomly shuffled win1 file, and pipe that into head (one of the commands you'll probably end up using most!) to take off the top five numbers, giving you your first five lottery numbers.

The next bit just repeats the same methodology for the lucky star balls (two numbers between one and eleven), outputting them after the previous five:

 for i in {1..11}; do echo $i >> win2; done; shuf win2 | head -2;

Finally, a bit of tidying up:

 rm win*  

Rm removes files - here I've used the '*' wildcard character to make it remove all files which begin 'win', thus removing the win1 and win2 files created earlier. This is also vital if you're going to run the script more than once, as otherwise you might end up drawing the same number twice (as re-running the loop would add yet more numbers to the files).

Note that rm can be dangerous - by default, it won't double check if you're sure you want to delete something, and there's basically no going back, so be careful! For instance, by not explicitly stating the files I wanted to remove, this line would also remove any other files which fit the pattern - e.g. if I had a file called 'winning_lottery_numbers.txt' in the same directory, it would have been deleted too!

This can be overcome by being explicit in the files you're removing, like so:

 rm win1 win2   

However if you have more than a few files, you're probably going to want to use the wildcards sensibly.

There you have it - a silly bit of code, but hopefully a useful one to a bash novice!

In case you're wondering, I did a small bit of rigourous, carefully controlled science with my script - I bought a ticket using numbers generated with either my code, or with the lottery's own 'lucky dip'. My code won nothing, but I got £3 from the lucky dip! Pretty conclusive we can all agree.

Tuesday, 10 September 2013

Python fastq analysis for beginners


It's been a little while since my last post, as I did that rare and wondrous thing and took a holiday. Now I'm back, and galvanised into action by a particularly nice reference to my blog, I though I'd continue my vague hopscotch bioinformatics for beginners series, Here's a pretty basic one I've been meaning to put up for a while: fastq analysis.

Fastq is of course the format for storage next generation DNA sequencing data. Each sequence gets four lines in the file: one for sequence identifier, nucleotide sequence, a description line (often just a '+' character), and finally the per-base quality scores (thus lines two and four must be equal length).

If you're on this page, chances are good you're a wet-labber who's just starting to get into working with sequencing data. You can do basic manipulation using a multitude of different programs (I particularly like the FASTX toolkit for a nice spread of basic but useful tricks), but at some point, you're probably going to have to get your hands dirty and code something.

Now I know this isn't best practise, but often when I want to code something I can't immediately think of a solution for, I just get combing tutorials, manuals and forums finding fragments of script that do similar things to what I want, and weaving them together.

Often, the tricky place is knowing where to start. With a view to helping out people like me, here's the basic script I use for the majority of the fastq analysis that I write in Python.

It makes use of the SeqIO interface in Biopython. Basically, all this script does is scroll through a fastq file one record at a time, allowing you to interrogate or manipulate your individual sequences as needed.

from Bio import SeqIO # module needed for sequence input

countseqs = 0

fastqfile = open("YOUR_FILENAME_HERE.fastq", "rU")  
      
for record in SeqIO.parse(fastqfile,"fastq"):  
   
  ######## PUT WHATEVER YOU WANT TO HAPPEN IN HERE  
   
  countseqs += 1  
   
print "Reading from", fastqfile  
print countseqs, "records analysed."  

I usually make use of a little variant on this that allows command line input of the fastq file name; this means I can apply the same script to a large number of files without having to edit the filename in the script each time (and instead can make use of the tab-complete function of my shell):

from Bio import SeqIO # module needed for sequence input   
import sys   # module needed for command line argument to get fastq name   
     
if (len(sys.argv) > 1):   
  fastqfile = open(sys.argv[1], "rU")   
           
elif (len(sys.argv) == 1):   
  print "Please supply a fastq filename (i.e. python script.py file.fastq)"   
  sys.exit()   
          
for record in SeqIO.parse(fastqfile,"fastq"):   
     
  ######## PUT WHATEVER YOU WANT TO HAPPEN IN HERE   
   
  countseqs += 1   
     
print "Reading from", fastqfile   
print countseqs, "records analysed."   
NB - this script is easily converted into a fasta file, just by doing a find and replace of 'fastq' for 'fasta' - the parser handles it just fine.

This is of course the easy bit - filling in the "whatever you want to happen" bit is always going to be tricky.

To give you an example, here are the bones of a script I wrote based on this template, that I alluded to in a previous post.

from Bio import SeqIO # module needed for sequence input  
import sys      # module needed for command line argument to get fastq name  
import regex        # module for regex  
   
###################  
   
if (len(sys.argv) > 1):  
  fastqfile1 = open(sys.argv[1], "rU")  
   
   
elif (len(sys.argv) == 1):  
  print "Please supply a fastq filename (i.e. python script.py file.fastq)"  
  sys.exit()  
   
######################  
   
SEQ1_outfilename = ("SEQ_1_" + sys.argv[1][:15] + ".fastq")  
SEQ1_outfile = open(SEQ1_outfilename, "w")  
SEQ1_outfile.close()  
   
SEQ2_outfilename = ("SEQ_2_" + sys.argv[1][:15] + ".fastq")  
SEQ2_outfile = open(SEQ2_outfilename, "w")  
SEQ2_outfile.close()  
   
countseqs = 0  
countSEQ_1 = 0  
countSEQ_2 = 0  
   
####################  
      
for record in SeqIO.parse(fastqfile1,"fastq"): # scrolls through fastq file, using SeqIO package  
  
  countseqs += 1 # counts number of sequences measured  
   
  seq1 = regex.search('(ATATATATAT){e<=1}', str(record.seq))  
   
  if seq1:  
   
    countSEQ_1 +=1  
   
    output_handle = open(str(SEQ1_outfilename), "a")  
    SeqIO.write(record, output_handle, "fastq-sanger")  
    output_handle.close()  
   
  else:  
                  
    seq2 = regex.search('(CGCGCGCGCG){e<=1}', str(record.seq))  
   
    if seq2:   
   
      countSEQ_2 +=1  
   
      output_handle = open(str(SEQ2_outfilename), "a")  
      SeqIO.write(record, output_handle, "fastq-sanger")  
      output_handle.close()  
   
print "Reading from", fastqfile1  
print countseqs, "records analysed."  
print countSEQ_1, "Sequence 1s detected."  
print countSEQ_2, "Sequence 2s sequences detected."  

This uses the framework of the scripts above, looking for two sequences (in this case two primers, the presence of which would indicate which amplicons the sequences belonged to); depending on which subsequence it finds (if any), it bins the record in one of two output files.

To give a bit of leeway for sequencing error, I've made use of 'regex', an alternative regular expression module (as opposed to the standard/older 're'), which has an option for fuzzy string matching - i.e. you can allow mismatches or indels.

Taking one of the pretend sequences I've used here (ATATATATAT), this means that any of the following sequences would also be categorised as a hit using this search: ATATATGTAT, ATATATATT, ATATAATATAT.

There we have it; a crude, yet effective script to take a fastq file, and output two new fastqs containing (mutually exclusive) reads based on the presence of a given subsequence.

Note that this probably isn't the most efficient solution to this problem - even for something written in Python - but it was knocked together quickly, building up on simple principles from a basic fastq analysis template script, and most importantly, it got the job done. I only needed to run this code three or four times, and was able to do the analysis I needed, which is, let's face it, the most important thing.

----

Update: it's been a few years since I wrote this article, and it's worth pointing out that I've found a much nicer way to parse fasta/fastq files: the readfq function from Heng Li. It's available in a variety of languages, is simple, customisable, pared down, and crucially it's fast. Particularly in applications where you don't need all the extra legwork that SeqIO is doing - like converting your quality scores into their integer equivalents.

Thursday, 15 August 2013

5 minute PyMOL tutorial


I'm very lucky to have a pretty great PhD project, deep-sequencing T cell receptor repertoires. It's an interesting, exciting field, that I have only one major complaint about - I don't get to make any pretty pictures.

It sounds like a minor complaint, but it makes a difference when it comes to presentation time. My main output is sequence data; at best I produce some nice graphs, which rather lack the the impact of say, some beautiful confocal microscopy.

In order to get nice pictures to open a presentation on, I like to just fire up some molecular visualisation software and get some nice screenshots of whatever molecules are relevant to the talk. In case others wanted to do similarly - make nice pictures, without needing to become an expert in molecular visualisation or modelling - but didn't know how, here's a mini-tutorial as to how I do it.

First off, you need some visualisation software. If you're a student or teacher, I think you can get the excellent PyMOL for free, which I'll use in this blog, although there are others. I personally cut my visualisation teeth on a combination of Jmol, and (the now decidedly retro) RasMol - what you learn in one is broadly applicable to the others.

Let's get a structure to work on - head on over to the Protein Data Bank and look up something you're interested in. You can search by molecule name, ID (if you have it, you can find these in structural papers) or author name.

We need to download the PDB file for a molecule of interest, which is basically just a text file containing a set of x, y, z coordinates for all the atoms in that molecule (as described by solving the crystal structure, for example).

I've downloaded the PDB for 1MI5, a crystal structure for the LC13 TCR bound to HLAB8 complexed with a peptide from the EBNA 3A molecule of Epstein Barr Virus.



Fire up PyMOL - there's a two windows, a viewer and an interface - and open your PDB (note that if you already know the ID of your molecule, you can use this to open the file without downloading the PDB with the fetch command, e.g. 'fetch 1zgl').

At first it'll probably look like a big mess.


Orient yourself. It's pretty easy to move about with mouse; it's just hard to remember which button does what. Handily, they designers provided a nice little reminder in the bottom ride hand side.


Basically, left mouse button rotates, middle button translocates, and right mouse button zooms. Have a play, get used to it, it probably doesn't matter much anyway - if you're anything like me you'll forget it every time and have to rediscover it every time.

Now let's make it look pretty.

By default, all molecules are selected. We can use the five buttons in the top right to perform various tasks on whatever's currently selected.

The left column represents the different selections. The buttons on the right are the tasks that can be applied to those selections. Left to right: Action; Show; Hide; Label; Color
Unless I'm looking at specific interactions, I like cartoon display, as I think this makes it easiest to see generally what goes where in a structure, particularly if there's more than one polypeptide present. Click 'S' for show, then choose cartoon.


It looks somewhat fuzzy now, as it's showing the cartoon display overlain on everything else that was there. To get rid of all the fluff, go the 'H' (hide) menu, and hide the lines (default layout) and the waters (sporadic red dots).

Much better.

Now we want to be able to tell our different molecules apart. This is where the PDB web entry for our molecule really comes in handy, as it tells us what letter is used to refer to each separate molecule as. This allows us to make different selections, which we can then use to apply actions to particular molecules or subsets of molecules.

So, looking at our entry, I can enter the following code into the command line area of the interface window to generate our selections.

 select MHC, (chain a)  
 select B2m, (chain b)  
 select peptide, (chain c)  
 select alpha, (chain d)  
 select beta, (chain e)  

TCR-people, note that also for the three or four other TCR-pMHC complexes I've tried this on, they've all used the same letters for each chain, so this might be the convention.
You can then use the 'C' color button on each selection to colour them separately. I've arranged the molecule so you see down the groove of the MHC molecule, as this shows the interaction site of the whole complex well.


However, I always think it's a bit hard to see the peptide in amongst all the other molecules, so I've gone back and changed it to show as spheres.

After that, we just need to save a high-resolution image, which we can get through ray tracing:

 ray 2400, 2400  

Just save the image after ray tracing. I then cropped this down using Gimp.
All done!

When using crystal structures that you didn't produce, remember to give proper attribution; either cite the original paper, or provide the PDB ID.

For a more in-depth understanding of PyMOL, why not read the manual or check out the PyMOL wiki.