Wednesday 21 November 2012

Generating random amino acid/DNA/RNA sequences in Python

This is my first post in what I hope to be a recurrent theme, that of bioinformatic practice.

I am by training a wet-lab scientist; my degrees were in molecular genetics, and I'm now doing my PhD in immunology. I've always had a keen interest in bioinformatics, and always tried to do as much as I can, but I've not been formally trained (and I've never claimed to be the most natural mathmatician/statistician).

Accordingly, whenever I'm trying to do some new bit of coding or analysis, I usually like to look around in manuals, papers, blogs and forums to see what and how people are doing things, and see if I can prevent some arduous wheel-reinventing.

I've also noticed that a lot of the time some little snippet of info or code that would have proved very useful to me is omitted by more practised programmers, to whom it would have been so obvious it goes unsaid or forgotten.

In that vein, I plan to post some of the bioinformatic tricks and codes I find and use as I go, in the hopes that I can pay it forward to others who find myself in my position, that is (usually), knowing exactly what I want to do, just not knowing off-hand the names of the functions to do it.

So, kicking off, here's something that I've been meaning to code for a little while; a way to generate random biological sequences (be they protein or nucleic acid sequences), written in Python, based on this biostars post.

This example produces random peptide sequences. For my current purposes, I also wanted these peptides to be of variable length, around a given normal distribution (that I'd measured empirically from the data set I was making these random sequences to act as a control for).

Note that I also wanted to plot the distribution of the newly created peptides; if you didn't want that functionality you can just delete every line with a "# plot" comment.

import random
import matplotlib.pyplot as plt # plot

def random_AA_seq(length):
    return ''.join(random.choice('ACDEFGHIKLMNPQRSTVWY') for i in range(length))

list_size = 3000

lengths = []

for j in range(list_size):

    a = int(random.normalvariate(17, 2))
    lengths.append(a)  # plot
    print random_AA_seq(a)

plt.hist(lengths, bins=14) # plot
plt.show() # plot

You can adjust the distribution of lengths by changing the values fed to normalvariate; the former is the mean, the latter the standard deviation (I deduced the values of 17 and 2 here from the sample I was trying to match).

Then of course these is easily adapted to DNA or RNA by swapping 'ACDEFGHIKLMNPQRSTVWY' for 'ACGT' or 'ACGU'.

Also, thanks go to David Craft, whose blog post on Syntax Highlighter came in useful in this post!

No comments:

Post a Comment