Counting known microRNAs in five easy steps

I'm assuming, as you're here, that you know what microRNAs are.  Some of you might even have seen my talk at the "UK Next-generation sequencing" meeting in 2011 entitled "21 ‐ 23 facts; the devil in the detail of microRNA‐Seq".  For a small species of RNA, these guys sure have a big impact, and there have been tons of bioinformatics tools published to deal with microRNA sequencing data (microRNA-Seq).  However, the basic task of counting known microRNAs is incredibly simple and can be carried out by anyone with even the most basic Unix skills (I use Novoalign below as it has some really cool features).  The code below should be executable as long as you have Novoalign in your path.  Here's how:


# Five easy steps to counting known microRNAs in human
# 1. get an example dataset.  
#
# This particular dataset is from a human study published in Genome Research:
#
# Morin RD, O'Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu AL, Zhao
# Y, McDonald H, Zeng T, Hirst M, Eaves CJ, Marra MA. (2008) Application of massively
# parallel sequencing to microRNA profiling and discovery in human embryonic stem
# cells. Genome Res. 18(4):610-21. [link]

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR026/SRR026762/SRR026762.fastq.gz

# 2. get the current set of miRBase mature sequences
wget ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz

# 3. extract only human microRNAs, and for convenience, convert Uracil to Thymine
zcat  mature.fa.gz | grep -A 1 "^>hsa" | grep -v "^-" | sed 's/U/T/g' > human_mature.fa

# 4. create a Novoalign index.  The basic version of Novoalign is free and available from http://www.novocraft.com
novoindex human_mature.nidx human_mature.fa

# 5. Search and count in one line
novoalign -f SRR026762.fastq.gz -F STDFQ -d human_mature.nidx -l 18 -s 1 -a -R 0 -r Random -t 0 \
                 | grep "^@" | awk '$9 != "" {print $9}' \
                 | sort | uniq -c | sort -n -r > counts.txt

# take a look at the results
head counts.txt

# One of the conclusions from the paper was "The most abundant miRNA in this library was miR-103". 
# Oh look, they missed one (I assume hsa-miR-92a-3p was discovered after 2008)

#  126476 >hsa-miR-92a-3p
#  120762 >hsa-miR-103b
#  120576 >hsa-miR-103a-3p
#   62704 >hsa-miR-25-3p
#   53155 >hsa-miR-92b-3p
#   49215 >hsa-miR-372
#   42536 >hsa-miR-378a-3p
#   37287 >hsa-miR-140-3p
#   36178 >hsa-miR-21-5p

And there we have it!  This was carried out on Scientific Linux 5 using Novoalign V2.07.18.  If you try this and encounter problems, e-mail me.  Please note, the $9 in the awk statement above can change depending on the format of your read identifier, so please check which column the hit identifier gets put into when running Novoalign.

Mick