Quick and dirty analysis of human microbiome data

The NCBI and NIH have released their Human Microbiome data to much fanfare [link] and so I thought it would be fun to download some data and play around with it.  I chose, at random, dataset SRX098549 which is a paired-end HiSeq lane.  Run SRR353629 claims to have over 100million paired-end reads.

The code below should be executable on any unix/linux environment with the correct software installed and in your path.  This ends up being a rather simplistic example, but it gives a flavour of what can be achieved.

Software requirements:

  • sra-toolkit
  • sickle
  • SOAPdenovo
  • BioPerl
  • Moose
  • Pfam
  • pfam_scan.pl

After filtering, this ends up being a small dataset, but I did this analysis on a Linux machine with 64Gb of RAM.

# Get some data (http://www.ncbi.nlm.nih.gov/sra/SRX098549):
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/SRX098/SRX098549/SRR353629/SRR353629.lite.sra

# Information:
# Layout: PAIRED, Orientation: 5'-3'Forward, 5'-3'Reverse, Nominal length: 281, Nominal Std Dev: 0.0E0

# Split the data using the sra-toolkit.  At this stage I lose a LOT of data, which appear to be reads that are 100% N!
fastq-dump --split-spot --split-files --qual-filter SRR353629.lite.sra

# trim using sickle (stringent cut-off of Q30 and at least 50bp length)
sickle pe -f SRR353629_1.fastq.gz -r SRR353629_2.fastq.gz -t sanger -o SRR353629_1_trimmed.fastq -p SRR353629_2_trimmed.fastq \
                                                        -s SRR353629_single.fastq -q 30 -l 50

# create a SOAPdenovo config file with the following information and save it in "config.txt" (remove the first # from each line)

# #maximal read length
# max_rd_len=100
# [LIB]
# avg_ins=280
# reverse_seq=0
# asm_flags=3
# rank=1
# #fastq file for read 1
# q1=SRR353629_1_trimmed.fastq
# #fastq file for read 2 always follows fastq file for read 1
# q2=SRR353629_2_trimmed.fastq
# [LIB]
# reverse_seq=0
# asm_flags=1
# rank=1
# q=SRR353629_single.fastq

# run SOAPdenovo to assemble into contigs
SOAPdenovo31mer pregraph -s config.txt -K 21 -d -D -o SRR353629.pregraph
SOAPdenovo31mer contig -g SRR353629.pregraph > contig.out

# find out the N50
grep N50 contig.out

# map and scaffold
SOAPdenovo31mer map -s config.txt -g SRR353629.pregraph
SOAPdenovo31mer scaff -g SRR353629.pregraph -L 300 > scaff.out

# find out the N50
grep N50 scaff.out

# extract ORFs >= 200bp
getorf -sequence SRR353629.pregraph.scafSeq -table 11 -minsize 200 -outseq stdout > SRR353629_orfs.fasta

# annotate domains
# prepare Pfam

wget ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
wget ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz  
wget ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/active_site.dat.gz
gunzip Pfam-A.hmm.gz Pfam-A.hmm.dat.gz  active_site.dat.gz
hmmpress Pfam-A.hmm

# run pfam_scan.pl
perl pfam_scan.pl -fasta SRR353629_orfs.fasta -dir . > SRR353629_annotated_domains.txt

And there we have it!  A very quick assembly and annotation of protein-coding genes in the human microbiome data.