viRome

We have developed code in R to analyse short-read next-generation sequencing data from virus-infection studies where the siRNA or piRNA pathways have been implicated.  The package provides a range of functions to help scientists visualise the results of such experiments.

viRome takes aligned data (in the form of a sorted, indexed BAM file) and produces a variety of graphs and reports.

viRome is completely open source and is available on Sourceforge: http://sourceforge.net/projects/virome

Tutorial

Please use version 0.6 or above for this tutorial - previous versions will not work.


1. Computing requirements

The role of viRome is to summarise and plot millions of data points - you will therefore likely need a powerful computer.  The package should work on standard desktops and laptops, but may take longer than expected. viRome was developed on machines with large amounts of RAM and significant processing power. 

The tutorial below takes approximately 30 minutes on a 64 bit Windows PC with 24Gb of RAM and a dual-core 2.4GHz processor.  If you find that the tutorial takes significantly longer, then please use a PC with more RAM and/or a faster processor.


2. Download and install R and necessary packages

You will need R: navigate to http://www.r-project.org and download and install the latest version of R for your operating system

The next step is to ensure we have all the dependencies installed.  You only need to run this once:

# get the necessary Bioconductor packages
source("http://www.bioconductor.org/biocLite.R")
biocLite("Rsamtools")

# install optional Bioconductor packages
# biocLite("seqLogo")
# biocLite("motifStack")

# install the necessary R packages
install.packages(c("plyr","gsubfn","seqinr","reshape2"))

# install optional R packages
# install.packages("ggplot2")


3. Install the viRome package

Next install and load the viRome package (http://sourceforge.net/projects/virome):

  • On windows, download the .zip file and install via the menu option "Packages -> Install packages from local zip files"
  • For linux and Mac, download the .tar.gz file from http://sourceforge.net/projects/virome. Run the command "R CMD INSTALL viRome_0.6.tar.gz"

4. Example data

The data we are using is from:

Vodovar N, Bronkhorst AW, van Cleef KW, Miesen P, Blanc H, van Rij RP, Saleh MC. (2012) Arbovirus-derived piRNAs exhibit a ping-pong signature in mosquito cells. PLoS One. 7(1):e30861. doi: 10.1371/journal.pone.0030861. [link]

The data are in the SRA here: http://www.ncbi.nlm.nih.gov/sra?term=SRR389184

We have distributed aligned data from this study with the viRome package - the data were aligned to the Sindbis virus genome using Novoalign (http://www.novocraft.com).

We recommend that you use BAM files that are sorted according to genomic location, and indexed.  This can be carried out using samtools (http://samtools.sourceforge.net/)


5. Minimal commands

Each function in viRome has a number of parameters, and all of these paramters have default values.  In order to read about these parameters, and to see their default values, please access the help for wach function:

?read.bam
?clip.bam
?barplot.bam
?size.strand.bias.plot
?summarise.by.length
?size.position.heatmap
?stacked.barplot
?position.barplot
?sequence.report
?make.pwm
?pwm.heatmap
?read.dist.plot

However, as we have already set the defaults, there are minimal commands which will create plots based on those defaults.  The mimimum information required is the data that the command must read to summarise and create a plot, and this is typically given as the first argument to the function.  Examples of these minimal commands, and the type of data they reuire, is below:

# load the library
# find example data
library(viRome)
infile <- system.file("data/SRR389184_vs_SINV_sorted.bam", package="viRome")

# minimal commands

bam    <- read.bam(infile, chr="SINV") # requires the full path to a bam file, and the name of the reference the data are aligned to
bamc   <- clip.bam(bam)                # requires only the output of read.bam()
bpl    <- barplot.bam(bamc)            # requires only the output of clip.bam()
ssp    <- size.strand.bias.plot(bpl)   # requires only the output of barplot.bam()
dm     <- summarise.by.length(bamc)    # requires only the output of clip.bam()
sph    <- size.position.heatmap(dm)    # requires only the output of summarise.by.length()
sbp    <- stacked.barplot(dm)          # requires only the output of summarise.by.length()
sir    <- position.barplot(bamc)       # requires only the output of clip.bam(), though one should alter
                                       # minlen, maxlen and reflen
sr     <- sequence.report(bamc)        # requires only the output of clip.bam()
pwm    <- make.pwm(bamc)               # requires only the output of clip.bam()
pmh    <- pwm.heatmap(pwm)             # requires only the output of make.pwm()
rdp    <- read.dist.plot(sr)           # requires only the output of sequence.report()

So those are the minimal commands, and every other paramter has default values.  We encourage you to read the help files to learn about these parameters in more detail.


6. Load data

We load data into R by using the load.bam function.  This function takes a number of parameters.  We first load the package, then locate an example BAM file that is distributed with viRome, and read in the data:

library(viRome)
infile <- system.file("data/SRR389184_vs_SINV_sorted.bam", package="viRome")
bam    <- read.bam(bamfile = infile, chr = "SINV", start = 1, end = 11703,
                   what = c("qname", "flag", "rname", "strand", "pos", "qwidth", "mapq", "cigar", "mrnm", "mpos", "isize", "seq"),
                   tag = c("NM"), removeN = TRUE)

To see documentation for all of the parameters to read.bam, access the help for the function:

?read.bam

The parameters we provide to read.bam are:

  • bamfile: location of the target BAM file on disk
  • chr: the name of the reference within the BAM file.  We can only load data from one reference at a time
  • start: the minimum base within the reference to load alignments from
  • end: the maximum base within the reference to load alignments from
  • what: the fields to load.  read.bam uses Rsamtools to load the data, see ?scanBam for an explanation of these fields
  • tag: the names of tags within the BAM file to load
  • removeN: whether or not to remove sequences that contain N's

We then simply need to deal with any soft- and hard- clipping that has been applied to the reads by the aligner, and calculate some statistics about it:

bamc <- clip.bam(bam)

To find out exactly what clip.bam does, you can read the help:

?clip.bam

However, here is an excerpt:

Aligned sequences are stored in BAM files with the caveat that they may have been soft-clipped - i.e. the clipped sequence was not used in the alignment, but is included in the BAM file. This function looks at the CIGAR string and clips data in the "seq" column appropriately.

For example, conside the CIGAR string "12S25M" and the sequence read "CACCCGAGAATACCCCAGAACCATTATGCTGTGACTT". The sequence read is 37bp long; however, the cigar string tells us that only 25 "matched" (25M) i.e. only 25 were used in the alignment. The CIGAR string also tells us that 12bp were soft-clipped (12S). We can tell that they were soft-clipped from the start of the read as 12S occurs before 25M

clip.bam parses the CIGAR string for every read, and calculates mapping statistics based on the information


7. Distribution of mapped read-lengths

Once the BAM file has been read into viRome, the first stage of any analysis is to look at the distribution of read lengths mapped to the genome.  A peak at 21-22bp may indicate a siRNA response, and a peak at 25-29bp may inidcate a piRNA response:

b <- barplot.bam(vdf = bamc, minlen = 1, maxlen = 37, poscol="red", negcol="green",
                 main = "Sequence length distribution", xlab = "Map length", ylab = "Count",
                 legend = c("+ve strand", "-ve strand"),
                 legendx = NULL, legendy = NULL)

The parameters here can be used to set the min and max length of reads to consider, the colours of the bars, the plot and axes titles and the location and text of the legend.  This command will produce a plot:

A feature of viRome is that many of the plotting functions also return the data to the user:

> b
   lgth   pos   neg
1     1     0     0
2     2     0     0
3     3     0     0
4     4     0     0
5     5     0     0
6     6     0     0
7     7     0     0
8     8     0     0
9     9     0     0
10   10     0     0
11   11     0     0
12   12     0     0
13   13     0     0
14   14     0     0
15   15     0     0
16   16     0     0
17   17     0     0
18   18   984   419
19   19  1209   791
20   20  2121  2159
21   21 22079 26241
22   22 11137 11640
23   23  6703  7094
24   24  8028  6325
25   25 10753 10477
26   26 19802  9427
27   27 36931 14240
28   28 40081 13852
29   29 29943 11737
30   30  7935  4757
31   31  1957  1319
32   32   534  1862
33   33   180   137
34   34    27    81
35   35     6    23
36   36    12    19
37   37     0     0

One of the advantages of this is you may wish to plot the data using a different package, e.g. ggplot2:

library(ggplot2)
x <- data.frame(loc = rep(b$lgth, 2), strand=rep(c("pos","neg"), each=37), count=c(b$pos,b$neg))
ggplot(x,aes(x=loc,y=count, fill=factor(strand))) + geom_bar(stat="identity", position="dodge") + scale_color_discrete()

The above will produce

We can also make the plot look a bit nicer using options built into viRome:

b <- barplot.bam(vdf = bamc, minlen = 17, maxlen = 37, poscol="blue", negcol="yellow",
                 main = "Sequence length distribution", xlab = "Map length", ylab = "Count",
                 legend = c("+ve strand", "-ve strand"),
                 legendx = 17, legendy = NULL, down=TRUE, space=c(0,0))

We have changed the colour scheme, and made the counts on the negative strand point downwards:

From inspecting this image we can see a number of patterns in the data

  1. There is a peak at 21bp, suggesting a siRNA response
  2. There is also a peak betwen 25-29bp, suggesting a piRNA response
  3. The number of reads on the positive and negative strands for 21bp reads appears equal
  4. There appear to be more 25-29bp reads on the positive strand compared to the negative strand

We can further visualise a potential strand bias using the size.strand.bias.plot function:

size.strand.bias.plot(b, mar=c(4,4,3,1), pch=".", tpos=3, cextxt=0.8)

Here we provide the function with "b", the output of the barplot.bam function.  We also set some additional plotting parameters, and we end up with this:

This is an XY-scatterplot of counts on the positive and negative strand for each read-length.  The red line indicates y=x.  We can see that for read lengths 26, 27, 28 and 29 there is a bias towards the positive strand.


8. Position of reads mapped to the genome: heatmaps

So, from the above analysis we have evidence that there is both an siRNA and a piRNA response in this sample.  We also have a suggestion that there may be a strand bias in the 25-29bp reads. 

The next stage is to analyse where the reads are mapping in the genome.  We will first take a global view and look at all read lengths and all positions in the reference.

To do this, we must count the occurrence of each read length at each position in the genome.  The function we use to do this is called summarise.by.length.  This function requires a lot of RAM and processing power.  However, on the windows PC specified above, these commands take approximately 20 seconds:

dm <- summarise.by.length(bamc)
dmp <- summarise.by.length(bamc, strand="pos")
dmn <- summarise.by.length(bamc, strand="neg")

First, we cound occurrence for both strands, and store those counts in a variable called "dm".  We then calculate similar summaries, but this time limit to the positive and negative strands respectably.

This will return a data matrix, the rows of which are genomic positions, the columns of which are read lengths and the values of which are counts:

> dm[1:10,]
   18 19 20 21 22 23  24  25 26 27 28 29 30 31 32 33 34 35 36
1   1  0  7  6 67 66 211 156 58 62  8 26  0  0  0  0  0  0  0
2   0  0  0  0  7  5  17   9  2  7  0  1  0  0  0  0  0  0  0
3   0  1  5  0 98 26  17  31  5 28  0 13  0  0  0  0  0  0  0
4   0  0  7  0  3  7  15   3  4  2 14 22  5  0  0  0  0  0  0
5   0  0  0  0  0  0   0   0  0  0  0  0  0  0  0  0  0  0  0
6   0  0  0  0  7  0   0   1  0  0  0  0  0  0  0  0  0  0  0
7   0  0  0  1  0  0   0   3  0  0  0  0  0  0  0  0  0  0  0
8   0  0  0  0  0  3   0   4  0  0  0  0  0  0  0  0  0  0  0
9   0  0  0  0  0  0   0   0  0  0  0  0  0  0  0  0  0  0  0
10  0  1  0  0  0  5   0   0  0  0  0  0  0  0  0  0  0  0  0

We can now use these in a range of plotting functions.

The first function we will use is size.position.heatmap:

size.position.heatmap(dm, mar=c(3,3,2,1))

This draws a heatmap, with read lengths as rows and genomic position as columns.  With default parameters, the data are scaled for each genomic position:

From this we can see the distribution of read lengths mapped throughout the genome.  We can see hot-spots for each read length, and there is a suggestion that the 21bp reads are more evenly distributed throughout the genome.  We can look at the data another way using additional options:

size.position.heatmap(dm, scale=FALSE, log=TRUE, mar=c(3,3,2,1))

With these parameters, we are telling the function to log the data (this helps with scale), not to use R's in-built scale function and to adjust the plotting margins.  The command produces:

As we have log transformed the data, and chosen not to scale it, we can begin to see additional patterns - specifically, we can see that there is a hotspot for 25-29bp reads around position 8000 in the genome.

Both of these plots show counts for both strands - but what about strand bias?  Well, with the following commands we can use split.screen() to compare counts for both strands with those for the positive and negative strands:

split.screen(c(3,1))
screen(1)
size.position.heatmap(dm, log=TRUE, scale=FALSE, mar=c(2,2,2,1), main="Both")
screen(2)
size.position.heatmap(dmp, log=TRUE, scale=FALSE, mar=c(2,2,2,1), main="Pos")
screen(3)
size.position.heatmap(dmn, log=TRUE, scale=FALSE, mar=c(2,2,2,1), main="Neg")
close.screen(all=TRUE)

Here we are using split.screen to create a plot with three graphics as 3 rows and 1 column.  We then plot data for both strands at the top, the positive strand in the middle and the negative strand at the bottom:

This image indicates yet another pattern - not only does there appear to be a peak of 25-29bp matches around position 8000, but it also appears to be more prominent on the positive strand.


9. Position of reads mapped to the genome: barplots

Using the data we calculated above, we can also examine the distribution of reads using barplots.  The first function we will use looks at all reads across all positions, and plots a barplot for each read length:

stacked.barplot(dm, internal.margins=c(0,2,0,1), skip.x=4, main.adj=0.5)

This command takes "dm", which includes counts for both strands and creates a stacked barplot.  We use parameters to adjust the plotting margins, to adjust the size of the title for each plot and to only plot an x-axis every 4 plots:

The default colour scheme here is "rainbow", which can be difficult to see.  We can change this:

stacked.barplot(dm, internal.margins=c(0,2,0,1), skip.x=4, main.adj=0.5, col.fun=colorRampPalette(c("green", "red"), space = "rgb"))

 

Here again we can see the pattern that 21 and 22bp reads seem to be distributed throughout the genome, but that 25-29bp reads are a bit more localised, with a particularly large peak around position 8000.

As of version 0.7, we have added two new parameters to stacked.barplot.  The first, bicol, allows the user to provide a vector of two colours, and those colours will be used for alternate graphs:

stacked.barplot(dm, internal.margins=c(0,2,0,1), skip.x=4, main.adj=0.5, bicol=c("red","black"))

The second, samey, forces stacked.barplot to create the same Y-axis for each of the stacked barplot (by default, each has its own):

stacked.barplot(dm, internal.margins=c(0,2,0,1), skip.x=4, main.adj=0.5, samey=TRUE)

As can be seen above, due to the predominance of the longer reads, the bars for the shorter reads do not appear - however, this plot accurately relays that information.

It's now time to focus in on specific patterns in the 21-22bp (siRNA) and 25-29bp (piRNA) data.


10. Genomic position of 21-22bp reads

The above plots are great for looking at global patterns, but we can now focus in on particular subsets of reads.  For example, to look at the genomic position of the 21-22bp reads (siRNA response):

sirna <- position.barplot(vdf = bamc, minlen = 21, maxlen = 22, reflen = 11703, samp = "SRR389184")

This function takes the output of clip.bam, filters on reads longer than minlen and shorter than maxlen and produces a barplot:

Using this image we can get a far more accurate picture of where in the genome the 21-22bp reads are mapping, and where hotspots exist.

As with the barplot.bam function, here the data are returned should you wish to plot the data using another function:

> sirna[1:20,]
   position poscount negcount
1         1       32       41
2         2        0        7
3         3       73       25
4         4        0        3
5         5        0        0
6         6        0        7
7         7        0        1
8         8        0        0
9         9        0        0
10       10        0        0
11       11        0        0
12       12        0        1
13       13        0        0
14       14        0        0
15       15        0        0
16       16        0        0
17       17        0        0
18       18        0        2
19       19        0        0
20       20        0        1

11. Genomic position of 25-29bp reads

We can also use the same function to look at 25-29bp reads (the piRNA response:

pirna <- position.barplot(vdf = bamc, minlen = 25, maxlen = 29, reflen = 11703, samp = "SRR389184")

In contrast to the 21-22bp reads, we see far clearer localisation of the 25-29bp reads, specifically:

  1. A large peak on the positive strand around position 8000
  2. The above peak is mostly concentrated on the psoitive strand
  3. A smaller peak just after around position 2100, again focused on the positive strand
  4. A large stretch in the middle of the genome with very few 25-29bp reads

At this stage this is an explanatory analysis - we make no attempt to interpret these patterns.  The purpose of viRome is merely to find and present those patterns.  Differences in read lengths, position in the genome and biases in the strand may be due to issues with the sequencing technology used, or indeed the software used to map the data.  These options should be eliminated before interpreting the data in terms of a biological phenomenon.

However, we can speculate from these graphs that, whilst the siRNA response seems to occur throughout the genome, the piRNA response appears more localised, and we can design experiments to attempt to confirm these hypotheses.


12. Generating a sequence report

The SAM/BAM format is well known to bioinformaticians, however, often we need to summarise and count each individual sequence and where it maps.  This is done using the sequence.report function:

sr <- sequence.report(bamc, minlen=1, maxlen=37)

Again, the function can be used to look at all reads, or be limited to subsets.  An example of the report is below:

> sr[1:10,]
 ref pos strand                           seq len count
 SINV   1      +        ATTGACGGCGTAGTACACACTA  22    32
 SINV   1      +       ATTGACGGCGTAGTACACACTAT  23    26
 SINV   1      +      ATTGACGGCGTAGTACACACTATT  24    68
 SINV   1      +     ATTGACGGCGTAGTACACACTATTG  25    18
 SINV   1      +    ATTGACGGCGTAGTACACACTATTGA  26     3
 SINV   1      +   ATTGACGGCGTAGTACACACTATTGAA  27    15
 SINV   1      + ATTGACGGCGTAGTACACACTATTGAATC  29    26
 SINV   1      -            ATTGACGGCGTAGTACAC  18     1
 SINV   1      -          ATTGACGGCGTAGTACACAC  20     7
 SINV   1      -         ATTGACGGCGTAGTACACACT  21     6

This can be exported to csv format using write.csv, and opened in Excel.


13. Base composition

Research has shown that RNAs within the piRNA pathway often show a U1(T1) and A10 bias - that is, far more often than you would expect by chance, piRNAs have a U/T at the first base and an A at the tenth base. 

NOTE some users have pointed out that we should be showing U instead of T (as we are representing RNA).  However, we disagree.  It is a technicality, but we have not actually measured RNA, we have measured the cDNA form of that RNA.  As far as we know, the Helicos system is the only system to directly sequence RNA, all other systems (including the Illumina/Solexa system used in this data) converts RNA to DNA first.  As we have measured DNA, we show T and not U.

Relative frequencies of the four bases can be calculated using the make.pwm function:

pwm1 <- make.pwm(bamc, minlen=25, maxlen=29, strand="neg")
pwm2 <- make.pwm(bamc, minlen=25, maxlen=29)

This function takes the output of clip.bam, limits to reads within a certain range, and calculates counts based on reads mapping to the positive (default) or negative strand.

The data returned are pretty self explanatory - rows are bases, columns are positions and the values are relative proportions:

> pwm1[,1:8]
           1         2         3         4         5         6         7         8
A 0.07749485 0.2313964 0.2165637 0.2318986 0.2419433 0.2274287 0.2203807 0.2227747
C 0.05628380 0.2211173 0.3201078 0.1700902 0.2330705 0.1811227 0.2174342 0.1820602
G 0.07272362 0.3965480 0.2024676 0.3793883 0.3017093 0.3421224 0.2726299 0.2966367
T 0.79349773 0.1509383 0.2608608 0.2186229 0.2232769 0.2493262 0.2895552 0.2985285

We can visualise these in a number of ways.  Firstly, using the viRome function pwm.heatmap:

pwm.heatmap(pwm1, col.fun=colorRampPalette(c("green","red"), space="rgb"))
pwm.heatmap(pwm2, col.fun=colorRampPalette(c("green","red"), space="rgb"))

These commands produce:

Using a green-to-red colour scheme, we can clearly see the T1 and A10 biases.

We can also visualise the data as sequence logos, using the (third party) seqLogo package:

library(seqLogo) # must have this installed!
seqLogo(pwm1)
seqLogo(pwm2))

Of course, it also possible to use the humble barplot function to more accurately visualise the proportions:

barplot(pwm1, col=rainbow(4), legend.text=rownames(pwm1), args.legend=list(x=39), xlim=c(0,45))


14. 5' read-distance plots

Finally, we plot the distance between the 5' ends of overlapping reads on opposite strands.  A peak at 10bp has been shown to be a signature of the piRNA response.  The read.dist.plot function takes every read mapped to a given strand and within a given size range, counts the frequency of that read, and then counts the frequency of reads mapping to every location on the opposite strand within a given window.  A range of summary functions are offered by the function.  Despite the report of this 10bp distance in the literature, this paper represents the first full and accurate description of how the data may be summarized and plotted, and viRome represents the first software implementation.

rdp <- read.dist.plot(sr, minlen=25, maxlen=29, method="add")

Again, the data are returned to the user, so if you choose to you can plot using the ggplot function:

ggplot(rdp, aes(x=loc, y=count)) + geom_line(colour="blue", size=1.3)


And there we have it!

viRome is under active development and a manuscript is under review.  Please check back regularly for updates.

Comments, suggestions etc to mick.watson@roslin.ed.ac.uk