# Next-gen sequencer output

As both a bioinformatician and Director of a genomics facility, one question I get asked repeatedly (often by myself) is: "How much disk storage do we need?"

Given we are always considering growth and purchase of equipment, I decided to write a script in R to do this for me :)  Here it is:

 # number of flowcells.  Four flowcells == 2 HiSeq 2000 instrumentsnin = 4 # types of runrtypes = c("100bp PE", "36bp SE") # how often the types are run (should sum to 1)# here we predict 75% of our runs will be 100bp PE and 25% will be 36bp SEprun = c(0.75, 0.25) # Output per run (Gb)# one value per run type.  # 250Gb from a HiSeq 100bp PE run, 25Gb from a HiSeq 36bp SE runrout = c(250, 25) # Probability of analysis needed# one value for each run type# what is the probability that that type of run reqires analysis?  # each value should be <= 1# Some collaborators just want FastQ, others require us to analyse the data# For each 100bp run, 75% of projects will require downstream analysis.  For each 36bp run, 100% of projects will require downstream analysispan = c(0.75, 1) # ratio of analysis data (how much does raw data grow when you analyse it)# one value per run type, multiplicative.  Here we say the data doubles in size when we analyse itarat = c(1, 1) # number of runs per year, average number of times each flowcell (stored in nin) will be runnrun = 30 # vectors for the outputrun <- vector()flowcell <- vector()type <- vector()raw <- vector()analysis <- vector()cumulative <- vector()  # the cumulative output in Gbcum <- 0 # iterate over the runsfor (i in 1:nrun) {        # iterate over the flowcells    for (j in 1:nin) {         # what type of run is this?        pr <- runif(1)        idx = 1        psum = prun[idx]        while(pr > psum) {            idx = idx + 1            psum = psum + prun[idx]        }         # idx is the index of our run type        ct <- rtypes[idx]         # idx is the index of our raw output        co <- rout[idx]            cum <- cum + co         # do we need analysis?        cpa <- pan[idx]        ca <- 0        if (runif(1) < cpa) {            ca <- arat[idx] * co            cum <- cum + ca        }             # create the output        run <- append(run, i)        flowcell <- append(flowcell, j)        type <- append(type,ct)        raw <- append(raw, co)        analysis <- append(analysis,ca)        cumulative <- append(cumulative, cum)     }} out <- data.frame(run=run,flowcell=flowcell,type=type,raw=raw,analysis=analysis,cumulative=cumulative) # run summaryrsum <- aggregate(out\$cumulative, by=list(run=out\$run), max)colnames(rsum) <- c("run","Gb")rsum\$Tb <- rsum\$Gb/1000 # plot the resultsplot(rsum\$run, rsum\$Tb, main="Tb of data per run", xlab="Run", ylab="Tb", pch=16)

The results look like this, which means if all of my many assumptions are correct, 2 HiSeq 2000s will produce ~40Tb every 30 runs (approximately one year):

# Additional uncaught exception thrown while handling exception.

## Original

PDOException: SQLSTATE[HY000]: General error: 145 Table './ark/semaphore' is marked as crashed and should be repaired: SELECT expire, value FROM {semaphore} WHERE name = :name; Array ( [:name] => cron ) in lock_may_be_available() (line 167 of /var/www/html/includes/lock.inc).