Next-gen sequencer output

As both a bioinformatician and Director of a genomics facility, one question I get asked repeatedly (often by myself) is: "How much disk storage do we need?"

Given we are always considering growth and purchase of equipment, I decided to write a script in R to do this for me :)  Here it is:


# number of flowcells.  Four flowcells == 2 HiSeq 2000 instruments
nin = 4

# types of run
rtypes = c("100bp PE", "36bp SE")

# how often the types are run (should sum to 1)
# here we predict 75% of our runs will be 100bp PE and 25% will be 36bp SE

prun = c(0.75, 0.25)

# Output per run (Gb)
# one value per run type.  
# 250Gb from a HiSeq 100bp PE run, 25Gb from a HiSeq 36bp SE run

rout = c(250, 25)

# Probability of analysis needed
# one value for each run type
# what is the probability that that type of run reqires analysis?  
# each value should be <= 1
# Some collaborators just want FastQ, others require us to analyse the data
# For each 100bp run, 75% of projects will require downstream analysis.  For each 36bp run, 100% of projects will require downstream analysis

pan = c(0.75, 1)

# ratio of analysis data (how much does raw data grow when you analyse it)
# one value per run type, multiplicative.  Here we say the data doubles in size when we analyse it

arat = c(1, 1)

# number of runs per year, average number of times each flowcell (stored in nin) will be run
nrun = 30

# vectors for the output
run <- vector()
flowcell <- vector()
type <- vector()
raw <- vector()
analysis <- vector()
cumulative <- vector()

 

# the cumulative output in Gb
cum <- 0

# iterate over the runs
for (i in 1:nrun) {
    
    # iterate over the flowcells
    for (j in 1:nin) {

        # what type of run is this?
        pr <- runif(1)
        idx = 1
        psum = prun[idx]
        while(pr > psum) {
            idx = idx + 1
            psum = psum + prun[idx]
        }

        # idx is the index of our run type
        ct <- rtypes[idx]

        # idx is the index of our raw output
        co <- rout[idx]    
        cum <- cum + co

        # do we need analysis?
        cpa <- pan[idx]
        ca <- 0
        if (runif(1) < cpa) {
            ca <- arat[idx] * co
            cum <- cum + ca
        }    

        # create the output
        run <- append(run, i)
        flowcell <- append(flowcell, j)
        type <- append(type,ct)
        raw <- append(raw, co)
        analysis <- append(analysis,ca)
        cumulative <- append(cumulative, cum)

    }
}

out <- data.frame(run=run,flowcell=flowcell,type=type,raw=raw,analysis=analysis,cumulative=cumulative)

# run summary
rsum <- aggregate(out$cumulative, by=list(run=out$run), max)
colnames(rsum) <- c("run","Gb")
rsum$Tb <- rsum$Gb/1000

# plot the results
plot(rsum$run, rsum$Tb, main="Tb of data per run", xlab="Run", ylab="Tb", pch=16)


The results look like this, which means if all of my many assumptions are correct, 2 HiSeq 2000s will produce ~40Tb every 30 runs (approximately one year):

Additional uncaught exception thrown while handling exception.

Original

PDOException: SQLSTATE[HY000]: General error: 145 Table './ark/semaphore' is marked as crashed and should be repaired: SELECT expire, value FROM {semaphore} WHERE name = :name; Array ( [:name] => cron ) in lock_may_be_available() (line 167 of /var/www/html/includes/lock.inc).

Additional

PDOException: SQLSTATE[HY000]: General error: 145 Table './ark/watchdog' is marked as crashed and should be repaired: INSERT INTO {watchdog} (uid, type, message, variables, severity, link, location, referer, hostname, timestamp) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6, :db_insert_placeholder_7, :db_insert_placeholder_8, :db_insert_placeholder_9); Array ( [:db_insert_placeholder_0] => 0 [:db_insert_placeholder_1] => php [:db_insert_placeholder_2] => %type: !message in %function (line %line of %file). [:db_insert_placeholder_3] => a:6:{s:5:"%type";s:12:"PDOException";s:8:"!message";s:206:"SQLSTATE[HY000]: General error: 145 Table &#039;./ark/semaphore&#039; is marked as crashed and should be repaired: SELECT expire, value FROM {semaphore} WHERE name = :name; Array ( [:name] =&gt; cron ) ";s:9:"%function";s:23:"lock_may_be_available()";s:5:"%file";s:31:"/var/www/html/includes/lock.inc";s:5:"%line";i:167;s:14:"severity_level";i:3;} [:db_insert_placeholder_4] => 3 [:db_insert_placeholder_5] => [:db_insert_placeholder_6] => http://www.ark-genomics.org/events-online-training/next-gen-sequencer-output [:db_insert_placeholder_7] => [:db_insert_placeholder_8] => 34.207.98.73 [:db_insert_placeholder_9] => 1518392176 ) in dblog_watchdog() (line 157 of /var/www/html/modules/dblog/dblog.module).


Uncaught exception thrown in shutdown function.

PDOException: SQLSTATE[HY000]: General error: 145 Table './ark/semaphore' is marked as crashed and should be repaired: DELETE FROM {semaphore} WHERE (value = :db_condition_placeholder_0) ; Array ( [:db_condition_placeholder_0] => 18317201935a80d370eca151.16100635 ) in lock_release_all() (line 269 of /var/www/html/includes/lock.inc).