As both a bioinformatician and Director of a genomics facility, one question I get asked repeatedly (often by myself) is: "How much disk storage do we need?"
Given we are always considering growth and purchase of equipment, I decided to write a script in R to do this for me :) Here it is:
# number of flowcells. Four flowcells == 2 HiSeq 2000 instruments
# types of run
# how often the types are run (should sum to 1)
# Output per run (Gb)
# Probability of analysis needed
# ratio of analysis data (how much does raw data grow when you analyse it)
# number of runs per year, average number of times each flowcell (stored in nin) will be run
# vectors for the output
# the cumulative output in Gb
# iterate over the runs
# what type of run is this?
# idx is the index of our run type
# idx is the index of our raw output
# do we need analysis?
# create the output
out <- data.frame(run=run,flowcell=flowcell,type=type,raw=raw,analysis=analysis,cumulative=cumulative)
# run summary
# plot the results
The results look like this, which means if all of my many assumptions are correct, 2 HiSeq 2000s will produce ~40Tb every 30 runs (approximately one year):