Improve computational projects organization

Published: Fri 13 February 2015

I am very interested in the work of the Software Carpentry people, who try hard to improve research practices by teaching researchers how to write more reliable and efficient code.

A Software Carpentry post from Daniel Chen mentioned W. S. Noble's paper published in 2009 in PLoS Computational Biology, "A quick guide to organizing computational biology projects". It is very interesting, and as a condensed guide to myself I made detailed notes of the important points of the paper (along with one or two personal remarks).

Be sure to read the original paper if you are interested in this topic!

1 Guiding principles

  • First principle: "Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why"
  • Second principle: Everything you did will probaby need to be redone (to fix a bug, to run with updated data or with improved parameters, …)

2 Project folder structure

  • One folder per project with all the relevant files, except for files and scripts shared by several projects (those can have their own project folder)
  • The root folder of the project follows a logical organization:
    • data: fixed data
    • results: results of the computations performed on data
    • src: sources of programs and scripts
    • bin: binaries
  • data and results are organized chronologically rather than logically, because their structure cannot be predicted much in advance. A clear chronological structure helps when coming back to the project after a while.
  • I would personally add a documentation folder to gather all the important notes, workflow or pipeline graphs, and README files about data provenance. An alternative is to store README files in the relevant folders, but I like the idea of having all the documentation easily accessible in one folder. It also makes it easier to setup your version control tool since you can just ignore the entire data folder then (whereas the README files should be tracked).

3 Lab notebook

  • Entries (progress, observations, conclusions and ideas) should be dated and verbose, with links or embedded images or plots. Failed experiments should also be reported, along with a detailed explanation of why one concludes the experiment has failed.
  • Transcriptions of discussions and e-mails enable to get a complete picture of the project.
  • It is also convenient to share the notebook online with collaborators (can be password-protected).

4 Performing an experiment

  • Every step should be recorded (every command line, every script), for example in a README file or using a driver script. Avoid editing intermediate files by hand (this breaks the automation of the analysis)
  • Driver script: "The lab notebook contains a prose description of the experiment, whereas the driver script contains all the gory details" (and lots of comments). All file and directory names shoud be stored in this script, using relative paths.
  • The driver script is restartable: using loops like if (not file exists) perform computation enables to rerun specific parts of the analysis just by deleting some result files.
  • A summary script called by the last line of the driver script creates a plot or html page showing the progress of the experiment.

5 Handling of errors

  1. Use robust code to detect errors
    • check parameters, input, … (e.g. assert statements in Python, stopifnot in R)
      • use robust library functions for parsing input files rather than writing own ad-hoc parser
  2. Abort when an error occur
    • important to make sure errors are not missed and conclusions are not drawn from false results
    • code should always check the return codes of function called and command executed
  3. Create each output with temporary name, and rename it when file is complete
    • makes scripts restartable
    • prevents partial results to be mistaken for complete ones

6 Command line vs script vs program

  • iterative improvement of scripts can be a good medium
  • scripts can be:
    • driver script: top-level, one per experiment/folder
    • single-use script: called by the driver script
    • project-specific script: stored at the root of the project folder
    • multi-project script:stored outside the project folder in their own folder
  • important: every script should have a well-documented interface and "should be able to produce a fairly detailed usage statement that makes it clear what the inputs and outputs are and what options are available"

7 Version control

  • Interest:
    • backup (if the local repository is pushed to a remote repository on a regular basis)
    • historical record (any previous results can be reproduced, tags can be placed on important commits)
    • collaborative projects
  • Discipline is required:
    • commit regularly (at least once a day, but I would suggest more, like after each feature addition of bug fix)
    • only hand-modified files should be version-controlled

8 Cited references (among others):