Parameterizable Reproducible Research

why do it

Our current best example of this structure is a report we generate that goes into per-customer technical details about their customized statistical model. It compares the version of the model in production with several “toy” models, along with a lot of explanation and education. The process of generating the report includes pulling data from remote servers, running statistical analyses, computing various metrics on the statistical models, then interleaving standard text, parameterized text (e.g., the name of the customer), and the results of computation, as well as auto-generated charts and graphs.

To do this process manually would require following a 10 to 20-step checklist, would require a substantial amount of effort – at least 30 minutes per report – and would be error-prone.

Our team now generates these reports in about 3 minutes per member, based on a simple configuration file. Here’s what a configuration file might look like:

user = "hharris"
data_source = "SOURCENAMEHERE"
data_location = "where in the source to pull data from"
cust_id = "9999"
cust_name = "State College"

That’s it. And here’s the command that generates the report:

./risk-model-report.R --verbose --config myconfig.R

That’s it! The result is a standalone HTML file customized for that specific customer, ready to be sent.

Should you use this pattern? If you use R to create reports, you should definitely be using Rmarkdown. If you have to generate these reports repeatedly, with subtle variations each time, you should strongly consider this framework or an equivalent!

how we do it

We use a relatively straightforward pattern available in the R programming language. As mentioned above, the standard reproducible research workflow is to create one document per analysis. We wrap that document in a separate script which is responsible for reading and validating a configuration file, then building the parameterized document with the appropriate configuration variables. The build process is multi-step under the hood, but most of the heavy lifting is performed by the Rmarkdown package, which runs Pandoc, a cross-platform system for converting document formats.

For an example, see this HTML document (it should work in any web browser). Note that the image is embedded, meaning that the HTML document stands alone. It’s relatively easy to generate other formats, such as PDF or even Word, instead.

The document was generated by R and Rmarkdown code we’ve released into the public domain, hosted on Github. If this pattern is useful to you, please make use of and adapt it!

Note that since the generated code is HTML, you can add arbitrary Javascript and markup to your generated reports. We’ve used collapsible areas of a page to expose technical details to only interested readers, added MathJax for mathematical expressions, and even added Javascript libraries to send us data about how many times the document is viewed and whether outgoing links were followed.

Over time, we expect to use variants of this pattern to standardize a variety of reports, internally-facing as well as customer-facing. If you do something similar, or better, I’d love to hear about it!