I joined The DataLab in August 2021 to work as a data scientist on OpenSAFELY. This blog post describes my experience getting up and running with the OpenSAFELY pipeline.

My first task was to install the OpenSAFELY pipeline and its small number of dependencies. The documentation was clear, well-presented and approachable and the necessary commands were copied, pasted and executed from the OpenSAFELY website without incident — special mention is due to the hover-over definitions and copy buttons on code snippets! I had my own version of the Getting Started project up and running within a few hours (in the interests of full disclosure, I am comfortable working on the command line and this was a fresh install on a laptop with full installation privileges).

By the end of Week 1, I was working on a new project with another data scientist looking at prescribing behaviours during the Covid-19 pandemic. All OpenSAFELY projects are built on top of a study definition, a flexible framework written in Python that allows the user to define the patient cohort and data required to generate the necessary research dataset. This detailed walkthrough is especially helpful when getting to grips with this framework, and the OpenSAFELY Github acts as a growing resource of example implementations that can help guide users when developing their own projects. This study definition in particular covers a broad range of primary care data.

The process of developing the study definition is iterative: cohort characteristics and data variables can be added over time and rerun. When an aspect of patient selection needed to be revised in my project, it was straightforward to amend the existing study definition and rerun the data generation. The idea of representing the data as a set of instructions rather than the data themselves is very appealing to me, particularly as the decision process for the set of instructions can be documented along with the analysis code via git commits.

Two months on and we’re at the next stage of this project, implementing the analyses required to answer our research questions as scripted actions, running these analyses on the real data, interpreting the results and starting to document our findings. The speed at which we’ve arrived at this point, especially with an OpenSAFELY novice on board, is extremely impressive and attests to the flexibility and usability of the pipeline. However, I’m sure that the success of OpenSAFELY is due to more than just a robust analytical pipeline. My experience as a post-doc in academia was that you had to do all the things, some (many?) of them not very well. In my time I have written my own (bad) Javascript, struggled to interpret Information Governance policies and spent hours manually resolving Linux dependencies to varying degrees of success. At The DataLab there are experienced software developers, front-end web developers, IG experts, policy experts, pharmacists and clinicians to work with and learn from, leaving the data scientists like me to focus on delivering the data science.

The DataLab has high aspirations for its products and is committed to keeping those products open source. In the very short time that I’ve been here I can see why there is good reason to believe that these aspirations will be achieved. The right people with the right skills are in place. The pipeline has been designed to be robust and flexible. I’m very excited about what I’m going to learn and what I’m going to be able to contribute to working at The DataLab.