OpenSAFELY: The Origin Story

On 7th May 2020, the OpenSAFELY Collaborative pre-printed the world’s largest study into factors associated with death from Covid-19, based on an analysis running across the full pseudonymised health records of 40% of the English population. This is an unprecedented scale of data.

It was only made possible because of a huge collaboration including our group (the Bennett Institute for Applied Data Science at the University of Oxford), the EHR research group at London School of Hygiene and Tropical Medicine, NHS England, and TPP. Over 42 days during the peak of the first wave of COVID-19 this team worked day and night to produce a fully open-source, privacy-preserving software platform, capable of running open and reproducible analytics across electronic health records, all held securely in situ. Since then the OpenSAFELY platform has expanded to a full scale analytic environment for secure data analysis, reproducible data curation, federated analysis, and code sharing, with every line of code for the platform, for data management, and for data analysis all shared openly by default, in re-usable forms, automatically, and without exception.

This has been a truly horrible year for everyone, around the globe, with horrible tragedies, deep suffering, and profound disruption to work and life. We recognise that being able to focus on work has been a great privilege for all of us working together on this project, but it has been a very difficult time for everyone. Many in our wide collaborative have been very specifically, tragically, and deeply affected by the pandemic. On our first anniversary, from the Policy Lead in the Bennett Institute, this is the brief story of the positive side from all our lives: how OpenSAFELY came to life, and what we’ve achieved so far.

An idea is sown

There’s a photo saved in the Bennett Institute archive from the 11th of February 2020, with four members of the Bennett Institute grinning and holding up some of our best papers in printed mug form. From May 2021 this image feels prehistoric: we are standing shoulder-to-shoulder, in a small office, in a way that would be unthinkable today. The World Health Organisation declared the Covid-19 pandemic exactly one month later. A few days afterwards we all left our small office for the last time.

For us, those early days of home working and talking on Zoom were - like they were for everyone - dominated by an overwhelming sense of uncertainty. We knew only that we wanted to help; that high-quality data analysis was going to be an essential component of the national response to the pandemic; and that we had the skills and experience we felt the NHS needed if it was going to make best use of its data.

To get us thinking about how we could leverage our skills and experience to help, we wrote a blog and held an all day team meeting on the 17th of March to generate ideas. We came up with a wide range of suggestions (too many!) from developing dashboards of medicine shortages, to tracking the spread of misinformation and using data to disprove misleading claims. Whilst many of these ideas were interesting, and could have led to useful research projects, one idea stood out as having the potential to deliver the most impact: The Open Covid Research Platform.

We realised that there was an increasingly urgent need to answer questions such as: which demographic characteristics or medical conditions made people more vulnerable to Covid-19; which drugs might help or hinder the treatment of Covid-19; and what happens to people after they have recovered from initial infection. We also realised that to answer these questions quickly researchers would need rapid access to unprecedented volumes of clinical data, and a means of conducting high-quality analytics in a collaborative fashion. Hence, we concluded that rather than a single study or data source, the NHS needed a platform that would enable many data analysis studies to be conducted in a single secure environment. And so the idea for what would become OpenSAFELY was born.

Getting started

First we needed the right team. The Bennett Institute, run by Ben Goldacre, has a strong existing team of software developers and “developer researchers” who understand software development and research in health data. The EHR group at LSHTM, run by Liam Smeeth, has over a decade of world-class experience working with complex, raw NHS patient data, and turning it into research insights. In the morning they started talking. By the end of the day they had a plan.

On the 18th March 2020 Ben and Liam wrote to the Department of Health and Social Care and NHS England, outlining our proposal. This outline stressed that the UK was uniquely positioned to provide this type of research and that, unlike a fragmented approach with multiple data extracts, the principles of our proposed platform design were the best for preserving patients’ privacy, while maximising the quality, speed, and quantity of outputs through mass collaboration. This was key: we wanted to move beyond slow, intermittent, and less secure data extraction, and move towards a better, faster, and more efficient means to work with raw NHS data.

Naturally we were excited at this point. It felt like we’d landed on an idea that might not only provide answers to urgently needed clinical questions, but also help show the NHS the art of the possible for open analytics and privacy by design. We had dreams of developing a platform that could substantially exceed (see Box 1) the current legal and ethical requirements on securing sensitive healthcare data. However our excitement was tempered by skepticism. Previous experience, and folklore, had led us to believe that getting started would take a long time. There was a mountain of paperwork to get through, all designed to ensure that patient data is kept secure and used in an appropriate manner. The technical challenge of handling so much data was also substantial. In hindsight we could have afforded to be relentlessly optimistic. We had neglected to consider the mountain-moving power of a shared mission.

Box 1: Key design features of OpenSAFELY

So far the idea was nice, but it couldn’t happen in the real world, as there were still several missing pieces. On the 20th March 2020 at 2pm we had a telephone meeting with TPP, one of the main providers of primary care electronic health record software (EHRs) for the NHS, more specifically with John Parry (their CMO) and Chris Bates who runs their research and analytics team. The meeting was to talk about something completely different. For fifty minutes we discussed various other projects (including an “open popups framework” for EHR systems, to which we will one day return!). A series of action points were written down, now long forgotten. Then, in the last ten minutes, we had time for Any Other Business. We nervously floated the idea of the “OpenCorona” research platform (as we first called it), and asked if there was any tiny possibility that TPP might be interested to consider discussing whether they might collaborate with us to build it.

The thing we all remember most from that conversation is that Chris Bates wouldn’t stop laughing, while repeatedly saying the word “yes”.

Chris and John went away to discuss with the team at TPP (for whom we now have deep, undying admiration) and we began to try and square off how this all might work. Seb Bacon, our Chief Technology Officer, began to think through the implications, and build plans with our developer team. At this point we were hugely lucky to have befriended Amir Mehrkar: a GP, previously a guiding light at InterOPEN and acting CMO at NHS Digital. We had no money to pay him, but he heroically accepted the task of managing our permissions and governance. We hit the phones, emails, and laptops, and things suddenly began to move with startling pace. Following the issuance of the COPI Notices by the Secretary of State for Health and Social Care (which gave us the legal basis for accessing EHR data on behalf of NHSE) the first line of code was committed to the OpenSAFELY codebase on the 31st March 2020. We began to build tools for data management and analysis in anticipation of data flowing. Soon we had agreed data flows for vital elements like “death in hospital from COVID” with the NHS England data team, who were crucial and highly skilled in their collaborative work. By the 14th April we had ethics approval and all the necessary data processing agreements in place.

There was then a truly monumental, enormous analytic push, with deep expertise from three different teams, most of whom had never met in person, working day and night, through the weekends, striving to turn tens of billions of rows of structured GP data into a clear picture of who was most at risk of death from COVID-19. Then, on the 7th of May, just seven weeks after the very first consideration of even trying to do such a thing, our collaborative group published the pre-print of our first ever paper: “OpenSAFELY: factors associated with COVID-19-related hospital death in the linked electronic health records of 17 million adult NHS patients”. And we announced OpenSAFELY to the world.

Going from strength-to-strength

Rather than relenting after going ’live,’ the pace of platform development and research output only intensified. In the latter half of 2020 (see timeline), we investigated the link between Covid-19 death and chronic obstructive pulmonary disease or asthma. After months of struggle we managed to get our first research funding for OpenSAFELY from UKRI/MRC; and then, at the end of 2020, our first platform-development funding from the Wellcome Trust. Nature published an updated version of our preprint; we brought EMIS (the other major provider of EHRs in England) on board as another collaborator; and OpenSAFELY was covered by multiple news outlets and periodicals including The Economist and The New York Times. As much as we were shouting about our research outputs, we were also spreading our modern, open, collaborative approach to computational data science: explaining how all data management and analysis code that is run on the platform is shared openly, by default and by design, so that anyone can view, check and re-use it all, driving quality and efficiency for all. We also began to more actively promote OpenSAFELY to the public (e.g. here), and to the NHS analyst workforce (e.g. here), as a worked example of how modern, efficient, open working methods can benefit science.

Into 2021 we continued to conduct and publish research related to urgent clinical questions, such as: the effect of hydroxychloroquine on Covid-19 mortality; the link between HIV infection and Covid-19 death; the factors associated with Covid-19 deaths vs other causes; the connection between the use of non-steroidal anti-inflammatory drugs and Covid-19 death; the existence of ethnic inequalities in Covid-19 deaths (and whether this has changed over time); the association between living with children and Covid-19 outcomes; whether it is possible to predict the risk of Covid-19 related death for adults in the general population; the risk associated with different variants of SARS-CoV-2; the longer-term clinical impact of Covid-19 hospitalisation; and the association between anticoagulants and Covid-19 outcomes. However, we also turned our attention to bigger-picture policy analyses, such as the impact of lockdowns on non-Covid-19 NHS services and the roll-out of the national vaccine programme.

Our work on vaccine coverage arguably led to one of our biggest achievements to date. OpenSAFELY is now a vast software project, and it is portable code: we are now implementing it as a software layer to preserve patients’ privacy, and provide efficient and open data management and data analysis, in a number of different settings, on some very diverse datasets (on which more soon!). But our two core environments today are OpenSAFELY-TPP, with 23 million patients’ records, and OpenSAFELY-EMIS, with 35 million patients’ records. This raises an interesting technical challenge. Running an analysis in just one of these environments is a big ask. The underlying data, and technical environment, is very different between these two settings. But we wanted them to be addressable as if they were one. On the 19th February 2021, we completed the first federated analysis in OpenSAFELY, running the same OpenSAFELY code, for the same analysis, in two different places (OpenSAFELY-EMIS and OpenSAFELY-TPP) against two completely different back-ends. This enabled the first analysis ever to be run across the full pseudonymised raw GP data of over 95% of the population of England. It was another significant milestone; one that, with a lot more work, ultimately led to us publishing a preprint detailing trends and clinical characteristics of Covid-19 vaccine recipients, derived from a federated analysis of 57.9 million patients’ primary care records in situ, and more to come.

This federated analysis was a truly massive technical achievement and speaks to the strength of the OpenSAFELY collaborative: it was driven, as ever, by the combination of skills that no single individual, or even team, is ever likely to embody alone, across EHR data analysis, EHR system design, software development, data management, open science, and more. Many others have done great work, over two decades, on small subsets of raw GP data, with data extracts like CPRD. Some, since our work on OpenSAFELY, have done preliminary work with the GPES data, a much smaller summary dataset derived from the raw GP data, but nonetheless covering 54 million patients. With our first analyses on the full raw GP records of 58 million patients we will forever feel, if you can forgive us, in our very geeky hearts and minds, like ‘Roger Bannister’: running the world’s first 4-minute mile, in plimsolls, having just eaten a ham sandwich.

We deeply admire the work of many other teams. And a year after we published our first preprint, we are undeniably proud of OpenSAFELY. We’re proud of the research; we’re proud to have tens of thousands of lines of code on GitHub; we’re proud of being called an ‘inspiring example of good practice’; we’re proud of our newly growing band of external users and the “co-pilots” in our team who help them drive the ship. We are proud to have shipped an analytics platform that imposes open working and open code sharing as the automatic default. And we’re proud of all the resources we’re making openly available for other researchers and teams, like OpenCodelists.

This belongs to everyone who has worked with us, up close and far away; but more than that, to everyone who has contributed a line to any code library or open tool we’ve re-used, to everyone who has created the systems and processes and code through which all NHS patient data flows, and to every patient whose data we’ve been able to access - securely, and non-disclosively - to generate insights on COVID. And, more than that even, this work literally belongs to everyone: every line of code for the entire OpenSAFELY platform and every analysis is free for review and re-use under open licenses at GitHub.com/OpenSAFELY, as should be the norm for all publicly funded code on NHS patients’ data.

We look forward to the day when these achievements are forgotten about, because that will mean we have all won, across the system: that the data infrastructure challenges of the NHS and research have all been addressed, and that modern, open, collaborative approaches to computational data science have become the norm, replacing the closed approaches of the past. For that to happen, the system for delivering, implementing, funding and evaluating data infrastructure will need to radically change. We are optimistic!

Onwards

Looking ahead into year 2, we see great things ahead. The Bennett Institute team has grown from 12 to nearly 30 (depending on how you count our inward secondments!). The EHR group at LSHTM is recruiting, our friendship and admiration for TPP and EMIS grows ever deeper, and our wider collaborations are flourishing. Our roadmap is packed, but we’re especially looking forward to: inviting more external users to conduct their own analyses using OpenSAFELY; delivering OpenSAFELY in more analytic environments, on more datasets; developing our documentation and building out into teaching analytics alongside open code; expanding the functionality of the platform (in some very exciting ways that are currently under wraps); producing more short data reports; and, most especially, one day meeting in 3D to celebrate an amazing first year.

Bravo, and onwards!