What is OpenSAFELY?

Working on behalf of NHS England we have now built a full, open source, highly secure analytics platform running across the full pseudonymised primary care records of 24 million people, rising soon to 55 million, 95% of the population of England. We have pursued a new model: for privacy, security, low cost, and near-real-time data access, we have built the analytics platform inside the EHR data centre of the major EHR providers, where the data already resides; in addition we have built software that uses tiered increasingly non-disclosive tables to prevent researchers ever needing direct access to the disclosive underlying data to run analyses; code is developed against simulated data using open platforms before moving to the live data environment. Everything has run smoothly. We are fully live inside TPP; we are signed off with full data access and end-stage tech development for the computational platform with EMIS.

Data in OpenSAFELY

We have every individual patient’s full primary care GP record with all diagnoses, tests, referrals, prescriptions, etc, all linked onto their data from SUS (hospital admissions, outpatient visits), ECDS (coded A&E attendances), CPNS (death in hospital from Covid), SGSS (Covid test results), ONS (cause of death in and out of hospital), household data (other pseudonymised occupants, is it a care home, approx location), ICNARC (ICU data), ISARIC (detailed hospital records of hospitalised covid patients) and more. Despite our unprecedented scale, privacy advocates like MedConfidential have actively praised our privacy model (e.g. link here). Our first paper (the largest analysis of Covid risk factors in the world) has now been published in Nature with another dozen coming. We are producing research on who gets covid, the consequences of covid, and restoration of NHS services. A longer list of research outputs, ongoing and completed, is available on OpenSAFELY.org.

How it works

OpenSAFELY can be used to perform any task that requires large scale computation across a large number of patient records, whether for formal analysis or simple tables and graphs. Once set up, analyses can re-run at regular intervals. Analysts write code on GitHub using OpenSAFELY software to describe how they would like the raw patient data converted into their intermediate dataset for analysis. A randomly generated dummy dataset is automatically created by the OpenSAFELY software to match the analyst’s specifications. Analysts use this dummy data to write data analysis or data visualisation code on GitHub using Stata, R or a similar language. When this is passing all tests, which are written to check that the dataset-generation and data-analysis code can run on the live data, their code is parcelled up safely, and passed to the environment containing the real data, where it runs, with permission from the core OpenSAFELY team. Only summary tables and graphs are released from the system, after manual review. All code is shared openly on GitHub for review and efficient re-use. We have written extensive open source software to make developing, sharing and re-using code and code-lists easy and consistent across analyses. We are keen and ready to deploy OpenSAFELY in different environments, against different datasets, to spread this model of working with modern open methods by default.

Video walkthroughs

If you’re keen to know more about what OpenSAFELY is and how it works, you can watch these two videos below:

OpenSAFELY: a walkthrough in R (and a little bit of Python) by William Hulme

OpenSAFELY: A new paradigm in health analytics & research — Jonny Cockburn