This is a draft discussion paper, the first of a series exploring “open team science” approaches to managing health data, and specifically how to create a collaborative computational data science ecosystem where the sharing and re-use of objects such as codelists and code is facilitated, encouraged, recognised, and rewarded. As a microcosm of this we have first explored “codelists”. There are currently no ‘answers’ or preferred solutions given. We will be holding an open discussion with the research community on 2nd March at 3pm - you can book to join us here.
Dr. Caroline Morton, Jessica Morley, Brian MacKenna, Dr. Ben Goldacre, Dr. Helen Curtis, Dr. Merlin Dunlop, Peter Inglesby, Dr. Rosalind Eggo, Dr. Anna Schultze,
Codelists are an important part of epidemiological research and good, or alternatively bad, practice with codelists can make or break a research project. Codelists also form an important foundation of many clinical decision support tools used within electronic health records (EHR) systems. Codelists are lists of codes that denote a particular variable, for example, “hypertension”. Individual codes are used in the electronic health record (EHR) and can be very specific, for example, “Referral to hypertension clinic”. It is therefore not sufficiently precise to identify people with one code as searching the EHR for “hypertension” might miss those with a code for “O/E: BP reading very high” or alternatively may incorrectly include those with “Ocular Hypertension”. Deciding on exactly which codes to include or exclude within a codelist is often context specific, and requires domain knowledge of both the condition and the clinical setting in which the code was generated. Also, the codes included or excluded in a codelist may vary depending if one is looking to make data entry consistent (in which one might wish to restrict the available codes in a list) or if one is looking to analyse data (in which case looking for all possible ways in which a parameter has been recorded might be more useful). In short, creating an accurate codelist requires time and expertise (Tai et al. 2007). In addition, codelists may benefit from being regularly reviewed and updated, including by people who are not the original authors, as coding systems are refined and new codes added etc (many rapid additions were made to the SNOMED-CT codes available, for example, in response to the COVID19 pandemic).
To minimise duplicated effort, and maximise return on the investment of time in creating a codelist, they are often reused within research groups for multiple projects. They can also be used by other external researchers for related projects, or tweaked for reuse by external researchers for non-related projects. This reuse (and external review) is enabled by the sharing of codelists, typically either by email upon request, or as an appendix to the associated research paper. It is best practice that codelists are critically reviewed before re-use to identify errors and ensure that they are appropriate for the new use-case. Checking what codes are not included within a particular codelist is difficult but important, and can be facilitated via software tools such as the OpenCodelist tree view. More recently, there have also been some attempts to widen sharing through open, online, searchable repositories of codelists, although these are typically implemented as “copy and paste” noticeboards, such as CALIBER, ClinicalCodeLists, Data Compass and this from Cambridge University’s Primary Care Unit.
All such attempts at sharing codelists for reuse are commendable, but it is worth highlighting the limitations of existing methods for sharing so that these might be overcome. These limitations fall into (at least) one of three categories: discoverability, provenance, and credit. These largely overlap with the FAIR principles (Wilkinson et al. 2016).
Requesting a codelist by email requires external researchers to know that the codelist exists in order to request it. If the existence of said codelist is not noted in the public domain, it is essentially undiscoverable, and the codelist is therefore unshareable. Similarly, appended codelists are only discoverable if the researcher has read the paper in question. As academic papers are often numerous and can be exceptionally niche, it is not a given that a researcher will have read (or even seen) all papers on a specific topic and may, therefore, miss the paper with the appended codelist. A researcher looking at stroke admissions, may require a codelist denoting hypertension and so search for papers dealing with the factors associated with hypertension. If the relevant codelist is appended to a paper dealing with hypertensive control after stroke, they will not find it. In addition, even when researchers do find the relevant paper with the appropriate codelists, this may be in a pdf rather than in a machine readable reusable format.
Even if a codelist is discoverable, it might not be understandable if there is no documentation explaining why certain codes are included and others excluded, and which coding system the codes are from (e.g. SNOMED CT, BNF, NHS dm+d). This information provides researchers who were not involved in the original creation of the list with essential context, without which they may not be able to evaluate whether or not the codelist can be used for their project (i.e. they cannot evaluate its reliability). This can result in either underuse of existing codelists and duplicated effort (because researchers decide that it is ‘safer’ to develop a new codelist than reuse one they don’t fully understand) or an inappropriate match between codelist and research question (if a researcher users a codelist generated in one context, to investigate a clinical problem in a different context without appropriately adjusting it). There may also be sociopolitical reasons why certain decisions about a codelist were made, that could be important for researchers to understand whether or not they can reasonably reuse a codelist they find online in a different context. Detailed documentation is, however, rarely available since its creation is not incentivised by funders nor required by journals. In some cases, it may be beneficial to be able to initiate a dialogue with the codelist creator(s) in order to further understand editorial decisions made or other nuances which were not adequately reflected in the initial documentation.
As has been established, creating a codelist is a complex and time-consuming task, and yet none of the typical methods for sharing provide a means of ensuring the original codelist creator is credited appropriately. This is likely because little attention has been paid to this issue by the research community. The work involved in the creation of a codelist is, unfortunately, not valued as highly as the work involved in running the analysis using the codelist or writing up the findings in a paper. Consequently, there are no established practices for ‘citing’ a codelist, for example. Without an equivalent to ‘citations’ for codelists, their creation cannot contribute to a researcher’s career progression unless that researcher is also named as an author on the associated paper - something that does not always happen. Some attempts to overcome this hurdle have been made. CALIBER, for example, lists the creator of the codelist. However, simply listing the name of the codelist creator when it is published online does not guarantee that they will be credited each time the codelist is used, nor is it possible to either track how many times the codelist has been reused or find the papers that have reused it.
Unless these problems with credit can be overcome, researchers have very little incentive to openly share the codelists they have created. Here we set out the options for ensuring credit attribution, exploring options from both academia and open source software development. Our intention is to spark discussion amongst these communities so that we can collectively come to an agreed upon solution.
How could current credit mechanisms be useful?
There are four main ways in which credit attribution is currently handled in academia and open-source software: citations, co-authorship, licensing and acknowledgements. All could be used to attribute credit to codelist creators.
The practice of citing is familiar to all those in research. A published or pre-printed paper is provided with a DOI and standardised meta-data which ensures that every time that DOI is quoted in another paper, this is logged publicly. This enables cited researchers to see which of their papers have been cited, and both the original researcher and all those reading their paper to see how many times a specific paper has been cited. These counts are then used in metrics such as the h index, or altmetric, which can be used as indicators of impact or as measures of performance for career progression.
In theory it is already possible to cite any codelist that has been published online, in the same way that it is possible to cite any website. However, unlike DOIs for papers which have standardised associated meta-data that is automatically pulled in by citation software (e.g. Zotero, Mendeley, Paperpile) and ensures that each published citation is logged and counted, citations for websites are generally created ad hoc by individual researchers in a non-standardised format and not tracked. To make citations work for codelists, there would need to be an effort to create standardised meta-data for codelists that can be associated with an automatically generated unique DOI that always points to the original codelist, ensuring its reuse can be tracked and logged by websites such as Google Scholar and Web of Science. A web scraper could then be built to pull this information back into a central location, or repository, where codelists are stored. This is a sophisticated solution that would cover most of the ‘needs’ of codelist creators, but it would require a considerable amount of coordination across organisational boundaries.
Overall we (the DataLab/OpenSAFELY team) think that DOI and citation is likely to be the most viable option; this will require some resource investment in minting DOIs.
In academia, particularly in science disciplines, it is reasonably common for all those who have made a contribution to a paper, to be listed as a co-author. The International Committee of Medical Journal Editors (ICMJE), for example recommends that co-authorship be based on the following four criteria:
Substantial contributions to the conception or design of the work; or the acquisition, analysis, or interpretation of data for the work; AND
Drafting the work or revising it critically for important intellectual content; AND
Final approval of the version to be published; AND
Agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
It is perhaps not unreasonable to argue that codelist creators could meet these criteria. Provided that the creator’s name and their email address is discoverable, this could be relatively simple to organise. However, there are questions to consider such as: what if the codelist is used for research the codelist creator thinks is poor quality? What if this process further slows the rate of academic publication? Should the codelist creator be credited every time a codelist is reused, even if it has been altered significantly? These questions are not easy to answer, and it is likely that, if this were the preferred option, guidelines would need to be developed to ensure a consistent approach.
A slightly softer option than co-authorship is acknowledgement. The ICMJE, for example, recommends that any individual that meets fewer than all 4 of the above criteria not be listed as a co-author, but simply acknowledged instead. This recommendation covers those who provided writing assistance, technical editing, language editing or proofreading. This is a relatively friction-free option for codelist attribution, assuming that the codelist’s creator is discoverable. Some codelist authors may feel it under-reflects their contribution, if the entire data management pipeline for a study is re-used by another; so this approach might only be regarded as acceptable by some in cases where the codelist was either simple to create, or has been adapted significantly for the project in question. Acknowledgement, also still implies endorsement and so this option still raises questions about whether or not a codelist creator would want to be seen as endorsing research they consider to be poor quality.
Contributors to open source code often license their code under a Creative Commons license. This allows others to copy and distribute their code as long as attribution is given to the original source. Licenses can be imposed on onward users as well which provides a degree of control over future use. Platforms for open-source development, such as GitHub, also provide mechanisms for logging this process. For example, it is possible to see publicly if a code repository has been forked (copied), who has contributed to it, and how many contributions individual coders have made to different open source projects.
This too could be an option for ensuring codelist attribution. Journals and funders could require all authors to make their codelists publicly available under a creative commons license online, in the same way that some journals and funders now require authors to ensure an open access version of their publication is available online. Software solutions can be developed to aid this process, as GitHub has aided open-source code contributors. OpenSAFELY Codelists for example, publishes codelists in a standardised format ensuring the creator(s), provenance, and version of each is clear. This might require more work on behalf of the codelist creator upfront, for example by producing detailed documentation of the process of codelist creation including versions, edits and contributors, and inclusion/exclusion criteria. However, it does minimise the risks of being associated with poor quality research highlighted above.
Example Codelist journeys
To prompt discussion on which of the four above options for codelist credit attribution is the preferred option, we have outlined a series of ‘codelist journeys’ which highlight the different considerations which should be taken into account.
Codelist Journey 1:
A codelist is created by a single researcher and used for one research project. It is never re-used or shared; perhaps the researcher moves on, and any publication arising from the research is not required to share the codelists by the journal. Countless hours may be spent developing the codelist, only for that to be lost to the outside world. Another researcher working in the same area wants to use the codelist but is unable to do so, and has to re-create the codelist.
Codelist Journey 2:
A codelist is created by Researcher X for a specific project. It is saved as an Excel document on a local hard-drive, where it remains for a few years. Later, Researcher Y is doing a project within the same area and importantly is from the same research group. They email around the department, and researcher X responds with their codelist. They describe their project briefly, but don’t remember the specifics of how it was put together. Researcher Y has questions about why certain codes were used instead of others and worries about the implications this has for their research. Nonetheless, Researcher Y uses it as a starting point.
Codelist Journey 3
A researcher spends a great deal of time creating a well-constructed codelist requiring lots of domain knowledge. It is shared in an appendix of a paper, or perhaps even online. Another researcher sees this codelist, and adapts it to their research question, crediting the first researcher as author of the original codelist and publishing their version. A different researcher sees the second paper, and adapts it to their own purposes, giving credit to the second researcher. The contribution of the first researcher is lost: they are not credited, and their documentation is not read and appropriately considered.
Codelist Journey 4
A researcher makes a well-constructed codelist and makes it publicly available. It is then taken and adapted by a second researcher, but unfortunately this is not done well, and there are obvious gaps and biases in the second codelist. A situation in which the first researcher gets credit may present a problem where they don’t want to be credited with this work. Would the first researcher get the opportunity to comment on the work, review it or change it and how would this be managed?
Codelist Journey 5
A researcher makes a well-constructed codelist and makes it publicly available. It is then taken and adapted by a second researcher. The adaptation of the codelist is good, however, the research it is used for is poor. If there was a move to being included in the authorship of any subsequent papers using your codelist, would there be an opportunity to comment, and discuss, and how would this be practically managed?
Tai, Tracy Waize, Sobanna Anandarajah, Neil Dhoul, and Simon de Lusignan. 2007. “Variation in Clinical Coding Lists in UK General Practice: A Barrier to Consistent Data Entry?” Informatics in Primary Care 15 (3): 143–50.
Wilkinson, Mark D., Michel Dumontier, I. Jsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (March): 160018.