The principles developed by industry and the European Medicines Agency (EMA) have made it clear that protecting the privacy of individuals is a necessary part of any policy to share participant data from clinical trials. But I often get asked: what are we protecting these participants from? In this short piece I will answer this question in a pragmatic manner, and also make clear that ensuring privacy through anonymization protects the sponsors as well as the participants.
Clinical trial data are currently typically being shared by sponsors using one of two mechanisms: (a) data files, or (b) through an online portal. Either of these may be making data (i) publicly available, or (ii) available with some kind of contract (i.e., “non-public” data). We therefore have four possible ways that the data can be shared when we cross the two mechanisms with the two modes of availability.
Shared clinical trial data are vulnerable to “attacks.” This means that an adversary will try to figure out the identity of one or more participants, and learn sensitive health information about them from the clinical trial dataset (“adversary” is the term used to describe those individuals who attack databases to figure out the identity of patients).
But when data are shared, there is also the real possibility that participants will complain to their regulators about their personal health information being shared. A complaint will typically trigger an investigation, even if no one was actually identified in the dataset, if there was no attack on the data, and if there was no data breach.
Sharing personal health information in a clinical trial dataset without express consent may be a breach of regulations. An investigation may result in penalties or other sanctions if it is determined that the data are not properly anonymized, and may cause reputational harm to the sponsor at the same time. It is therefore important to keep in mind that successful attacks are not the only risk that needs to be managed here—it is also necessary to employ anonymization practices that will be convincing to a regulator in any country that data are being collected from. These practices have to withstand an investigation.
Now let’s look at the attacks.
Publicly shared data are open to demonstration attacks. These types of attacks are almost always performed by academics and the media to demonstrate that a dataset can be attacked. The outcome of the attack is to correctly determine the identity of at least one participant in the dataset. The adversaries who conduct these attacks view themselves as “white hats,” who are doing a service to the community by revealing weaknesses in privacy and security practices, and weaknesses in disseminated datasets.
Why would these white hats attack health data? In general, within the computer security community, finding a vulnerability in a widely deployed system bestows notoriety. These demonstration attacks are not theoretical and have been conducted against health datasets. Consequently, it is important to protect against them.
For clinical trial data that are not made publicly available, and where the data are provided to a known recipient (for example, a researcher or qualified investigator), there are three types of attacks to protect against:
1. Deliberate attack: The qualified investigator or one of his/her staff deliberately attempts to re-identify individuals in the dataset. Having the investigator sign a contract, providing him/her with access to the data through a portal, and providing him/her with privacy training mitigates against this attack.
2. Data breach: Imposing strict security controls on the qualified investigator, or using an online portal with strong built-in security controls to share the data, mitigates the risk of data breaches occurring.
3. Inadvertent re-identification: This is when a qualified investigator or one of his/her staff recognizes someone that they know in the dataset from the information in the file. Such a recognition would be accidental, but would still be considered a privacy breach. The only way to mitigate against this risk in a consistent manner is to anonymize the data.
While there are no known re-identification attacks specifically on clinical trial data as of today, the likelihood of these occurring increases as more trial data are shared, and the higher the likelihood of a complaint. The risks from these four types of attacks can be managed by putting an appropriate privacy framework in place, which includes: security and privacy controls, contractual controls, and data anonymization.
Read Khaled E Emam’s previous blogs in this series:
Pseudonymous data is not anonymous data
Towards standards for anonymizing clinical trials data
Khaled E Emam is the Canada research chair in electronic health information at the University of Ottawa, and an associate professor in the department of pediatrics, and is cross-appointed to the school of electrical engineering and computer science.
I have read and understood BMJ policy on declaration of interests and declare the following interests: I have financial interests in Privacy Analytics Inc., a University of Ottawa and Children`s Hospital of Eastern Ontario spin-off company, which develops anonymization software for the health sector.