Khaled E Emam: Pseudonymous data is not anonymous data

Recently, efforts have been made to make health data more generally available for secondary purposes, including research. These include the recent policy announcements from the European Medicines Agency (EMA) on making clinical trials data available, industry efforts to do the same, as well as care.data in the UK.

All of these are premised on being able to anonymize the data properly before it is shared, and in a manner that will meet multiple requirements: (a) ensure that the probability of re-identifying individual patients is small, (b) meet the regulatory and legal thresholds for what is an anonymized dataset, and (c) ensure that the anonymized data quality is sufficiently high to allow meaningful analysis.

It is important to clarify what practices will not meet the first two of these requirements, and to make some terminology clarifications so that we can have an informed discussion on data sharing practices. A health dataset will have two types of variables that we care about from a privacy perspective: (a) direct identifiers, and (b) indirect identifiers.

Direct identifiers are details such as names, addresses, social insurance numbers, telephone numbers, email addresses, and any other unique identifiers. These direct identifiers are typically removed from the dataset when it is shared for secondary purposes. The unique identifiers, such as a medical record number, would be converted to a pseudonym so that it can still be used to relate all of the records that belong to the same patient. When you do all of these things to protect against re-identifying individuals from these types of variables, the dataset is considered pseudonymized.

The EU Data Protection Directive still considers pseudonymized data as personally identifying information, and the UK Information Commissioner’s Office considers pseudonymized data to be personal information. There are good reasons for that.

Pseudonymized data leave all of the indirect identifiers intact, and individuals can still be identified using the indirect identifiers. All known re-identification attacks on clinical, administrative, and survey data were done using the indirect identifiers. And recently, an academic researcher gathered information from newspaper articles about vehicle accidents to re-identify individuals in a hospital discharge database, using information such as the year of birth, date of accident, the hospital the individual went to, and where they lived.

This is not surprising. There is evidence that basic demographics, such as the date of birth and the postal code, can uniquely identify almost all of the population. For example, these two pieces of information are unique identifiers for almost all of the population in Canada, and the Netherlands, and a high percentage of the population in the United States. These basic demographics are easy to get from public sources and can be used to re-identify individuals.

The risks are not theoretical: there have been re-identification attacks on health data. There is also the issue of public trust. How can the public trust that data custodians are protecting their data properly when they are sharing it without consent, if even the basics of disclosure control are not being adhered to?

There is a need for anonymization standards that go beyond pseudonymization to cover the indirect identifiers. As data sharing initiatives take flight, there is an urgency to address the standards gap before data go out of the door.

Read Khaled E Emam’s previous blogs in this series:

Towards standards for anonymizing clinical trials data

What are the privacy concerns when sharing clinical trials data?

Khaled E Emam is the Canada research chair in electronic health information at the University of Ottawa, and an associate professor in the department of pediatrics, and is cross-appointed to the school of electrical engineering and computer science.

Competing Interests: I have read and understood BMJ policy on declaration of interests and declare the following interests: I have financial interests in Privacy Analytics Inc., a University of Ottawa and Children`s Hospital of Eastern Ontario spin-off company, which develops anonymization software for the health sector.