University of Alberta: Even better than the real thing: Simulated, anonymized data could be key to health-care innovations

A University of Alberta researcher is developing an inventive solution to a problem plaguing health-care research around the world: how to make data-driven decisions without compromising the privacy of personal medical records.

Dean Eurich, professor in the School of Public Health, is academic lead on a project that has successfully created a “synthetic data” set that mirrors Albertans’ use of the health-care system without breaching their privacy.

“In Alberta we have some of the best health-care data available of any jurisdiction in Canada with pharmacy, lab, physician, hospital and immunization records all integrated across the province,” said Eurich.

“Now we have the potential to liberate that data with this privacy-enhancing technology,” Eurich said. “The synthetic data can be used to conduct research, train students and plan public health measures while protecting patients and their records.”

Opening access, protecting privacy
The synthetic health data project is sponsored by Health Cities, an Edmonton-based non-profit, in partnership with the School of Public Health, the Institute of Health Economics, pharmaceutical company Merck Canada and synthetic generation group Replica Analytics, with advice from Alberta Innovates and the Office of the Information and Privacy Commissioner of Alberta.

In the first project of its kind in Canada, the researchers used machine learning to create a computer model based on 100,000 Albertans’ health-care records. Using baseline characteristics and starting values from the real data, they then generated doctor’s visits, diagnoses and prescriptions. Further testing showed they had created a record that mimics the real-world data but cannot be traced back to individual patients.

“Health Cities works with innovators from the health system, academia and industry to drive new models of care,” explained Health Cities CEO Reg Joseph. “The goal is to drive economic growth and improve overall health outcomes, and the key to that is data.”

“That’s where Health Cities gets excited about this project — opening doors that have been previously closed.”

Training students, testing innovations, improving outcomes
Joseph noted that synthetic health data could be used by computer science students who would normally have to go through many regulatory hoops to practise their skills using real health records.

“Those students need access to really big data sets,” Joseph said. “If they don’t get health data, they will move on to other fields such as the environment or transportation, and we really need data scientists in health care.”

Eurich, who is also program director of the U of A’s clinical epidemiology program and a member of the Alberta Diabetes Institute, hopes synthetic data can help pharmaceutical companies improve their medications.

“Companies don’t currently have good access to data about how their drugs are being used in the population — what are the average patients’ age or geographic location, and how do their lab results change as they progress from one drug to the next?” he explained.

“Analyzing this data could help them target patient groups where medications and services are being underutilized based on the evidence, so they can identify the right drug for the right person at the right time.”

Eurich noted that while synthetic data has the potential to give a leg up to innovation in the health-care system, results will always need to be confirmed by further research.

“This method is only hypothesis-generating,” he said. “You’re always going to have to go back and do proper studies with the real data.”

The synthetic health data could also be used to predict outcomes for new policies. “If we introduce a new drug that could reduce hospitalizations, how would that change costs to the health-care system, for example?” Eurich said.

“The model allows us to see into the future by simulating conditions that may not yet exist in the real data.”

Eurich said this first synthetic data study was relatively small compared with the complexity and number of records in the real health-care system. Now his team will tackle a larger data set based on 250,000 records of Albertans with diabetes, building in more than 50 variables such as drugs and other illnesses, in an attempt to reliably reproduce useful anonymized records.