Yesterday we got a lot of new detail on the likely shape of President Obama’s proposed “Precision Medicine Initiative” (or PMI). This project envisions creating a research database of genetic data on one million volunteers, an idea that promises potentially huge medical benefits, but also raises significant privacy questions.
The initiative was first announced in the president’s State of the Union address last January, and then launched with fanfare a few weeks later in a White House gathering of genetics experts, patients, academics, and government officials. To their credit, White House officials have recognized the privacy challenges, and have sought input from privacy experts from the beginning (the ACLU was invited to the program launch, and to several meetings since to discuss the privacy issues involved). In July, the White House produced a document that outlined some basic “privacy and trust” principles to which the program is supposed to adhere.
But many of the crucial details have still been lacking, making it hard to know how to evaluate what has been essentially just a vision. The privacy document is quite good and covers all the bases, but seems to defer any hard choices between using data in every possible way for research, and protecting privacy. For example, in meetings I attended top health officials waxed enthusiastic about the potential advantages of crowdsourcing genetic databases – leaving them wide open so that a thousand people can explore them, which they said in their experience inevitably yields discoveries that don’t come to light if data access is granted to only a small cadre of certified professional “researchers.” Certainly that comports with everything we know about the advantages of open-source and crowdsourced data. At the same time, the document envisions strict rules and limits around how people’s data will be used and shared. It’s not clear to me how those two visions will co-exist.
It’s been hard to make judgments without more details, which is why I’ve been eagerly awaiting the recommendations of an NIH advisory group that was tasked with developing recommendations for some of the nuts and bolts of how this initiative would actually work. Yesterday, in a public meeting broadcast live via conference call, the advisory group formally presented its recommendations to NIH Director Francis Collins. At the end of the meeting, Collins announced that he was accepting the recommendations, clearing the way for implementation according to the report’s blueprint.
These are very complex issues and I haven’t had time to properly digest them, but here are some significant features of the program outlined in the report (a slide show was also presented at the meeting):
- Volunteers whose information is entered into the database (which the program likes to call “participants”) are envisioned as being drawn mainly from health providers such as Kaiser Permanente. Any other individual will also be able to directly volunteer to be included.
- The data obtained from each participant will include not just their genome, but also Electronic Health Record data—essentially their complete medical records, including such things as narrative documents, EKG and EEG waveform data, scan imagery, and “mobile health” data from wearable sensors. It will also include a bio-sample of their actual tissue—most likely a blood sample—and the results of a baseline physical exam. Basically, all available medical data that could prove useful. One reason that the system will seek to draw from established health providers like Kaiser is that they already have all this information on their patients in one place.
- The million-person Cohort is envisioned as being longitudinal—it will feature an ongoing relationship with participants, including continuous information collection.
- Information and findings will also be fed back to participants—both aggregate scientific findings, and also findings of individual relevance.
- The database would be open for exploration by any researchers—anyone from academic professionals to high school students.
- Any new data that results from (for example) running a new algorithm on the Cohort would have to be shared back with the project and available to others. This is good; this database will belong to the public and its fruits should likewise belong to the public.
- The report details a governance system that includes significant input from program participants. Also good.
Perhaps most significantly for privacy, the report recommends that the program “should create and use de-identified data for research whenever feasible to do so.” At the same time, it also wants participants to be “re-contactable.” In its key paragraphs on privacy, the report recognizes the complexities involved:
A national cohort that includes a highly interactive approach to communicating with and soliciting input from study participants will necessarily have to operate in two data management modes, while respecting participant preferences and terms of consent. The “fully identified” mode of operations will be needed for messaging, study appointment reminders, phone interactions, etc….
Aggregate data assembled for analysis will need to be de-identified by removal of standard classes of personal identifiers such as those specified by HIPAA Limited Data Set and Safe Harbor provisions. These are imperfect privacy standards, however, and the clinical and research-generated data are expected to be rich in features that make each individual’s contribution unique. Uniqueness is not synonymous with re-identification (which requires, in addition, a naming source), but the proliferation of data mining methods and potential naming sources (voter lists, public registries, social media postings, ancestry web sites, etc.) means that technology alone will be insufficient to address issues of data privacy for the PMI cohort. Expert testimony presented at [a program workshop] brought forth the view that de-identification should not be thought of as a guarantee of anonymity, but rather simply “another disincentive to attempting re-identification of individuals.” Acceptable use policies with substantial enforceable sanctions will need to be developed or adapted from other similar research efforts to complement the technical approaches to deidentification of data.
In short, it may be possible to re-identify participants from medical records in the database, but those who attempt to do so will be subject to unspecified “penalties.”
Ultimately, the report thus punts on the hardest details for now with a recommendation that the program “engage data privacy experts to create an effective combination of technology and policy to minimize risks to re-identification of de-identified data.” On yesterday’s call, as in prior meetings, I have certainly been favorably impressed with the thoughtfulness and thoroughness with which White House and NIH staff have approached the policy issues raised by this project, including the privacy implications as well as a number of other knotty issues it raises. That said, strictly from a privacy point of view, there remain some significant questions for those contemplating volunteering for this program. It does not look as though this will be an airtight, privacy-protective system where subjects’ data will be technologically guaranteed private. And of course as with any large data store in today’s world the cybersecurity questions are considerable. A fair amount of trust will have to be placed by participants in those who run this program.
Of course, many people will be inspired to volunteer for this program out of a desire to help researchers fight diseases—diseases that have already affected them or people they love, or out of an abstract desire to contribute to humanity. Those are motivations we can all honor. Scientists say there’s real potential for this kind of database to revolutionize many areas of medicine. The exploitation of medical data for good is not like using big data to try to spot terrorists, a misguided effort where the privacy downsides are vastly eclipsed by the (unlikely) benefits. In a chart included in the report yesterday, the authors estimate that with a population of a million people, there will be 6,400 cases of Parkinson’s within 5 years, for example, 18,000 cases of Lupus, 32,600 cases of breast cancer, and similar numbers for many other conditions. That will allow a lot of exploration of genetic and environmental causes of disease. Such possibilities are something that we privacy advocates do not fail to take into consideration when judging uses of data.
And not everyone feels they need airtight privacy, even for their medical records and the sensitive information they so often contain. Some people are already making their genomes public.
But it’s also important for people to have a clear understanding of what the privacy risks might be, both so that those risks can be ameliorated where possible, and also so that individuals can make a fully informed decision about whether they want to participate. We want volunteers to go in with their eyes wide open. The proposal outlined yesterday, and the project overall as it unfolds, will have to be studied and analyzed closely by privacy advocates.