‘Something bad might happen’: Lawyers, Anonymization and Risk

June 18, 2013

If you wanted to predict the future, who would you call upon?  An economist; a statistician; Nate Silver?  A lawyer might not be high on your list.  Yet when faced with questions of individual privacy and data anonymization, this is what lawyers are being asked to do.  This article aims to illustrate how this is the case and consequently why lawyers need help from statisticians and computer scientists.

Legal background

Anonymization presents lawyers with somewhat of a challenge.  Take the European Data Protection Directive for instance.  It applies to ‘personal data’, that is any information relating to an identified or identifiable natural person; an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity. 

If data is personal, the organization which decides how and why the data is processed (the data controller) becomes subject to a number of (sometimes complex) duties and responsibilities designed to safeguard the data.  If personal data can be converted into anonymized form in such a way that a living individual can no longer be identified from it (taking into account all the means likely reasonably to be used by anyone receiving the data), disclosure of information in this anonymized form will not be disclosure of personal data, and therefore those duties and responsibilities will not apply to the disclosed data.   

Identifiability

Where personal data has been anonymized, a crucial question that the courts often have to decide is whether this supposedly anonymized dataset in fact falls within the definition of ‘personal data’.  Why then does this involve predicting the future?  Because it involves an assessment of risk, or as David Spiegelhalter has put it, ‘the possibility that something bad might happen.’  This is usually deconstructed into ‘the likelihood of something happening and the impact if it actually does.’  Then some attempt is made to quantify the magnitude of these two dimensions.

In its Code of Practice ‘Anonymization: managing data protection risk’, the ICO advised that the Data Protection Act 1998 does not require the process of anonymization to be completely risk free – data controllers must mitigate the risk of re-identification until the risk is ‘remote’. 

So when lawyers are presented with an anonymized dataset and asked ‘is it personal data?’ they have to assess the possibility of something bad happening, ie the likelihood of someone being able to re-identify an individual, and the harm or impact if that re-identification occurred.  And tending to be conservative creatures, lawyers may be tempted to respond: yes, it could happen, therefore there is a risk, therefore the data is personal data.  

The possibility of something bad happening

There are situations where the risk of re-identification is undoubtedly high: the likelihood of re-identification high, the potential for harm high, and the certainty factor high.  Take the request under the Freedom of Information Act 2000 for details of disciplinary action taken against employees of Magherafelt District Council in Northern Ireland.  Should the Council disclose a schedule containing the penalty issued and the reason for the action, but which excluded the date of the action, and the gender, job title and department of the employee?

No, said the Upper Tribunal.  The information was personal data, and it would be unfair to disclose it.  But the information had been anonymized – why was it personal data?  The issue was not whether the information was personal data in the hands of the Council, but whether it was personal data in the hands of the general public.  A crucial question was whether the public could identify the individuals to whom the summarised schedule related. 

The Tribunal considered evidence that the Council was a small authority with only 150 employees, all known to each other, in a district with a population of 39,500.  The Council was likened to a family, with a high level of knowledge of each other’s affairs.

The Tribunal made use of the motivated intruder test, a motivated intruder being someone who has access to the internet and public documents and would use investigatory techniques such as making enquiries of people likely to have additional knowledge.  The requestor was an investigative journalist, and so might have been highly motivated to identify individuals using other information available and common investigative steps.  The Tribunal concluded that an investigative journalist ‘would have little difficulty in making the necessary enquiries which could lead to the identification of individuals subject to disciplinary proceedings,’ particularly as the community was small and close-knit, and that identification would be all the more likely when the sanction was suspension or dismissal.

Big and Open data

The Magherafelt decision dealt with a relatively small set of data where prior or personal knowledge about a particular individual may already have existed or could have been obtained.  The UK Government’s Open Data agenda is particularly concerned with regional or national datasets where the likelihood of personal knowledge having an impact on re-identification risk is minimized.  How should we assess the risk of re-identification in relation to such datasets?

Paul Ohm, in his paper ‘Broken Promises of Privacy’, argued that ‘re-identification science exposes the promise made by [privacy/data protection] laws – that anonymization protects privacy – as an empty one.’  Ohm highlighted ‘release-and-forget’ anonymization, with generalised rather than suppressed identifiers, as of particular concern, and that other ‘data fingerprints’ such as search queries or social media postings can be combined with anonymized data to attempt re-identification.  Others disagree with Ohm’s view that re-identification can be achieved with ‘astonishing ease’.  In her new guidance, ‘Looking Forward: De-identification Developments – New Tools, New Challenges‘, Ann Cavoukian, Information & Privacy Commissioner, Ontario, Canada restated her opinion that re-identification ‘is not easy’ and that the most significant privacy risks arise from ineffectively de-identified data.  Commenting on Big Data, she said, ‘As masses of information are linked across multiple sources it becomes more difficult to ensure the anonymity of the information.’  On the other hand, Big Data could make de-identification easier to achieve; ‘smaller datasets are more challenging to de-identify as it is easier to be unique in a small dataset’, as seen in the Magherafelt case discussed above. 

So who should we believe? 

Trust, risk and anonymization studies

‘…our views of the facts about big risks are often prompted by our politics and behaviour, even as we insist that the rock on which we build our beliefs is scientific and objective, not the least bit personal’ (The Norm Chronicles, Michael Blastland & David Spiegelhalter, Profile Books, 2013, p 110)

In Nick Pidgeon’s view, emotional responses are very important in the assessment of risk.  ‘If you do not trust the parties who manage the risk, you are not likely to have confidence that the risk is being safely managed’.  Kieron O’Hara has said that ‘trust is an important risk and complexity management tool…The stronger X’s trust, the higher the degree, and the greater the risk he is willing to take.’

A recent Ipsos MORI study of Public Understanding of Statistics examined how much trust the participants had in information provided by certain categories of people.  Information provided by scientists was the most trusted (28% trusted the information ‘a great deal’, 46% ‘a fair amount’, compared to politicians: 1% ‘a great deal’, 7% ‘a fair amount’). 

And so we might expect scientific anonymization studies to increase public (and decision-maker’s) understanding of the risks of re-identification.  The opposite may sometimes be the case.

Trust cannot fail to be affected by what Daniel Barth-Jones has described as ‘anxiety-inducing media storms’ over recent re-identification research and demonstrations.  He has argued that ‘many, if not most, re-identification demonstration attacks, particularly because of the way their results have been reported to the public, serve to inherently distort the public’s (and, perhaps, policy-maker’s?) perceptions of the likelihood of “real-world” re-identification risks’.  He analysed a recently reported ‘DNA Hack’ carried out by Yaniv Erlich’s lab, which had been reported using such headlines as ‘DNA hack could make medical privacy impossible’.

Barth-Jones’s analysis of the study concluded that only 6% of the US population was at risk of having their last name correctly guessed by the method used (which excluded all females), not equivalent to re-identifying them. Additional demographics could be used to attempt a unique identification, Erlich’s study estimating that 17-18% of males in the US might be unique with regard to combination of surname, age in years and state of residence and potentially re-identifiable.  In a real-world implementation however, the intruder would not know whether the last name was correct or a false positive.  Barth-Jones also pointed out that this research targeted a population sub-group with Mormon ancestry, already at increased risk of re-identification due to their participation in other projects.  While not downplaying the likelihood that the risks associated with this attack will increase over the next decade, Barth-Jones commented on the impact of fear on the ability to assess risk rationally: ‘when a re-identification attack has been brought to life, like some Frankenstein monster, our assessment of the probability of it actually being implemented in the real-world may subconsciously become 100 percent, which is highly distortive of the true risk/benefit calculus that we face’.

Even this non-techie author could identify discrepancies between some media reporting of Latanya Sweeney’s Personal Genome Project re-identification exercise and the reality as expressed in the Paper; this article will attempt to analyse those discrepancies.  It was not the case, as reported in one article, that the study ‘elucidated the genome’ of more than 1,000 participants of the Personal Genome Project or that 84 – 97% of participants were accurately re-identified.  Sweeney’s paper disclosed that some Genome project participants had volunteered demographic data, such as date of birth, gender and ZIP code.  In addition, some documents that had been uploaded from outside sources were found to contain the participant’s name or nickname.  Sweeney’s study analysed only about half the public profiles available (579 out of 1130) which were the ones disclosing date of birth, gender and 5-digit ZIP.  Re-identification was attempted using a sample of voter registrations and an online public records web site. The tests yielded 241 (42%) names that might match to a profile. These were submitted to the Project and it was confirmed that 84% had been matched correctly (increasing to 97% allowing consideration for possible nicknames).

The Paper noted that to reduce the risk of re-identification, the participant could make the date of birth and ZIP code information less specific, and (a rather obvious step it must be said) remove his or her name from uploaded documents.  The Paper did not deal with what Barth-Jones has termed ‘the myth of the perfect population register’ ie that without a complete and accurate population listing an intruder could not be certain whether the name was a correct identification or a false positive, unless additional information was available about the individual.  ‘Some people will always be missing from any easily obtained source of data’ and Barth-Jones has argued that studies often miss out the step of assessing the impact of ‘the myth of the perfect population register’, thus making highly conservative estimates of the true re-identification risks.

Time and online content

In the debate over re-identification risk, it is also relevant to draw attention to recent scholarship on the impact of time on privacy and information lifecycles.  Contrary to the common view that information posted on the internet will be there to haunt the individual forever, in her paper ‘It’s About Time’, Meg Ambrose considered studies that have shown that 59% of web content disappears after one week, and 85% after a year (which is in itself concerning from an historical records perspective).  Ambrose commented that ‘it is possible that content can be easily accessible for a very long time, but permanence does not, at this point, appear to be a pervasive threat to most’.

But as data’s online permanence is not yet predictable, it ought for the time being to remain a factor.  Contributions from computer scientists will be essential to the on-going debate on how this factor, amongst others, should feed into the assessment of re-identification risk. 

Final thoughts

We will all be aware of the arguments in favour of anonymization: it protects privacy while allowing information to be used for important secondary purposes, for instance monitoring the quality of healthcare.  More could be done to increase transparency about the methods, risks, reasons for and benefits of anonymization, thus contributing to genuine understanding and potentially reducing the fear factor bound to affect not only lawyers but others involved in taking decisions about data disclosure.  ‘People need full information and guidance for action, rather than just reassurance, and their concerns must be taken seriously’ (The Norm Chronicles, p 119).

But we cannot be complacent.  Decisions to release anonymized data must be taken with great care taking into consideration the latest anonymization techniques and risk-assessment procedures, and continually assessing the changing risk environment.  Cavoukian noted that the increase in genetic research, including the trend towards large scale biobanks, poses new privacy risks: ‘Improved methods for the de-identification of genome sequences or genomic data are needed.’ 

In addition, it may only be a matter of time before a re-identification risk is created by the open and uncoordinated release by separate public bodies of two similar or identical datasets, one anonymized effectively, the other not.  And what about re-identification risks that may be created by personal data disclosed by a breach?

Courts faced with a dispute over a proposed or existing release of an anonymized dataset will increasingly be called upon to assess the robustness of the risk assessment.  But what will be judged an acceptable risk? 10%, 2%, 0.001%?  How do percentages equate to ‘remote’ or ‘minimal’ risks?  The question of whether data is, or is not, personal data is ultimately a legal one; lawyers need context in order to tackle it, not something that lawyers can do alone. 

‘…data protection is not sufficient for preserving privacy, or public trust, or indeed the usability of data, and the right discussions should be more wide-ranging, including not only lawyers but also representatives of all interested parties in the domain, those demanding the data, the data controllers (who undertake the risks of publication), domain experts (who know about the power of data in that domain, and the potential harms), and technical experts’ (Kieron O’Hara). 

In questions of anonymization, this is sound advice. 

Marion Oswald is a practising solicitor and Head of the Centre for Information Rights at the University of Winchester.  Before joining the University, Marion worked in legal management roles within private practice, international technology companies and UK central government, including the Ministry of Defence, and specializes in data protection, freedom of information and information technology: marion.oswald@winchester.ac.uk