Knowledge extraction on anonymized data – Differential Privacy

Anonymity at scale

This article is the 3rd in a series on data anonymization. Remember that the purpose of data anonymization is to protect the privacy of an entity in a dataset while allowing the extraction of useful statistical information from the complete set. One of the main arguments of this series is the idea that anonymity is not a universally defined concept. It is a highly contextual notion, and in such its success depends on the definition we give it, somewhat arbitrarily. Any process of anonymisation must therefore necessarily be preceded by a thorough risk analysis and background study.

The first article focused on presenting the concepts and basic terminology of the field, presenting the fundamental tradeoff of privacy and utility. The second article presented the idea of k-anonymity, a definition of privacy on a set of data where each combination of sensitive information (quasi-identifiers) is indistinguishable from at least k-1 other records. While this approach is satisfactory for protection against some linkage attacks, it has a number of flaws:

  • Its success depends on what you define is a quasi-identifier,
  • It is not robust in time; a new external dataset, event or background information disclosure could in principle break the property or make previously benign information sensitive,
  • While it makes it hard to figure out which record is which, it gives no guarantee of protection whatsoever against determining whether a particular entry is present in the dataset.

This article introduces the concept of differential privacy, which aims to solve those concerns.  Differential privacy is a model of data privacy inspired by cryptography, and is now considered the most robust. It’s been successfully adopted by tech companies like Google, Apple and Uber.

Defining privacy

Let us first give a little bit of context. In 1977, statistician Tore Dalenius gave a formal definition of what data privacy is: given a sensitive dataset, an attacker should learn nothing about an individual that they didn’t know before using the dataset. While this seems like a pretty decent, and strict, definition of privacy, Microsoft researcher Cynthia Dwork proved in 2006 that this criterion is indeed impossible to achieve. Any access to sensitive information will violate this definition of privacy because it is impossible to account for all types of background information that could in principle lead to new inferences about individuals in the dataset. With this in mind, Dwork proposed a new way of thinking about privacy. Inspired by the concept of plausible deniability, she proposed a definition of anonymity in which one can learn *almost* nothing more, on top of what one would already know, about someone if this person is present or not in the dataset.

Semi-formally, let  A be a be a randomized query algorithm, one that computes aggregate statistics within a margin of error, e.g., counts, sums, and means. Then A is said to be differentially private if, for any pair of data sets D1 and D2 that differ by a single element, a query by A on D1 or D2 does not allow them to be distinguished with a reasonable probability (say ½ + ε, where ε is an exponentially small parameter relative to the size of the dataset and the sensitivity of the query). The real definition is just slightly more mathematically involved, but the principles are the same.

Inputs differing by one record are indistinguishable from the output alone

While this definition might seem a little bit dry, what it says is that such a query algorithm, while still providing useful statistical information on the inputs, is trying to minimize the probability of identifying the differing record, hence preserving its privacy in a quantifiable way. Here are a few remarks on this definition:

  • It is a property of an algorithm, not a set of data (in contrast with k-anonymity, which intrinsically describes a dataset).
  • It provides a guarantee that the response to any query on the database will not be noticeably different if any of the individuals are present or absent from this database, allowing plausible deniability.
  • It is robust over time because it does not depend on background information or external knowledge.
  • The parameter ε can be composed, so privacy can be preserved for a fixed (but possibly large) number of consecutive (or grouped) queries.
  • The sensitivity of a query can roughly be defined as the maximum amount of knowledge you can gain by “asking the same question” on two arbitrary data sets that differs by only one element.

Most of the techniques that make it possible to obtain differential privacy consist of adding randomness according to a known statistical distribution during the request (or data collection). If the number of requests and the size of the database are large enough, the resulting sample statistic should converge with a known bias, which is therefore “factorizable” from the result a posteriori.

These ideas are illustrated by the following example:

Suppose that an organization wants to know the password robustness practices of its users. The organization will ask its members, all of whom know and apply the protocol faithfully, the following question: Is your password more than 5 characters long? One method for obtaining differential privacy at the data collection level could follow the following scheme:

  1. In secret, flip a coin.
  2. If the coin lands on tails, tell the truth about your practice.
  3. If the coin lands on heads, toss the coin again and respond positively if heads, negatively otherwise.

It is easy to see that any individual can plausibly deny having answered the truth in this investigation. Furthermore, because the bias introduced by the protocol is known, it is still possible to form a correct estimate of the number of participants who have bad password habits.

If p is the true proportion of people with a poor password, we expect to get ( ¼ )( 1-p ) + ( ¾ )p positive responses, so we can estimate p by simple arithmetic a posteriori.

Although this example is simple, it illustrates the general technique of adding a known statistical noise to the information gathering mechanism. In practice we need a much more sophisticated distribution that scales with the size of the dataset and the sensitivity of the queries.

In Practice

Differential privacy emerged in 2006 through the work of Dwork, which earned her the prestigious Gödel prize. Since her foundational article was published, the concept has been widely accepted as the strongest and most useful notion of confidentiality. Many large projects use differential privacy at all stages of the information life cycle.

Data Collection

A real world example of differential privacy for data collection is RAPPOR, an algorithm developed by Google to anonymize the data of Google Chrome users. The code is available on github. Another example is the anonymization mechanism Apple uses at the iOS user device level to improve its text and emoji prediction mechanism. While the project is not as open as Google’s RAPPOR, it is quite well documented and commented.

Data publication

There are also examples of differential privacy used to publicize data. With the collaboration of the University of California, Berkeley, Uber has developed Chorus, a middleware that modifies arbitrary SQL queries to render them differentially private.

Result publication

There is a strong interest within the machine learning community toward addressing the privacy issues of machine learning mechanisms. A supervised learning model usually requires a large number of examples from which to learn the latent statistical distribution of the data source. A potential privacy concern is that the model might leak examples from the sensitive training set. This could happen either through access to its learned parameters, or through special queries against the model. For example, we use a neural network to help our web crawler find vulnerable web pages. This network contains web pages with real vulnerabilities in its training data. It is therefore essential to prevent the identification of such pages from any external queries to the model, because these vulnerabilities might potentially not have been remedied. If the network is built with the differential confidentiality property, then such security can be mathematically guaranteed.


The differential privacy model enjoys tremendous traction and enthusiasm due to its combination of usefulness and formal character built on sound mathematical principles. It has been implemented by many tech firms caught between the opportunity to access large amounts of user data and the strong privacy concerns such an access raises. This trend will continue as people become increasingly concerned about the problems of anonymity and privacy.