Knowledge extraction on anonymized data – K-anonymity

Disclaimer: The following article’s example is artificially constructed to illustrate the purpose of the research and while remaining relevant, does not directly represent the methods used for Delve Labs customer data handling.

How to protect your meaningful data in a provably secure way

In the first article of this series we introduced the notion of data anonymization in a way that tries to retain some usefulness. We also introduced three well-known techniques for achieving this desired property. The fundamental problem of data anonymization is to maximize the utility of the information while maintaining a minimum threshold of anonymity. We consider the following 3 phases of the life cycle: data collection, data publication and result publication on these data. For each of these phases, there are different ways of formalizing the problem that are dependent on the objectives, the technical means and the context of the analysis.

In a k-anonymous dataset, any individual-specific record is indistinguishable from at least k-1 other records.

We will discuss in this article the first theoretical formalization of the problem, k-anonymity, as well as the natural solution that follows this formalism. We will also present an example of data publication for vulnerability analysis which will motivate further exploration of the notions uncovered.

An example (part 1)

Delve Labs cybersecurity researchers study the context in which network and web vulnerabilities appear in order to better understand the remediation process and the generation of false positives. Suppose we have the following data model: each client has one or more networks, on which several types of assets reside: servers, databases, workstations, websites, etc. These assets will most likely have one or more vulnerabilities (for example, Spectre and Meltdown). These vulnerabilities generally appear on multiple clients’ networks.

We want to find a method of publishing this data that would allow researchers to dive into vulnerability analysis without being able to find the identity of clients from the structured set of data. We would therefore like to identify which subset of the present information would re-identify customers and then mitigate this risk using k-anonymity. Here is an overview of the data model we consider, with specific information fields:

The data model used in this example

Clients

  • Name
  • Users
  • Industries

Networks

  • IP range
  • Topology
  • Number of assets in a given subnet
  • Number or internet facing assets
  • NetFlow data
  • Egress/Ingress rules
  • Domain name information

Assets

  • Type
  • Private/Public Network Address
  • URL/Paths
  • Open ports
  • Software stacks/versions
  • Asset Traffic
  • Metadata (Service Banners)
  • TLS certificates
  • Users

Vulnerabilities

  • Name (identifier)
  • CVE number
  • Severity
  • Exploitability (known exploits)
  • Affected software
  • Remediation
  • Detection timestamp

k-anonymity

The idea of ​​k-anonymity appeared in 1998 in an article by Latanya Sweeney and Pierangela Samarati following a linkage attack demonstration of the latter on a set of partially anonymized medical dataset. The health institution had only naively masked the obvious identifying information: names, addresses, telephone numbers, social security numbers, etc. The authors have thus demonstrated that certain combinations of “quasi-identifiers”, for example the zip code and the age, had a frequency of appearance rare enough to allow a total re-identification thanks to a simple telephone directory. A quasi-identifier is thus a value which does not allow for unique identification of the subject taken in isolation, but makes the identification possible through correlation with other values.

The lesson that the authors learned is simple: If each combination of quasi-identifiers can not appear less than k-times, it will not be possible to identify an individual’s data by correlation  with probability better than 1 / k. We shall say of these k indistinguishable elements that they are in the same equivalence class.

The data anonymization problem is then: determine a k that meets our anonymization criteria, isolate a set of quasi-identifiers and modify the sensitive fields to obtain the desired property (k-anonymity). There is a vast body of research that focuses on specifying proper definitions of quasi-identifier and satisfying metrics for choosing k, which are directly dependent on the context and sensitivity of the data. For example, several articles have focused on ways to anonymize the logging of search engine queries for secure publication that respects the privacy of users while allowing for large-scale studies.

The problem of modifying the data in order to obtain the optimal k-anonymity property is very complex, especially as the number of dimensions grow (the problem is NP-hard). However, there are several heuristics that can be used to find satisfactory approximate solutions in a reasonable time. The basic components of any k-anonymization algorithm are the generalization of a certain aggregate of values ​​and the deletion of values ​​that move away from it too much (thus becoming a unique identifier). The Mondrian and Micro-clustering algorithms are conceptually interesting examples. We will see the use of the latter applied to our example data model.

It can be very difficult to determine the distribution of these data, that is to say to choose a space and a measure of distance between these elements that would aggregate certain values to generate the equivalence classes. We will only consider quasi-identifiers that can be generalized in the publication of data.

The anonymisation procedure therefore proceeds as follows:

  1. Hiding the direct identifiers.
  2. Determination of quasi-identifiers.
  3. Removing quasi-identifiers that cannot be aggregated.
  4. Partitioning remaining quasi-identifiers into equivalence classes.
  5. Suppression of extreme values.

An example (part 2)

We want to extract knowledge from vulnerabilities, assets and networks without clients being identifiable.

For the sake of simplicity, suppose we have already hidden (with the help of a non-invertible hash function) the names of clients, domain names and public IP addresses, since these are obvious identifiers. Suppose also that the customer database is large enough for an extraction of relevant statistical significance. First of all, we have to determine which data can be considered as “quasi-identifiers”, that is to say what combination of data and metadata can be used to reconstruct the client’s identity from external documents (e.g. Shodan, Maltego, scans.io, DNS qeries and other OSINT). We must then make a selection to make the partitioning procedure numerically possible.

Identification of quasi-identifiers

We will not consider anonymizing IP network traffic in this analysis because there are several techniques available to address this problem (see RFC 6235).

Network Information

  • IP ranges
  • Network Topology
  • Number of assets in the subnets,
  • Number of “internet facing” assets
  • Firewall rules (ingress / egress)
  • Subdomains and TLDs
  • traffic timestamps

Asset Information

  • Private network address
  • URL Components / Paths
  • Open ports
  • Software metadata (Service Banners)
  • Installed software (and versions)
  • Intensity of use
  • TLS metadata

Choice of relevant quasi-identifiers

Numerical data

  • Number of assets in the subnets,
  • Number of “internet facing” assets
  • Intensity of use
  • traffic timestamps

Semantic data

  • Subdomains and TLDs
  • Private network address
  • URL Components / Paths
  • Software metadata (Service Banners)
  • Installed software (and versions)
  • TLS metadata

Categorical data

  • IP ranges
  • Network Topology
  • Firewall rules (ingress / egress)

Partitioning procedure

There are several possible methods for partitioning the data into equivalence classes according to their nature (for example according to whether the data is categorical, semantic or numerical) and their complexity (dimensionality, distribution, scale). The general idea is to create classes that group together several values. For example, categorical subnet topologies (star, mesh, bus) can be classified according to the firewall and routing rules and the number of assets per category. The clustering techniques of unsupervised machine learning make it possible to create classes in an algorithmic way if one can embed the data into a space allowing for a notion of distance between its elements. We can thus iteratively determine a partitioning of the space into regions which contains at most k elements, then choose the central value of this region (e.g. k-means algorithm). This is particularly easy for aggregating IP traffic data and information relative to the number of assets in each subnets.

 

Suppression of extreme values

In order to maintain anonymity and optimal utility, it may be necessary to eliminate values ​​that move too far away from any equivalence class. These values ​​could decay the statistical significance of aggregates of information or worse, lead to isolation of specific data.

From left to right: The clustering process in 2 dimensions finds groups of at least k elements, choose a representative and remove extreme values.

Contextualizing

It is very difficult to determine what good anonymity is. These decisions need to be context-specific (will the data be public, are they governed by laws, can they lead to physical attacks?) and based on the intrinsic nature of the data (numerical, categorical, semantic, etc). There is also no consensus on how to measure utility. Some information theory techniques focus on calculating the loss of transformation-induced information by studying the statistical distributions of databases before and after. These techniques are often very difficult to implement, especially when the data has a strong semantic or networked component. There is a strong pull of the scientific community in this sense, see this, this or this.

As mentioned above, k-anonymity was historically the first formal model of confidentiality. It is particularly useful for publishing sensitive data, and has a rich scientific literature. However, the difficulty of calculating the utility and the algorithmic complexity to find an optimal partitioning in high dimensions limit its applicability. Some critics also claim that this model cannot mask an individual’s presence or absence in the data set. To address these issues, will present the confidentiality model of differential privacy in a subsequent article.

Related