The California Consumer Privacy Act (CCPA)—the nation’s most expansive data privacy law—will go into effect in 2020. It governs the ways companies can use the personal information of consumers, and grants individuals certain rights to data that companies have collected about them. Before CCPA takes effect, California’s Attorney General will issue regulations to clarify parts of the law, and updating the definition of personal information in particular.
One type of personal information is a probabilistic identifier. Understanding the meaning of probabilistic identifiers is critical to understanding the scope of CCPA, but the legislative language is vague and flawed. The Attorney General has the power and mandate to clarify the meaning of the term through regulation. In order to reduce uncertainty during transition to CCPA enforcement, the Attorney General’s regulations on CCPA should clarify three points.
- Probabilistic identifiers can take many forms.
- The power of the data, not its form, is important.
- Probabilistic identifiers might not consist of personal information.
Before I can explain what that means, we need to understand what’s wrong with the legislative definition of probabilistic identifiers. And to do that, we need to talk about something called cross-device tracking.
In the digital advertising industry, the term probabilistic identification is most often used to refer to a set of technologies enabling cross-device tracking. Cross-device tracking is used to link consumer behavior across devices and contexts.
Here’s how it works.
It all starts with an online tracker. An online tracker collects information about people and devices that connect to websites. Typically trackers are operated not by the webpage being visited, but by third parties that monitor access to thousands of different websites all at once.
Let’s imagine an online tracker that collects from each webpage access: the URL of the webpage, the time, the type of device being used (e.g., iPhone, Windows laptop), and the device’s location rounded to the nearest 100 feet. Our tracker collects this information constantly from hundreds of webpages, resulting in a raw dataset with thousands or millions of new records every hour.
This information may seem rather innocuous. After all, the tracker doesn’t even record any persistent device identifiers (i.e., unique identifiers like cookies, IP address, or advertising IDs).
Even so, our tracker can learn a lot about me by analysing my movement and browsing patterns over time. It can learn that I have two devices, a phone and a computer. It can deduce what time I get out of bed, whether I’m late to work, and that I may be in the market for new clothes. The tracker can build a profile of me, my devices, my habits, and my interests, all without knowing my name.
The inferences drawn from this data are probabilistic: they are essentially best-guess inferences based on the analysis of data, but these best-guesses are in practice quite good. This type of tracking and inference is happening on thousands of websites, to millions of people and devices, all the time.
All of these inferences feed into a device network or device graph. This is a network of devices, their users, the relationships among them, and profiles of them. Device networks are proprietary, large, constantly updated, and very useful for online advertising. Most often, companies that construct these networks do not give away the whole thing. Instead, companies sell limited rights of access through some interactive interface or application. An interface might allow access to individualized information or only provide aggregated statistical information. However, even aggregated statistical information can often allow information about an individual to be inferred.
Cross-Device Tracking and Probabilistic Identifiers
The data used for cross-device tracking takes many forms: it transitions from raw data collected by a tracker, to inferred individual or device profiles, to a device network describing various connections and relationships among devices and consumers, to a (possibly statistical) interface making use of the device network.
Cross-device tracking and probabilistic identification go hand in hand. The data used for cross-device tracking must count as a “probabilistic identifier” under CCPA if the term is to have any meaning at all.
But the meaning of a probabilistic identifier under CCPA is uncertain. The law defines the term as “the identification of a consumer or a device to a degree of certainty of more probable than not based on any categories of personal information included in, or similar to, the categories enumerated in the definition of personal information.”
It is relatively clear that the inferred profiles and the device network are probabilistic identifiers under the current law, so long as they are accurate “to a degree of certainty of more probable than not.” They are the results of “the identification of a consumer or a device.”
On the other hand, the raw data underlying these inferences is not so clear. Certainly this raw data is information that can be used to identify a consumer or a device. But it would not be unreasonable to argue that because the raw data hasn’t been used to perform this identification, it does not itself contitute “the identification of a consumer or a device” as in the definition of probabilistic identifier. Similarly, while an interface to the device network may allow the identification of an individual or device, the interface itself is arguably not a probabilistic identifier—especially if the interface is statistical in nature.
The data that underlies probabilistic identification is collected en masse and later analyzed to extract individualized information. As the wellspring of probabilistic identifiers, this data should itself be a probabilistic identifier. But according to one reading of CCPA, it would not be. This is because the legislative definition seems to distinguish what can be probabilistically inferred from data and what has been. This is a distinction without a difference — a distinction that every related definition in CCPA rejects.
The Attorney General has the power and mandate to clarify the meaning of the term through regulation. The regulations should make the following three points. Data which can be used to probabilistically identify users should be subject to the same legislative framework as other personally identifying information, which will already be clearly covered under CCPA. Clarifying the term “probabilistic identifier” in regulation would resolve the existing flawed and vague language in the Act to clearly extend the responsibilities of companies and the rights of consumers to include these kinds of data collection.
Probabilistic identifiers can take many forms.
The data used to track users online comes in many forms, including those described below. The Attorney General’s regulations should clarify that all of these forms are covered by the law. Doing so would further clarify that the regulation covers datasets that have already been shown to enable probabilistic identification.
- A collection of information about multiple consumers or devices from which a probabilistic identifier may be inferred (e.g., raw data, device network).
- A digital application or interface from which a probabilistic identifier may be inferred (e.g., interface to device network).
- A collection of multiple attributes or other data about an individual consumer or device (e.g., inferred profile, information used for browser fingerprinting).
- Any other information that can be used for probabilistic cross-device tracking or device-fingerprinting.
The power of the data, not its form, is important.
Probabilistic identifier is currently defined as the “identification of a consumer.” This focuses the definition on the form of the data, not its power; what has been done with the data, not what can be done with it. Data may be treated differently than inferences drawn from it.
The Attorney General should clarify that a “probabilistic identifier” is information which can be used to identify or recognize a consumer or device, rather than the identification of the consumer or device itself.
The definition of “probabilistic identifier” is constructed inconsistently from the other definitions of information and identifiers in CCPA. The authors of the CCPA recognized the distinction in the definitions of “personal information” (“information that identifies…”), “unique identifier” (“a persistent identifier which can be used to recognize…”), and “deidentified” (“information which cannot reasonably identify”). The difference is one of the many inconsistencies born of CCPA’s frenzied passage and should not be considered a sign of legislative intent.
Probabilistic identifiers might not consist of personal information.
The Act suffers from a circular definition of “probabilistic identifier”. It is a sub-type of personal information, but must be based on “categories of personal information.” This creates an ambiguity that threatens to strip probabilistic identifiers of all meaning.
Taking the circular definition literally, the probabilistic identification of a consumer based on a collection of many discrete data points may not be considered a “probabilistic identifier” if each of the data points separately do not rise to the level of personal information. It is hard to believe that the legislature intended this interpretation.
The Attorney General should clarify that—while based on categories of information included in, or similar to, the categories enumerated in the definition of personal information—probabilistic identifiers may consist of a collection of multiple pieces of information that do not separately constitute personal information or unique identifiers.
 For example, the data releases published by the US Census Bureau, which have been shown to allow the accurate reconstruction and reidentification of tens of millions of Americans. https://twitter.com/john_abowd/status/1114942180278272000?lang=en
 For example, interactive statistics applications such as those developed by the Australian Bureau of Statistics and the Israeli Central Bureau of Statistics, which have been shown to allow the reconstruction of the underlying datasets. https://www.haaretz.com/surveys-not-as-anonymous-as-respondents-think-1.5288950, https://arxiv.org/abs/1902.06414.
 For example, the Netflix Prize Dataset which consisted of deidentified data of individual user’s movie watching habits and was famously shown to be vulnerable to partial reidentification. https://arxiv.org/abs/cs/0610105