WTF is the difference between deterministic and probabilistic identity data?

By Kate Kaye • April 1, 2021 •

Ivy Liu

This article is a WTF explainer, in which we break down media and marketing’s most confusing terms. More from the series →

Get honest, in-depth coverage of media, marketing and TV, delivered to your inbox daily. Sign up for the Digiday Daily Newsletter here.

“Deterministic” and “probabilistic” identity data have become the new buzzwords in digital ad circles.

These terms have been familiar to digital advertisers, publishers and ad tech executives for years. But now that the entire industry is on the hunt for alternatives to the third-party cookie, they seem to be tossed around more frequently, especially in descriptions of how the new crop of so-called cookieless identifiers work.

Ad tech, of course, is riddled with made-up terminology. Not this time. Deterministic and probabilistic methods for making identifiable data connections have been around for years and in relation to a variety of subject areas that have absolutely nothing to do with digital advertising —from public health to education to risk analysis.

Better yet: the words actually reflect their meaning. (Even better yet — no acronyms!)

What is deterministic data?
Deterministic data is information that is known to be true and accurate because it is supplied by people directly or is personally identifiable, such as names or email addresses. It’s often referred to as authenticated data.

What is probabilistic data?
Probabilistic data is based on probabilities. It is comprised of individual pieces of information, such as a device’s operating system or IP address, and compiled to puzzle together a conclusion. In the case of ad tech, probabilistic data can be used to create an identifier.

How is deterministic data used for advertising identity?
Deterministic identifiers use deterministic data to assign identity to a person online or using a mobile device in order to track that identified person across websites or apps for ad targeting or measurement. The key ingredient in deterministic identity is typically information someone supplied herself, usually by logging in with a name, email address or phone number.

So, is deterministic data the same as first-party data?
Well, sometimes. First-party data gathered directly from people by a brand or publisher includes deterministic data such as names, emails or phone numbers. But first-party data also includes a variety of other information reflecting actions taken on a website, articles read, purchase transactions or other behavioral data.

So how is deterministic data used to assign identity?
Deterministic identity is achieved when an email address supplied by a publisher or advertiser is matched to the same email address in an identity graph or database of logged-in users. Or, a deterministic ID match could happen if two entities both recognize an ID and can accurately match them. Sometimes three pieces of deterministic information can be used to connect the dots. For example, if it’s known that ID1234 is johndoe@johndoe.com and johndoe@johndoe.com is ID6789, then ID1234 is a deterministic match to ID6789. Ultimately, to achieve a deterministic match, data fields must agree.

So what’s probabilistic data, and how is it used for advertising?

First, a bit on why probabilistic data is used. Deterministic data is hard to come by. Very often ad tech systems can’t match identities because someone is not logged in or an email address or other piece of deterministic data is not available. When advertisers complain about low match rates, it’s because there is a lack of deterministic data links.

Systems using probabilistic methods employ a variety of data points to decipher who a user might be. The easiest way to think about these methods is that they assign identity that is probably accurate. Basically, they’re taking their best guess to infer identity.

When publishers want to assign identity to someone who is not logged in, or a demand-side platform or identity graph provider wants to figure out if there’s a match between a site visitor and another existing ID, they employ probabilistic methods to assign identity based on a variety of probabilistic data points.

Do companies communicate whether an identity has been assigned based on deterministic or probabilistic data?
While identity tech firms provide information about how they create or link IDs in technical documentation and materials provided to clients, their IDs themselves don’t reveal whether deterministic or probabilistic methods are used. In fact, some firms take a hybrid approach to creating or matching identifiers.

What types of information is used to assign probabilistic identity?
Some identity tech firms call the information used to piece together probabilistic identity “soft signals” or “non-unique device characteristics.” Typical data points used include IP address, timestamps, browser version or screen resolution.

Um, isn’t this just fingerprinting?
Fingerprinting also triangulates a variety of data points to establish identity, but ad and identity tech execs often stress that there are distinctions between the two. They’re particularly compelled to draw distinctions because the practice of fingerprinting has fallen out of favor, especially since 2019 when Google said its Chrome browser would restrict its use and since the company prohibits ad tech vendor partners from using fingerprinting for identification. Other browsers like Safari and Firefox also restrict fingerprinting.

Companies employing probabilistic identification methods give varying reasons for why their techniques are distinct from fingerprinting. But the distinctions can seem convoluted or semantic.

For example, some identity tech firms argue that fingerprinting usually happens mainly on the advertiser side, when advertisers or ad tech firms want to create persistent identifiers without the knowledge or approval of people or publishers. Others, however, say fingerprinting happens on the publisher side, when publishers want to create IDs. Others suggest the distinction lies in that fingerprinting happens only at the device-level.

“It’s just in the language and that makes me furious,” one ad tech exec who spoke anonymously told Digiday. “Most ad tech companies, most identity solutions, the probabilistic IDs, these are based on fingerprinting technology — but they’re not calling it fingerprinting.”

WTF is the difference between deterministic and probabilistic identity data?

More in Media

Why brands are bringing creators to the World Cup sidelines

Media Briefing: ‘Surveillance pricing’ laws are coming for dynamic subscription strategies

How Time and others are rebuilding parts of the web for AI agents