The Big Data Accuracy Myth

Amit Avner is the CEO of Taykey, a media technology company that identifies target audiences based on their real-time interests.

The adulatory hoopla over big data and real-time bidding, as exemplified in “Bye Bye, Traditional Media Buying,” is premised on many highly debatable notions, most notably that “big data” will eliminate waste in advertising.

It has been noted repeatedly that the foundational units of “big data,” the cookie and the look-alike model, are often extraordinarily inaccurate. According to the anonymous ad tech executive who confessed in Digiday this spring, “We’ve seen agencies run tests against the validity of cookies on a data exchange. The gender is wrong 30-35 percent of the time.” Targeting either gender at random is wrong 50 percent of the time, so this is an improvement but hardly an eliminator of waste.

The problem only gets worse when you add additional filters to get a more specific audience. Some of this is for obvious reasons — people share computers, so you’ll never know at any given moment whether it’s my girlfriend or me that your algorithm has bought — but some of it is just the nature of the platform. Quantcast published a white paper that noted that “the half-life of an average third-party cookie … is approximately three days, and cookies for one third of online users last for less than an hour.” Finally, there is a fundamental flaw with the idea of using historical browsing data to predict future interests and behavior. As Jeff Hawkins said in a recent New York Times piece, “It only makes sense to look at old data if you think the world doesn’t change.” Many of the look-alike models the RTBs rely on to target more dimensional audience profiles are backward-looking, expensive to update and are rarely validated.

It is not generally in either a data provider’s or an agency’s interest to call the data into question. But if agencies were to run a small quantity of their media in the form of surveys to validate targeting and hold their data providers financially accountable to a minimum level of accuracy (call it MLA so we can have another acronym), one of two things would happen. Most likely, the data providers will balk and thereby reveal their own level of confidence in their data. But the better outcome would be an improvement in the methodologies used to identify audience that would make the data (not just the bidding) more real-time. The result might be a smaller pool of audience that could be bought with more confidence.

The other solution is a blunt instrument that puts a lot of companies out of business but will absolutely improve data accuracy: the Facebook ad network, where all targeting is based on declared data and there are few look-alike models, only (anonymous) individuals that can be bought on the basis of granular knowledge and not inference. Even Facebook will need to be more agile — that you liked Nokia two years ago doesn’t explain the Galaxy in your pocket today — but if it drops an identifier with every login, it solves the computer-sharing problem, the cookie-deletion problem, the mobile-targeting problem and almost all other big data problems in one swoop. Pricing may go up, but so will quality. Advertisers are used to accepting waste, but when companies figure out how to best buy Facebook targeting based on what audiences are engaged with in real time, it will be possible to reach audiences at a lower cost by following unpredictable flows of topical interest and virtually eliminate waste. Then we won’t be talking about Big Data; we’ll be talking about Good Data.

  • LinkedIn Icon