Why data mined from social media alone is garbage

Rudi Anggono is head of creative at Google in New York.

I can hear the groans already. Another article about data? Don’t worry. I’m not here to talk about ROI, KPI or any other I, which is usually how data is discussed. I want to talk about other aspects of data that may not be obvious to some people — especially data collected in the social media space. It’s a two-faced, insincere, duplicitous, lying sack of shit.

Consider declared data.
Declared data is the perfect vacation selfies you post on Instagram, the adorable baby video you upload, the numerous “likes” you give, the witty remarks you leave, the polite white lie you tell your waiter even if you hate the food. You get the idea. It’s influenced by your mood, your prejudices, your political agenda, your insecurities, which shape your carefully (or not so carefully) crafted public image. And it’s not entirely reliable because it’s mostly made of half-truths.

For example, just because I give a “like” to a cute cat video, it doesn’t mean I like cats. In fact, I hate cats. I do that to show my support to the poster. Because I know he just lost his partner and that cat is the only living thing that ties him to his loved one. This is a context that the act of liking (the declared data) fails to recognize. Yet it’s a crucial context. Or I may comment on an issue that I don’t particularly care about, but I do that to appear smart. I may even write in the poster’s native language even though I don’t speak it.

It’s fake. It’s insincere. Yet this is the data people declare to the world. Because it’s what humans do. We lie about a lot of things. Renowned cultural anthropologist Genevieve Bell once said we lie because we want to tell better stories, to project better versions of ourselves. It’s part of our genetic make up as political animals to be accepted and survive. Unfortunately, these lies are being captured as data. Declared data. A lying sack of shit.

Which brings us to intent and behavior.
Typically, in dealing with most lying sacks of shit, you look for the intent. You confirm that intent through the behavior. The problem is intent is not always obvious. You have to cast a much wider net, beyond the environment where the data is declared, so you can extrapolate and cross-reference. Search data is a good place to start because people search with intents, but it’s not always enough. You need to look at other data points to collect more intents and behaviors.

If I like the cute cat video, and at the same time I’m searching for “the loss of long-term partner,” it’s not enough to link those two data points to provide context. But if you also know that I’ve been watching videos on YouTube about dying patients with their pets, while also shopping for books at Amazon about coping with loss and caring for pets after the owner passes away, then these four data points will start showing you the fuller picture, the context.

These are digital behaviors, fueled by intents. The initial declared data of “liking” a cat video becomes more than just that. It’s deeper. This is where a much-improved algorithm for data-driven prediction (aka machine learning) combined with human intuition come in handy.

The more you know.
We should then use both declared data in concert with intent and behavior. True, in most cases, you’ll hear arguments for one or the other, which is a little bit like a pineapple arguing with a banana about which one is the real fruit.

But whether you’re a marketer, a political strategist, an agency planner or an intern researching a paper, you can’t trust data based on what people tell you. It’s almost always a lie. You have to prod, extrapolate, look for the intent, play good cop bad cop, get the full story, get the context, get the real insights. Use all the available analytical tools at your disposal. Or if not, get access to those tools. Only then you can trust this data.

Digiday Top Stories