for the Digiday Programmatic Marketing Summit, May 6-8 in Palm Springs.
From ad tech tax to AI data brokers: the new middlemen keep 100%, publishers say
Publishers have weathered every digital headache going — ad tech tax, murky supply chains, algorithm whiplash — but there’s a new one that, even in passing, makes jaws tighten: third-party content scraping.
For some publishers, it lands as an even bigger affront than the ad tech tax they’ve spent years navigating — not a share of the pie, but the pie itself.
One publishing exec, who agreed to speak on condition of anonymity so he could speak freely, likened the new crop of AI data brokers to ad tech middlemen running demand-side platforms (DSPs) for content. “We’ve got all these 30 40, 50 startup DSPs for content, but they’re taking a 100% fee,” he said. “That is the market that we are seeing emerging now.”
Publishers have long been willing to tolerate the ad tech tax as long as it clearly adds value — the persistent frustration has been less about the cost itself, and more about the lack of clarity over where, exactly, that value is being created.
Chris Dicker, CEO of Candr Media, believes it makes ad tech tax look like small fry. “At least with ad tech middlemen publishers got something back,” he said. “With scrapers, the value extraction is total. They’re taking 100% of the content, paying 0% and then in some cases using that content to create competing products that remove the publisher entirely. It’s not a tax, it’s a hostile takeover funded by our own IP.”
What compounds the issue is the bad-faith behavior layered on top, he noted. Whether that be companies using stealth, undeclared crawlers to evade websites’ no-crawl directives to slip past detection or just publicly announcing that you aren’t going to adhere to the publishers’ no crawl directives, he stressed. ”So you’re not just dealing with freeloading, you’re dealing with active deception and abuse of scale designed to defeat the few defensive tools publishers have left. If the message is ‘no crawl,’ then they need to remember that no means no,” said Dicker, who is also on the board of the Independent Media Alliance.
Media analyst Matthew Scott Goldstein’s recent report on the “scraper economy” underlined that this is a $1 billion industry, citing Mordor Intelligence data. Yet it’s an industry publishers make zilch from.
What’s worse, he believes that third-party web scrapers are now rebranding as “agentic infrastructure” so they can continue stealing in plain sight. On LinkedIn, he called out Parallel Web Systems as a company doing just that, in a blog post he wrote on April 29.
“The scraper economy is being rebranded as agent infrastructure, and while the technology is getting sharper and the enterprise pitch is getting cleaner, the underlying economics have not changed because agents will consume the web at a scale that dwarfs human behavior,” he wrote, “and until a real marketplace layer exists to price and govern that consumption, this category is fundamentally competing on who can extract the most value from the web the fastest while the question of who gets paid remains unresolved.”
Goldstein’s report identified 21 vendors doing this, including Firecrawl, Exa, Tavily, Brave, You.com, Perplexity Sonar and Bright Data. (TollBit also has a running index on third-party scrapers, identifying nearly 40 vendors.)
Publishers have repeatedly leaned on the idea that they’re the “hosts” being eaten alive, arguing that without their content, future LLMs wouldn’t exist. Yet increasingly, it feels like they’re shouting into an empty void, with licensing deals driven less by recognition of value and more by platforms moving to limit legal exposure.
Napster has become the go-to cautionary tale parallel — a moment when the music industry saw its value stripped out at scale, much like what publishers fear is happening now.
“We’re in a world with more and more Napsters, but we don’t yet have iTunes or Spotify… we’re only in a race with the pirates, and the pirates are quicker, as they always are,” said the same publishing exec.
For those publishers that syndicate content elsewhere across the web, blocking AI crawlers is increasingly a game of whack‑a‑mole. Even if they lock down their own domains, their stories often reappear on large portals and customer sites that carry their feeds, a publishing exec previously told Digiday on background.
When these publishers challenge AI firms about scraping that content via third parties, they’re frequently told the problem lies with the portals’ settings rather than the AI companies’ own crawling practices — effectively shifting responsibility further down the chain.
More in Media
The state of generative AI in the creator economy
A look at how the creator economy is using generative AI, from workflow help to identifying partnerships in DMs.
Taboola’s next act: an AI answer engine for publishers
HuffPost UK, Reach and USA Today Co. are rolling out Taboola’s AI-powered answer engine to boost engagement.
USA Today Co.’s AI licensing deals drive ‘notable’ revenue in Q1, despite pressure on traffic and programmatic
USA Today Co.’s AI licensing deals helped drive meaningful year-over-year revenue growth in Q1, despite pressure on traffic and programmatic.