Why protecting paywalled content from AI bots is difficult business
Update: After this story’s publication, OpenAI introduced the ability for publishers to block its web crawler from accessing their sites.
People have used OpenAI’s ChatGPT to bypass publishers’ paywalls. So how can publishers protect their subscription businesses against generative AI chatbots siphoning their subscriber-only content?
Digiday checked in with publishers, paywall management companies and consultants to find out, and their answers largely boil down to a need for generative AI chatbot makers to signal when they are trying to access publishers’ content so publishers can treat them similarly to search engines’ content crawlers.
Generative AI chatbots like OpenAI’s ChatGPT work in a similar way to search engine bots, which crawl and collect information from sites to surface them in search results. While OpenAI suspended this feature last month, Google’s Bard and Microsoft’s Bing have not yet formally turned off the bot’s ability to do this.
Publishers can turn off the ability for bots to crawl their content, but it’s difficult to distinguish AI bots from the ones coming from search engines like Google that allow pages to get indexed and appear in search results.
“If a DNC (do not crawl) flag is set by a publisher but the compliance is optional, it is unlikely to stop [large language models] from crawling websites,” said Arvid Tchivzhel, managing director at Mather Economics’ digital consulting practice. “To my knowledge, there is not a unified ‘do not crawl’ standard in place nor any technology [available] on the market to selectively block a crawler.”
- A CDN works by loading the page on a separate server, and not letting the page load on a device until a reader logs in. Examples of CDNs are Cloudflare and AWS, and Zuora’s Zephr, which built their own CDN.
A CDN is stronger against AI bots, but it remains unclear if it can truly block them, according to two paywall management companies.
Paywall technology “could, in theory, block access to an AI-crawler… However, this would rely on AI organizations flagging their crawlers as such — such as using a consistent and known IP address [and] not altering it,” said Felix Danczak, senior director of subscriber at Zephr, a subscription platform owned by subscription technology provider Zuora.
Paywall platform Piano is developing a product called Edge Experience, which can lock content in a CDN. It’ll launch in beta with around five clients in the next month. [Editor’s note: Piano is a contracted vendor with Digiday.] Their CDN would also be able to block generative AI crawling, “as long as the client is able to identify the user agent they want to block for that particular crawler,” said Michael Silberman, Piano’s svp of strategy.
Those interviewed for this story said there needs to be a unified approach from publishers against AI bot crawlers. One example would be signing deals with generative AI companies like OpenAI to allow them to license content, such as the one AP signed with OpenAI last month.
Because it’s difficult to track where bots are coming from, publishers like the Inquirer look for “a huge spike in requests from a small range of IPs or a single IP” as a red flag, Boggie said. “But it’s definitely a difficult thing to do in real time… Often, in the course of a day, those things go unnoticed,” he added.
The Washington Post published a report in April showing the websites that were used to train AI chatbots. Boggie said the Inquirer’s URLs appeared in that dataset.
More in Media
Sharing a stage with leading media executives from PepsiCo, Samsung Mobile, and Unilever, leading execs at the DSP shared their vision for the year ahead.
The U.S. Supreme Court addressed separate cases about a similar question: Can states limit social media companies’ moderation?
MFAs carry a loose definition and media buyers are split on how to go about removing them from their clients’ programmatic budgets.