Why protecting paywalled content from AI bots is difficult business

Update: After this story’s publication, OpenAI introduced the ability for publishers to block its web crawler from accessing their sites.

People have used OpenAI’s ChatGPT to bypass publishers’ paywalls. So how can publishers protect their subscription businesses against generative AI chatbots siphoning their subscriber-only content?

Digiday checked in with publishers, paywall management companies and consultants to find out, and their answers largely boil down to a need for generative AI chatbot makers to signal when they are trying to access publishers’ content so publishers can treat them similarly to search engines’ content crawlers.

Generative AI chatbots like OpenAI’s ChatGPT work in a similar way to search engine bots, which crawl and collect information from sites to surface them in search results. While OpenAI suspended this feature last month, Google’s Bard and Microsoft’s Bing have not yet formally turned off the bot’s ability to do this.

Publishers can turn off the ability for bots to crawl their content, but it’s difficult to distinguish AI bots from the ones coming from search engines like Google that allow pages to get indexed and appear in search results.

“If a DNC (do not crawl) flag is set by a publisher but the compliance is optional, it is unlikely to stop [large language models] from crawling websites,” said Arvid Tchivzhel, managing director at Mather Economics’ digital consulting practice. “To my knowledge, there is not a unified ‘do not crawl’ standard in place nor any technology [available] on the market to selectively block a crawler.”

To understand the tools at publishers’ disposal, we first need to go over the two main mechanisms for delivering a paywall: JavaScript-based paywalls and paywalls built on a content delivery network (CDN). 

  • JavaScript-based paywalls work by having a page load on a reader’s device, and then overlaying a pop-up that requires a reader to log in to read more. It’s a similar delivery mechanism to overlaying an ad on a page.
  • A CDN works by loading the page on a separate server, and not letting the page load on a device until a reader logs in. Examples of CDNs are Cloudflare and AWS, and Zuora’s Zephr, which built their own CDN. 

A CDN is stronger against AI bots, but it remains unclear if it can truly block them, according to two paywall management companies.

Paywall technology “could, in theory, block access to an AI-crawler… However, this would rely on AI organizations flagging their crawlers as such — such as using a consistent and known IP address [and] not altering it,” said Felix Danczak, senior director of subscriber at Zephr, a subscription platform owned by subscription technology provider Zuora.

Paywall platform Piano is developing a product called Edge Experience, which can lock content in a CDN. It’ll launch in beta with around five clients in the next month. [Editor’s note: Piano is a contracted vendor with Digiday.] Their CDN would also be able to block generative AI crawling, “as long as the client is able to identify the user agent they want to block for that particular crawler,” said Michael Silberman, Piano’s svp of strategy.

Those interviewed for this story said there needs to be a unified approach from publishers against AI bot crawlers. One example would be signing deals with generative AI companies like OpenAI to allow them to license content, such as the one AP signed with OpenAI last month.

The best way to monitor AI crawlers is by analyzing bot traffic, said Matt Boggie, chief technology and product officer at The Philadelphia Inquirer. The Inquirer has a metered paywall, and a hard paywall on premium content. He declined to share if the Inquirer’s paywall is built on JavaScript or a CDN.

Because it’s difficult to track where bots are coming from, publishers like the Inquirer look for “a huge spike in requests from a small range of IPs or a single IP” as a red flag, Boggie said. “But it’s definitely a difficult thing to do in real time… Often, in the course of a day, those things go unnoticed,” he added.

The Washington Post published a report in April showing the websites that were used to train AI chatbots. Boggie said the Inquirer’s URLs appeared in that dataset.


More in Media

Why Getty Images and Picsart are partnering to train a new AI image model

The deal let Picsart train create an AI model using hundred of millions of licensed images to power a new platform for both companies.

How a work platform redesigned BuzzFeed’s former offices after moving in

Work platform Monday.com now occupies the former office space of BuzzFeed.

Media Briefing: 2024 publishers’ guide to selling at Cannes

Publishers’ sales teams will descend on the Croisette in Cannes next week. This is how they’re planning to pitch marketers.