Why protecting paywalled content from AI bots is difficult business

Update: After this story’s publication, OpenAI introduced the ability for publishers to block its web crawler from accessing their sites.

People have used OpenAI’s ChatGPT to bypass publishers’ paywalls. So how can publishers protect their subscription businesses against generative AI chatbots siphoning their subscriber-only content?

Digiday checked in with publishers, paywall management companies and consultants to find out, and their answers largely boil down to a need for generative AI chatbot makers to signal when they are trying to access publishers’ content so publishers can treat them similarly to search engines’ content crawlers.

Generative AI chatbots like OpenAI’s ChatGPT work in a similar way to search engine bots, which crawl and collect information from sites to surface them in search results. While OpenAI suspended this feature last month, Google’s Bard and Microsoft’s Bing have not yet formally turned off the bot’s ability to do this.

Publishers can turn off the ability for bots to crawl their content, but it’s difficult to distinguish AI bots from the ones coming from search engines like Google that allow pages to get indexed and appear in search results.

“If a DNC (do not crawl) flag is set by a publisher but the compliance is optional, it is unlikely to stop [large language models] from crawling websites,” said Arvid Tchivzhel, managing director at Mather Economics’ digital consulting practice. “To my knowledge, there is not a unified ‘do not crawl’ standard in place nor any technology [available] on the market to selectively block a crawler.”

To understand the tools at publishers’ disposal, we first need to go over the two main mechanisms for delivering a paywall: JavaScript-based paywalls and paywalls built on a content delivery network (CDN). 

  • JavaScript-based paywalls work by having a page load on a reader’s device, and then overlaying a pop-up that requires a reader to log in to read more. It’s a similar delivery mechanism to overlaying an ad on a page.
  • A CDN works by loading the page on a separate server, and not letting the page load on a device until a reader logs in. Examples of CDNs are Cloudflare and AWS, and Zuora’s Zephr, which built their own CDN. 

A CDN is stronger against AI bots, but it remains unclear if it can truly block them, according to two paywall management companies.

Paywall technology “could, in theory, block access to an AI-crawler… However, this would rely on AI organizations flagging their crawlers as such — such as using a consistent and known IP address [and] not altering it,” said Felix Danczak, senior director of subscriber at Zephr, a subscription platform owned by subscription technology provider Zuora.

Paywall platform Piano is developing a product called Edge Experience, which can lock content in a CDN. It’ll launch in beta with around five clients in the next month. [Editor’s note: Piano is a contracted vendor with Digiday.] Their CDN would also be able to block generative AI crawling, “as long as the client is able to identify the user agent they want to block for that particular crawler,” said Michael Silberman, Piano’s svp of strategy.

Those interviewed for this story said there needs to be a unified approach from publishers against AI bot crawlers. One example would be signing deals with generative AI companies like OpenAI to allow them to license content, such as the one AP signed with OpenAI last month.

The best way to monitor AI crawlers is by analyzing bot traffic, said Matt Boggie, chief technology and product officer at The Philadelphia Inquirer. The Inquirer has a metered paywall, and a hard paywall on premium content. He declined to share if the Inquirer’s paywall is built on JavaScript or a CDN.

Because it’s difficult to track where bots are coming from, publishers like the Inquirer look for “a huge spike in requests from a small range of IPs or a single IP” as a red flag, Boggie said. “But it’s definitely a difficult thing to do in real time… Often, in the course of a day, those things go unnoticed,” he added.

The Washington Post published a report in April showing the websites that were used to train AI chatbots. Boggie said the Inquirer’s URLs appeared in that dataset.

https://digiday.com/?p=513903

More in Media

Media Briefing: Publishers confront the AI era during the Digiday Publishing Summit

This week’s Media Briefing recaps what publishers had to say about AI platforms during the Digiday Publishing Summit’s closed-door town hall sessions.

Mastercard, Samsung and 7-Eleven are 2024 Greater Good Awards winners

The honorees of this year’s Greater Good Awards, presented by Digiday, Glossy, Modern Retail and WorkLife, recognize the importance of empowering communities and fostering economic opportunities, both globally and closer to home. Many of this year’s entrants and subsequent winners also collaborated with mission-driven organizations to amplify their efforts in education, inclusion and sustainability. For […]

Challenge Board: The platform era for publishers gives way to AI

At the Digiday Publishing Summit, publishers discussed the challenges they face, from traditional platforms like Facebook and Reddit as well as those posed by new AI platforms.