Here are the biggest misconceptions about AI content scraping

By Sara Guaglione • July 2, 2025 •

Ivy Liu

AI bots scraping publishers’ sites for real-time information are now scraping publishers’ sites more than the bots used to train large language models. And they’re harder to detect.

That’s according to the latest report from TollBit, a data marketplace for publishers and AI companies. From Q4 2024 to Q1 2025, bot scrapes used for Retrieval Augmented Generation, or RAG, per site grew 49%. That is nearly 2.5 times the rate of training bot scrapes (which grew by 18%) in the same time period.

An increase in bots scraping content from publishers’ sites represents a threat to their businesses. But scraping for AI training and scraping for real-time outputs present different challenges — and some opportunities — for publishers. And not all of them are fully understood.

Training scrapes are “one-and-done… to feed a model’s general knowledge,” said Josh Jaffe, AI and media consultant and former president of media at the publisher Ingenio.

RAG scrapes, on the other hand, are continuous. They have to power responses to users’ questions in AI chatbots and search engines, he said. “It’s the difference between selling your archive once versus being part of an ongoing syndication feed. One is finite. The other has compounding value, assuming publishers can tap into it,” Jaffe said.

Here is a look at some of the misconceptions:

Myth: AlI bot scraping is the same

There are two main types of AI bots — RAG AI bots and training data bots.

RAG AI bots, or agents, retrieve factual, current information in real-time. They respond to user prompts in AI products like Perplexity and ChatGPT by searching the web. Responses include links or citations to the original sources, such as publishers’ sites. RAG can surface and summarize articles without storing them in training data, which makes the threat to traffic and monetization even more immediate and harder to regulate.

“Despite the high commercial value of RAG to AI developers, the vast majority of companies take the raw materials required to create summarised simulacrums without any form of remuneration, licensing arrangement, or traffic back to the source publisher website. This is contrary to the terms of service of many publishers, and is neither fair nor sustainable,” reads a report from the Financial Times, submitted to the House of Lords Communications and Digital Select Committee into media literacy last month, which also called the ability for publishers to prevent this process from happening as “minimal.”

Training data bots, on the other hand, crawl the web for data to feed into LLMs, such as Meta’s Llama or OpenAI’s GPT. Those large datasets are then used to train the models how to “speak,” or generate responses.

And once they have learned to speak — and LLMs get smarter — training bots are hitting publishers’ sites less frequently. RAG bots, on the other hand, need to keep crawling publishers’ sites to access up-to-date information, which is why they are happening more often.

AI companies have taken on the responsibility of defining these bots to differentiate them. For example, OpenAI has an agent called “ChatGPT-User” — its RAG AI bot — that scrapes the web for real-time information, while “GPTBot” — its training data bot — scrapes to train OpenAI”s LLM.

But not all of them do so publicly.

Myth: RAG scraping is easy to detect

What makes things even more complicated is that smarter AI agents are emerging that mimic human behavior (and can even solve CAPTCHAs and bypass advanced cyber tools), according to an AI startup company exec, who asked to speak anonymously to share their thoughts freely. This makes them increasingly difficult to detect — and without that visibility, publishers have a hard time knowing how many bots are scraping their sites and how often, and what the impact is to their businesses.

Also, search engines like Google and Bing don’t separate their RAG bots from the bots they use to categorize content for search results — which means publishers could not “hide” from RAG bots without potentially also “hiding” itself from search and its corresponding referral traffic.

“This puts publishers in a difficult position as they would risk losing search rankings by restricting all bots — including search bots,” said Arvid Tchivzhel, managing director at Mather Economics’ digital consulting practice.

For example, Google’s “Google-Extended” bot gathers data to train and improve Google’s AI models, which publishers can block with robots.txt. But Google’s LLM Gemini and its AI search feature AI Overviews do not use Google-Extended for real-time data retrieval, meaning publishers can’t block Google from crawling its sites for RAG without blocking the crawlers it uses for Google’s general search product.

TollBit’s report detected 436 million AI bot scrapes (both RAG and training scrapes) in Q1 2025, up 46% from Q4 2024. “The harder you block bots, the harder they will work to evade detection,” said Olivia Joslin, co-founder of TollBit.

Myth: Monetizing training data is the only way publishers can make money

AI companies like OpenAI have signed large, lump-sum deals with publishers to allow them to ingest their content to train their LLMs. But it’s not the only way publishers can monetize AI bots crawling their sites.

Publishers could charge RAG AI bots for crawling their sites — either when they scrape for content, or when they’re cited in responses to users’ questions in AI products. Two digital publishing execs told Digiday this will be key to monetizing the increase in bot scraping.

“The revenue is not there yet as the LLM platforms are still in the early days of building their commercial models, but [I would] expect that to be an area of growth,” said one publishing exec, who traded anonymity for candor.

TollBit, for example, gives AI scrapers the option to pay a “toll” to access a publisher’s content. A web scraper or AI agent tries to go to a publisher’s webpage, gets redirected to TollBit’s platform, and then is offered a transaction fee to access that page. TollBit has struck deals with over 2,000 publishers, including Penske Media and Time. However, it’s unclear how much money publishers are actually making from TollBit’s marketplace.

The IAB Tech Lab is also in the early stages of developing an API called LLM Content Ingest, a technical framework that could help control how publishers’ content is accessed and monetized by AI systems. Although it will need buy-in from the AI companies to make it work.

Publishers are likely to shift to monetizing RAG bot scraping over signing more and more licensing deals with LLMs. Recent deals between AI companies and publishers seem to be moving away from sharing publishers’ content to train LLMs, and instead shifting toward feeding data to AI models in response to queries in AI search engines through a RAG system. (Arguably, many AI companies have already trained their LLMs on huge amounts of data available on the web already.)

But, it’s not an easy process, according to Tchivzhel.

“Preventing and monetizing the scraping is very difficult for the average local publisher. Unless you have significant legal resources and scale, you are unlikely to generate meaningful ROI on monetizing the input into RAG models directly,” Tchivzhel said. “There is likely more ROI on monetizing the output from LLMs and RAG models and striking deals with intermediaries who have built attribution models and can prove a specific piece of content was used in an AI answer.”

Myth: Scrape-to-referral ratio is the same for all AI crawlers

Another key finding in TollBit’s report is that AI bots crawling publishers’ sites are scraping way more than they are referring traffic — meaning publishers are losing out on monetizing those audiences.

On average across TollBit’s partners’ sites, for every 11 scrapes, Bing returns one human visit to sites. This means that Bing’s scrape-to-referral ratio is 11:1. Scrape-to-referral ratios for OpenAI is 179:1, Perplexity’s is 369:1, and Anthropic’s ratio is 8692:1, according to the report.

Overall across TollBit’s publisher network, AI apps drove 0.04% of total external referral traffic to sites from Q4 2024 to Q1 2025.

RAG scraping is also happening more often due to the increased adoption of AI tools, according to an AI startup company exec who asked to speak anonymously to share their thoughts freely. Data shows more people are using AI tools for search, for example. As AI companies invest in more search-focused tools, RAG is needed to keep responses up-to-date.

“We don’t see training bots hammering publishers’ sites thousands of times a day,” the AI company exec said.

Myth: Robots.txt protects publishers from AI bots

If publishers aren’t managing to monetize AI bot scraping, the alternative is to block those bots from accessing the content on their websites.

Robots.txt — which tells web crawlers which URLs they can access and is a mechanism to disallow access to publishers’ sites — is the most straightforward way to do this, with just a few lines of code. But it’s also the weakest tactic to block bot traffic.

Publishers have attempted to block four-times more AI bots between January 2024 and January 2025 using robots.txt. But the percentage of AI bot scrapes that bypassed robots.txt surged from 3.3% in Q4 2024 to 12.9% by the end of Q1 2025. In March 2025, over 26 million scrapes from AI bots bypassed robots.txt for sites on TollBit.

Recent updates to major AI companies’ terms of service state that their AI bots can act on behalf of user requests – effectively meaning they can ignore robots.txt when being used for RAG, according to the TollBit report.

Among websites with TollBit Analytics set up before January 2025, AI bot traffic volume nearly doubled in Q1, rising by 87%.

The FT report called this the era of “digital dumping.”

“AI developers flood the market for news and information with outputs that are created using generative AI models in response to natural language user prompts,” it read. “This probabilistic approach to the production of outputs is as far away from the process of producing high quality journalism as it is possible to be.”

More in Media

Media

Google AI Overviews linked to 25% drop in publisher referral traffic, new data shows

August 15, 2025

Organic search referral traffic from Google is declining broadly, with the majority of DCN member sites – spanning both news and entertainment – experiencing traffic losses from Google search between 1% and 25%.

Member Exclusive

Media Briefing: Amazon’s off-site ad push is becoming publishers’ post-cookie playbook

August 14, 2025

Amazon is fast becoming a partner du jour for publishers: a kind of post-cookie data wingman that’s helping them monetize the approximate 70 percent of the open web that’s now unaddressable.

Publishing in the Platform Era

Despite the hype, publishers aren’t prioritizing GEO

August 13, 2025

Even though referral traffic is drying up, most publishers are skeptical of the hype around generative search optimization.