Can AI analyses about AI content reveal anything about AI and copyright?

What happens when artificial intelligence analyzes human thoughts about AI and copyright?

As tech giants and startups alike move forward with AI models, the legal landscape is still full of uncertainty when it comes to current and future rules about AI and copyright. Earlier this week, U.S. Rep. Adam Schiff introduced a new bill that would require AI companies to disclose AI training content including text, images, music and videos. Meanwhile, more authors, musicians and other creative professions are also speaking out. Last week, 200 musicians — including Billie Eilish, Jason Isbell, Nicki Minaj and Bon Jovi — signed an open letter calling for companies to protect artists from “the predatory use of AI.”

The U.S. Patent and Trademark Office is also considering new rules related to AI and copyright, including publishing guidance in February and again this week. Along with looking at whether AI-assisted works can receive copyright protections, it’s also considering whether AI systems should be trained with already protected content. As part of the rule-making process, the USPTO received around 10,000 submitted comments from a range of stakeholders — including companies, AI experts, artists and organizations — that expressed a wide range of views about AI and intellectual property.

The massive trove of comments led to a broader question: What would an AI model notice by analyzing the whole collection of human comments about copyright? Can themes in the commentary help paint a picture of what various stakeholders want to see from the USPTO?

To better understand the sentiments, Digiday worked with the AI company IV.AI to explore the set of comments using a subset of AI called natural language processing (NLP), which analyzes language to identify patterns in words and phrases to identify meaning from text. Founded in 2016, the IV.AI helps major brands make decisions using AI, find insights in unstructured data and deploy AI inside their businesses. Brands that have worked with the Los Angeles-based company include Netflix, Estée Lauder, Walmart, Uber and Capital One.

To frame its analysis, IV.AI looked at four key questions that the USPTO invited submissions to address: Training of AI with copyrighted materials, the copyright-ability of AI-generated content, the liability for AI-created infringements, and the legal treatment of AI outputs that mimic human artists’ styles or identities.

While many of the comments taken en masse represent the overall concerns about the creative rights of humans, it also reflects how companies, individuals and organizations think about ownership of content and data over time. Just like social media companies learned from the data users created, many AI companies are now doing the same by training their AI models on content posted to the various platforms.

Creators and companies both need to discuss and understand how data trains AI models, said IV.AI CEO and cofounder Vince Lynch, adding that society has already seen the negative impact of unchecked AI on social media and curated algorithms’ influence on society and culture.

“[Social platforms] all learn from all the data we create,” Lynch said. “And they just give us a space to [post] and then they profit from it. Now, people are taking that information, and then new AI companies are running with it…Everybody’s like keeps milking the general hoi polloi of humanity.”

AI appraises human sentiments

There were numerous macro and micro themes that emerged from the analysis. Many of the comments mentioned some form of fraud — with words like “theft,” “steal,” “infringement” and “plagiarism,” “threat,” and “devalue.” Another theme IV.AI noticed was the myriad demands within comment submissions, which used words like “consent,” “compensation,” “permission,” “protection,” and “incentive.”

Submissions also noted what’s at stake with the future of AI and copyright: What will the technology mean for human creativity, original creations and their creators?

To understand the sentiment in the submissions, IV.AI had its AI model look at the first 500 words in each submission and found that 74% of comments were identified as negative. The other 26% were identified as more positive — but mostly because commenters expressed hope that new regulations might be able to help address concerns about AI and copyright.

Many of the comments came from artists, writers and musicians who are worried about having their content scraped by AI models without consent or compensation. Voice actors expressed worry about losing their jobs to AI. Fan-fiction writers pointed out that they’re not allowed to make money from their work, but AI models might do the same thing and make money off of it. One of the more noteworthy findings: More than 400 submissions came from members of the Writer’s Guild Of America, according to IV.AI, which also noted many WGA members seemed to copy and paste a statement based on a template provided by WGA.

IV.AI also identified key themes based on the most frequently used terms and contiguous words. By identifying patterns and relationships between words, the company was able to extract meaningful topics from the comments. For instance, the analysis revealed terms “infringement” and “copyright” frequently appeared together, indicating that copyright infringement was a significant topic in the responses. It also noticed clusters of related topics — such as the use of AI in training models, whether AI-generated content can be copyrighted, and issues related to legal liability with AI and copyright infringement.

Unsurprisingly, the most popular words identified included “AI,” “work,” and “copyright.” However, when looking at multi-word concepts, the most popular was “train AI model,” followed by other terms related to training AI, copyright and content. The concept “without permission” came up nearly 900 times while “theft” showed up nearly 1,300 times and “replace human creativity” appeared nearly 500 times.

“We need to clean up how AI learns and the impact it makes,” Lynch said. “We have ways of doing this ethically as we deploy AI for companies and governments. It’s important that these tried and tested best practices are included in all AI engagements.”

When looking at which companies were most mentioned, Google came out on top with 183 mentions, followed by Disney (138), Adobe (95), Amazon (95), YouTube (73), Microsoft (42), Netflix (31), Instagram (30). The most mentioned platform was ChatGPT, which was mentioned 319 times. Others with the most mentions included Midjourney (204), Stable Diffusion (136), Photoshop (94), DALL-E (57), DeviantArt (48), Stability AI (44) and Glaze (39). Platforms working to protect artists from AI also got dozens of mentions including Glaze and Nightshade, which got 39 and 26 mentions.

The submissions also came from hundreds of companies ranging from tech giants, startups and content companies. Some examples include Qualcomm, Meta, Yelp, Adobe, Microsoft, OpenAI, Cohere, Getty Images, Shutterstock, The New York Times, National Public Radio. Others came from The Recording Academy, Motion Picture Association and various publishing houses. Brands like The Knot, the NFL and Duolingo also submitted. Many of these top names attached separate files to their submissions that weren’t included in main data set IV.AI analyzed.

AI v. AI — analyses on other topics

Another thing IV.AI analyzed was lawsuits related to AI and copyright against companies like OpenAI and others. Using NLP to analyze several initial complaints — including those filed by The New York Times, Getty Images, publishers and groups of authors — it identified frequent terms and phrases to understand key themes, such as “copyright infringement.” IV.AI also observed how certain terms, like “Getty Images” and “Microsoft,” varied in frequency depending on the context of the documents. The analysis helped pinpoint common topics and the significance of various terms within the legal discussions about AI technologies, providing insights into areas of concern or interest in these proceedings.

Other AI companies are also using their own AI models to identify AI-generated content — and to track which publishers are attempting to block AI crawlers from scanning their content without permission. Another startup, Originality.AI, created a dashboard to track how many of the top websites have blocked AI web crawlers from various AI companies. Of the top 1,000 websites by traffic volume, 34% had blocked OpenAI’s GPTBot, 19% had blocked Google’s Google-Extended, 11% had blocked nonprofit Common Extended, and just 5% had blocked Anthropic’s.

It’s also worth noting which websites blocked or allowed various crawlers. For example, YouTube allows all four, but Facebook and Instagram block’s OpenAI’s and Google’s. Meanwhile, Amazon blocks OpenAI’s and Common Crawler, but allows Anthropic’s and Google’s.

“Google Extended is really interesting,” said Originality.AI founder and CEO Jon Gilham. “Why is it three times less likely to be blocked than GPTBot? Is Google using its potential monopolistic power in search to get an unfair advantage and an emerging field of AI?”

Another AI startup, Patronus AI, built a tool called Copyright Catcher to detect how likely various LLMs are to produce copyrighted content. Last month, the startup’s initial results found OpenAI’s GPT-4 produced copyrighted content in 44% of prompts, MIstral AI produced it in 22%, Anthropic produced it in 8% and Llama 2 in just 10%. According to Patronus co-founder Anand Kannappan, companies that accidentally output copyrighted content still puts a brand or the company’s reputation at risk.

“A lot of companies still feel really uncomfortable because they don’t know where the liability actually is and who’s at risk or who’s responsible for the risk,” Kannappan said. “…If you’re a user of a foundation model and you ultimately accidentally output copyrighted content, that still puts the brand at risk, or the reputation of the company at risk. And so even if it’s not a legal issue, there’s just other kinds of issues that you know most companies just don’t want to be involved [in].”

https://digiday.com/?p=540860

More in Media

Inside The New York Times’ plans to correlate attention levels to other metrics

There’s a lot of buzz around attention advertising right now, but The New York Times is trying to stay grounded even as it develops its own plans.

Why publishers are preparing to federate their sites

The Verge and 404 Media are exploring the fediverse as a way to take more control over their referral traffic and onsite audience engagement.

Why publishers fear traffic, ad declines from Google’s AI-generated search results

Some publishers and partners hope for more transparency from Google and other AI companies related to AI-generated search.