The case for and against open-source large language models for use in newsrooms

As publishers develop generative AI tools for their newsrooms, they have two primary choices when deciding what to build upon: open-source or private large language models (LLMs).

Chatbots like OpenAI’s ChatGPT and Google’s Bard are built using private and proprietary LLMs, or systems that are trained on large amounts of data that learn to generate text.

On the other hand, open source is computer code that can be freely used and modified by anyone on the internet. Open-source LLMs allow publishers to download that code and fine-tune the models for specific tasks using their own data, as well as see what the model was initially trained on and examine the model for any potential limitations and biases.

Last month Meta shared the code that powers Llama 2, its AI-powered LLM, allowing users including publishers to use the Llama 2 model for free to build their own customized chatbots, for example.

“Open source drives innovation because it enables many more developers to build with new technology,” Meta CEO Mark Zuckerberg wrote in a Facebook post. “It also improves safety and security because when software is open, more people can scrutinize it to identify and fix potential issues.”

Despite it being free, not all publishers are convinced open source LLMs are the models they should be using to build generative AI tools for their newsrooms.

Here are the cases for and against open source LLMs.

The case for open-source LLMs

Open source models like Llama 2 are free, as opposed to private, proprietary LLMs, such as GPT or Bard, which charge companies based on usage of the model. 

Publishers can experiment with building generative AI tools and products on top of open-source models and bypass the initial cost of paying to use a private model, explained Francesco Marconi, a computational journalist and co-founder of real-time information company AppliedXL. 

Marconi said AppliedXL chose to go the open-source route to build a language model for journalists, called AXL-1, due to cost considerations and for the transparency that open-source models provide, as well as the ability to to connect the models to real-time data. The company fine-tuned open-source models including Llama 2 and Falcon (another LLM provider), through Amazon’s cloud services provider Amazon Web Services (AWS). AppliedXL’s model was used to build a new tool with STAT – The Boston Globe’s news site focused on health, medicine and science – to analyze, identify and summarize real-time clinical trial data. The tool sifts through data and, based on parameters set by journalists, determines whether a clinical trial update is noteworthy, then generates a news report on that finding.

“The idea is that the open innovation approach invites more scrutiny, potentially addressing issues related to bias and transparency, because there are many people collaborating and improving the model,” Marconi said.

Publishers can also run open-source LLMs internally, meaning they aren’t sharing their proprietary data with a large tech company and training those companies’ LLMs from their own content.

Open-source models “allow organizations to avoid sending sensitive data to external systems, enhancing security and privacy. It also stops them from training and improving the models of large technology companies,” said Felix Simon, a communication researcher at the Oxford Internet Institute, who studies the implications of AI for journalism.

Few publishers can build their own LLMs, due to the cost as well as recruiting dedicated data scientists and engineers to create and maintain those models. Bloomberg is one of the few that took that route, training its BloombergGPT model on Bloomberg’s financial data. But LLMs like the ones behind ChatGPT can cost millions of dollars to create and run.

OpenAI must be feeling the competition as more free open-source LLMs are becoming available to the public. The Information reported in May that OpenAI is preparing to release an open-source model – notably, OpenAI’s first two versions of its GPT models were open source.

The case against open-source LLMs

While there’s a lot of innovation happening around them, the argument for a publisher to host and maintain tools built using an open-source model on its own (with the required human and computing power it takes to run them) remains to be seen — especially considering recent improvements to OpenAI’s GPT-4 model that has advanced its capabilities, according to David Caswell, founder of StoryFlow, which helps publishers apply LLMs and generative AI technology to their organizations.

“So far I’ve seen nothing that remotely approaches [OpenAI’s LLM] GPT-4 in terms of general capabilities,” Caswell told Digiday.

Renn Turiano, svp and head of product for Gannett, said the lack of service agreements between a company offering an open-source LLM and a publisher means the publisher is on its own when it comes to developing generative AI tools and products from those models.

“There may be some similarities here to the early days of cloud computing, when lots of big news [organizations] wanted their own cloud… but ended up using AWS for sheer practicality,” Caswell said.

Instead of building tools on top of open-source models, publishers are increasingly taking the route of signing deals with generative AI companies to access their proprietary models — although they need to negotiate terms to have some control over the data they are sharing.

OpenAI struck a licensing partnership with the Associated Press last month, giving OpenAI access to the AP’s data to train its models while using OpenAI’s tech and resources to develop AI tools. An AP spokesperson said OpenAI is paying to license part of AP’s text archive, but declined to share financial terms of the deal.

Gannett, meanwhile, has a partnership with a proprietary LLM company called Cohere. Signing direct deals with companies that offer proprietary LLMs means publishers like Gannett can “keep control of our own training data,” Turiano said. Gannett retains ownership and control of its content as part of the licensing deal with Cohere, a Gannett spokesperson said.

Not all open-source LLMs are necessarily open source

Some of the publishing executives and researchers that Digiday interviewed for this story argued that Meta’s Llama 2 isn’t truly an open-source model in the first place, because Meta has control over the conditions around the usage of the model in its acceptable use policy. Meta also hasn’t released the training data used to teach the Llama 2 model, which is key to helping newsrooms spot any bias in the AI systems. (The Washington Post published a report in April showing the websites that were used to train AI chatbots, including Meta’s Llama.)

“Big tech still benefits when the open-source community and outside engineers improve its models, which Meta can then use to improve its own tools. This way, Meta doesn’t have to put all of its resources into catching up to OpenAI and Google,” said Oxford Internet Institute’s Simon.

This story was updated to reflect the AXL-1 model was built by AppliedXL and used to develop the STAT tool.

https://digiday.com/?p=514350

More in Media

News publishers may be flocking to Bluesky, but many aren’t leaving X

The Guardian and NPR have left X, but don’t expect a wave of publishers to follow suit. Execs said the platform is still useful for some traffic and engaging with fandoms – despite its toxicity.

Media Briefing: Publishers’ Q4 programmatic ad businesses are in limbo

This week’s Media Briefing looks at how publishers in the U.S. and Europe have seen programmatic ad sales on the open market slow in the fourth quarter while they’ve picked up in the private marketplace.

How the European and U.S. publishing landscapes compare and contrast

Publishing executives compared and contrasted the European and U.S. media landscapes and the challenges facing publishers in both regions.