In 2023, a major clash between artificial intelligence and journalism emerged as many leading news websites across 10 countries decided to block the crawlers of OpenAI and Google, two of the most prominent AI companies in the world. This move reflects the growing concerns and challenges regarding the use and regulation of generative AI tools that rely on web content to produce new data and insights.
What are AI crawlers, and why do they matter?
AI crawlers are programs that scan and collect data from websites for various purposes. For example, Google’s Googlebot indexes web pages for its search engine, while OpenAI’s GPTBot gathers data from the internet to train its large language models (LLMs) such as ChatGPT. These LLMs can generate natural language texts on various topics and domains, such as news articles, summaries, reviews, and more.
AI crawlers can offer many benefits for users and businesses, such as improving search quality, providing relevant information, and creating new content. However, they also pose significant risks and challenges for original content creators, especially news publishers, who invest time, money, and resources to produce high-quality and reliable journalism. Some of these challenges include:
- Copyright infringement: AI crawlers may violate the intellectual property rights of news publishers by copying, reproducing, or modifying their content without permission or compensation.
- Competition and cannibalization: AI crawlers may create competing or substitute products that reduce the demand and revenue of news publishers. For example, ChatGPT can generate summaries or headlines that may discourage users from reading the full articles on the news websites.
- Quality and credibility: AI crawlers may generate inaccurate, misleading, or harmful content that may damage the reputation and trust of news publishers. For example, ChatGPT may produce false or biased information that contradicts or undermines the facts and opinions presented by the news websites.
How did news publishers react to AI crawlers?
In the absence of clear and consistent regulatory frameworks governing the use of generative AI tools, many news publishers took matters into their own hands to protect their content, data, and revenues from AI crawlers. According to a study by the Reuters Institute, nearly half (48%) of the top news websites, based on reach, across 10 countries blocked OpenAI’s crawlers by the end of 2023, while nearly a quarter (24%) blocked Google’s AI crawler. The study analyzed the robots.txt files of 15 online news sources with the widest reach, including titles like The New York Times, BuzzFeed News, The Wall Street Journal, The Washington Post, CNN, and NPR, across countries such as Germany, India, Spain, the U.K., and the U.S.
The study grouped the outlets into three categories: legacy print publications, television and radio broadcasters, and digital-born outlets. It found that print publications were the most likely to block AI crawlers, followed by broadcasters and digital outlets. The study also revealed differences between news outlets in the Global North and Global South, with the former being more inclined to block AI crawlers than the latter. For example, in the U.S., 79% of the top online news websites blocked OpenAI, while only 20% did so in Mexico and Poland.
The study suggested that the blocking of AI crawlers was driven by various factors, such as the type and quality of content, the business model and revenue streams, the competitive environment and market position, the legal and ethical standards, and the technological capabilities and resources of the news publishers. The study also noted that once a news website decided to block an AI crawler, it did not reverse its decision, indicating a strong and persistent stance against the use of generative AI tools.
What are the implications and prospects of the blocking of AI crawlers?
The blocking of AI crawlers by news publishers has significant implications and consequences for both the AI industry and the journalism sector. On one hand, it may limit the access and availability of web content for AI tools, which may affect their performance, quality, and diversity. On the other hand, it may also create opportunities and incentives for collaboration and innovation between the two fields, which may lead to better products, services, and solutions for users and society.
Some of the possible scenarios and outcomes of the blocking of AI crawlers are:
- Legal and regulatory interventions: The blocking of AI crawlers may prompt legal and regulatory actions from the AI companies, the news publishers, or the governments to resolve the disputes and establish the rules and norms for the use of generative AI tools. For example, AI companies may challenge the blocking of their crawlers as a violation of fair use or antitrust laws, while news publishers may demand compensation or licensing fees for the use of their content. Alternatively, governments may intervene and create laws and policies that balance the interests and rights of both parties as well as the public good.
- Technical and strategic adaptations: The blocking of AI crawlers may encourage technical and strategic adaptations from the AI companies, the news publishers, or both to overcome the challenges and leverage the opportunities of generative AI tools. For example, AI companies may develop new methods or sources to collect and process data, such as user-generated content, social media platforms, or alternative websites. Meanwhile, news publishers may adopt or create their own AI tools to enhance their content production, distribution, and monetization, such as using automated summarization, personalization, or recommendation systems.
- Collaboration and partnership: The blocking of AI crawlers may foster collaboration and partnership between the AI companies and the news publishers, creating mutual benefits and value for both sides, as well as for the users and society. For example, AI companies and news publishers may agree on terms and conditions for the use of generative AI tools, such as sharing data, revenue, or credit, or setting quality and ethical standards. Alternatively, they may jointly develop and launch new products or services that combine the strengths and advantages of both fields, such as offering customized news feeds, interactive chatbots, or immersive storytelling.
The blocking of AI crawlers by news publishers is a remarkable phenomenon that reflects the complex and dynamic relationship between artificial intelligence and journalism. It also raises important questions and issues about the future of web content, data, and information in the age of generative AI tools. How will the AI companies and the news publishers resolve their conflicts and cooperate? How will users and society benefit or suffer from the use of generative AI tools? How will web content, data, and information evolve and transform in the coming years? These are some of the questions that need to be answered and addressed by all the stakeholders involved in this emerging and evolving field.