The Battle Against Web Crawling Bots: A Closer Look

An intriguing analysis conducted by data journalist Ben Welsh revealed some eye-opening statistics regarding news websites and their approach towards blocking web crawling bots. Shockingly, just over a quarter of the news websites surveyed have chosen to block Applebot-Extended. In comparison, a whopping 53 percent of the sampled publications have decided to block OpenAI’s bot. With the recent introduction of Google-Extended, a significant number of sites (nearly 43 percent) have opted to block this bot as well. It seems that Applebot-Extended may still be flying under the radar, but Welsh notes that the number of blockers has been gradually increasing over time.

Welsh’s ongoing project monitoring news outlets’ interactions with major AI agents has revealed a noticeable divide among publishers. Some have chosen to block these bots, while others have remained open to them. The reasoning behind each news organization’s decision remains a mystery, but deals and partnerships seem to play a major role. Last year, The New York Times reported Apple’s attempts to strike AI deals with publishers, and competitors like OpenAI and Perplexity have since followed suit. These strategic partnerships hint at a broader business strategy, possibly involving data withholding until an agreement is reached.

Partnering for Access

Evidence supporting the theory of partnership as a driving force can be seen in the recent actions of Condé Nast and Buzzfeed. While Condé Nast previously blocked OpenAI’s web crawlers, a partnership announcement led to the unblocking of these bots. Meanwhile, Buzzfeed’s spokesperson revealed their practice of blocking AI web-crawling bots unless a paid partnership agreement is in place. The complex nature of maintaining a comprehensive block list, particularly with the emergence of numerous new AI agents, has led to the development of services like Dark Visitors, which automates the updating of a site’s robots.txt file. Major publishers are increasingly turning to these solutions due to copyright concerns and the need to stay current in the ever-evolving digital landscape.

In the age of AI, robots.txt files have taken on new significance for digital publishers, becoming a key domain for media executives. It has come to light that some CEOs from major media companies are directly involved in the decision-making process of which bots to block. Certain outlets have made it clear that they block AI scraping tools due to the absence of commercial agreements with the bot owners. For example, Vox Media has implemented a blanket block on Applebot-Extended across all its properties, along with various other AI scraping tools, in the absence of a commercial partnership. This stance underscores the importance of partnerships in determining access for web crawling bots.

The battle against web crawling bots continues to unfold in the digital sphere, with news outlets facing strategic decisions regarding blocking or allowing access to AI agents. Partnerships, business strategies, and the manual editing of robots.txt files play crucial roles in shaping the digital landscape for publishers. As the influence of AI grows, media executives will be at the forefront of these decisions, determining the future relationship between news outlets and web crawling bots.

Partnering for Access

Articles You May Like

Leave a Reply Cancel reply