The New York Times Makes Its Content Off-Limit to Train AI Models

By Anshul Panda
August 17, 2023 00:13 +08

The New York Times has implemented proactive measures to prevent the utilization of its content in the training of artificial intelligence models. As noted in a report by Adweek, on August 3rd, the NYT revised its Terms of Service to explicitly forbid the usage of its content, encompassing text, images, photographs, audio/video clips, metadata, "look and feel," or compilations, in the development of "any software program, including but not limited to training machine learning or artificial intelligence (AI) systems."

Furthermore, the updated terms now explicitly outline that automated tools such as website crawlers, designed to employ, access, or gather such content, cannot be employed without obtaining written consent from the publication. The New York Times emphasizes that non-compliance with these novel restrictions may result in unspecified fines or penalties. Although these new guidelines have been incorporated into its policy, there seems to be no observable modification to the publication's robots.txt file – the file that communicates to search engine crawlers which URLs can be accessed.

This strategic move might be a response to Google's recent revision of its privacy policy, disclosing the potential collection of public data from the internet to train its diverse AI services, such as Bard or Cloud AI. Numerous prominent language models powering well-known AI services, for instance, OpenAI's ChatGPT, rely on extensive datasets that might incorporate copyrighted or otherwise safeguarded content scraped from the web without the original creator's authorization.

In another development, The New York Times had previously entered into a $100 million agreement with Google in February, granting the search giant permission to feature Times content on select platforms for the next three years. The publication announced that both entities would collaborate on tools pertaining to content distribution, subscriptions, marketing, advertisements, and "experimentation." Consequently, the alterations in the NYT's terms of service might be aimed at entities such as OpenAI or Microsoft.

Recent reports by Semafor indicate that The New York Times withdrew from a media coalition engaged in collective negotiations with tech firms concerning AI training data. This signifies that any potential agreements with companies could be formulated on a case-by-case basis.

OpenAI has recently introduced the capability for website operators to prevent their websites from being scraped by the GPTBot web crawler. Microsoft has also revised its terms and conditions, adding new restrictions that forbid users from utilizing its AI products to "develop, train, or enhance (directly or indirectly) any other AI service." These revisions also prohibit users from extracting data from its AI tools.

Earlier this month, various news organizations, including The Associated Press and the European Publishers' Council, released an open letter urging global legislators to establish regulations mandating transparency regarding training datasets and the acquisition of consent from rights holders prior to using data for training purposes.

Related topics : Artificial intelligence