Don’t use our content to train AI systems

By Chris Barnhart On Aug 10, 2023

Don’t use our content to train AI systems

Although Google wants all online content available for AI training, the New York Times clearly wants to opt out.

The Times has made numerous changes to its terms of service – all aimed at preventing AI companies from using the media organization’s content to train their systems.

Why we care. Many large language models are trained using website content (see: Search the 15.7 million websites in Google’s C4 dataset). While Google is exploring alternatives or supplemental ways of controlling crawling and indexing beyond robots.txt, many brands (e.g., Reddit) are making it clear right now they don’t want their content used to improve the products and increase the profits for Google, Microsoft and OpenAI – at least not without compensation. You may want to consider adding some similar AI-related messaging to your website’s terms page.

What has changed. The New York Times updated its terms of service page Aug. 3. It includes AI-specific additions that apply to its content (which it defines as “including, but not limited to text, photographs, images, illustrations, designs, audio clips, video clips, ‘look and feel,’ metadata, data, or compilations”).

In the “Prohibited use of the services” section:

(3) use the Content for the development of any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system.

Will AI companies compensate publishers? OpenAI and the Associated Press signed a deal last month. OpenAI licensed the AP’s news article archive dating back to 1985 for training.

Google and the New York Times Co. already have a lucrative “commercial agreement” in place, but that deal is about working together on “tools for content distribution and subscriptions.”

Microsoft is also promising publishers some sort of revenue sharing. However, most of the benefits will apparently go to members of its Start program.