纽约时报版权诉讼要求 OpenAI 删除所有 GPT 实例

纽约时报版权诉讼要求 OpenAI 删除所有 GPT 实例
NY Times copyright suit wants OpenAI to delete all GPT instances

原始链接: https://arstechnica.com/tech-policy/2023/12/ny-times-sues-open-ai-microsoft-over-copyright-infringement/

《纽约时报》对人工智能工具制造商提起诉讼，声称侵犯版权一家大型出版公司《纽约时报》本周早些时候对几家科技巨头提起诉讼，指控他们在开发突破性人工智能技术期间利用了受版权保护的材料，《纽约时报》称该技术可以实时复制其大部分核心内容。根据研究人员或记者的提示。根据曼哈顿联邦法院提交的法庭文件，这家媒体集团辩称，《纽约时报》的受版权保护的材料，包括文本、图像、视频、音频剪辑、插图和动画，经常被用于该出版物所谓的“严重疏忽” 公共领域抓取操作。《泰晤士报》进一步指出，基于广泛的调查，领先的人工智能开发人员已经创作了 1997 年流行的讽刺剧《Grosse Pointe Blank》的衍生版本，该讽刺剧由乔恩·费儒 (Jon Favreau) 主演，本·阿弗莱克 (Ben Affleck) 和帕克·波西 (Parker Posey) 主演。在投诉中引用的一个例子中，一名负责撰写《星球大战》系列最新作品评论的记者利用人工智能程序撰写了她的专题文章的部分内容。通过抄袭检查器运行人工智能响应后，《泰晤士报》发现有很大一部分内容与之前发布的《泰晤士报》内容相匹配；特别提到了达斯·维德的人物弧线、死星爆炸以及莱娅公主和汉·索罗所说的台词，这些内容随后在整个相关报告中以粗体文本突出显示。《纽约时报》的一位发言人承认，虽然剽窃行为在现代社会仍然猖獗，特别是在新闻界，但有一个强有力的法律推定，即新闻编辑室在每一次公开出现的盗窃指控中都表现得光荣和道德，无论这些指控是否属实。最终证明有效。为了支持其说法，《泰晤士报》援引了 20 世纪初的报道，表明这种做法在记者中已经司空见惯，编辑们想走捷径，以便在紧迫的期限内及时发布报道。根据《泰晤士报》周四上午发布的一份新闻稿，这些调查结果表明，虽然第四阶层的某些成员倾向于广泛使用“新闻业”一词，但实际上，很少有从业者表现出通常被认为是传统记者或作家的品质，他们更喜欢使用“新闻业”一词。而不是仅仅充当模仿者和数字寄生虫。行动发生在几周后

围绕法学硕士的使用的争论引发了与知识产权相关的重要考虑，特别是在版权侵权方面。虽然《纽约时报》对 OpenAI 的索赔对于 ChatGPT/Bing Copilot 逐字复制《纽约时报》的实例而言似乎很简单，但它忽略了与使用法学硕士侵权相关的细微差别。关注合理使用中考虑的四个因素的“原始表达目的”标准有时会忽视其他用途，例如教育和变革目的，这些用途除了经济收益外还可以满足标准。此外，在版权侵权案件等法律辩论中，区分通过记忆实现的转换与通过抄袭版权数据的数据库实现的转换至关重要。相比之下，使用原作只是大多数人满足的条件，而不是决定因素。随着与 EastlawAI 等幽默比较的出现，它们的无关性凸显了转型过程在区分衍生作品和侵权方面所发挥的关键作用。尽管如此，大脑机制本质上需要转变的观念掩盖了知识产权监管所做出的区分。这些要点说明了合理使用评估的复杂性，特别是与法学硕士在学术研究和企业场景中的使用有关，从而阐明了对该主题进行严格分析调查的迫切需求。

原文

Image of a CPU on a motherboard with — Enlarge / Microsoft is named in the suit for allegedly building the system that allowed GPT derivatives to be trained using infringing material.

In August, word leaked out that The New York Times was considering joining the growing legion of creators that are suing AI companies for misappropriating their content. The Times had reportedly been negotiating with OpenAI regarding the potential to license its material, but those talks had not gone smoothly. So, eight months after the company was reportedly considering suing, the suit has now been filed.

The Times is targeting various companies under the OpenAI umbrella, as well as Microsoft, an OpenAI partner that both uses it to power its Copilot service and helped provide the infrastructure for training the GPT Large Language Model. But the suit goes well beyond the use of copyrighted material in training, alleging that OpenAI-powered software will happily circumvent the Times' paywall and ascribe hallucinated misinformation to the Times.

Journalism is expensive

The suit notes that The Times maintains a large staff that allows it to do things like dedicate reporters to a huge range of beats and engage in important investigative journalism, among other things. Because of those investments, the newspaper is often considered an authoritative source on many matters.

All of that costs money, and The Times earns that by limiting access to its reporting through a robust paywall. In addition, each print edition has a copyright notification, the Times' terms of service limit the copying and use of any published material, and it can be selective about how it licenses its stories. In addition to driving revenue, these restrictions also help it to maintain its reputation as an authoritative voice by controlling how its works appear.

The suit alleges that OpenAI-developed tools undermine all of that. "By providing Times content without The Times’s permission or authorization, Defendants’ tools undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue," the suit alleges.

Part of the unauthorized use The Times alleges came during the training of various versions of GPT. Prior to GPT-3.5, information about the training dataset was made public. One of the sources used is a large collection of online material called "Common Crawl," which the suit alleges contains information from 16 million unique records from sites published by The Times. That places the Times as the third most referenced source, behind Wikipedia and a database of US patents.

OpenAI no longer discloses as many details of the data used for training of recent GPT versions, but all indications are that full-text NY Times articles are still part of that process (Much more on that in a moment.) Expect access to training information to be a major issue during discovery if this case moves forward.

Not just training

A number of suits have been filed regarding the use of copyrighted material during training of AI systems. But the Times' suit goes well beyond that to show how the material ingested during training can come back out during use. "Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.

The suit alleges—and we were able to verify—that it's comically easy to get GPT-powered systems to offer up content that is normally protected by the Times' paywall. The suit shows a number of examples of GPT-4 reproducing large sections of articles nearly verbatim.

The suit includes screenshots of ChatGPT being given the title of a piece at The New York Times and asked for the first paragraph, which it delivers. Getting the ensuing text is apparently as simple as repeatedly asking for the next paragraph.

ChatGPT has apparently closed that loophole in between the preparation of that suit and the present. We entered some of the prompts shown in the suit, and were advised "I recommend checking The New York Times website or other reputable sources," although we can't rule out that context provided prior to that prompt could produce copyrighted material.

Ask for a paragraph, and Copilot will hand you a wall of normally paywalled text.
John Timmer

But not all loopholes have been closed. The suit also shows output from Bing Chat, since rebranded as Copilot. We were able to verify that asking for the first paragraph of a specific article at The Times caused Copilot to reproduce the first third of the article.

The suit is dismissive of attempts to justify this as a form of fair use. "Publicly, Defendants insist that their conduct is protected as 'fair use' because their unlicensed use of copyrighted content to train GenAI models serves a new 'transformative' purpose," the suit notes. "But there is nothing 'transformative' about using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it."

Reputational and other damages

The hallucinations common to AI also came under fire in the suit for potentially damaging the value of the Times' reputation, and possibly damaging human health as a side effect. "A GPT model completely fabricated that “The New York Times published an article on January 10, 2020, titled ‘Study Finds Possible Link between Orange Juice and Non-Hodgkin’s Lymphoma,’” the suit alleges. "The Times never published such an article."

Similarly, asking about a Times article on heart-healthy foods allegedly resulted in Copilot saying it contained a list of examples (which it didn't). When asked for the list, 80 percent of the foods on weren't even mentioned by the original article. In another case, recommendations were ascribed to the Wirecutter when the products hadn't even been reviewed by its staff.

As with the Times material, it's alleged that it's possible to get Copilot to offer up large chunks of Wirecutter articles (The Wirecutter is owned by The New York Times). But the suit notes that these article excerpts have the affiliate links stripped out of them, keeping the Wirecutter from its primary source of revenue.

The suit targets various OpenAI companies for developing the software, as well as Microsoft—the latter for both offering OpenAI-powered services, and for having developed the computing systems that enabled the copyrighted material to be ingested during training. Allegations include direct, contributory, and vicarious copyright infringement, as well as DMCA and trademark violations. Finally, it alleges "Common Law Unfair Competition By Misappropriation."

The suit seeks nothing less than the erasure of both any GPT instances that the parties have trained using material from the Times, as well as the destruction of the datasets that were used for the training. It also asks for a permanent injunction to prevent similar conduct in the future. The Times also wants money, lots and lots of money: "statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity."