FSF 认为大型语言模型

FSF 认为大型语言模型
The FSF considers large language models

## FSF 应对 LLM 生成代码自由软件基金会 (FSF) 正在积极调查大型语言模型 (LLM) 对自由软件许可的影响，但目前尚无明确答案。在最近的 GNU 工具坩埚会议上，FSF 承认了 LLM 生成的代码带来的复杂挑战，重点关注版权、许可和潜在侵权问题。目前，没有计划修改 GNU 通用公共许可证 (GPL)；相反，FSF 正在重新评估自由软件的定义本身。主要担忧包括大多数 LLM 及其训练数据的非自由性质，以及 LLM 生成的代码是否具有版权。FSF 正在调查项目，以了解它们对接受 LLM 代码的当前立场。为了降低风险，FSF 建议项目在接受 LLM 代码时，详细记录其来源——包括使用的模型、版本、训练数据（如果已知）以及使用的提示。建议明确标记 LLM 生成的代码，并记录任何使用限制。讨论强调了追溯代码来源的困难，以及由于训练数据泄露而导致意外侵犯版权的潜力。FSF 强调了人类对提交的代码的责任，即使在 LLM 的辅助下也是如此。

原文

By Jonathan Corbet
October 14, 2025

Cauldron

The Free Software Foundation's Licensing and Compliance Lab concerns itself with many aspects of software licensing, Krzysztof Siewicz said at the beginning of his 2025 GNU Tools Cauldron session. These include supporting projects that are facing licensing challenges, collecting copyright assignments, and addressing GPL violations. In this session, though, there was really only one topic that the audience wanted to know about: the interaction between free-software licensing and large language models (LLMs).

Anybody hoping to exit the session with clear answers about the status of LLM-created code was bound to be disappointed; the FSF, too, is trying to figure out what this landscape looks like. The organization is currently running a survey of free-software projects with the intent of gathering information about what position those projects are taking with regard to LLM-authored code. From that information (and more), the FSF eventually hopes to come up with guidance of its own.

Nick Clifton asked whether the FSF is working on a new version of the GNU General Public License — a GPLv4 — that takes LLM-generated code into account. No license changes are under consideration now, Siewicz answered; instead, the FSF is considering adjustments to the Free Software Definition first.

Siewicz continued that LLM-generated code is problematic from a free-software point of view because, among other reasons, the models themselves are usually non-free, as is the software used to train them. Clifton asked why the training code mattered; Siewicz said that at this point he was just highlighting the concern that some feel. There are people who want to avoid proprietary software even when it is being run by others.

$ sudo subscribe today
Subscribe today and elevate your LWN privileges. You’ll have access to all of LWN’s high-quality articles as soon as they’re published, and help support LWN in the process. Act now and you can start with a free trial subscription.

Siewicz went on to say that one of the key questions is whether code that is created by an LLM is copyrightable and, if not, if there is some way to make it copyrightable. It was never said explicitly, but the driving issue seems to be whether this software can be credibly put under a copyleft license. Equally important is whether such code infringes on the rights of others. With regard to copyrightability, the question is still open; there are some cases working their way through the courts now. Regardless, though, he said that it seems possible to ensure that LLM output can be copyrighted by applying some human effort to enhance the resulting code. The use of a "creative prompt" might also make the code copyrightable.

Many years ago, he said, photographs were not generally seen as being copyrightable. That changed over time as people figured out what could be done with that technology and the creativity it enabled. Photography may be a good analogy for LLMs, he suggested.

There is also, of course, the question of copyright infringements in code produced by LLMs, usually in the form of training data leaking into the model's output. Prompting an LLM for output "in the style of" some producer may be more likely to cause that to happen. Clifton suggested that LLM-generated code should be submitted with the prompt used to create it so that the potential for copyright infringement can be evaluated by others.

Siewicz said that he does not know of any model that says explicitly whether it incorporates licensed data. As some have suggested, it could be possible to train a model exclusively on permissively licensed material so that its output would have to be distributable, but even permissive licenses require the preservation of copyright notices, which LLMs do not do. A related concern is that some LLMs come with terms of service that assert copyright over the model's output; incorporating such code into a free-software project could expose that project to copyright claims.

Siewicz concluded his talk with a few suggested precautions for any project that accepts LLM-generated code, assuming that the project accepts it at all. These suggestions mostly took the form of collecting metadata about the code. Submissions should disclose which LLM was used to create them, including version information and any available information on the data that the model was trained on. The prompt used to create the code should also be provided. The LLM-generated code should be clearly marked. If there are any use restrictions on the model output, those need to be documented as well. All of this information should be recorded and saved when the code is accepted.

A member of the audience pointed out that the line between LLMs and assistive (accessibility) technology can be blurry, and that any outright ban of the former can end up blocking developers needing assistive technology, which nobody wants to do.

There were some questions about how to distinguish LLM-generated code from human-authored code, given that some contributors may not be up-front about their model use. Clifton said that there must always be humans in the loop; they, in the end, are responsible for the code they submit. Jeff Law added that the developers certificate of origin, under which code is submitted to many projects, includes a statement that the contributor has the right to submit the code in question. Determining whether that right is something the contributor truly holds is not a new concern; developers could be, for example, submitting code that is owned by their employer.

A real concern, Siewicz said, is whether contributors are sufficiently educated to know where the risks actually are.

Mark Wielaard said that developers are normally able to cite any inspirations for the code they write; an LLM is clearly inspired by other code, but is unable to make any such citations. So there is no way to really know where LLM-generated code came from. A developer would have to publish their entire session with the LLM to even begin to fill that in.

The session came to an end with, perhaps, participants feeling that they had a better understanding of where some of the concerns are, but nobody walked out convinced that they knew the answers.

A video of this session is available on YouTube.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to this event.]

FSF 认为大型语言模型 The FSF considers large language models

FSF 认为大型语言模型
The FSF considers large language models