CodeScientist:用于基于代码实验的自动化科学发现系统
CodeScientist: Automated scientific discovery system for code-based experiments

原始链接: https://github.com/allenai/codescientist

CodeScientist 的常见问题解答涵盖了成本、基础设施、数据存储和故障排除。成本估算主要包括用于实验调试和执行的大型语言模型 (LLM) 费用,不包括初始代码生成和 Modal 容器成本。虽然可以使用本地 Docker 容器,但为了速度和可扩展性,首选 Modal 容器。实验数据以 JSON 格式存储在 `data/` 目录中,PaperStore 除外。 实验成本差异很大,根据 CodeScientist 论文,平均约为 4 美元,主要是因为使用 `gpt-4o-mini` 进行代码调试。Claude Sonnet 3.5 是默认模型,因为它具有代码生成和调试功能,但 `litellm` 可以轻松切换到其他大型语言模型。 故障排除技巧包括:使用 Chrome 查看 PDF 报告;如果 PDF 生成失败,请检查实验目录中的原始 Latex 报告;确保下载论文数据以解决“ERROR: Could not retrieve source”问题。使用 `python -u src/CodeScientistWebServer.py > debuglog.txt 2>&1` 记录服务器输出以进行调试。监控 API 密钥的使用情况,因为 LLM 生成的代码可能会进行 API 调用。使用排除文件和预提交钩子来保护 API 密钥。

CodeScientist是一个新的自动化科学发现(ASD)系统,旨在克服现有系统局限于狭窄的设计空间和有限的代码评估的缺点。它采用遗传搜索方法,结合研究论文和代码块来生成和测试想法,重点关注智能体和虚拟环境等领域。 该系统自动化实验构建,从而能够进行比简单的基准优化更广泛、更多样的发现。研究人员使用CodeScientist运行了数百个实验,产生了19项发现。这些发现都经过了外部评审、代码评审和复制尝试的严格评估。其中,有六项被认为既可靠又具有增量式的新颖性。这些发现涵盖了新的任务、智能体、指标和数据,标志着向更广泛的科学探索迈进了一步。该项目的代码和更多信息可在Github上找到。

原文

11. Frequently Asked Questions

Q: What is and is not included in the cost estimates?

A: Generally, the cost of running/debugging the experiment is what's calculated -- which is centrally the LLM cost for the Experiment Builder reflection loop, and the cost of any LLM calls made by the experiment code itself. The ideation costs, initial experiment code generation costs, and modal container costs are not included (as these are generally small compared to the debugging costs).

Q: Why not use local/free Docker containers instead of Modal containers?

A: You could do this, and create a version of src/modules/ModuleRunPythonInModal.py intended for local Docker use intead of Modal use. The initial version of CodeScientist did this, but the containers were very slow to spin up, and it limited the number of simultaneous experments that could be spooled up to what was available on the local machine. The Modal containers are generally very inexpensive, very fast, and allow scaling.

Q: Where are the internal files for ideas, experiments, etc., stored?

A: These are generally stored in easily-parsable JSON format, in data/ , with the exception of the PaperStore, which is stored in paperstore/ .

Q: How much does each experiment cost to run?

A: This is highly variable, and depends on a great many things that are unknown (like the number of debugging iterations required, the size of the prompts the LLM generates, etc.). The experiments in the CodeScientist paper cost an average of about $4 each, but this was highly variable, and almost entirely code debugging costs. Those experiments were generally prompted to use gpt-4o-mini for all experiment code, which is generally quite inexpensive.

Q: Why use Claude Sonnet 3.5 as a base model? Why not use [My Favorite Model]?

A: Claude is excellent at generating and debugging code. The library for LLM calls is litellm, so you should be able to easily retarget to many common language models -- though the performance may vary.

Q: Help, when I click the "PDF" report link in the Experiment Monitor, the PDF doesn't appear.

A: It appears to work in Chrome, but may not work in other browsers (like FireFox). As a workaround, you can download the experiment ZIP file (that includes the report).

Q: The reports for some experiments do not generate.

A: This is an uncommon but known issue -- sometimes the LLM doesn't generate valid Latex, or the Latex is complicated and doesn't compile. The report generator makes several attempts. If the Latex compilation was unsuccessful, the raw Latex report should still be in the experiment directory, just without the compiled PDF.

Q: Ideas or experiments don't appear to be working, and I'm seeing errors in the server console of the form ERROR: Could not retrieve source for paper with ID: xxxx.xxxxx?

A: If your ideas don't appear to be generating in any mode (idea generation, batch autonomous experimentation, etc.), and you're seeing errors of the form ERROR: Could not retrieve source for paper with ID: xxxx.xxxxx in the server console, then you likely have not yet downloaded the paper data. See the data section for a link to the paper corpus.

Q: Help, something is failing, but since the debug cycles take minutes to hours, I'm not sure precisely what failed.

A: You might want to run the CodeScientist server in a way that saves the output to a log file that you can examine after the issue occurs. For example: python -u src/CodeScientistWebServer.py > debuglog.txt 2>&1. Alternately, if experiments appear to be running, you might manually examine the history and other logs in the relevant experiment directory in generated-experiments/.

This software is an autonomous agent that not only makes a significant number of LLM calls itself (each with a cost associated with it), but spawns containers with LLM-generated code that themselves can create LLM calls or incur other costs (like container costs). Further, while an effort is made to hide the API keys from the LLM-generated code, nothing is completely safe. What's more, while this code includes cost estimation tools to help put in hard limits and control run-away costs, these are only estimates, and nothing is foolproof.

All this is to say: The only API keys you should provide to CodeScientist are those with hard limits, and you should continually monitor your API key usage to measure the actual system cost.

Setting up exclusion patterns for API keys: The contents of the containers are stored on disk, and this can include the API keys. You may wish to place your keys in an exclusion file (e.g. EXCLUDE_PATTERNS.TXT) and setup pre-commit hooks to check for these patterns to help prevent them from accidentally being committed.

If you use this work, please reference the following citation:

@misc{jansen2025codescientistendtoendsemiautomatedscientific,
      title={CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation}, 
      author={Peter Jansen and Oyvind Tafjord and Marissa Radensky and Pao Siangliulue and Tom Hope and Bhavana Dalvi Mishra and Bodhisattwa Prasad Majumder and Daniel S. Weld and Peter Clark},
      year={2025},
      eprint={2503.22708},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.22708}, 
} 

CodeScientist is released under an Apache 2.0 License. The text of that license is included in this repository.

Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.

For any questions, please contact Peter Jansen ([email protected]). For issues, bugs, or feature requests, please submit a github issue.

联系我们 contact @ memedata.com