Q: What is and is not included in the cost estimates?
A: Generally, the cost of running/debugging the experiment is what's calculated -- which is centrally the LLM cost for the Experiment Builder reflection loop, and the cost of any LLM calls made by the experiment code itself. The ideation costs, initial experiment code generation costs, and modal container costs are not included (as these are generally small compared to the debugging costs).
Q: Why not use local/free Docker containers instead of Modal containers?
A: You could do this, and create a version of src/modules/ModuleRunPythonInModal.py intended for local Docker use intead of Modal use. The initial version of CodeScientist did this, but the containers were very slow to spin up, and it limited the number of simultaneous experments that could be spooled up to what was available on the local machine. The Modal containers are generally very inexpensive, very fast, and allow scaling.
Q: Where are the internal files for ideas, experiments, etc., stored?
A: These are generally stored in easily-parsable JSON format, in data/ , with the exception of the PaperStore, which is stored in paperstore/ .
Q: How much does each experiment cost to run?
A: This is highly variable, and depends on a great many things that are unknown (like the number of debugging iterations required, the size of the prompts the LLM generates, etc.). The experiments in the CodeScientist paper cost an average of about $4 each, but this was highly variable, and almost entirely code debugging costs. Those experiments were generally prompted to use gpt-4o-mini
for all experiment code, which is generally quite inexpensive.
Q: Why use Claude Sonnet 3.5 as a base model? Why not use [My Favorite Model]?
A: Claude is excellent at generating and debugging code. The library for LLM calls is litellm
, so you should be able to easily retarget to many common language models -- though the performance may vary.
Q: Help, when I click the "PDF" report link in the Experiment Monitor, the PDF doesn't appear.
A: It appears to work in Chrome, but may not work in other browsers (like FireFox). As a workaround, you can download the experiment ZIP file (that includes the report).
Q: The reports for some experiments do not generate.
A: This is an uncommon but known issue -- sometimes the LLM doesn't generate valid Latex, or the Latex is complicated and doesn't compile. The report generator makes several attempts. If the Latex compilation was unsuccessful, the raw Latex report should still be in the experiment directory, just without the compiled PDF.
Q: Ideas or experiments don't appear to be working, and I'm seeing errors in the server console of the form ERROR: Could not retrieve source for paper with ID: xxxx.xxxxx
?
A: If your ideas don't appear to be generating in any mode (idea generation, batch autonomous experimentation, etc.), and you're seeing errors of the form ERROR: Could not retrieve source for paper with ID: xxxx.xxxxx
in the server console, then you likely have not yet downloaded the paper data. See the data section for a link to the paper corpus.
Q: Help, something is failing, but since the debug cycles take minutes to hours, I'm not sure precisely what failed.
A: You might want to run the CodeScientist server in a way that saves the output to a log file that you can examine after the issue occurs. For example: python -u src/CodeScientistWebServer.py > debuglog.txt 2>&1
. Alternately, if experiments appear to be running, you might manually examine the history and other logs in the relevant experiment directory in generated-experiments/
.
This software is an autonomous agent that not only makes a significant number of LLM calls itself (each with a cost associated with it), but spawns containers with LLM-generated code that themselves can create LLM calls or incur other costs (like container costs). Further, while an effort is made to hide the API keys from the LLM-generated code, nothing is completely safe. What's more, while this code includes cost estimation tools to help put in hard limits and control run-away costs, these are only estimates, and nothing is foolproof.
All this is to say: The only API keys you should provide to CodeScientist are those with hard limits, and you should continually monitor your API key usage to measure the actual system cost.
Setting up exclusion patterns for API keys: The contents of the containers are stored on disk, and this can include the API keys. You may wish to place your keys in an exclusion file (e.g. EXCLUDE_PATTERNS.TXT
) and setup pre-commit hooks to check for these patterns to help prevent them from accidentally being committed.
If you use this work, please reference the following citation:
@misc{jansen2025codescientistendtoendsemiautomatedscientific,
title={CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation},
author={Peter Jansen and Oyvind Tafjord and Marissa Radensky and Pao Siangliulue and Tom Hope and Bhavana Dalvi Mishra and Bodhisattwa Prasad Majumder and Daniel S. Weld and Peter Clark},
year={2025},
eprint={2503.22708},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2503.22708},
}
CodeScientist is released under an Apache 2.0 License. The text of that license is included in this repository.
Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
For any questions, please contact Peter Jansen ([email protected]
). For issues, bugs, or feature requests, please submit a github issue.