是否可以使用图像代替文本来节省LLM的token？

是否可以使用图像代替文本来节省LLM的token？
Can you save on LLM tokens using images instead of text?

原始链接: https://pagewatch.ai/blog/post/llm-text-as-image-tokens/

最近一项研究调查了将文本提示转换为图像是否能减少在使用 OpenAI API 时产生的令牌使用量和成本，起因是对图像输入令牌消耗的观察。该实验使用了 Karpathy 关于数字卫生的博客文章，提示 ChatGPT 使用文本和图像输入来总结其中的建议。结果显示，使用图像时，GPT-5 的提示令牌减少了 **40%**。然而，这种节省在大多数模型中被 **完成令牌的大幅增加** 所抵消。完成令牌的成本更高，这意味着对于除 GPT-5-chat 之外的所有模型，使用图像的总成本都 *更高*。作者得出结论，虽然节省令牌是可能的，但完成令牌使用量的增加可能会抵消任何财务收益，使得基于图像的方法通常 **不值得**，尽管有可能降低提示成本。仔细考虑模型和令牌定价至关重要。

## LLMs & Image Token Efficiency - Hacker News Summary A Hacker News discussion revolves around the idea of using images to potentially reduce LLM token usage. The core question is whether images can effectively represent information more compactly than text – essentially, if “a picture is worth a thousand words” holds true for LLMs. Recent research, including the CALM paper, suggests this is possible. CALM claims to compress 4 tokens into a single embedding, potentially equating to over 1000 words per image. However, users debate the practical implications and decompression methods, clarifying it's about compressing *embeddings* not tokens directly, using techniques like MLPs for compression/decompression or learned prediction. Other points raised include the increased processing time for images compared to text, and the potential for multimodal models to utilize hidden captioning steps. A humorous anecdote about Charles Babbage correcting Tennyson’s population calculations highlights the importance of accuracy even in poetic expression. Finally, Y Combinator's Winter 2026 application period is announced.

原文

TLDR: Yes, but only under specific conditions, and it is probably not worth the trade-offs.

The recent OCR paper from DeepSeek and the related discussion reminded me of something curious I noticed with the OpenAI API - Prompts that use images as input consume almost the same number of tokens as those that use text, even though the images could be 50x the size of the text.

This made me think, could you actually save on tokens by converting your text to an image and sending that to the api?

Setting up an experiment

To test the theory, I ran a simple prompt using both text only and an image-based one.

For the prompt I’m asking ChatGPT to pick out and summarize the information from this post by Karpathy on digital hygiene. It should be something LLMs are good at, but there is also a clear correct answer so it’s easy to check.

The prompt is simply the following:

List every hygiene tip from the collection, with a short sentence summary.
List the titles * EXACTLY * as in the text
Follow the following format:
** Title 1 ** - summary 1
** Title 2 ** - summary 2
DO NOT ask any followup questions, simply return the list.

To run the prompt I extracted the text-only version of the blogpost (skipping the headers and image), and fed it to gpt-5

The output is as expected:

**Password manager** - Use a password manager to store and generate unique passwords to prevent reuse and leaks across services.
**Hardware security key** - Strengthen your accounts with physical security keys like YubiKeys for superior protection against attacks and SIM swaps.
**Biometrics** - Use biometric authentication, such as FaceID, as an additional factor of security where available.
**Security questions** - Avoid traditional security questions; use random passwords instead to prevent easy guessing or phishing.
.. etc

The next step is to run the prompt on the image instead of text.

Converting the prompt to an image

When creating the image we need to be careful to get the dimensions right, otherwise OpenAI will resize the image and won’t be able to make out any text.

Looking through their documentation, the ideal size seems to be 768x768, so I wrote a basic puppeteer script to convert the post to an image with these dimensions.

To fit into the desired resolution, I had to break the prompt into two images; you can see them here and here. When running the prompt, you have to specify both images with “detail”: “high”

    model='gpt-5',
    messages=[{
        'role': 'user',
        'content': [
            {"type": "image_url",  "image_url": {"url": f"{im_1}", "detail": "high"}},
            {"type": "image_url",  "image_url": {"url": f"{im_2}", "detail": "high"}},
        ],
    },

This worked perfectly, and the output was similar to using the text-based prompts (though it did take almost twice as long to process).

The Results

Running the prompt with a few different models, we can see there are indeed significant savings in prompt tokens.

With gpt-5 in particular, we get over a 40% reduction in prompt tokens.

Completion Tokens

Prompt tokens are only half the story, though.

Running the prompt five times with each model and taking the average, we see the following:

All models apart from gpt-5-chat use significantly more completion tokens with the image inputs.

Completion tokens are also significantly more expensive, so unless you use the chat model, you are not getting any savings.