Glyph is a framework for scaling the context length through visual-text compression. Instead of extending token-based context windows, Glyph renders long textual sequences into images and processes them using vision–language models (VLMs). This design transforms the challenge of long-context modeling into a multimodal problem, substantially reducing computational and memory costs while preserving semantic information.
(Upper) Comparison of two paradigms for long-context tasks: conventional approaches directly feeding plain text into LLMs, and the proposed VLM-based paradigm, Glyph, which renders text as compact images to achieve substantial input-token compression. (Lower) Glyph attains competitive performance on LongBench and MRCR, while offering significant compression and inference speedup over its text backbone model on 128K-token inputs.
We provide a ready-to-run demo script that deploys both a baseline text model (Qwen3/GLM4 etc.) and Glyph, enabling comparison of long-context inference efficiency.
After downloading the model, to see a side-by-side comparison of the output from Qwen3 and Glyph, run:
cd demo
bash run_demo_compared.shThis demo will:
- Start a text-only LLM
- Start Glyph with visual–text compression
- Provide a simple testing interface for long-context question answering
If you wish to view only the output from Glyph, run the following command in the demo directory:
🎬 A short demonstration is provided below, showing the faster prefill speed of Glyph on long-context inputs:
demo.mp4
Glyph achieves notably improved prefill efficiency on long-context inputs, with increasing benefits as the sequence length grows. 🚀
Our model is built on GLM-4.1V-9B-Base. The fine-tuned model is publicly available on Hugging Face.
Welcome to download and use it!
The continual pre-training data of Glyph will be added to the new version of GLM-4.1V-9B-Base and will be released later.
First, please install the required dependencies using the following command:
apt-get install poppler-utils
pip install transformers==4.57.1
# Optional
pip install vllm==0.10.2 sglang==0.5.2Then, run the following code:
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://raw.githubusercontent.com/thu-coai/Glyph/main/assets/Little_Red_Riding_Hood.png"
},
{
"type": "text",
"text": "Who pretended to be Little Red Riding Hood's grandmother"
}
],
}
]
processor = AutoProcessor.from_pretrained("zai-org/Glyph")
model = AutoModelForImageTextToText.from_pretrained(
pretrained_model_name_or_path="zai-org/Glyph",
torch_dtype=torch.bfloat16,
device_map="auto",
)
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
print(output_text)We provide the post-training configurations for both English and Chinese in the config directory, along with the corresponding fonts.
You can customize the newline behavior using the newline-markup option in the config file, which could affect the compression ratio:
- Set
"newline-markup": "<font color=\"#FF0000\"> \\n </font>"to use a special visual marker for newlines. - Set
"newline-markup": "<br/>"for standard line breaks.
The compression ratio is also influenced by the DPI setting:
- DPI=72: Achieving an average compression of 3-4x, which is the best trade-off between compression ratio and performance.
- DPI=96: Achieving an average compression of 2-3x, which usually leads to better results than dpi 72.
A rendering example:
We provide scripts to render long text into images for your convenience.
This is a simple example of rendering a single text file (e.g., input.txt) into a sequence of images. You can adjust the rendering style by modifying CONFIG_EN_PATH.
from test_word2png_function_fast import text_to_images
CONFIG_EN_PATH = '../config/config_en.json'
OUTPUT_DIR = './output_images'
INPUT_FILE = './input.txt'
# Read text from file
with open(INPUT_FILE, 'r', encoding='utf-8') as f:
text = f.read()
# Convert text to images
images = text_to_images(
text=text,
output_dir=OUTPUT_DIR,
config_path=CONFIG_EN_PATH,
unique_id='test_001'
)
print(f"\nGenerated {len(images)} image(s):")
for img_path in images:
print(f" {img_path}")Note: The current text rendering feature is implemented using the reportlab library. While the overall process is stable, there is still significant room for acceleration.
Our model supports vLLM acceleration for inference, which significantly improves throughput and response speed in long-context scenarios. Use the following command to start the vLLM-served model:
vllm serve YOUR_MODEL_PATH --port 5002 --served-model-name glyph --allowed-local-media-path / --media-io-kwargs '{"video": {"num_frames": -1}}'After rendering the text into images, you can perform inference with the VLM.
from vlm_inference import vlm_inference
response = vlm_inference(
question="Based on the story in the figures, what is the ending of the wolf?",
image_paths=["./output_images/Little_Red_Riding_Hood/page_001.png"]
)
print("VLM's Response:")
print(response)We provide evaluation scripts and test cases for benchmarks including LONGBENCH, MRCR, and RULER. For detailed instructions on running the evaluations, please refer to the guide in evaluation/readme.md.
Glyph achieves context window scaling, matching the performance of text LLMs that use 3×–4× longer contexts through visual–text compression.
Speedup ratios of Glyph over the text backbone model for prefill, decoding, and training across different sequence lengths.
- Sensitivity to rendering parameters: Glyph’s performance can vary with rendering settings such as resolution, font, and spacing. Since our search procedure adopts a fixed rendering configuration during post-training, the model may not generalize well to unseen or substantially different rendering styles.
- OCR-related challenges: Recognizing fine-grained or rare alphanumeric strings (e.g., UUIDs) remains difficult for visual-language models, especially with ultra-long inputs, sometimes leading to minor character misclassification.
- Limited generalization: The training of Glyph mainly targets long-context understanding, and its capability on broader tasks is yet to be studied.
If you find our model or code useful in your research, please cite our paper:
@article{cheng2025glyphscalingcontextwindows,
title={Glyph: Scaling Context Windows via Visual-Text Compression},
author={Jiale Cheng and Yusen Liu and Xinyu Zhang and Yulin Fei and Wenyi Hong and Ruiliang Lyu and Weihan Wang and Zhe Su and Xiaotao Gu and Xiao Liu and Yushi Bai and Jie Tang and Hongning Wang and Minlie Huang},
journal={arXiv preprint arXiv:2510.17800},
year={2025}
}

