展示HN：Chonky – 一种神经文本语义分块工具，支持多语言。

展示HN：Chonky – 一种神经文本语义分块工具，支持多语言。
Show HN: Chonky – a neural text semantic chunking goes multilingual

原始链接: https://huggingface.co/mirth/chonky_mmbert_small_multilingual_1

## Chonky：用于RAG的智能文本分割 Chonky是一种Transformer模型，旨在将文本智能地分割成语义连贯的块，非常适合用于检索增强生成（RAG）系统。现在支持多语言，它通过准备文本用于嵌入和检索来增强RAG流程。该模型`mirth/chonky_mmbert_small_multilingual_1`在MiniPile、BookCorpus和Project Gutenberg等数据集上进行了微调，在包括英语、德语、西班牙语和中文在内的多种语言中实现了强大的性能（F1分数高达0.97）。它能有效处理长度达1024的序列。一个专门的Python库`chonky`简化了集成。或者，它可以与标准命名实体识别（NER）流程通过Hugging Face Transformers一起使用。Chonky在一块H100 GPU上训练了几个小时，在性能和效率之间取得了平衡。

## Chonky：多语言文本分块 Hessdalenlight 在 Hugging Face 上发布了一个新的多语言模型“Chonky”，扩展了之前的文本分割模型。该模型基于 mmBERT 构建——一个在 1833 种语言上预训练的模型——旨在提高对不同文本的语义分块效果。开发者使用多种语言的古腾堡计划书籍扩充了训练数据集，并实施了一种从训练块的最后一个词中随机移除标点符号的技术，以提高鲁棒性。由于缺乏真实世界的、未格式化的文本数据集，评估变得具有挑战性，因此使用了现有的书籍语料库和论文。有趣的是，更大的 mmBERT 模型表现*不如*更小的模型。目标是有效地将文本分割成有意义的块，尤其适用于来自 OCR、转录和会议记录等来源的数据。开发者欢迎反馈并鼓励用户测试该模型：[https://huggingface.co/mirth/chonky_mmbert_small_multilingua](https://huggingface.co/mirth/chonky_mmbert_small_multilingua)。

原文

Chonky is a transformer model that intelligently segments text into meaningful semantic chunks. This model can be used in the RAG systems.

🆕 Now multilingual!

Model Description

The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.

⚠️This model was fine-tuned on sequence of length 1024 (by default mmBERT supports sequence length up to 8192).

How to use

I've made a small python library for this model: chonky

Here is the usage:

from src.chonky import ParagraphSplitter

# on the first run it will download the transformer model
splitter = ParagraphSplitter(
  model_id="mirth/chonky_mmbert_small_multilingual_1",
  device="cpu"
)

text = (
    "Before college the two main things I worked on, outside of school, were writing and programming. "
    "I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. "
    "My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. "
    "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' "
    "This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, "
    "and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — "
    "CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."
)

for chunk in splitter(text):
  print(chunk)
  print("--")

Sample Output:

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep
--
. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
--

But you can use this model using standart NER pipeline:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "mirth/chonky_mmbert_small_multilingual_1"

tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)

id2label = {
    0: "O",
    1: "separator",
}
label2id = {
    "O": 0,
    "separator": 1,
}

model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
)

pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = (
    "Before college the two main things I worked on, outside of school, were writing and programming. "
    "I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. "
    "My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. "
    "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' "
    "This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, "
    "and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — "
    "CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."
)

pipe(text)

Sample output

[{'entity_group': 'separator',
  'score': np.float32(0.66304857),
  'word': ' deep',
  'start': 332,
  'end': 337}]

Training Data

The model was trained to split paragraphs from minipile, bookcorpus and Project Gutenberg datasets.

Metrics

Token based F1-score.

Project Gutenberg validation:

Model	de	en	es	fr	it	nl	pl	pt	ru	sv	zh
chonky_mmbert_small_multi_1 🆕	0.88	0.78	0.91	0.93	0.86	0.81	0.81	0.88	0.97	0.91	0.11
chonky_modernbert_large_1	0.53	0.43	0.48	0.51	0.56	0.21	0.65	0.53	0.87	0.51	0.33
chonky_modernbert_base_1	0.42	0.38	0.34	0.4	0.33	0.22	0.41	0.35	0.27	0.31	0.26
chonky_distilbert_base_uncased_1	0.19	0.3	0.17	0.2	0.18	0.04	0.27	0.21	0.22	0.19	0.15
Number of val tokens	1m	1m	1m	1m	1m	1m	38k	1m	24k	1m	132k

Various english datasets:

Model	bookcorpus	en_judgements	paul_graham	20_newsgroups
chonkY_modernbert_large_1	0.79	0.29	0.69	0.17
chonkY_modernbert_base_1	0.72	0.08	0.63	0.15
chonkY_distilbert_base_uncased_1	0.69	0.05	0.52	0.15
chonky_mmbert_small_multilingual_1 🆕	0.72	0.2	0.56	0.13

Hardware

Model was fine-tuned on a single H100 for a several hours