挪威的 2 PB 华为全闪存存储与大语言模型训练

挪威的 2 PB 华为全闪存存储与大语言模型训练
Norway's 2 petabytes of Huawei flash storage and LLM training

原始链接: https://www.blocksandfiles.com/flash/2026/05/22/norways-2-petabytes-of-huawei-flash-storage-and-llm-training/5244910

挪威国家图书馆正在开发一种主权大语言模型（LLM），旨在保护国家的文化遗产、历史和语言。受文化部委托，该图书馆利用其包含书籍、报纸和广播内容在内的 20 PB 海量数字化馆藏进行开发。为了解决缺乏专业挪威语模型的问题，图书馆构建了一条高性能人工智能数据流水线。该基础设施采用 2 PB 的华为 OceanStor Dorado 全闪存存储，在将处理后的信息输入挪威国家超级计算机 Sigma2 Olivia 进行训练之前，对数据进行清洗、去重和标准化。该项目突显了将 PB 级数据从成本优化的归档存储迁移到高吞吐量、低延迟的人工智能训练环境所面临的重大技术挑战——这是图书馆团队必须独立解决的瓶颈。除了技术障碍外，该图书馆还在主权人工智能治理、挪威方言的定制评估工具以及大规模系统编排方面开创了解决方案。正如 IT 平台负责人马里乌斯·胡斯内斯（Marius Husnes）所指出的，挪威的努力为那些寻求构建能够反映其独特文化和历史身份的人工智能的非英语国家提供了重要的蓝图。

```Hacker News新消息 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交登录挪威 2PB 的华为闪存存储与大语言模型训练 (blocksandfiles.com)16 点，由 rbanffy 发布于 52 分钟前 | 隐藏 | 过往 | 收藏 | 2 条评论帮助 jauntywundrkind 1 分钟前 | 下一条 [–] 384 核 CPU 集群？2PB？戴尔刚发布了一款 2U 服务器，可以容纳近 10PB 数据。虽然它可能不具备 384 核的处理能力，但这在目前完全可行，毕竟 Epyc 芯片单颗就有 192 核！ https://www.techradar.com/pro/dell-launches-record-shatterin...回复7e 6 分钟前 | 上一条 [–] 2PB？他们不可能用这点数据量完成训练。也许几年后吧。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：```

原文

Norway’s National Library is developing a large language model (LLM) that understands the Norwegian language and is using 2 PB of Huawei OceanStor Dorado flash storage in its AI training data pipeline.

Marius Husnes, the Head of IT Platform at the library (Nasjonlbiblioteket) discussed the project at Huawei’s ID Forum 2026 in Paris, saying that no commercial LLM provider was developing a local (Norwegian) language LLM. He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.

Norway’s Ministry of Culture tasked the National Library with building a sovereign AI (LLM) as the library has the single largest digital collection of Norwegian books, newspapers, web pages and so forth in the country. Like many state libraries it is entitled to receive copies of every published book and broadcasted content. Its legal deposit mandate in this area extended beyond books, as it was duty-bound to collect and preserve all of Norway’s cultural heritage.

An agreement with Norwegian newspapers permitted LLM training on copyrighted content and, Husnes said: ”No private company has this.”

The library was also well-placed to do this as it had been digitizing its collection since 2005 and had amassed 20 PB of unique data stored in 3-2-1 form (3 copies, 2 media types, 1 off-site), meaning some 60 PB overall. The digitization process for the raw text, sound, moving pictures, still images and web content involved much OCR scanning, and generated a lot of metadata, and also APIs for online access.

The bulk of the data was deposited in a digital disk plus tape archive, a preservation system. Husnes’ task was to get this data to the LLM training system. He said the bottlemeck was not compute; it was data quality, cleaning and pipeline throughput. There were two main processing stages. First there was in-house computation, using an Nvidia DGX H200 system, a 384 core CPU cluster and multiple Huawei OceanStor Dorado all-flash arrays, totalling 2 PB of flash capacity. This is low-latency storage for the data pipelines and training preparation.

Husnes - training national LLM.

The pipeline has data ingestion, cleaning, deduplication, format normalization, validation and preparation steps.Once the data has passed through the pipeline it’s sent to Norway’s national supercomputer, the Sigma2 Olivia system, for the actual training runs. The Olivia system is an HPE Cray Supercomputing EX system, with 448 GPUs and 64,512 CPU cores. It uses a 5.3 PB Cray ClusterStor E1000 storage system.

One large problem area has been over-coming two different storage system needs. The 60 PB preservation system is optimized for durability and cost, not fast IO, and has a high read latency, being designed for infrequent access. The AI Pipeline storage is designed for high-throughput, low-latency, parallel data IO. Husnes said he learnt that nobody was talking about the problems involved in moving PB-scale datasets from an archive to, and through, an AI data pipeline system. His team had to find out how to do it themselves.

Husnes - preservation and AI pipeline storage.

The LLM training is ongoing and he finished his talk with a summary of what his team is stll learning about:

Evaluation - there are no standard evaluation tools to assess a sovereign Norwegian LLM.The language has two written forms, multiple dialects and historical changes. They are building their own evaluation tool on the fly.
Governance - who controls access to a sovereign LLM? Who decides what it can be used for? These are institutional and political questions with no easy answers.
Orchestration - making three systems; preservation archive + on-prem AI environment + national Sigma2 supercomputer, work smoothly together is an ongoing project.

Our takeaways here are, one, that Huawei storage is playing a serious and significant role in the European market, and two, that any country developing a sovereign, local language LLM would do well to consult with Husnes and get acquainted with what’s involved.

As Husnes put it; Norway is a small country solving a problem every non-English-speaking nation will face: how do you build AI that reflects your language, your culture and your history? AI needs custodians, not just builders.

挪威的 2 PB 华为全闪存存储与大语言模型训练 Norway's 2 petabytes of Huawei flash storage and LLM training

挪威的 2 PB 华为全闪存存储与大语言模型训练
Norway's 2 petabytes of Huawei flash storage and LLM training