TScale——基于消费级GPU的分布式训练

TScale——基于消费级GPU的分布式训练
TScale – Distributed training on consumer GPUs

原始链接: https://github.com/Foreseerr/TScale

TScale是一个基于C++/CUDA的Transformer库，旨在高效地在消费级硬件（特别是NVIDIA GPU）上进行大型语言模型（LLM）的训练和推理。它通过优化的架构、更快的收敛速度、降低的注意力计算成本以及对FP8/INT8精度的支持来实现这一点。主要特性包括：CPU卸载以减少GPU内存占用；在同构主机上进行同步分布式训练；在异构的、地理位置分散的主机上进行异步分布式训练，并使用1比特梯度压缩以最大限度地减少网络开销。 TScale展示了一种新颖的“模型大小”方法，它使用一个较小的模型配合一个巨大的（1TB）索引进行token预测，从而显著降低了困惑度。编译需要CUDA v12.3和C++编译器（Windows使用MSVC，Linux使用CMake/Clang）。训练使用脚本进行数据处理和模型训练，分布式训练支持`pow2`数量的工作主机。每个工作主机可以使用多个GPU。推理通过`gpt_infer`启用，这是一个提供模型延续的基本HTTP服务器，但目前它针对演示而非速度进行了优化。

Hacker News 的一个帖子讨论了 TScale 项目，这是一个用于消费级 GPU 分布式训练的项目。一位用户指出缺少一个文件，另一位用户推测这是一个仓促发布的周末项目，并批评了其代码质量和配置文件解析器的重新实现。这引发了讨论，一些人认为大型语言模型可能导致重复造轮子，另一些人则指出管理 C/C++ 依赖项的难度。讨论还涉及 TScale 的“1T 索引”技术，该技术用于最小化模型大小，以及由于网络瓶颈导致跨多个主机划分推理的挑战。进一步的讨论深入到更广泛的 AI 供应链、ASML 在半导体制造中的作用以及 ASML 停产的潜在地缘政治影响。帖子总结认为，虽然 ASML 至关重要，但冗余性和其他国家正在迎头赶上将减轻其影响。

原文

This repo contains transformer train and inference code written in C++ and CUDA.

TScale is designed to run on consumer hardware. To achive best results it features

Optimized transformer architecture with faster convergence and ~2x reduced attention costs
Support for fp8 and int8 model weights and activations precision
Optimized for consumer nVidia GPUs including fast reduced precision training without sacrificing model quality
CPU offload reduces GPU memory requirements for training
Sync distributed training on several same config hosts
1-bit gradient compression allowing using regular ethernet links for interconnect
Async distributed training on arbitrary hosts with negligible network traffic. In this mode training can be run on geographically separated hosts

By using inexpensive GPUs and async distributed mode TScale trains LLMs fast and affordable. Log loss for the 1.5B model trained on fineweb-edu for 2 days and $500 on several spot instances with 4090:

1T model size sounds beyond reach for most people and even organisations. However if we consider creative ways to count model size then there is nothing impossible. In this case we build a model with 1T index which we lookup for every token to make prediction with much smaller model. In terms of logloss/perplexity this construction easily achieves stellar results. Index for fineweb-edu occupies about 1T of disk space. Training run of 125M model with this ~1T index achieves x8 perplexity reduction:

Model	Perplexity
125M	19.02
125M + 1T index	2.28

Training 125M model

Training 1.5B model

Training 1T (!) model in your kitchen

Async distributed train

Notes on model and compute precision

TScale transformer model

Data indexing

Tokenizer

To build the the code CUDA v12.3 and C++ compiler are required, msvc for windows, cmake+clang for Linux. To support cross platform build files generation this repo uses fo, lightweight solution/build files generator. To generate build files you need to compile fo/fo.cpp and run it with two arguments. First argument is root of source tree, second argument is directory to store build files to.

D:\TScale>fo.exe code sln

Then open code.sln from d:\TScale\sln\code.sln.

To compile TScale for linux you need to compile fo.cpp, generate CMakeLists.txt file, run cmake, run make.

~/TScale/fo$ clang++17 fo.cpp -o fo
~/TScale/fo$ cd ..
~/TScale$ ./fo/fo code make.dir
~/TScale$ cd make.dir
~/TScale/make.dir$ cmake -D CMAKE_BUILD_TYPE=RelWithDebInfo .
~/TScale/make.dir$ make

Examples in the code use enwik9 dataset and its truncacted version enwik8. Also Hugging Face hosted datasets openwebtext, ontocord/CulturaY, danasone/librusec are used in examples. To import them use hf_import.

gpt_train is used to train a model. It is controlled by the train script and data script. Default scripts are stored in main_gpt.cpp. To load train script from file run gpt_train with '-d data_script.txt -s train_script.txt' arguments.

Compile gpt-train. Run it in the root directory:

~/TScale$ ./make.dir/gpt-train

Currently training can be distributed only among pow2 number of worker hosts.

To start a worker process run gpt_train with '-w 10000' argument. 10000 specifies port number to use.

To run master process call net_train('worker.txt') function in train script. List worker IP addresses in the file provided to net_train().

To use multiple GPU devices set DEVICE_COUNT variable in train script to number of GPUs to use. For distributed runs DEVICE_COUNT is applied on each worker, heterogeneous configurations are not supported.

Description of scripts used in training: data script, train script

To try inferencing from the trained model you can use gpt_infer. It runs basic http server on 11311 port and allows sampling continuations from the model. Current implementation is slow and designed for demonstration purposes only.

MIT

TScale——基于消费级GPU的分布式训练 TScale – Distributed training on consumer GPUs

TScale——基于消费级GPU的分布式训练
TScale – Distributed training on consumer GPUs