SeedLM:将LLM权重压缩到伪随机生成器的种子中
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

原始链接: https://machinelearning.apple.com/research/seedlm-compressing

SeedLM 是一种新的无需数据的LLM(大型语言模型)后训练压缩方法,旨在降低其部署带来的高运行时成本。该方法利用一个以压缩种子值进行播种的线性反馈移位寄存器 (LFSR) 在推理过程中重建权重块。本质上,SeedLM 将频繁的内存访问替换为动态计算。 通过为每个权重块找到合适的种子,LFSR 生成一个随机矩阵,该矩阵与压缩系数结合后,可以准确地重建原始权重。这种方法用计算换取内存访问,对于内存受限的任务尤其有利,从而导致推理速度更快。 至关重要的是,SeedLM 无需校准数据即可运行,确保其在各种任务中具有广泛的泛化能力。在对要求苛刻的 Llama3 70B 模型进行测试表明,SeedLM 在 4 位和 3 位压缩下实现了与最先进方法相当或优于最先进方法的精度,同时保持了与 FP16 基线相似的性能水平。FPGA 测试进一步证实,随着模型尺寸的增加,4 位 SeedLM 比 FP16 Llama 2/3 基线快近 4 倍。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 SeedLM:将LLM权重压缩到伪随机生成器的种子中 (machinelearning.apple.com) 13 分,来自 pizza,1小时前 | 隐藏 | 过去 | 收藏 | 1 评论 elashri 23分钟前 [–] 我认为直接链接到论文[1]会更好。这是苹果和Meta的研究人员的工作。[1] https://arxiv.org/abs/2410.10714 回复 加入我们,参加6月16日至17日在旧金山举办的AI创业学校! 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系方式 搜索:
相关文章

原文

Large Language Models (LLMs) have transformed natural language processing, but face significant challenges in widespread deployment due to their high runtime cost. In this paper, we introduce SeedLM, a novel post-training compression method that uses seeds of a pseudo-random generator to encode and compress model weights. Specifically, for each block of weights, we find a seed that is fed into a Linear Feedback Shift Register (LFSR) during inference to efficiently generate a random matrix. This matrix is then linearly combined with compressed coefficients to reconstruct the weight block. SeedLM reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses. Unlike state-of-the-art methods that rely on calibration data, our approach is data-free and generalizes well across diverse tasks. Our experiments with Llama3 70B, which is particularly challenging, show zero-shot accuracy retention at 4- and 3-bit compression to be on par with or better than state-of-the-art methods, while maintaining performance comparable to FP16 baselines. Additionally, FPGA-based tests demonstrate that 4-bit SeedLM, as model size increases, approaches a 4x speed-up over an FP16 Llama 2/3 baseline.

† Meta

联系我们 contact @ memedata.com