米斯特拉尔：专家混合

米斯特拉尔：专家混合
Mistral: Mixtral of Experts

原始链接: https://mistral.ai/news/mixtral-of-experts/

隆重推出 Mixtral，它是 Mistral AI 开发的开放模型世界的最新成员。 Mixtral 是一种先进的稀疏专家混合模型，旨在高效处理复杂任务。其开放权重允许开发人员根据特定要求定制和调整模型。 Mixtral 拥有超过 460 亿个总参数，展示了令人印象深刻的语言处理能力，并在各种基准测试中表现出色。与 Llama 2 70B 和 GPT3.5 等其他流行模型相比，Mixtral 对最大 32k-token 上下文的优雅处理使其特别适合处理较长的文本或多语言内容。开发人员可以通过多种方式利用 Mixtral；使用开源堆栈部署它，在 Mistral AI 提供的指定端点上利用它，成为合作伙伴，或提交新闻报道请求。作为一家负责任的公司，我们努力在各个方面保持透明度，严格遵守我们的隐私政策、服务条款和使用条款。立即加入我们，体验 Mistral AI 打造的创新人工智能解决方案！如需了解更多详情或其他问题，请随时通过电子邮件或 LinkedIn 与我们联系。让我们共同努力，释放人工智能技术的全部潜力！

Mixtral 是一家成立于 2019 年的年轻初创公司，其目标是推进法学硕士混合专家架构的开发。他们最近宣布推出其旗舰产品 Mixt-L（具有混合注意力的大型语言模型）的开放访问版本，该模型由 67B+ 参数组成（比 OpenAI 的 GPT-3 模型大四倍多）。该团队声称这是一个有前途的解决方案，可用于大规模预训练变压器模型的可扩展训练。 Mixtral 得到了 Y Combinator、21C Venture Capital、StartX Med 和 Felicis Valley Fund 等知名投资者的支持。去年 8 月，它筹集了 1000 万欧元（1140 万美元）的资金，由 KB Investment 领投，中国风险投资公司创新工场参与其中。 Mixtral 的旗舰产品以混合专家的技术命名，旨在结合其他法学硕士的一系列功能，使开发人员能够创建自己的这些技能组合，从而为各个项目培养专门的人才。细粒度分析揭示了简单课程和复杂课程的显着性能提升，包括使用比当前最先进系统多百分之十的数据的课程训练的模型。此外，在某些情况下，系统还实现了可比（有时更优越）的性能，所需的计算资源减少了五到十五小时。总体而言，Mixtral 的发展表明在建立日益复杂的法学硕士方面取得了快速进展，这些法学硕士能够被部署来培养满足网络安全和执法等特定需求的专业技能。

原文

Mistral AI continues its mission to deliver the best open models to the developer community. Moving forward in AI requires taking new technological turns beyond reusing well-known architectures and training paradigms. Most importantly, it requires making the community benefit from original models to foster new inventions and usages.

Today, the team is proud to release Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights. Licensed under Apache 2.0. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. In particular, it matches or outperforms GPT3.5 on most standard benchmarks.

Mixtral has the following capabilities.

It gracefully handles a context of 32k tokens.
It handles English, French, Italian, German and Spanish.
It shows strong performance in code generation.
It can be finetuned into an instruction-following model that achieves a score of 8.3 on MT-Bench.

Pushing the frontier of open models with sparse architectures

Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively.

This technique increases the number of parameters of a model while controlling cost and latency, as the model only uses a fraction of the total set of parameters per token. Concretely, Mixtral has 46.7B total parameters but only uses 12.9B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12.9B model.

Mixtral is pre-trained on data extracted from the open Web – we train experts and routers simultaneously.

Performance

We compare Mixtral to the Llama 2 family and the GPT3.5 base model. Mixtral matches or outperforms Llama 2 70B, as well as GPT3.5, on most benchmarks.

On the following figure, we measure the quality versus inference budget tradeoff. Mistral 7B and Mixtral 8x7B belong to a family of highly efficient models compared to Llama 2 models.

The following table give detailed results on the figure above.

Hallucination and biases. To identify possible flaws to be corrected by fine-tuning / preference modelling, we measure the base model performance on TruthfulQA/BBQ/BOLD.

Compared to Llama 2, Mixtral is more truthful (73.9% vs 50.2% on the TruthfulQA benchmark) and presents less bias on the BBQ benchmark. Overall, Mixtral displays more positive sentiments than Llama 2 on BOLD, with similar variances within each dimension.

Language. Mixtral 8x7B masters French, German, Spanish, Italian, and English.

Instructed models

We release Mixtral 8x7B Instruct alongside Mixtral 8x7B. This model has been optimised through supervised fine-tuning and direct preference optimisation (DPO) for careful instruction following. On MT-Bench, it reaches a score of 8.30, making it the best open-source model, with a performance comparable to GPT3.5.

Note: Mixtral can be gracefully prompted to ban some outputs from constructing applications that require a strong level of moderation, as exemplified here. A proper preference tuning can also serve this purpose. Bear in mind that without such a prompt, the model will just follow whatever instructions are given.

Deploy Mixtral with an open-source deployment stack

To enable the community to run Mixtral with a fully open-source stack, we have submitted changes to the vLLM project, which integrates Megablocks CUDA kernels for efficient inference.

Skypilot allows the deployment of vLLM endpoints on any instance in the cloud.

Use Mixtral on our platform.

We’re currently using Mixtral 8x7B behind our endpoint mistral-small, which is available in beta. Register to get early access to all generative and embedding endpoints.

Acknowledgement

We thank CoreWeave and Scaleway teams for technical support as we trained our models.