Performance overview: Mixtral Mixture-of-Experts (MoE 8x7B) with vLLM

Explore the performance benchmarks of the latest Mixtral-8x7B (Mixture of Experts) with our CPO Oscar Rovira.

Digest time
4 min read
Published
12/12/2023
Tags
MLOps
Author
Oscar Rovira

About Mixtral and vLLM

Mixtral 8x7b is an exciting new LLM released by Mistral, which sets a new standard for open-access models and outperforms GPT-3.5 across many benchmarks.

Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively.

Features:

Outperforms Llama 2 70B and matches or beats GPT3.5 on most benchmarks.
Context length of 32k tokens.
Languages supported: English, French, Italian, German and Spanish.
Strong performance in code generation.
Commercially permissive license.

The below image (extracted from HF blog) shows how Mixtral Instruct ranks on the ML-Bench compared to other SOTA LLMs. MT-Bench tests for how natural and meaningful a conversation with an LLM can be.

We ran a set of experiments to benchmark the throughput and latency of running Mixtral with vLLM.

For context, vLLM is an open-source inference engine that enables high-throughput and memory-efficient inference for LLMs. Some of their features include optimized CUDA kernels, continuous batching and PagedAttention (which reduces the waste of KV cache memory and allows for sharing during inference - especially useful for long input sequences), among other optimizations. As of today, vLLM is Mystic’s recommended method for our customer’s to run LLMs faster on our platform.

Performance benchmark

We used 2xA100 80GB for inference, running with float16 precision and repeating the experiment 4 times for each sweep to extract the average latencies and throughputs.

For both plots we used:

Batch sizes: [1, 4, 8, 16, 32, 64, 128]
Input tokens: 1000
Output tokens: [1, 16, 64, 128, 256, 512, 2048]

In the following plot, we answer the following question: How long does it take to process inputs of different sizes? We show how latency varies as more tokens are requested from the model. Each line corresponds to a different batch size and each new data-point corresponds to a different number of generated tokens.

The graph above illustrates that sequence length plays a more significant role in influencing latency compared to increases in batch size. This indicates that the time it takes to process requests is more affected by the length of the sequences being processed, rather than the number of requests handled simultaneously. Increasing batch sizes is advantageous when feasible, as it enables processing more inputs with minimal spend on additional resources. This approach effectively enhances efficiency, particularly in scenarios where managing a high volume of inputs is crucial.

In the following plot, we showcase throughput against latency. We answer the following question: What is the optimal operational point of this LLM?. Each line corresponds to a different number of generated tokens and each subsequent data-point corresponds to a different batch size.

Plotting these two metrics against each other helps in understanding the trade-off between processing a large number of requests quickly (high throughput) and minimizing the delay for each request (low latency). This balance is crucial for LLMs, especially in user-facing applications where both low response time and the simultaneous handling of multiple users are equally important.

About Mystic

At Mystic AI, we've developed a platform that lets you run any open-source or custom ML model as an API on your own cloud. Our system is built around simplicity and enables users to instantly leverage advanced optimisations like GPU fractionalization and smart auto-scaling without the engineering hassle. With our Python SDK, you can upload your own model and our platform handles all the engineering for a secure, scalable and cost-efficient endpoint running in your own cloud or on-prem.

Book a demo | Explore Pipeline Core (Enterprise)

Bibliography:

https://mistral.ai/news/mixtral-of-experts/

https://huggingface.co/blog/mixtral

https://github.com/vllm-project/vllm

https://github.com/mystic-ai/pipeline

Continue reading...

Flux.1 models now available for commercial use on ...