Llama-Juiced: The Fastest Llama 3.1-8B Endpoint (1.89× Faster than vLLM and Available on AWS) - Pruna AI - Make your AI models cheaper, faster, smaller ...

Back to articles

Technical Article

・

Announcement

Llama-Juiced: The Fastest Llama 3.1-8B Endpoint (1.89× Faster than vLLM and Available on AWS)

Jun 24, 2025

Louis Leconte

ML Research Engineer

Quentin Sinig

Go-to-Market Lead

Bertrand Charpentier

Cofounder, President & Chief Scientist

Large Language Models (LLMs) like GPT, LLaMA, and Mistral have transformed the landscape of artificial intelligence, enabling capabilities like natural conversation, summarization, code generation, and reasoning. However, leveraging these models in real-world applications—be it in chatbots, agents, or automated workflows—requires efficient LLM and efficient serving. This refers to the process of compressing LLMs themselves, and managing the requests around them to achieve scalable, low-latency, and cost-effective inference. In this blog, we focus on the fully managed platform Amazon SageMaker which provides end-to-end capabilities from data labeling and model tuning to inference deployment and monitoring. More specifically, we show how Amazon SageMaker can combine state-of-the-art compression from Pruna AI and popular serving services like TritonServer and vLLM to efficiently deploy LLMs at scale.

Why benchmark LLM and serving efficiency?

Running LLMs on AWS should be straightforward: you pick a base model, spin up an endpoint, and that’s it. Your software team can build an agent. But in practice, driving adoption means delivering speed. If the experience lags, users will default to ChatGPT & co, and you’ll face shadow AI. Once adoption kicks in, the next challenge is cost. You need confidence that scaling won’t send your infrastructure budget off a cliff. AWS won’t cover you with credits for life, right?

We wanted to see:

👉 Can you stay fully on AWS, use open tooling, and still build high-performing, cost-efficient applications?

👉 Can Pruna, combined with TritonServer or vLLM, already outperform default setups, whether it's a vanilla base model or an already pre-optimized one?

All that out without quality loss.

Turns out: yes.

TL;DR

In terms of cost efficiency, the Llama 3.1-8B (meta-llama/Llama-3.1-8B-Instruct) deployed with Pruna + vLLM is up to 3.31× more efficient than the vanilla base model, and up to 2.52× more efficient than the model deployed with vLLM alone.
In terms of inference speed, the Llama 3.1-8B (meta-llama/Llama-3.1-8B-Instruct) deployed with Pruna + vLLM is up to 2.6× more efficient than the vanilla base model, and up to 1.89× more efficient than the model deployed with vLLM alone.

Got your attention? Good. Now let’s break down what the numbers really say.

Let’s dig into the latency numbers

Using a serving platform is a must. Even before tuning anything else, pairing Pruna with TritonServer or vLLM already unlocks massive performance gains on vanilla Llama 3.1-8B.
If Time-to-First-Token (TTFT)* is your bottleneck, Pruna + vLLM comes out on top. Indeed, any app (copilot-style tools, autocomplete suggestions…) where users are actively waiting for the model to respond in the loop benefits heavily from low TTFT. Once Pruna OSS integration is released, expect an additional 1.89× boost over the current vLLM performance.

*TTFT (Time-to-First-Token): measures how long it takes after sending a request to receive the first generated token. TTFT reflects queueing time, prompt tokenization, model activation, and early decoding. It is crucial in interactive settings where responsiveness is key.

If Inter-Token-Latency (ITL)* matters most (remember, it’s critical when the full output matters, and you need it fast, token by token), Pruna + TritonServer leads at smaller batch sizes (1 or 2). That makes it ideal for chat-like use cases or long-form content generation. At larger batch sizes (4 or 8), vLLM performs well on its own when it comes to TTFT. However, you can further reduce inter-token latency (ITL) by serving a Pruna-optimized LLM on top of vLLM.

*ITL (Inter-Token-Latency): measures how quickly the model produces each subsequent token after the first. It reflects the speed of the generation loop, which includes decoding, attention computation, and streaming to clients.

Note: We did not explore any sampling/batching/prefilling/KV-cache strategies in this blog. The optimization we propose with Pruna here lies at the model level. Hence, all compatible methods (like continuous batching or KV cache sharing, for example) can be combined with Pruna.

Benchmarked on Llama 3.1 8B (no fine-tuning, no adapters, no distributed inference) using SageMaker on L40 GPUs. Batch sizes 1 to 8, prompt/output lengths of 50 and 400 tokens. Focused on latency (TTFT and token speed) with quantization, bfloat16, and compilation. Quality must remain stable (no visible degradation). All experiments were run offline (localhost) to avoid network noise, and the results reflect the average of 5 independent runs.

When it comes to cost per token, how much do we actually save?

Let’s say you're building a GPT or Copilot-like assistant for internal use in a company with 20.000 users. Based on public data from GitHub Copilot usage patterns, we can conservatively estimate that 60% of users use AI assistants actively each day, averaging 3 sessions per day with around 20 requests per session, which brings us to ~792.000 inference requests per day.

50 tokens: reflects lightweight tasks such as autocompletion, CRM note cleanup, or summarizing short messages.
400 tokens: aligns with more involved outputs like drafting follow-up emails, generating campaign briefs, or internal documentation.

Assuming these ~17.4 million inference requests per month have a 50/50 split, and using the actual benchmark results where Pruna provides a 1.89× inference speedup (best-case scenario), this leads to the following gains*:

*Total time = TTFT + (n_tokens - 1) / ITL

Considering L40 GPU on AWS (g6xlarge) is publicly priced at $1.861/hour (on-demand), Pruna + vLLM gives you nearly 3.31× better cost efficiency than a vanilla base model and 2.52× better than vLLM alone, all while speeding up inference.

That’s nearly $43,290/month saved over the baseline, or ~$28,464/month saved over vLLM alone, without compromising quality.

And also, productivity gain is usually underrated. We estimate 1.27 hours saved per active user per month when using Pruna + vLLM compared to vLLM alone. Because yes, let’s not forget: faster feedback loops boost productivity and reduce context-switching frustration. That’s the kind of value you can't always see in GPU logs, but you do see it in team velocity.

Why compare to vLLM and TritonServer?

Serving LLMs is not merely about running a forward pass through a transformer. It requires sophisticated scheduling, memory management, batching, and GPU orchestration to meet real-time demands. Tools like Triton Inference Server and vLLM have emerged as leading open-source frameworks addressing this challenge, each bringing a unique approach to LLM inference. Triton Inference Server (developed by NVIDIA) is a powerful, production-ready serving system designed for multimodal AI workloads. It supports a variety of backends (like Python, PyTorch, and ONNX Runtime) and provides robust features like dynamic batching, model ensembles, and GPU multi-model support. vLLM is an open-source project (supported by Red Hat), purpose-built for LLMs. Its standout feature is PagedAttention, a novel memory management strategy that enables efficient batched inference with low latency and high throughput. vLLM supports continuous batching and optimized GPU memory reuse, making it ideal for high-throughput LLM workloads, such as serving chatbots or multi-user agents. Both frameworks are compatible with Pruna to improve latency and efficiency further.

Which optimization algorithms did we use to obtain these numbers?

To run this benchmark, we used a Pruna configuration combining quantization and compilation, two of the most impactful and mature techniques available for optimizing LLM inference.

Specifically:

Quantization was done using HQQ (Half Quadratic Quantization) at 4-bit precision, with computation done in bfloat16. HQQ is a data-free algorithm that can perform quick quantization of pre-trained LLM weights, and enable the use of efficient CUDA kernels (here we use the torchao_int4 kernels). It enables the model to run faster and use less memory without retraining.
The compilation was done using torch.compile, with settings for dynamic shapes and full-graph compilation enabled. This allows PyTorch to trace and fuse operations across the entire model, leading to better runtime execution on GPU.

from pruna import SmashConfig

smash_config = SmashConfig()
smash_config["quantizer"] = "hqq"
smash_config["hqq_weight_bits"] = 4
smash_config["hqq_compute_dtype"] = "torch.bfloat16"
smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_fullgraph"] = True
smash_config["torch_compile_dynamice"] = True

We didn't apply pruning or structured distillation in this experiment. The goal was to show what can be achieved through minimal, plug-and-play inference-time optimization, with no need to retrain, fine-tune, or manually rework the architecture.

You can find more details on each of these techniques in this introduction blogpost.

How did we evaluate quality?

When optimizing inference, whether through quantization, operator fusion, caching strategies, or architectural changes, it's essential to ensure that the model still produces useful and accurate outputs.

One way to evaluate quality is by measuring perplexity on a dataset like WikiText2. It's simple and commonly used, but it doesn't capture the complexity of real-world usage. It’s good for spotting major degradation, but it is worth completing with other metrics to assess performance in more practical tasks.

Hence, we decided to benchmark on MMLU, a more comprehensive evaluation suite that includes multiple-choice questions across dozens of domains, such as science, law, history, and programming. It's more reflective of the kinds of tasks LLMs face in production, from question answering to reasoning in assistant-like applications.

Here’s what we observed:

While the original Llama 3.1 8B achieves 8.64 perplexity, the Pruna-optimized model achieves 9.17 perplexity on WikiText, which represents a minor degradation.
While the original Llama 3.1 8B scores 68.64% accuracy on MMLU, the Pruna-optimized model scores 66.84%, less than 2 points of difference on a highly challenging benchmark.

That’s a small and acceptable change, especially considering we reduced the model from 16-bit to 4-bit weights.

Heads-up: external factors can impact performance

If you've read the benchmark conditions above, we stated that to avoid any bias in our results, we ran all benchmark experiments in offline mode (localhost). Now it's time to explain why.

Inference performance isn’t just about the model and the code. It’s also impacted by external factors that can make results harder to reproduce. For example, these limitations make it much harder to switch models on the fly, which adds complexity to implementation, especially in multi-model environments.

Here’s what we observed:

Model Warmup

LLMs often require a few initial runs to warm up GPU kernels and memory caches. Benchmarking before this warmup can produce misleading latency numbers. In our case, TritonServer+Pruna and vLLM+Pruna setups took around 1 minute to fully warm up, compared to under 10 seconds for the base model. This difference is mostly due to the compilation steps applied by both vLLM and Pruna.

Model Loading

Cold-start latency (loading the model into memory from disk or shared storage) can significantly affect TTFT. This is especially relevant in multi-model setups or serverless architectures. With TritonServer on SageMaker, the container does not have direct access to the Hugging Face Hub. That means the LLM must be compressed ahead of time and uploaded to S3, which can add several seconds depending on the network bandwidth. This also makes dynamic model switching harder to manage. One could fix this by giving the container access to Hugging Face Hub.

What’s Next

Thanks for reading this deep dive. We hope you enjoyed the level of detail and transparency, not just in the numbers, but in how we got them. From benchmark conditions to optimization configs, we’ve tried to share exactly what we tested, why we tested it, and what it means for real-world usage.

If you want to go further, here’s what’s coming and how to get started:

🚀 Try it yourself: Run the Triton + Pruna notebook and reproduce the benchmarks.
🧪 Get early access: Register for the Pruna + vLLM private beta
☁️ Launch ready-to-go infra: Spin up an EC2 instance with Pruna pre-installed on AWS.
📦Coming soon: Optimized Llama models available on the SageMaker Marketplace

Back to articles

・

Jun 24, 2025

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark