Pruna Cuts GLiNER Latency by x2 for the Largest Cloud Monitoring Platform - Pruna AI - Make your AI models cheaper, faster, smaller ...

Back to articles

Case study

Pruna Cuts GLiNER Latency by x2 for the Largest Cloud Monitoring Platform

Jul 15, 2025

Begüm Çig

ML Research Engineer

Quentin Sinig

Go-to-Market Lead

Bertrand Charpentier

Cofounder, President & Chief Scientist

A Leading Large Cloud Monitoring-as-a-Service Platform (name can’t be publicly disclosed) had just rolled out a new AI-powered feature: real-time detection of Personally Identifiable Information (PII) inside log streams. Not a nice-to-have, this was a long-requested feature, now live and being gradually rolled out to customers.

The core challenge wasn’t about accuracy. The model (a custom proxy of GLiNER v2.1-large) had already been through an internal evaluation process and was selected for its balance of performance and zero-shot flexibility. The real challenge was about cost savings and latency. It was deployed at scale, running on a fleet of NVIDIA A10s for millions of logs per second. Even small efficiency gains could make a big financial difference at that volume. For example, shaving just 5–10% off latency could save tens of thousands of dollars a year and make the feature significantly more cost-effective.

They gave us a clear goal: beat the current latency by at least 10%, without touching quality.

We said: let’s go.

We Didn’t Know the Model. They Did.

From the start, we knew we were stepping into their territory. Their engineers had selected GLiNER for good reasons and had already adapted it to match the structure of their log data. While they were the experts on their new GliNER model, our strength was making it run faster. So we did what we usually do:

We asked questions about their use case,
We reviewed what they had already tried,
We looked at how to make it more efficient with state-of-the-art expertise.

Our job wasn’t to re-train the model but to get more performance from what was already working.

They shared a synthetic dataset with us, small in size but shaped like production. No labels, but realistic in length and format, enough to measure latency and throughput reliably. The brief was simple and clear: don’t touch model quality, don’t change the serving flow, just make inference more efficient.

How Did We Combine Optimization Algorithms

We had both the small and large versions of GLiNER available. We started with the large one because it offered better performance. Once optimized, the same techniques could later be applied to the small version as well.

We experimented with many combinations of compression methods:

We used float16 with half precision. It cuts latency by roughly a third everywhere without quality loss.

We compiled with compilers like torch.compile. We used max-autotune and targeted the three heaviest sub-modules:

Sub-module we compiled	What it actually does	Compile flags
BERT encoder stack	Turns raw tokens into contextual embeddings	`fullgraph=True`
Span-representation feed-forward head	Converts token embeddings into span vectors	`fullgraph=True`
Lightweight RNN head	Adds sequence-level context	default graph

We also experimented with the “module list” feature in Pruna, which compiles modules one by one. While it nudged runtime latency up a bit, it cuts the warm-up from 6-7mins to around 1 minute!

We quantized to lower with Torch-AO dynamic quantization. It brought speedups for larger batch sizes:
Batch
Compile only (P90)
+ Torch-AO (P90)
Speed-up
8
89 ms
91 ms
≈ -2%
16
178 ms
175 ms
≈ 2 %
32
353 ms
312 ms
≈ 11 %
64
674 ms
599 ms
≈ 11 %
We applied structured pruning. We targeted selected layers (4% of the linear layers and 20% of the attention heads in the encoder) not to impact accuracy:
Batch
Before pruning (P90)
After pruning (P90)
Extra gain
8
89 ms
72 ms
20 %
16
178 ms
130 ms
27 %
32
312 ms
280 ms
10 % (on top of Torch-AO)
64
599 ms
452 ms
25 % (on top of Torch-AO)

Batch	Compile only (P90)	+ Torch-AO (P90)	Speed-up
8	89 ms	91 ms	≈ -2%
16	178 ms	175 ms	≈ 2 %
32	353 ms	312 ms	≈ 11 %
64	674 ms	599 ms	≈ 11 %

Batch	Before pruning (P90)	After pruning (P90)	Extra gain
8	89 ms	72 ms	20 %
16	178 ms	130 ms	27 %
32	312 ms	280 ms	10 % (on top of Torch-AO)
64	599 ms	452 ms	25 % (on top of Torch-AO)

From 35 ms to 19 ms, Without Touching Quality

The outcome was clear. In our best-case configuration, we brought latency down from 35 milliseconds to 19 milliseconds*, a nearly 2x speedup, with a conservative setup that we would recommend for production while cutting memory usage by half. This allowed us to pass the original target of 10% latency improvement.

But we didn’t stop there. We kept iterating, and eventually built a performance envelope that covered everything from simple drop-in gains to more aggressive tuning, depending on how far the team wanted to take it.

While the objective was already reached, we kept iterating by making the model even smaller. Smaller models translate in free memory to fit more samples and model instances, thus delivering higher throughput per machine.

And all of this came without requiring changes to their production flow. No code rewrites, no system overhaul. Just a more efficient model, dropped in place.

*It’s worth noting that the 35 ms baseline reflects the client’s already-optimized production setup. If you compare against the original, unoptimized base model, the efficiency gains can reach up to 5.5x, depending on batch size.

ROI — $28K to $58K/year in total savings

The chart below makes the cost dynamics visual. Each point along the blue line shows the net savings per task (”net” means after Pruna’s license cost is deducted) at a given latency. The dashed green line marks the break-even point (the point beyond which latency becomes too costly to justify).

Bottom line: In a real-world production scenario (not a lab benchmark), Pruna AI already pays for itself through the speed efficiency it unlocks, delivering a 7% net savings just from latency. On top of that, memory compression allows you to cut GPU needs in half, bringing total savings to €28K/year on a modest 10-GPU setup. And that’s just for one feature: the impact scales with your service. Meanwhile, your users get a better experience, which drives both revenue and retention.

The “Safe” scenario at 19ms sits in the profitable zone.

Break-even at 20.55 ms → savings for switching from 35ms to at 19 ms = 7.53%, or €308.08/month, €3,696.93/year
Memory savings (–50%) → savings for switching from 10 GPUs to only 5 GPUs needed = €2,046.96/month, €24,563.52/year
Total combined savings → €28,260.45/year

*Pricing Hypothesis based on A10G Anyscale Public Pricing: $0.5686 per hour (g5.4xlarge) and Pruna Pro Public Pricing: $0.40/hour

As a side note, this estimate reflects our recommended “safe” production launch setup. We also evaluated an “extreme” configuration delivering 5.5x faster inference (on batch size 64) and 68% net savings from latency alone, leading to over €58K in annual savings—though this path requires additional engineering effort to ensure a smooth implementation.

Bonus: A New Pruning Method Shared with the Community

There was an added outcome from this project. While working through the optimization process, our research team developed a pruning method that allows you to prune entire target modules without manually listing every leaf submodule to exclude, making it easier to prune scoped blocks like encoder stacks without breaking shape compatibility. Instead of keeping it internal, we contributed it to our pruna package.

That’s one of the parts we appreciated most about this collaboration. It wasn’t just about performance gains for one team, it was a chance to improve the tooling for anyone using similar models in production.

Enjoy the Quality and Efficiency!

Want to take it further?

Compress your own models with Pruna and give us a ⭐ to show your support!
Try our Replicate endpoint with just one click.
Stay up to date with the latest AI efficiency research on our blog, explore our materials collection, or dive into our courses.
Join the conversation and stay updated in our Discord community.

Back to articles

・

Jul 15, 2025

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark