Case study
Pruna Cuts GLiNER Latency by x2 for the Largest Cloud Monitoring Platform
Jul 15, 2025

Begüm Çig
ML Research Engineer

Quentin Sinig
Go-to-Market Lead

Bertrand Charpentier
Cofounder, President & Chief Scientist
A Leading Large Cloud Monitoring-as-a-Service Platform (name can’t be publicly disclosed) had just rolled out a new AI-powered feature: real-time detection of Personally Identifiable Information (PII) inside log streams. Not a nice-to-have, this was a long-requested feature, now live and being gradually rolled out to customers.
The core challenge wasn’t about accuracy. The model (a custom proxy of GLiNER v2.1-large) had already been through an internal evaluation process and was selected for its balance of performance and zero-shot flexibility. The real challenge was about cost savings and latency. It was deployed at scale, running on a fleet of NVIDIA A10s for millions of logs per second. Even small efficiency gains could make a big financial difference at that volume. For example, shaving just 5–10% off latency could save tens of thousands of dollars a year and make the feature significantly more cost-effective.
They gave us a clear goal: beat the current latency by at least 10%, without touching quality.
We said: let’s go.
We Didn’t Know the Model. They Did.
From the start, we knew we were stepping into their territory. Their engineers had selected GLiNER for good reasons and had already adapted it to match the structure of their log data. While they were the experts on their new GliNER model, our strength was making it run faster. So we did what we usually do:
We asked questions about their use case,
We reviewed what they had already tried,
We looked at how to make it more efficient with state-of-the-art expertise.
Our job wasn’t to re-train the model but to get more performance from what was already working.
They shared a synthetic dataset with us, small in size but shaped like production. No labels, but realistic in length and format, enough to measure latency and throughput reliably. The brief was simple and clear: don’t touch model quality, don’t change the serving flow, just make inference more efficient.
How Did We Combine Optimization Algorithms
We had both the small and large versions of GLiNER available. We started with the large one because it offered better performance. Once optimized, the same techniques could later be applied to the small version as well.
We experimented with many combinations of compression methods:
We used float16 with half precision. It cuts latency by roughly a third everywhere without quality loss.
We compiled with compilers like
torch.compile
. We usedmax-autotune
and targeted the three heaviest sub-modules:Sub-module we compiled
What it actually does
Compile flags
BERT encoder stack
Turns raw tokens into contextual embeddings
fullgraph=True
Span-representation feed-forward head
Converts token embeddings into span vectors
fullgraph=True
Lightweight RNN head
Adds sequence-level context
default graph
We also experimented with the “module list” feature in Pruna, which compiles modules one by one. While it nudged runtime latency up a bit, it cuts the warm-up from 6-7mins to around 1 minute!
We quantized to lower with Torch-AO dynamic quantization. It brought speedups for larger batch sizes:
Batch
Compile only (P90)
+ Torch-AO (P90)
Speed-up
8
89 ms
91 ms
≈ -2%
16
178 ms
175 ms
≈ 2 %
32
353 ms
312 ms
≈ 11 %
64
674 ms
599 ms
≈ 11 %
We applied structured pruning. We targeted selected layers (4% of the linear layers and 20% of the attention heads in the encoder) not to impact accuracy:
Batch
Before pruning (P90)
After pruning (P90)
Extra gain
8
89 ms
72 ms
20 %
16
178 ms
130 ms
27 %
32
312 ms
280 ms
10 % (on top of Torch-AO)
64
599 ms
452 ms
25 % (on top of Torch-AO)
From 35 ms to 19 ms, Without Touching Quality

The outcome was clear. In our best-case configuration, we brought latency down from 35 milliseconds to 19 milliseconds*, a nearly 2x speedup, with a conservative setup that we would recommend for production while cutting memory usage by half. This allowed us to pass the original target of 10% latency improvement.
But we didn’t stop there. We kept iterating, and eventually built a performance envelope that covered everything from simple drop-in gains to more aggressive tuning, depending on how far the team wanted to take it.
While the objective was already reached, we kept iterating by making the model even smaller. Smaller models translate in free memory to fit more samples and model instances, thus delivering higher throughput per machine.
And all of this came without requiring changes to their production flow. No code rewrites, no system overhaul. Just a more efficient model, dropped in place.
*It’s worth noting that the 35 ms baseline reflects the client’s already-optimized production setup. If you compare against the original, unoptimized base model, the efficiency gains can reach up to 5.5x, depending on batch size.
ROI — $28K to $58K/year in total savings
The chart below makes the cost dynamics visual. Each point along the blue line shows the net savings per task (”net” means after Pruna’s license cost is deducted) at a given latency. The dashed green line marks the break-even point (the point beyond which latency becomes too costly to justify).
Bottom line: In a real-world production scenario (not a lab benchmark), Pruna AI already pays for itself through the speed efficiency it unlocks, delivering a 7% net savings just from latency. On top of that, memory compression allows you to cut GPU needs in half, bringing total savings to €28K/year on a modest 10-GPU setup. And that’s just for one feature: the impact scales with your service. Meanwhile, your users get a better experience, which drives both revenue and retention.

The “Safe” scenario at 19ms sits in the profitable zone.
Break-even at 20.55 ms → savings for switching from 35ms to at 19 ms = 7.53%, or €308.08/month, €3,696.93/year
Memory savings (–50%) → savings for switching from 10 GPUs to only 5 GPUs needed = €2,046.96/month, €24,563.52/year
Total combined savings → €28,260.45/year
*Pricing Hypothesis based on A10G Anyscale Public Pricing: $0.5686 per hour (g5.4xlarge) and Pruna Pro Public Pricing: $0.40/hour
As a side note, this estimate reflects our recommended “safe” production launch setup. We also evaluated an “extreme” configuration delivering 5.5x faster inference (on batch size 64) and 68% net savings from latency alone, leading to over €58K in annual savings—though this path requires additional engineering effort to ensure a smooth implementation.
Bonus: A New Pruning Method Shared with the Community
There was an added outcome from this project. While working through the optimization process, our research team developed a pruning method that allows you to prune entire target modules without manually listing every leaf submodule to exclude, making it easier to prune scoped blocks like encoder stacks without breaking shape compatibility. Instead of keeping it internal, we contributed it to our pruna
package.
That’s one of the parts we appreciated most about this collaboration. It wasn’t just about performance gains for one team, it was a chance to improve the tooling for anyone using similar models in production.
It’s a wrap.
👉🏻 If you’re working with similar models, the new pruning method will be released in pruna 0.2.7, stay tuned!
👉🏻 Want to see what you could unlock on your stack? Try Pruna now or contact our team to run a benchmark like this on your models.



・
Jul 15, 2025
Subscribe to Pruna's Newsletter