Case study

Pruna Cuts GLiNER Latency by x2 for the Largest Cloud Monitoring Platform

Jul 15, 2025

Begüm Çig

ML Research Engineer

Quentin Sinig

Quentin Sinig

Go-to-Market Lead

Bertrand Charpentier

Bertrand Charpentier

Cofounder, President & Chief Scientist

A Leading Large Cloud Monitoring-as-a-Service Platform (name can’t be publicly disclosed) had just rolled out a new AI-powered feature: real-time detection of Personally Identifiable Information (PII) inside log streams. Not a nice-to-have, this was a long-requested feature, now live and being gradually rolled out to customers.

The core challenge wasn’t about accuracy. The model (a custom proxy of GLiNER v2.1-large) had already been through an internal evaluation process and was selected for its balance of performance and zero-shot flexibility. The real challenge was about cost savings and latency. It was deployed at scale, running on a fleet of NVIDIA A10s for millions of logs per second. Even small efficiency gains could make a big financial difference at that volume. For example, shaving just 5–10% off latency could save tens of thousands of dollars a year and make the feature significantly more cost-effective.

They gave us a clear goal: beat the current latency by at least 10%, without touching quality.

We said: let’s go.

We Didn’t Know the Model. They Did.

From the start, we knew we were stepping into their territory. Their engineers had selected GLiNER for good reasons and had already adapted it to match the structure of their log data. While they were the experts on their new GliNER model, our strength was making it run faster. So we did what we usually do:

  • We asked questions about their use case,

  • We reviewed what they had already tried,

  • We looked at how to make it more efficient with state-of-the-art expertise.

Our job wasn’t to re-train the model but to get more performance from what was already working.

They shared a synthetic dataset with us, small in size but shaped like production. No labels, but realistic in length and format, enough to measure latency and throughput reliably. The brief was simple and clear: don’t touch model quality, don’t change the serving flow, just make inference more efficient.

How Did We Combine Optimization Algorithms

We had both the small and large versions of GLiNER available. We started with the large one because it offered better performance. Once optimized, the same techniques could later be applied to the small version as well.

We experimented with many combinations of compression methods:

  • We used float16 with half precision. It cuts latency by roughly a third everywhere without quality loss.

  • We compiled with compilers like torch.compile. We used max-autotune and targeted the three heaviest sub-modules:

    Sub-module we compiled

    What it actually does

    Compile flags

    BERT encoder stack

    Turns raw tokens into contextual embeddings

    fullgraph=True

    Span-representation feed-forward head

    Converts token embeddings into span vectors

    fullgraph=True

    Lightweight RNN head

    Adds sequence-level context

    default graph

    We also experimented with the “module list” feature in Pruna, which compiles modules one by one. While it nudged runtime latency up a bit, it cuts the warm-up from 6-7mins to around 1 minute!

  • We quantized to lower with Torch-AO dynamic quantization. It brought speedups for larger batch sizes:

    Batch

    Compile only (P90)

    + Torch-AO (P90)

    Speed-up

    8

    89 ms

    91 ms

    ≈ -2%

    16

    178 ms

    175 ms

    ≈ 2 %

    32

    353 ms

    312 ms

    ≈ 11 %

    64

    674 ms

    599 ms

    ≈ 11 %

  • We applied structured pruning. We targeted selected layers (4% of the linear layers and 20% of the attention heads in the encoder) not to impact accuracy:

    Batch

    Before pruning (P90)

    After pruning (P90)

    Extra gain

    8

    89 ms

    72 ms

    20 %

    16

    178 ms

    130 ms

    27 %

    32

    312 ms

    280 ms

    10 % (on top of Torch-AO)

    64

    599 ms

    452 ms

    25 % (on top of Torch-AO)

From 35 ms to 19 ms, Without Touching Quality

The outcome was clear. In our best-case configuration, we brought latency down from 35 milliseconds to 19 milliseconds*, a nearly 2x speedup, with a conservative setup that we would recommend for production while cutting memory usage by half. This allowed us to pass the original target of 10% latency improvement.

But we didn’t stop there. We kept iterating, and eventually built a performance envelope that covered everything from simple drop-in gains to more aggressive tuning, depending on how far the team wanted to take it.

While the objective was already reached, we kept iterating by making the model even smaller. Smaller models translate in free memory to fit more samples and model instances, thus delivering higher throughput per machine.

And all of this came without requiring changes to their production flow. No code rewrites, no system overhaul. Just a more efficient model, dropped in place.

*It’s worth noting that the 35 ms baseline reflects the client’s already-optimized production setup. If you compare against the original, unoptimized base model, the efficiency gains can reach up to 5.5x, depending on batch size.

ROI — $28K to $58K/year in total savings

The chart below makes the cost dynamics visual. Each point along the blue line shows the net savings per task (”net” means after Pruna’s license cost is deducted) at a given latency. The dashed green line marks the break-even point (the point beyond which latency becomes too costly to justify).

Bottom line: In a real-world production scenario (not a lab benchmark), Pruna AI already pays for itself through the speed efficiency it unlocks, delivering a 7% net savings just from latency. On top of that, memory compression allows you to cut GPU needs in half, bringing total savings to €28K/year on a modest 10-GPU setup. And that’s just for one feature: the impact scales with your service. Meanwhile, your users get a better experience, which drives both revenue and retention.

The “Safe” scenario at 19ms sits in the profitable zone.

  • Break-even at 20.55 ms → savings for switching from 35ms to at 19 ms = 7.53%, or €308.08/month, €3,696.93/year

  • Memory savings (–50%) → savings for switching from 10 GPUs to only 5 GPUs needed = €2,046.96/month, €24,563.52/year

  • Total combined savings€28,260.45/year

*Pricing Hypothesis based on A10G Anyscale Public Pricing: $0.5686 per hour (g5.4xlarge) and Pruna Pro Public Pricing: $0.40/hour

As a side note, this estimate reflects our recommended “safe” production launch setup. We also evaluated an “extreme” configuration delivering 5.5x faster inference (on batch size 64) and 68% net savings from latency alone, leading to over €58K in annual savings—though this path requires additional engineering effort to ensure a smooth implementation.

Bonus: A New Pruning Method Shared with the Community

There was an added outcome from this project. While working through the optimization process, our research team developed a pruning method that allows you to prune entire target modules without manually listing every leaf submodule to exclude, making it easier to prune scoped blocks like encoder stacks without breaking shape compatibility. Instead of keeping it internal, we contributed it to our pruna package.

That’s one of the parts we appreciated most about this collaboration. It wasn’t just about performance gains for one team, it was a chance to improve the tooling for anyone using similar models in production.

It’s a wrap.

👉🏻 If you’re working with similar models, the new pruning method will be released in pruna 0.2.7, stay tuned!

👉🏻 Want to see what you could unlock on your stack? Try Pruna now or contact our team to run a benchmark like this on your models.

Subscribe to Pruna's Newsletter

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.