About Pruna AI

Combination Engine

Combine Optimization Algorithms To Get The Most Out Of Your Model

Stop overcharging your codebase with manual algorithm implementation. Pruna's library combines the best algorithms and the latest compression methods

pip install pruna

Copied

pip install pruna

Copied

pip install pruna

Copied

Don’t be fooled by our name,   we do more than Pruning!

Pruning

Pruning removes less important or redundant connections and neurons from a model, resulting in a sparser, more efficient network.

Pruning

Pruning removes less important or redundant connections and neurons from a model, resulting in a sparser, more efficient network.

Pruning

Pruning removes less important or redundant connections and neurons from a model, resulting in a sparser, more efficient network.

Quantization

Quantization reduces the precision of the model’s weights and activations, making them much smaller in terms of memory required.

Quantization

Quantization reduces the precision of the model’s weights and activations, making them much smaller in terms of memory required.

Quantization

Quantization reduces the precision of the model’s weights and activations, making them much smaller in terms of memory required.

Batching

Batching groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time.

Batching

Batching groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time.

Batching

Batching groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time.

Enhancing

Enhancers improve the quality of the model’s output. They range from post-processing to test time compute algorithms.

Enhancing

Enhancers improve the quality of the model’s output. They range from post-processing to test time compute algorithms.

Enhancing

Enhancers improve the quality of the model’s output. They range from post-processing to test time compute algorithms.

Caching

Caching is a technique used to store intermediate results of computations to speed up subsequent operations, particularly useful in reducing inference time for machine learning models.

Caching

Caching is a technique used to store intermediate results of computations to speed up subsequent operations, particularly useful in reducing inference time for machine learning models.

Caching

Caching is a technique used to store intermediate results of computations to speed up subsequent operations, particularly useful in reducing inference time for machine learning models.

Recovery

Recovery restores the performance of a model after compression.

Recovery

Recovery restores the performance of a model after compression.

Recovery

Recovery restores the performance of a model after compression.

Factorization

Factorization batches several small matrix multiplications into one large fused operation which, while neutral on memory and raw latency, unlocks notable speed-ups when used alongside quantization.

Factorization

Factorization batches several small matrix multiplications into one large fused operation which, while neutral on memory and raw latency, unlocks notable speed-ups when used alongside quantization.

Factorization

Factorization batches several small matrix multiplications into one large fused operation which, while neutral on memory and raw latency, unlocks notable speed-ups when used alongside quantization.

Distillation

Distillation trains a smaller, simpler model to mimic a larger, more complex model.

Distillation

Distillation trains a smaller, simpler model to mimic a larger, more complex model.

Distillation

Distillation trains a smaller, simpler model to mimic a larger, more complex model.

Compilation

Compilation optimizes the model for specific hardware.

Compilation

Compilation optimizes the model for specific hardware.

Compilation

Compilation optimizes the model for specific hardware.

Get a faster inference without the   
trial-and-error process.

Get a faster inference without the trial-and-error process.

We combine +46 algorithms methods across nine combination techniques, including proprietary ones, so you don’t have to implement or test them manually.

Pruna combines several compression
algorithms with one feature

Our SmashConfig feature, let you defines your objectives and choose the algorithms methods you need to optimize your model in just a few lines of code. And if you don’t know what combination to use, have a look at our tutorials or our Optimization Agent.

Recommendation of configuration to compress Qwen

from pruna import SmashConfig

# Initialize the SmashConfig

smash_config = SmashConfig(cache_dir_prefix="/efs/smash_cache")

smash_config.add_tokenizer(model_name)

smash_config['quantizer'] = 'hqq'

smash_config["hqq_weight_bits"] = 4

smash_config['compiler'] = 'torch_compile'

smash_config['torch_compile_fullgraph'] = True

smash_config['torch_compile_dynamic'] = True

smash_config['hqq_compute_dtype'] = 'torch.bfloat16'

smash_config._prepare_saving = False

from pruna import SmashConfig

# Initialize the SmashConfig

smash_config = SmashConfig(cache_dir_prefix="/efs/smash_cache")

smash_config.add_tokenizer(model_name)

smash_config['quantizer'] = 'hqq'

smash_config["hqq_weight_bits"] = 4

smash_config['compiler'] = 'torch_compile'

smash_config['torch_compile_fullgraph'] = True

smash_config['torch_compile_dynamic'] = True

smash_config['hqq_compute_dtype'] = 'torch.bfloat16'

smash_config._prepare_saving = False

Recommendation of configuration to compress Flux

Recommendation of configuration to compress
“Flux”

smash_config = SmashConfig()

smash_config["compiler"] = "torch_compile"

smash_config["torch_compile_target"] = "module_list"

smash_config["quantizer"] = "fp8"

smash_config["factorizer"] = "qkv_diffusers"

smash_config["cacher"] = "auto"

smash_config["auto_cache_mode"] = "taylor"

smash_config["auto_objective"] = "quality"

smash_config._prepare_saving = False

smash_config = SmashConfig()

smash_config["compiler"] = "torch_compile"

smash_config["torch_compile_target"] = "module_list"

smash_config["quantizer"] = "fp8"

smash_config["factorizer"] = "qkv_diffusers"

smash_config["cacher"] = "auto"

smash_config["auto_cache_mode"] = "taylor"

smash_config["auto_objective"] = "quality"

smash_config._prepare_saving = False

Get Started

Learn more about Combination Engine with our blog articles

・

Apr 18, 2025

Technical Article

An Introduction to AI Model Optimization Techniques

・

Apr 18, 2025

Technical Article

An Introduction to AI Model Optimization Techniques

・

Apr 18, 2025

Technical Article

An Introduction to AI Model Optimization Techniques

・

Jun 24, 2025

Technical Article

SOTA Optimization for Image-Generation with Pruna Open Source

・

Jun 24, 2025

Technical Article

SOTA Optimization for Image-Generation with Pruna Open Source

・

Jun 24, 2025

Technical Article

SOTA Optimization for Image-Generation with Pruna Open Source

・

Mar 26, 2025

Technical Article

Quantization for Image Generation Models to 3 bits: Shrinking Models,...

・

Mar 26, 2025

Technical Article

Quantization for Image Generation Models to 3 bits: Shrinking Models,...

・

Mar 26, 2025

Technical Article

Quantization for Image Generation Models to 3 bits: Shrinking Models,...

Speed Up Your Models With Pruna AI

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna AI.

pip install pruna

Copied

Speed Up Your Models With Pruna AI

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna AI.

pip install pruna

Copied

Speed Up Your Models With Pruna AI

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with
Pruna AI.

pip install pruna

Copied