Combination Engine
Combine Optimization Algorithms To Get The Most Out Of Your Model
Combine Optimization Algorithms To Get The Most Out Of Your Model
Stop overcharging your codebase with manual algorithm implementation. Pruna's library combines the best algorithms and the latest compression methods
Stop overcharging your codebase with manual algorithm implementation. Pruna's library combines the best algorithms and the latest compression methods


pip install pruna
Copied


pip install pruna
Copied


pip install pruna
Copied
Don’t be fooled by our name, we do more than Pruning!
Don’t be fooled by our name, we do more than Pruning!
Pruning
Pruning removes less important or redundant connections and neurons from a model, resulting in a sparser, more efficient network.
Pruning
Pruning removes less important or redundant connections and neurons from a model, resulting in a sparser, more efficient network.
Pruning
Pruning removes less important or redundant connections and neurons from a model, resulting in a sparser, more efficient network.
Quantization
Quantization reduces the precision of the model’s weights and activations, making them much smaller in terms of memory required.
Quantization
Quantization reduces the precision of the model’s weights and activations, making them much smaller in terms of memory required.
Quantization
Quantization reduces the precision of the model’s weights and activations, making them much smaller in terms of memory required.
Batching
Batching groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time.
Batching
Batching groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time.
Batching
Batching groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time.
Enhancing
Enhancers improve the quality of the model’s output. They range from post-processing to test time compute algorithms.
Enhancing
Enhancers improve the quality of the model’s output. They range from post-processing to test time compute algorithms.
Enhancing
Enhancers improve the quality of the model’s output. They range from post-processing to test time compute algorithms.
Caching
Caching is a technique used to store intermediate results of computations to speed up subsequent operations, particularly useful in reducing inference time for machine learning models.
Caching
Caching is a technique used to store intermediate results of computations to speed up subsequent operations, particularly useful in reducing inference time for machine learning models.
Caching
Caching is a technique used to store intermediate results of computations to speed up subsequent operations, particularly useful in reducing inference time for machine learning models.
Recovery
Recovery restores the performance of a model after compression.
Recovery
Recovery restores the performance of a model after compression.
Recovery
Recovery restores the performance of a model after compression.
Factorization
Factorization batches several small matrix multiplications into one large fused operation which, while neutral on memory and raw latency, unlocks notable speed-ups when used alongside quantization.
Factorization
Factorization batches several small matrix multiplications into one large fused operation which, while neutral on memory and raw latency, unlocks notable speed-ups when used alongside quantization.
Factorization
Factorization batches several small matrix multiplications into one large fused operation which, while neutral on memory and raw latency, unlocks notable speed-ups when used alongside quantization.
Distillation
Distillation trains a smaller, simpler model to mimic a larger, more complex model.
Distillation
Distillation trains a smaller, simpler model to mimic a larger, more complex model.
Distillation
Distillation trains a smaller, simpler model to mimic a larger, more complex model.
Compilation
Compilation optimizes the model for specific hardware.
Compilation
Compilation optimizes the model for specific hardware.
Compilation
Compilation optimizes the model for specific hardware.
Get a faster inference without the
trial-and-error process.
Get a faster inference without the trial-and-error process.
We combine +46 algorithms methods across nine combination techniques, including proprietary ones, so you don’t have to implement or test them manually.
And more...
Pruna combines several compression
algorithms with one feature
Our SmashConfig feature, let you defines your objectives and choose the algorithms methods you need to optimize your model in just a few lines of code. And if you don’t know what combination to use, have a look at our tutorials or our Optimization Agent.
Recommendation of configuration to compress Qwen
1
2
3
4
5
6
7
8
9
10
11
from pruna import SmashConfig
# Initialize the SmashConfig
smash_config = SmashConfig(cache_dir_prefix="/efs/smash_cache")
smash_config.add_tokenizer(model_name)
smash_config['quantizer'] = 'hqq'
smash_config["hqq_weight_bits"] = 4
smash_config['compiler'] = 'torch_compile'
smash_config['torch_compile_fullgraph'] = True
smash_config['torch_compile_dynamic'] = True
smash_config['hqq_compute_dtype'] = 'torch.bfloat16'
smash_config._prepare_saving = False
1
2
3
4
5
6
7
8
9
10
11
from pruna import SmashConfig
# Initialize the SmashConfig
smash_config = SmashConfig(cache_dir_prefix="/efs/smash_cache")
smash_config.add_tokenizer(model_name)
smash_config['quantizer'] = 'hqq'
smash_config["hqq_weight_bits"] = 4
smash_config['compiler'] = 'torch_compile'
smash_config['torch_compile_fullgraph'] = True
smash_config['torch_compile_dynamic'] = True
smash_config['hqq_compute_dtype'] = 'torch.bfloat16'
smash_config._prepare_saving = False
Recommendation of configuration to compress Flux
Recommendation of configuration to compress
“Flux”
1
2
3
4
5
6
7
8
9
smash_config = SmashConfig()
smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_target"] = "module_list"
smash_config["quantizer"] = "fp8"
smash_config["factorizer"] = "qkv_diffusers"
smash_config["cacher"] = "auto"
smash_config["auto_cache_mode"] = "taylor"
smash_config["auto_objective"] = "quality"
smash_config._prepare_saving = False
1
2
3
4
5
6
7
8
9
smash_config = SmashConfig()
smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_target"] = "module_list"
smash_config["quantizer"] = "fp8"
smash_config["factorizer"] = "qkv_diffusers"
smash_config["cacher"] = "auto"
smash_config["auto_cache_mode"] = "taylor"
smash_config["auto_objective"] = "quality"
smash_config._prepare_saving = False
Learn more about Combination Engine with our blog articles
Speed Up Your Models With Pruna AI
Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna AI.
pip install pruna
Copied
Speed Up Your Models With Pruna AI
Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna AI.
pip install pruna
Copied
Speed Up Your Models With Pruna AI
Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with
Pruna AI.
pip install pruna
Copied
© 2025 Pruna AI - Built with Pretzels & Croissants 🥨 🥐
© 2025 Pruna AI - Built with Pretzels & Croissants 🥨 🥐
© 2025 Pruna AI - Built with Pretzels & Croissants