Pruna is a frictionless solution to help you optimize and compress your ML models for efficient inference
Only a few lines of code to automatically adapt and combine the best
machine learning efficiency and compression methods for your use-case.
Make your pipelines efficient by taking care of all tasks involved, whether in GenAI, LLMs, Computer Vision, NLP, Graphs & more
Keep the freedom to try new models and customize your model architecture for your needs, Pruna takes care of the rest
Find the best compute provider for your needs and budget, then squeeze out as much efficiency as you can by leveraging Pruna
Create customised efficiency configs based on your needs, save and load the efficient models easily and don't worry about compatibility
"As billions are invested in AI development, it is imperative to maximize the efficiency and impact of these resources."
Our product adapts and combines the best efficiency methods for each use-case. This can include quantization, pruning, compilation and other algorithmic optimizations from the latest research and our own work. You can see the details in our documentation and each Hugging Face model's README.
We showcase detailed results for specific models, hardwares and parameters in our list of models on Hugging Face. It's often 2-10x gains, sometimes more and sometimes less. Exact results will depend on your own pipelines, the best is to request a trial.
Your side. Pruna is a tool to make your models more efficient for your infrastructure, whether that's on a cloud provider you selected (AWS, Google cloud...), on your own cluster or in an edge device.
It depends on the specific configs selected for our product. Some configs do not change quality, while others can slightly vary the output (usually to make the model even faster and smaller). Choose what suits you best or let our product do it for you. We put a lot of work to have the product adapt efficiency methods in a way that minimizes their combined impact on model output.
You can use the efficient models we put on Hugging Face for free (if you respect the original model's license). These are optimized for inference for specific but popular use-cases. If you want the same for other custom models and use-cases, you will need access to our product. Pricing varies and is meant to be win-win, so you get more than you pay for.
Our current product makes your AI models more efficient at inference. Use it after training your models and before deploying them on your target hardware. Our next product iteration will make your model training more efficient too and we're eager for people to try it :)
Our approach integrates a suite of cutting-edge AI model compression techniques. These methods are the culmination of our years of research and numerous presentations at ML conferences including NeurIPS, ICML, and ICLR.
Our product only needs your AI model and specifications about your target hardware for inference. The smashed models could be less flexible if you have very specific use-case, and that can be worked out with a little support.
We aim to maintain the predictive performance of all smashed'AI models, ensuring they're as accurate as their original versions. However, we must clarify that while the practical results have consistently met our goals, we cannot provide a theoretical guarantee of exact match in predictions with the original model. We recommend you test the smashed models on your own internal benchmarks.
Tell us about your use-case, measure what Pruna can do for you and focus on what you do best.