Measuring What Matters: Objective Metrics for Image Generation Assessment - Pruna AI - Make your AI models cheaper, faster, smaller ...

Back to articles

Technical Article

Measuring What Matters: Objective Metrics for Image Generation Assessment

May 21, 2025

Begüm Çig

ML Research Engineer

Stephan Günnemann

Cofounder & Chief Strategy Officer

Bertrand Charpentier

Cofounder, President & Chief Scientist

David Berenstein

ML & DevRel Engineer

Generating high-quality visuals with state-of-the-art models is now more and more accessible. Open-source models run on laptops, and cloud services turn text into images in seconds. These models are already reshaping industries like advertising, gaming, fashion, and science.

But creating images is the easy part. Judging their quality is much harder. Human feedback is slow, expensive, biased, and often inconsistent. Plus, quality has many faces: creativity, realism, and style don’t always align. Improving one can hurt another.

That’s why we need clear, objective metrics that capture quality, coherence, and originality. we’ll look at ways to measure image quality and compare models with Pruna, beyond just "does it look cool?"

Metrics Overview

There is no correct way to categorize evaluation metrics, as a metric can belong to multiple categories depending on its usage and the data it evaluates. In our repository, all quality metrics can be computed in single and pairwise modes.

Single mode evaluates a model by comparing the generated images to input references or ground truth images, producing one score per model.
Pairwise mode compares two models by directly evaluating the generated images from each model together, producing a single comparative score for these two models.

This flexibility enables absolute evaluations (assessing each model individually) and relative evaluations (direct comparisons between models).

On top of the evaluation modes, it also makes sense to think about metrics in terms of their evaluation criteria to provide structure and clarity. Our metrics fall into two overarching categories:

Efficiency Metrics: Measure the speed, memory usage, carbon emissions, energy, etc., of models during inference. At pruna, we focus on making your models smaller, faster, cheaper, and greener, so evaluating your models using these efficiency metrics is a natural fit. However, because efficiency metrics are not specific to image generation tasks, we won't discuss them in detail in this blog post. If you'd like to learn more about these metrics, please refer to our documentation.
Quality Metrics: Measure generated images' intrinsic quality and alignment to intended prompts or references. These include:
- Distribution Alignment: How closely generated images resemble real-world distributions.
- Prompt Alignment: Semantic similarity between generated images and their intended prompts.
- Perceptual Alignment: Pixel-level or perceptual similarity between generated and reference images.

The table below summarizes the most common quality metrics available at Pruna, their categories, score ranges, and key limitations to help guide metric selection.

Metric	Measures	Category	Range (↑ higher is better/↓ lower is better)	Limitations
FID	Distributional similarity to real images	Distribution Alignment	0 to ∞ (↓)	Assumes Gaussianity, requires large dataset, depends on a surrogate model
CMMD	CLIP-space distributional similarity	Distribution Alignment	0 to ∞ (↓)	Kernel choice affects results, depends on a surrogate model
CLIPScore	Image-text alignment	Prompt Alignment	0 to 100 (↑)	Insensitive to image quality, depends on a surrogate model
PSNR	Pixel-wise similarity	Perceptual Alignment	0 to ∞ (↑)	Not well perceptually aligned
SSIM	Structural similarity	Perceptual Alignment	-1 to 1 (↑)	Can be unstable for small input variations
LPIPS	Perceptual similarity	Perceptual Alignment	0 to 1 (↓)	depends on a surrogate model

Distribution Alignment Metrics

Distribution alignment metrics measure how closely generated images resemble real-world data distributions, comparing low and high features. In pairwise mode, they compare outputs from different models to produce a single score that reflects relative image quality.

The generated image closely resembles the real one, and the distributions are well aligned, suggesting good quality.

The generated image is noticeably off, and the distributions differ significantly, which the metric captures as a mismatch.

Fréchet Inception Distance (FID): FID (introduced here) is one of the most popular metrics for evaluating how realistic AI-generated images are. It compares the feature distribution of the reference images (e.g. authentic images) to the images generated by the model to assess.

Here’s how it works in a nutshell:

We take a pretrained surrogate model and pass authentic and generated images through it. The pretrained surrogate model is usually the Inception v3, explaining the metric name.
The model turns each image into a feature embedding (a numerical summary of the image). We assume the embeddings from each set form a Gaussian distribution.
FID then measures the distance between the two distributions — the closer they are, the better.

A lower FID score indicates that the generated images are more similar to real ones, meaning better image quality.

Clip Maximum-Mean-Discrepancy (CMMD): CMMD (introduced here) is another way to measure how close your generated images are to real ones. Like FID, it compares feature distributions, but instead of using Inception features, it uses embeddings from a pretrained CLIP model.
Here’s how it works:
1. We take a pretrained surrogate model and pass both real and generated images through it. The pretrained surrogate model is usually the CLIP.
2. The model turns each image into a feature embedding (a numerical summary of the image). We do not assume the embeddings from each set form a Gaussian distribution.
3. Use a kernel function (usually RBF) to compare how these distributions differ, without assuming they are Gaussian.
A lower CMMD score indicates that the feature distributions of generated images are more similar to those of real images, meaning better image quality.

Prompt Alignment Metrics

Prompt alignment metrics evaluate how well generated images match their input prompts, especially in text-to-image tasks. In pairwise mode, they instead measure semantic similarity between outputs from different models, shifting focus from prompt alignment to model agreement.

CLIPScore: CLIPScore (introduced here) tells you how well a generated image matches the text prompt that produced it. It uses a pretrained CLIP model, which maps both text and images into the same embedding space.
Here’s the idea:
1. Pass the image and its prompt through the surrograte CLIP model to get their embeddings.
2. Measure how close these two embeddings are. The closer they are, the better the alignment between the image and the prompt.
CLIPScore ranges from 0 to 100. A higher score means the image is more semantically aligned with the prompt. Note that this metric doesn’t look at visual quality, just the match in meaning.

Perceptual Alignment Metrics

Perceptual alignment metrics evaluate the perceptual quality and internal consistency of generated images. They compare pixel-level or feature-level differences between images. These metrics are often pairwise by nature, as comparing generated images with other generated images is more appropriate in certain cases, such as pixel-by-pixel comparisons.

Peak Signal-to-Noise Ratio (PSNR): PSNR measures the pixel-level similarity between a generated image and its reference (ground truth) image. It is widely used for evaluating image compression and restoration models.
A higher PSNR value indicates better image quality, but PSNR does not always correlate well with human perception.
Structural Similarity Index (SSIM): SSIM improves upon PSNR by comparing local patterns of pixel intensities instead of just raw pixel differences. It models human visual perception by considering luminance, contrast, and structure in small image patches
SSIM ranges from -1 to 1, where 1 indicates perfect similarity.
Learned Perceptual Image Patch Similarity (LPIPS): LPIPS is a deep-learning-based metric that measures perceptual similarity between images using features from a pre-trained neural network (e.g., VGG, AlexNet). Unlike PSNR and SSIM, LPIPS captures high-level perceptual differences rather than pixel-wise differences.

Let's look at the following example to illustrate how different distortions impact metric scores. The image below showcases various distortions applied to an original image and how metrics like SSIM, PSNR, and LPIPS react to these changes.

The results in the image illustrate how different types of distortions affect the scores given by these task-based metrics. Notably:
- Blurred images tend to score higher in SSIM than in PSNR. This suggests that while fine details are lost, the overall structure and patterns of the image remain intact, which aligns with SSIM’s focus on structural consistency.
- Pixelated images, on the other hand, maintain relatively high PSNR values but drop in SSIM ranking. This indicates that while pixel intensity differences remain small, the structural coherence of the image is significantly degraded—highlighting SSIM’s sensitivity to spatial relationships rather than just pixel-level accuracy.
These observations demonstrate why selecting the right metric is crucial. Each of the metrics captures different aspects of image quality, making them useful in different scenarios depending on the type of distortion and the perceptual quality being assessed.

Confidently evaluate AI models with the Evaluation Agent!

The evaluation framework in Pruna consists of several key components:

Step 1: Define what you want to measure

Use the Task object to specify which quality metrics you'd like to compute. You can provide the metrics in three different ways, depending on how much control you need.

from pruna.evaluation.task import Task
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.metrics.metric_torch import TorchMetricWrapper

# Method 1: plain text from predefined options
evaluate_image_generation_task = Task("image_generation_quality", datamodule=PrunaDataModule.from_string('LAION256'))

# Method 2: list of metric names
metrics = ['clip_score', 'psnr']
evaluate_image_generation_task = Task(request = metrics, datamodule=PrunaDataModule.from_string('LAION256'))

# Method 3: list of metric instances
clip_score_metric = TorchMetricWrapper("clip_score", model_name_or_path = "openai/clip-vit-base-patch32")
psnr_metric = TorchMetricWrapper('psnr', base=2.0)
metrics = [clip_score_metric, psnr_metric]
evaluate_image_generation_task = Task(request= metrics, datamodule=PrunaDataModule.from_string('LAION256'))

Step 2: Run the Evaluation Agent

Pass your model to the EvaluationAgent and let it handle everything: running inference, computing metrics, and returning the final scores.

from pruna.evaluation.evaluation_agent import EvaluationAgent

eval_agent = EvaluationAgent(evaluate_image_generation_task)
results = eval_agent.evaluate(your_model)

As AI-generated images become more prevalent, evaluating their quality effectively is more critical than ever. Whether you're optimizing for realism, accuracy, or perceptual similarity, selecting the right evaluation metric is key. With Pruna now open-source, you have the freedom to explore, customize, and even contribute new evaluation metrics to the community .

Our documentation and tutorials (here) provide a step-by-step guide on how to add your metrics, making it easier than ever to tailor evaluations to your needs. Try it out today, contribute, and help shape the future of AI image evaluation!

Back to articles

・

May 21, 2025

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark