Technical Article

An Introduction to the Architectures Powering the Current LLMs

Sep 9, 2025

Sara Han Díaz

DevRel Engineer

Bertrand Charpentier

Bertrand Charpentier

Cofounder, President & Chief Scientist

Large Language Models (LLMs) have rapidly become a hot topic in various fields over the past few years. At Pruna, the focus has been clear: make these models smaller, faster, cheaper, and greener. To this end, the team has explored and provided different optimization techniques, from caching and model compilation to advanced quantization.

For an overview of AI model optimization techniques, see this blog.

However, these individual optimizations are just pieces of a much larger machine. We must lift the hood and examine the engine to understand how it works. This blog post will provide an overview, not attempting to cover every mathematical detail, but focusing on the central intuition, of the key architectures powering today’s language models: Autoregressive Models, State-Space Models, and Diffusion-based Models.

Where It All Begins: Tokenizers and Embeddings

Before we dive into the intricate inner workings, it’s worth remembering that an LLM can’t “think” until it first “reads” your request, something it does through tokenization and embedding.

For example, if you ask, "How do I optimize a model?", the model doesn’t receive that sentence as you wrote it. Instead, it's first tokenized, i.e., the text is broken into smaller, more frequent chunks known as tokens. The process involves the following steps:

  1. Text normalization, standardizing case and punctuation to ensure consistency.

  2. Pre-tokenization breaks the text into rough chunks, such as words or subwords.

  3. The actual tokenization kicks in. This step can vary slightly between models depending on design choices: the tokenization method (most commonly Byte Pair Encoding, or BPE, and its variants), the vocabulary and special tokens that define the model’s “dictionary,” and the training data that influences how the tokenizer learns the patterns to split the input.

When it’s time to generate text, the model maps each token’s ID back to its original text fragment. But tokens alone aren’t enough — the model needs to understand their meaning and relationships, and work with numerical representations. That’s where embeddings come in. Each token ID is transformed into a high-dimensional vector that captures the meaning of the word based on how it was used in the training set. This is what allows LLMs to grasp intent, subtlety, and meaning far beyond basic definitions.

Token-by-Token: The Autoregressive Way

Many LLMs are autoregressive, i.e., they generate text by predicting the next token in a sequence one by one—the Transformer architecture powers most of today’s leading models.

Once we step into a transformer, we'll find a stack of transformer blocks. Each block processes the incoming token and passes the results to the next. Two operations occur at each block's heart: self-attention and a feed-forward network.

The self-attention mechanism determines how important each token is relative to all others in the sequence. This process implies:

  • The model computes attention scores by multiplying the query vector of the current token with the key vectors of all other tokens.

  • After normalization, each score is used to weigh the corresponding value vector. The weighted sum of these values becomes the output of the attention layer.

  • When a query and key are a strong match — meaning they produce a high attention score — the associated value has a stronger influence on the final output.

  • Transformers use multi-head attention, i.e., multiple attention mechanisms ("heads") are run in parallel to increase the model's ability to capture different types of relationships. Each head focuses on different aspects of the input, combining their outputs to form a richer representation.

After the self-attention step, the output at each position is passed through a feed-forward neural network, a stack of dense layers with non-linear activation functions like ReLU or GeLU. This helps the model detect complex patterns that attention alone might miss.

Finally, each sub-layer (self-attention and feed-forward) is wrapped with residual connections and layer normalization, which helps stabilize the model and allows for deeper networks.

Source: https://arxiv.org/abs/1706.03762

To make the Transformer more efficient, several optimizations are often applied to different parts of the transformer block:

  • Since the attention mechanism is typically the main computational bottleneck, various strategies have focused on reducing its load:

    • KV caching stores previously computed keys and values to speed up text generation significantly by avoiding redundant computations.

    • Sparse Attention limits focus to a subset of tokens

    • Sliding Window Attention restricts attention to the most recent tokens

    • Flash Attention improves GPU memory usage and throughput

    • Paged Attention manages KV caches more effectively for long sequences

    • Multi-Query Attention (MQA) lowers computational cost by sharing keys and values across all attention heads.

  • Feed-forward can be improved with another powerful approach, the Mixture of Experts (MoE). It replaces the traditional single feed-forward block with multiple expert networks specialized in different patterns or topics, selectively activated through a gating mechanism. So, only a subset of them runs at a time, allowing the model to scale efficiently during training.

Thinking in States: A Different Way to Think About Sequences

While autoregressive models like Transformers generate text by predicting the next token based on all previously seen tokens, State Space Models (SSMs) take inspiration from physics. At a time, they map a continuous input sequence to a latent space representation and predict the output sequence.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mamba-and-state

SSMs capture only the most relevant information in three different ways to represent the relationship between input, state, and output. Depending on the task, the stage of the process, or the type of data, it’s possible to switch between these representations. However, it requires some advanced methods to take advantage of the most efficient one for the problem at hand, maximizing performance.

Aspect

Continuous Representation

Recurrent Representation

Convolutional Representation

Core idea

Describes how the state changes smoothly over time.

Breaks time into steps, updating the current state based on the previous state and new input.

Updates the current state using a weighted history of previous states.

Advantages

  • Ideal for data with irregular or time-shifted sampling

  • Mathematically feasible analysis

  • Natural fit for sequences

  • Efficient inference

  • Local, interpretable features

  • Parallelizable training

Disadvantages

  • Very slow training and inference

  • Slow training

  • Gradient issues in too-long sequences

  • Inefficient in online/autoregressive use

  • Fixed context size

Suitability

Handling continuous data

Efficient inference

Fast training via parallelization

To handle the complexity of natural language, deep SSMs stack multiple state space layers and add non-linear transformations. In this setup, the SSM blocks handle dependencies across tokens in the sequence, while the non-linear layers capture dependencies across embedding dimensions. This division of labor allows the model to represent intricate language patterns while still benefiting from the efficiency of state-tracking mechanisms.

Removing the Noise: Diffusion LLMs

In the world of computer vision, one of the most groundbreaking advances in recent years has been diffusion models. The core idea is quite intuitive: start with an image and gradually add random noise over many steps until it turns into pure noise — resembling TV static or white noise. Then, train a model to reverse this process — step by step — by learning how to remove the noise and recover the original image (or generate a completely new one). Through this iterative denoising, the model learns the underlying patterns and structures of visual data, encoding that knowledge into a latent space, i.e., a map of all the possible images the model could generate, where each point represents a unique combination of learned features.

Similar principles have recently been explored in the context of language modeling, where researchers are adapting diffusion-based approaches to generate text. In this case, the process begins with a random noise representation, which is then gradually refined and “denoised” into a coherent sequence of tokens.

Unlike traditional autoregressive models that generate one token at a time, diffusion-based language models produce the entire sequence simultaneously (although they can also operate in a semi-autoregressive fashion by predicting blocks of tokens after blocks of tokens). This makes the process inherently parallelizable and potentially more efficient, especially during inference. In addition, as they consider the whole text structure simultaneously, they might be naturally better at logical reasoning and generating well-structured responses. Their ability to continuously refine output also holds promise for reducing hallucinations and minimizing errors.

Source: https://arxiv.org/abs/2502.09992

Overview at a Glance

Now that we’ve walked through the main architectures, it’s time to recap!


Autoregressive LLMs

State-Space LLMs

Diffusion LLMs

Core Idea

Sequential token prediction via conditional probabilities.

Sequence modeling via state-space equations.

Iterative noise reduction.

Computational Cost

High

Low

High

Inference Speed

Slow-Medium

Fast

Medium-Fast

Long-context

Limited by memory

Designed for long sequences.

Limited by memory.

Interpretability

Medium

Medium

Low

Examples

GPT, LLaMA, Mistral

Mamba

LLaDA, Mercury Coder

While we’ve gone over the core ideas of these architectures, you should take into account that each can have other possible configurations depending on how encoding and decoding are designed for specific tasks.

What's Next?

In this blog post, we explored an overview of the main architectures behind today’s cutting-edge LLMs. Understanding these foundations is key to optimizing performance and choosing where to focus your efforts.

Enjoy the Quality and Efficiency!

Want to take it further?

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.