17. U‑Net: Timeless by Design

How an elegant biomedical tool became the backbone of modern diffusion models

Apr 23, 2025

The Bernina Express, one of the best Swiss train rides. Shout out to SBB.

Introduction

If a research contribution deserves a test of time award this year, in my opinion, that must be the U-Net paper from 2015 [1]. Today, we’re taking a closer look at how a network originally designed to efficiently segment microscope images has not only remained a strong baseline for dense prediction tasks, but has also evolved into a core component of modern generative AI systems.

We already encountered the U-Net last week, talking about Stable Diffusion. This time, we’re taking a step back to appreciate the original model, still incredibly popular for simple image segmentation, before looking into its more modern adaptation to perform latent denoising. Let’s go!

Cells on glass: raw image (left), overlay with ground truth segmentation (middle), generated segmentation mask (right). Images from [1].

The Original U‑Net (2015)

The U‑Net was introduced to address a practical problem in biomedical imaging: producing pixel-precise segmentation masks from only a handful of labeled examples. With a focus on good design, the authors relied heavily on data augmentation (shifts, rotations, and color variations) to teach the network the desired invariance and robustness properties. And it worked!

Previous methods, most notably the Fully  Convolutional Networks (FCNs) [2], had already shown that classification models could be successfully adapted to dense prediction tasks. The core idea was to replace the final fully connected layers with convolutions (allowing the model to handle inputs of arbitrary size) and to attach a deconvolution layer to upsample the features to pixel-dense outputs.

Fully convolutional networks can learn to make pixel-level predictions for tasks like semantic segmentation by transforming fully connected layers into convolutional layers, and training end-to-end. Image from [2].

However, FCNs typically suffered from coarse outputs due to the repeated aggressive downsampling without an effective mechanism to recover fine-grained spatial details. As seen in the figure, the prediction head (calling it decoder would be an exaggeration XD) was both asymmetric and rudimentary, resulting in oversmoothed predictions.

U‑Net modifies and extends the FCN architecture to produce more precise segmentation maps. The main idea is to gradually recover the pixel-level resolution through a sequence of upsampling layers, while also improving localization by combining high-resolution features from the encoder path with the predictions.

The model gets its name from its distinctive U-shaped architecture, where the input is gradually compressed down to a low-resolution representation and then expanded back up to full resolution, in a roughly symmetric manner.

The U-net architecture. Each blue box corresponds to a multi-channel feature map, with the number of channels on top of the box and resolution at the lower left. White boxes represent copied feature maps. Image from [1].

The architecture consists of three main parts:

Encoder (contracting path), which progressively halves the spatial resolution while doubling the number of feature channels at each step, enabling the network to capture rich contextual information. The encoder is similar to what we find in conventional CNN-based classification models, where the goal is to distill the input image into a compact but very rich feature representation.

Decoder (expansive path), which gradually upsamples the feature maps back to high resolution, in order to make pixel-level predictions. The decoder uses up-convolutions that double the resolution and halve the number of feature channels at each step. Crucially, at every level in the decoder, upsampled features are concatenated with the corresponding intermediate states from the encoder.
“Skip connections”, which copy feature maps from each encoder block straight into the matching decoder block, recovering structures at multiple scales. These are not the additive skip connections we find in ResNets or Transformers, but a straightforward concatenation of feature maps along the channel dimension.

What is the idea? In simple terms, the deeper we go into the encoder, the more the features become semantic. At the bottleneck level, the model might recognize that there is “a cat and a dog in the image”, but may no longer know exactly where they are. By concatenating feature maps from earlier encoder layers, which maintain spatial details, we bring back the “where”. Combined with the semantic “what” from deeper layers, this can produce highly precise segmentation results.

The Stable Diffusion U‑Net (2021)

The U‑Net is especially well-suited for image-to-image tasks that require both high-level understanding of the semantics and preservation of precise details. More generally, if we think of an image as just a tensor with three channels, it’s easy to see how the architecture can be extended to handle inputs with more channels.

Fast‑forward almost seven years, and text‑to‑image diffusion models needed a network that, given a noisy latent image and a text prompt, could predict the noise added, so that the picture could be denoised step‑by‑step. Stable Diffusion, which we covered in detail last week, keeps the U‑Net skeleton but adapts it to operate on latent tensors with a few crucial modifications:

A Time Embedding projection is added, which takes a scalar timestep and projects it into a learned vector through multiple linear layers. This vector is then injected into each block as an additive bias to tell how much noise to remove.
Self- and Cross‑Attention Layers are also used, in contrast to the classic U‑Net, which is purely based on convolutions. Self‑Attention allows each spatial location in the latent space to access context from every other location (e.g., elements on opposite sides) by having the sequence of flattened grid tokens attend to each other. Cross‑Attention injects the text‑prompt information, so the network can align visual features with textual semantics at every spatial location.
ResNet blocks are incorporated (the true skip connections), providing more stable training for deeper variants and improved expressivity by enabling residual learning at every level.

In summary, the architecture looks roughly like this, with each block containing multiple ResNet layers with skip connections and attention layers, enabling the model to capture long-range dependencies and align to external conditioning signals:

The time- and text-conditional U‑Net architecture used in Stable Diffusion 1.x and 2.x. Image from https://arxiv.org/pdf/2306.09762

Conclusions

The U‑Net architecture has proven extremely versatile, with a clear design pattern that outlived the specifics of any individual layer type. Its multi‑scale features and the residual connections between the downsampling and the upsampling blocks make it perfect for tasks that require a translation of the input to another modality, while keeping a pixel-perfect spatial alignment.

If you need a baseline for any dense prediction task, such as segmentation, training a simple U‑Net as a starting point is always a good choice. The best library to do so is without a doubt Segmentation Models Pytorch (SMP) [3]. Hopefully, you now have a clearer picture of what’s happening under the hood when you use one. Thanks for reading this far, see you!