Introduction
Welcome back to the second article in our monthly series on personalizing Stable Diffusion! This time, we explore how to teach a pre-trained text-to-image model completely new ideas (such as a personal object, a distinct artistic style, or a custom subject like yourself) while keeping all the weights surprisingly frozen!
This technique, called Textual Inversion, was proposed in August 2022 by researchers at Tel Aviv University and NVIDIA [1]. In simple terms, it allows us to introduce new concepts into a diffusion model by fitting a new text token to represent that idea.
Now, I know I promised DreamBooth last week. That was a small oversight on my end, but don’t worry, it’s coming in the next article! Textual inversion actually makes for the perfect introduction, so I decided to keep the original planned order. Let’s go!
Model inversion and beyond
A big topic in machine learning is model inversion, the process of finding an input that will produce a specific output from a pre-trained model. In image generation, this means finding a latent vector that creates a certain image when fed to the model. Inversion techniques are somewhat linked to personalization, as generating a copy of the input concept is the first step toward creating new variations of it.
Editing techniques are also relevant in this space. These methods often incorporate elements of inversion, such as starting from a latent vector that corresponds to the existing image, and then modifying it to reflect a desired change, like altering an attribute or introducing a new element.
Both these approaches are based on keeping the model always frozen. It is all about clever pipelining around it and some optimization. Now, can we do the same for personalization with diffusion models? Can we find a way to invert not a specific image into the model’s latent space, but a user-provided concept instead? Of course yes!
Method
The core assumption of textual inversion is that the model already understands the concept we are trying to reproduce. It knows how to generate perfectly realistic pets, people, or teapots. It just doesn't know how to generate you specifically, or that old shiny teapot sitting on the kitchen shelf of your grandmother.
Stable Diffusion's only direct conditional mechanism is the text prompt. So we just need to find, or better to say learn, the “right input vector” that describes the specific person, object, or style we want to generate and use it as a signal to condition the frozen model.
In practice, we designate a placeholder string S∗, to represent the new concept we wish to learn. During the embedding process, that token is replaced with a newly learned embedding v∗. With this token in the vocabulary, we can then compose new sentences containing the concept, just as we would with any other word.

To learn this new embedding, we use a small set of images (typically 3-5) that depict the target concept across multiple settings, such as various backgrounds or poses. We find v∗ through direct optimization, backpropagating the standard diffusion loss through the text encoder. More specifically, at each training step, we:
Sample one of the images of the concept and a prompt of the form “A photo of S∗”, “A rendering of S∗”, and so on.
Add noise to the image, reusing the same training scheme as the original latent diffusion model.
Backpropagate only to the embedding v∗ the loss comparing the predicted noise to the ground truth added noise. The model or the text encoder is not altered!
Notably, this is a reconstruction task, and as such, we expect the learned embedding to capture fine visual details unique to the concept. We repeat for a few thousand optimization steps, sampling different noise levels and prompts every time, until convergence.
Main advantages
As we just saw, textual inversion is conceptually very simple: we just learn one text embedding vector and that’s it! Compared to LoRA or full fine-tuning, this offers several key benefits:
Storage efficiency. The resulting embedding vector is extremely compact (just a few kilobytes), making it easy to store, share, and reuse across projects without the need to move around large model checkpoints.
Modularity. Multiple custom embeddings can be trained for different concepts and used interchangeably or in combination, like in the example below.
It is very lightweight. Training only the embedding vector for the new token keeps the complexity minimal. Backpropagating through the (frozen) text encoder is still expensive, but in general, it can be performed with no issues on a standard consumer GPU.
Preservation of the original model. The base model remains unchanged, ensuring that its capabilities are not altered for any other generation. However, this is also the biggest downside because the model can’t specialize enough at times if the subject is very particular. At the end of the day, we are trying to compress an identity into roughly a thousand numbers, not easy!
Conclusion
Textual inversion is the simplest way to personalize text-to-image models to generate images that reflect unique and specific concepts. Despite its elegance and efficiency, it’s not as widely used today, with most of the current focus shifting to DreamBooth, coming (for real this time) next week!
Still, if you want to experiment and test your understanding of diffusion models, implementing textual inversion yourself can be a great exercise. If not, you can use the ready-made script provided in the diffusers library [2].
As usual, thanks for reading this far, and see you in the next one!
References
[1] An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion