4. Generative Pretraining from Pixels

GPT with... pixels?! Does that really work?

Jan 22, 2025

Introduction

Unlike words, which often carry standalone semantic meaning, a single pixel in an image offers little to no contextual information. But what if we take a sequence of them, can we learn something useful from these atomic visual units?

This is the question posed in the paper Generative Pretraining from Pixels [1] by OpenAI published at ICML 2020. Drawing inspiration from advancements in unsupervised learning for language (GPT-2 and BERT at the time), this work investigates whether similar approaches can also capture meaningful representations for images.

Why do we need pretraining?

The main motivation behind unsupervised (i.e. with no labels) pretraining of networks is that learning the general data distribution facilitates transferring to downstream tasks later on. For instance, a model pretrained to predict missing pixels in an image must learn fundamental features such as edges and textures to solve the task, which can later benefit applications like image classification.

Learning the data distribution during unsupervised pretraining helps transfer to downstream tasks in a second stage more efficiently.

This idea of pretraining on images is not a new concept. Autoencoders, including a variant known as denoising autoencoders, are built on a similar principle. The innovation here lies in recognizing that generative sequence modeling is a general-purpose algorithm and therefore it is worth exploring its applications in vision with no inductive prior. The paper employs a plain transformer architecture, and no biases such as spatial locality or hierarchical structures are enforced, letting the model learn image features purely from data.

Challenges in modeling pixel sequences

There is a big problem to tackle right off the bat: modeling pixel sequences is extremely expensive. First, each (R, G, B) component can take 256 possible values, resulting in approximately 17 million unique input pixel colors. Second, the memory requirements of a transformer scale quadratically with the context length, so flattening even a small 224 x 224 image results in an enormous context size and even a single layer would not fit on a GPU.

To address the first issue, the paper introduces a custom color palette by clustering RGB values into 512 clusters. This way, the input space vocabulary is significantly reduced while still encoding color relatively faithfully. To deal with the second one, the solution is to process images at a lower resolution, so that the flattened pixels are at most a few thousand. Processing resolutions used are 32 x 32, 48 x 48, and 64 x 64 pixels depending on the model capacity.

Training strategies, models, and evaluation setup

Two pretraining strategies are analyzed, directly taken from natural language tasks:

The autoregressive objective (GPT) [2], or in simple words predicting the next element in the sequence given the previous context. Pixels are passed following the natural raster order.
The BERT objective [3], where we mask out 15% of the pixels and train the model to fill in the blanks conditioned on the available information.

Four iGPT model classes are trained based on the GPT-2 architecture, ranging from 74M to 6.8B parameters, with feature sizes scaling proportionally from 512 to 3072 dimensions. All the models are pretrained on ImageNet (1.2 million images) as a proxy for a large unlabeled corpus, but the largest variant iGPT-XL also digests a massive corpus of another 100M unlabeled web images filtered to be similar to Imagenet.

Overview of the approach. 1) Images are resized to a low resolution and reshaped into a 1D sequence with a custom color palette. 2) Models are pretrained using one of two objectives. 3) The learned representations are evaluated with linear probes or fine-tuning. Image from [1].

The findings are evaluated on small classic labeled datasets (CIFAR-10, CIFAR-100, STL-10) in image classification as proxies for downstream tasks, using two common approaches:

Full fine-tuning. Features from the last layer are averaged, a classification head is added at the end, and the entire model is trained adapting all weights to the new task. The pretraining weights are seen as a favorable initialization compared to for example random weights for the final classifier.
Linear probing. The model is seen as a feature extractor and we take the learned representations at a particular layer. After averaging them, we keep them frozen and train a linear classifier on top. This is based on the assumption that good features should linearly separate the classes of the transfer task.

Results

An interesting finding from linear probing is that the best features are found in the middle of the network, and quality is unimodal. This suggests that the model might operate in two phases, the first one in which it creates a global representation of the image, and then a second one in which that contextualized representation is used to predict pixel values. This makes sense as the objective of pixel prediction is not related to image classification at all, so there is no reason why features from the last layer should be the most predictive of object category.

The best features are found in the middle layers, in contrast to supervised models where the best features are found at the end. Graph from [1].

The second result worth mentioning is that improved generative performance, measured by lower validation loss, correlates with better feature quality and therefore better results in downstream tasks. This also shows bigger models producing better representations at the same value of the validation loss. So once again, scaling transformers (if you have the compute XD) seems the way to go when in doubt.

Larger models and better generative performance in terms of validation loss correlate with better feature quality. Taken from [1].

Finally, BERT models perform significantly worse in linear probing settings compared to autoregressive models, but they catch up after full fine-tuning. Overall, full fine-tuning consistently achieves higher accuracy on downstream tasks than linear probing.

Conclusions

This work demonstrates that a sequence transformer can effectively compete with convolutional networks for unsupervised pretraining. A generative model can successfully learn high-quality global representations that are transferable to downstream tasks by simply predicting pixels.

However, this route is way less parameter and compute efficient than other methods and its direct application is limited to low resolution. However, considering that no image-specific prior is passed to the model, it also shows that it is possible to trade off specific domain knowledge for computing resources and still get the job done.

More importantly, showing that something is possible is enough to spark plenty of future research, and we’ll see in future articles how this approach inspired other pretraining methods. Until then, stay tuned, and thanks for reading this far!

References

[1] Generative Pretraining from Pixels

[2] GPT-2: Language Models are Unsupervised Multitask Learners

[3] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Machine Learning with a Honk

Discussion about this post