M.Sc. Thesis · FAU Erlangen-Nürnberg · 2026

Finetuning Visual
Autoregressive Models
for Controllable
Image Generation

Computer VisionGenerative AIPyTorchFirst of its kind

The first framework to integrate spatial control into a scale-wise autoregressive text-to-image model. Two novel architectures, six control modalities, and a unified model that routes them all with a single checkpoint.

2.5Bmodel parameters
6control modalities
1stin research landscape

Motivation

The Empty Cell

Spatial control existed for diffusion models. For autoregressive T2I models, one specific combination remained unexplored.

The research landscape for controllable autoregressive generation can be mapped along two dimensions: the generation task (text-to-image vs class-to-image) and the prediction paradigm (next-token vs next-scale).

When examined collectively, prior work — ControlAR, CAR, ControlVAR, and SCALAR — leaves exactly one cell vacant: text-to-image with next-scale prediction.

Why is this cell the hardest?
C2I models avoid dual conditioning (one class label, not open-ended text). ControlAR avoids hierarchical alignment (flat token stream, not 10 resolution scales). Our framework confronts both simultaneously.
Next-TokenNext-Scale
Text-to-ImageControlAR★ This work
Switti-Control
Class-to-ImageCAR · ControlVAR
SCALAR
Dual-conditioning complexityHierarchical alignmentMulti-scale control

Methodology

Two Architectures,
One Philosophy

Both approaches keep the SWITTI backbone frozen and use zero-initialized projections to start training from pretrained behavior.

Architecture overview

Encoder Injection

Lightweight

A frozen DINOv2 ViT-B/14 encoder extracts rich features from the control image. These features are pooled to match each of SWITTI's 10 generation scales, then injected after self-attention and before text cross-attention.

DINOv2 ViT-B/1431M params (additive)282M params (cross-attn)Frozen backbone10 scale pools

Parallel Control Branch

Full-scale

A full copy of the SWITTI backbone runs in parallel, processing the control image and injecting control features via zero-initialized linear layers — the same design principle that made ControlNet work, reimagined for next-scale autoregressive generation.

ControlNet-style~2.5B trainableZero-initialized injectionLearned modality embedding

Results

Qualitative Results

Parallel Control Branch outputs across three modalities. The model follows the structural layout of the control map while respecting the text prompt.

Original
Control Map
Generated
HED
Original scene
HED edges map
Generated from HED
Depth
Original scene
Segmentation map
Generated from segmentation
Canny
Original scene
Canny edges map
Generated from Canny

Key Results

By the Numbers

+236%
Segmentation mIoU improvement
Encoder Injection → Parallel Branch
+84%
Canny F1 improvement
0.197 → 0.362 (Additive → Parallel)
10–15%
Gap: unified vs. specialist models
One checkpoint serves all 6 modalities
10×
Faster than next-token AR
10 scale steps vs. 1024 token steps

Unified Control

One Model.
Six Modalities.

A single checkpoint with a learned modality embedding routes six structurally distinct control signals without interference. Within 10–15% of per-modality specialists.

Original scene
original
Canny-controlled
Canny
HED-controlled
HED
Normals-controlled
normals
Seg-controlled
segmentation
Gray-controlled
grayscale
Depth-controlled
depth

Same source scene, six different structural control signals — all from one unified model.

Design Principle

Pretrained Quality
Preserved

The frozen backbone strategy ensures the base model's text-to-image behavior is fully preserved. Without a control signal, all variants generate normally.

Prompt

“Professional motorcycle racer in extreme lean on a race course.”

Pure T2I generation

+ normals

Surface normals control map
Controlled generation

Pure T2I

no control

Control Map

surface normals

Controlled

same model