M.Sc. Thesis · FAU Erlangen-Nürnberg · 2026

Finetuning Visual
Autoregressive Models
for Controllable
Image Generation

Computer VisionGenerative AIPyTorchFirst of its kind

The first framework to integrate spatial control into a scale-wise autoregressive text-to-image model. Two novel architectures, six control modalities, and a unified model that routes them all with a single checkpoint.

GitHub

2.5Bmodel parameters

6control modalities

1stin research landscape

control map (normals)

generated image

control map (depth)

generated image

Motivation

The Empty Cell

Spatial control existed for diffusion models. For autoregressive T2I models, one specific combination remained unexplored.

The research landscape for controllable autoregressive generation can be mapped along two dimensions: the generation task (text-to-image vs class-to-image) and the prediction paradigm (next-token vs next-scale).

When examined collectively, prior work — ControlAR, CAR, ControlVAR, and SCALAR — leaves exactly one cell vacant: text-to-image with next-scale prediction.

Why is this cell the hardest?
C2I models avoid dual conditioning (one class label, not open-ended text). ControlAR avoids hierarchical alignment (flat token stream, not 10 resolution scales). Our framework confronts both simultaneously.

	Next-Token	Next-Scale
Text-to-Image	ControlAR	★ This work Switti-Control
Class-to-Image	—	CAR · ControlVAR SCALAR

Dual-conditioning complexityHierarchical alignmentMulti-scale control

Methodology

Two Architectures,
One Philosophy

Both approaches keep the SWITTI backbone frozen and use zero-initialized projections to start training from pretrained behavior.

Encoder Injection

Lightweight

A frozen DINOv2 ViT-B/14 encoder extracts rich features from the control image. These features are pooled to match each of SWITTI's 10 generation scales, then injected after self-attention and before text cross-attention.

DINOv2 ViT-B/1431M params (additive)282M params (cross-attn)Frozen backbone10 scale pools

Parallel Control Branch

Full-scale

A full copy of the SWITTI backbone runs in parallel, processing the control image and injecting control features via zero-initialized linear layers — the same design principle that made ControlNet work, reimagined for next-scale autoregressive generation.

ControlNet-style~2.5B trainableZero-initialized injectionLearned modality embedding

Results

Qualitative Results

Parallel Control Branch outputs across three modalities. The model follows the structural layout of the control map while respecting the text prompt.

Original

Control Map

Generated

HED

Depth

Canny

Key Results

By the Numbers

+236%

Segmentation mIoU improvement

Encoder Injection → Parallel Branch

+84%

Canny F1 improvement

0.197 → 0.362 (Additive → Parallel)

10–15%

Gap: unified vs. specialist models

One checkpoint serves all 6 modalities

10×

Faster than next-token AR

10 scale steps vs. 1024 token steps

Unified Control

One Model.
Six Modalities.

A single checkpoint with a learned modality embedding routes six structurally distinct control signals without interference. Within 10–15% of per-modality specialists.

original

Canny

HED

normals

segmentation

grayscale

depth

Same source scene, six different structural control signals — all from one unified model.

Design Principle

Pretrained Quality
Preserved

The frozen backbone strategy ensures the base model's text-to-image behavior is fully preserved. Without a control signal, all variants generate normally.

Prompt

“Professional motorcycle racer in extreme lean on a race course.”

→

+ normals

→

Pure T2I

no control

Control Map

surface normals

Controlled

same model

Finetuning VisualAutoregressive Modelsfor ControllableImage Generation