FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

Abstract

Motivation

Method Overview

Dual-branch self-attention

FreeSpec replaces self-attention with global full-window and local sliding-window branches during inference while keeping the pretrained short-video model frozen.

SVD-guided spectral fusion

Global and local outputs are decomposed by SVD, and global information is injected through timestep- and rank-aware singular-spectrum modulation.

Local-basis reconstruction

The modulated spectrum is reconstructed under the local singular basis, preserving high-rank spatial-temporal variations while retaining controlled low-rank structural guidance.

Method Comparison

Quantitative Results

All results are evaluated at 4× the native base-model inference length. Inference time denotes runtime on an A100 GPU. Best scores are shown in bold, second-best scores are underlined, and the FreeSpec row is highlighted.

Wan2.1-1.3B

LTX-Video-2B-dev

Qualitative Results

sync hover preview

More Results

Additional side-by-side examples. The left column shows Direct and the right column shows FreeSpec (Ours).

BibTeX

@article{chen2026freespec,
  title={FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction},
  author={Fangda Chen and Shanshan Zhao and Longrong Yang and Chuanfu Xu and Zhigang Luo and Long Lan},
  year={2026},
  journal={arXiv preprint arXiv:2605.06509},
}