Logo JointTuner:
Appearance-Motion Adaptive Joint Training for Customized Video Generation

1National University of Defense Technology
2Alibaba International Digital Commerce Group

*Indicates Corresponding Author

Abstract

Recent advancements in customized video generation have significantly improved the adaptation of appearance and motion. However, traditional approaches often decouple appearance and motion training, leading to concept interference and inaccurate feature rendering. Furthermore, these methods frequently suffer from appearance contamination, where background and foreground elements from reference videos distort the generated output. To tackle these limitations, we propose JointTuner, designed to jointly optimize appearance and motion dynamics. We introduce two key innovations: Gated Low-Rank Adaptation (GLoRA) and Appearance-independent Temporal Loss (AiT Loss). GLoRA employs a context-aware activation layer to dynamically steer LoRA modules toward optimizing either appearance or motion while preserving spatio-temporal consistency. Concurrently, the AiT Loss leverages channel-temporal shift noise to isolate motion-related high-frequency signals, forcing the model to prioritize motion learning without compromising appearance. JointTuner is architecture-agnostic, supporting both UNet and Diffusion Transformer backbones like CogVideo and Wan2.1. We also establish a systematic evaluation framework covering 90 scenarios, demonstrating superior performance in semantic alignment, motion dynamism, temporal consistency, and perceptual quality.

Method

JointTuner

Architecture of JointTuner, an adaptive joint training framework featuring two primary steps:(1) integrating GLoRA into Transformer blocks for efficient fine-tuning, and (2) optimizing GLoRA via two complementary losses. The original diffusion loss utilizes reference images to preserve appearance details, while the AiT Loss leverages reference videos to focus on motion patterns. The pre-trained text-to-video model remains frozen, with only GLoRA parameters updated during training. During inference, trained GLoRA weights are loaded to generate customized videos conditioned on the input prompt.

Customized Videos on Wan2.1-1.3B (DiT-based Diffusion Model)

Appearance + Motion = Customized Videos

Customization Comparison on DiT models

Appearance + Motion = Customized Videos

Customized Videos on ZeroScope (UNet-based Diffusion Model)

Appearance + Motion = Customized Videos

BibTeX

@article{chen2025jointtuner,
  title={JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation},
  author={Fangda Chen and Shanshan Zhao and Chuanfu Xu and Long Lan},
  journal={arXiv preprint arXiv:2503.23951},
  year={2025}
}