Logo JointTuner:
Appearance-Motion Adaptive Joint Training for Customized Video Generation

1National University of Defense Technology
2Alibaba International Digital Commerce Group

*Indicates Corresponding Author

Abstract

Recent advancements in customized video generation have led to significant improvements in the simultaneous adaptation of appearance and motion. Typically, decoupling the appearance and motion training, prior methods often introduce concept interference, resulting in inaccurate rendering of appearance features or motion patterns. In addition, these methods often suffer from appearance contamination, in which background and foreground elements from reference videos distort the customized video. This paper aims to alleviate these issues by proposing JointTuner. The core motivation of our JointTuner is to enable joint optimization of both appearance and motion components, upon which two key innovations are developed, i.e., Gated Low-Rank Adaptation (GLoRA) and Appearance-independent Temporal Loss (AiT Loss). Specifically, GLoRA uses a context-aware activation layer, analogous to a gating regulator, to dynamically steer LoRA modules toward learning either appearance or motion while maintaining spatio-temporal consistency. Moreover, with the finding that channel-temporal shift noise suppresses appearance-related low-frequencies while enhancing motion-related high-frequencies, we designed the AiT Loss. This loss adds the same shift to the diffusion model’s predicted noise during fine-tuning, forcing the model to prioritize learning motion patterns. JointTuner's architecture-agnostic design supports both UNet (e.g., ZeroScope) and Diffusion Transformer (e.g., CogVideoX) backbones, ensuring its customization capabilities scale with the evolution of foundational video models. Furthermore, we present a systematic evaluation framework for appearance-motion combined customization, covering 90 combinations evaluated along four critical dimensions: semantic alignment, motion dynamism, temporal consistency, and perceptual quality.

Method

JointTuner

Architecture of JointTuner, an adaptive joint training framework with two main steps: (1) integrating GLoRA into the transformer blocks for efficient fine-tuning, and (2) optimizing GLoRA with two complementary losses. The original diffusion loss leverages reference images to preserve appearance details, and the AiT Loss utilizes reference videos to focus on motion patterns. The pre-trained text-to-video model remains frozen throughout training; only GLoRA parameters are updated. During inference, trained GLoRA weights are loaded, and customized videos are generated conditioned solely on the input prompt.

More Results on CogVideoX-5B (DiT-based Diffusion Model)

More Results on ZeroScope (UNet-based Diffusion Model)

BibTeX

@article{chen2025jointtuner,
  title={JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation},
  author={Fangda Chen and Shanshan Zhao and Chuanfu Xu and Long Lan},
  journal={arXiv preprint arXiv:2503.23951},
  year={2025}
}