Recent text-to-video advancements have enabled coherent video synthesis from prompts and expanded to fine-grained control over appearance and motion. However, existing methods either suffer from concept interference due to feature domain mismatch caused by naive decoupled optimizations or exhibit appearance contamination induced by spatial feature leakage resulting from the entanglement of motion and appearance in reference video reconstructions. In this paper, we propose JointTuner, a novel adaptive joint training framework, to alleviate these issues. Specifically, we develop Adaptive LoRA, which incorporates a context-aware gating mechanism, and integrate the gated LoRA components into the spatial and temporal Transformers within the diffusion model. These components enable simultaneous optimization of appearance and motion, eliminating concept interference. In addition, we introduce the Appearance-independent Temporal Loss, which decouples motion patterns from intrinsic appearance in reference video reconstructions through an appearance-agnostic noise prediction task. The key innovation lies in adding frame-wise offset noise to the ground-truth Gaussian noise, perturbing its distribution, thereby disrupting spatial attributes associated with frames while preserving temporal coherence. Furthermore, we construct a benchmark comprising 90 appearance-motion customized combinations and 10 multi-type automatic metrics across four dimensions, facilitating a more comprehensive evaluation for this customization task. Extensive experiments demonstrate the superior performance of our method compared to current advanced approaches.
The architecture of our JointTuner, an adaptive joint training framework, comprises three parts: (1) Converting reference images into pseudo-videos by duplicating frames to match the format of motion videos; (2) Inserting Adaptive LoRA components into the spatial and temporal Transformers for parameter-efficient fine-tuning; (3) Optimizing Adaptive LoRA parameters with two specialized losses: the One Frame Spatial Loss preserves appearance details, while the Appearance-independent Temporal Loss captures motion patterns. Notably, the pre-trained text-to-video diffusion model remains frozen during training, with only the injected LoRA parameters being fine-tuned. During inference, the trained LoRA weights are loaded to generate customized videos.
@article{chen2025jointtuner,
title={JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation},
author={Fangda Chen and Shanshan Zhao and Chuanfu Xu and Long Lan},
journal={arXiv preprint arXiv:2503.23951},
year={2025}
}