Click on a shot or play the video to see the text description.
Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation application. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.
StoryMem generates each shot conditioned on a memory bank that stores keyframes from previously generated shots (Memory-to-Video, M2V). During generation, the selected memory frames are encoded by a 3D VAE, fused with noisy video latents and binary masks, and fed into a LoRA-finetuned memory-conditioned Video DiT to synthesize the current shot. After generating each shot, semantic keyframe selection and aesthetic preference filtering are applied to obtain informative and reliable memory frames, enabling long-range cross-shot consistency and natural narrative progression. By iteratively generating shots with memory updates, StoryMem produces coherent minute-long, multi-shot story videos.
Click on a shot or play the video to see the text description.
Click on a shot or play the video to see the text description.
Click on a shot or play the video to see the text description.
@article{zhang2025storymem,
title={{StoryMem}: Multi-shot Long Video Storytelling with Memory},
author={Zhang, Kaiwen and Jiang, Liming and Wang, Angtian and Fang, Jacob Zhiyuan and Zhi, Tiancheng and Yan, Qing and Kang, Hao and Lu, Xin and Pan, Xingang},
journal={arXiv preprint},
volume={arXiv:2512.19539},
year={2025}
}