ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

Published in CVPR 2026, 2025

Story visualization aims to generate coherent image sequences that faithfully represent a narrative and match given character references. Despite progress in generative models, existing benchmarks remain narrow in scope, often limited to short prompts, lacking character references, or single-image cases, failing to reflect real-world narrative complexity and obscuring true model capabilities.

This paper introduces ViStoryBench, a comprehensive benchmark designed to evaluate story visualization models across varied narrative structures, visual styles, and character settings. Its core features include richly annotated multi-shot scripts derived from curated stories spanning literature, film, and folklore. Large language models (LLMs) assist in story summarization and script generation, with all outputs verified by humans for coherence and fidelity. Character references are carefully curated to maintain consistency across different artistic styles.

ViStoryBench proposes a suite of multi-dimensional automated metrics to evaluate character consistency, style similarity, prompt alignment, aesthetic quality, and artifacts like copy-paste behavior. These metrics are validated through human studies and are applied to assess a broad range of open-source and commercial models, enabling systematic analysis and encouraging advances in the field of visual storytelling. The paper has been accepted for presentation at CVPR 2026.

Download paper here

Share on

Twitter Facebook LinkedIn

Zhuang Cailin (庄才林)

Share on