We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. A model that genuinely understands physical interactions should be able to reverse an action after performing itβthis is the core principle of the Do-Undo task.
We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility despite achieving high semantic fidelity. We demonstrate that explicit supervision on reversible action pairs significantly improves both semantic awareness and action understanding in image generation models.
A model that genuinely understands physical actions should be able to reverse an action it has just performed. This seemingly simple requirement exposes fundamental failures in current state-of-the-art generative models, revealing gaps between semantic plausibility and true action understanding.
State-of-the-art models achieve high semantic fidelity but fail at action understanding β especially for the reverse (Undo) direction. High DINO scores do not correlate with high action accuracy.
| Method | DINO-F β | DINO-R β | CLIP β | A-F β | A-R β | N-F β | N-R β | EPE-F β | EPE-R β |
|---|---|---|---|---|---|---|---|---|---|
| Qwen-Image | 0.817 | 0.815 | 0.258 | 52.33 | 29.71 | 61.20 | 52.77 | 89.23 | 80.86 |
| BAGEL | 0.793 | 0.796 | 0.262 | 57.87 | 33.48 | 55.65 | 50.55 | 121.0 | 94.07 |
| Flux Kontext | 0.750 | 0.746 | 0.240 | 52.23 | 30.12 | 53.23 | 48.18 | 111.2 | 95.87 |
| Method | DINO-F β | DINO-R β | CLIP β | A-F β | A-R β | N-F β | N-R β | EPE-F β | EPE-R β |
|---|---|---|---|---|---|---|---|---|---|
| BAGEL | 0.796 | 0.793 | 0.262 | 57.87 | 33.48 | 55.65 | 50.55 | 121.0 | 94.07 |
| BAGEL-Do(SP) | 0.818 | 0.819 | 0.254 | 55.65 | 34.81 | 54.55 | 47.23 | 118.8 | 93.27 |
| BAGEL-Do | 0.821 | 0.816 | 0.250 | 55.92 | 34.60 | 56.87 | 46.21 | 124.5 | 93.70 |
| BAGEL-DoUndo (Ours) | 0.836 | 0.832 | 0.251 | 58.77 | 36.26 | 58.53 | 50.47 | 118.4 | 90.88 |