Do-Undo Bench: Reversibility for Action
Understanding in Image Generation

Shweta Mahajan1,2*  Shreya Kadambi3*  Hoang Le3  Rajeev Yasarla3  Apratim Bhattacharyya3  Munawar Hayat3  Fatih Porikli3
* Equal contribution
1York University   2Vector Institute for AI   3Qualcomm AI Research

Abstract

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. A model that genuinely understands physical interactions should be able to reverse an action after performing itβ€”this is the core principle of the Do-Undo task.

We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility despite achieving high semantic fidelity. We demonstrate that explicit supervision on reversible action pairs significantly improves both semantic awareness and action understanding in image generation models.


The Do-Undo Task

A model that genuinely understands physical actions should be able to reverse an action it has just performed. This seemingly simple requirement exposes fundamental failures in current state-of-the-art generative models, revealing gaps between semantic plausibility and true action understanding.

Input State
πŸšͺ
Initial scene.
Action not yet performed.
Forward prompt β†’ Do
Do (Forward)
βœ…
Action performed.
Scene state updated.
Reverse prompt ← Undo
Undo (Reverse)
πŸ”
Action reversed.
Must match input state.
Key insight: Reversible actions are the acid test for physical understanding. "Cut the paper" is irreversible β€” the model cannot be tested. "Open the drawer" is reversible β€” the model must generate consistent, cause-effect transformations that can be inverted.

Example Prompt Pair

β–Ά Forward Prompt
"Open the drawer with left hand by pulling it backward until it is fully opened. The drawer is a wooden kitchen drawer located beneath the countertop, currently in a closed position..."
β—€ Reverse Prompt
"Close the drawer with left hand by pushing it forward until it is fully closed and flush with the cabinet. The drawer is a wooden kitchen drawer currently in an open position..."

Dataset

22,529
Training image pairs with forward & reverse annotations
451
Test samples balanced across 10 action classes
45,058
Total annotations (forward + reverse per sample)
~120
Average expanded prompt length in words

Data Curation Pipeline

01
Frame Quality Filtering
Starting from Epic-Kitchens (100 video episodes), exclude frames with poor lighting or blur. Use Qwen2-VL-7B to verify background and action consistency.
02
Reversible Action Mining
Filter for physically reversible actions: pick-up, put-down, open, close, turn-on, turn-off, move, grab, place, remove. Irreversible actions (e.g. cut) are excluded.
03
Prompt Expansion
Short Epic-Kitchens narrations (avg 3 words) expanded to ~120-word prompts using Qwen3-VL, encoding object attributes, hand posture, and spatial context.
04
Human Verification
Secondary human annotation validates action consistency, background preservation, and that start/final images clearly demonstrate the action boundary.

Action Distribution (Test Set)

turn-on81 turn-off78 take61 open56 pick-up55 put39 put-down33 get24 close14 move10

Results

State-of-the-art models achieve high semantic fidelity but fail at action understanding β€” especially for the reverse (Undo) direction. High DINO scores do not correlate with high action accuracy.

Zero-Shot Evaluation

Method DINO-F ↑ DINO-R ↑ CLIP ↑ A-F ↑ A-R ↑ N-F ↑ N-R ↑ EPE-F ↓ EPE-R ↓
Qwen-Image 0.8170.8150.258 52.3329.7161.2052.77 89.2380.86
BAGEL 0.7930.7960.262 57.8733.4855.6550.55 121.094.07
Flux Kontext 0.7500.7460.240 52.2330.1253.2348.18 111.295.87

Fine-Tuning with Do-Undo Training Signal

Method DINO-F ↑ DINO-R ↑ CLIP ↑ A-F ↑ A-R ↑ N-F ↑ N-R ↑ EPE-F ↓ EPE-R ↓
BAGEL 0.7960.7930.262 57.8733.4855.6550.55 121.094.07
BAGEL-Do(SP) 0.8180.8190.254 55.6534.8154.5547.23 118.893.27
BAGEL-Do 0.8210.8160.250 55.9234.6056.8746.21 124.593.70
BAGEL-DoUndo (Ours) 0.836 0.832 0.251 58.77 36.26 58.53 50.47 118.4 90.88
BAGEL-DoUndo outperforms all baselines across both semantic and action understanding metrics.
User study: BAGEL-DoUndo was preferred 66.7% of the time vs. BAGEL (33.3%) for both semantic awareness and action understanding. Training with reverse image pairs significantly improves action reversibility.

Evaluation Metrics

Semantic Awareness
DINO-F
Similarity between generated forward image and ground-truth forward image
DINO-R
Similarity between reverse image and original input β€” tests restoration quality
CLIP
Similarity of generated forward image with ground-truth caption (accounts for diversity)
Action Understanding
A-F / A-R
Action classifier accuracy on generated forward and reverse images (LaViLa-based)
N-F / N-R
Object/noun accuracy in generated forward and reverse images
EPE ↓
Optical flow error (RAFT) vs. ground-truth β€” measures physical plausibility of motion

BibTeX

@inproceedings{mahajan2026doundo, title = {Do-Undo Bench: Reversibility for Action Understanding in Image Generation}, author = {Mahajan, Shweta and Kadambi, Shreya and Le, Hoang and Yasarla, Rajeev and Bhattacharyya, Apratim and Hayat, Munawar and Porikli, Fatih}, booktitle = {arXiv preprint arXiv:2512.13609}, year = {2026} }