Do-Undo Bench: Reversibility for Action Understanding in Image Generation

Abstract

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. A model that genuinely understands physical interactions should be able to reverse an action after performing it—this is the core principle of the Do-Undo task.

We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility despite achieving high semantic fidelity. We demonstrate that explicit supervision on reversible action pairs significantly improves both semantic awareness and action understanding in image generation models.

The Do-Undo Task

A model that genuinely understands physical actions should be able to reverse an action it has just performed. This seemingly simple requirement exposes fundamental failures in current state-of-the-art generative models, revealing gaps between semantic plausibility and true action understanding.

Input State

🚪

Initial scene.
Action not yet performed.

Forward prompt → Do

Do (Forward)

✅

Action performed.
Scene state updated.

Reverse prompt ← Undo

Undo (Reverse)

🔁

Action reversed.
Must match input state.

Key insight: Reversible actions are the acid test for physical understanding. "Cut the paper" is irreversible — the model cannot be tested. "Open the drawer" is reversible — the model must generate consistent, cause-effect transformations that can be inverted.

Example Prompt Pair

▶ Forward Prompt

"Open the drawer with left hand by pulling it backward until it is fully opened. The drawer is a wooden kitchen drawer located beneath the countertop, currently in a closed position..."

◀ Reverse Prompt

"Close the drawer with left hand by pushing it forward until it is fully closed and flush with the cabinet. The drawer is a wooden kitchen drawer currently in an open position..."

Dataset

22,529

Training image pairs with forward & reverse annotations

451

Test samples balanced across 10 action classes

45,058

Total annotations (forward + reverse per sample)

~120

Average expanded prompt length in words

Data Curation Pipeline

01

Frame Quality Filtering

Starting from Epic-Kitchens (100 video episodes), exclude frames with poor lighting or blur. Use Qwen2-VL-7B to verify background and action consistency.

02

Reversible Action Mining

Filter for physically reversible actions: pick-up, put-down, open, close, turn-on, turn-off, move, grab, place, remove. Irreversible actions (e.g. cut) are excluded.

03

Prompt Expansion

Short Epic-Kitchens narrations (avg 3 words) expanded to ~120-word prompts using Qwen3-VL, encoding object attributes, hand posture, and spatial context.

04

Human Verification

Secondary human annotation validates action consistency, background preservation, and that start/final images clearly demonstrate the action boundary.

Action Distribution (Test Set)

turn-on81 turn-off78 take61 open56 pick-up55 put39 put-down33 get24 close14 move10

Results

State-of-the-art models achieve high semantic fidelity but fail at action understanding — especially for the reverse (Undo) direction. High DINO scores do not correlate with high action accuracy.

Zero-Shot Evaluation

Method	DINO-F ↑	DINO-R ↑	CLIP ↑	A-F ↑	A-R ↑	N-F ↑	N-R ↑	EPE-F ↓	EPE-R ↓
Qwen-Image	0.817	0.815	0.258	52.33	29.71	61.20	52.77	89.23	80.86
BAGEL	0.793	0.796	0.262	57.87	33.48	55.65	50.55	121.0	94.07
Flux Kontext	0.750	0.746	0.240	52.23	30.12	53.23	48.18	111.2	95.87

Fine-Tuning with Do-Undo Training Signal

Method	DINO-F ↑	DINO-R ↑	CLIP ↑	A-F ↑	A-R ↑	N-F ↑	N-R ↑	EPE-F ↓	EPE-R ↓
BAGEL	0.796	0.793	0.262	57.87	33.48	55.65	50.55	121.0	94.07
BAGEL-Do(SP)	0.818	0.819	0.254	55.65	34.81	54.55	47.23	118.8	93.27
BAGEL-Do	0.821	0.816	0.250	55.92	34.60	56.87	46.21	124.5	93.70
BAGEL-DoUndo (Ours)	0.836	0.832	0.251	58.77	36.26	58.53	50.47	118.4	90.88

BAGEL-DoUndo outperforms all baselines across both semantic and action understanding metrics.

User study: BAGEL-DoUndo was preferred 66.7% of the time vs. BAGEL (33.3%) for both semantic awareness and action understanding. Training with reverse image pairs significantly improves action reversibility.

Evaluation Metrics

Semantic Awareness

DINO-F

Similarity between generated forward image and ground-truth forward image

DINO-R

Similarity between reverse image and original input — tests restoration quality

CLIP

Similarity of generated forward image with ground-truth caption (accounts for diversity)

Action Understanding

A-F / A-R

Action classifier accuracy on generated forward and reverse images (LaViLa-based)

N-F / N-R

Object/noun accuracy in generated forward and reverse images

EPE ↓

Optical flow error (RAFT) vs. ground-truth — measures physical plausibility of motion

BibTeX

@inproceedings{mahajan2026doundo, title = {Do-Undo Bench: Reversibility for Action Understanding in Image Generation}, author = {Mahajan, Shweta and Kadambi, Shreya and Le, Hoang and Yasarla, Rajeev and Bhattacharyya, Apratim and Hayat, Munawar and Porikli, Fatih}, booktitle = {arXiv preprint arXiv:2512.13609}, year = {2026} }

Do-Undo Bench: Reversibility for ActionUnderstanding in Image Generation