Supplementary Materials

Composite action-annotated video generations across all 46 Melting Pot environments.

Each video annotates the prescribed action for every player at each timestep. Before the action is applied, icons have a white border; after the action, the border turns green (correct) or red (incorrect), indicating whether the generated frame matches the expected effect. Colored circle markers show the model’s predicted coordinates for each player.

Action Icons (relative to player orientation)
F Forward
B Backward
L Strafe Left
R Strafe Right
TL Turn Left
TR Turn Right
Interact
Noop
Icon Border Color
Pre-action (about to apply)
Correct effect
Incorrect effect
Baseline Comparison Layout
GT  |  Ours  |  Text-Action  |  Pretrained AR  |  Zero-shot I2V.   Zero-shot I2V shows only the single next-frame prediction, then black for remaining steps.
Orientation Cones (detected player facing direction)
A semi-transparent cone is drawn from each detected player in the direction they are facing.
Facing Up (North)
Facing Right (East)
Facing Down (South)
Facing Left (West)
Example Generation (Coins, single seed)

Sections

Long Horizon Generation — Coins (20 Actions, Sliding Window)

We demonstrate that our model, trained on 4-action sequences (5 latent frames), can generate coherent 20-action rollouts via sliding-window autoregressive inference. The grid shows Ground Truth (top row) vs. Ours (bottom row) across 5 seeds, with action icons annotated at player positions.

GT vs. Ours — 5 Seeds × 20 Actions

Variable Player Count — Coins (1–8 Players)

We test our model’s ability to generalize to unseen player counts. The model was trained on games with 2–7 players (padded to 8). Here we generate Coins rollouts with 1 to 8 active players using the same checkpoint, demonstrating compositional generalization across player counts. Each row shows a different player count; columns are independent seeds.

Generated Rollouts — 5 Seeds × 8 Player Counts

Ablation Study — Coins (20-step, 256×256)

Architecture Ablations (5 Seeds)

Columns: Ground Truth | Ours | w/o MSA | w/o MCA | Frame-wise MCA | No RoPE in SA. Rows: Seeds 0–4.

Qualitative Results — Per-Game Videos

Each game section below contains ground truth rollouts, our generated results (4-action and 20-action), and side-by-side baseline comparisons with action icon annotations. Click a game name to expand.