Composite action-annotated video generations across all 46 Melting Pot environments.
Each video annotates the prescribed action for every player at each timestep. Before the action is applied, icons have a white border; after the action, the border turns green (correct) or red (incorrect), indicating whether the generated frame matches the expected effect. Colored circle markers show the model’s predicted coordinates for each player.
We demonstrate that our model, trained on 4-action sequences (5 latent frames), can generate coherent 20-action rollouts via sliding-window autoregressive inference. The grid shows Ground Truth (top row) vs. Ours (bottom row) across 5 seeds, with action icons annotated at player positions.
We test our model’s ability to generalize to unseen player counts. The model was trained on games with 2–7 players (padded to 8). Here we generate Coins rollouts with 1 to 8 active players using the same checkpoint, demonstrating compositional generalization across player counts. Each row shows a different player count; columns are independent seeds.
Columns: Ground Truth | Ours | w/o MSA | w/o MCA | Frame-wise MCA | No RoPE in SA. Rows: Seeds 0–4.
Each game section below contains ground truth rollouts, our generated results (4-action and 20-action), and side-by-side baseline comparisons with action icon annotations. Click a game name to expand.