Supplementary Materials

Composite action-annotated video generations across all 46 Melting Pot environments.

Each video annotates the prescribed action for every player at each timestep. Before the action is applied, icons have a white border; after the action, the border turns green (correct) or red (incorrect), indicating whether the generated frame matches the expected effect. Colored circle markers show the model’s predicted coordinates for each player.

Action Icons (relative to player orientation)

Forward

Backward

Strafe Left

Strafe Right

Turn Left

Turn Right

Interact

Noop

Icon Border Color

Pre-action (about to apply)

Correct effect

Incorrect effect

Baseline Comparison Layout

GT | Ours | Text-Action | Pretrained AR | Zero-shot I2V. Zero-shot I2V shows only the single next-frame prediction, then black for remaining steps.

Orientation Cones (detected player facing direction)

A semi-transparent cone is drawn from each detected player in the direction they are facing.

Facing Up (North)

Facing Right (East)

Facing Down (South)

Facing Left (West)

Example Generation (Coins, single seed)

Sections

Long Horizon Generation Variable Player Count Ablation Study (Coins) Qualitative Results (All Games)

Long Horizon Generation — Coins (20 Actions, Sliding Window)

▼

We demonstrate that our model, trained on 4-action sequences (5 latent frames), can generate coherent 20-action rollouts via sliding-window autoregressive inference. The grid shows Ground Truth (top row) vs. Ours (bottom row) across 5 seeds, with action icons annotated at player positions.

GT vs. Ours — 5 Seeds × 20 Actions

Variable Player Count — Coins (1–8 Players)

▼

We test our model’s ability to generalize to unseen player counts. The model was trained on games with 2–7 players (padded to 8). Here we generate Coins rollouts with 1 to 8 active players using the same checkpoint, demonstrating compositional generalization across player counts. Each row shows a different player count; columns are independent seeds.

Generated Rollouts — 5 Seeds × 8 Player Counts

Ablation Study — Coins (20-step, 256×256)

▼

Architecture Ablations (5 Seeds)

Qualitative Results — Per-Game Videos

Each game section below contains ground truth rollouts, our generated results (4-action and 20-action), and side-by-side baseline comparisons with action icon annotations. Click a game name to expand.

Allelopathic Harvest: Open

▼

Bach Or Stravinsky In The Matrix: Arena

▼

Bach Or Stravinsky In The Matrix: Repeated

▼

Chemistry: Three Metabolic Cycles

▼

Chemistry: Three Metabolic Cycles With Plentiful Distractors

▼

Chemistry: Two Metabolic Cycles

▼

Chemistry: Two Metabolic Cycles With Distractors

▼

Chicken In The Matrix: Arena

▼

Chicken In The Matrix: Repeated

▼

Clean Up

▼

Coins

▼

Collaborative Cooking: Asymmetric

▼

Collaborative Cooking: Circuit

▼

Collaborative Cooking: Cramped

▼

Collaborative Cooking: Crowded

▼

Collaborative Cooking: Figure Eight

▼

Collaborative Cooking: Forced

▼

Collaborative Cooking: Ring

▼

Commons Harvest: Closed

▼

Commons Harvest: Open

▼

Commons Harvest: Partnership

▼

Coop Mining

▼

Daycare

▼

Externality Mushrooms: Dense

▼

Factory Commons: Either Or

▼

Fruit Market: Concentric Rivers

▼

Gift Refinements

▼

Paintball: Capture The Flag

▼

Paintball: King Of The Hill

▼

Predator Prey: Alley Hunt

▼

Predator Prey: Open

▼

Predator Prey: Orchard

▼

Predator Prey: Random Forest

▼

Prisoners Dilemma In The Matrix: Arena

▼

Prisoners Dilemma In The Matrix: Repeated

▼

Pure Coordination In The Matrix: Arena

▼

Pure Coordination In The Matrix: Repeated

▼

Rationalizable Coordination In The Matrix: Arena

▼

Rationalizable Coordination In The Matrix: Repeated

▼

Running With Scissors In The Matrix: Arena

▼

Running With Scissors In The Matrix: One Shot

▼

Running With Scissors In The Matrix: Repeated

▼

Stag Hunt In The Matrix: Arena

▼

Stag Hunt In The Matrix: Repeated

▼

Territory: Inside Out

▼

Territory: Rooms

▼

Supplementary Materials

Sections

Long Horizon Generation — Coins (20 Actions, Sliding Window)

GT vs. Ours — 5 Seeds × 20 Actions

Variable Player Count — Coins (1–8 Players)

Generated Rollouts — 5 Seeds × 8 Player Counts

Ablation Study — Coins (20-step, 256×256)

Architecture Ablations (5 Seeds)

Qualitative Results — Per-Game Videos

Allelopathic Harvest: Open

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Bach Or Stravinsky In The Matrix: Arena

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Bach Or Stravinsky In The Matrix: Repeated

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Chemistry: Three Metabolic Cycles

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Chemistry: Three Metabolic Cycles With Plentiful Distractors

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Chemistry: Two Metabolic Cycles

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Chemistry: Two Metabolic Cycles With Distractors

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Chicken In The Matrix: Arena

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Chicken In The Matrix: Repeated

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Clean Up

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Coins

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Collaborative Cooking: Asymmetric

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Collaborative Cooking: Circuit

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Collaborative Cooking: Cramped

Ground Truth Rollouts (5 Seeds)

Our Results (5 Seeds, 4 Actions)

20-Step Inference (GT vs. Ours)

Baseline Comparison

Collaborative Cooking: Crowded