ActionParty

Abstract

Recent advances in video diffusion have enabled the development of “world models” capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene.

In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. We propose ActionParty, an action-controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates.

We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions. See more examples for generations across all 46 games.

ActionParty Generations

ActionParty operates as a generative game engine, jointly modeling video frames and the state of each subject. Given an initial frame and per-player action sequences, it autoregressively generates future game states entirely through learned video diffusion. Each player is independently controlled: per-step actions are shown as icons, orientation cones indicate detected facing direction, and green/red borders denote whether each action was correctly executed. Generations across all 46 environments with baseline comparisons are available in the full results gallery.

Action Icons (relative to player orientation)

Forward

Backward

Strafe Left

Strafe Right

Turn Left

Turn Right

Interact

Noop

Border Color

Pre-action (about to apply)

Correct effect

Incorrect effect

Orientation Cones (detected player facing direction)

Up (N)

Right (E)

Down (S)

Left (W)

Coins (2 players), step-by-step generation with action annotations.

Cooking: Asymmetric (2 players)

Clean Up (4 players)

Coop Mining (4 players)

See more examples for all 46 games with seed variations and baseline comparisons.

Architecture Overview

ActionParty operates as a generative game engine, jointly modeling subject state tokens and video latents. Given context frames, per-subject state tokens, and per-subject actions, the Diffusion Transformer (DiT) denoises the next frame while simultaneously predicting updated subject coordinates. Attention masking enforces action-subject correspondence, enabling consistent multi-subject action control in a single forward pass.

Video tokens and subject state tokens are concatenated and processed through a DiT with 3D RoPE for spatial binding and attention masking for action isolation.

Action-Subject Binding

The core challenge in multi-subject control is ensuring each player's actions only affect that player, not others. We solve this with two complementary mechanisms:

(a) Self-attention: 3D RoPE biases each subject state token to the spatial location of its corresponding subject in the video.

(b) Cross-attention: Masks ensure each subject state token only attends to its own actions. Text attends to video tokens.

Together, RoPE spatial biasing anchors subject state to specific spatial coordinates within the video, while attention masking isolates action-to-subject correspondence, even when subjects are visually identical. This enables ActionParty to scale to seven simultaneous players, enabling reliable subject localization and consistent action binding.

BibTeX

@article{pondaven2026actionparty,
  title={ActionParty: Multi-Subject Action Binding in Generative Video Games},
  author={Pondaven, Alexander and Wu, Ziyi and Gilitschenski, Igor and Torr, Philip and Tulyakov, Sergey and Pizzati, Fabio and Siarohin, Aliaksandr},
  year={2026}
}

ActionParty: Multi-Subject Action Binding in Generative Video Games

Action Binding Problem

Abstract

ActionParty Generations

Architecture Overview

Action-Subject Binding

Related Links

BibTeX