arXiv Papers of World Models

Abstract:
Vegetation monitoring under climate stress requires answering not only how it will evolve given the expected weather, but how it would respond to alternative meteorological conditions. Forecasting models return the expected vegetation state for the observed weather and cannot answer these scenario‑conditioned questions, because future weather is fixed to the recorded trajectory. We present VegSim, a geospatial world model for scenario‑conditioned vegetation simulation. VegSim infers a latent vegetation state from sparse satellite‑derived NDVI histories, past meteorological covariates, and static spatial context, propagates it forward under future weather forcing through recurrent latent dynamics, and decodes predictive NDVI quantiles at each lead time. Because future forcing enters as a controllable input, the same trained model supports probabilistic forecasting under observed weather and conditional simulation under user‑defined meteorological forcing, without supervision on scenario responses. We evaluate VegSim on GreenEarthNet across in‑distribution data and spatial, temporal, and joint spatial‑temporal shift, where it achieves strong point and probabilistic accuracy against time series and Earth observation forecasting baselines while using a compact architecture. We then simulate vegetation responses across Europe under four meteorological scenarios, and in a France summer 2022 case study, obtaining spatially coherent patterns consistent with known sensitivity to temperature and precipitation. The code is available at https://github.com/arco‑group/vegsim.

Abstract:
Imitation learning has emerged as a powerful paradigm for learning visuomotor policies, but its generalisation and stability are limited by the scale and quality of demonstration data needed. A promising direction is to leverage more abundant but heterogeneous data sources, which differ in action space and often lack action labels altogether. Existing co‑training approaches that combine heterogeneous data sources rely on heuristic and hand‑engineered alignment techniques. In contrast, we argue that action representations should be grounded in prediction: actions that produce the same effect on the environment should share the same representation, regardless of their sources. To this end, we instantiate this principle by using a grounded latent‑action world model (GLAM), a pair of generative models with a shared latent action space across data sources that is grounded by predicting future observations consistently across sources. This latent action space is used to train downstream behavioural cloning (BC) policies which map observations to latent actions and decode them back to robot actions, providing a paradigm for learning from heterogeneous data. Empirically, we demonstrate that GLAM successfully learns an aligned latent action space that facilitates action transfer across data sources with and without action labels. Across five manipulation tasks in simulation and in the real world, GLAM‑aligned policies significantly outperform BC baselines and prior latent‑action methods, achieving an average of +48% improvement in task success rate with the same data‑scarce setting. Videos and code are available at https://viccccciv.github.io/glam/.

Abstract:
World Action Models (WAMs) are embodied predictive‑action models that make a forecast of the future available to action. Recent WAMs repurpose large video generation models, and a parallel line relies on language or vision‑language backbones without a video‑generation core. This rapid expansion has blurred the boundary among broad world models, video generation models, action‑grounded video world models, Vision‑Language‑Action policies, and WAMs. This survey gives the field a common account. It first clarifies these boundaries, then organizes existing works through two complementary views. The first view asks what each method is required to generate, spanning rendered futures, latent futures, and video‑generation‑free action reasoning. The second view decomposes each method by predictive substrate, backbone, action coupling, and deployment regime. This anatomy supports a unified discussion of interactability, causality, persistence, physical plausibility, and generalization, followed by data, evaluation, and open challenges. Across these axes, a consistent design pattern emerges: WAMs are not simply video generators with action heads, but predictive‑action methods whose design choices trade representational richness against compute, memory, latency, and action‑label cost. The field is moving toward methods that generate less of the future while preserving what control requires. The survey homepage is available at https://world‑action‑models.github.io/.

Abstract:
Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first‑frame‑anchored source‑to‑state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo‑World, a unified controllable video world model that jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather‑dependent appearance and particle effects. Additionally, Scene‑Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over‑amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo‑World maintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperforming video‑to‑video weather editing baselines on weather‑state generation. Our project page is available at \urlhttps://xiangchenyin.github.io/Holo‑World/.

Abstract:
Interactive world models aim to simulate environment dynamics under real‑time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt‑to‑full‑video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation‑centric generators to support mid‑rollout object interaction within a chunk‑autoregressive framework. We argue that the navigation‑interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human‑object interaction data with accurate, dense labels. Second, a memory bottleneck: recency‑biased history compression in existing world models discards the event‑transition frames that causally determine subsequent object states, leading to an action‑forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per‑chunk captions via chain‑of‑thought reasoning. On the model side, we introduce a hierarchical action‑aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event‑update and object‑identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation‑only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.

Abstract:
Inference‑time steering adapts pre‑trained generative robot policies during deployment by verifying candidate actions before execution. While prior methods typically perform this verification only with visual observations, vision alone is often insufficient for contact‑rich manipulation, where success depends on both global task progress and subtle local interactions such as contact force. We introduce ViTaL, a visuo‑tactile inference‑time steering framework that formulates multimodal guidance as a bi‑level optimization problem. At the high level, visual sampling‑and‑verification performs long‑horizon mode selection, deciding what behavior the robot should execute. At the low level, tactile‑guided diffusion editing refines the selected action sequence over a shorter horizon to satisfy local contact requirements. To support outcome‑based steering, ViTaL learns a visuo‑tactile latent world model and employs semantically aligned visual and tactile verifiers, including a novel text‑conditioned tactile reward that scores predicted tactile futures directly in latent space. Across three real‑world contact‑rich manipulation tasks, ViTaL improves overall success by 51% over the base policy, outperforms unimodal steering by at least 33%, and exceeds naive multimodal fusion by at least 20%. Website: https://yilin‑wu98.github.io/vital_website.

Abstract:
The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far‑reaching ‑‑ policy evaluation, policy improvement, and test‑time planning ‑‑ all with limited real‑world interaction. To unlock these downstream capabilities, a WM needs to jointly satisfy three desiderata: (i) fidelity (i.e., producing simulated trajectories that correlate with reality), (ii) consistency (i.e., producing simulated trajectories that are coherent over long horizons), and (iii) efficiency (i.e., producing simulated trajectories quickly). We propose \textttWEAVER (World Estimation Across Views for Embodied Reasoning): a WM architecture that simultaneously achieves all three desiderata, providing state‑of‑the‑art results on robotic manipulation tasks. \textttWEAVER is a multi‑view WM trained to predict future latents and reward values via a flow‑matching loss. We distill the key design decisions across model architecture, memory, and prediction objectives required to unlock the kinds of long‑horizon dynamic manipulation tasks that have confounded prior world modeling approaches. We apply \textttWEAVER in robotic hardware, demonstrating its effectiveness at policy evaluation (ρ=0.870 correlation with real‑world success rate), policy improvement (real‑world success rate improvement of 38% on top of the π_0.5 robot foundation model), and test‑time planning (real‑world success rate improvement of 14% with a 5‑10× speedup over prior WMs). \textttWEAVER also demonstrates better performance than prior WMs when evaluated on out‑of‑distribution scenarios. Code, models, and videos at: https://arnavkj1995.github.io/WEAVER/ .

Abstract:
Goal‑conditioned visual navigation requires a robot to act under partial observability by anticipating how its motion will change the future egocentric view and whether that change brings it closer to the goal. Navigation world models provide such visual foresight, but they remain prediction modules that require an external planner to convert predicted futures into closed‑loop control. We propose Navigation World Action Model (NavWAM), a diffusion‑transformer policy that turns navigation world‑model prediction into executable action by representing future observations, goal‑progress values, and action chunks in a shared latent sequence. By learning future prediction jointly with the action and value targets that determine closed‑loop behavior, NavWAM makes visual foresight directly usable for robot control. We build NavWAM through simulation pretraining and real‑robot adaptation, and evaluate it on image‑goal navigation against planning‑based world models and a representative direct navigation policy. Across offline benchmarks and closed‑loop real‑robot deployment, NavWAM improves over planning‑based world‑model baselines in our evaluations while using the default policy mode without CEM‑style action search. Project page: https://dachii‑azm.github.io/navwam/

Abstract:
Scale buys interpolation; structure buys a certified horizon. A world model's average error says nothing about whether a particular prediction can be trusted, or for how long. For equivariant latent world models we give a computable, multi‑step certificate of the predictable horizon: T‑step rollout error is provably constant over each symmetry orbit (Theorem A) and stratified channel‑by‑channel by the predictor's Lyapunov spectrum, T_j(ε)～\log(1/ε)/λ_j. The horizon is two‑sided ‑‑ a matching lower bound makes approximate equivariance provably horizon‑limited ‑‑ and the certificate is exclusive to structure: orbit‑constant error characterizes equivariance, so no non‑equivariant model has it at any scale. Empirically, on 40‑D Lorenz‑96 only a \mathbbZ_N‑equivariant network recovers the full Lyapunov spectrum (R^2=0.98); dense and recurrent baselines fail. Because the spectrum is faithful, the certificate acts, a priori: under a fixed sensing budget a c×‑inflated certificate provably needs c× the budget, and the equivariant certificate meets a budget its inflated dense counterpart cannot ‑‑ with zero calibration data. The same read‑out, unchanged, audits public pretrained world models training‑free: TD‑MPC2 checkpoints land on the certificate's own scope taxonomy ‑‑ calibrated where strongly expansive (ratio 0.94‑1.02), optimistic where weakly expansive, correctly abstaining where contracting ‑‑ a map a deployed monitor replicates cell‑by‑cell, out‑of‑sample. Across the official 1M‑317M multitask ladder, calibration does not improve with parameters. On V‑JEPA 2‑AC (1B, real robot data) the measured cross‑check correctly overrides an over‑promising tangent spectrum ‑‑ the cross‑validated audit, not the raw number, is the deployable object. Scale buys interpolation, not a calibrated horizon.

Abstract:
Do research topics in artificial intelligence grow gradually, or do they advance through abrupt, detectable jumps? Analyzing 80,814 accepted main‑track papers from five premier AI conferences (ACL, CVPR, ICLR, ICML, NeurIPS) spanning 2017 to 2025, we show major AI topics advance through topical phase transitions: remaining marginal for years, then surging across venues within one to three years. Large language models became the dominant cross‑venue topic by 2025, diffusion models rose with comparable abruptness, and language‑model methods crossed into computer vision via vision‑language models, whereas reinforcement learning compounded smoothly, distinguishing genuine phase transitions from ordinary growth. This structure is our primary contribution: a large‑scale, cross‑venue characterization of how AI research reorganizes. We then ask whether a transition leaves a detectable footprint before it peaks. We define an early‑warning signature, four publication‑dynamics criteria frozen on 2017‑2021 data, and evaluate it out of sample on 2023‑2025 transitions, obtaining a precision of 27% and recall of 63% against a 13.5% base rate. Applied to 2025 data, the signature flags reasoning and test‑time compute, agentic AI, multimodal LLMs, retrieval‑augmented generation, and world models as topics to monitor over 2026‑2028. The source code is also publicly available on GitHub at https://github.com/KurbanIntelligenceLab/ai‑phase‑transitions.

Abstract:
Self‑evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM‑agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure‑level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low‑level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task‑specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self‑evolution capability over strong baselines. Our code has been released in https://github.com/antman9914/proplay.

Abstract:
Compression progress is a long‑standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predicting or compressing experience. The folk claim is that this reward is "credible" because it is paid only for learning. We make this precise and prove it. If intrinsic reward is the signed decrease of a fixed sealed‑audit loss, r_t = E(theta_t‑1) ‑ E(theta_t), then cumulative reward telescopes exactly to endpoint audit improvement, so no policy can push reward up indefinitely while true audit performance stagnates or degrades. For finite audit panels the same result holds with a sharp false‑positive budget: cumulative empirical reward is at most true audit improvement plus 2 Delta_n(F, delta), the uniform audit deviation of the model class. This is horizon‑free: adaptivity over time costs nothing once the sealed panel uniformly controls the class. The theorem also identifies the failure modes: the guarantee disappears if progress is clipped, scored on the agent's own stream, exposed to a high‑capacity model on a reusable panel, or applied to a neural class that makes Delta_n vacuous. We give a Lean 4 mechanization of the structural core (telescoping, the finite‑audit bound, finite Gibbs, and the entropy floor) and an experiment suite on ARC‑TGI grid‑transformation generators with adaptive holdout attacks. Experiments confirm the theory: finite‑audit deviation scales as n^‑0.527; signed progress resists clip‑farming, stream leakage, and noisy‑TV curiosity; naive reusable audits are exploitable by black‑box scalar feedback, while standard release defenses keep the attack below the 2 Delta_n threshold. Signed compression progress on a sealed audit is an accounting signal of genuine improvement.

Abstract:
Contact‑rich manipulation requires robots to continuously perceive and regulate evolving physical interactions under dynamic contact transitions or complex surface geometries. Recent imitation learning methods improve contact‑aware control by incorporating tactile or force feedback, but they rarely model the asymmetric spatiotemporal roles of global force and local tactile sensing. To address this, we propose TacForeSight, a lightweight force‑conditioned tactile foresight framework for real‑time manipulation. The core component is TacForceWM, a tactile world model that predicts short‑horizon tactile latent dynamics from dual‑finger tactile observations conditioned on high‑frequency wrist force and torque signals. Another key component, the Predictive Tactile‑Conditioned Policy, leverages the predicted latents as anticipatory contact priors, models the current‑to‑future tactile evolution via cross‑attention, and adaptively fuses visuo‑tactile features through a tactile‑guided gating module. By forecasting purely within a compact latent space, TacForeSight enables proactive contact reasoning with efficient real‑time inference suitable for high‑frequency manipulation control. Real‑robot experiments on five representative tasks and three in‑process perturbation settings show that TacForeSight consistently outperforms existing baselines, particularly under dynamic contact disturbances. All models and datasets will be made publicly available on the project website at https://tacforesight.github.io/ProjectPage.

Abstract:
We introduce WorldOlympiad, a benchmark for diagnosing video‑based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short‑term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world‑model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM‑as‑judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross‑view coherence, and camera‑trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real‑world videos, capturing diverse challenges from interactive control and embodied manipulation to open‑domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state‑of‑the‑art models reveal substantial gaps in physical reasoning, 3D consistency, and long‑horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.

Abstract:
World action models inherit the predictive capability of world models, enabling action generation to be guided by anticipated future observations. However, they rely primarily on vision and often fail in contact‑rich manipulation, where critical cues arise from physical interaction. In this paper, we propose Dream‑Tac, a unified Tactile‑World Action Model that jointly models actions, future visual observations, and tactile dynamics. Specifically, Dream‑Tac introduces (i) contact‑gated visuotactile fusion to selectively integrate tactile signals and (ii) a contact‑aware attention bias to better regulate cross‑modal interactions during manipulation. To support real‑time deployment, we further design a dual‑level acceleration strategy, reformulating the contact‑aware bias to preserve the fused attention path during training and introducing cache‑based diffusion acceleration at inference, achieving up to 2.9× faster training and 1.8× faster inference. Across six contact‑rich manipulation tasks, Dream‑Tac improves action accuracy by 31.7% on average, demonstrating the effectiveness of unified visuotactile world modeling.Code is available at https://github.com/LYFCLOUDFAN/Dream‑Tac.

Abstract:
Recent video diffusion foundation models have achieved remarkable progress in high‑quality video generation, yet turning them into real‑time interactive video world models remains challenging. Interactive world models require controllable, causal, and low‑latency rollout, which in practice demands a full pipeline spanning data construction, controllable fine‑tuning, autoregressive training, few‑step distillation, and streaming inference. In this work, we present minWM, a full‑stack open‑source framework for building real‑time interactive video world models. minWM provides an end‑to‑end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera‑controllable few‑step autoregressive world models. Specifically, minWM first fine‑tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few‑step autoregressive generator for low‑latency rollout. The framework is modular and architecture‑extensible: we instantiate it on representative open backbones, including Wan2.1‑T2V‑1.3B and HY1.5‑TI2V‑8B, covering both cross‑attention‑based condition injection and MMDiT‑style architectures. minWM also supports adapting existing video world models, such as HY‑WorldPlay, to new data distributions, training recipes, and latency targets. Beyond releasing runnable scripts, checkpoints, documentation, and inference code, we provide practical ablations on camera trajectory quality, controllability training steps, and minimal batch‑size requirements. We hope minWM serves as a reproducible and extensible recipe for building and adapting real‑time interactive video world models. Project Page: [https://github.com/shengshu‑ai/minWM](https://github.com/shengshu‑ai/minWM)

Abstract:
Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long‑tail workloads ‑‑ our profiling shows that 43% of real‑world subgraphs experience end‑to‑end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation ‑‑ where LLMs author structured graph transformations that integrate directly into compiler pipelines ‑‑ is the more appropriate abstraction. We propose PassNet, the first large‑scale ecosystem for LLM‑based compiler pass generation, comprising: (1) PassNet‑Dataset, over 18K unique computational graphs from 100K real‑world models; and (2) PassBench, 200 curated long‑tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error‑aware Speedup Score (ES_t) ‑‑ a metric unifying correctness, stability, and performance ‑‑ with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler ‑‑ indicating that the bottleneck is consistency, not capability. Fine‑tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier‑model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM‑driven compiler optimization. All data, benchmarks, and tooling are publicly available.

Abstract:
The success of large language models (LLMs) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a critical step in aligning model behavior with task objectives. Existing reasoning chain tuning methods often rely on black‑box heuristics or gradient‑free search, which lack interpretability, generalization, and sample efficiency. In this work, we introduce Thoughts‑as‑Planning, a novel framework that formalizes reasoning chain optimization as a sequential decision‑making process over a latent semantic space. We model the LLM as a partially observable environment and learn a latent world model that simulates the effect of reasoning chain edits on downstream outputs. A proximity‑preserving embedding space is constructed to encode reasoning chain‑response dynamics, enabling planning via gradient descent or reinforcement learning. Our method supports multi‑scale abstraction, allowing reasoning chain edits at token, segment, and instruction levels to be integrated into a unified planner. Through extensive experiments on language understanding and generation tasks, we demonstrate that Thoughts‑as‑Planning outperforms state‑of‑the‑art reasoning chain tuning baselines in efficiency, robustness, and generalization, while offering interpretability through its structured planning trajectory. Our code is available at https://github.com/FastLM/Thoughts‑as‑Planning.

Abstract:
Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi‑turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi‑turn interaction sequence, covering diverse scenes, styles, subjects, and both first‑ and third‑person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6‑DoF pose, and discrete‑action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub‑metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state‑of‑the‑art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan‑longcat/WBench.

Abstract:
World models for partially observed environments must imagine multiple compatible hidden futures and steer between them under counterfactual actions. Joint Embedding Predictive Architectures (JEPAs) do this in latent space, but a vector‑valued latent has no internal structure for carrying the belief over hidden continuations through blind rollout. We introduce the Unitary World Model JEPA (UWM‑JEPA), a JEPA world model with a density‑matrix latent on a joint system‑environment space and a learned unitary predictor. The construction preserves the joint‑state spectrum exactly during rollout, so the predictor itself cannot dissipate the represented uncertainty. On a hidden‑velocity indicator task requiring five‑step forward simulation under a given action sequence with the target observation masked, UWM‑JEPA reaches 0.77 accuracy and degrades monotonically as actions are perturbed; a parameter‑matched LSTM‑JEPA trained under the same counterfactual‑target objective and action head collapses to majority‑class accuracy (0.53) under every action condition. Under blind rollout, UWM‑JEPA loses fewer than ten points of probe R^2 at short horizons while vector‑latent baselines lose forty‑one and sixty‑eight; both nevertheless tie on a held‑out context probe, locating the separation in the predictor rather than the encoder. Action sensitivity itself requires training against counterfactual rather than teacher‑forced targets, a finding that applies beyond the unitary parameterisation. For JEPA world models to imagine under partial observability, latent geometry and predictor dynamics matter, not frozen context‑encoding capacity alone.

Abstract:
Recent video‑based world models have made pixel‑space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete: users can move the camera, but cannot act on individual objects. Since real‑world interaction is inherently object‑centric, such models remain closer to passive scene observers than truly manipulable environments. We present WorldCraft, a framework that expands interactive video world models from camera navigation to object‑level trajectory actions. Given a user click and a sketched path, WorldCraft generates future frames in which the selected object follows the prescribed trajectory while the camera continues to navigate the scene. WorldCraft achieves this through a trajectory‑centric control pipeline: First, Normalized World Trajectory (NWT) represents user‑drawn motion in a camera‑invariant world coordinate system and dynamically re‑projects it under the current camera pose, separating object motion from camera‑induced screen‑space displacement; Spatial‑Pathway LoRA (SP‑LoRA) then injects this world‑space signal through the model's spatial‑control pathway, adding object manipulation capability while preserving the pretrained camera controller; finally, Trajectory‑Anchored State Persistence (TASP) treats the world trajectory as a persistent spatial state and refreshes autoregressive memory after trajectory‑conditioned generation, allowing moved objects to reappear at their updated positions after leaving the camera view. Experiments show that WorldCraft enables accurate object control, preserves the video‑based world model's camera fidelity under camera‑only evaluation, and maintains object state across long autoregressive rollouts with off‑camera excursions.

Abstract:
Minority sampling aims to generate low‑density instances on a data manifold and is of central importance in applications such as medical diagnosis, anomaly detection, and creative AI. Existing approaches, however, define minority samples relative to generative priors learned from training data, confining rarity to model‑specific notions that may poorly reflect real‑world semantics. In this work, we propose a world‑centric perspective on minority sampling, which defines rarity with respect to real‑world priors rather than generator‑induced densities. To this end, we introduce JEPA guidance, a diffusion sampling framework guided by a Joint‑Embedding Predictive Architecture (JEPA) ‑‑ a class of world models that encode broad, semantically rich representations. JEPA guidance steers diffusion trajectories toward low‑density regions under the implicit density induced by the JEPA, thereby aligning generated minorities with real‑world semantic rarity. To make JEPA guidance computationally practical, we develop principled approximation strategies accompanied by theoretical error bounds, significantly reducing the overhead of guidance computation. Extensive experiments across unconditional, class‑conditional, and text‑to‑image generation demonstrate that JEPA guidance consistently improves the fidelity and semantic validity of minority samples, outperforming generator‑centric baselines in capturing real‑world notions of rarity. Code is available at https://github.com/soobin‑um/jepa‑guidance.

Abstract:
Recently, world models have made significant progress in enhancing end‑to‑end driving systems through both future situation forecasting and improved scene understanding. However, existing driving world models are typically built upon dense scene representations, causing high computational costs and redundant information. In this paper, we present SparseWorld, a lightweight world model that focuses on predicting only the critical layout of the scene, enabling efficient future forecasting for end‑to‑end driving systems. SparseWorld first performs autoregressive rollout to forecast future map elements and surrounding agents, enabling the model to learn how driving scenarios evolve over time. It then leverages these predicted futures to refine downstream motion prediction and trajectory planning. Specifically, we propose a Sparse Dreamer that anticipates future instances in the latent space through joint temporal and spatial attention. By interacting with predicted future instances, the motion planner captures more accurate motion patterns and generates more informed and safety‑aware trajectories. Extensive experiments demonstrate that SparseWorld significantly reduces collision risk and achieves state‑of‑the‑art performance on the open‑loop planning metrics of the nuScenes dataset with a collision rate of 0.05%. Moreover, it substantially outperforms the baseline method in closed‑loop planning metrics on the Bench2Drive benchmark. Supplementary material is available at the project page: https://wryzju.github.io/SparseWorld/.

Abstract:
Large language models (LLMs) are promising for autonomous driving, but semantics‑only decision policies can yield physically unsafe behavior in dynamic traffic. Existing methods either perform online language reasoning without explicit dynamics verification or use world models mainly in offline pipelines, leaving a gap between semantic intent and physical feasibility at decision time. We propose Reason‑‑Imagine‑‑Act (RIA), a closed‑loop framework that couples an LLM reasoner with an action‑conditioned world model for online safety verification. At each step, the LLM proposes an action template and candidate sub‑actions, the world model performs short‑horizon rollouts, and a safety scorer selects the safest executable action with feedback to the next reasoning step. Under a unified CARLA point‑goal protocol (1000 episodes), RIA achieves 80.05% route completion, 51.10% arrival rate, and 0.20% collision rate. Under the same closed‑loop interface, RIA consistently outperforms training‑free baselines, including CARLA TM and MADA, on core closed‑loop metrics. For reproducibility, code is available at https://github.com/pku‑smart‑city/source_code/tree/main/RIA.

Abstract:
World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision‑making. Yet, despite rapid progress in industry‑scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action‑conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long‑horizon rollout procedures. This design enables controlled studies of world‑modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real‑robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world‑model research.

Abstract:
Large language models achieve strong performance in language generation and knowledge‑intensive tasks, yet remain limited in settings requiring causal reasoning, persistent state tracking, and long‑horizon planning. We argue that these limitations may arise from an objective‑level mismatch between sequence prediction and reasoning over latent environment dynamics. To formalize this distinction, we introduce Latent Dynamics Inference (LDI), a conceptual perspective that interprets language and multimodal observations as partial evidence of underlying transition dynamics. To empirically investigate this perspective, we introduce Flux, a sequential reasoning environment specified entirely through natural‑language rules. As a proof‑of‑concept case study, the rules are first compiled into an explicit state‑transition simulator, illustrating that structured latent transition dynamics can, in some cases, be operationally extracted from textual rule descriptions. This enables a controlled comparison between the LLMs operating purely over textual observations and reinforcement‑learning agents trained directly within the extracted latent state space. Within this case study, agents operating with explicit access to the latent state space exhibit substantially more stable behavior in long‑horizon gameplay, achieving an aggregate win rate of approximately 79% versus 11% for LLMs. Qualitative analysis further reveals failure modes consistent with unstable persistent state tracking, including invalid actions, state‑tracking errors, and short‑horizon reasoning failures. The complete implementation of the Flux environment available at https://github.com/FeisalAlaswad/FLUX‑RL‑Agent Within the evaluated setting, these results suggest that strong sequence prediction alone may struggle to support robust long‑horizon dynamic reasoning without mechanisms for persistent state tracking and transition modeling

Abstract:
Interactive world models for first‑person shooter (FPS) games must resolve high‑frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per‑pixel temporal sequences so that each position computes its action response from local visual content. This separates in‑scope effects from out‑of‑scope generation without segmentation labels. We also introduce CrossFPS, the first multi‑game FPS dataset with frame‑aligned action telemetry. It comprises 69K clips from 7 titles with 10‑DoF controller signals, curated to remove gameplay bias. The model learns general visual‑to‑action mappings rather than game‑specific patterns, enabling zero‑shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross‑game generalization.

Abstract:
How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain‑of‑thought), trained end‑to‑end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision‑making into three systems: simulative reasoning (System II) grounding deliberation in future‑state prediction via a world model; self‑regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine‑grained action. Simulative reasoning provides unified planning across diverse tasks without per‑domain engineering, while self‑regulation ensures the planner is invoked only when needed. To test this, we develop SR^2AM (Self‑Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain‑of‑thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi‑module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1‑8B and v1.0‑30B achieve Pass@1 competitive with 120‑355B and 685B‑1T parameter systems respectively, while v1.0‑30B uses 25.8‑95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self‑regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

Abstract:
Human‑in‑the‑loop reinforcement learning systems achieve near‑perfect success on the workstation where they are trained, but collapse when the same robot is moved to a workstation a few meters away due to shifts in the visual input distribution caused by new lamp positions and window light. Re‑collecting demonstrations and re‑running HIL on every workstation is incompatible with deployment, and naively fine‑tuning on shifted‑light data triggers catastrophic forgetting of the source workstation. To close this cross‑domain gap, we present RoHIL, an offline fine‑tuning framework that uses no extra real‑robot interaction. RoHIL combines (i) a world‑model‑based image relighter that re‑synthesises the visual stream of source‑workstation trajectories under multiple virtual HDRI environments, leaving actions and rewards real; (ii) Illumination‑Retention Replay (IRR), a data‑level anti‑forgetting mechanism that interleaves relit adaptation transitions with original‑light retention transitions to preserve source‑workstation Bellman coverage; and (iii) an anchored Bellman‑actor regulariser that constrains representation and policy drift from the original source‑workstation policy. Across four real‑robot manipulation tasks under significant cross‑workstation illumination variations, RoHIL substantially improves shifted‑light performance where standard HIL‑RL collapses, while preserving source‑workstation performance, eliminating the need to re‑collect data and retrain for every new workstation and environment. Project page: https://anonymous4365.github.io/RoHIL/

Abstract:
Generating a consistent whole‑house VR tour from a floorplan and style reference requires both photorealistic panoramas and cross‑view spatial coherence. Pure 2D generators produce appealing single panoramas but re‑imagine geometry and materials when the viewpoint changes, whereas monolithic 3D generation becomes expensive and loses fine texture at multi‑room scale. We introduce PanoWorld, a generative spatial world model that treats whole‑house synthesis as autoregressive generation of node‑based 360‑degree panoramas, matching the discrete navigation used by real VR tour products. PanoWorld uses a floorplan‑derived 3D shell as a global geometric proxy and a dynamic 3D Gaussian Splatting cache as renderable spatial memory. A feed‑forward panoramic LRM designed for metric‑scale multi‑room 360‑degree inputs lifts generated panoramas into local 3DGS updates, while Room‑aware Group Attention suppresses cross‑room feature interference. A topology‑aware progressive caching strategy fuses these local updates without repeatedly reconstructing the full history. By decoupling shell‑based geometry guidance from cache‑rendered visual memory, PanoWorld preserves high‑frequency 2D synthesis quality while improving cross‑node layout and material consistency. The project link is https://jjrcn.github.io/PanoWorld‑project‑home/

Abstract:
3D open‑world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Effective reasoning representations are essential in such settings. While existing self‑supervised visual foresight reasoning approaches often suffer from multi‑step error accumulation, many recent studies resort to injecting domain‑specific knowledge for more stable guidance. Our key insight is that the photorealistic fidelity of visual reasoning representations is secondary; what truly matters is providing informative, task‑relevant signals. To this end, we propose ResDreamer, a hierarchical world model in which each higher‑level layer is trained to reconstruct the residuals of the layer below. This design enables progressive abstraction of increasingly sophisticated world dynamics and fosters the emergence of richer latent representations. Drawing inspiration from the "Bitter Lesson", ResDreamer trains its reasoning representations in a purely self‑supervised manner. The higher‑level residual representations are used to modulate lower‑level predictions, allowing the world model to scale effectively with only linearly increasing cross‑layer communication costs. Experiments show that ResDreamer achieves state‑of‑the‑art sample efficiency and parameter efficiency. This scalable hierarchical visual foresight reasoning architecture paves the way for more capable online RL agents in open‑ended, dynamic environments. The code is accessible at https://github.com/XuYuanFei01/ResDreamer.

Abstract:
Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state‑of‑the‑art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism, physically grounded interactions, and the ability to simulate scenarios beyond the training distribution. Hence, we introduce SWoMo, a neuro‑symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule‑based simulator and scene graph representations, models motion dynamics and tool‑tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim‑to‑real translation. Our experiments show both qualitative and quantitative improvements over prior work. We demonstrate that our simulator further satisfies the key criteria, including generalisation to unseen interaction geometries, improvements in downstream phase detection, and unsupervised video style transfer. The code, data, and model weights are available at: https://ssharvienkumar.github.io/SWoMo/

Abstract:
Token‑based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long‑horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next‑frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token‑based transformer world models that formulates next‑frame prediction as a structured assignment problem with latent token correspondence variables: each next‑frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones. Our experiments show state‑of‑the‑art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax‑classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on https://github.com/snu‑mllab/Identifiable‑Token‑Correspondence.

Abstract:
We present OrbiSim, a novel robotic simulation paradigm that redefines world models as a fully differentiable physics engine for embodied intelligence. Unlike prior world models that focus on unconstrained imagination in latent or visual domains, OrbiSim establishes a unified, physically‑grounded pathway that bridges structured scene assets, neural dynamics, and downstream reinforcement learning. By enabling end‑to‑end differentiability throughout the entire simulation loop ‑‑ spanning from explicit state transitions to visual observation generation ‑‑ OrbiSim supports tasks traditionally intractable for classical simulators, such as differentiable contact modeling, gradient‑based policy optimization under sparse rewards, and intuitive physical inference. Empirical results demonstrate that OrbiSim significantly outperforms state‑of‑the‑art world models in both predictive fidelity and control performance. Furthermore, its consistent responsiveness to asset configurations and physical parameters suggests its potential as a differentiable tool for enhancing robot simulation and policy training.

Abstract:
We tackle the challenge of building embodied AI agents that can reliably solve long‑horizon planning problems. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low‑level (LL), continuous environments. Yet, it remains a difficult endeavour to generate long‑horizon plans from imitation learning alone. In contrast, high‑level (HL), symbolic abstractions facilitate efficient and interpretable long‑horizon planning. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long‑horizon planning. We realise this idea via \emphbilevel policies of the form (π^\mathrmhl, π^\mathrmll), consisting of a neural policy π^\mathrmll learned from LL demonstrations, and an HL symbolic policy π^\mathrmhl that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end‑to‑end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON's HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: https://dillonzchen.github.io/bison

Abstract:
Aerial vision‑language navigation (VLN) requires agents to follow natural‑language instructions through closed‑loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction‑driven world‑action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full‑sequence video‑generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short‑horizon world‑state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed‑loop world‑action prediction. We further introduce a two‑stage training framework that first grounds the video prior in instruction‑conditioned navigation dynamics and then develops Action‑aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision‑Language‑Action baselines with 12%+ success‑rate gains and larger advantages on challenging cases. It further transfers zero‑shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks. Demos and code are available at https://embodiedcity.github.io/WorldVLN/.

Abstract:
Recent 3D world modeling systems based on generative scene synthesis, such as Marble, can create coherent and explorable 3D environments, yet their outputs are typically static monolithic assets with limited editability and physical interaction. This restricts their use in immersive content creation and embodied simulation, where generated worlds must be actively modified and manipulated. To tackle this challenge, we present WorldAct, a framework that converts static generated 3D worlds into editable and interaction‑ready scenes. WorldAct uses a multimodal agent to guide scene decomposition, identify actionable objects, reconstruct geometrically aligned object‑level meshes for interaction, and restore the residual background via 3D inpainting. The resulting scenes support object‑level editing, collision‑aware manipulation, and embodied task execution while preserving global scene coherence. Experiments show that WorldAct enables richer interaction scenarios than the original generated scenes, suggesting a practical path toward editable and interactive 3D world models.

Abstract:
We present PanoWorld, a panoramic video world model that generates geometry‑consistent 360\degree video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry‑ and dynamics‑consistent latent state modeling problem rather than pure visual synthesis. Building on a pre‑trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground‑truth panoramic depth, and a trajectory consistency loss that supervises the 3D world‑frame positions of tracked points across time. We further apply spherical‑geometry‑aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry‑aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at https://github.com/ostadabbas/PanoWorld.

Abstract:
Current game world models simulate environments from a subjective, player‑centric perspective. However, by treating the Non‑Player Character (NPC) merely as background pixels, these models cannot capture interactions between the player and NPC. In that sense, they act as passive video renderers rather than real simulation engines, lacking the physical understanding needed to model action‑induced NPC reactivities. We introduce ReactiveGWM, a reactive game world model that synthesizes dynamic interactions between the player and NPC. Instead of entangling all interaction dynamics, ReactiveGWM explicitly decouples player controls from NPC behaviors. Player actions are injected into the diffusion backbone via a lightweight additive bias, while high‑level NPC responses (e.g., Offense, Control, Defense) are grounded through cross‑attention modules. Crucially, these modules learn a game‑agnostic representation of interactive logic. This enables zero‑shot strategy transfer: our learned modules can be plugged directly into off‑the‑shelf, unannotated world models of different games. This instantly unlocks steerable NPC interactions without any domain‑specific retraining. Evaluated on two Street Fighter games, ReactiveGWM maintains fine‑grain player controllability while achieving robust, prompt‑aligned NPC strategy adherence, paving the way for scalable, strategy‑rich interaction with the NPC.

Abstract:
Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI‑Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object‑centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world‑space coordinates via monocular reconstruction, and compute a set of projective‑geometry residuals capturing three failure dimensions: scale‑depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI‑Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state‑of‑the‑art video generators, PDI reveals consistent geometry‑specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi‑bench.github.io/.

Abstract:
We introduce SANA‑WM, an efficient 2.6B‑parameter open‑source world model natively trained for one‑minute generation, synthesizing high‑fidelity, 720p, minute‑scale videos with precise camera control. SANA‑WM achieves visual quality comparable to large‑scale industrial baselines such as LingBot‑World and HY‑WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame‑wise Gated DeltaNet (GDN) with softmax attention for memory‑efficient long‑context modeling. (2) Dual‑Branch Camera Control ensures precise 6‑DoF trajectory adherence. (3) Two‑Stage Generation Pipeline applies a long‑video refiner to stage‑1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric‑scale 6‑DoF camera poses from public videos to yield high‑quality, spatiotemporally consistent action labels. Driven by these designs, SANA‑WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only ～213K public video clips with metric‑scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one‑minute world‑model benchmark, SANA‑WM demonstrates stronger action‑following accuracy than prior open‑source baselines and achieves comparable visual quality at 36× higher throughput for scalable world modeling.

Abstract:
Operational plan generation and verification are critical for modern complex and rapidly changing battlefield environments, yet traditional generation and verification methods still respectively face the challenges of generation infeasibility and verification insufficiency. To alleviate these limitations, we propose an Integrated Multi‑Agent Framework for Generative Operational Planning and High‑Fidelity Plan Verification (IFPV). IFPV consists of two tightly coupled modules: Multi‑Perspective Hierarchical Agents (MPHA) for generative operational planning and an Adversarial Cognitive Simulation Engine (ACSE) for high‑fidelity adversarial plan verification. MPHA decomposes commander intent into executable multi‑platform tactical action sequences through the collaboration of Pathfinder, Analyst, and Planner agents. ACSE introduces an opponent equipped with a customized world model, which predicts the future evolution of mission‑critical platforms and conducts dynamic counteractions against candidate plans. Simulation experiments in the Asymmetric Combat Tactic Simulator (ACTS) show that IFPV improves mission success by 19.4% and reduces operational cost by 41.7% compared with a single‑step large language model (LLM) planning baseline. Compared with a traditional rule‑based validator, ACSE increases the average suppression rate by 31.8%, indicating that the proposed verification environment is stricter and more discriminative in revealing the latent vulnerabilities of candidate plans. The code for IFPV can be found at https://github.com/zhigao3ks/IFPV.

Abstract:
Whether navigating a building, operating a robot, or playing a game, an agent that acts effectively in an environment must first learn an internal model of how that environment works. Partially‑observable Markov decision processes (POMDPs) provide a flexible modeling class for such internal world models, but learning them from observation‑action trajectories alone is challenging and typically requires extensive environment interaction. We ask whether language‑model priors can reduce costly interaction by leveraging prior knowledge, and introduce \emphPinductor (POMDP‑inductor): an LLM proposes candidate POMDP models from a few observation‑action trajectories and iteratively refines them to optimize a belief‑based likelihood score. Despite using strictly less information, \emphPinductor matches the performance and sample efficiency of LLM‑based POMDP learning methods that assume privileged access to the hidden state, while significantly surpassing the sample efficiency of tabular POMDP baselines. Further results show that performance scales with LLM capability and degrades gracefully as semantic information about the environment is withheld. Together, these results position language‑model priors as a practical tool for sample‑efficient world‑model learning under partial observability, and a step toward generalist agents in real‑world environments. Code is available at https://github.com/atomresearch/pinductor.

Abstract:
Recent large vision‑language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova‑U1, a native unified multimodal paradigm built upon NEO‑unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova‑U1‑8B‑MoT and SenseNova‑U1‑A3B‑MoT, built on dense (8B) and mixture‑of‑experts (30B‑A3B) understanding baselines, respectively. Designed from first principles, they rival top‑tier understanding‑only VLMs across text understanding, vision‑language perception, knowledge reasoning, agentic decision‑making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge‑intensive any‑to‑image (X2I) synthesis, complex text‑rich infographic generation, and interleaved vision‑language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre‑/post‑training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision‑language‑action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

Abstract:
Leveraging the rich world knowledge of Large Language Models (LLMs) to enhance Reinforcement Learning (RL) agents offers a promising path toward general intelligence. However, a fundamental prior‑dynamics mismatch hinders existing approaches: static LLM knowledge cannot directly adapt to the complex transition dynamics of long‑horizon tasks. Using LLM priors as fixed policies limits exploration diversity, as the prior is blind to environment‑specific dynamics; while end‑to‑end fine‑tuning suffers from optimization instability and credit assignment issues. To bridge this gap, we propose PriorZero, a unified framework that integrates LLM‑derived conceptual priors into world‑model‑based planning through a decoupled rollout‑training design. During rollout, a novel root‑prior injection mechanism incorporates LLM priors exclusively at the root node of Monte Carlo Tree Search (MCTS), focusing search on semantically promising actions while preserving the world model's deep lookahead capability. During training, PriorZero decouples world‑model learning from LLM adaptation: the world model is continuously refined on interaction data to jointly improve its dynamics, policy, and value predictions, its value estimates are then leveraged to provide fine‑grained credit assignment signals for stable LLM fine‑tuning via alternating optimization. Experiments across diverse benchmarks, including text‑based adventure games in Jericho and instruction‑following gridworld tasks in BabyAI, demonstrate that PriorZero consistently improves both exploration efficiency and asymptotic performance, establishing a promising framework for LLM‑empowered decision‑making. Our code is available at https://github.com/opendilab/LightZero.

Abstract:
This paper addresses the Motion Execution Gap, the disconnect between high‑level symbolic task descriptions using semantic constraints and executable robot motions. Motion Statecharts are introduced as an executable symbolic representation for complex motions. They allow the arbitrary arrangement of motion constraints, monitors or nested statecharts in parallel and sequence. World‑centric motion specification and generalization across embodiments are enabled through the use of a unified differentiable kinematic world model of both, robots and environments. Motion execution is realized through a lMPC‑based implementation of the task‑function approach, in which smooth transitions during task switches are ensured using jerk bounds. Cross‑platform transferability was demonstrated by deploying the method on eight robot platforms, operating in diverse environments. The proposed framework is called Giskard and is available open source: https://github.com/cram2/cognitive_robot_abstract_machine.

Abstract:
Closed‑loop driving simulation requires real‑time interaction beyond short offline clips, pushing current driving world models toward autoregressive (AR) rollout. Existing AR distillation approaches typically rely on frame sinks or student‑side degradation training. The former transfers poorly to driving due to fast ego‑motion and rapid scene changes, while the latter remains bounded by the teacher's single‑pass output length and thus provides only a limited supervision horizon. A natural question is: can the teacher itself be extended via AR rollout to provide unbounded‑horizon supervision at bounded memory cost? The key difficulty is that a standard teacher drifts under its own predictions, contaminating the supervision it provides. Our key insight is to make the teacher rollout‑capable, ensuring reliable supervision from its own AR rollouts. This is instantiated as HorizonDrive, an anti‑drifting training‑and‑distillation framework for AR driving simulation. First, scheduled rollout recovery (SRR) trains the base model to reconstruct ground‑truth future clips from prediction‑corrupted histories, yielding a teacher that remains stable across long AR rollouts. Second, the rollout‑capable teacher is extended via AR rollout, providing long‑horizon distribution‑matching supervision under bounded memory, while a short‑window student aligns to it with teacher rollout DMD (TRD) for efficient real‑time deployment. HorizonDrive natively supports minute‑scale AR rollout under bounded memory; on nuScenes, HorizonDrive reduces FID by 52% and FVD by 37%, and lowers ARE and DTW by 21% and 9% relative to the strongest long‑horizon streaming baselines, while remaining competitive with single‑pass driving video generators.

Abstract:
Today's driving world models can generate remarkably realistic dash‑cam videos, yet no single model excels universally. Some generate photorealistic textures but violate basic physics; others maintain geometric consistency but fail when subjected to closed‑loop planning. This disconnect exposes a critical gap: the field evaluates how real generated worlds appear, but rarely whether they behave realistically. We introduce WorldLens, a unified benchmark that measures world‑model fidelity across the full spectrum, from pixel quality and 4D geometry to closed‑loop driving and human perceptual alignment, through five complementary aspects and 24 standardized dimensions. Our evaluation of six representative models reveals that no existing approach dominates across all axes: texture‑rich models violate geometry, geometry‑aware models lack behavioral fidelity, and even the strongest performers achieve only 2‑3 out of 10 on human realism ratings. To bridge algorithmic metrics with human perception, we further contribute WorldLens‑26K, a 26,808‑entry human‑annotated preference dataset pairing numerical scores with textual rationales, and WorldLens‑Agent, a vision‑language evaluator distilled from these judgments that enables scalable, explainable auto‑assessment. Together, the benchmark, dataset, and agent form a unified ecosystem for assessing generated worlds not merely by visual appeal, but by physical and behavioral fidelity.

Abstract:
Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real‑world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics‑focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law‑specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics‑aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria‑grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across solid‑body mechanics, fluid dynamics, and optics. Each law is operationalized through observable sub‑questions to enable per‑law diagnostics. We evaluate eight modern video generation models through a large‑scale, quality‑controlled human study, grounded on social science lab experiment design. A total of 459 annotators provided 5,796 complete annotations and over 37.4K fine‑grained labels; after quality control, the retained annotations exhibited high split‑half model‑ranking correlations (Spearman's rho > 0.90). To support reproducible automated evaluation, we release PhyJudge‑9B, an open physics‑specialized VLM judge. PhyJudge‑9B achieves substantially lower aggregate relative bias than Gemini‑3.1‑Pro (3.3% vs. 16.6%). We release prompts, human annotations, model checkpoints, and evaluation code on the project page https://phyground.github.io/.

Abstract:
End‑to‑end autonomous driving systems are increasingly integrating Vision‑Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in‑depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's‑eye‑view (BEV) space, thereby enabling long‑horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long‑tail scenarios. We present a novel, efficient, and effective approach that achieves state‑of‑the‑art (SOTA) results on the closed‑loop Bench2drive benchmark. Codes are available at: https://github.com/hotdogcheesewhite/DeepSight.

Abstract:
Vision‑Language‑Action (VLA) models have emerged as a promising paradigm for end‑to‑end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning‑oriented intermediate representations: textual Chain‑of‑Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld‑VLA, a multi‑expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld‑VLA extracts complementary world information through multi‑source supervision and encodes it into expert tokens within the VLA, thereby providing planner‑accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld‑VLA employs a diffusion‑based hierarchical multi‑expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld‑VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/AFARI‑Research/CoWorld‑VLA.

Abstract:
Existing latent world models for autonomous driving have opened a promising path toward future‑aware driving intelligence. However, they typically treat future latent states as prediction targets or auxiliary signals, rather than directly conditioning trajectory planning. This can entangle current and future features in latent space. In this work, we propose DriveFuture, a future‑aware latent world modeling framework for autonomous driving that explicitly learns planning‑oriented foresight by conditioning the current latent state modeling process on future world states. Specifically, during training, the model first predicts future latent world states from the current latent state and ego action, and then refines the prediction against the ground‑truth future latent state via cross‑attention. The resulting future‑aware latent serves as an explicit condition for a diffusion‑based trajectory planner. During inference, DriveFuture conditions on the predicted future latent state instead of the ground‑truth future state. DriveFuture achieves SOTA performance on the public NAVSIM benchmarks, reaching 55.5 EPDMS on NAVSIM‑v2 \textcolorbluenavhard, 89.9 EPDMS on NAVSIM‑v2 \textcolorbluenavtest, and 90.7 PDMS on NAVSIM‑v1 \textcolorbluenavtest, respectively. These results suggest that the key to latent world modeling lies not merely in simulating future states, but more importantly in conditioning current decision‑making on future states. Notably, as of April 2026, DriveFuture ranks 1st on the \hrefhttps://huggingface.co/spaces/AGC2025/e2e‑driving‑navhardNAVSIM‑v2 \textcolorbluenavhard leaderboard and achieves SOTA performance on \hrefhttps://huggingface.co/spaces/AGC2024‑P/e2e‑driving‑navtestNAVSIM‑v1 \textcolorbluenavtest.

Abstract:
World models for deformable objects should recover not only geometry and appearance, but also underlying physical dynamics, interaction grounding, and material behavior. Learning such a model from real videos is challenging because deformable linear, planar, and volumetric objects evolve under high‑dimensional deformation, noisy interactions, and complex material response. The model must therefore infer a physical state from visual observations, roll it forward under new interactions, and render the resulting dynamics with high visual fidelity. We present DeformMaster, a video‑derived interactive physics‑neural world model that turns real interaction videos into an online interactive model of deformable objects within a unified dynamics‑and‑appearance framework. DeformMaster preserves structured physical rollout while using a neural residual to compensate for unmodeled effects, grounds sparse hand motion as distributed compliant actuator for hand‑continuum interaction, represents material response with spatially varying constitutive experts, and drives high‑fidelity 4D appearance from the predicted physical evolution. Experiments on real‑world deformable‑object sequences demonstrate DeformMaster's ability to roll out future dynamics and render dynamic appearance, outperforming state‑of‑the‑art baselines while supporting novel action rollout, material‑parameter variation, and dynamic novel‑view synthesis. Project page: https://can‑lee.github.io/deformmaster‑web/

Abstract:
Joint‑Embedding Predictive Architectures (JEPAs) provide a simpleframework for learning world models by predicting future latent representations.However, JEPA training is subject to a bias‑variance tradeoff.Without sufficient structural constraints, excessive representationalvariance causes the model to collapse to trivial solutions.The recent LeWorldModel (LeWM) shows that this issue can be alleviated bysimply constraining latent embeddings with an isotropic Gaussian prior.However, latent representations inherently lie on low‑dimensional manifoldswithin a high‑dimensional ambient space, and enforcing an isotropic Gaussianprior directly in this ambient space introduces an overly strong bias.In this work, we propose ame, which seeks a favorable operatingpoint on the bias‑variance frontier by applying Gaussian constraints inmultiple random subspaces rather than in the originalembedding space.This design relaxes the global constraint while preserving itsanti‑collapse effect, leading to a better balance between trainingstability and representation flexibility.Extensive experiments across fourcontinuous‑control environments demonstrate that consistentlyoutperforms LeWM with very clear margins.Our method is simple yet effective, and serves as a strong baseline for future JEPA‑based world model research.fdefinedeeemodeThe code is available at https://github.com/intcomp/Sub‑JEPA.

Abstract:
World models, which simulate environmental dynamics and generate sensor observations, are gaining increasing attention in autonomous driving. However, progress in LiDAR‑based world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of LiDAR point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we propose GEM: a Generative LiDAR world model that leverages deformable mamba architecture, significantly improving fidelity and imaginative capability. Specifically, leveraging the structural similarity between sequential laser scanning and Mamba's processing mechanism, we first tokenize LiDAR sweeps into compact representations via a custom LiDAR scene tokenizer. After unsupervised disentanglement of tokenized features via a dynamic‑static separator, a tri‑path deformable Mamba is introduced to perform selective scanning and adaptive gating fusion over the disentangled features, leading to enhanced spatial‑temporal understanding of the world evolution. Optionally, a planner and a BEV layout controller can be integrated to explore the model's capability for autonomous rollout and its potential to generate ``what‑if" scenarios. Extensive experiments show that GEM achieves state‑of‑the‑art performances across diverse benchmarks and evaluation settings, demonstrating its superiority and effectiveness. Project page: https://github.com/wuyang98/GEM.

Abstract:
A latent world model may achieve accurate short‑horizon prediction while still inducing a latent space that is poorly aligned with planning. A key issue is spatiotemporal mismatch: these models are often trained with local predictive supervision, but deployed for long‑horizon goal‑directed search in latent spaces where Euclidean distance may not reflect what is reachable within a finite action budget. We present the Reachability‑Correction auxiliary objective (RC‑aux), a lightweight correction for this mismatch in reconstruction‑free latent world models. RC‑aux keeps the world‑model backbone unchanged and adds planning‑aligned supervision along two axes. Along the time axis, multi‑horizon open‑loop prediction trains the model beyond one‑step consistency. Along the space axis, budget‑conditioned reachability supervision, together with temporal hard negatives, encourages the latent space to distinguish states that are eventually reachable from those reachable within the current planning horizon. At test time, the learned reachability signal can also be used by a reachability‑aware planner to favor trajectories that are both goal‑directed and attainable under the available budget. We instantiate RC‑aux on LeWorldModel and evaluate it under both continuation‑training and matched‑from‑scratch settings. Across goal‑conditioned pixel‑control tasks and a LIBERO‑Goal extension, RC‑aux improves LeWM‑style planning with modest additional cost. These results suggest that planning with latent world models depends not only on predictive accuracy, but also on whether the learned representation encodes the temporal and geometric structure required by downstream search. The code is available at https://github.com/Guang000/RC‑aux.

Abstract:
World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature‑based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature‑based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high‑dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as Residual Latent Action (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose RLA World Model (RLA‑WM), which predicts RLA values via flow matching. RLA‑WM outperforms both state‑of‑the‑art feature‑based and video‑diffusion world models on simulation and real‑world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA‑WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video‑aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla‑wm

Abstract:
World models enable model‑based planning through learned latent dynamics, but imagined rollouts become unstable as the planning horizon grows or the dynamics distribution shifts. We argue that this instability reflects two missing structures in planner‑facing latents: history‑conditioned memory for approximate Markov completeness, and geometric organization that separates configuration, momentum, and task semantics. We propose HaM‑World (HMW), a structured world model that decomposes the latent state into a canonical (q, p) subspace and a context subspace c, while using Mamba selective state‑space memory as the history‑conditioned input to the same latent dynamics. Within this interface, (q, p) evolves through an energy‑derived Hamiltonian vector field plus learnable residual/control dynamics, while c captures semantic, dissipative, and non‑conservative factors. This gives the planner a single latent state shared by dynamics prediction, reward/value estimation, imagined rollouts, and CEM action search. On four DeepMind Control Suite tasks, HaM‑World reaches the highest Avg. AUC (117.9, +9.5%), reduces long‑horizon rollout error to 45% of a strong baseline model, and wins 11/12 k in 3,5,7 MSE cells. Under 12 OOD perturbations spanning dynamics shifts, action delay, and observation masking, HaM‑World achieves the highest return in every condition, with average OOD‑return gains of 10.2% on Finger Spin and 13.6% on Reacher Easy. Mechanism diagnostics further show bounded action‑free Hamiltonian‑energy drift, structured energy variation under policy rollouts, and coherent control‑induced energy transfer, supporting the intended Soft‑Hamiltonian dynamics design.

Abstract:
We present JoyAI‑Image, a unified multimodal foundation model for visual understanding, text‑to‑image generation, and instruction‑guided image editing. JoyAI‑Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long‑text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry‑aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long‑text rendering, and editing benchmarks show that JoyAI‑Image achieves state‑of‑the‑art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel‑view‑assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision‑language‑action systems and world models.

Abstract:
Traditional Simultaneous Localization and Mapping (SLAM) algorithms rely heavily on the static environment assumption, which severely limits their applicability in real‑world spaces populated by moving entities, such as pedestrians. In this work, we propose DynoSLAM, a tightly‑coupled Dynamic GraphSLAM architecture that integrates socially‑aware Graph Neural Networks (GNNs) directly into the factor graph optimization. Unlike conventional approaches that use rigid constant‑velocity heuristics or deterministic single‑agent neural priors, our framework formulates pedestrian motion forecasting as a stochastic World Model. By utilizing Monte Carlo rollouts from a trained GNN, we capture the multimodal epistemic uncertainty of human interactions and embed it into the SLAM graph via a dynamic Mahalanobis distance factor. We demonstrate through extensive simulated experiments that this stochastic formulation not only maintains highly accurate retrospective tracking but also prevents the optimization failures caused by the deterministic "argmax problem". Ultimately, extracting the empirical mean and covariance matrices of future pedestrian states provides a mathematically rigorous, probabilistic safety envelope for downstream local planners, enabling anticipatory and collision‑free robot navigation in densely crowded environments.

Abstract:
Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominantly focus on future scene generation, often overlooking comprehensive 3D scene understanding. Conversely, while Large Language Models (LLMs) demonstrate impressive reasoning capabilities, they lack the capacity to predict future geometric evolution, creating a significant disparity between semantic interpretation and physical simulation. To bridge this gap, we propose HERMES++, a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework. Our approach addresses the distinct requirements of these tasks through synergistic designs. First, a BEV representation consolidates multi‑view spatial information into a structure compatible with LLMs. Second, we introduce LLM‑enhanced world queries to facilitate knowledge transfer from the understanding branch. Third, a Current‑to‑Future Link is designed to bridge the temporal gap, conditioning geometric evolution on semantic context. Finally, to enforce structural integrity, we employ a Joint Geometric Optimization strategy that integrates explicit geometric constraints with implicit latent regularization to align internal representations with geometry‑aware priors. Extensive evaluations on multiple benchmarks validate the effectiveness of our method. HERMES++ achieves strong performance, outperforming specialist approaches in both future point cloud prediction and 3D scene understanding tasks. The model and code will be publicly released at https://github.com/H‑EmbodVis/HERMESV2.

Abstract:
Graphical User Interface (GUI) agents have emerged as a promising paradigm for intelligent systems that perceive and interact with graphical interfaces visually. Yet supervised fine‑tuning alone cannot handle long‑horizon credit assignment, distribution shifts, and safe exploration in irreversible environments, making Reinforcement Learning (RL) a central methodology for advancing automation. In this work, we present the first comprehensive overview of the intersection between RL and GUI agents, and examine how this research direction may evolve toward digital inhabitants. We propose a principled taxonomy that organizes existing methods into Offline RL, Online RL, and Hybrid Strategies, and complement it with analyses of reward engineering, data efficiency, and key technical innovations. Our analysis reveals several emerging trends: the tension between reliability and scalability is motivating the adoption of composite, multi‑tier reward architectures; GUI I/O latency bottlenecks are accelerating the shift toward world‑model‑based training, which can yield substantial performance gains; and the spontaneous emergence of System‑2‑style deliberation suggests that explicit reasoning supervision may not be necessary when sufficiently rich reward signals are available. We distill these findings into a roadmap covering process rewards, continual RL, cognitive architectures, and safe deployment, aiming to guide the next generation of robust GUI automation and its agent‑native infrastructure.

Abstract:
Vision‑language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion‑conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision‑language model. Given an initial observation and a parameterized camera trajectory, we use a view‑consistent world model to synthesize geometrically aligned future views and derive structured supervision for both forward (action‑to‑outcome) and inverse (outcome‑to‑action) spatial reasoning. We post‑train the VLM with a two‑stage recipe on a compact dataset generated by this pipeline and evaluate it on multiple spatial reasoning benchmarks. World2VLM delivers consistent improvements over the base model across diverse benchmarks, including SAT‑Real, SAT‑Synthesized, VSI‑Bench, and MindCube. It also outperforms the test‑time world‑model‑coupled methods while eliminating the need for expensive inference‑time generation. Our results suggest that world models can serve not only as inference‑time tools, but also as effective training‑time teachers, enabling VLMs to internalize spatial imagination in a scalable and efficient manner.

Abstract:
We propose X‑WAM, a Unified 4D World Model that unifies real‑time robotic action execution and high‑fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel‑space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X‑WAM imagines the future world by predicting multi‑view RGB‑D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a specialized asynchronous denoising schedule during inference, which rapidly decodes actions with fewer steps to enable efficient real‑time execution, while dedicating the full sequence of steps to generate high‑fidelity video. Rather than entirely decoupling the timesteps during training, ANS samples from their joint distribution to align with the inference distribution. Pretrained on over 5,800 hours of robotic data, X‑WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks, while producing high‑fidelity 4D reconstruction and generation surpassing existing methods in both visual and geometric metrics.

Abstract:
Three‑dimensional content generation has progressed from producing isolated, visually plausible shapes to constructing structured assets that can be deployed in real‑time interactive environments. This trajectory is driven by converging demands from game development, embodied AI, world simulation, digital twins, and spatial computing, all of which require 3D content that goes beyond surface appearance to satisfy engine‑level constraints on topology, UV parameterization, physically based materials, skeletal rigging, and physics‑aware scene layout. Despite rapid advances in generative modeling, a persistent gap separates the outputs of current methods from the production‑ready standard expected by interactive applications. This survey addresses that gap by organizing the literature around the asset production pipeline rather than algorithmic families. Along the horizontal axis we distinguish three asset tiers, namely general objects, characters, and scenes, while the vertical axis traces each tier through the full production lifecycle from data foundations and geometry synthesis through topology optimization, UV unwrapping, PBR appearance, rigging, and scene assembly. Through this two‑dimensional taxonomy we assess not only what current methods can generate but whether their outputs are directly usable in downstream engines and simulation platforms. We further consolidate evaluation metrics and protocols that span geometric fidelity, appearance quality, asset usability, and scene‑level physical plausibility. The survey concludes by identifying open challenges in data quality, generation controllability, end‑to‑end assetization, and physically grounded generation, and by situating production‑ready 3D content as foundational infrastructure for emerging interactive world models and embodied intelligent systems.

Abstract:
Self‑supervised learning in healthcare has largely relied on invariance‑based objectives, which maximize similarity between different views of the same patient. While effective for static anatomy, this paradigm is fundamentally misaligned with clinical diagnosis, as it mathematically compels the model to suppress the transient pathological changes it is intended to detect. We propose a shift towards Action‑Conditioned World Models that learn to simulate the dynamics of disease progression, or Event‑Conditioned. Adapting the LeJEPA framework to physiological time‑series, we define pathology not as a static label, but as a transition vector acting on a patient's latent state. By predicting the future electrophysiological state of the heart given a disease onset, our model explicitly disentangles stable anatomical features from dynamic pathological forces. Evaluated on the MIMIC‑IV‑ECG dataset, our approach outperforms fully supervised baselines on the critical triage task. Crucially, we demonstrate superior sample efficiency: in low‑resource regimes, our world model outperforms supervised learning by over 0.05 AUROC. These results suggest that modeling biological dynamics provides a dense supervision signal that is far more robust than static classification. Source code is available at https://github.com/cljosegfer/lesaude‑dynamics

Abstract:
The temporal lag between actions and their long‑term consequences makes credit assignment a challenge when learning goal‑directed behaviors from data. Generative world models capture the distribution of future states an agent may visit, indicating that they have captured temporal information. How can that temporal information be extracted to perform credit assignment? In this paper, we formalize how the temporal information stored in world models encodes the underlying geometry of the world. Leveraging optimal transport, we extract this geometry from a learned model of the occupancy measure into a reward function that captures goal‑reaching information. Our resulting method, Occupancy Reward Shaping, largely mitigates the problem of credit assignment in sparse reward settings. ORS provably does not alter the optimal policy, yet empirically improves performance by 2.2x across 13 diverse long‑horizon locomotion and manipulation tasks. Moreover, we demonstrate the effectiveness of ORS in the real world for controlling nuclear fusion on 3 Tokamak control tasks. Code: https://github.com/aravindvenu7/occupancy_reward_shaping; Website: https://aravindvenu7.github.io/website/ors/

Abstract:
Local prediction‑error‑based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity‑Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it admits a tractable per‑step surrogate: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this error baseline online with a learned critic co‑trained alongside the world model; regressing a single scalar, the critic converges well before the world model saturates, redirecting exploration toward learnable transitions without oracle knowledge of the noise floor. The reward is higher for learnable transitions and collapses toward the error baseline for stochastic ones, effectively separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction‑error curiosity formulations, from Schmidhuber (1991) to learned‑feature‑space variants, emerge as special cases corresponding to specific approximations of this error baseline. Experiments on a stochastic grid world show that Curiosity‑Critic outperforms prediction‑error, visitation‑count, and Random Network Distillation methods in training speed and final world model accuracy.

Abstract:
Chain‑of‑Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA‑based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real‑time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One‑step latent reasoning and planning with Vision‑Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future‑frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three‑stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. In inference, the auxiliary decoders are discarded, and all latent tokens are prefilled in a single parallel pass, matching the speed of answer‑only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering superior accuracy at answer‑only latency. These results show that with world model supervision, latent CoT produces more generalizable representations than verbose token‑by‑token reasoning. Code has been open‑sourced to the community. Project Page: https://xiaomi‑embodied‑intelligence.github.io/OneVL

Abstract:
Vision‑Language‑Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long‑horizon trajectories and evaluate their consequences, which limits performance in complex decision‑making tasks. In this work, we introduce World‑Value‑Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long‑horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high‑value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent‑space inference reshapes the search distribution toward feasible regions, enabling efficient long‑horizon decision making. Extensive simulations and real‑world experiments demonstrate that the WAV model consistently outperforms state‑of‑the‑art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long‑horizon and compositional scenarios. Code is available at https://github.com/Win‑commit/WAV.

Abstract:
We introduce HY‑World 2.0, a multi‑modal world model framework that advances our prior project HY‑World 1.0. HY‑World 2.0 accommodates diverse input modalities, including text prompts, single‑view images, multi‑view images, and videos, and produces 3D world representations. With text or single‑view image inputs, the model performs world generation, synthesizing high‑fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four‑stage method: a) Panorama Generation with HY‑Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe‑based view generation model with consistent memory. We also upgrade WorldMirror, a feed‑forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi‑view images or videos. Also, we introduce WorldLens, a high‑performance 3DGS rendering platform featuring a flexible engine‑agnostic architecture, automatic IBL lighting, efficient collision detection, and training‑rendering co‑design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY‑World 2.0 achieves state‑of‑the‑art performance on several benchmarks among open‑source approaches, delivering results comparable to the closed‑source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.

Abstract:
Recent advancements in foundational models, such as large language models and world models, have greatly enhanced the capabilities of robotics, enabling robots to autonomously perform complex tasks. However, acquiring large‑scale, high‑quality training data for robotics remains a challenge, as it often requires substantial manual effort and is limited in its coverage of diverse real‑world environments. To address this, we propose a novel hybrid approach called Compositional Simulation, which combines classical simulation and neural simulation to generate accurate action‑video pairs while maintaining real‑world consistency. Our approach utilizes a closed‑loop real‑sim‑real data augmentation pipeline, leveraging a small amount of real‑world data to generate diverse, large‑scale training datasets that cover a broader spectrum of real‑world scenarios. We train a neural simulator to transform classical simulation videos into real‑world representations, improving the accuracy of policy models trained in real‑world environments. Through extensive experiments, we demonstrate that our method significantly reduces the sim2real domain gap, resulting in higher success rates in real‑world policy model training. Our approach offers a scalable solution for generating robust training data and bridging the gap between simulated and real‑world robotics.

Abstract:
Imitation learning is a powerful paradigm for training robotic policies, yet its performance is limited by compounding errors: minor policy inaccuracies could drive robots into unseen out‑of‑distribution (OOD) states in the training set, where the policy could generate even bigger errors, leading to eventual failures. While the Data Aggregation (DAgger) framework tries to address this issue, its reliance on continuous human involvement severely limits scalability. In this paper, we propose WM‑DAgger, an efficient data aggregation framework that leverages World Models to synthesize OOD recovery data without requiring human involvement. Specifically, we focus on manipulation tasks with an eye‑in‑hand robotic arm and only few‑shot demonstrations. To avoid synthesizing misleading data and overcome the hallucination issues inherent to World Models, our framework introduces two key mechanisms: (1) a Corrective Action Synthesis Module that generates task‑oriented recovery actions to prevent misleading supervision, and (2) a Consistency‑Guided Filtering Module that discards physically implausible trajectories by anchoring terminal synthesized frames to corresponding real frames in expert demonstrations. We extensively validate WM‑DAgger on multiple real‑world robotic tasks. Results that our method significantly improves success rates, achieving a 93.3% success rate in soft bag pushing with only five demonstrations. The source code is publicly available at https://github.com/czs12354‑xxdbd/WM‑Dagger.

Abstract:
We present PhysInOne, a large‑scale synthetic dataset addressing the critical scarcity of physically‑grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground‑truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne's efficacy across four emerging applications: physics‑aware video generation, long‑/short‑term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine‑tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics‑grounded world models in generation, simulation, and embodied AI.

Abstract:
The competency of any intelligent agent is bounded by its formal account of the world in which it operates. Clinical AI lacks such an account. Existing frameworks address evaluation, regulation, or system design in isolation, without a shared model of the clinical world to connect them. We introduce the Clinical World Model, a framework that formalizes care as a tripartite interaction among Patient, Provider, and Ecosystem. To formalize how any agent, whether human or artificial, transforms information into clinical action, we develop parallel decision‑making architectures for providers, patients, and AI agents, grounded in validated principles of clinical cognition. The Clinical AI Skill‑Mix operationalizes competency through eight dimensions. Five define the clinical competency space (condition, phase, care setting, provider role, and task) and three specify how AI engages human reasoning (assigned authority, agent facing, and anchoring layer). The combinatorial product of these dimensions yields a space of billions of distinct competency coordinates. A central structural implication is that validation within one coordinate provides minimal evidence for performance in another, rendering the competency space irreducible. The framework supplies a common grammar through which clinical AI can be specified, evaluated, and bounded across stakeholders. By making this structure explicit, the Clinical World Model reframes the field's central question from whether AI works to in which competency coordinates reliability has been demonstrated, and for whom.

Abstract:
Recent advances in world models have demonstrated strong capabilities in simulating physical reality, making them an increasingly important foundation for embodied intelligence. For UAV agents in particular, accurate prediction of complex 3D dynamics is essential for autonomous navigation and robust decision‑making in unconstrained environments. However, under the highly dynamic camera trajectories typical of UAV views, existing world models often struggle to maintain spatiotemporal physical consistency. A key reason lies in the distribution bias of current training data: most existing datasets exhibit restricted 2.5D motion patterns, such as ground‑constrained autonomous driving scenes or relatively smooth human‑centric egocentric videos, and therefore lack realistic high‑dynamic 6‑DoF UAV motion priors. To address this gap, we present MotionScape, a large‑scale real‑world UAV‑view video dataset with highly dynamic motion for world modeling. MotionScape contains over 30 hours of 4K UAV‑view videos, totaling more than 4.5M frames. This novel dataset features semantically and geometrically aligned training samples, where diverse real‑world UAV videos are tightly coupled with accurate 6‑DoF camera trajectories and fine‑grained natural language descriptions. To build the dataset, we develop an automated multi‑stage processing pipeline that integrates CLIP‑based relevance filtering, temporal segmentation, robust visual SLAM for trajectory recovery, and large‑language‑model‑driven semantic annotation. Extensive experiments show that incorporating such semantically and geometrically aligned annotations effectively improves the ability of existing world models to simulate complex 3D dynamics and handle large viewpoint shifts, thereby benefiting decision‑making and planning for UAV agents in complex environments. The dataset is publicly available at https://github.com/Thelegendzz/MotionScape

Abstract:
Articulated objects are essential for embodied AI and world models, yet inferring their kinematics from a single closed‑state image remains challenging because crucial motion cues are often occluded. Existing methods either require multi‑state observations or rely on explicit part priors, retrieval, or other auxiliary inputs that partially expose the structure to be inferred. In this work, we present DailyArt, which formulates articulated joint estimation from a single static image as a synthesis‑mediated reasoning problem. Instead of directly regressing joints from a heavily occluded observation, DailyArt first synthesizes a maximally articulated opened state under the same camera view to expose articulation cues, and then estimates the full set of joint parameters from the discrepancy between the observed and synthesized states. Using a set‑prediction formulation, DailyArt recovers all joints simultaneously without requiring object‑specific templates, multi‑view inputs, or explicit part annotations at test time. Taking estimated joints as conditions, the framework further supports part‑level novel state synthesis as a downstream capability. Extensive experiments show that DailyArt achieves strong performance in articulated joint estimation and supports part‑level novel state synthesis conditioned on joints. Project page is available at https://rangooo123.github.io/DaliyArt.github.io/.

Abstract:
Achieving physically accurate object manipulation in image editing is essential for its potential applications in interactive world models. However, existing visual generative models often fail at precise spatial manipulation, resulting in incorrect scaling and positioning of objects. This limitation primarily stems from the lack of explicit mechanisms to incorporate 3D geometry and perspective projection. To achieve accurate manipulation, we develop PhyEdit, an image editing framework that leverages explicit geometric simulation as contextual 3D‑aware visual guidance. By combining this plug‑and‑play 3D prior with joint 2D‑‑3D supervision, our method effectively improves physical accuracy and manipulation consistency. To support this method and evaluate performance, we present a real‑world dataset, RealManip‑10K, for 3D‑aware object manipulation featuring paired images and depth annotations. We also propose ManipEval, a benchmark with multi‑dimensional metrics to evaluate 3D spatial control and geometric consistency. Extensive experiments show that our approach outperforms existing methods, including strong closed‑source models, in both 3D geometric accuracy and manipulation consistency.

Abstract:
The rapid advancement of Artificial Intelligence Generated Content (AIGC) has revolutionized video generation, enabling systems ranging from proprietary pioneers like OpenAI's Sora, Google's Veo3, and Bytedance's Seedance to powerful open‑source contenders like Wan and HunyuanVideo to synthesize temporally coherent and semantically rich videos. These advancements pave the way for building "world models" that simulate real‑world dynamics, with applications spanning entertainment, education, and virtual reality. However, existing reviews on video generation often focus on narrow technical fields, e.g., Generative Adversarial Networks (GAN) and diffusion models, or specific tasks (e. g., video editing), lacking a comprehensive perspective on the field's evolution, especially regarding Auto‑Regressive (AR) models and integration of multimodal information. To address these gaps, this survey firstly provides a systematic review of the development of video generation technology, tracing its evolution from early GANs to dominant diffusion models, and further to emerging AR‑based and multimodal techniques. We conduct an in‑depth analysis of the foundational principles, key advancements, and comparative strengths/limitations. Then, we explore emerging trends in multimodal video generation, emphasizing the integration of diverse data types to enhance contextual awareness. Finally, by bridging historical developments and contemporary innovations, this survey offers insights to guide future research in video generation and its applications, including virtual/augmented reality, personalized education, autonomous driving simulations, digital entertainment, and advanced world models, in this rapidly evolving field. For more details, please refer to the project at https://github.com/sjtuplayer/Awesome‑Video‑Foundations.

Abstract:
Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision‑Language‑Action (VLA) approaches based on multimodal foundation models, including recent advances in vision‑language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open‑source codebase for VLA research. StarVLA addresses these challenges in three aspects. First, it provides a modular backbone‑‑action‑head architecture that supports both VLM backbones (e.g., Qwen‑VL) and world‑model backbones (e.g., Cosmos) alongside representative action‑decoding paradigms, all under a shared abstraction in which backbone and action head can each be swapped independently. Second, it provides reusable training strategies, including cross‑embodiment learning and multimodal co‑training, that apply consistently across supported paradigms. Third, it integrates major benchmarks, including LIBERO, SimplerEnv, RoboTwin~2.0, RoboCasa‑GR1, and BEHAVIOR‑1K, through a unified evaluation interface that supports both simulation and real‑robot deployment. StarVLA also ships simple, fully reproducible single‑benchmark training recipes that, despite minimal data engineering, already match or surpass prior methods on multiple benchmarks with both VLM and world‑model backbones. To our best knowledge, StarVLA is one of the most comprehensive open‑source VLA frameworks available, and we expect it to lower the barrier for reproducing existing methods and prototyping new ones. StarVLA is being actively maintained and expanded; we will update this report as the project evolves. The code and documentation are available at https://github.com/starVLA/starVLA.

Abstract:
World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long‑term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib

Abstract:
End‑to‑end autonomous driving models based on Vision‑Language‑Action (VLA) architectures have shown promising results by learning driving policies through behavior cloning on expert demonstrations. However, imitation learning inherently limits the model to replicating observed behaviors without exploring diverse driving strategies, leaving it brittle in novel or out‑of‑distribution scenarios. Reinforcement learning (RL) offers a natural remedy by enabling policy exploration beyond the expert distribution. Yet VLA models, typically trained on offline datasets, lack directly observable state transitions, necessitating a learned world model to anticipate action consequences. In this work, we propose a unified understanding‑and‑generation framework that leverages world modeling to simultaneously enable meaningful exploration and provide dense supervision. Specifically, we augment trajectory prediction with future RGB and depth image generation as dense world modeling objectives, requiring the model to learn fine‑grained visual and geometric representations that substantially enrich the planning backbone. Beyond serving as a supervisory signal, the world model further acts as a source of intrinsic reward for policy exploration: its image prediction uncertainty naturally measures a trajectory's novelty relative to the training distribution, where high uncertainty indicates out‑of‑distribution scenarios that, if safe, represent valuable learning opportunities. We incorporate this exploration signal into a safety‑gated reward and optimize the policy via Group Relative Policy Optimization (GRPO). Experiments on the NAVSIM and nuScenes benchmarks demonstrate the effectiveness of our approach, achieving a state‑of‑the‑art PDMS score of 93.7 and an EPDMS of 88.8 on NAVSIM. The code and demo will be publicly available at https://zihaosheng.github.io/ExploreVLA/.

Abstract:
World models that predict future states from video remain limited by flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. We present HCLSM, a world model architecture that operates on three interconnected principles: object‑centric decomposition via slot attention with spatial broadcast decoding, hierarchical temporal dynamics through a three‑level engine combining selective state space models for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals, and causal structure learning through graph neural network interaction patterns. HCLSM introduces a two‑stage training protocol where spatial reconstruction forces slot specialization before dynamics prediction begins. We train a 68M‑parameter model on the PushT robotic manipulation benchmark from the Open X‑Embodiment dataset, achieving 0.008 MSE next‑state prediction loss with emerging spatial decomposition (SBD loss: 0.0075) and learned event boundaries. A custom Triton kernel for the SSM scan delivers 38x speedup over sequential PyTorch. The full system spans 8,478 lines of Python across 51 modules with 171 unit tests. Code: https://github.com/rightnow‑ai/hclsm

Abstract:
We study language‑conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open‑loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human‑verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future‑state prediction, and action generation through two complementary model families. The first family combines LCVN‑WM, a diffusion‑based world model, with LCVN‑AC, an actor‑critic agent trained in the latent space of the world model. The second family, LCVN‑Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language‑conditioned world models. The code is available at https://github.com/F1y1113/LCVN.

Abstract:
World models (WMs) are intended to serve as internal simulators of the real world that enable agents to understand, anticipate, and act upon complex environments. Existing WM benchmarks remain narrowly focused on next‑state prediction and visual fidelity, overlooking the richer simulation capabilities required for intelligent behavior. To address this gap, we introduce WR‑Arena, a comprehensive benchmark for evaluating WMs along three fundamental dimensions of next world simulation: (i) Action Simulation Fidelity, the ability to interpret and follow semantically meaningful, multi‑step instructions and generate diverse counterfactual rollouts; (ii) Long‑horizon Forecast, the ability to sustain accurate, coherent, and physically plausible simulations across extended interactions; and (iii) Simulative Reasoning and Planning, the ability to support goal‑directed reasoning by simulating, comparing, and selecting among alternative futures in both structured and open‑ended environments. We build a task taxonomy and curate diverse datasets designed to probe these capabilities, moving beyond single‑turn and perceptual evaluations. Through extensive experiments with state‑of‑the‑art WMs, our results expose a substantial gap between current models and human‑level hypothetical reasoning, and establish WR‑Arena as both a diagnostic tool and a guideline for advancing next‑generation world models capable of robust understanding, forecasting, and purposeful action. The code is available at https://github.com/MBZUAI‑IFM/WR‑Arena.

Abstract:
Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re‑emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out‑of‑view intervals. To facilitate research in this direction, we construct HM‑World, the first large‑scale video dataset dedicated to hybrid memory. It features 59K high‑fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit‑entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance‑driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM‑World demonstrate that our method significantly outperforms state‑of‑the‑art approaches in both dynamic subject consistency and overall generation quality. Code is publicly available at https://github.com/H‑EmbodVis/HyDRA.

Abstract:
Video generation models have shown strong potential as world models for autonomous driving simulation. However, existing approaches are primarily trained on real‑world driving datasets, which mostly contain natural and safe driving scenarios. As a result, current models often fail when conditioned on challenging or counterfactual trajectories‑such as imperfect trajectories generated by simulators or planning systems‑producing videos with severe physical inconsistencies and artifacts. To address this limitation, we propose PhyGenesis, a world model designed to generate driving videos with high visual fidelity and strong physical consistency. Our framework consists of two key components: (1) a physical condition generator that transforms potentially invalid trajectory inputs into physically plausible conditions, and (2) a physics‑enhanced video generator that produces high‑fidelity multi‑view driving videos under these conditions. To effectively train these components, we construct a large‑scale, physics‑rich heterogeneous dataset. Specifically, in addition to real‑world driving videos, we generate diverse challenging driving scenarios using the CARLA simulator, from which we derive supervision signals that guide the model to learn physically grounded dynamics under extreme conditions. This challenging‑trajectory learning strategy enables trajectory correction and promotes physically consistent video generation. Extensive experiments demonstrate that PhyGenesis consistently outperforms state‑of‑the‑art methods, especially on challenging trajectories. Our project page is available at: https://wm‑research.github.io/PhyGenesis/.

Abstract:
Dynamical systems theory and reinforcement learning view world evolution as latent‑state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action‑conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel‑level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large‑scale action‑conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role‑playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per‑frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long‑horizon state consistency, highlighting the need for state‑aware video generation. The project page is https://shandaai.github.io/wildworld‑project/.

Abstract:
Video‑based world models offer a powerful paradigm for embodied simulation and planning, yet state‑of‑the‑art models often generate physically implausible manipulations ‑ such as object penetration and anti‑gravity motion ‑ due to training on generic visual data and likelihood‑based objectives that ignore physical laws. We present ABot‑PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action‑controllable videos. Built on a curated dataset of three million manipulation clips with physics‑aware annotation, it uses a novel DPO‑based post‑training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross‑embodiment control. To better evaluate generalization, we introduce EZSbench, the first training‑independent embodied zero‑shot benchmark combining real and synthetic unseen robot‑task‑scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot‑PhysWorld achieves new state‑of‑the‑art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.

Abstract:
Diffusion Transformers (DiTs) power high‑fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio‑temporal attention. Training‑free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero‑Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose WorldCache, a Perception‑Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion‑adaptive thresholds, saliency‑weighted drift estimation, optimal approximation via blending and warping, and phase‑aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion‑consistent feature reuse without retraining. On Cosmos‑Predict2.5‑2B evaluated on PAI‑Bench, WorldCache achieves 2.3× inference speedup while preserving 99.4% of baseline quality, substantially outperforming prior training‑free caching approaches. Our code can be accessed on \hrefhttps://umair1221.github.io/World‑Cache/World‑Cache.

Abstract:
World models learn to predict future states of an environment, enabling planning and mental simulation. Current approaches default to Transformer‑based predictors operating in learned latent spaces. This comes at a cost: O(N^2) computation and no explicit spatial inductive bias. This paper asks a foundational question: is self‑attention necessary for predictive world modeling, or can alternative computational substrates achieve comparable or superior results? I introduce FluidWorld, a proof‑of‑concept world model whose predictive dynamics are governed by partial differential equations (PDEs) of reaction‑diffusion type. Instead of using a separate neural network predictor, the PDE integration itself produces the future state prediction. In a strictly parameter‑matched three‑way ablation on unconditional UCF‑101 video prediction (64x64, ~800K parameters, identical encoder, decoder, losses, and data), FluidWorld is compared against both a Transformer baseline (self‑attention) and a ConvLSTM baseline (convolutional recurrence). While all three models converge to comparable single‑step prediction loss, FluidWorld achieves 2x lower reconstruction error, produces representations with 10‑15% higher spatial structure preservation and 18‑25% more effective dimensionality, and critically maintains coherent multi‑step rollouts where both baselines degrade rapidly. All experiments were conducted on a single consumer‑grade PC (Intel Core i5, NVIDIA RTX 4070 Ti), without any large‑scale compute. These results establish that PDE‑based dynamics, which natively provide O(N) spatial complexity, adaptive computation, and global spatial coherence through diffusion, are a viable and parameter‑efficient alternative to both attention and convolutional recurrence for world modeling.

Abstract:
Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary‑free probe that converts V‑JEPA 2 continuous latent vectors into discrete symbol sequences without task‑specific supervision or modifying the encoder. Because the encoder is kept completely frozen, any symbolic structure in the AIM codebook is attributable entirely to V‑JEPA 2 pre‑trained representations ‑‑ not to the probe. We evaluate through category‑contrast experiments on Kinetics‑mini along three physical dimensions: grasp angle, object geometry, and motion temporal structure. AIM symbol distributions differ significantly across all three experiments (chi^2 p < 10^‑4; MI 0.036‑‑0.117 bits, NMI 1.2‑‑3.9% of the 3‑bit maximum; JSD up to 0.342; codebook active ratio 62.5%). The experiments reveal that V‑JEPA 2 latent space is markedly compact: diverse action categories share a common representational core, with semantic differences encoded as graded distributional variations rather than categorical boundaries. These results establish Stage 1 of a four‑stage roadmap toward an action‑conditioned symbolic world model, demonstrating that structured symbolic manifolds are discoverable properties of frozen JEPA latent spaces.

Abstract:
Recently, world models have been incorporated into the autonomous driving systems to improve the planning reliability. Existing approaches typically predict future states through appearance generation or deterministic regression, which limits their ability to capture trajectory‑conditioned scene evolution and leads to unreliable action planning. To address this, we propose DynFlowDrive, a latent world model that leverages flow‑based dynamics to model the transition of world states under different driving actions. By adopting the rectifiedflow formulation, the model learns a velocity field that describes how the scene state changes under different driving actions, enabling progressive prediction of future latent states. Building upon this, we further introduce a stability‑aware multi‑mode trajectory selection strategy that evaluates candidate trajectories according to the stability of the induced scene transitions. Extensive experiments on the nuScenes and NavSim benchmarks demonstrate consistent improvements across diverse driving frameworks without introducing additional inference overhead. Source code will be abaliable at https://github.com/xiaolul2/DynFlowDrive.

Abstract:
With the growing adoption of vision‑language‑action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter‑view inconsistency when applied to high‑resolution multi‑view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi‑view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross‑attention. For decoding, we employ a multi‑view transformer to reconstruct multi‑view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi‑view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.

Abstract:
Contact‑rich manipulation tasks, such as wiping and assembly, require accurate perception of contact forces, friction changes, and state transitions that cannot be reliably inferred from vision alone. Despite growing interest in visuo‑tactile manipulation, progress is constrained by two persistent limitations: existing datasets are small in scale and narrow in task coverage, and current methods treat tactile signals as passive observations rather than using them to model contact dynamics or enable closed‑loop control explicitly. In this paper, we present OmniViTac, a large‑scale visuo‑tactile‑action dataset comprising 21,000+ trajectories across 86 tasks and 100+ objects, organized into six physics‑grounded interaction patterns. Building on this dataset, we propose OmniVTA, a world‑model‑based visuo‑tactile manipulation framework that integrates four tightly coupled modules: a self‑supervised tactile encoder, a two‑stream visuo‑tactile world model for predicting short‑horizon contact evolution, a contact‑aware fusion policy for action generation, and a 60Hz reflexive controller that corrects deviations between predicted and observed tactile signals in a closed loop. Real‑robot experiments across all six interaction categories show that OmniVTA outperforms existing methods and generalizes well to unseen objects and geometric configurations, confirming the value of combining predictive contact modeling with high‑frequency tactile feedback for contact‑rich manipulation. All data, models, and code will be made publicly available on the project website at https://mrsecant.github.io/OmniVTA.

Abstract:
Reinforcement learning (RL) for large‑scale Vision‑Language‑Action (VLA) models faces significant challenges in computational efficiency and data acquisition. We propose AcceRL, a fully asynchronous and decoupled RL framework designed to eliminate synchronization barriers by physically isolating training, inference, and rollouts. Crucially, AcceRL is the first to integrate a plug‑and‑play, trainable world model into a distributed asynchronous RL pipeline to generate virtual experiences. Experiments on the LIBERO~\citeliu2023libero benchmark demonstrate that AcceRL achieves state‑of‑the‑art (SOTA) performance. Systematically, it exhibits super‑linear scaling in throughput and highly efficient hardware utilization. Algorithmically, the world‑model‑augmented variant delivers unprecedented sample efficiency and robust training stability in complex control tasks. Code is publicly available at https://github.com/distanceLu/AcceRL.

Abstract:
Dynamics models, whether simulators or learned world models, have long been central to robotic manipulation, but most focus on minimizing prediction error rather than confronting a more fundamental challenge: real‑world manipulation is inherently uncertain. We argue that robust manipulation under uncertainty is fundamentally an integration problem: uncertainties must be represented, propagated, and constrained within the planning loop, not merely suppressed during training. We present and open‑source ManiDreams, a modular framework for uncertainty‑aware manipulation planning over intuitive physics models. It realizes this integration through composable abstractions for distributional state representation, backend‑agnostic dynamics prediction, and declarative constraint specification for action optimization. The framework explicitly addresses three sources of uncertainty: perceptual, parametric, and structural. It wraps any base policy with a sample‑predict‑constrain loop that evaluates candidate actions against distributional outcomes, adding robustness without retraining. Experiments on ManiSkill tasks show that ManiDreams maintains robust performance under various perturbations where the RL baseline degrades significantly. Runnable examples on pushing, picking, catching, and real‑world deployment demonstrate flexibility across different policies, optimizers, physics backends, and executors. The framework is publicly available at https://github.com/Rice‑RobotPI‑Lab/ManiDreams

Abstract:
A central challenge in image‑based Model‑Based Reinforcement Learning (MBRL) is to learn representations that distill essential information from irrelevant visual details. While promising, reconstruction‑based methods often waste capacity on large task‑irrelevant regions. Decoder‑free methods instead learn robust representations by leveraging Data Augmentation (DA), but reliance on such external regularizers limits versatility. We propose R2‑Dreamer, a decoder‑free MBRL framework with a self‑supervised objective that serves as an internal regularizer, preventing representation collapse without resorting to DA. The core of our method is a redundancy‑reduction objective inspired by Barlow Twins, which can be easily integrated into existing frameworks. On DeepMind Control Suite and Meta‑World, R2‑Dreamer is competitive with strong baselines such as DreamerV3 and TD‑MPC2 while training 1.59x faster than DreamerV3, and yields substantial gains on DMC‑Subtle with tiny task‑relevant objects. These results suggest that an effective internal regularizer can enable versatile, high‑performance decoder‑free MBRL. Code is available at https://github.com/NM512/r2dreamer.

Abstract:
Closed‑loop evaluation of autonomous‑driving policies requires interactive simulation beyond log replay. However, existing generative world models often degrade in closed loop due to (i) history‑free initialization that mismatches policy inputs, (ii) multi‑step sampling latency that violates real‑time budgets, and (iii) compounding kinematic infeasibility over long horizons. We propose VectorWorld, a streaming world model that incrementally generates ego‑centric 64 \mathrmm× 64\mathrmm lane‑‑agent vector‑graph tiles during rollout. VectorWorld aligns initialization with history‑conditioned policies by producing a policy‑compatible interaction state via a motion‑aware gated VAE. It enables real‑time outpainting via solver‑free one‑step masked completion with an edge‑gated relational DiT trained with interval‑conditioned MeanFlow and JVP‑based large‑step supervision. To stabilize long‑horizon rollouts, we introduce ΔSim, a physics‑aligned non‑ego (NPC) policy with hybrid discrete‑‑continuous actions and differentiable kinematic logit shaping. On Waymo open motion and nuPlan, VectorWorld improves map‑structure fidelity and initialization validity, and supports stable, real‑time 1\mathrmkm+ closed‑loop rollouts (\hrefhttps://github.com/jiangchaokang/VectorWorldcode).

Abstract:
We present StereoWorld, a camera‑conditioned stereo world model that jointly learns appearance and binocular geometry for end‑to‑end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera‑frame RoPE that augments latent tokens with camera‑aware rotary positional encoding, enabling relative, view‑ and time‑consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo‑aware attention decomposition that factors full 4D attention into 3D intra‑view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity‑aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera‑motion fidelity over strong monocular‑then‑convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end‑to‑end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric‑scale depth grounding, and is compatible with long‑video distillation for extended interactive stereo synthesis.

Abstract:
Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection‑based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt‑following generation. MosaicMem composes spatially aligned patches in the queried view via a patch‑and‑compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute‑level navigation, memory‑based scene editing, and autoregressive rollout.

Abstract:
Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long‑horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long‑term 3D consistency. First, we define a physics‑based continuous action space and represent user inputs in the Lie algebra to derive precise 6‑DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long‑horizon navigation. To support this research, we introduce a large‑scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state‑of‑the‑art interactive gaming world models in action controllability, long‑horizon visual quality, and 3D spatial consistency.

Abstract:
Social navigation requires robots to act safely in dynamic human environments. Effective behavior demands thinking ahead: reasoning about how the scene and pedestrians evolve under different robot actions rather than reacting to current observations alone. This creates a coupled prediction‑planning challenge, where robot actions and human motion mutually influence each other. To address this challenge, we propose NavThinker, a future‑aware framework that couples an action‑conditioned world model with on‑policy reinforcement learning. The world model operates in the Depth Anything V2 patch feature space and performs autoregressive prediction of future scene geometry and human motion; multi‑head decoders then produce future depth maps and human trajectories, yielding a future‑aware state aligned with traversability and interaction risk. Crucially, we train the policy with DD‑PPO while injecting world‑model think‑ahead signals via: (i) action‑conditioned future features fused into the current observation embedding and (ii) social reward shaping from predicted human trajectories. Experiments on single‑ and multi‑robot Social‑HM3D show state‑of‑the‑art navigation success, with zero‑shot transfer to Social‑MP3D and real‑world deployment on a Unitree Go2, validating generalization and practical applicability. Webpage: https://hutslib.github.io/NavThinker.

Abstract:
End‑to‑end autonomous driving aims to generate safe and plausible planning policies from raw sensor input. Driving world models have shown great potential in learning rich representations by predicting the future evolution of a driving scene. However, existing driving world models primarily focus on visual scene representation, and motion representation is not explicitly designed to be planner‑shared and inheritable, leaving a schism between the optimization of visual scene generation and the requirements of precise motion planning. We present WorldDrive, a holistic framework that couples scene generation and real‑time planning via unifying vision and motion representation. We first introduce a Trajectory‑aware Driving World Model, which conditions on a trajectory vocabulary to enforce consistency between visual dynamics and motion intentions, enabling the generation of diverse and plausible future scenes conditioned on a specific trajectory. We transfer the vision and motion encoders to a downstream Multi‑modal Planner, ensuring the driving policy operates on mature representations pre‑optimized by scene generation. A simple interaction between motion representation, visual representation, and ego status can generate high‑quality, multi‑modal trajectories. Furthermore, to exploit the world model's foresight, we propose a Future‑aware Rewarder, which distills future latent representation from the frozen world model to evaluate and select optimal trajectories in real‑time. Extensive experiments on the NAVSIM, NAVSIM‑v2, and nuScenes benchmarks demonstrate that WorldDrive achieves leading planning performance among vision‑only methods while maintaining high‑fidelity action‑controlled video generation capabilities, providing strong evidence for the effectiveness of unifying vision and motion representation for robust autonomous driving.

Abstract:
Trajectory world models play a crucial role in robotic dynamics learning, planning, and control. While recent works have explored trajectory world models for diverse robotic systems, they struggle to scale to a large number of distinct system dynamics and overlook domain knowledge of physical structures. To address these limitations, we introduce WestWorld, a knoWledge‑Encoded Scalable Trajectory World model for diverse robotic systems. To tackle the scalability challenge, we propose a novel system‑aware Mixture‑of‑Experts (Sys‑MoE) that dynamically combines and routes specialized experts for different robotic systems via a learnable system embedding. To further enhance zero‑shot generalization, we incorporate domain knowledge of robot physical structures by introducing a structural embedding that aligns trajectory representations with morphological information. After pretraining on 89 complex environments spanning diverse morphologies across both simulation and real‑world settings, WestWorld achieves significant improvements over competitive baselines in zero‑ and few‑shot trajectory prediction. Additionally, it shows strong scalability across a wide range of robotic environments and significantly improves performance on downstream model‑based control for different robots. Finally, we deploy our model on a real‑world Unitree Go1, where it demonstrates stable locomotion performance. The code is available at https://github.com/511205787/WestWorld.

Abstract:
While recent generative video models have achieved remarkable visual realism and are being explored as world models, true physical simulation requires mastering both space and time. Current models can produce visually smooth kinematics, yet they lack a reliable internal motion pulse to ground these motions in a consistent, real‑world time scale. This temporal ambiguity stems from the common practice of indiscriminately training on videos with vastly different real‑world speeds, forcing them into standardized frame rates. This leads to what we term chronometric hallucination: generated sequences exhibit ambiguous, unstable, and uncontrollable physical motion speeds. To address this, we propose Visual Chronometer, a predictor that recovers the Physical Frames Per Second (PhyFPS) directly from the visual dynamics of an input video. Trained via controlled temporal resampling, our method estimates the true temporal scale implied by the motion itself, bypassing unreliable metadata. To systematically quantify this issue, we establish two benchmarks, PhyFPS‑Bench‑Real and PhyFPS‑Bench‑Gen. Our evaluations reveal a harsh reality: state‑of‑the‑art video generators suffer from severe PhyFPS misalignment and temporal instability. Finally, we demonstrate that applying PhyFPS corrections significantly improves the human‑perceived naturalness of AI‑generated videos. Our project page is https://xiangbogaobarry.github.io/Visual_Chronometer/.

Abstract:
Open‑world object manipulation remains a fundamental challenge in robotics. While Vision‑Language‑Action (VLA) models have demonstrated promising results, they rely heavily on large‑scale robot action demonstrations, which are costly to collect and can hinder out‑of‑distribution generalization. In this paper, we propose an explicit‑world‑model‑based framework for open‑world manipulation that achieves zero‑shot generalization by constructing a physically grounded digital twin of the environment. The framework integrates open‑set perception, digital‑twin reconstruction, sampling and evaluation of interaction strategies. By constructing a digital twin of the environment, our approach efficiently explores and evaluates manipulation strategies in physic‑enabled simulator and reliably deploys the chosen strategy to the real world. Experimentally, the proposed framework is able to perform multiple open‑set manipulation tasks without any task‑specific action demonstrations, proving strong zero‑shot generalization on both the task and object levels. Project Page: https://bojack‑bj.github.io/projects/thesis/

Abstract:
Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO‑Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera "lookaway" trajectories. By evaluating video models with and without camera control for a diverse set of naturally‑occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO‑Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO‑Bench results provide new insight into potential data and architecture bias of present‑day video world models. Project website: https://glab‑caltech.github.io/STEVOBench/. Blog: https://ziqi‑ma.github.io/blog/2026/outofsight/

Abstract:
We present InSpatio‑WorldFM, an open‑source real‑time frame model for spatial intelligence. Unlike video‑based world models that rely on sequential frame generation and incur substantial latency due to window‑level processing, InSpatio‑WorldFM adopts a frame‑based paradigm that generates each frame independently, enabling low‑latency real‑time spatial inference. By enforcing multi‑view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine‑grained visual details across viewpoint changes. We further introduce a progressive three‑stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real‑time generator through few‑step distillation. Experimental results show that InSpatio‑WorldFM achieves strong multi‑view consistency while supporting interactive exploration on consumer‑grade GPUs, providing an efficient alternative to traditional video‑based world models for real‑time world simulation.

Abstract:
Visual navigation requires agents to reach goals in complex environments through perception and planning. World models address this task by simulating action‑conditioned state transitions to predict future observations. Current navigation world models typically learn state evolution under actions within the compressed latent space of a Variational Autoencoder, where spatial compression often discards fine‑grained structural information and hinders precise control. To better understand the propagation characteristics of different representations, we conduct a linear dynamics probe and observe that dense DINOv2 features exhibit stronger linear predictability for action‑conditioned transitions. Motivated by this observation, we propose the Representation Autoencoder‑based Navigation World Model (RAE‑NWM), which models navigation dynamics in a dense visual representation space. We employ a Conditional Diffusion Transformer with Decoupled Diffusion Transformer head (CDiT‑DH) to model continuous transitions, and introduce a separate time‑driven gating module for dynamics conditioning to regulate action injection strength during generation. Extensive evaluations show that modeling sequential rollouts in this space improves structural stability and action accuracy, benefiting downstream planning and navigation.

Abstract:
Learning natural, stable, and compositionally generalizable whole‑body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco‑manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross‑skill gradient interference and motion pattern conflicts in high‑degree‑of‑freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld‑X, a hierarchical world model framework for humanoid control. Guided by a divide‑and‑conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation‑constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision‑Language Model (VLM), enabling semantic‑driven expert composition. The VLM‑guided router dynamically integrates expert policies according to high‑level task semantics, facilitating compositional generalization and adaptive execution in multi‑stage loco‑manipulation tasks.

Abstract:
World models enable planning in imagined future predicted space, offering a promising framework for embodied navigation. However, existing navigation world models often lack action‑conditioned consistency, so visually plausible predictions can still drift under multi‑step rollout and degrade planning. Moreover, efficient deployment requires few‑step diffusion inference, but existing distillation methods do not explicitly preserve rollout consistency, creating a training‑inference mismatch. To address these challenges, we propose MWM, a mobile world model for planning‑based image‑goal navigation. Specifically, we introduce a two‑stage training framework that combines structure pretraining with Action‑Conditioned Consistency (ACC) post‑training to improve action‑conditioned rollout consistency. We further introduce Inference‑Consistent State Distillation (ICSD) for few‑step diffusion distillation with improved rollout consistency. Our experiments on benchmark and real‑world tasks demonstrate consistent gains in visual fidelity, trajectory accuracy, planning success, and inference efficiency. Code: https://github.com/AIGeeksGroup/MWM. Website: https://aigeeksgroup.github.io/MWM.

Abstract:
Precise prognostic modeling of glioblastoma (GBM) under varying treatment interventions is essential for optimizing clinical outcomes. While generative AI has shown promise in simulating GBM evolution, existing methods typically treat interventions as static conditional inputs rather than dynamic decision variables. Consequently, they fail to capture the complex, reciprocal interplay between tumor evolution and treatment response. To bridge this gap, we present Brain‑WM, a pioneering brain GBM world model that unifies next‑step treatment prediction and future MRI generation, thereby capturing the co‑evolutionary dynamics between tumor and treatment. Specifically, Brain‑WM encodes spatiotemporal dynamics into a shared latent space for joint autoregressive treatment prediction and flow‑based future MRI generation. Then, instead of a conventional monolithic framework, Brain‑WM adopts a novel Y‑shaped Mixture‑of‑Transformers (MoT) architecture. This design structurally disentangles heterogeneous objectives, successfully leveraging cross‑task synergies while preventing feature collapse. Finally, a synergistic multi‑timepoint mask alignment objective explicitly anchors latent representations to anatomically grounded tumor structures and progression‑aware semantics. Extensive validation on internal and external multi‑institutional cohorts demonstrates the superiority of Brain‑WM, achieving 91.5% accuracy in treatment planning and SSIMs of 0.8524, 0.8581, and 0.8404 for FLAIR, T1CE, and T2W sequences, respectively. Ultimately, Brain‑WM offers a robust clinical sandbox for optimizing patient healthcare. The source code is made available at https://github.com/thibault‑wch/Brain‑GBM‑world‑model.

Abstract:
Recent generative video world models aim to simulate visual environment evolution, allowing an observer to interactively explore the scene via camera control. However, they implicitly assume that the world only evolves within the observer's field of view. Once an object leaves the observer's view, its state is "frozen" in memory, and revisiting the same region later often fails to reflect events that should have occurred in the meantime. In this work, we identify and formalize this overlooked limitation as the "out‑of‑sight dynamics" problem, which impedes video world models from representing a continuously evolving world. To address this issue, we propose LiveWorld, a novel framework that extends video world models to support persistent world evolution. Instead of treating the world as static observational memory, LiveWorld models a persistent global state composed of a static 3D background and dynamic entities that continue evolving even when unobserved. To maintain these unseen dynamics, LiveWorld introduces a monitor‑based mechanism that autonomously simulates the temporal progression of active entities and synchronizes their evolved states upon revisiting, ensuring spatially coherent rendering. For evaluation, we further introduce LiveBench, a dedicated benchmark for the task of maintaining out‑of‑sight dynamics. Extensive experiments show that LiveWorld enables persistent event evolution and long‑term scene consistency, bridging the gap between existing 2D observation‑based memory and true 4D dynamic world simulation. The baseline and benchmark will be publicly available at https://zichengduan.github.io/LiveWorld/index.html.

Abstract:
We introduce Latent Particle World Model (LPWM), a self‑supervised object‑centric world model scaled to real‑world multi‑object datasets and applicable in decision‑making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end‑to‑end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state‑of‑the‑art results on diverse real‑world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision‑making, including goal‑conditioned imitation learning, as we demonstrate in the paper. Code, data, pre‑trained models and video rollouts are available: https://taldatech.github.io/lpwm‑web

Abstract:
Vision‑Language‑Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal‑causal structure underlying visual dynamics. World‑model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent‑action VLAs encode frame‑to‑frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain‑of‑World VLA), a new "Chain of World" paradigm that unifies world‑model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre‑training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co‑fine‑tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world‑model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world‑model and latent‑action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx‑hit.github.io/cowvla‑io.

Abstract:
Developing generalist robots capable of mastering diverse skills remains a central challenge in embodied AI. While recent progress emphasizes scaling model parameters and offline datasets, such approaches are limited in robotics, where learning requires active interaction. We argue that effective online learning should scale the \emphnumber of tasks, rather than the number of samples per task. This regime reveals a structural advantage of model‑based reinforcement learning (MBRL). Because physical dynamics are invariant across tasks, a shared world model can aggregate multi‑task experience to learn robust, task‑agnostic representations. In contrast, model‑free methods suffer from gradient interference when tasks demand conflicting actions in similar states. Task diversity therefore acts as a regularizer for MBRL, improving dynamics learning and sample efficiency. We instantiate this idea with EfficientZero‑Multitask (EZ‑M), a sample‑efficient multi‑task MBRL algorithm for online learning. Evaluated on HumanoidBench, a challenging whole‑body control benchmark, EZ‑M achieves state‑of‑the‑art performance with significantly higher sample efficiency than strong baselines, without extreme parameter scaling. These results establish task scaling as a critical axis for scalable robotic learning. The project website is available \hrefhttps://yewr.github.io/ez_m/here.

Abstract:
Despite impressive progress in video generation, existing models remain limited to surface‑level plausibility, lacking a coherent and unified understanding of the world. Prior approaches typically incorporate only a single form of world‑related knowledge or rely on rigid alignment strategies to introduce additional knowledge. However, aligning the single world knowledge is insufficient to constitute a world model that requires jointly modeling multiple heterogeneous dimensions (e.g., physical commonsense, 3D and temporal consistency). To address this limitation, we introduce DreamWorld, a unified framework that integrates complementary world knowledge into video generators via a Joint World Modeling Paradigm, jointly predicting video pixels and features from foundation models to capture temporal dynamics, spatial geometry, and semantic consistency. However, naively optimizing these heterogeneous objectives can lead to visual instability and temporal flickering. To mitigate this issue, we propose Consistent Constraint Annealing (CCA) to progressively regulate world‑level constraints during training, and Multi‑Source Inner‑Guidance to enforce learned world priors at inference. Extensive evaluations show that DreamWorld improves world consistency, outperforming Wan2.1 by 2.26 points on VBench. Code will be made publicly available at \hrefhttps://github.com/ABU121111/DreamWorld\textcolormypinkGithub.

Abstract:
Observational learning requires an agent to learn to perform a task by referencing only observations of the performed task. This work investigates the equivalent setting in real‑world robot learning where access to hand‑designed rewards and demonstrator actions are not assumed. To address this data‑constrained setting, this work presents a planning‑based Inverse Reinforcement Learning (IRL) algorithm for world modeling from observation and interaction alone. Experiments conducted entirely in the real‑world demonstrate that this paradigm is effective for learning image‑based manipulation tasks from scratch in under an hour, without assuming prior knowledge, pre‑training, or data of any kind beyond task observations. Moreover, this work demonstrates that the learned world model representation is capable of online transfer learning in the real‑world from scratch. In comparison to existing approaches, including IRL, RL, and Behavior Cloning (BC), which have more restrictive assumptions, the proposed approach demonstrates significantly greater sample efficiency and success rates, enabling a practical path forward for online world modeling and planning from observation and interaction. Videos and more at: https://uwrobotlearning.github.io/mpail2/.

Abstract:
Energy‑based predictive world models provide a powerful approach for multi‑step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long‑horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy‑based optimization, enabling stable multi‑step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3‑step planning and 2% SR improvement in 4‑step planning compared to the state‑of‑the‑art V‑JEPA 2. Project website: https://steve‑zeyu‑zhang.github.io/GeoWorld.

Abstract:
World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long‑term content consistency when scenes are revisited and enabling precise camera control from user‑provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine‑grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long‑term memory and precise camera control via a time‑aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual‑stream diffusion transformer for high‑fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point‑cloud‑based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real‑world and synthetic benchmarks demonstrate that UCM significantly outperforms state‑of‑the‑art methods in long‑term scene consistency, while also achieving precise camera controllability in high‑fidelity video generation.

Abstract:
World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry‑agonistic multiview world model for driving scenarios that employs a dual‑causal autoregressive framework. It follows both scale‑wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio‑temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio‑temporal representation across views, frames, and scales based on relative Plücker‑ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long‑horizon video generation. RAYNOVA achieves state‑of‑the‑art multi‑view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at https://raynova‑ai.github.io/.

Abstract:
Clinical MRI contrast acquisition suffers from inefficient information yield, which presents as a mismatch between the risky and costly acquisition protocol and the fixed and sparse acquisition sequence. Applying world models to simulate the contrast enhancement kinetics in the human body enables continuous contrast‑free dynamics. However, the low temporal resolution in MRI acquisition restricts the training of world models, leading to a sparsely sampled dataset. Directly training a generative model to capture the kinetics leads to two limitations: (a) Due to the absence of data on missing time, the model tends to overfit to irrelevant features, leading to content distortion. (b) Due to the lack of continuous temporal supervision, the model fails to learn the continuous kinetics law over time, causing temporal discontinuities. For the first time, we propose MRI Contrast Enhancement Kinetics World model (MRI CEKWorld) with SpatioTemporal Consistency Learning (STCL). For (a), guided by the spatial law that patient‑level structures remain consistent during enhancement, we propose Latent Alignment Learning (LAL) that constructs a patient‑specific template to constrain contents to align with this template. For (b), guided by the temporal law that the kinetics follow a consistent smooth trend, we propose Latent Difference Learning (LDL) which extends the unobserved intervals by interpolation and constrains smooth variations in the latent space among interpolated sequences. Extensive experiments on two datasets show our MRI CEKWorld achieves better realistic contents and kinetics. Codes will be available at https://github.com/DD0922/MRI‑Contrast‑Enhancement‑Kinetics‑World‑Model.

Abstract:
Extended reality (XR) demands generative models that respond to users' tracked real‑world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human‑centric video world model that is conditioned on both tracked head pose and joint‑level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand‑‑object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

Abstract:
Model‑based reinforcement learning promises strong sample efficiency but often underperforms in practice due to compounding model error, unimodal world models that average over multi‑modal dynamics, and overconfident predictions that bias learning. We introduce WIMLE, a model‑based method that extends Implicit Maximum Likelihood Estimation (IMLE) to the model‑based RL framework to learn stochastic, multi‑modal world models without iterative sampling and to estimate predictive uncertainty via ensembles and latent sampling. During training, WIMLE weights each synthetic transition by its predicted confidence, preserving useful model rollouts while attenuating bias from uncertain predictions and enabling stable learning. Across 40 continuous‑control tasks spanning DeepMind Control, MyoSuite, and HumanoidBench, WIMLE achieves superior sample efficiency and competitive or better asymptotic performance than strong model‑free and model‑based baselines. Notably, on the challenging Humanoid‑run task, WIMLE improves sample efficiency by over 50% relative to the strongest competitor, and on HumanoidBench it solves 8 of 14 tasks (versus 4 for BRO and 5 for SimbaV2). These results highlight the value of IMLE‑based multi‑modality and uncertainty‑aware weighting for stable model‑based RL.

Abstract:
Vision‑language‑action (VLA) models that directly predict multi‑step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre‑trained on web‑scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose GigaBrain‑0.5M, a VLA model trained via world model‑based reinforcement learning. Built upon GigaBrain‑0.5, which is pre‑trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. GigaBrain‑0.5M further integrates world model‑based reinforcement learning via RAMP (Reinforcement leArning via world Model‑conditioned Policy) to enable robust cross‑task adaptation. Empirical results demonstrate that RAMP achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30% on challenging tasks including \textttLaundry Folding, \textttBox Packing, and \textttEspresso Preparation. Critically, GigaBrain‑0.5M^ exhibits reliable long‑horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real‑world deployment videos on our \hrefhttps://gigabrain05m.github.ioproject page.

Abstract:
Reinforcement Learning (RL) offers a powerful paradigm for autonomous robots to master generalist manipulation skills through trial‑and‑error. However, its real‑world application is stifled by low sample efficiency. Recent Human‑in‑the‑Loop (HIL) methods accelerate training by using human corrections, yet this approach faces a scalability barrier. Reliance on human supervisors imposes a 1:1 supervision ratio that limits scalability, suffers from operator fatigue over extended sessions, and introduces high variance due to inconsistent human proficiency. We present Agent‑guided Policy Search (AGPS), a framework that automates the training pipeline by replacing human supervisors with a multimodal agent. Our key insight is that the agent can be viewed as a semantic world model, injecting intrinsic value priors to structure physical exploration. By using tools, the agent provides precise guidance via corrective waypoints and spatial constraints for exploration pruning. We validate our approach on three tasks, ranging from precision insertion to deformable object manipulation. Results demonstrate that AGPS outperforms HIL methods in sample efficiency. This automates the supervision pipeline, unlocking the path to labor‑free and scalable robot learning. Project website: https://agps‑rl.github.io/agps/.

Abstract:
Efficient spatial reasoning requires world models that remain reliable under tight precision budgets. We study whether low‑bit planning behavior is determined mostly by total bitwidth or by where bits are allocated across modules. Using DINO‑WM on the Wall planning task, we run a paired‑goal mixed‑bit evaluation across uniform, mixed, asymmetric, and layerwise variants under two planner budgets. We observe a consistent three‑regime pattern: 8‑bit and 6‑bit settings remain close to FP16, 3‑bit settings collapse, and 4‑bit settings are allocation‑sensitive. In that transition region, preserving encoder precision improves planning relative to uniform quantization, and near‑size asymmetric variants show the same encoder‑side direction. In a later strict 22‑cell replication with smaller per‑cell episode count, the mixed‑versus‑uniform INT4 sign becomes budget‑conditioned, which further highlights the sensitivity of this transition regime. These findings motivate module‑aware, budget‑aware quantization policies as a broader research direction for efficient spatial reasoning. Code and run artifacts are available at https://github.com/suraj‑ranganath/DINO‑MBQuant.

Abstract:
World models require robust relational understanding to support prediction, reasoning, and control. While object‑centric representations provide a useful abstraction, they are not sufficient to capture interaction‑dependent dynamics. We therefore propose C‑JEPA, a simple and flexible object‑centric world model that extends masked joint embedding prediction from image patches to object‑centric representations. By masking object‑level latents and requiring each masked object state to be inferred from the surrounding context, C‑JEPA imposes structured partial observability during training, creating counterfactual‑like prediction queries that discourage shortcut solutions and make interaction‑dependent prediction necessary under the learning objective. Empirically, C‑JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20% in counterfactual reasoning over the same architecture without object‑level masking. On agent control tasks, C‑JEPA enables substantially more efficient planning by using only 1% of the total latent input features required by patch‑based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object‑level masking induces useful inductive bias by controlling observability. Our code is available at https://github.com/galilai‑group/cjepa.

Abstract:
Vision‑Language‑Action (VLA) models are promising for generalist robot manipulation but remain brittle in out‑of‑distribution (OOD) settings, especially with limited real‑robot data. To resolve the generalization bottleneck, we introduce a hierarchical Vision‑Language‑Action framework \our that leverages the generalization of large‑scale pre‑trained world model for robust and generalizable VIsual Subgoal TAsk decomposition VISTA. Our hierarchical framework \our consists of a world model as the high‑level planner and a VLA as the low‑level executor. The high‑level world model first divides manipulation tasks into subtask sequences with goal images, and the low‑level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low‑level policies, making it feasible to generalize across unseen objects and novel scenarios. We validate both visual goal synthesis and our hierarchical VLA policies in massive out‑of‑distribution scenarios, and the performance of the same‑structured VLA in novel scenarios could boost from 14% to 69% with the guidance generated by the world model. Results demonstrate that our method outperforms previous baselines with a clear margin, particularly in out‑of‑distribution scenarios. Project page: \hrefhttps://vista‑wm.github.io/https://vista‑wm.github.io

Abstract:
The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end‑to‑end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR‑World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR‑World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future‑Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial‑temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state‑of‑the‑art planning performance. The code is available at https://github.com/mengtan00/ResWorld.git.

Abstract:
Large language models (LLMs) exhibit strong general‑purpose reasoning capabilities, yet they frequently hallucinate when used as world models (WMs), where strict compliance with deterministic transition rules‑‑particularly in corner cases‑‑is essential. In contrast, Symbolic WMs provide logical consistency but lack semantic expressivity. To bridge this gap, we propose Neuro‑Symbolic Synergy (NeSyS), a framework that integrates the probabilistic semantic priors of LLMs with executable symbolic rules to achieve both expressivity and robustness. NeSyS alternates training between the two models using trajectories inadequately explained by the other. Unlike rule‑based prompting, the symbolic WM directly constrains the LLM by modifying its output probability distribution. The neural WM is fine‑tuned only on trajectories not covered by symbolic rules, reducing training data by 50% without loss of accuracy. Extensive experiments on three distinct interactive environments, i.e., ScienceWorld, Webshop, and Plancraft, demonstrate NeSyS's consistent advantages over baselines in both WM prediction accuracy and data efficiency. Our models and code are available at https://github.com/tianyi‑lab/NeSyS.

Abstract:
Scaling action‑controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene‑specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce SeqΔ‑REPA, a sequence‑level control‑effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self‑supervised video encoder. Building on this, we present Olaf‑World, a pipeline that pretrains action‑conditioned video world models from large‑scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero‑shot action transfer and more data‑efficient adaptation to new control interfaces than state‑of‑the‑art baselines.

Abstract:
Recent advances in large language model (LLM) have empowered autonomous agents to perform multi‑turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets and obtain high‑quality observations. Notably, these environments are code‑driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large‑scale reinforcement learning for multi‑turn tool‑use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark‑specific ones, yields strong out‑of‑distribution generalization. The code is available at https://github.com/Snowflake‑Labs/agent‑world‑model.

Abstract:
Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human‑like foresight by enabling action‑conditioned prediction. However, existing text‑ and pixel‑based approaches struggle to simultaneously achieve high visual fidelity and fine‑grained structural controllability. To this end, we propose Code2World, a vision‑language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high‑fidelity HTML and refining synthesized code through a visual‑feedback revision mechanism, yielding a corpus of over 80K high‑quality screen‑action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render‑Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World‑8B achieves the top‑performing next UI prediction, rivaling the competitive GPT‑5 and Gemini‑3‑Pro‑Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini‑2.5‑Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP‑ML/Code2World.

Abstract:
We study diffusion‑based world models for reinforcement learning, which offer high generative fidelity but face critical efficiency challenges in control. Current methods either require heavyweight models at inference or rely on highly sequential imagination, both of which impose prohibitive computational costs. We propose Horizon Imagination (HI), an on‑policy imagination process for discrete stochastic policies that denoises multiple future observations in parallel. HI incorporates a stabilization mechanism and a novel sampling schedule that decouples the denoising budget from the effective horizon over which denoising is applied while also supporting sub‑frame budgets. Experiments on Atari 100K and Craftium show that our approach maintains control performance with a sub‑frame budget of half the denoising steps and achieves superior generation quality under varied schedules. Code is available at https://github.com/leor‑c/horizon‑imagination.

Abstract:
World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open‑domain closed‑loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high‑quality videos at 1080p and 24 FPS, including 100 (first‑person) + 100 (third‑person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND‑World, a novel interactive Video‑to‑World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long‑term memory consistency and generalizing across action spaces. Code: https://github.com/CSU‑JPG/MIND.

Abstract:
End‑to‑end (E2E) autonomous driving has recently attracted increasing interest in unifying Vision‑Language‑Action (VLA) with World Models to enhance decision‑making and forward‑looking imagination. However, existing methods fail to effectively unify future scene evolution and action planning within a single architecture due to inadequate sharing of latent states, limiting the impact of visual imagination on action decisions. To address this limitation, we propose DriveWorld‑VLA, a novel framework that unifies world modeling and planning within a latent space by tightly integrating VLA and world models at the representation level, which enables the VLA planner to benefit directly from holistic scene‑evolution modeling and reducing reliance on dense annotated supervision. Additionally, DriveWorld‑VLA incorporates the latent states of the world model as core decision‑making states for the VLA planner, facilitating the planner to assess how candidate actions impact future scene evolution. By conducting world modeling entirely in the latent space, DriveWorld‑VLA supports controllable, action‑conditioned imagination at the feature level, avoiding expensive pixel‑level rollouts. Extensive open‑loop and closed‑loop evaluations demonstrate the effectiveness of DriveWorld‑VLA, which achieves state‑of‑the‑art performance with 91.3 PDMS on NAVSIMv1, 86.8 EPDMS on NAVSIMv2, and 0.16 3‑second average collision rate on nuScenes. Code and models will be released in https://github.com/liulin815/DriveWorld‑VLA.git.

Abstract:
Generative sequence models are typically trained on sample sequences from natural or formal languages. It is a crucial question whether ‑‑ or to what extent ‑‑ sample‑based training is able to capture the true structure of these languages, often referred to as the ``world model''. Theoretical results indicate that we can hope for soundness at best, that is, generating valid sequences, but not necessarily all of them. However, it is still important to have practical tools that are able to verify whether a given sequence model is sound. In this study, we focus on chess, as it is a domain that provides enough complexity while having a simple rule‑based world model. We propose adversarial sequence generation for verifying the soundness of the sequence model. Our adversaries generate valid sequences so as to force the sequence model to generate an invalid next move prediction. Apart from the falsification of soundness, this method is also suitable for a more fine‑grained analysis of the failure modes and the effects of different choices during training. To demonstrate this, we propose a number of methods for adversarial sequence generation and evaluate the approach on a large set of chess models. We train models on random as well as high‑quality chess games, using several training recipes. We find that none of the models are sound, but some training techniques and dataset choices are able to improve soundness remarkably. We also investigate the potential application of board state probes in both our training and attack methods. Our findings indicate that the extracted board states have no causal role in next token prediction in most of the models.

Abstract:
We present PIRATR, an end‑to‑end 3D object detection framework for robotic use cases in point clouds. Extending PI3DETR, our method streamlines parametric 3D object detection by jointly estimating multi‑class 6‑DoF poses and class‑specific parametric attributes directly from occlusion‑affected point cloud data. This formulation enables not only geometric localization but also the estimation of task‑relevant properties for parametric objects, such as a gripper's opening, where the 3D model is adjusted according to simple, predefined rules. The architecture employs modular, class‑specific heads, making it straightforward to extend to novel object types without re‑designing the pipeline. We validate PIRATR on an automated forklift platform, focusing on three structurally and functionally diverse categories: crane grippers, loading platforms, and pallets. Trained entirely in a synthetic environment, PIRATR generalizes effectively to real outdoor LiDAR scans, achieving a detection mAP of 0.919 without additional fine‑tuning. PIRATR establishes a new paradigm of pose‑aware, parameterized perception. This bridges the gap between low‑level geometric reasoning and actionable world models, paving the way for scalable, simulation‑trained perception systems that can be deployed in dynamic robotic environments. Code available at https://github.com/swingaxe/piratr.

Abstract:
We present EB‑JEPA, an open‑source library for learning representations and world models using Joint‑Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self‑contained implementations that illustrate how representation learning techniques developed for image‑level self‑supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action‑conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single‑GPU training within a few hours, making energy‑based self‑supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR‑10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi‑step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action‑conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at https://github.com/facebookresearch/eb_jepa.

Abstract:
Autonomous driving relies on robust models trained on high‑quality, large‑scale multi‑view driving videos. While world models offer a cost‑effective solution for generating realistic driving videos, they struggle to maintain instance‑level temporal consistency and spatial geometric fidelity. To address these challenges, we propose InstaDrive, a novel framework that enhances driving video realism through two key advancements: (1) Instance Flow Guider, which extracts and propagates instance features across frames to enforce temporal consistency, preserving instance identity over time. (2) Spatial Geometric Aligner, which improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. By incorporating these instance‑aware mechanisms, InstaDrive achieves state‑of‑the‑art video generation quality and enhances downstream autonomous driving tasks on the nuScenes dataset. Additionally, we utilize CARLA's autopilot to procedurally and stochastically simulate rare but safety‑critical driving scenarios across diverse maps and regions, enabling rigorous safety evaluation for autonomous systems. Our project page is https://shanpoyang654.github.io/InstaDrive/page.html.

Abstract:
Autonomous driving relies on robust models trained on large‑scale, high‑quality multi‑view driving videos. Although world models provide a cost‑effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance‑level temporal constraints. We introduce ConsisDrive, an identity‑preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance‑Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance‑Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state‑of‑the‑art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is https://shanpoyang654.github.io/ConsisDrive/page.html.

Abstract:
We propose Infinite‑World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real‑world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground‑truth, they lack an effective training paradigm for real‑world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose‑free Memory Compressor (HPMC) that recursively distills historical latents into a fixed‑budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty‑aware Action Labeling module that discretizes continuous motion into a tri‑state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action‑response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit‑Dense Finetuning Strategy using a compact, 30‑minute dataset to efficiently activate the model's long‑range loop‑closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite‑World achieves superior performance in visual quality, action controllability, and spatial consistency.

Abstract:
World models learn an internal representation of environment dynamics, enabling agents to simulate and reason about future states within a compact latent space for tasks such as planning, prediction, and inference. However, running world models rely on hevay computational cost and memory footprint, making model quantization essential for efficient deployment. To date, the effects of post‑training quantization (PTQ) on world models remain largely unexamined. In this work, we present a systematic empirical study of world model quantization using DINO‑WM as a representative case, evaluating diverse PTQ methods under both weight‑only and joint weight‑activation settings. We conduct extensive experiments on different visual planning tasks across a wide range of bit‑widths, quantization granularities, and planning horizons up to 50 iterations. Our results show that quantization effects in world models extend beyond standard accuracy and bit‑width trade‑offs: group‑wise weight quantization can stabilize low‑bit rollouts, activation quantization granularity yields inconsistent benefits, and quantization sensitivity is highly asymmetric between encoder and predictor modules. Moreover, aggressive low‑bit quantization significantly degrades the alignment between the planning objective and task success, leading to failures that cannot be remedied by additional optimization. These findings reveal distinct quantization‑induced failure modes in world model‑based planning and provide practical guidance for deploying quantized world models under strict computational constraints. The code will be available at https://github.com/huawei‑noah/noah‑research/tree/master/QuantWM.

Abstract:
Autoregressive video diffusion models enable streaming generation, opening the door to long‑form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long‑range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near‑duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross‑attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training‑free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross‑attention by selecting frame‑relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self‑attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5‑‑x10 end‑to‑end speedups while preserving near‑identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

Abstract:
World models are essential for autonomous robotic planning. However, the substantial computational overhead of existing dense Transformerbased models significantly hinders real‑time deployment. To address this efficiency‑performance bottleneck, we introduce DDP‑WM, a novel world model centered on the principle of Disentangled Dynamics Prediction (DDP). We hypothesize that latent state evolution in observed scenes is heterogeneous and can be decomposed into sparse primary dynamics driven by physical interactions and secondary context‑driven background updates. DDP‑WM realizes this decomposition through an architecture that integrates efficient historical processing with dynamic localization to isolate primary dynamics. By employing a crossattention mechanism for background updates, the framework optimizes resource allocation and provides a smooth optimization landscape for planners. Extensive experiments demonstrate that DDP‑WM achieves significant efficiency and performance across diverse tasks, including navigation, precise tabletop manipulation, and complex deformable or multi‑body interactions. Specifically, on the challenging Push‑T task, DDP‑WM achieves an approximately 9 times inference speedup and improves the MPC success rate from 90% to98% compared to state‑of‑the‑art dense models. The results establish a promising path for developing efficient, high‑fidelity world models. Codes is available at https://hcplab‑sysu.github.io/DDP‑WM/.

Abstract:
Achieving reliable and efficient planning in complex driving environments requires a model that can reason over the scene's geometry, appearance, and dynamics. We present UniDWM, a unified driving world model that advances autonomous driving through multifaceted representation learning. UniDWM constructs a structure‑ and dynamic‑aware latent world representation that serves as a physically grounded state space, enabling consistent reasoning across perception, prediction, and planning. Specifically, a joint reconstruction pathway learns to recover the scene's structure, including geometry and visual texture, while a collaborative generation framework leverages a conditional diffusion transformer to forecast future world evolution within the latent space. Furthermore, we show that our UniDWM can be deemed as a variation of VAE, which provides theoretical guidance for the multifaceted representation learning. Extensive experiments demonstrate the effectiveness of UniDWM in trajectory planning, 4D reconstruction and generation, highlighting the potential of multifaceted world representations as a foundation for unified driving intelligence. The code will be publicly available at https://github.com/Say2L/UniDWM.

Abstract:
Language model (LM)‑based embodied agents are increasingly deployed in real‑world settings. Yet, their adaptability remains limited in dynamic environments, where constructing accurate and flexible world models is crucial for effective reasoning and decision‑making. To address this challenge, we extend the Mixture‑of‑Experts (MoE) paradigm to embodied agents. While conventional MoE architectures modularize knowledge into expert components with pre‑trained routing, they remain rigid once deployed, making them less effective for adapting to unseen domains in dynamic environments. We therefore propose Test‑time Mixture of World Models (TMoW), a framework that enhances adaptability to unseen and evolving domains. TMoW updates its routing function over world models at test time, unlike conventional MoE where the function remains fixed, enabling agents to recombine existing models and integrate new ones for continual adaptation. It achieves this through (i) multi‑granular prototype‑based routing, which adapts mixtures across object‑ to scene‑level similarities, (ii) test‑time refinement that aligns unseen domain features with prototypes during inference, and (iii) distilled mixture‑based augmentation, which efficiently constructs new models from few‑shot data and existing prototypes. We evaluate TMoW on VirtualHome, ALFWorld, and RLBench benchmarks, demonstrating strong performance in both zero‑shot adaptation and few‑shot expansion scenarios, and showing that it enables embodied agents to operate effectively in dynamic environments.

Abstract:
This work highlights that video world modeling, alongside vision‑language pre‑training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot‑VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture‑of‑Transformers (MoT) architecture, (2) a closed‑loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground‑truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real‑world scenarios, where it shows significant promise in long‑horizon manipulation, data efficiency in post‑training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.

Abstract:
We present LingBot‑World, an open‑sourced world simulator stemming from video generation. Positioned as a top‑tier world model, LingBot‑World offers the following features. (1) It maintains high fidelity and robust dynamics in a broad spectrum of environments, including realism, scientific contexts, cartoon styles, and beyond. (2) It enables a minute‑level horizon while preserving contextual consistency over time, which is also known as "long‑term memory". (3) It supports real‑time interactivity, achieving a latency of under 1 second when producing 16 frames per second. We provide public access to the code and model in an effort to narrow the divide between open‑source and closed‑source technologies. We believe our release will empower the community with practical applications across areas like content creation, gaming, and robot learning.

Abstract:
Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain‑of‑thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert‑level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human‑like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world‑model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks‑‑particularly those grounded in the physical world‑‑visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual‑verbal CoT reasoning, constructing a new evaluation suite, VisWorld‑Eval. Controlled experiments on a state‑of‑the‑art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human‑like multimodal AI.

Abstract:
While model‑based reinforcement learning (MBRL) improves sample efficiency by learning world models from raw observations, existing methods struggle to generalize across structurally similar scenes and remain vulnerable to spurious variations such as textures or color shifts. From a cognitive science perspective, humans segment continuous sensory streams into discrete events and rely on these key events for decision‑making. Motivated by this principle, we propose the Event‑Aware World Model (EAWM), a general framework that learns event‑aware representations to streamline policy learning without requiring handcrafted labels. EAWM employs an automated event generator to derive events from raw observations and introduces a Generic Event Segmentor (GES) to identify event boundaries, which mark the start and end time of event segments. Through event prediction, the representation space is shaped to capture meaningful spatio‑temporal transitions. Beyond this, we present a unified formulation of seemingly distinct world model architectures and show the broad applicability of our methods. Experiments on Atari 100K, Craftax 1M, and DeepMind Control 500K, DMC‑GB2 500K demonstrate that EAWM consistently boosts the performance of strong MBRL baselines by 10%‑45%, setting new state‑of‑the‑art results across benchmarks. Our code is released at https://github.com/MarquisDarwin/EAWM.

Abstract:
Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels‑V3, a conditional video generation model, built upon a unified multimodal in‑context learning framework with diffusion Transformers. SkyReels‑V3 model supports three core generative paradigms within a single architecture: reference images‑to‑video synthesis, video‑to‑video extension and audio‑guided video generation. (i) reference images‑to‑video model is designed to produce high‑fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste artifacts. During training, an image video hybrid strategy combined with multi‑resolution joint optimization is employed to improve generalization and robustness across diverse scenarios. (ii) video extension model integrates spatio‑temporal consistency modeling with large‑scale video understanding, enabling both seamless single‑shot continuation and intelligent multi‑shot switching with professional cinematographic patterns. (iii) Talking avatar model supports minute‑level audio‑conditioned video generation by training first‑and‑last frame insertion patterns and reconstructing key‑frame inference paradigms. On the basis of ensuring visual quality, synchronization of audio and videos has been optimized. Extensive evaluations demonstrate that SkyReels‑V3 achieves state‑of‑the‑art or near state‑of‑the‑art performance on key metrics including visual quality, instruction following, and specific aspect metrics, approaching leading closed‑source systems. Github: https://github.com/SkyworkAI/SkyReels‑V3.

Abstract:
Humans can look at a static scene and instantly predict what happens next ‑‑ will moving this object cause a collision? We call this ability Causal Spatial Reasoning. However, current multimodal large language models (MLLMs) cannot do this, as they remain largely restricted to static spatial perception, struggling to answer "what‑if" questions in a 3D scene. We introduce CausalSpatial, a diagnostic benchmark evaluating whether models can anticipate consequences of object motions across four tasks: Collision, Compatibility, Occlusion, and Trajectory. Results expose a severe gap: humans score 84% while GPT‑5 achieves only 54%. Why do MLLMs fail? Our analysis uncovers a fundamental deficiency: models over‑rely on textual chain‑of‑thought reasoning that drifts from visual evidence, producing fluent but spatially ungrounded hallucinations. To address this, we propose the Causal Object World model (COW), a framework that externalizes the simulation process by generating videos of hypothetical dynamics. With explicit visual cues of causality, COW enables models to ground their reasoning in physical reality rather than linguistic priors. We make the dataset and code publicly available here: https://github.com/CausalSpatial/CausalSpatial

Abstract:
Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. Adapting these generalist video models to driving domains has shown promise but typically requires massive domain‑specific data and costly fine‑tuning. We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively "dressing" the motion with texture and lighting. This two‑stage process mirrors a reasoning‑rendering paradigm: first infer dynamics, then render appearance. Our experiments show this decoupled approach is exceptionally efficient: adapting SVD, we match prior SOTA models with less than 6% of their compute. Scaling to LTX, our MAD‑LTX model outperforms all open‑source competitors, and supports a comprehensive suite of text, ego, and object controls. Project page: https://vita‑epfl.github.io/MAD‑World‑Model/

Abstract:
Recent advances in world models have shown promise for modeling future dynamics of environmental states, enabling agents to reason and act without accessing real environments. Current methods mainly perform single‑step or fixed‑horizon rollouts, leaving their potential for complex task planning under‑exploited. We propose Imagine‑then‑Plan (\textttITP), a unified framework for agent learning via lookahead imagination, where an agent's policy model interacts with the learned world model, yielding multi‑step ``imagined'' trajectories. Since the imagination horizon may vary by tasks and stages, we introduce a novel adaptive lookahead mechanism by trading off the ultimate goal and task progress. The resulting imagined trajectories provide rich signals about future consequences, such as achieved progress and potential conflicts, which are fused with current observations, formulating a partially observable and imaginable Markov decision process to guide policy learning. We instantiate \textttITP with both training‑free and reinforcement‑trained variants. Extensive experiments across representative agent benchmarks demonstrate that \textttITP significantly outperforms competitive baselines. Further analyses validate that our adaptive lookahead largely enhances agents' reasoning capability, providing valuable insights into addressing broader, complex tasks. Our code and data will be publicly available at https://github.com/loyiv/ITP.

Abstract:
Autonomous machine learning agents have revolutionized scientific discovery, yet they remain constrained by a Generate‑Execute‑Feedback paradigm. Previous approaches suffer from a severe Execution Bottleneck, as hypothesis evaluation relies strictly on expensive physical execution. To bypass these physical constraints, we internalize execution priors to substitute costly runtime checks with instantaneous predictive reasoning, drawing inspiration from World Models. In this work, we formalize the task of Data‑centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report, achieving 61.5% accuracy and robust confidence calibration. Finally, we instantiate this framework in FOREAGENT, an agent that employs a Predict‑then‑Verify loop, achieving a 6x acceleration in convergence while surpassing execution‑based baselines by +6%. Our code and dataset are publicly available at https://github.com/zjunlp/predict‑before‑execute.

Abstract:
Video world models aim to simulate dynamic, real‑world environments, yet existing methods struggle to provide unified and precise control over camera and multi‑object motion, as videos inherently capture dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a geometry‑driven video world model that generates dynamic, realistic videos from a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state as a static background point cloud and per‑object 3D Gaussian trajectories. This representation captures each object's motion path and probabilistic 3D occupancy over time, providing a flexible, category‑agnostic alternative to rigid bounding boxes and parametric models. We render 4D Geometric Control into 4D control maps for a pretrained video diffusion model, enabling high‑fidelity, view‑consistent video generation that faithfully follows the specified dynamics. To enable training at scale, we develop an automatic data engine and construct VerseControl4D, a real‑world dataset of 35K training samples with automatically derived prompts and rendered 4D control maps. Extensive experiments show that VerseCrafter achieves superior visual quality and more accurate control over camera and multi‑object motion than prior methods.

Abstract:
World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision‑language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive‑WM, a unified VLM‑based world model that jointly performs driving‑scene understanding, trajectory planning, and trajectory‑conditioned future image generation within a single architecture. UniDrive‑WM's trajectory planner predicts a future trajectory, which conditions a VLM‑based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive‑WM produces high‑fidelity future images and improves planning performance by 7.3% in L2 trajectory error and 10.4% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM‑driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive‑wm.github.io/UniDrive‑WM.

Abstract:
Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre‑trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB‑D images and a sequence of low‑level robot action commands, PointWorld forecasts per‑pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment‑specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large‑scale dataset spanning real and simulated robotic manipulation in open‑world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single‑arm Franka and a bimanual humanoid. Through rigorous, large‑scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large‑scale 3D world modeling. With a real‑time (0.1s) inference speed, PointWorld can be efficiently integrated in the model‑predictive control (MPC) framework for manipulation. We demonstrate that a single pre‑trained checkpoint enables a real‑world Franka robot to perform rigid‑body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post‑training and all from a single image captured in‑the‑wild. Project website at https://point‑world.github.io/.

Abstract:
Prevalent Vision‑Language‑Action (VLA) models are typically built upon Multimodal Large Language Models (MLLMs) and demonstrate exceptional proficiency in semantic understanding, but they inherently lack the capability to deduce physical world dynamics. Consequently, recent approaches have shifted toward World Models, typically formulated via video prediction; however, these methods often suffer from a lack of semantic grounding and exhibit brittleness in the presence of video prediction errors. To synergize semantic understanding with dynamic predictive capabilities, we present InternVLA‑A1. This model employs a unified Mixture‑of‑Transformers architecture, coordinating three experts for scene understanding, visual foresight generation, and action execution. These components interact seamlessly through a unified masked self attention mechanism. Building upon InternVL3 and Qwen3‑VL, we instantiate InternVLA‑A1 at 2B and 3B parameter scales. We pre‑train these models on heterogeneous data sources over real‑world robot data, synthetic simulation data, and human videos, covering over 692M frames. This hybrid training strategy effectively harnesses the diversity of synthetic simulation data while minimizing the sim‑to‑real gap. We evaluated InternVLA‑A1 on 12 real‑world robotic tasks and a simulation benchmark. The results show that InternVLA‑A1 consistently outperforms prior leading models: compared with pi0.5, it achieves +4.4% on static manipulation tasks and +2.6% on the RoboTwin 2.0 simulation benchmark, and delivers a +26.7% boost on dynamic manipulation tasks.

Abstract:
A long‑standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state‑action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA‑WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real‑world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO‑WM and V‑JEPA‑2‑AC, in both navigation and manipulation tasks. Code, data and checkpoints are available at https://github.com/facebookresearch/jepa‑wms.

Abstract:
Language agents increasingly require persistent worlds in which they can act, remember, and learn. Existing approaches sit at two extremes: conventional web frameworks provide reliable but fixed contexts backed by databases, while fully generative world models aim for unlimited environments at the expense of controllability and practical engineering. In this work, we introduce the Web World Model (WWM), a middle ground where world state and ``physics'' are implemented in ordinary web code to ensure logical consistency, while large language models generate context, narratives, and high‑level decisions on top of this structured latent state. We build a suite of WWMs on a realistic web stack, including an infinite travel atlas grounded in real geography, fictional galaxy explorers, web‑scale encyclopedic and narrative worlds, and simulation‑ and game‑like environments. Across these systems, we identify practical design principles for WWMs: separating code‑defined rules from model‑driven imagination, representing latent state as typed web interfaces, and utilizing deterministic generation to achieve unlimited but structured exploration. Our results suggest that web stacks themselves can serve as a scalable substrate for world models, enabling controllable yet open‑ended environments. Project Page: https://github.com/Princeton‑AI2‑Lab/Web‑World‑Models.

Abstract:
Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi‑modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task‑aware language‑guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual‑condition multi‑modal generation model, where the information captured by our vision‑language model is leveraged as a high‑level language condition in combination with a low‑level image condition, jointly guiding the multi‑modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state‑of‑the‑art performance. We will release the code publicly on GitHub https://github.com/dtc111111/GaussianDWM.

Abstract:
Current methods for incremental object detection (IOD) primarily rely on Faster R‑CNN or DETR series detectors; however, these approaches do not accommodate the real‑time YOLO detection frameworks. In this paper, we first identify three primary types of knowledge conflicts that contribute to catastrophic forgetting in YOLO‑based incremental detectors: foreground‑background confusion, parameter interference, and misaligned knowledge distillation. Subsequently, we introduce YOLO‑IOD, a real‑time Incremental Object Detection (IOD) framework that is constructed upon the pretrained YOLO‑World model, facilitating incremental learning via a stage‑wise parameter‑efficient fine‑tuning process. Specifically, YOLO‑IOD encompasses three principal components: 1) Conflict‑Aware Pseudo‑Label Refinement (CPR), which mitigates the foreground‑background confusion by leveraging the confidence levels of pseudo labels and identifying potential objects relevant to future tasks. 2) Importancebased Kernel Selection (IKS), which identifies and updates the pivotal convolution kernels pertinent to the current task during the current learning stage. 3) Cross‑Stage Asymmetric Knowledge Distillation (CAKD), which addresses the misaligned knowledge distillation conflict by transmitting the features of the student target detector through the detection heads of both the previous and current teacher detectors, thereby facilitating asymmetric distillation between existing and newly introduced categories. We further introduce LoCo COCO, a more realistic benchmark that eliminates data leakage across stages. Experiments on both conventional and LoCo COCO benchmarks show that YOLO‑IOD achieves superior performance with minimal forgetting.

Abstract:
Despite numerous attempts at mitigation since the inception of language models, hallucinations remain a persistent problem even in today's frontier LLMs. Why is this? We review existing definitions of hallucination and fold them into a single, unified definition wherein prior definitions are subsumed. We argue that hallucination can be unified by defining it as simply inaccurate (internal) world modeling, in a form where it is observable to the user. For example, stating a fact which contradicts a knowledge base OR producing a summary which contradicts the source. By varying the reference world model and conflict policy, our framework unifies prior definitions. We argue that this unified view is useful because it forces evaluations to clarify their assumed reference "world", distinguishes true hallucinations from planning or reward errors, and provides a common language for comparison across benchmarks and discussion of mitigation strategies. Building on this definition, we outline plans for a family of benchmarks using synthetic, fully specified reference world models to stress‑test and improve world modeling components.

Abstract:
Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long‑term goals through adaptive environmental interaction. We address this by introducing L‑IVA (Long‑horizon Interactive Visual Avatar), a task and benchmark for evaluating goal‑directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed‑loop OTAR cycle (Observe‑Think‑Act‑Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual‑system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model‑specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi‑step task completion in open‑domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open‑loop and non‑reflective baselines in task success rate and behavioral coherence, validating our IWM‑inspired design for advancing video avatar intelligence from passive animation to active, goal‑oriented behavior.

Abstract:
Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single‑target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi‑target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large‑scale dataset built for comprehensive training and evaluation across four critical dimensions of language‑guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimplify, leading to sensitivity‑prone real‑world models. We also propose SegEarth‑R2, an MLLM architecture designed for comprehensive language‑guided segmentation in RS, which directly confronts these challenges. The model's effectiveness stems from two key improvements: (1) a spatial attention supervision mechanism specifically handles the localization of small objects and their components, and (2) a flexible and efficient segmentation query mechanism that handles both single‑target and multi‑target scenarios. Experimental results demonstrate that our SegEarth‑R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for the next generation of geospatial segmentation. All data and code will be released at https://github.com/earth‑insights/SegEarth‑R2.

Abstract:
Frame‑level autoregressive (frame‑AR) models have achieved significant progress, enabling real‑time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade‑off, we propose Memorize‑and‑Generate (MAG), a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce MAG‑Bench to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.

Abstract:
Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable‑entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment. In this work, we introduce AniX, leveraging the realism and structural grounding of static world generation while extending controllable‑entity models to support user‑specified characters capable of performing open‑ended actions. Users can provide a 3DGS scene and a character, then direct the character through natural language to perform diverse behaviors from basic locomotion to object‑centric interactions while freely exploring the environment. AniX synthesizes temporally coherent video clips that preserve visual fidelity with the provided scene and character, formulated as a conditional autoregressive video generation problem. Built upon a pre‑trained video generator, our training strategy significantly enhances motion dynamics while maintaining generalization across actions and characters. Our evaluation covers a broad range of aspects, including visual quality, character consistency, action controllability, and long‑horizon coherence.

Abstract:
We present WorldCanvas, a framework for promptable world events that enables rich, user‑directed simulation by combining text, trajectories, and reference images. Unlike text‑only approaches and existing trajectory‑controlled image‑to‑video methods, our multimodal approach combines trajectories ‑‑ encoding motion, timing, and visibility ‑‑ with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi‑agent interactions, object entry/exit, reference‑guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user‑shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

Abstract:
Autonomous driving requires a persistent understanding of 3D scenes that is robust to temporal disturbances and accounts for potential future actions. We introduce a new concept of 4D Occupancy Spatio‑Temporal Persistence (OccSTeP), which aims to address two tasks: (1) reactive forecasting: ''what will happen next'' and (2) proactive forecasting: "what would happen given a specific future action". For the first time, we create a new OccSTeP benchmark with challenging scenarios (e.g., erroneous semantic labels and dropped frames). To address this task, we propose OccSTeP‑WM, a tokenizer‑free world model that maintains a dense voxel‑based scene state and incrementally fuses spatio‑temporal context over time. OccSTeP‑WM leverages a linear‑complexity attention backbone and a recurrent state‑space module to capture long‑range spatial dependencies while continually updating the scene memory with ego‑motion compensation. This design enables online inference and robust performance even when historical sensor input is missing or noisy. Extensive experiments prove the effectiveness of the OccSTeP concept and our OccSTeP‑WM, yielding an average semantic mIoU of 23.70% (+6.56% gain) and occupancy IoU of 35.89% (+9.26% gain). The data and code will be open source at https://github.com/FaterYU/OccSTeP.

Abstract:
World models have shown great utility in improving the task performance of embodied agents. While prior work largely focuses on pixel‑space world models, these approaches face practical limitations in GUI settings, where predicting complex visual elements in future states is often difficult. In this work, we explore an alternative formulation of world modeling for GUI agents, where state transitions are described in natural language rather than predicting raw pixels. First, we introduce MobileWorldBench, a benchmark that evaluates the ability of vision‑language models (VLMs) to function as world models for mobile GUI agents. Second, we release MobileWorld, a large‑scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs. Finally, we propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates. The code and dataset is available at https://github.com/jacklishufan/MobileWorld

Abstract:
Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long‑term visual quality, and temporal consistency. To this end, we take a progressive approach‑first enhancing controllability and then extending toward long‑term, high‑quality generation. We present LongVie 2, an end‑to‑end autoregressive framework trained in three stages: (1) Multi‑modal guidance, which integrates dense and sparse control signals to provide implicit world‑level supervision and improve controllability; (2) Degradation‑aware training on the input frame, bridging the gap between training and long‑term inference to maintain high visual quality; and (3) History‑context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high‑resolution one‑minute videos covering diverse real‑world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state‑of‑the‑art performance in long‑range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.

Abstract:
Physics‑aware driving world model is essential for drive planning, out‑of‑distribution data synthesis, and closed‑loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics‑aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics‑informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high‑resolution 3D structures and dynamics. To facilitate effective compression of such high‑resolution occupancy, we propose a VAE that encodes occupancy into a latent tri‑plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end‑to‑end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi‑View Attention is introduced in the video generation model to generate multi‑view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi‑view consistent, and physics‑aware driving video generation.

Abstract:
The collection of large‑scale and diverse robot demonstrations remains a major bottleneck for imitation learning, as real‑world data acquisition is costly and simulators offer limited diversity and fidelity with pronounced sim‑to‑real gaps. While generative models present an attractive solution, existing methods often alter only visual appearances without creating new behaviors, or suffer from embodiment inconsistencies that yield implausible motions. To address these limitations, we introduce AnchorDream, an embodiment‑aware world model that repurposes pretrained video diffusion models for robot data synthesis. AnchorDream conditions the diffusion process on robot motion renderings, anchoring the embodiment to prevent hallucination while synthesizing objects and environments consistent with the robot's kinematics. Starting from only a handful of human teleoperation demonstrations, our method scales them into large, diverse, high‑quality datasets without requiring explicit environment modeling. Experiments show that the generated data leads to consistent improvements in downstream policy learning, with relative gains of 36.4% in simulator benchmarks and nearly double performance in real‑world studies. These results suggest that grounding generative world models in robot motion provides a practical path toward scaling imitation learning.

Abstract:
Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill‑suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision‑Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.

Abstract:
Vision‑Language‑Action (VLA) models have recently enabled robotic manipulation by grounding visual and linguistic cues into actions. However, most VLAs assume the Markov property, relying only on the current observation and thus suffering from temporal myopia that degrades long‑horizon coherence. In this work, we view motion as a more compact and informative representation of temporal context and world dynamics, capturing inter‑state changes while filtering static pixel‑level noise. From this perspective, HiF‑VLA equips a motion‑centric world model for the VLA, enabling agents to reason about temporal dynamics for future evolution during action generation. Building on this idea, we propose HiF‑VLA (Hindsight, Insight, and Foresight for VLAs), a unified framework that leverages motion for bidirectional temporal reasoning. HiF‑VLA encodes past dynamics through hindsight priors, anticipates future motion via foresight reasoning, and integrates both through a hindsight‑modulated joint expert to enable a ''think‑while‑acting'' paradigm for long‑horizon manipulation. As a result, HiF‑VLA surpasses strong baselines on LIBERO‑Long and CALVIN ABC‑D benchmarks, while incurring negligible additional inference latency. Furthermore, HiF‑VLA achieves substantial improvements in real‑world long‑horizon manipulation tasks, demonstrating its broad effectiveness in practical robotic settings.

Abstract:
Recent advances in diffusion transformers have empowered video generation models to generate high‑quality video clips from texts or images. However, world models with the ability to predict long‑horizon futures from past observations and actions remain underexplored, especially for general‑purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real‑world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise‑augmented history memory to avoid over‑reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action‑aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real‑world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long‑term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long‑range prediction, and action alignment over existing state‑of‑the‑art world models.

Abstract:
Neural rendering, particularly 3D Gaussian Splatting (3DGS), has evolved rapidly and become a key component for building world models. However, existing viewer solutions remain fragmented, heavy, or constrained by legacy pipelines, resulting in high deployment friction and limited support for dynamic content and generative models. In this work, we present Visionary, an open, web‑native platform for real‑time various Gaussian Splatting and meshes rendering. Built on an efficient WebGPU renderer with per‑frame ONNX inference, Visionary enables dynamic neural processing while maintaining a lightweight, "click‑to‑run" browser experience. It introduces a standardized Gaussian Generator contract, which not only supports standard 3DGS rendering but also allows plug‑and‑play algorithms to generate or update Gaussians each frame. Such inference also enables us to apply feedforward generative post‑processing. The platform further offers a plug in three.js library with a concise TypeScript API for seamless integration into existing web applications. Experiments show that, under identical 3DGS assets, Visionary achieves superior rendering efficiency compared to current Web viewers due to GPU‑based primitive sorting. It already supports multiple variants, including MLP‑based 3DGS, 4DGS, neural avatars, and style transformation or enhancement networks. By unifying inference and rendering directly in the browser, Visionary significantly lowers the barrier to reproduction, comparison, and deployment of 3DGS‑family methods, serving as a unified World Model Carrier for both reconstructive and generative paradigms.

Abstract:
Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three‑dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real‑world cameras. We introduce Relative Ray Encoding, a geometry‑consistent representation that unifies complete camera information, including 6‑DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera‑controlled text‑to‑video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for Absolute Orientation Encoding, enabling full control over the initial camera orientation. Together, these designs form UCPE (Unified Camera Positional Encoding), which integrates into a pretrained video Diffusion Transformer through a lightweight spatial attention adapter, adding less than 1% trainable parameters while achieving state‑of‑the‑art camera controllability and visual fidelity. To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types. Extensive experiments validate the effectiveness of UCPE in camera‑controllable video generation and highlight its potential as a general camera representation for Transformers across future multi‑view, video, and 3D tasks. Code will be available at https://github.com/chengzhag/UCPE.

Abstract:
Vision‑Language Models (VLMs) remain limited in spatial reasoning tasks that require multi‑view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test‑time scaling where a world model imagines action‑conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test‑time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty‑based analyses show that MindJourney's verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test‑time reward in verifiable, frame‑anchored micro‑claims. This principled verifier consistently improves spatial reasoning on the SAT‑Real benchmark and corrects trajectory‑selection biases through more balanced exploratory behavior. However, on the challenging MMSI‑Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine‑grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test‑time verification for world‑model‑based reasoning. Our code is available at https://github.com/chandar‑lab/visa‑for‑mindjourney.

Abstract:
Generating long, coherent egocentric videos is difficult, as hand‑object interactions and procedural tasks require reliable long‑term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end‑to‑end framework for egocentric long‑context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long‑Term Sparse KV Cache for stable global context with an attention‑based short‑term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid‑5M benchmark demonstrate that EgoLCD achieves state‑of‑the‑art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.

Abstract:
Popular offline reinforcement learning (RL) methods rely on explicit conservatism, penalizing out‑of‑dataset actions or restricting rollout horizons. We question the universality of this principle and revisit a complementary Bayesian perspective for test‑time adaptation. By modeling a posterior over world models and training a history‑dependent agent to maximize expected return, the Bayesian approach directly addresses epistemic uncertainty without explicit conservatism. We first illustrate in a bandit setting that Bayesianism excels on low‑quality datasets where conservatism fails. Scaling to realistic tasks, we find that long‑horizon rollouts are essential to control value overestimation once conservatism is removed. We introduce design choices that enable learning from long‑horizon rollouts while mitigating compounding model errors, yielding our algorithm, NEUBAY, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, NEUBAY is competitive with leading conservative algorithms, achieving new state‑of‑the‑art on 7 datasets with rollout horizons of several hundred steps. Finally, we characterize datasets by quality and coverage to identify when NEUBAY is preferable to conservative methods.

Abstract:
Video‑based world models have recently garnered increasing attention for their ability to synthesize diverse and dynamic visual environments. In this paper, we focus on shared world modeling, where a model generates multiple videos from a set of input images, each representing the same underlying world in different camera poses. We propose IC‑World, a novel generation framework, enabling parallel generation for all input images via activating the inherent in‑context generation capability of large video models. We further finetune IC‑World via reinforcement learning, Group Relative Policy Optimization, together with two proposed novel reward models to enforce scene‑level geometry consistency and object‑level motion consistency among the set of generated videos. Extensive experiments demonstrate that IC‑World substantially outperforms state‑of‑the‑art methods in both geometry and motion consistency. To the best of our knowledge, this is the first work to systematically explore the shared world modeling problem with video‑based world models.

Abstract:
Recent audio‑video generative systems suggest that coupling modalities benefits not only audio‑video synchrony but also the video modality itself. We pose a fundamental question: Does audio‑video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter‑efficient Audio‑Video Full DiT (AVFullDiT) architecture that leverages pre‑trained text‑to‑video (T2V) and text‑to‑audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V‑only counterpart under identical settings. Our results provide the first systematic evidence that audio‑video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision × impact sound), which in turn regularizes video dynamics. Our findings suggest that cross‑modal co‑training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.

Abstract:
AI systems deployed in the real world must contend with distractions and out‑of‑distribution (OOD) noise that can destabilize their policies and lead to unsafe behavior. While robust training can reduce sensitivity to some forms of noise, it is infeasible to anticipate all possible OOD conditions. To mitigate this issue, we develop an algorithm that leverages a world model's inherent measure of surprise to reduce the impact of noise in world model‑‑based reinforcement learning agents. We introduce both multi‑representation and single‑representation rejection sampling, enabling robustness to settings with multiple faulty sensors or a single faulty sensor. While the introduction of noise typically degrades agent performance, we show that our techniques preserve performance relative to baselines under varying types and levels of noise across multiple environments within self‑driving simulation domains (CARLA and Safety Gymnasium). Furthermore, we demonstrate that our methods enhance the stability of two state‑of‑the‑art world models with markedly different underlying architectures: Cosmos and DreamerV3. Together, these results highlight the robustness of our approach across world modeling domains. We release our code at https://github.com/Bluefin‑Tuna/WISER .

Abstract:
Vision‑language models (VLMs) have made great strides in addressing temporal understanding tasks, which involve characterizing visual changes across a sequence of images. However, recent works have suggested that when making predictions, VLMs may rely on static feature biases, such as background or object features, rather than dynamic visual changes. Static feature biases are a type of shortcut and can contribute to systematic prediction errors on downstream tasks; as a result, identifying and characterizing error‑inducing static feature biases is critical prior to real‑world model deployment. In this work, we introduce TRoVe, an automated approach for discovering error‑inducing static feature biases learned by temporal VLMs. Given a trained VLM and an annotated validation dataset associated with a downstream classification task, TRoVe extracts candidate static features from the dataset and scores each feature by (i) the effect of the feature on classification errors as well as (ii) the extent to which the VLM relies on the feature when making predictions. In order to quantitatively evaluate TRoVe, we introduce an evaluation framework consisting of 101 trained temporal VLMs paired with ground‑truth annotations for learned static feature biases. We use this framework to demonstrate that TRoVe can accurately identify error‑inducing static feature biases in VLMs, achieving a 28.6% improvement over the closest baseline. Finally, we apply TRoVe to 7 off‑the‑shelf VLMs and 2 temporal understanding tasks, surfacing previously‑unknown static feature biases and demonstrating that knowledge of learned biases can aid in improving model performance at test time. Our code is available at https://github.com/Stanford‑AIMI/TRoVe.

Abstract:
Recent advances in generative world models have enabled remarkable progress in creating open‑ended game environments, evolving from static scene synthesis toward dynamic, interactive simulation. However, current approaches remain limited by rigid action schemas and high annotation costs, restricting their ability to model diverse in‑game interactions and player‑driven dynamics. To address these challenges, we introduce Hunyuan‑GameCraft‑2, a new paradigm of instruction‑driven interaction for generative game world modeling. Instead of relying on fixed keyboard inputs, our model allows users to control game video contents through natural language prompts, keyboard, or mouse signals, enabling flexible and semantically rich interaction within generated worlds. We formally defined the concept of interactive video data and developed an automated process to transform large‑scale, unstructured text‑video pairs into causally aligned interactive datasets. Built upon a 14B image‑to‑video Mixture‑of‑Experts(MoE) foundation model, our model incorporates a text‑driven interaction injection mechanism for fine‑grained control over camera motion, character behavior, and environment dynamics. We introduce an interaction‑focused benchmark, InterBench, to evaluate interaction performance comprehensively. Extensive experiments demonstrate that our model generates temporally coherent and causally grounded interactive game videos that faithfully respond to diverse and free‑form user instructions such as "open the door", "draw a torch", or "trigger an explosion".

Abstract:
Generating minute‑long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi‑autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary‑length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV‑cache‑induced long‑horizon error accumulation, and (ii) the lack of fine‑grained long‑video benchmarks and coherence‑aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic‑aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk‑wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV‑Bench, a fine‑grained benchmark for minute‑long videos, complete with new metrics evaluating long‑range coherence. Extensive experiments on VBench and LV‑Bench demonstrate that BlockVid consistently outperforms existing methods in generating high‑quality, coherent minute‑long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV‑Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba‑damo‑academy/Inferix.

Abstract:
The precise reading recognition of pointer meters plays a key role in smart power systems, but existing approaches remain fragile due to challenges like reflections, occlusions, dynamic viewing angles, and overly between thin pointers and scale markings. Up to now, this area still lacks large‑scale datasets to support the development of robust algorithms. To address these challenges, this paper first presents a new large‑scale benchmark dataset for dial reading, termed RPM‑10K, which contains 10730 meter images that fully reflect the aforementioned key challenges. Built upon the dataset, we propose a novel vision‑language model for pointer meter reading recognition, termed MRLM, based on physical relation injection. Instead of exhaustively learning image‑level correlations, MRLM explicitly encodes the geometric and causal relationships between the pointer and the scale, aligning perception with physical reasoning in the spirit of world‑model perspectives. Through cross‑attentional fusion and adaptive expert selection, the model learns to interpret dial configurations and generate precise numeric readings. Extensive experiments fully validated the effectiveness of our proposed framework on the newly proposed benchmark dataset. Both the dataset and source code will be released on https://github.com/Event‑AHU/DialBench

Abstract:
Vision‑Language‑Action (VLA) policies excel in aligning language, perception, and robot control. However, most VLAs are trained purely by imitation, which overfits to demonstrations, and is brittle under distribution shift. Reinforcement learning (RL) directly optimizes task reward and thus addresses this misalignment, but real‑robot interaction is expensive and conventional simulators are hard to engineer and transfer. We address both data efficiency and optimization stability in VLA post‑training via a learned world model and an RL procedure tailored to flow‑based action heads. Specifically, we introduce Prophet, a unified action‑to‑video robot actuation pretrained across large‑scale, heterogeneous robot data to learn reusable action‑outcome dynamics. It is able to few‑shot adapt to new robots, objects, and environments, yielding a rollout‑ready simulator. Upon Prophet, we reinforce action policies with Flow‑action‑GRPO (FA‑GRPO), which adapts Flow‑GRPO to operate on VLA actions, and with FlowScale, a stepwise reweighting that rescales per‑step gradients in the flow head. Together, Prophet, FA‑GRPO, and FlowScale constitute ProphRL, a practical, data‑ and compute‑efficient path to VLA post‑training. Experiments show 5‑17% success gains on public benchmarks and 24‑30% gains on real robots across different VLA variants.

Abstract:
Normalizing flows (NFs) are end‑to‑end likelihood‑based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state‑of‑the‑art systems almost exclusively rely on diffusion‑based models. In this work, we revisit this design space by presenting STARFlow‑V, a normalizing flow‑based video generator with substantial benefits such as end‑to‑end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow‑V operates in the spatiotemporal latent space with a global‑local architecture which restricts causal dependencies to a global latent space while preserving rich local within‑frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow‑score matching, which equips the model with a light‑weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow‑V employs a video‑aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text‑to‑video, image‑to‑video as well as video‑to‑video generation tasks. Empirically, STARFlow‑V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion‑based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high‑quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml‑starflow.

Abstract:
Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine‑grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text‑based generation with the object‑level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language‑driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four‑stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language‑grounded editing agent that supports five object‑level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high‑quality multimodal dataset containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state‑of‑the‑art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our project page: https://longhz140516.github.io/MajutsuCity/.

Abstract:
World Generation Models are emerging as a cornerstone of next‑generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high‑fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, embodied intelligence, and content creation. However, prior benchmarks emphasize different evaluation dimensions and lack a unified assessment of world‑realism capability. To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition‑4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image‑to‑3D/4D, Video‑to‑4D, Text‑to‑3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality‑conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM‑as‑judge, MLLM‑as‑judge, and traditional network‑based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical realism, and cross‑modal coherence. Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments. We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from "visual generation" to "world generation." Our project can be found at https://yeppp27.github.io/4DWorldBench.github.io/.

Abstract:
City‑scale 3D generation is of great importance for the development of embodied intelligence and world models. Existing methods, however, face significant challenges regarding quality, fidelity, and scalability in 3D world generation. Thus, we propose RAISECity, a Reality‑Aligned Intelligent Synthesis Engine that creates detailed, City‑scale 3D worlds. We introduce an agentic framework that leverages diverse multimodal foundation tools to acquire real‑world knowledge, maintain robust intermediate representations, and construct complex 3D scenes. This agentic design, featuring dynamic data processing, iterative self‑reflection and refinement, and the invocation of advanced multimodal tools, minimizes cumulative errors and enhances overall performance. Extensive quantitative experiments and qualitative analyses validate the superior performance of RAISECity in real‑world alignment, shape precision, texture fidelity, and aesthetics level, achieving over a 90% win‑rate against existing baselines for overall perceptual quality. This combination of 3D quality, reality alignment, scalability, and seamless compatibility with computer graphics pipelines makes RAISECity a promising foundation for applications in immersive media, embodied intelligence, and world models.

Abstract:
Cambrian‑S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI‑Super‑Recall (VSR) and VSI‑Super‑Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian‑S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag‑of‑words SigLIP model, yet near‑perfectly solves VSR, achieving 95% accuracy even on 4‑hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian‑S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC‑Repeat: We concatenate each video with itself 1‑5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian‑S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object‑count predictions unchanged; instead, Cambrian‑S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI‑Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive‑sensing inference recipes used by Cambrian‑S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian‑S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity

Abstract:
Synthesizing high‑fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving. This task is inherently challenging due to the sensor's unique spherical geometry, the temporal sparsity of point clouds, and the complexity of dynamic scenes. To address these challenges, we present LiSTAR, a novel generative world model that operates directly on the sensor's native geometry. LiSTAR introduces a Hybrid‑Cylindrical‑Spherical (HCS) representation to preserve data fidelity by mitigating quantization artifacts common in Cartesian grids. To capture complex dynamics from sparse temporal data, it utilizes a Spatio‑Temporal Attention with Ray‑Centric Transformer (START) that explicitly models feature evolution along individual sensor rays for robust temporal coherence. Furthermore, for controllable synthesis, we propose a novel 4D point cloud‑aligned voxel layout for conditioning and a corresponding discrete Masked Generative START (MaskSTART) framework, which learns a compact, tokenized representation of the scene, enabling efficient, high‑resolution, and layout‑guided compositional generation. Comprehensive experiments validate LiSTAR's state‑of‑the‑art performance across 4D LiDAR reconstruction, prediction, and conditional generation, with substantial quantitative gains: reducing generation MMD by a massive 76%, improving reconstruction IoU by 32%, and lowering prediction L1 Med by 50%. This level of performance provides a powerful new foundation for creating realistic and controllable autonomous systems simulations. Project link: https://ocean‑luna.github.io/LiSTAR.gitub.io.

Abstract:
Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human‑like reasoning from interaction and keep improving with more experience. To study this, we introduce a Game‑to‑Unseen (G2U) benchmark of 1,000+ heterogeneous games that exhibit significant visual domain gaps. Existing approaches, including VLMs and world models, struggle to capture underlying physics and causality since they are not focused on core mechanisms and overfit to visual details. VLM/VLA agents reason but lack look‑ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world‑model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics‑centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on levels from primitive intuition to goal‑driven reasoning, and even surpasses GPT‑5 overall. We find that performance improves with more training games and interaction steps, and that the model also zero‑shot transfers to unseen games. These results support physics‑centric interaction as a path to steadily improving physical reasoning. Further demos and project details can be found at https://mybearyzhang.github.io/ipr‑1.

Abstract:
Vision‑‑language‑‑action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real‑world environments. In this work, we introduce NORA‑1.5, a VLA model built from the pre‑trained NORA backbone by adding to it a flow‑matching‑based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA‑1.5 to outperform NORA and several state‑of‑the‑art VLA models across both simulated and real‑world benchmarks. To further improve robustness and task success, we develop a set of reward models for post‑training VLA policies. Our rewards combine (i) an action‑conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation‑from‑ground‑truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA‑1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward‑driven post‑training consistently improves performance in both simulation and real‑robot settings, demonstrating significant VLA model‑reliability gains through simple yet effective reward models. Our findings highlight NORA‑1.5 and reward‑guided post‑training as a viable path toward more dependable embodied agents suitable for real‑world deployment.

Abstract:
World models have been developed to support sample‑efficient deep reinforcement learning agents. However, it remains challenging for world models to accurately replicate environments that are high‑dimensional, non‑stationary, and composed of multiple objects with rich interactions since most world models learn holistic representations of all environmental components. By contrast, humans perceive the environment by decomposing it into discrete objects, facilitating efficient decision‑making. Motivated by this insight, we propose \emphSlot Transformer Imagination with CAusality‑aware reinforcement learning (STICA), a unified framework in which object‑centric Transformers serve as the world model and causality‑aware policy and value networks. STICA represents each observation as a set of object‑centric tokens, together with tokens for the agent action and the resulting reward, enabling the world model to predict token‑level dynamics and interactions. The policy and value networks then estimate token‑level cause‑‑effect relations and use them in the attention layers, yielding causality‑guided decision‑making. Experiments on object‑rich benchmarks demonstrate that STICA consistently outperforms state‑of‑the‑art agents in both sample efficiency and final performance.

Abstract:
Data attribution for text‑to‑image models aims to identify the training images that most significantly influenced a generated output. Existing attribution methods involve considerable computational resources for each query, making them impractical for real‑world applications. We propose a novel approach for scalable and efficient data attribution. Our key idea is to distill a slow, unlearning‑based attribution method to a feature embedding space for efficient retrieval of highly influential training images. During deployment, combined with efficient indexing and search methods, our method successfully finds highly influential images without running expensive attribution algorithms. We show extensive results on both medium‑scale models trained on MSCOCO and large‑scale Stable Diffusion models trained on LAION, demonstrating that our method can achieve better or competitive performance in a few seconds, faster than existing methods by 2,500x ‑ 400,000x. Our work represents a meaningful step towards the large‑scale application of data attribution methods on real‑world models such as Stable Diffusion.

Abstract:
The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent‑environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long‑term temporal consistency, and goal‑driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a "window" into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real‑time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next‑generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up‑to‑date list of related works is maintained at this link.

Abstract:
We introduce PhysWorld, a framework that enables robot learning from video generation through physical world modeling. Recent video generation models can synthesize photorealistic visual demonstrations from language commands and images, offering a powerful yet underexplored source of training signals for robotics. However, directly retargeting pixel motions from generated videos to robots neglects physics, often resulting in inaccurate manipulations. PhysWorld addresses this limitation by coupling video generation with physical world reconstruction. Given a single image and a task command, our method generates task‑conditioned videos and reconstructs the underlying physical world from the videos, and the generated video motions are grounded into physically accurate actions through object‑centric residual reinforcement learning with the physical world model. This synergy transforms implicit visual guidance into physically executable robotic trajectories, eliminating the need for real robot data collection and enabling zero‑shot generalizable robotic manipulation. Experiments on diverse real‑world tasks demonstrate that PhysWorld substantially improves manipulation accuracy compared to previous approaches. Visit \hrefhttps://pointscoder.github.io/PhysWorld_Web/the project webpage for details.

Abstract:
Transformers replace recurrence with a memory that grows with sequence length and self‑attention that enables ad‑hoc lookups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next‑Latent Prediction (NextLat), which extends standard next‑token training with self‑supervised predictions in the latent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next token. Theoretically, we show that these latents provably converge towards belief states, compressed information about the history necessary to predict the future. This simple auxiliary objective injects a recurrent inductive bias into transformers while leaving their architecture, parallel training efficiency, and inference unchanged. NextLat effectively encourages transformers to form compact internal world models with coherent belief states and transition dynamics ‑‑ crucial properties not guaranteed by standard next‑token prediction alone. Empirically, across benchmarks in world modeling, reasoning, planning, and language modeling, NextLat demonstrates significant gains over standard next‑token prediction and other baselines in downstream accuracy, representation compression, and lookahead planning. Furthermore, NextLat enables variable‑length self‑speculative decoding, accelerating inference by up to 3.3x in language modeling. NextLat offers a simple yet effective paradigm for learning compact, predictive representations in transformers that generalize better. Our code is available at https://github.com/microsoft/NextLat.

Abstract:
Large language model (LLM)‑based agents struggle to generalize to novel and complex environments, such as unseen websites or new sets of functions, due to a fundamental mismatch between their pre‑training and test‑time conditions. This challenge stems from two distinct failure modes: a syntactic misunderstanding of environment‑specific components like observation formats, and a semantic misunderstanding of state‑transition dynamics, which are only revealed at test time. To address these issues, we propose two distinct strategies for adapting LLM agents by leveraging environment‑specific information from interaction that is available during deployment. First, an online syntactic alignment (SA) method parameterizes environmental nuances by learning a lightweight adaptation vector that biases the model's output distribution, enabling rapid alignment with an environment response format. Second, a deployment‑time dynamics grounding (DG) method employs a persona‑driven exploration phase to systematically probe and learn the environment's causal dynamics before task execution, equipping the agent with an in‑context world model. We evaluate these strategies across diverse agentic benchmarks, including function calling and web navigation. Our empirical results show the effectiveness of both strategies across all benchmarks with minimal computational cost. We find that dynamics grounding is particularly effective in complex environments where unpredictable dynamics pose a major obstacle, demonstrating a robust path toward more generalizable and capable LLM‑based agents. For example, on the WebArena multi‑site split, this method increases the agent's success rate from 2% to 23%. We release our code.

Abstract:
Online planning has proven effective in reinforcement learning (RL) for improving sample efficiency and final performance. However, using planning for environment interaction inevitably introduces a divergence between the collected data and the policy's actual behaviors, degrading both model learning and policy improvement. To address this, we propose BOOM (Bootstrap Off‑policy with WOrld Model), a framework that tightly integrates planning and off‑policy learning through a bootstrap loop: the policy initializes the planner, and the planner refines actions to bootstrap the policy through behavior alignment. This loop is supported by a jointly learned world model, which enables the planner to simulate future trajectories and provides value targets to facilitate policy improvement. The core of BOOM is a likelihood‑free alignment loss that bootstraps the policy using the planner's non‑parametric action distribution, combined with a soft value‑weighted mechanism that prioritizes high‑return behaviors and mitigates variability in the planner's action quality within the replay buffer. Experiments on the high‑dimensional DeepMind Control Suite and Humanoid‑Bench show that BOOM achieves state‑of‑the‑art results in both training stability and final performance. The code is accessible at https://github.com/molumitu/BOOM_MBRL.

Abstract:
Augmenting vision‑language‑action models (VLAs) with world models is promising for robotic policy learning but faces challenges in jointly predicting states and actions due to the modality gap. To address this, we propose DUal‑STream diffusion (DUST), a world‑model augmented VLA framework featuring a multimodal diffusion transformer that maintains separate modality streams while enabling cross‑modal knowledge sharing. In addition, DUST utilizes independent noise perturbations and a decoupled flow matching loss to learn cross‑modal causal relationships. We further introduce an asynchronous sampling method for action and vision tokens that enhances performance through inference‑time scaling. Experimental results on simulated benchmarks like RoboCasa and GR‑1 show that DUST achieves up to 6% gains over state‑of‑the‑art VLA and world‑modeling baselines, with inference‑time scaling providing an additional 2‑5% improvement. In real‑world tasks using the Franka Research 3, DUST outperforms baselines by 10% in success rate. Finally, we demonstrate that DUST enables effective transfer learning through both pretraining on action‑free videos and joint‑training with heterogeneous robot and human datasets.

Abstract:
We introduce Emu3.5, a large‑scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre‑trained end‑to‑end with a unified next‑token prediction objective on a corpus of vision‑language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision‑language inputs and generates interleaved vision‑language outputs. Emu3.5 is further post‑trained with large‑scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token‑by‑token decoding into bidirectional parallel prediction, accelerating per‑image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long‑horizon vision‑language generation, any‑to‑image (X2I) generation, and complex text‑rich image generation. It also exhibits generalizable world‑modeling abilities, enabling spatiotemporally consistent world exploration and open‑world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open‑source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.

Abstract:
3D reconstruction of indoor and urban environments is a prominent research topic with various downstream applications. However, existing geometric priors for addressing low‑texture regions in indoor and urban settings often lack global consistency. Moreover, Gaussian Splatting and implicit SDF fields often suffer from discontinuities or exhibit computational inefficiencies, resulting in a loss of detail. To address these issues, we propose an Atlanta‑world guided implicit‑structured Gaussian Splatting that achieves smooth indoor and urban scene reconstruction while preserving high‑frequency details and rendering efficiency. By leveraging the Atlanta‑world model, we ensure the accurate surface reconstruction for low‑texture regions, while the proposed novel implicit‑structured GS representations provide smoothness without sacrificing efficiency and high‑frequency details. Specifically, we propose a semantic GS representation to predict the probability of all semantic regions and deploy a structure plane regularization with learnable plane indicators for global accurate surface reconstruction. Extensive experiments demonstrate that our method outperforms state‑of‑the‑art approaches in both indoor and urban scenes, delivering superior surface reconstruction quality.

Abstract:
We propose Avi, a novel 3D Vision‑Language‑Action (VLA) architecture that reframes robotic action generation as a problem of 3D perception and spatial reasoning, rather than low‑level policy learning. While existing VLA models primarily operate on 2D visual inputs and are trained end‑to‑end on task‑specific action policies, Avi leverages 3D point clouds and language‑grounded scene understanding to compute actions through classical geometric transformations. Most notably, Avi does not train on previous action tokens, rather, we build upon a 3D Multi‑modal Large Language Model (MLLM) to generate the next point cloud and explicitly calculate the actions through classical transformations. This approach enables generalizable behaviors that are robust to occlusions, camera pose variations, and changes in viewpoint. By treating the robotic decision‑making process as a structured reasoning task over 3D representations, Avi bridges the gap between high‑level language instructions and low‑level actuation without requiring opaque policy learning. Our preliminary results highlight the potential of 3D vision‑language reasoning as a foundation for scalable, robust robotic systems. Check it out at https://avi‑3drobot.github.io/.

Abstract:
We tackle the challenge of generating the infinitely extendable 3D world ‑‑ large, continuous environments with coherent geometry and realistic appearance. Existing methods face key challenges: 2D‑lifting approaches suffer from geometric and appearance inconsistencies across views, 3D implicit representations are hard to scale up, and current 3D foundation models are mostly object‑centric, limiting their applicability to scene‑level generation. Our key insight is leveraging strong generation priors from pre‑trained 3D models for structured scene block generation. To this end, we propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis. Our method features three core components: (1) a data curation pipeline that extracts high‑quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context‑aware scene extension; and (3) a coarse‑to‑fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity. Evaluated on the large‑scale 3D‑FRONT dataset, WorldGrow achieves SOTA performance in geometry reconstruction, while uniquely supporting infinite scene generation with photorealistic and structurally consistent outputs. These results highlight its capability for constructing large‑scale virtual environments and potential for building future world models.

Abstract:
Planning with world models offers a powerful paradigm for robotic control. Conventional approaches train a model to predict future frames conditioned on current frames and actions, which can then be used for planning. However, the objective of predicting future pixels is often at odds with the actual planning objective; strong pixel reconstruction does not always correlate with good planning decisions. This paper posits that instead of reconstructing future frames as pixels, world models only need to predict task‑relevant semantic information about the future. For such prediction the paper poses world modeling as a visual question answering problem about semantic information in future frames. This perspective allows world modeling to be approached with the same tools underlying vision language models. Thus vision language models can be trained as "semantic" world models through a supervised finetuning process on image‑action‑text data, enabling planning for decision‑making while inheriting many of the generalization and robustness properties from the pretrained vision‑language models. The paper demonstrates how such a semantic world model can be used for policy improvement on open‑ended robotics tasks, leading to significant generalization improvements over typical paradigms of reconstruction‑based action‑conditional world modeling. Website available at https://weirdlabuw.github.io/swm.

Abstract:
Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action‑free future state forecasting scheme. Through collaborative state‑action prediction, PWM can mimic the human‑like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context‑guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state‑of‑the‑art approaches that rely on multi‑view and multi‑modal inputs. Code and model weights will be released at https://github.com/6550Zhao/Policy‑World‑Model.

Abstract:
Recent advancements in driving world models enable controllable generation of high‑quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are \mathbfreally\ crucial for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D‑aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine‑tuned to produce the edited, multi‑view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi‑view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large‑scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D‑aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Page: https://wm‑research.github.io/Dream4Drive/ GitHub Link: https://github.com/wm‑research/Dream4Drive

Abstract:
Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. Existing models, however, are typically restricted to limited state modalities, short video sequences, imprecise action control, and a lack of reward awareness. In this paper, we introduce OmniNWM, an omniscient panoramic navigation world model that addresses all three dimensions within a unified framework. For state, OmniNWM jointly generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy. A flexible forcing strategy enables high‑quality long‑horizon auto‑regressive generation. For action, we introduce a normalized panoramic Plucker ray‑map representation that encodes input trajectories into pixel‑level signals, enabling highly precise and generalizable control over panoramic video generation. Regarding reward, we move beyond learning reward functions with external image‑based models: instead, we leverage the generated 3D occupancy to directly define rule‑based dense rewards for driving compliance and safety. Extensive experiments demonstrate that OmniNWM achieves state‑of‑the‑art performance in video generation, control accuracy, and long‑horizon stability, while providing a reliable closed‑loop evaluation framework through occupancy‑grounded rewards. Project page is available at https://arlo0o.github.io/OmniNWM/.

Abstract:
Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open‑loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World‑in‑World, the first open platform that benchmarks WMs in a closed‑loop world that mirrors real agent‑environment interactions. World‑in‑World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed‑loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post‑training with action‑observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference‑time compute allows WMs to substantially improve closed‑loop performance.

Abstract:
Semantic occupancy has emerged as a powerful representation in world models for its ability to capture rich spatial semantics. However, most existing occupancy world models rely on static and fixed embeddings or grids, which inherently limit the flexibility of perception. Moreover, their ``in‑place classification" over grids exhibits a potential misalignment with the dynamic and continuous nature of real scenarios. In this paper, we propose SparseWorld, a novel 4D occupancy world model that is flexible, adaptive, and efficient, powered by sparse and dynamic queries. We propose a Range‑Adaptive Perception module, in which learnable queries are modulated by the ego vehicle states and enriched with temporal‑spatial associations to enable extended‑range perception. To effectively capture the dynamics of the scene, we design a State‑Conditioned Forecasting module, which replaces classification‑based forecasting with regression‑guided formulation, precisely aligning the dynamic queries with the continuity of the 4D environment. In addition, We specifically devise a Temporal‑Aware Self‑Scheduling training strategy to enable smooth and efficient training. Extensive experiments demonstrate that SparseWorld achieves state‑of‑the‑art performance across perception, forecasting, and planning tasks. Comprehensive visualizations and ablation studies further validate the advantages of SparseWorld in terms of flexibility, adaptability, and efficiency.

Abstract:
Embodied AI requires agents that perceive, act, and anticipate how actions reshape future world states. World models serve as internal simulators that capture environment dynamics, enabling forward and counterfactual rollouts to support perception, prediction, and decision making. This survey presents a unified framework for world models in embodied AI. Specifically, we formalize the problem setting and learning objectives, and propose a three‑axis taxonomy encompassing: (1) Functionality, Decision‑Coupled vs. General‑Purpose; (2) Temporal Modeling, Sequential Simulation and Inference vs. Global Difference Prediction; (3) Spatial Representation, Global Latent Vector, Token Feature Sequence, Spatial Latent Grid, and Decomposed Rendering Representation. We systematize data resources and metrics across robotics, autonomous driving, and general video settings, covering pixel prediction quality, state‑level understanding, and task performance. Furthermore, we offer a quantitative comparison of state‑of‑the‑art models and distill key open challenges, including the scarcity of unified datasets and the need for evaluation metrics that assess physical consistency over pixel fidelity, the trade‑off between model performance and the computational efficiency required for real‑time control, and the core modeling difficulty of achieving long‑horizon temporal consistency while mitigating error accumulation. Finally, we maintain a curated bibliography at https://github.com/Li‑Zn‑H/AwesomeWorldModels.

Abstract:
A major bottleneck in off‑road autonomous driving research lies in the scarcity of large‑scale, high‑quality datasets and benchmarks. To bridge this gap, we present ORAD‑3D, which, to the best of our knowledge, is the largest dataset specifically curated for off‑road autonomous driving. ORAD‑3D covers a wide spectrum of terrains, including woodlands, farmlands, grasslands, riversides, gravel roads, cement roads, and rural areas, while capturing diverse environmental variations across weather conditions (sunny, rainy, foggy, and snowy) and illumination levels (bright daylight, daytime, twilight, and nighttime). Building upon this dataset, we establish a comprehensive suite of benchmark evaluations spanning five fundamental tasks: 2D free‑space detection, 3D occupancy prediction, rough GPS‑guided path planning, vision‑language model‑driven autonomous driving, and world model for off‑road environments. Together, the dataset and benchmarks provide a unified and robust resource for advancing perception and planning in challenging off‑road scenarios. The dataset and code will be made publicly available at https://github.com/chaytonmin/ORAD‑3D.

Abstract:
World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel‑aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point‑to‑Gaussian variational autoencoder (P2G‑VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi‑view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state‑of‑the‑art performance in both reconstruction and generation with high 3D consistency.

Abstract:
Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws, limiting their ability to generate physically plausible videos and serve as ''world models''. To address this issue, we propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics‑awareness. Specifically, PhysMaster is based on the image‑to‑video task where the model is expected to predict physically plausible dynamics from the input image. Since the input image provides physical priors like relative positions and potential interactions of objects in the scenario, we devise PhysEncoder to encode physical information from it as an extra condition to inject physical knowledge into the video generation process. The lack of proper supervision on the model's physical performance beyond mere appearance motivates PhysEncoder to apply reinforcement learning with human feedback to physical representation learning, which leverages feedback from generation models to optimize physical representations with Direct Preference Optimization (DPO) in an end‑to‑end manner. PhysMaster provides a feasible solution for improving physics‑awareness of PhysEncoder and thus of video generation, proving its ability on a simple proxy task and generalizability to wide‑ranging physical scenarios. This implies that our PhysMaster, which unifies solutions for various physical processes via representation learning in the reinforcement learning paradigm, can act as a generic and plug‑in solution for physics‑aware video generation and broader applications.

Abstract:
End‑to‑end autonomous driving models trained solely with imitation learning (IL) often suffer from poor generalization. In contrast, reinforcement learning (RL) promotes exploration through reward maximization but faces challenges such as sample inefficiency and unstable convergence. A natural solution is to combine IL and RL. Moving beyond the conventional two‑stage paradigm (IL pretraining followed by RL fine‑tuning), we propose CoIRL‑AD, a competitive dual‑policy framework that enables IL and RL agents to interact during training. CoIRL‑AD introduces a competition‑based mechanism that facilitates knowledge exchange while preventing gradient conflicts. Experiments on the nuScenes dataset show an 18% reduction in collision rate compared to baselines, along with stronger generalization and improved performance on long‑tail scenarios. Code is available at: https://github.com/SEU‑zxj/CoIRL‑AD.

Abstract:
Symbolic world modeling requires inferring and representing an environment's transitional dynamics as an executable program. Prior work has focused on largely deterministic environments with abundant interaction data, simple mechanics, and human guidance. We address a more realistic and challenging setting, learning in a complex, stochastic environment where the agent has only "one life" to explore a hostile environment without human guidance. We introduce OneLife, a framework that models world dynamics through conditionally‑activated programmatic laws within a probabilistic programming framework. Each law operates through a precondition‑effect structure, activating in relevant world states. This creates a dynamic computation graph that routes inference and optimization only through relevant laws, avoiding scaling challenges when all laws contribute to predictions about a complex, hierarchical state, and enabling the learning of stochastic dynamics even with sparse rule activation. To evaluate our approach under these demanding constraints, we introduce a new evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter‑OO, our reimplementation of the Crafter environment that exposes a structured, object‑oriented symbolic state and a pure transition function that operates on that state alone. OneLife can successfully learn key environment dynamics from minimal, unguided interaction, outperforming a strong baseline on 16 out of 23 scenarios tested. We also test OneLife's planning ability, with simulated rollouts successfully identifying superior strategies. Our work establishes a foundation for autonomously constructing programmatic world models of unknown, complex environments.

Abstract:
Ensuring safety in autonomous driving (AD) remains a significant challenge, especially in highly dynamic and complex traffic environments where diverse agents interact and unexpected hazards frequently emerge. Traditional reinforcement learning (RL) methods often struggle to balance safety, efficiency, and adaptability, as they primarily focus on reward maximization without explicitly modeling risk or safety constraints. To address these limitations, this study proposes a novel game‑theoretic risk‑shaped RL (GTR2L) framework for safe AD. GTR2L incorporates a multi‑level game‑theoretic world model that jointly predicts the interactive behaviors of surrounding vehicles and their associated risks, along with an adaptive rollout horizon that adjusts dynamically based on predictive uncertainty. Furthermore, an uncertainty‑aware barrier mechanism enables flexible modulation of safety boundaries. A dedicated risk modeling approach is also proposed, explicitly capturing both epistemic and aleatoric uncertainty to guide constrained policy optimization and enhance decision‑making in complex environments. Extensive evaluations across diverse and safety‑critical traffic scenarios show that GTR2L significantly outperforms state‑of‑the‑art baselines, including human drivers, in terms of success rate, collision and violation reduction, and driving efficiency. The code is available at https://github.com/DanielHu197/GTR2L.

Abstract:
Learned world models hold significant potential for robotic manipulation, as they can serve as simulator for real‑world interactions. While extensive progress has been made in 2D video‑based world models, these approaches often lack geometric and spatial reasoning, which is essential for capturing the physical structure of the 3D world. To address this limitation, we introduce iMoWM, a novel interactive world model designed to generate color images, depth maps, and robot arm masks in an autoregressive manner conditioned on actions. To overcome the high computational cost associated with three‑dimensional information, we propose MMTokenizer, which unifies multi‑modal inputs into a compact token representation. This design enables iMoWM to leverage large‑scale pretrained VideoGPT models while maintaining high efficiency and incorporating richer physical information. With its multi‑modal representation, iMoWM not only improves the visual quality of future predictions but also serves as an effective simulator for model‑based reinforcement learning (MBRL) and facilitates real‑world imitation learning. Extensive experiments demonstrate the superiority of iMoWM across these tasks, showcasing the advantages of multi‑modal world modeling for robotic manipulation. Homepage: https://xingyoujun.github.io/imowm/

Abstract:
Enabling embodied agents to imagine future states is essential for robust and generalizable visual navigation. Yet, state‑of‑the‑art systems typically rely on modular designs that decouple navigation planning from visual world modeling, which often induces state‑action misalignment and weak adaptability in novel or dynamic scenarios. We propose UniWM, a unified, memory‑augmented world model that integrates egocentric visual foresight and planning within a single multimodal autoregressive backbone. UniWM explicitly grounds action selection in visually imagined outcomes, tightly aligning prediction with control. Meanwhile, a hierarchical memory mechanism fuses short‑term perceptual cues with longer‑term trajectory context, supporting stable and coherent reasoning over extended horizons. Extensive experiments on four challenging benchmarks (Go Stanford, ReCon, SCAND, HuRoN) and the 1X Humanoid Dataset show that UniWM improves navigation success rates by up to 30%, substantially reduces trajectory errors against strong baselines, generalizes zero‑shot to the unseen TartanDrive dataset, and scales naturally to high‑dimensional humanoid control. These results position UniWM as a principled step toward unified, imagination‑driven embodied navigation. The code and models are available at https://github.com/F1y1113/UniWM.

Abstract:
The rapid advancement of embodied intelligence and world models has intensified efforts to integrate physical laws into AI systems, yet physical perception and symbolic physics reasoning have developed along separate trajectories without a unified bridging framework. This work provides a comprehensive overview of physical AI, establishing clear distinctions between theoretical physics reasoning and applied physical understanding while systematically examining how physics‑grounded methods enhance AI's real‑world comprehension across structured symbolic reasoning, embodied systems, and generative models. Through rigorous analysis of recent advances, we advocate for intelligent systems that ground learning in both physical principles and embodied reasoning processes, transcending pattern recognition toward genuine understanding of physical laws. Our synthesis envisions next‑generation world models capable of explaining physical phenomena and predicting future states, advancing safe, generalizable, and interpretable AI systems. We maintain a continuously updated resource at https://github.com/AI4Phys/Awesome‑AI‑for‑Physics.

Abstract:
World models that support controllable and editable spatiotemporal environments are valuable for robotics, enabling scalable training data, repro ducible evaluation, and flexible task design. While recent text‑to‑video models generate realistic dynam ics, they are constrained to 2D views and offer limited interaction. We introduce MorphoSim, a language guided framework that generates 4D scenes with multi‑view consistency and object‑level controls. From natural language instructions, MorphoSim produces dynamic environments where objects can be directed, recolored, or removed, and scenes can be observed from arbitrary viewpoints. The framework integrates trajectory‑guided generation with feature field dis tillation, allowing edits to be applied interactively without full re‑generation. Experiments show that Mor phoSim maintains high scene fidelity while enabling controllability and editability. The code is available at https://github.com/eric‑ai‑lab/Morph4D.

Abstract:
Prevailing Video‑to‑Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame‑level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end‑to‑end causality and targets low per‑frame latency with audio‑visual synchronization. Our model's backbone is a decoder‑only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end‑to‑end causality and efficiency. The model is trained through a diffusion pre‑training followed by consistency fine‑tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high‑quality full‑band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per‑frame waveform‑level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi‑saito‑sony.github.io/soundreactor/.

Abstract:
Humans possess a remarkable ability to mentally explore and replay 3D environments they have previously experienced. Inspired by this mental process, we present EvoWorld: a world model that bridges panoramic video generation with evolving 3D memory to enable spatially consistent long‑horizon exploration. Given a single panoramic image as input, EvoWorld first generates future video frames by leveraging a video generator with fine‑grained view control, then evolves the scene's 3D reconstruction using a feedforward plug‑and‑play transformer, and finally synthesizes futures by conditioning on geometric reprojections from this evolving explicit 3D memory. Unlike prior state‑of‑the‑arts that synthesize videos only, our key insight lies in exploiting this evolving 3D reconstruction as explicit spatial guidance for the video generation process, projecting the reconstructed geometry onto target viewpoints to provide rich spatial cues that significantly enhance both visual realism and geometric consistency. To evaluate long‑range exploration capabilities, we introduce the first comprehensive benchmark spanning synthetic outdoor environments, Habitat indoor scenes, and challenging real‑world scenarios, with particular emphasis on loop‑closure detection and spatial coherence over extended trajectories. Extensive experiments demonstrate that our evolving 3D memory substantially improves visual fidelity and maintains spatial scene coherence compared to existing approaches, representing a significant advance toward long‑horizon spatially consistent world modeling.

Abstract:
Vision‑Language‑Action (VLA) models enable embodied decision‑making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real‑world interactions or suffers from sim‑to‑real gaps. We introduce VLA‑RFT, a reinforcement fine‑tuning framework that leverages a data‑driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory‑level rewards derived from goal‑achieving references. This design delivers an efficient and action‑aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine‑tuning steps, VLA‑RFT surpasses strong supervised baselines and achieves greater efficiency than simulator‑based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world‑model‑based RFT as a practical post‑training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla‑rft.github.io/.

Abstract:
Streaming video generation, as one fundamental component in interactive world models and neural game engines, aims to generate high‑quality, low‑latency, and temporally coherent long video streams. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long‑horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor and thereby enhances long‑term global consistency. Third, we design an efficient training algorithm that enables few‑step distillation over largely extended denoising windows. This algorithm operates on non‑overlapping windows and mitigates exposure bias conditioned on self‑generated histories. Extensive experiments show that Rolling Forcing enables real‑time streaming generation of multi‑minute videos on a single GPU, with substantially reduced error accumulation.

Abstract:
Vision‑Language‑Action (VLA) models trained via imitation learning suffer from significant performance degradation in data‑scarce scenarios due to their reliance on large‑scale demonstration datasets. Although reinforcement learning (RL)‑based post‑training has proven effective in addressing data scarcity, its application to VLA models is hindered by the non‑resettable nature of real‑world environments. This limitation is particularly critical in high‑risk domains such as industrial automation, where interactions often induce state changes that are costly or infeasible to revert. Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose World‑Env, an RL‑based post‑training framework that replaces physical interaction with a low‑cost world model‑based virtual simulator. World‑Env consists of two key components: (1) a physically‑consistent world simulator that generates temporally consistent future visual observations, and (2) a vision‑language model (VLM)‑guided instant reflector that provides continuous reward signals and predicts action termination. This simulated environment enables VLA models to safely explore and generalize beyond their initial imitation learning distribution. Our method achieves notable performance gains with as few as five expert demonstrations per task. Experiments on complex robotic manipulation tasks demonstrate that World‑Env effectively overcomes the data inefficiency, safety constraints, and inefficient execution of conventional VLA models that rely on real‑world interaction, offering a practical and scalable solution for post‑training in resource‑constrained settings. Our code is available at https://github.com/amap‑cvlab/world‑env.

Abstract:
A critical bottleneck in deep reinforcement learning (DRL) is sample inefficiency, as training high‑performance agents often demands extensive environmental interactions. Model‑based reinforcement learning (MBRL) mitigates this by building world models that simulate environmental dynamics and generate synthetic experience, improving sample efficiency. However, conventional world models process observations holistically, failing to decouple dynamic objects and temporal features from static backgrounds. This approach is computationally inefficient, especially for visual tasks where dynamic objects significantly influence rewards and decision‑making performance. To address this, we introduce DyMoDreamer, a novel MBRL algorithm that incorporates a dynamic modulation mechanism to improve the extraction of dynamic features and enrich the temporal information. DyMoDreamer employs differential observations derived from a novel inter‑frame differencing mask, explicitly encoding object‑level motion cues and temporal dynamics. Dynamic modulation is modeled as stochastic categorical distributions and integrated into a recurrent state‑space model (RSSM), enhancing the model's focus on reward‑relevant dynamics. Experiments demonstrate that DyMoDreamer sets a new state‑of‑the‑art on the Atari 100k benchmark with a 156.6% mean human‑normalized score, establishes a new record of 832 on the DeepMind Visual Control Suite, and gains a 9.5% performance improvement after 1M steps on the Crafter benchmark. Our code is released at https://github.com/Ultraman‑Tiga1/DyMoDreamer.

Abstract:
We present PoseDiff, a conditional diffusion model that unifies robot state estimation and control within a single framework. At its core, PoseDiff maps raw visual observations into structured robot states‑such as 3D keypoints or joint angles‑from a single RGB image, eliminating the need for multi‑stage pipelines or auxiliary modalities. Building upon this foundation, PoseDiff extends naturally to video‑to‑action inverse dynamics: by conditioning on sparse video keyframes generated by world models, it produces smooth and continuous long‑horizon action sequences through an overlap‑averaging strategy. This unified design enables scalable and efficient integration of perception and control. On the DREAM dataset, PoseDiff achieves state‑of‑the‑art accuracy and real‑time performance for pose estimation. On Libero‑Object manipulation tasks, it substantially improves success rates over existing inverse dynamics modules, even under strict offline settings. Together, these results show that PoseDiff provides a scalable, accurate, and efficient bridge between perception, planning, and control in embodied AI. The video visualization results can be found on the project page: https://haozhuo‑zhang.github.io/PoseDiff‑project‑page/.

Abstract:
Vision‑Language‑Action models (VLAs) achieve strong performance in general robotic manipulation tasks by scaling imitation learning. However, existing VLAs are limited to predicting short‑sighted next‑action, which struggle with long‑horizon trajectory tasks due to incremental deviations. To address this problem, we propose a plug‑in framework named \method that effectively empowers off‑the‑shelf VLAs with the capability of foreseeing future states via test‑time scaling. Specifically, \method samples and rolls out possible action trajectories where involved actions are rationales to generate future states via a world model, which enables \method to foresee and reason potential outcomes and search for the optimal actions. We further leverage Monte Carlo Tree Search (MCTS) to improve search efficiency in large action spaces, where step‑wise VLA predictions seed the root. Meanwhile, we introduce a confidence sampling mechanism based on Kernel Density Estimation (KDE), to enable efficient exploration in MCTS without redundant VLA queries. We evaluate intermediate states in MCTS via an offline value estimation strategy, to score predicted futures and correct deviations with long‑term feedback. We conducted extensive experiments in both simulators and the real world, demonstrating that our proposed VLA‑Reasoner achieves significant improvements over the state‑of‑the‑art VLAs. Our method highlights a potential pathway toward scalable test‑time computation of robotic manipulation. The project website is available at: https://vla‑reasoner.github.io/.

Abstract:
Embodied action planning is a core challenge in robotics, requiring models to generate precise actions from visual observations and language instructions. While video generation world models are promising, their reliance on pixel‑level reconstruction often introduces visual redundancies that hinder action decoding and generalization. Latent world models offer a compact, motion‑aware representation, but overlook the fine‑grained details critical for precise manipulation. To overcome these limitations, we propose MoWM, a mixture‑of‑world‑model framework that fuses representations from hybrid world models for embodied action planning. Our approach combines motion‑aware latent world model features with pixel‑space features, enabling MoWM to emphasize action‑relevant visual details for action decoding. Extensive evaluations on the CALVIN and real‑world manipulation tasks demonstrate that our method achieves state‑of‑the‑art task success rates and superior generalization. We also provide a comprehensive analysis of the strengths of each feature space, offering valuable insights for future research in embodied planning. The code is available at: https://github.com/tsinghua‑fib‑lab/MoWM.

Abstract:
Video‑based world models hold significant potential for generating high‑quality embodied manipulation data. However, current video generation methods struggle to achieve stable long‑horizon generation: classical diffusion‑based approaches often suffer from temporal inconsistency and visual drift over multiple rollouts, while autoregressive methods tend to compromise on visual detail. To solve this, we introduce LongScape, a hybrid framework that adaptively combines intra‑chunk diffusion denoising with inter‑chunk autoregressive causal generation. Our core innovation is an action‑guided, variable‑length chunking mechanism that partitions video based on the semantic context of robotic actions. This ensures each chunk represents a complete, coherent action, enabling the model to flexibly generate diverse dynamics. We further introduce a Context‑aware Mixture‑of‑Experts (CMoE) framework that adaptively activates specialized experts for each chunk during generation, guaranteeing high visual quality and seamless chunk transitions. Extensive experimental results demonstrate that our method achieves stable and consistent long‑horizon generation over extended rollouts. Our code is available at: https://github.com/tsinghua‑fib‑lab/Longscape.

Abstract:
Model‑based reinforcement learning (MBRL) has achieved remarkable success in robotics due to its high sample efficiency and planning capability. However, extending MBRL to physical multi‑robot cooperation remains challenging due to the complexity of joint dynamics. To address this challenge, we propose the Sequential World Model (SeqWM), a novel framework that integrates the sequential paradigm into multi‑robot MBRL. SeqWM employs independent, autoregressive agent‑wise world models to represent joint dynamics, where each agent generates its future trajectory and plans its actions based on the predictions of its predecessors. This design lowers modeling complexity and enables the emergence of advanced cooperative behaviors through explicit intention sharing. Experiments on Bi‑DexHands and Multi‑Quadruped demonstrate that SeqWM outperforms existing state‑of‑the‑art model‑based and model‑free baselines in both overall performance and sample efficiency, while exhibiting advanced cooperative behaviors such as predictive adaptation, temporal alignment, and role division. Furthermore, SeqWM has been successfully deployed on physical quadruped robots, validating its effectiveness in real‑world multi‑robot systems. Demos and code are available at: https://github.com/zhaozijie2022/seqwm

Abstract:
The field of 4D world modeling ‑ aiming to jointly capture spatial geometry and temporal dynamics ‑ has witnessed remarkable progress in recent years, driven by advances in large‑scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high‑quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi‑domain diversity, and spatial‑temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera‑control video generation. To address this gap, we introduce OmniWorld, a large‑scale, multi‑domain, multi‑modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld‑Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld‑Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state‑of‑the‑art (SOTA) approaches in modeling complex 4D environments. Moreover, fine‑tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general‑purpose 4D world models, ultimately advancing machines' holistic understanding of the physical world.

Abstract:
World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB‑D imagery, occupancy grids, and LiDAR point clouds for large‑scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models'' has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video‑based (VideoGen), occupancy‑based (OccGen), and LiDAR‑based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/awesome‑3d‑4d‑world‑models

Abstract:
In heterogeneous multi‑task decision‑making, tasks not only exhibit diverse observation and action spaces but also vary substantially in their underlying complexities. While conventional multi‑task world models like UniZero excel in single‑task settings, we find that when handling a broad and diverse suite of tasks, gradient conflicts and the loss of model plasticity often constrain their sample efficiency. In this work, we address these challenges from two complementary perspectives: the single learning iteration and the overall learning process. First, to mitigate the gradient conflicts, we systematically investigate key architectural designs for extending UniZero. Our investigation identifies a Mixture‑of‑Experts (MoE) architecture as the most effective approach. We demonstrate, both theoretically and empirically, that this architecture alleviates gradient conflicts by routing task‑specific representations to specialized sub‑networks. This finding leads to our proposed model, ScaleZero. Second, to dynamically allocate model capacity throughout the learning process, we introduce an online Dynamic Parameter Scaling (DPS) strategy. This strategy progressively integrates LoRA adapters in response to task‑specific progress, enabling adaptive knowledge retention and parameter expansion. Evaluations on a diverse set of standard benchmarks (Atari, DMC, Jericho) demonstrate that ScaleZero, utilizing solely online reinforcement learning with one model, performs on par with specialized single‑task agents. With the DPS strategy, it remains competitive while using just 71.5% of the environment interactions. These findings underscore the potential of ScaleZero for effective multi‑task planning. Our code is available at \textcolormagentahttps://github.com/opendilab/LightZero.

Abstract:
Data collection is crucial for learning robust world models in model‑based reinforcement learning. The most prevalent strategies are to actively collect trajectories by interacting with the environment during online training or training on offline datasets. At first glance, the nature of learning task‑agnostic environment dynamics makes world models a good candidate for effective offline training. However, the effects of online vs. offline data on world models and thus on the resulting task performance have not been thoroughly studied in the literature. In this work, we investigate both paradigms in model‑based settings, conducting experiments on 31 different environments. First, we showcase that online agents outperform their offline counterparts. We identify a key challenge behind performance degradation of offline agents: encountering Out‑Of‑Distribution states at test time. This issue arises because, without the self‑correction mechanism in online agents, offline datasets with limited state space coverage induce a mismatch between the agent's imagination and real rollouts, compromising policy training. We demonstrate that this issue can be mitigated by allowing for additional online interactions in a fixed or adaptive schedule, restoring the performance of online training with limited interaction data. We also showcase that incorporating exploration data helps mitigate the performance degradation of offline agents. Based on our insights, we recommend adding exploration data when collecting large datasets, as current efforts predominantly focus on expert data alone.

Abstract:
Reinforcement learning (RL) techniques have achieved impressive performance on simulated benchmarks such as Atari100k, yet recent advances remain largely confined to simulation and show limited transfer to real‑world domains. A central obstacle is environmental stochasticity, as real systems involve noisy observations, unpredictable dynamics, and non‑stationary conditions that undermine the stability of current methods. Existing benchmarks rarely capture these uncertainties and favor simplified settings where algorithms can be tuned to succeed. The absence of a well‑defined taxonomy of stochasticity further complicates evaluation, as robustness to one type of stochastic perturbation, such as sticky actions, does not guarantee robustness to other forms of uncertainty. To address this critical gap, we introduce STORI (STOchastic‑ataRI), a benchmark that systematically incorporates diverse stochastic effects and enables rigorous evaluation of RL techniques under different forms of uncertainty. We propose a comprehensive five‑type taxonomy of environmental stochasticity and demonstrate systematic vulnerabilities in state‑of‑the‑art model‑based RL algorithms through targeted evaluation of DreamerV3 and STORM. Our findings reveal that world models dramatically underestimate environmental variance, struggle with action corruption, and exhibit unreliable dynamics under partial observability. We release the code and benchmark publicly at https://github.com/ARY2260/stori, providing a unified framework for developing more robust RL systems.

Abstract:
Recently, interactive digital human video generation has attracted widespread attention and achieved remarkable progress. However, building such a practical system that can interact with diverse input signals in real time remains challenging to existing methods, which often struggle with heavy computational cost and limited controllability. In this work, we introduce an autoregressive video generation framework that enables interactive multimodal control and low‑latency extrapolation in a streaming manner. With minimal modifications to a standard large language model (LLM), our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations to guide the denoising process of a diffusion head. To support this, we construct a large‑scale dialogue dataset of approximately 20,000 hours from multiple sources, providing rich conversational scenarios for training. We further introduce a deep compression autoencoder with up to 64× reduction ratio, which effectively alleviates the long‑horizon inference burden of the autoregressive model. Extensive experiments on duplex conversation, multilingual human synthesis, and interactive world model highlight the advantages of our approach in low latency, high efficiency, and fine‑grained multimodal controllability.

Abstract:
Many Vision‑Language‑Action (VLA) models are built upon an internal world model trained via next‑frame prediction ``v_t \rightarrow v_t+1''. However, this paradigm attempts to predict the future frame's appearance directly, without explicitly reasoning about the underlying dynamics. This lack of an explicit motion reasoning step often leads to physically implausible visual forecasts and inefficient policy learning. To address this limitation, we introduce the Visual Chain of Thought (Visual CoT), a paradigm that compels the model to first reason about motion dynamics before generating the future frame. We instantiate this paradigm by proposing FlowVLA, an autoregressive Transformer that explicitly materializes this reasoning process as ``v_t \rightarrow f_t \rightarrow v_t+1'', where f_t is an intermediate optical flow prediction that inherently encodes motion. By forcing the model to first follow the motion plan encoded by f_t, this process inherently aligns the pre‑training objective of dynamics prediction with the downstream task of action generation. We conduct experiments on challenging robotics manipulation benchmarks, as well as real‑robot evaluations. Our FlowVLA not only generates more coherent and physically plausible visual predictions, but also achieves state‑of‑the‑art policy performance with substantially improved sample efficiency, pointing toward a more principled foundation for world modeling in VLAs. Project page: https://irpn‑lab.github.io/FlowVLA/

Abstract:
Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human‑like ability to decompose visual scenes, ground intermediate concepts, and perform multi‑step logical inference. While early surveys focus on monolithic vision‑language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five‑stage paradigm shift: from prompt‑enhanced language‑centric pipelines, through tool‑enhanced LLMs and tool‑enhanced VLMs, to recently minted chain‑of‑thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain‑of‑thought faithfulness, and high‑resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM‑based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world‑model integration, human‑AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.

Abstract:
Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule‑based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision‑Language‑Action (VLA) models, built upon Large Vision‑Language Models (VLMs) pretrained on vast image‑text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy‑oriented review of large VLM‑based VLA models for robotic manipulation. We begin by clearly defining large VLM‑based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single‑system and dual‑system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in‑depth examination of large VLM‑based VLA models: (1) integration with advanced domains, including reinforcement learning, training‑free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi‑agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: https://github.com/JiuTian‑VL/Large‑VLM‑based‑VLA‑for‑Robotic‑Manipulation

Abstract:
Partial observability presents a significant challenge for Safe Reinforcement Learning (Safe RL), as it impedes the identification of potential risks and rewards. Leveraging specific types of privileged information during training to mitigate the effects of partial observability has yielded notable empirical successes. In this paper, we propose Asymmetric Constrained Partially Observable Markov Decision Processes (ACPOMDPs) to theoretically examine the advantages of incorporating privileged information in Safe RL. Building upon ACPOMDPs, we propose the Privileged Information Guided Dreamer (PIGDreamer), a model‑based RL approach that leverages privileged information to enhance the agent's safety and performance through privileged representation alignment and an asymmetric actor‑critic structure. Our empirical results demonstrate that PIGDreamer significantly outperforms existing Safe RL methods. Furthermore, compared to alternative privileged RL methods, our approach exhibits enhanced performance, robustness, and efficiency. Codes are available at: https://github.com/hggforget/PIGDreamer.

Abstract:
What does it mean to plan? Current agentic systems, whether scaffolded workflows or end‑to‑end policies, rely on reactive decision‑making: selecting the next action via a fixed procedure with at most undifferentiated adaptive computation (e.g., chain‑of‑thought) lacking explicit modeling of future outcomes. This limits generalizability, as each new task demands re‑engineering rather than transfer of shared reasoning capacity. Humans, by contrast, plan by mentally simulating consequences of candidate actions within an internal world model, a capacity known as simulative reasoning (System II) that supports flexible, goal‑directed behavior across diverse contexts. We argue that simulative reasoning through a world model provides a general‑purpose planning mechanism for agentic systems, improving upon reactive policies (System I) by grounding decisions in predicted future states rather than pattern‑matched responses. To verify this, we introduce SiRA (Simulative Reasoning Architecture), a goal‑oriented architecture instantiating simulative reasoning using an LLM‑based world model with natural‑language belief states, while remaining model‑agnostic. We evaluate across three qualitatively distinct task categories: constrained navigation, multi‑hop information aggregation, and general instruction following, in a web‑browser environment. Across all categories, simulative reasoning achieves up to 124% higher task completion rates than a matched reactive baseline, and increases constrained navigation success from 0% to 32.2% compared to a representative open‑web agent. The persistent advantage across distinct task types suggests the benefit stems from generalizable counterfactual evaluation rather than task‑specific tuning.

Abstract:
Segments in computer vision are often defined by semantic considerations and are highly dependent on category‑specific conventions. In contrast, developmental psychology suggests that humans perceive the world in terms of Spelke objects‑‑groupings of physical things that reliably move together when acted on by physical forces. Spelke objects thus operate on category‑agnostic causal motion relationships which potentially better support tasks like manipulation and planning. In this paper, we first benchmark the Spelke object concept, introducing the SpelkeBench dataset that contains a wide variety of well‑defined Spelke segments in natural images. Next, to extract Spelke segments from images algorithmically, we build SpelkeNet, a class of visual world models trained to predict distributions over future motions. SpelkeNet supports estimation of two key concepts for Spelke object discovery: (1) the motion affordance map, identifying regions likely to move under a poke, and (2) the expected‑displacement map, capturing how the rest of the scene will move. These concepts are used for "statistical counterfactual probing", where diverse "virtual pokes" are applied on regions of high motion‑affordance, and the resultant expected displacement maps are used define Spelke segments as statistical aggregates of correlated motion statistics. We find that SpelkeNet outperforms supervised baselines like SegmentAnything (SAM) on SpelkeBench. Finally, we show that the Spelke concept is practically useful for downstream applications, yielding superior performance on the 3DEditBench benchmark for physical object manipulation when used in a variety of off‑the‑shelf object manipulation models.

Abstract:
The performance of learned robot visuomotor policies is heavily dependent on the size and quality of the training dataset. Although large‑scale robot and human datasets are increasingly available, embodiment gaps and mismatched action spaces make them difficult to leverage. Our main insight is that skills performed across different embodiments produce visual similarities in motions that can be captured using off‑the‑shelf action representations such as optical flow. Moreover, World Models (WMs) can leverage sub‑optimal data since they focus on modeling dynamics. In this work, we aim to improve visuomotor policies in low‑data regimes by first pretraining a WM using optical flow as an embodiment‑agnostic action representation to leverage accessible or easily collected data from multiple embodiments (robots, humans). Given a small set of demonstrations on a target embodiment, we finetune the WM on this data to better align the WM predictions, train a base policy, and learn a robust value function. Using our finetuned WM and value function, our approach evaluates action candidates from the base policy and selects the best one to improve performance. Our approach, which we term Latent Policy Steering (LPS), improves behavior‑cloned policies by 10.6% on average across four Robomimic tasks, even though most of the pretraining data comes from the real world. In the real‑world experiments, LPS achieves larger gains: 70% relative improvement with 30‑50 target‑embodiment demonstrations, and 44% relative improvement with 60‑100 demonstrations, compared to a behavior‑cloned baseline. Qualitative results can be found on the website: https://yiqiwang8177.github.io/LatentPolicySteering/.

Abstract:
Existing world models for autonomous driving struggle with long‑horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state‑of‑the‑art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side‑by‑side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens. Code, models and qualitative results are publicly available at https://lmb‑freiburg.github.io/orbis.github.io/.

Abstract:
Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state‑of‑the‑art vision‑language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test‑time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi‑view evidence gathered during the interactive exploration. Without any fine‑tuning, our MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test‑time scaling offers a simple, plug‑and‑play route to robust 3D reasoning. Meanwhile, our method also improves upon the test‑time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test‑time scaling.

Abstract:
World models (WMs) demonstrate strong capabilities in prediction, generation, and planning tasks. Existing WMs primarily focus on unstructured data and cannot leverage the ubiquitous structured data, often represented as graphs, in the digital world. While multiple graph foundation models have been proposed, they focus on graph learning tasks and cannot extend to diverse multi‑modal data and interdisciplinary tasks. To address these challenges, we propose the Graph World Model (GWM), a world model that supports both unstructured and graph‑structured states with multi‑modal information and represents diverse tasks as actions. The core of a GWM is a generic message‑passing algorithm to aggregate structured information, either over a unified multi‑modal token space by converting multi‑modal data into text (GWM‑T) or a unified multi‑modal embedding space by modality‑specific encoders (GWM‑E). Notably, GWM introduces action nodes to support diverse tasks, where action nodes are linked to other nodes via direct reference or similarity computation. Extensive experiments on six tasks from diverse domains, including multi‑modal generation and matching, recommendation, graph prediction, multi‑agent, retrieval‑augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain‑specific baselines' performance, benefits from multi‑hop structures, and demonstrates strong zero‑shot/few‑shot capabilities on unseen new tasks. Our code for GWM is released at https://github.com/ulab‑uiuc/GWM.

Abstract:
Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy‑based world models offers substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose I^2‑World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra‑scene and inter‑scene tokenizers. The intra‑scene tokenizer employs a multi‑scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter‑scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder‑only GPT‑style autoregressive models, I^2‑World adopts an encoder‑decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to enable high‑level control over scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that I^2‑World achieves state‑of‑the‑art performance, outperforming existing methods by 25.1% in mIoU and 36.9% in IoU for 4D occupancy forecasting while exhibiting exceptional computational efficiency: it requires merely 2.9 GB of training memory and achieves real‑time inference at 37.0 FPS. Our code is available on https://github.com/lzzzzzm/II‑World.

Abstract:
Extracting optical flow from videos remains a core computer vision problem. Motivated by the recent success of large general‑purpose models, we ask whether frozen self‑supervised video models trained only to predict future frames can be prompted, without fine‑tuning, to output flow. Prior attempts to read out depth or illumination from video generators required fine‑tuning; that strategy is ill‑suited for flow, where labeled data is scarce and synthetic datasets suffer from a sim‑to‑real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point‑wise correspondences by injecting a small tracer perturbation into a next‑frame predictor and tracking its propagation, we extend this idea to generative video models for zero‑shot flow extraction. We explore several popular architectures and find that successful zero‑shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio‑temporal patch independently; and (3) random‑access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recently introduced Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL‑tracing: a novel test‑time inference procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback‑Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow‑specific fine‑tuning, our method is competitive with state‑of‑the‑art, task‑specific models on the real‑world TAP‑Vid DAVIS benchmark and the synthetic TAP‑Vid Kubric. Our results show that counterfactual prompting of controllable generative video models is an effective alternative to supervised or photometric‑loss methods for high‑quality flow.

Abstract:
How to enable agents to predict the outcomes of their own motion intentions in three‑dimensional space has been a fundamental problem in embodied intelligence. To explore general spatial imagination capability, we present AirScape, the first world model designed for six‑degree‑of‑freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct a dataset for aerial world model training and testing, which consists of 11k video‑intention pairs. This dataset includes first‑person‑view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two‑phase schedule to train a foundation model‑‑initially devoid of embodied spatial knowledge‑‑into a world model that is controllable by motion intentions and adheres to physical spatio‑temporal constraints. Experimental results demonstrate that AirScape significantly outperforms existing foundation models in 3D spatial imagination capabilities, especially with over a 50% improvement in metrics reflecting motion alignment. The project is available at: https://embodiedcity.github.io/AirScape/.

Abstract:
Birds' Eye View (BEV) semantic segmentation is an indispensable perception task in end‑to‑end autonomous driving systems. Unsupervised and semi‑supervised learning for BEV tasks, as pivotal for real‑world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Yet, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. To fully harness the potential of synthetic data from world models, this paper proposes NRSeg, a noise‑resilient learning framework for BEV semantic segmentation. Specifically, a Perspective‑Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. Moreover, a Bi‑Distribution Parallel Prediction (BiDPP) is designed to enhance the inherent robustness of the model, where the learning process is constrained through parallel prediction of multinomial and Dirichlet distributions. The former efficiently predicts semantic probabilities, whereas the latter adopts evidential deep learning to realize uncertainty quantification. Furthermore, a Hierarchical Local Semantic Exclusion (HLSE) module is designed to address the non‑mutual exclusivity inherent in BEV semantic segmentation tasks. Experimental results demonstrate that NRSeg achieves state‑of‑the‑art performance, yielding the highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi‑supervised BEV segmentation tasks, respectively. The source code will be made publicly available at https://github.com/lynn‑yu/NRSeg.

Abstract:
We present a novel approach to knowledge transfer in model‑based reinforcement learning, addressing the critical challenge of deploying large world models in resource‑constrained environments. Our method efficiently distills a high‑capacity multi‑task agent (317M parameters) into a compact model (1M parameters) on the MT30 benchmark, significantly improving performance across diverse tasks. Our distilled model achieves a state‑of‑the‑art normalized score of 28.45, surpassing the original 1M parameter model score of 18.93. This improvement demonstrates the ability of our distillation technique to capture and consolidate complex multi‑task knowledge. We further optimize the distilled model through FP16 post‑training quantization, reducing its size by ～50%. Our approach addresses practical deployment limitations and offers insights into knowledge representation in large world models, paving the way for more efficient and accessible multi‑task reinforcement learning systems in robotics and other resource‑constrained applications. Code available at https://github.com/dmytro‑kuzmenko/td‑mpc‑opt.

Abstract:
Recent advances in vision‑language models (VLMs) have enabled robots to follow open‑ended instructions and demonstrate impressive commonsense reasoning. However, current vision‑language‑action (VLA) frameworks primarily rely on static representations and limited temporal context, restricting agents to short‑horizon, reactive behaviors and hindering robust generalization in dynamic embodied environments. Inspired by cognitive neuroscience theories of episodic memory, we propose, to our knowledge, one of the first formalized episodic world models in VLA, enabling embodied robots to accumulate, recall, and predict sequential experiences. As an instantiation of this concept, our unified TriVLA realizes the episodic world model through a triple‑system architecture: integrating multimodal grounding from a pretrained VLM (System 2) and temporally rich dynamics perception from a video diffusion model (System 3). This enables the agent to accumulate and recall sequential experiences, interpret current contexts, and predict future environmental evolution. Guided by episodic representations that span both the past and anticipated future, the downstream policy (System 1) generates coherent, context‑aware action sequences through flow‑matching and cross‑modal attention mechanisms. Experimental results show that TriVLA operates efficiently at approximately 36 Hz and consistently outperforms baseline models on standard benchmarks and challenging real‑world manipulation tasks. It demonstrates strong long‑horizon planning and open‑ended intent understanding, showcasing the advantages of episodic world model‑inspired reasoning for robust, generalizable robot intelligence. Project Page: https://zhenyangliu.github.io/TriVLA/.

Abstract:
The pursuit of artificial general intelligence (AGI) has placed embodied intelligence at the forefront of robotics research. Embodied intelligence focuses on agents capable of perceiving, reasoning, and acting within the physical world. Achieving robust embodied intelligence requires not only advanced perception and control, but also the ability to ground abstract cognition in real‑world interactions. Two foundational technologies, physical simulators and world models, have emerged as critical enablers in this quest. Physical simulators provide controlled, high‑fidelity environments for training and evaluating robotic agents, allowing safe and efficient development of complex behaviors. In contrast, world models empower robots with internal representations of their surroundings, enabling predictive planning and adaptive decision‑making beyond direct sensory input. This survey systematically reviews recent advances in learning embodied AI through the integration of physical simulators and world models. We analyze their complementary roles in enhancing autonomy, adaptability, and generalization in intelligent robots, and discuss the interplay between external simulation and internal modeling in bridging the gap between simulated training and real‑world deployment. By synthesizing current progress and identifying open challenges, this survey aims to provide a comprehensive perspective on the path toward more capable and generalizable embodied AI systems. We also maintain an active repository that contains up‑to‑date literature and open‑source projects at https://github.com/NJU3DV‑LoongGroup/Embodied‑World‑Models‑Survey.

Abstract:
End‑to‑end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation‑free, end‑to‑end planning via self‑supervised learning. In this paper, we present World4Drive, an end‑to‑end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi‑modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial‑semantic priors provided by vision foundation models. It then generates multi‑modal planning trajectories based on current scene features and driving intentions and predicts multiple intention‑driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation‑free, end‑to‑end planning through self‑supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state‑of‑the‑art performance without manual perception annotations on both the open‑loop nuScenes and closed‑loop NavSim benchmarks, demonstrating an 18.1% relative reduction in L2 error, 46.7% lower collision rate, and 3.75 faster training convergence. Codes will be accessed at https://github.com/ucaszyp/World4Drive.

Abstract:
Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion‑based world models struggle with flexible‑length, long‑horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed‑length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine‑grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end‑to‑end framework. Our architecture enables high‑resolution, long‑duration generation while introducing a novel chain‑of‑forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state‑of‑the‑art performance with 7.4% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a real‑time motion planner, outperforming strong end‑to‑end planners on NAVSIM benchmarks. Code will be publicly available at \hrefhttps://github.com/Kevin‑thu/Epona/https://github.com/Kevin‑thu/Epona/.

Abstract:
Generalization remains a fundamental challenge in robotic manipulation. To tackle this challenge, recent Vision‑Language‑Action (VLA) models build policies on top of Vision‑Language Models (VLMs), seeking to transfer their open‑world semantic knowledge. However, their zero‑shot capability lags significantly behind the base VLMs, as the instruction‑vision‑action data is too limited to cover diverse scenarios, tasks, and robot embodiments. In this work, we present Goal‑VLA, a zero‑shot framework that leverages Image‑Generative VLMs as world models to generate desired goal states, from which the target object pose is derived to enable generalizable manipulation. The key insight is that object state representation is the golden interface, naturally separating a manipulation system into high‑level and low‑level policies. This representation abstracts away explicit action annotations, allowing the use of highly generalizable VLMs while simultaneously providing spatial cues for training‑free low‑level control. To further improve robustness, we introduce a Reflection‑through‑Synthesis process that iteratively validates and refines the generated goal image before execution. Both simulated and real‑world experiments demonstrate that our \name achieves strong performance and inspiring generalizability in manipulation tasks. Supplementary materials are available at https://nus‑lins‑lab.github.io/goalvlaweb/.

Abstract:
Panoramic video generation aims to synthesize 360‑degree immersive videos, holding significant importance in the fields of VR, world models, and spatial intelligence. Existing works fail to synthesize high‑quality panoramic videos due to the inherent modality gap between panoramic data and perspective data, which constitutes the majority of the training data for modern diffusion models. In this paper, we propose a novel framework utilizing pretrained perspective video models for generating panoramic videos. Specifically, we design a novel panorama representation named ViewPoint map, which possesses global spatial continuity and fine‑grained visual details simultaneously. With our proposed Pano‑Perspective attention mechanism, the model benefits from pretrained perspective priors and captures the panoramic spatial correlations of the ViewPoint map effectively. Extensive experiments demonstrate that our method can synthesize highly dynamic and spatially consistent panoramic videos, achieving state‑of‑the‑art performance and surpassing previous methods.

Abstract:
Vision‑and‑Language Navigation in Continuous Environments (VLN‑CE) requires agents to execute sequential navigation actions in complex environments guided by natural language instructions. Current approaches often struggle with generalizing to novel environments and adapting to ongoing changes during navigation. Inspired by human cognition, we present NavMorph, a self‑evolving world model framework that enhances environmental understanding and decision‑making in VLN‑CE tasks. NavMorph employs compact latent representations to model environmental dynamics, equipping agents with foresight for adaptive planning and policy refinement. By integrating a novel Contextual Evolution Memory, NavMorph leverages scene‑contextual information to support effective navigation while maintaining online adaptability. Extensive experiments demonstrate that our method achieves notable performance improvements on popular VLN‑CE benchmarks. Code is available at https://github.com/Feliciaxyao/NavMorph.

Abstract:
World models have become indispensable tools for embodied intelligence, serving as powerful simulators capable of generating realistic robotic videos while addressing critical data scarcity challenges. However, current embodied world models exhibit limited physical awareness, particularly in modeling 3D geometry and motion dynamics, resulting in unrealistic video generation for contact‑rich robotic scenarios. In this paper, we present RoboScape, a unified physics‑informed world model that jointly learns RGB video generation and physics knowledge within an integrated framework. We introduce two key physics‑informed joint training tasks: temporal depth prediction that enhances 3D geometric consistency in video rendering, and keypoint dynamics learning that implicitly encodes physical properties (e.g., object shape and material characteristics) while improving complex motion modeling. Extensive experiments demonstrate that RoboScape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios. We further validate its practical utility through downstream applications including robotic policy training with generated data and policy evaluation. Our work provides new insights for building efficient physics‑informed world models to advance embodied intelligence research. The code is available at: https://github.com/tsinghua‑fib‑lab/RoboScape.

Abstract:
3D world models (i.e., learning‑based 3D dynamics models) offer a promising approach to generalizable robotic manipulation by capturing the underlying physics of environment evolution conditioned on robot actions. However, existing 3D world models are primarily limited to single‑material dynamics using a particle‑based Graph Neural Network model, and often require time‑consuming 3D scene reconstruction to obtain 3D particle tracks for training. In this work, we present ParticleFormer, a Transformer‑based point cloud world model trained with a hybrid point cloud reconstruction loss, supervising both global and local dynamics features in multi‑material, multi‑object robot interactions. ParticleFormer captures fine‑grained multi‑object interactions between rigid, deformable, and flexible materials, trained directly from real‑world robot perception data without an elaborate scene reconstruction. We demonstrate the model's effectiveness both in 3D scene forecasting tasks, and in downstream manipulation tasks using a Model Predictive Control (MPC) policy. In addition, we extend existing dynamics learning benchmarks to include diverse multi‑material, multi‑object interaction scenarios. We validate our method on six simulation and three real‑world experiments, where it consistently outperforms leading baselines by achieving superior dynamics prediction accuracy and less rollout error in downstream visuomotor tasks. Experimental videos are available at https://suninghuang19.github.io/particleformer_page/.

Abstract:
In this paper, we propose a Counterfactually Decoupled Attention Learning (CDAL) method for open‑world model attribution. Existing methods rely on handcrafted design of region partitioning or feature space, which could be confounded by the spurious statistical correlations and struggle with novel attacks in open‑world scenarios. To address this, CDAL explicitly models the causal relationships between the attentional visual traces and source model attribution, and counterfactually decouples the discriminative model‑specific artifacts from confounding source biases for comparison. In this way, the resulting causal effect provides a quantification on the quality of learned attention maps, thus encouraging the network to capture essential generation patterns that generalize to unseen source models by maximizing the effect. Extensive experiments on existing open‑world model attribution benchmarks show that with minimal computational overhead, our method consistently improves state‑of‑the‑art models by large margins, particularly for unseen novel attacks. Source code: https://github.com/yzheng97/CDAL.

Abstract:
We present WorldVLA, an autoregressive action world model that unifies action and image understanding and generation. Our WorldVLA intergrates Vision‑Language‑Action (VLA) model and world model in one single framework. The world model predicts future images by leveraging both action and image understanding, with the purpose of learning the underlying physics of the environment to improve action generation. Meanwhile, the action model generates the subsequent actions based on image observations, aiding in visual understanding and in turn helps visual generation of the world model. We demonstrate that WorldVLA outperforms standalone action and world models, highlighting the mutual enhancement between the world model and the action model. In addition, we find that the performance of the action model deteriorates when generating sequences of actions in an autoregressive manner. This phenomenon can be attributed to the model's limited generalization capability for action prediction, leading to the propagation of errors from earlier actions to subsequent ones. To address this issue, we propose an attention mask strategy that selectively masks prior actions during the generation of the current action, which shows significant performance improvement in the action chunk generation task.

Abstract:
Multi‑task robotic bimanual manipulation is becoming increasingly popular as it enables sophisticated tasks that require diverse dual‑arm collaboration patterns. Compared to unimanual manipulation, bimanual tasks pose challenges to understanding the multi‑body spatiotemporal dynamics. An existing method ManiGaussian pioneers encoding the spatiotemporal dynamics into the visual representation via Gaussian world model for single‑arm settings, which ignores the interaction of multiple embodiments for dual‑arm systems with significant performance drop. In this paper, we propose ManiGaussian++, an extension of ManiGaussian framework that improves multi‑task bimanual manipulation by digesting multi‑body scene dynamics through a hierarchical Gaussian world model. To be specific, we first generate task‑oriented Gaussian Splatting from intermediate visual features, which aims to differentiate acting and stabilizing arms for multi‑body spatiotemporal dynamics modeling. We then build a hierarchical Gaussian world model with the leader‑follower architecture, where the multi‑body spatiotemporal dynamics is mined for intermediate visual representation via future scene prediction. The leader predicts Gaussian Splatting deformation caused by motions of the stabilizing arm, through which the follower generates the physical consequences resulted from the movement of the acting arm. As a result, our method significantly outperforms the current state‑of‑the‑art bimanual manipulation techniques by an improvement of 20.2% in 10 simulated tasks, and achieves 60% success rate on average in 9 challenging real‑world tasks. Our code is available at https://github.com/April‑Yz/ManiGaussian_Bimanual.

Abstract:
We introduce Matrix‑Game, an interactive world foundation model for controllable game world generation. Matrix‑Game is trained using a two‑stage pipeline that first performs large‑scale unlabeled pretraining for environment understanding, followed by action‑labeled training for interactive video generation. To support this, we curate Matrix‑Game‑MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high‑quality labeled clips with fine‑grained keyboard and mouse action annotations. Our model adopts a controllable image‑to‑world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix‑Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix‑Game consistently outperforms prior open‑source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double‑blind human evaluations further confirm the superiority of Matrix‑Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image‑to‑world generation, we will open‑source the Matrix‑Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix‑Game.

Abstract:
Action‑labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action‑free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large‑scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action‑free videos and an inverse dynamics model on a limited set of action‑labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2‑2.2x improvement in low‑data regimes, a 1.4x average improvement by learning from action‑free human videos, and the first generalization to LIBERO tasks from zero in‑distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at https://amplify‑robotics.github.io/.

Abstract:
World models are critical for autonomous driving to simulate environmental dynamics and generate synthetic data. Existing methods struggle to disentangle ego‑vehicle motion (perspective shifts) from scene evolvement (agent interactions), leading to suboptimal predictions. Instead, we propose to separate environmental changes from ego‑motion by leveraging the scene‑centric coordinate systems. In this paper, we introduce COME: a framework that integrates scene‑centric forecasting Control into the Occupancy world ModEl. Specifically, COME first generates ego‑irrelevant, spatially consistent future features through a scene‑centric prediction branch, which are then converted into scene condition using a tailored ControlNet. These condition features are subsequently injected into the occupancy world model, enabling more accurate and controllable future occupancy predictions. Experimental results on the nuScenes‑Occ3D dataset show that COME achieves consistent and significant improvements over state‑of‑the‑art (SOTA) methods across diverse configurations, including different input sources (ground‑truth, camera‑based, fusion‑based occupancy) and prediction horizons (3s and 8s). For example, under the same settings, COME achieves 26.3% better mIoU metric than DOME and 23.7% better mIoU metric than UniScene. These results highlight the efficacy of disentangled representation learning in enhancing spatio‑temporal prediction fidelity for world models. Code and videos will be available at https://github.com/synsin0/COME.

Abstract:
A major challenge in developing robust and generalizable Human Activity Recognition (HAR) systems for smart homes is the lack of large and diverse labeled datasets. Variations in home layouts, sensor configurations, and individual behaviors further exacerbate this issue. To address this, we leverage the idea of embodied AI agents ‑‑ virtual agents that perceive and act within simulated environments guided by internal world models. We introduce AgentSense, a virtual data generation pipeline in which agents live out daily routines in simulated smart homes, with behavior guided by Large Language Models (LLMs). The LLM generates diverse synthetic personas and realistic routines grounded in the environment, which are then decomposed into fine‑grained actions. These actions are executed in an extended version of the VirtualHome simulator, which we augment with virtual ambient sensors that record the agents' activities. Our approach produces rich, privacy‑preserving sensor data that reflects real‑world diversity. We evaluate AgentSense on five real HAR datasets. Models pretrained on the generated data consistently outperform baselines, especially in low‑resource settings. Furthermore, combining the generated virtual sensor data with a small amount of real data achieves performance comparable to training on full real‑world datasets. These results highlight the potential of using LLM‑guided embodied agents for scalable and cost‑effective sensor data generation in HAR. Our code is publicly available at https://github.com/ZikangLeng/AgentSense.

Abstract:
For autonomous vehicles, safe navigation in complex environments depends on handling a broad range of diverse and rare driving scenarios. Simulation‑ and scenario‑based testing have emerged as key approaches to development and validation of autonomous driving systems. Traditional scenario generation relies on rule‑based systems, knowledge‑driven models, and data‑driven synthesis, often producing limited diversity and unrealistic safety‑critical cases. With the emergence of foundation models, which represent a new generation of pre‑trained, general‑purpose AI models, developers can process heterogeneous inputs (e.g., natural language, sensor data, HD maps, and control actions), enabling the synthesis and interpretation of complex driving scenarios. In this paper, we conduct a survey about the application of foundation models for scenario generation and scenario analysis in autonomous driving (as of May 2025). Our survey presents a unified taxonomy that includes large language models, vision‑language models, multimodal large language models, diffusion models, and world models for the generation and analysis of autonomous driving scenarios. In addition, we review the methodologies, open‑source datasets, simulation platforms, and benchmark challenges, and we examine the evaluation metrics tailored explicitly to scenario generation and analysis. Finally, the survey concludes by highlighting the open challenges and research questions, and outlining promising future research directions. All reviewed papers are listed in a continuously maintained repository, which contains supplementary materials and is available at https://github.com/TUM‑AVS/FM‑for‑Scenario‑Generation‑Analysis.

Abstract:
The flourishing of video generation technologies has endangered the credibility of real‑world information and intensified the demand for AI‑generated video detectors. Despite some progress, the lack of high‑quality real‑world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large‑scale, high‑quality, and real‑world simulation dataset for AI‑generated video detection. GenWorld features the following characteristics: (1) Real‑world Simulation: GenWorld focuses on videos that replicate real‑world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state‑of‑the‑art video generation models to provide realistic and high‑quality forged videos; (3) Cross‑prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high‑quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real‑world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi‑view consistency as a strong criterion for real‑world AI‑generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI‑generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI‑generated video detection. Project Page: https://chen‑wl20.github.io/GenWorld

Abstract:
Large Language Models (LLMs) possess general world knowledge but often struggle to generate precise predictions in structured, domain‑specific contexts such as simulations. These limitations arise from their inability to ground their broad, unstructured understanding in specific environments. To address this, we present WorldLLM, a framework that enhances LLM‑based world modeling by combining Bayesian inference and autonomous active exploration with reinforcement learning. WorldLLM leverages the in‑context learning abilities of LLMs to guide an LLM‑based world model's predictions using natural language hypotheses given in its prompt. These hypotheses are iteratively refined through a Bayesian inference framework that leverages a second LLM as the proposal distribution given collected evidence. This evidence is collected using a curiosity‑driven reinforcement learning policy that explores the environment to find transitions with a low log‑likelihood under our LLM‑based predictive model using the current hypotheses. By alternating between refining hypotheses and collecting new evidence, our framework autonomously drives continual improvement of the predictions. Our experiments demonstrate the effectiveness of WorldLLM in a textual game environment that requires agents to manipulate and combine objects. The framework not only enhances predictive accuracy, but also generates human‑interpretable theories of environment dynamics.

Abstract:
Model‑based reinforcement learning (MBRL) has been used to efficiently solve vision‑based control tasks in highdimensional image observations. Although recent MBRL algorithms perform well in trained observations, they fail when faced with visual distractions in observations. These task‑irrelevant distractions (e.g., clouds, shadows, and light) may be constantly present in real‑world scenarios. In this study, we propose a novel self‑supervised method, Dream to Generalize (Dr. G), for zero‑shot MBRL. Dr. G trains its encoder and world model with dual contrastive learning which efficiently captures task‑relevant features among multi‑view data augmentations. We also introduce a recurrent state inverse dynamics model that helps the world model to better understand the temporal structure. The proposed methods can enhance the robustness of the world model against visual distractions. To evaluate the generalization performance, we first train Dr. G on simple backgrounds and then test it on complex natural video backgrounds in the DeepMind Control suite, and the randomizing environments in Robosuite. Dr. G yields a performance improvement of 117% and 14% over prior works, respectively. Our code is open‑sourced and available at https://github.com/JeongsooHa/DrG.git

Abstract:
The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don't know how to recover from it. In this sense, BC is akin to giving the agent the fish ‑‑ giving them dense supervision across a narrow set of states ‑‑ rather than teaching them to fish: to be able to reason independently about achieving the expert's outcome even when faced with unseen situations at test‑time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction‑efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach SAILOR consistently out‑performs state‑of‑the‑art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5‑10x still leaves a performance gap. We find that SAILOR can identify nuanced failures and is robust to reward hacking. Our code is available at https://github.com/arnavkj1995/SAILOR .

Abstract:
Open‑vocabulary semantic segmentation (OVSS) entails assigning semantic labels to each pixel in an image using textual descriptions, typically leveraging world models such as CLIP. To enhance out‑of‑domain generalization, we propose Cost Aggregation with Optimal Transport (OV‑COAST) for open‑vocabulary semantic segmentation. To align visual‑language features within the framework of optimal transport theory, we employ cost volume to construct a cost matrix, which quantifies the distance between two distributions. Our approach adopts a two‑stage optimization strategy: in the first stage, the optimal transport problem is solved using cost volume via Sinkhorn distance to obtain an alignment solution; in the second stage, this solution is used to guide the training of the CAT‑Seg model. We evaluate state‑of‑the‑art OVSS models on the MESS benchmark, where our approach notably improves the performance of the cost‑aggregation model CAT‑Seg with ViT‑B backbone, achieving superior results, surpassing CAT‑Seg by 1.72 % and SAN‑B by 4.9 % mIoU. The code is available at https://github.com/adityagandhamal/OV‑COAST/https://github.com/adityagandhamal/OV‑COAST/ .

Abstract:
Driving world models are used to simulate futures by video generation based on the condition of the current state and actions. However, current models often suffer serious error accumulations when predicting the long‑term future, which limits the practical application. Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility. However, these models are always trained on short video clips (high fps and short duration), and multiple roll‑out generations struggle to produce consistent and reasonable long videos due to the training‑inference gap. To this end, we propose several solutions to build a simple yet effective long‑term driving world model. First, we hierarchically decouple world model learning into large motion learning and bidirectional continuous motion learning. Then, considering the continuity of driving scenes, we propose a simple distillation method where fine‑grained video flows are self‑supervised signals for coarse‑grained flows. The distillation is designed to improve the coherence of infinite video generation. The coarse‑grained and fine‑grained modules are coordinated to generate long‑term and temporally coherent videos. In the public benchmark NuScenes, compared with the state‑of‑the‑art front‑view model, our model improves FVD by 27% and reduces inference time by 85% for the video task of generating 110+ frames. More videos (including 90s duration) are available at https://Wang‑Xiaodong1899.github.io/longdwm/.

Abstract:
World model based planning has significantly improved decision‑making in complex environments by enabling agents to simulate future states and make informed choices. This computational burden is particularly restrictive in robotics, where resources are severely constrained. To address this limitation, we propose a Sparse Imagination for Efficient Visual World Model Planning, which enhances computational efficiency by reducing the number of tokens processed during forward prediction. Our method leverages a sparsely trained vision‑based world model based on transformers with randomized grouped attention strategy, allowing the model to flexibly adjust the number of tokens processed based on the computational resource. By enabling sparse imagination during latent rollout, our approach significantly accelerates planning while maintaining high control fidelity. Experimental results demonstrate that sparse imagination preserves task performance while dramatically improving inference efficiency. This general technique for visual planning is applicable from simple test‑time trajectory optimization to complex real‑world tasks with the latest VLAs, enabling the deployment of world models in real‑time scenarios.

Abstract:
Recent advancements in language models have demonstrated remarkable in‑context learning abilities, prompting the exploration of in‑context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in‑context inference. In the paper, we propose Scalable In‑Context Q‑Learning (S‑ICQL), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt‑based multi‑head transformer architecture that simultaneously predicts optimal policies and in‑context value functions using separate heads. We pretrain a generalized world model to capture task‑relevant information, enabling the construction of a compact prompt that facilitates fast and precise in‑context inference. During training, we perform iterative policy improvement by fitting a state value function to an upper‑expectile of the Q‑function, and distill the in‑context value functions into policy extraction using advantage‑weighted regression. Extensive experiments across a range of discrete and continuous environments show consistent performance gains over various types of baselines, especially when learning from suboptimal data. Our code is available at \textcolormagenta\hrefhttps://github.com/NJU‑RL/SICQLhttps://github.com/NJU‑RL/SICQL.

Abstract:
Recent advancements in world models have revolutionized dynamic environment simulation, allowing systems to foresee future states and assess potential actions. In autonomous driving, these capabilities help vehicles anticipate the behavior of other road users, perform risk‑aware planning, accelerate training in simulation, and adapt to novel scenarios, thereby enhancing safety and reliability. Current approaches exhibit deficiencies in maintaining robust 3D geometric consistency or accumulating artifacts during occlusion handling, both critical for reliable safety assessment in autonomous navigation tasks. To address this, we introduce GeoDrive, which explicitly integrates robust 3D geometry conditions into driving world models to enhance spatial understanding and action controllability. Specifically, we first extract a 3D representation from the input frame and then obtain its 2D rendering based on the user‑specified ego‑car trajectory. To enable dynamic modeling, we propose a dynamic editing module during training to enhance the renderings by editing the positions of the vehicles. Extensive experiments demonstrate that our method significantly outperforms existing models in both action accuracy and 3D spatial awareness, leading to more realistic, adaptable, and reliable scene modeling for safer autonomous driving. Additionally, our model can generalize to novel trajectories and offers interactive scene editing capabilities, such as object editing and object trajectory control.

Abstract:
World models have recently gained prominence for action‑conditioned visual prediction in complex environments. However, relying on only a few recent observations causes them to lose long‑term context. Consequently, within a few steps, the generated scenes drift from what was previously observed, undermining temporal coherence. This limitation, common in state‑of‑the‑art world models, which are diffusion‑based, stems from the lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long‑context tasks by integrating features from a state‑space model, representing the entire interaction history. This design restores long‑term memory while preserving the high‑fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion‑only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state‑space representations into diffusion models is highly effective in demonstrating both visual details and long‑term memory. Project page: https://insait‑institute.github.io/StateSpaceDiffuser/.

Abstract:
World models have recently attracted growing interest in Multi‑Agent Reinforcement Learning (MARL) due to their ability to improve sample efficiency for policy learning. However, accurately modeling environments in MARL is challenging due to the exponentially large joint action space and highly uncertain dynamics inherent in multi‑agent systems. To address this, we reduce modeling complexity by shifting from jointly modeling the entire state‑action transition dynamics to focusing on the state space alone at each timestep through sequential agent modeling. Specifically, our approach enables the model to progressively resolve uncertainty while capturing the structured dependencies among agents, providing a more accurate representation of how agents influence the state. Interestingly, this sequential revelation of agents' actions in a multi‑agent system aligns with the reverse process in diffusion models‑‑a class of powerful generative models known for their expressiveness and training stability compared to autoregressive or latent variable models. Leveraging this insight, we develop a flexible and robust world model for MARL using diffusion models. Our method, Diffusion‑Inspired Multi‑Agent world model (DIMA), achieves state‑of‑the‑art performance across multiple multi‑agent control benchmarks, significantly outperforming prior world models in terms of final return and sample efficiency, including MAMuJoCo and Bi‑DexHands. DIMA establishes a new paradigm for constructing multi‑agent world models, advancing the frontier of MARL research. Codes are open‑sourced at https://github.com/breez3young/DIMA.

Abstract:
Visual imitation learning enables robotic agents to acquire skills by observing expert demonstration videos. In the one‑shot setting, the agent generates a policy after observing a single expert demonstration without additional fine‑tuning. Existing approaches typically train and evaluate on the same set of tasks, varying only object configurations, and struggle to generalize to unseen tasks with different semantic or structural requirements. While some recent methods attempt to address this, they exhibit low success rates on hard test tasks that, despite being visually similar to some training tasks, differ in context and require distinct responses. Additionally, most existing methods lack an explicit model of environment dynamics, limiting their ability to reason about future states. To address these limitations, we propose a novel framework for one‑shot visual imitation learning via world‑model‑guided trajectory generation. Given an expert demonstration video and the agent's initial observation, our method leverages a learned world model to predict a sequence of latent states and actions. This latent trajectory is then decoded into physical waypoints that guide the agent's execution. Our method is evaluated on two simulated benchmarks and three real‑world robotic platforms, where it consistently outperforms prior approaches, with over 30% improvement in some cases. The code is available at https://github.com/raktimgg/osvi‑wm.

Abstract:
Imitation learning is a promising approach for enabling generalist capabilities in humanoid robots, but its scaling is fundamentally constrained by the scarcity of high‑quality expert demonstrations. This limitation can be mitigated by leveraging suboptimal, open‑ended play data, often easier to collect and offering greater diversity. This work builds upon recent advances in generative modeling, specifically Flow Matching, an alternative to Diffusion models. We introduce a method for estimating the minimum or maximum of the learned distribution by leveraging the unique properties of Flow Matching, namely, deterministic transport and support for arbitrary source distributions. We apply this method to develop several goal‑conditioned imitation and reinforcement learning algorithms based on Flow Matching, where policies are conditioned on both current and goal observations. We explore and compare different architectural configurations by combining core components, such as critic, planner, actor, or world model, in various ways. We evaluated our agents on the OGBench benchmark and analyzed how different demonstration behaviors during data collection affect performance in a 2D non‑prehensile pushing task. Furthermore, we validated our approach on real hardware by deploying it on the Talos humanoid robot to perform complex manipulation tasks based on high‑dimensional image observations, featuring a sequence of pick‑and‑place and articulated object manipulation in a realistic kitchen environment. Experimental videos and code are available at: https://hucebot.github.io/extremum_flow_matching_website/

Abstract:
Vision‑Language‑Action (VLA) models offer significant potential for end‑to‑end driving, yet their reasoning is often constrained by textual Chains‑of‑Thought (CoT). This symbolic compression of visual information creates a modality gap between perception and planning by blurring spatio‑temporal relations and discarding fine‑grained cues. We introduce FSDrive, a framework that empowers VLAs to "think visually" using a novel visual spatio‑temporal CoT. FSDrive first operates as a world model, generating a unified future frame that combines a predicted background with explicit, physically‑plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio‑temporal CoT, capturing both spatial structure and temporal evolution in a single representation. The same VLA then functions as an inverse‑dynamics model to plan trajectories conditioned on current observations and this visual CoT. We enable this with a unified pre‑training paradigm that expands the model's vocabulary with visual tokens and jointly optimizes for semantic understanding (VQA) and future‑frame prediction. A progressive curriculum first generates structural priors to enforce physical laws before rendering the full scene. Evaluations on nuScenes and NAVSIM show FSDrive improves trajectory accuracy and reduces collisions, while also achieving competitive FID for video generation with a lightweight autoregressive model and advancing scene understanding on DriveLM. These results confirm that our visual spatio‑temporal CoT bridges the perception‑planning gap, enabling safer, more anticipatory autonomous driving. Code is available at https://github.com/MIV‑XJTU/FSDrive.

Abstract:
World models, which predict future transitions from past observation and action sequences, have shown great promise for improving data efficiency in sequential decision‑making. However, existing world models often require extensive domain‑specific training and still produce low‑fidelity, coarse predictions, limiting their usefulness in complex environments. In contrast, video diffusion models trained on large‑scale internet data have demonstrated impressive capabilities in generating high‑quality videos that capture diverse real‑world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre‑trained video diffusion models into interactive world models. To bridge the gap, Vid2World systematically explores video diffusion causalization, reshaping both the architecture and training objective of pre‑trained models to enable autoregressive generation. Additionally, it incorporates a causal action guidance mechanism to enhance action controllability in the resulting interactive world models. Extensive experiments across multiple domains, including robot manipulation, 3D game simulation, and open‑world navigation, demonstrate that our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.

Abstract:
World models predict state transitions in response to actions and are increasingly developed across diverse modalities. However, standard training objectives such as maximum likelihood estimation (MLE) often misalign with task‑specific goals of world models, i.e., transition prediction metrics like accuracy or perceptual quality. In this paper, we present RLVR‑World, a unified framework that leverages reinforcement learning with verifiable rewards (RLVR) to directly optimize world models for such metrics. Despite formulating world modeling as autoregressive prediction of tokenized sequences, RLVR‑World evaluates metrics of decoded predictions as verifiable rewards. We demonstrate substantial performance gains on both language‑ and video‑based world models across domains, including text games, web navigation, and robot manipulation. Our work indicates that, beyond recent advances in reasoning language models, RLVR offers a promising post‑training paradigm for enhancing the utility of generative models more broadly. Code, datasets, models, and video samples are available at the project website: https://thuml.github.io/RLVR‑World.

Abstract:
We present MAGI‑1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed‑length segments of consecutive frames. Trained to denoise per‑chunk noise that increases monotonically over time, MAGI‑1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image‑to‑video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI‑1 facilitates controllable generation via chunk‑wise prompting and supports real‑time, memory‑efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI‑1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI‑org/MAGI‑1 and https://github.com/SandAI‑org/MagiAttention. The product can be accessed at https://sand.ai.

Abstract:
We introduce DreamGen, a simple yet highly effective 4‑stage pipeline for training robot policies that generalize across behaviors and environments through neural trajectories ‑ synthetic robot data generated from video world models. DreamGen leverages state‑of‑the‑art image‑to‑video generative models, adapting them to the target robot embodiment to produce photorealistic synthetic videos of familiar or novel tasks in diverse environments. Since these models generate only videos, we recover pseudo‑action sequences using either a latent action model or an inverse‑dynamics model (IDM). Despite its simplicity, DreamGen unlocks strong behavior and environment generalization: a humanoid robot can perform 22 new behaviors in both seen and unseen environments, while requiring teleoperation data from only a single pick‑and‑place task in one environment. To evaluate the pipeline systematically, we introduce DreamGen Bench, a video generation benchmark that shows a strong correlation between benchmark performance and downstream policy success. Our work establishes a promising new axis for scaling robot learning well beyond manual data collection. Code available at https://github.com/NVIDIA/GR00T‑Dreams.

Abstract:
Learning how the world works is central to building AI agents that can adapt to complex environments. Traditional world models based on deep learning demand vast amounts of training data, and do not flexibly update their knowledge from sparse observations. Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code, supporting strong generalization from little data. To date, application of program‑structured world models remains limited to natural language and grid‑world domains. We introduce a novel program synthesis method for effectively modeling complex, non‑gridworld domains by representing a world model as an exponentially‑weighted product of programmatic experts (PoE‑World) synthesized by LLMs. We show that this approach can learn complex, stochastic world models from just a few observations. We evaluate the learned world models by embedding them in a model‑based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari's Pong and Montezuma's Revenge. We release our code and display the learned world models and videos of the agent's gameplay at https://topwasu.github.io/poe‑world.

Abstract:
This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB‑D frames (RGB‑D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U‑Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end‑to‑end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB‑D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.

Abstract:
Robotic imitation learning has advanced from solving static tasks to addressing dynamic interaction scenarios, but testing and evaluation remain costly and challenging due to the need for real‑time interaction with dynamic environments. We propose EnerVerse‑AC (EVAC), an action‑conditional world model that generates future visual observations based on an agent's predicted actions, enabling realistic and controllable robotic inference. Building on prior architectures, EVAC introduces a multi‑level action‑conditioning mechanism and ray map encoding for dynamic multi‑view image generation while expanding training data with diverse failure trajectories to improve generalization. As both a data engine and evaluator, EVAC augments human‑collected trajectories into diverse datasets and generates realistic, action‑conditioned video observations for policy testing, eliminating the need for physical robots or complex simulations. This approach significantly reduces costs while maintaining high fidelity in robotic manipulation evaluation. Extensive experiments validate the effectiveness of our method. Code, checkpoints, and datasets can be found at <https://annaj2178.github.io/EnerverseAC.github.io>.

Abstract:
Recent advances in creative AI have enabled the synthesis of high‑fidelity images and videos conditioned on language instructions. Building on these developments, text‑to‑video diffusion models have evolved into embodied world models (EWMs) capable of generating physically plausible scenes from language commands, effectively bridging vision and action in embodied AI applications. This work addresses the critical challenge of evaluating EWMs beyond general perceptual metrics to ensure the generation of physically grounded and action‑consistent behaviors. We propose the Embodied World Model Benchmark (EWMBench), a dedicated framework designed to evaluate EWMs based on three key aspects: visual scene consistency, motion correctness, and semantic alignment. Our approach leverages a meticulously curated dataset encompassing diverse scenes and motion patterns, alongside a comprehensive multi‑dimensional evaluation toolkit, to assess and compare candidate models. The proposed benchmark not only identifies the limitations of existing video generation models in meeting the unique requirements of embodied tasks but also provides valuable insights to guide future advancements in the field. The dataset and evaluation tools are publicly available at https://github.com/AgibotTech/EWMBench.

Abstract:
The combination of embodied intelligence and robots has great prospects and is becoming increasingly common. In order to work more efficiently, accurately, reliably, and safely in industrial scenarios, robots should have at least general knowledge, working‑environment knowledge, and operating‑object knowledge. These pose significant challenges to existing embodied intelligent robotics (EIR) techniques. Thus, this paper first briefly reviews the history of industrial robotics and analyzes the limitations of mainstream EIR frameworks. Then, a new knowledge‑driven technical framework of embodied intelligent industrial robotics (EIIR) is proposed for various industrial environments. It has five modules: a world model, a high‑level task planner, a low‑level skill controller, a simulator, and a physical system. The development of techniques related to each module are also thoroughly reviewed, and recent progress regarding their adaption to industrial applications are discussed. A case study of real‑world assembly system is given to demonstrate the newly proposed EIIR framework's applicability and potentiality. Finally, the key challenges that EIIR encounters in industrial scenarios are summarized and future research directions are suggested. The authors believe that EIIR technology is shaping the next generation of industrial robotics and EIIR‑based industrial systems supply a new technological paradigm for intelligent manufacturing. It is expected that this review could serve as a valuable reference for scholars and engineers that are interested in industrial embodied intelligence. Together, scholars can use this research to drive their rapid advancement and application of EIIR techniques. The authors would continue to track and contribute new studies in the project page https://github.com/jackyzengl/EIIR

Abstract:
Motion prediction, recently popularized as world models, refers to the anticipation of future agent states or scene evolution, which is rooted in human cognition, bridging perception and decision‑making. It enables intelligent systems, such as robots and self‑driving cars, to act safely in dynamic, human‑involved environments, and informs broader time‑series reasoning challenges. With advances in methods, representations, and datasets, the field has seen rapid progress, reflected in quickly evolving benchmark results. Yet, when state‑of‑the‑art methods are deployed in the real world, they often struggle to generalize to open‑world conditions and fall short of deployment standards. This reveals a gap between research benchmarks, which are often idealized or ill‑posed, and real‑world complexity. To address this gap, this survey revisits the generalization and deployability of motion prediction models, with an emphasis on applications of robotics, autonomous driving, and human motion. We first offer a comprehensive taxonomy of motion prediction methods, covering representations, modeling strategies, application domains, and evaluation protocols. We then study two key challenges: (1) how to push motion prediction models to be deployable to realistic deployment standards, where motion prediction does not act in a vacuum, but functions as one module of closed‑loop autonomy stacks ‑ it takes input localization and perception, and informs downstream planning and control. 2) How to generalize motion prediction models from limited seen scenarios/datasets to the open‑world settings. Throughout the paper, we highlight critical open challenges to guide future work, aiming to recalibrate the community's efforts, fostering progress that is not only measurable but also meaningful for real‑world applications. The project webpage can be found here https://trends‑in‑motion‑prediction‑2025.github.io/.

Abstract:
Model‑based offline reinforcement learning (RL) has emerged as a promising approach for recommender systems, enabling effective policy learning by interacting with frozen world models. However, the reward functions in these world models, trained on sparse offline logs, often suffer from inaccuracies. Specifically, existing methods face two major limitations in addressing this challenge: (1) deterministic use of reward functions as static look‑up tables, which propagates inaccuracies during policy learning, and (2) static uncertainty designs that fail to effectively capture decision risks and mitigate the impact of these inaccuracies. In this work, a dual‑agent framework, DARLR, is proposed to dynamically update world models to enhance recommendation policies. To achieve this, a selector is introduced to identify reference users by balancing similarity and diversity so that the recommender can aggregate information from these users and iteratively refine reward estimations for dynamic reward shaping. Further, the statistical features of the selected users guide the dynamic adaptation of an uncertainty penalty to better align with evolving recommendation requirements. Extensive experiments on four benchmark datasets demonstrate the superior performance of DARLR, validating its effectiveness. The code is available at https://github.com/ArronDZhang/DARLR.

Abstract:
In many real‑world planning tasks, agents must tackle uncertainty about the environment's state and variability in the outcomes induced by stochastic dynamics and rewards. Motivated by recent progress in world model approaches, where latent models approximate beliefs and support planning, we extend Distributional Reinforcement Learning (DistRL), which models the entire return distribution for fully observable domains, to Partially Observable Markov Decision Processes (POMDPs). Concretely, we introduce new distributional Bellman operators for partial observability and prove their convergence under the supremum p‑Wasserstein metric. We also propose a finite representation of these return distributions via psi‑vectors, generalizing the classical alpha‑vectors in POMDP solvers. Building on this, we develop Distributional Point‑Based Value Iteration (DPBVI), which integrates psi‑vectors into a standard point‑based backup procedure, bridging DistRL and POMDP planning. Our experiments demonstrate that DPBVI recovers classical Point‑Based Value Iteration (PBVI) in the risk‑neutral case, validating the distributional extension.

Abstract:
Imitation Learning (IL) has achieved remarkable success across various domains, including robotics, autonomous driving, and healthcare, by enabling agents to learn complex behaviors from expert demonstrations. However, existing IL methods often face instability challenges, particularly when relying on adversarial reward or value formulations in world model frameworks. In this work, we propose a novel approach to online imitation learning that addresses these limitations through a reward model based on random network distillation (RND) for density estimation. Our reward model is built on the joint estimation of expert and behavioral distributions within the latent space of the world model. We evaluate our method across diverse benchmarks, including DMControl, Meta‑World, and ManiSkill2, showcasing its ability to deliver stable performance and achieve expert‑level results in both locomotion and manipulation tasks. Our approach demonstrates improved stability over adversarial methods while maintaining expert‑level performance.

Abstract:
Recent advances in generative world models have enabled classical safe control methods, such as Hamilton‑Jacobi (HJ) reachability, to generalize to complex robotic systems operating directly from high‑dimensional sensor observations. However, obtaining comprehensive coverage of all safety‑critical scenarios during world model training is extremely challenging. As a result, latent safety filters built on top of these models may miss novel hazards and even fail to prevent known ones, overconfidently misclassifying risky out‑of‑distribution (OOD) situations as safe. To address this, we introduce an uncertainty‑aware latent safety filter that proactively steers robots away from both known and unseen failures. Our key idea is to use the world model's epistemic uncertainty as a proxy for identifying unseen potential hazards. We propose a principled method to detect OOD world model predictions by calibrating an uncertainty threshold via conformal prediction. By performing reachability analysis in an augmented state space‑spanning both the latent representation and the epistemic uncertainty‑we synthesize a latent safety filter that can reliably safeguard arbitrary policies from both known and unseen safety hazards. In simulation and hardware experiments on vision‑based control tasks with a Franka manipulator, we show that our uncertainty‑aware safety filter preemptively detects potential unsafe scenarios and reliably proposes safe, in‑distribution actions. Video results can be found on the project website at https://cmu‑intentlab.github.io/UNISafe

Abstract:
Agent self‑improvement, where the backbone Large Language Model (LLM) of the agent are trained on trajectories sampled autonomously based on their own policies, has emerged as a promising approach for enhancing performance. Recent advancements, particularly in web environments, face a critical limitation: their performance will reach a stagnation point during autonomous learning cycles, hindering further improvement. We argue that this stems from limited exploration of the web environment and insufficient exploitation of pre‑trained web knowledge in LLMs. To improve the performance of self‑improvement, we propose a novel framework that introduces a co‑evolving World Model LLM. This world model predicts the next observation based on the current observation and action within the web environment. Leveraging LLMs' pretrained knowledge of abundant web content, the World Model serves dual roles: (1) as a virtual web server generating self‑instructed training data to continuously refine the agent's policy, and (2) as an imagination engine during inference, enabling look‑ahead simulation to guide action selection for the agent LLM. Experiments in real‑world web environments (Mind2Web‑Live, WebVoyager, and GAIA‑web) show a 10% performance gain over existing self‑evolving agents, demonstrating the efficacy and generalizability of our approach, without using any distillation from more powerful close‑sourced models. Our work establishes the necessity of integrating world models into autonomous agent frameworks to unlock sustained adaptability. Code is available at https://github.com/Tencent/SelfEvolvingAgent

Abstract:
Can we build accurate world models out of large language models (LLMs)? How can world models benefit LLM agents? The gap between the prior knowledge of LLMs and the specified environment's dynamics usually bottlenecks LLMs' performance as world models. To bridge the gap, we propose a training‑free "world alignment" that learns an environment's symbolic knowledge complementary to LLMs. The symbolic knowledge covers action rules, knowledge graphs, and scene graphs, which are extracted by LLMs from exploration trajectories and encoded into executable codes to regulate LLM agents' policies. We further propose an RL‑free, model‑based agent "WALL‑E 2.0" through the model‑predictive control (MPC) framework. Unlike classical MPC requiring costly optimization on the fly, we adopt an LLM agent as an efficient look‑ahead optimizer of future steps' actions by interacting with the neurosymbolic world model. While the LLM agent's strong heuristics make it an efficient planner in MPC, the quality of its planned actions is also secured by the accurate predictions of the aligned world model. They together considerably improve learning efficiency in a new environment. On open‑world challenges in Mars (Minecraft like) and ALFWorld (embodied indoor environments), WALL‑E 2.0 significantly outperforms existing methods, e.g., surpassing baselines in Mars by 16.1%‑51.6% of success rate and by at least 61.7% in score. In ALFWorld, it achieves a new record 98% success rate after only 4 iterations.

Abstract:
Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real‑world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). CAPTURe requires both recognizing visual patterns and reasoning, making it a useful testbed for evaluating vision‑language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs' ability to form world models that would allow them to fill in missing information. CAPTURe consists of two parts: (1) CAPTURe‑real, with manually filtered images of real objects in patterns and (2) CAPTURe‑synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs (GPT‑4o, Intern‑VL2, Molmo, and Qwen2‑VL) on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in inferring unseen spatial relationships: even the strongest VLMs like GPT‑4o fail to count with occlusion. In contrast, we find that humans achieve very little error on CAPTURe. We also find that providing auxiliary information of occluded object locations increases performance, underscoring that the model error comes both from an inability to handle occlusion as well as difficulty in counting in images. Code and data: https://github.com/atinpothiraj/CAPTURe

Abstract:
Offline meta‑RL usually tackles generalization by inferring task beliefs from high‑quality samples or warmup explorations. The restricted form limits their generality and usability since these supervision signals are expensive and even infeasible to acquire in advance for unseen tasks. Learning directly from the raw text about decision tasks is a promising alternative to leverage a much broader source of supervision. In the paper, we propose Text‑to‑Decision Agent (T2DA), a simple and scalable framework that supervises offline meta‑RL with natural language. We first introduce a generalized world model to encode multi‑task decision data into a dynamics‑aware embedding space. Then, inspired by CLIP, we predict which textual description goes with which decision embedding, effectively bridging their semantic gap via contrastive language‑decision pre‑training and aligning the text embeddings to comprehend the environment dynamics. After training the text‑conditioned generalist policy, the agent can directly realize zero‑shot text‑to‑decision generation in response to language instructions. Comprehensive experiments on MuJoCo and Meta‑World benchmarks show that T2DA facilitates high‑capacity zero‑shot generalization and outperforms various types of baselines. Our code is available at \textcolormagenta\hrefhttps://github.com/NJU‑RL/T2DAhttps://github.com/NJU‑RL/T2DA.

Abstract:
App agents, which autonomously operate mobile Apps through Graphical User Interfaces (GUIs), have gained significant interest in real‑world applications. Yet, they often struggle with long‑horizon planning, failing to find the optimal actions for complex tasks with longer steps. To address this, world models are used to predict the next GUI observation based on user actions, enabling more effective agent planning. However, existing world models primarily focus on generating only textual descriptions, lacking essential visual details. To fill this gap, we propose ViMo, the first visual world model designed to generate future App observations as images. For the challenge of generating text in image patches, where even minor pixel errors can distort readability, we decompose GUI generation into graphic and text content generation. We propose a novel data representation, the Symbolic Text Representation~(STR) to overlay text content with symbolic placeholders while preserving graphics. With this design, ViMo employs a STR Predictor to predict future GUIs' graphics and a GUI‑text Predictor for generating the corresponding text. Moreover, we deploy ViMo to enhance agent‑focused tasks by predicting the outcome of different action options. Experiments show ViMo's ability to generate visually plausible and functionally effective GUIs that enable App agents to make more informed decisions.

Abstract:
Humans can develop internal world models that encode common sense knowledge, telling them how the world works and predicting the consequences of their actions. This concept has emerged as a promising direction for establishing general‑purpose machine‑learning models in recent preliminary works, e.g., for visual representation learning. In this paper, we present CheXWorld, the first effort towards a self‑supervised world model for radiographic images. Specifically, our work develops a unified framework that simultaneously models three aspects of medical knowledge essential for qualified radiologists, including 1) local anatomical structures describing the fine‑grained characteristics of local tissues (e.g., architectures, shapes, and textures); 2) global anatomical layouts describing the global organization of the human body (e.g., layouts of organs and skeletons); and 3) domain variations that encourage CheXWorld to model the transitions across different appearance domains of radiographs (e.g., varying clarity, contrast, and exposure caused by collecting radiographs from different hospitals, devices, or patients). Empirically, we design tailored qualitative and quantitative analyses, revealing that CheXWorld successfully captures these three dimensions of medical knowledge. Furthermore, transfer learning experiments across eight medical image classification and segmentation benchmarks showcase that CheXWorld significantly outperforms existing SSL methods and large‑scale medical foundation models. Code & pre‑trained models are available at https://github.com/LeapLabTHU/CheXWorld.

Abstract:
Echocardiography is crucial for cardiovascular disease detection but relies heavily on experienced sonographers. Echocardiography probe guidance systems, which provide real‑time movement instructions for acquiring standard plane images, offer a promising solution for AI‑assisted or fully autonomous scanning. However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion‑aware world modeling framework for probe guidance that encodes anatomical knowledge and motion‑induced visual dynamics, while effectively leveraging past visual‑motion sequences to enhance guidance precision. EchoWorld employs a pre‑training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre‑trained model, we introduce a motion‑aware attention mechanism in the fine‑tuning stage that effectively integrates historical visual‑motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures key echocardiographic knowledge, as validated by qualitative analysis. Moreover, our method significantly reduces guidance errors compared to existing visual backbones and guidance frameworks, excelling in both single‑frame and sequential evaluation protocols. Code is available at https://github.com/LeapLabTHU/EchoWorld.

Abstract:
Effective scene representation is critical for the visual grounding ability of representations, yet existing methods for 3D Visual Grounding are often constrained. They either only focus on geometric and visual cues, or, like traditional 3D scene graphs, lack the multi‑dimensional attributes needed for complex reasoning. To bridge this gap, we introduce the Diverse Semantic Map (DSM) framework, a novel scene representation framework that enriches robust geometric models with a spectrum of VLM‑derived semantics, including appearance, physical properties, and affordances. The DSM is first constructed online by fusing multi‑view observations within a temporal sliding window, creating a persistent and comprehensive world model. Building on this foundation, we propose DSM‑Grounding, a new paradigm that shifts grounding from free‑form VLM queries to a structured reasoning process over the semantic‑rich map, markedly improving accuracy and interpretability. Extensive evaluations validate our approach's superiority. On the ScanRefer benchmark, DSM‑Grounding achieves a state‑of‑the‑art 59.06% overall accuracy of IoU@0.5, surpassing others by 10%. In semantic segmentation, our DSM attains a 67.93% F‑mIoU, outperforming all baselines, including privileged ones. Furthermore, successful deployment on physical robots for complex navigation and grasping tasks confirms the framework's practical utility in real‑world scenarios.

Abstract:
Conventional visuomotor imitation learning usually predicts future robot actions directly in the time domain. Such formulations often have limited physical scene awareness and weak long‑horizon memory. In contrast, world‑model‑based perception and memory‑augmented policies can improve world awareness with substantial computation overhead. In this work, we propose Wavelet Policy, a lightweight imitation learning framework that combines World Prior Memory (WPM) with wavelet‑based multi‑scale action modeling. Our key idea is to encode persistent physical scene structure from static background images into compact memory tokens, which are fused into world‑prior tokens and injected into the encoder during forward propagation. Based on this memory‑conditioned representation, We further perform wavelet‑domain decomposition over horizon‑aligned latent action tokens and adopt a Single‑Encoder Multiple‑Decoder (SE2MD) architecture to model latent components at different temporal scales. The resulting latent subbands are reconstructed through inverse wavelet transform and finally projected into executable action chunks. To facilitate efficient world prior learning, we introduce a world‑prior adaptation loss, encouraging the background encoder to retain persistent scene knowledge while remaining lightweight and stable. Extensive experiments on four simulated and six real‑world robotic manipulation tasks show that Wavelet Policy consistently outperforms strong baselines. These results demonstrate that combining scale‑domain action modeling with world‑prior memory provides an effective and efficient solution for long‑horizon embodied manipulation. We release the source code, data and model checkpoint of simulation task at https://github.com/lurenjia384/Wavelet_Policy.

Abstract:
Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high‑quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real‑world dynamics and agent‑environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real‑world experiments, we show that: (1) UWM enables effective pretraining on large‑scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action‑free video data through independent control of modality‑specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.

Abstract:
Modern world models require costly and time‑consuming collection of large video datasets with action demonstrations by people or by environment‑specific agents. To simplify training, we focus on using many virtual environments for inexpensive, automatically collected interaction data. Genie, a recent multi‑environment world model, demonstrates simulation abilities of many environments with shared behavior. Unfortunately, training their model requires expensive demonstrations. Therefore, we propose a training framework merely using a random agent in virtual environments. While the model trained in this manner exhibits good controls, it is limited by the random exploration possibilities. To address this limitation, we propose AutoExplore Agent ‑ an exploration agent that entirely relies on the uncertainty of the world model, delivering diverse data from which it can learn the best. Our agent is fully independent of environment‑specific rewards and thus adapts easily to new environments. With this approach, the pretrained multi‑environment model can quickly adapt to new environments achieving video fidelity and controllability improvement. In order to obtain automatically large‑scale interaction datasets for pretraining, we group environments with similar behavior and controls. To this end, we annotate the behavior and controls of 974 virtual environments ‑ a dataset that we name RetroAct. For building our model, we first create an open implementation of Genie ‑ GenieRedux and apply enhancements and adaptations in our version GenieRedux‑G. Our code and data are available at https://github.com/insait‑institute/GenieRedux.

Abstract:
End‑to‑end autonomous driving has achieved remarkable progress by integrating perception, prediction, and planning into a fully differentiable framework. Yet, to fully realize its potential, an effective online trajectory evaluation is indispensable to ensure safety. By forecasting the future outcomes of a given trajectory, trajectory evaluation becomes much more effective. This goal can be achieved by employing a world model to capture environmental dynamics and predict future states. Therefore, we propose an end‑to‑end driving framework WoTE, which leverages a BEV World model to predict future BEV states for Trajectory Evaluation. The proposed BEV world model is latency‑efficient compared to image‑level world models and can be seamlessly supervised using off‑the‑shelf BEV‑space traffic simulators. We validate our framework on both the NAVSIM benchmark and the closed‑loop Bench2Drive benchmark based on the CARLA simulator, achieving state‑of‑the‑art performance. Code is released at https://github.com/liyingyanUCAS/WoTE.

Abstract:
Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per‑frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real‑world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real "world models" through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI‑assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench‑2.0, a next‑generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench‑2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine‑grained capabilities. Tailored to individual dimensions, our evaluation framework integrates generalists such as SOTA VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive human annotations to ensure evaluation alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench‑2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.

Abstract:
In this paper, we introduce Semi‑SMD, a novel metric depth estimation framework tailored for surrounding cameras equipment in autonomous driving. In this work, the input data consists of adjacent surrounding frames and camera parameters. We propose a unified spatial‑temporal‑semantic fusion module to construct the visual fused features. Cross‑attention components for surrounding cameras and adjacent frames are utilized to focus on metric scale information refinement and temporal feature matching. Building on this, we propose a pose estimation framework using surrounding cameras, their corresponding estimated depths, and extrinsic parameters, which effectively address the scale ambiguity in multi‑camera setups. Moreover, semantic world model and monocular depth estimation world model are integrated to supervised the depth estimation, which improve the quality of depth estimation. We evaluate our algorithm on DDAD and nuScenes datasets, and the results demonstrate that our method achieves state‑of‑the‑art performance in terms of surrounding camera based depth estimation quality. The source code will be available on https://github.com/xieyuser/Semi‑SMD.

Abstract:
World models aim to learn action‑controlled future prediction and have proven essential for the development of intelligent agents. However, most existing world models rely heavily on substantial action‑labeled data and costly training, making it challenging to adapt to novel environments with heterogeneous actions through limited interactions. This limitation can hinder their applicability across broader domains. To overcome this limitation, we propose AdaWorld, an innovative world model learning approach that enables efficient adaptation. The key idea is to incorporate action information during the pretraining of world models. This is achieved by extracting latent actions from videos in a self‑supervised manner, capturing the most critical transitions between frames. We then develop an autoregressive world model that conditions on these latent actions. This learning paradigm enables highly adaptable world models, facilitating efficient transfer and learning of new actions even with limited interactions and finetuning. Our comprehensive experiments across multiple environments demonstrate that AdaWorld achieves superior performance in both simulation quality and visual planning.

Abstract:
Model Predictive Control (MPC) has been demonstrated to be effective in continuous control tasks. When a world model and a value function are available, planning a sequence of actions ahead of time leads to a better policy. Existing methods typically obtain the value function and the corresponding policy in a model‑free manner. However, we find that such an approach struggles with complex tasks, resulting in poor policy learning and inaccurate value estimation. To address this problem, we leverage the strengths of MPC itself. In this work, we introduce Bootstrapped Model Predictive Control (BMPC), a novel algorithm that performs policy learning in a bootstrapped manner. BMPC learns a network policy by imitating an MPC expert, and in turn, uses this policy to guide the MPC process. Combined with model‑based TD‑learning, our policy learning yields better value estimation and further boosts the efficiency of MPC. We also introduce a lazy reanalyze mechanism, which enables computationally efficient imitation learning. Our method achieves superior performance over prior works on diverse continuous control tasks. In particular, on challenging high‑dimensional locomotion tasks, BMPC significantly improves data efficiency while also enhancing asymptotic performance and training stability, with comparable training time and smaller network sizes. Code is available at https://github.com/wertyuilife2/bmpc.

Abstract:
Zero‑Shot Composed Image Retrieval (ZS‑CIR) involves diverse tasks with a broad range of visual content manipulation intent across domain, scene, object, and attribute. The key challenge for ZS‑CIR tasks is to modify a reference image according to manipulation text to accurately retrieve a target image, especially when the reference image is missing essential target content. In this paper, we propose a novel prediction‑based mapping network, named PrediCIR, to adaptively predict the missing target visual content in reference images in the latent space before mapping for accurate ZS‑CIR. Specifically, a world view generation module first constructs a source view by omitting certain visual content of a target view, coupled with an action that includes the manipulation intent derived from existing image‑caption pairs. Then, a target content prediction module trains a world model as a predictor to adaptively predict the missing visual information guided by user intention in manipulating text at the latent space. The two modules map an image with the predicted relevant information to a pseudo‑word token without extra supervision. Our model shows strong generalization ability on six ZS‑CIR tasks. It obtains consistent and significant performance boosts ranging from 1.73% to 4.45% over the best methods and achieves new state‑of‑the‑art results on ZS‑CIR. Our code is available at https://github.com/Pter61/predicir.

Abstract:
In recent years, data‑driven techniques have greatly advanced autonomous driving systems, but the need for rare and diverse training data remains a challenge, requiring significant investment in equipment and labor. World models, which predict and generate future environmental states, offer a promising solution by synthesizing annotated video data for training. However, existing methods struggle to generate long, consistent videos without accumulating errors, especially in dynamic scenes. To address this, we propose MiLA, a novel framework for generating high‑fidelity, long‑duration videos up to one minute. MiLA utilizes a Coarse‑to‑Re(fine) approach to both stabilize video generation and correct distortion of dynamic objects. Additionally, we introduce a Temporal Progressive Denoising Scheduler and Joint Denoising and Correcting Flow modules to improve the quality of generated videos. Extensive experiments on the nuScenes dataset show that MiLA achieves state‑of‑the‑art performance in video generation quality. For more information, visit the project website: https://github.com/xiaomi‑mlab/mila.github.io.

Abstract:
Control Lyapunov functions are traditionally used to design a controller which ensures convergence to a desired state, yet deriving these functions for nonlinear systems remains a complex challenge. This paper presents a novel, sample‑efficient method for neural approximation of nonlinear Lyapunov functions, leveraging self‑supervised Reinforcement Learning (RL) to enhance training data generation, particularly for inaccurately represented regions of the state space. The proposed approach employs a data‑driven World Model to train Lyapunov functions from off‑policy trajectories. The method is validated on both standard and goal‑conditioned robotic tasks, demonstrating faster convergence and higher approximation accuracy compared to the state‑of‑the‑art neural Lyapunov approximation baseline. The code is available at: https://github.com/CAV‑Research‑Lab/SACLA.git

Abstract:
With the rapid advancement of autonomous driving technology, a lack of data has become a major obstacle to enhancing perception model accuracy. Researchers are now exploring controllable data generation using world models to diversify datasets. However, previous work has been limited to studying image generation quality on specific public datasets. There is still relatively little research on how to build data generation engines for real‑world application scenes to achieve large‑scale data generation for challenging scenes. In this paper, a simulator‑conditioned scene generation engine based on world model is proposed. By constructing a simulation system consistent with real‑world scenes, simulation data and labels, which serve as the conditions for data generation in the world model, for any scenes can be collected. It is a novel data generation pipeline by combining the powerful scene simulation capabilities of the simulation engine with the robust data generation capabilities of the world model. In addition, a benchmark with proportionally constructed virtual and real data, is provided for exploring the capabilities of world models in real‑world scenes. Quantitative results show that these generated images significantly improve downstream perception models performance. Finally, we explored the generative performance of the world model in urban autonomous driving scenarios. All the data and code will be available at https://github.com/Li‑Zn‑H/SimWorld.

Abstract:
World models significantly enhance hierarchical understanding, improving data integration and learning efficiency. To explore the potential of the world model in the remote sensing (RS) field, this paper proposes a label‑efficient remote sensing world model for multimodal data fusion (FusDreamer). The FusDreamer uses the world model as a unified representation container to abstract common and high‑level knowledge, promoting interactions across different types of data, \emphi.e., hyperspectral (HSI), light detection and ranging (LiDAR), and text data. Initially, a new latent diffusion fusion and multimodal generation paradigm (LaMG) is utilized for its exceptional information integration and detail retention capabilities. Subsequently, an open‑world knowledge‑guided consistency projection (OK‑CP) module incorporates prompt representations for visually described objects and aligns language‑visual features through contrastive learning. In this way, the domain gap can be bridged by fine‑tuning the pre‑trained world models with limited samples. Finally, an end‑to‑end multitask combinatorial optimization (MuCO) strategy can capture slight feature bias and constrain the diffusion process in a collaboratively learnable direction. Experiments conducted on four typical datasets indicate the effectiveness and advantages of the proposed FusDreamer. The corresponding code will be released at https://github.com/Cimy‑wang/FusDreamer.

Abstract:
We present UniFuture, a unified 4D Driving World Model designed to simulate the dynamic evolution of the 3D physical world. Unlike existing driving world models that focus solely on 2D pixel‑level video generation (lacking geometry) or static perception (lacking temporal dynamics), our approach bridges appearance and geometry to construct a holistic 4D representation. Specifically, we treat future RGB images and depth maps as coupled projections of the same 4D reality and model them jointly within a single framework. To achieve this, we introduce a Dual‑Latent Sharing (DLS) scheme, which maps visual and geometric modalities into a shared spatio‑temporal latent space, implicitly entangling texture with structure. Furthermore, we propose a Multi‑scale Latent Interaction (MLI) mechanism, which enforces bidirectional consistency: geometry constrains visual synthesis to prevent structural hallucinations, while visual semantics refine geometric estimation. During inference, UniFuture can forecast high‑fidelity, geometrically consistent 4D scene sequences (image‑depth pairs) from a single current frame. Extensive experiments on the nuScenes and Waymo datasets demonstrate that our method outperforms specialized models in both future generation and geometry perception, highlighting the efficacy of unified 4D modeling for autonomous driving. The code is available at https://github.com/dk‑liang/UniFuture.

Abstract:
We propose a new task to benchmark human‑in‑scene understanding for embodied agents: Human‑In‑Scene Question Answering (HIS‑QA). Given a human motion within a 3D scene, HIS‑QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human‑related questions within the scene. To support this new task, we present HIS‑Bench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum, from basic perception to commonsense reasoning and planning. Our evaluation of various vision‑language models on HIS‑Bench reveals significant limitations in their ability to handle HIS‑QA tasks. To this end, we propose HIS‑GPT, the first foundation model for HIS understanding. HIS‑GPT integrates 3D scene context and human motion dynamics into large language models while incorporating specialized mechanisms to capture human‑scene interactions. Extensive experiments demonstrate that HIS‑GPT sets a new state‑of‑the‑art on HIS‑QA tasks. We hope this work inspires future research on human behavior analysis in 3D scenes, advancing embodied AI and world models. The codes and data: https://github.com/ZJHTerry18/HumanInScene.

Abstract:
We introduce specialized diffusion‑based generative models that capture the spatiotemporal dynamics of fine‑grained robotic surgical sub‑stitch actions through supervised learning on annotated laparoscopic surgery footage. The proposed models form a foundation for data‑driven world models capable of simulating the biomechanical interactions and procedural dynamics of surgical suturing with high temporal fidelity. Annotating a dataset of ～2K clips extracted from simulation videos, we categorize surgical actions into fine‑grained sub‑stitch classes including ideal and non‑ideal executions of needle positioning, targeting, driving, and withdrawal. We fine‑tune two state‑of‑the‑art video diffusion models, LTX‑Video and HunyuanVideo, to generate high‑fidelity surgical action sequences at \ge768x512 resolution and \ge49 frames. For training our models, we explore both Low‑Rank Adaptation (LoRA) and full‑model fine‑tuning approaches. Our experimental results demonstrate that these world models can effectively capture the dynamics of suturing, potentially enabling improved training simulators, surgical skill assessment tools, and autonomous surgical systems. The models also display the capability to differentiate between ideal and non‑ideal technique execution, providing a foundation for building surgical training and evaluation systems. We release our models for testing and as a foundation for future research. Project Page: https://mkturkcan.github.io/suturingmodels/

Abstract:
Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, i.e., RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentangled representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline‑to‑online latent distillation and flexible disentanglement constraints. To enable effective cross‑domain semantic knowledge transfer, we introduce an interpretable model‑based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action‑free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.

Abstract:
Traditional agentic workflows rely on external prompts to manage interactions with tools and the environment, which limits the autonomy of reasoning models. We position \emphLarge Agent Models (LAMs) that internalize the generation of \emphChain‑of‑Action (CoA), enabling the model to autonomously decide when and how to use external tools. Our proposed AutoCoA framework combines supervised fine‑tuning (SFT) and reinforcement learning (RL), allowing the model to seamlessly switch between reasoning and action while efficiently managing environment interactions. Main components include step‑level action triggering, trajectory‑level CoA optimization, and an internal world model to reduce real‑environment interaction costs. Evaluations on open‑domain QA tasks demonstrate that AutoCoA‑trained agent models significantly outperform ReAct‑based workflows in task completion, especially in tasks that require long‑term reasoning and multi‑step actions. Code and dataset are available at https://github.com/ADaM‑BJTU/AutoCoA

Abstract:
Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real‑world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real‑world simulation within a unified framework.

Abstract:
Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action‑controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large‑scale real data acquisition is infeasible. However, such works in the surgical domain have been limited to simplified computer simulations, and lack realism. Furthermore, existing literature in world models has predominantly dealt with action‑labeled data, limiting their applicability to real‑world surgical data, where obtaining action annotation is prohibitively expensive. Inspired by the recent success of Genie in leveraging unlabeled video game data to infer latent actions and enable action‑controlled data generation, we propose the first surgical vision world model. The proposed model can generate action‑controllable surgical data and the architecture design is verified with extensive experiments on the unlabeled SurgToolLoc‑2022 dataset. Codes and implementation details are available at https://github.com/bhattarailab/Surgical‑Vision‑World‑Model

Abstract:
Object Goal Navigation‑requiring an agent to locate a specific object in an unseen environment‑remains a core challenge in embodied AI. Although recent progress in Vision‑Language Model (VLM)‑based agents has demonstrated promising perception and decision‑making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model‑based Navigation framework powered by Vision‑Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. To retain the predicted state of the environment, WMNav proposes the online maintained Curiosity Value Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human‑like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. To further boost efficiency, we implement a two‑stage action proposer strategy: broad exploration followed by precise localization. Extensive evaluation on HM3D and MP3D validates WMNav surpasses existing zero‑shot benchmarks in both success rate and exploration efficiency (absolute improvement: +3.2% SR and +3.2% SPL on HM3D, +13.5% SR and +1.1% SPL on MP3D). Project page: https://b0b8k1ng.github.io/WMNav/.

Abstract:
Long‑horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state‑action space. However, despite a lack of dense rewards, these tasks often have a multi‑stage structure, which can be leveraged to decompose the overall objective into manageable subgoals. In this work, we propose DEMO3, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi‑stage dense reward learning, a bi‑phasic training scheme, and world model learning into a carefully designed demonstration‑augmented RL framework that strongly mitigates the challenge of exploration in long‑horizon tasks. Our evaluations demonstrate that our method improves data‑efficiency by an average of 40% and by 70% on particularly difficult tasks compared to state‑of‑the‑art approaches. We validate this across 16 sparse‑reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.

Abstract:
Generative models have fundamentally reshaped the landscape of decision‑making, reframing the problem from pure scalar reward maximization to high‑fidelity trajectory generation and distribution matching. This paradigm shift addresses intrinsic limitations in classical Reinforcement Learning (RL), particularly the limited expressivity of standard unimodal policy distributions in capturing complex, multi‑modal behaviors embedded in diverse datasets. However, current literature often treats these models as isolated algorithmic improvements, rarely synthesizing them into a single comprehensive framework. This survey proposes a principled taxonomy grounding generative decision‑making within the probabilistic framework of Control as Inference. By performing a variational factorization of the trajectory posterior, we conceptualize four distinct functional roles: Controllers for amortized policy inference, Modelers for dynamics priors, Optimizers for iterative trajectory refinement, and Evaluators for trajectory guidance and value assessment. Unlike existing architecture‑centric reviews, this function‑centric framework allows us to critically analyze representative generative families across distinct dimensions. Furthermore, we examine deployment in high‑stakes domains, specifically Embodied AI, Autonomous Driving, and AI for Science, highlighting systemic risks such as dynamics hallucination in world models and proxy exploitation. Finally, we chart the path toward Generalist Physical Intelligence, identifying pivotal challenges in inference efficiency, trustworthiness, and the emergence of Physical Foundation Models.

Abstract:
Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi‑criteria, execution‑based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large‑scale reinforcement learning outperform others. However, even the best‑performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test‑time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text‑to‑world.github.io/.

Abstract:
World models (WMs) represent the frontier of sample‑efficient reinforcement learning, but their complexity leaves many promising improvements unrealized due to the significant expertise and effort required to identify and integrate them. Inspired by Rainbow, which showed that individually known improvements to DQN complement each other and can be effectively combined, we take on this challenge and ask whether the same principle applies to world model agents. We introduce Simulus, a modular token‑based WM agent that integrates: (1) a flexible tokenization framework supporting arbitrary combinations of observation and action modalities; (2) intrinsic motivation for epistemic uncertainty reduction; (3) prioritized world model replay; and (4) regression‑as‑classification for reward and return prediction. Simulus achieves state‑of‑the‑art sample efficiency for planning‑free WMs across three diverse benchmarks: visual Atari 100K, continuous‑control DMC Proprioception 500K, and symbolic Craftax‑1M. Notably, intrinsic motivation proves beneficial even under the tight interaction budgets of sample‑efficient RL, despite the risk of wasting scarce interactions on task‑irrelevant experience. Ablation studies reveal that each component contributes individually, and their combination yields synergistic gains. Our code and model weights are publicly available at https://github.com/leor‑c/Simulus.

Abstract:
The Driving World Model (DWM), which focuses on predicting scene evolution during the driving process, has emerged as a promising paradigm in the pursuit of autonomous driving (AD). DWMs enable AD systems to better perceive, understand, and interact with dynamic driving environments. In this survey, we provide a comprehensive overview of the latest progress in DWM. First, we review the DWM ecosystem, which is constructed using mainstream simulators, high‑impact datasets, and various metrics that evaluate DWMs across multiple dimensions. We then categorize existing approaches based on the modalities of the predicted scenes, including video, point cloud, occupancy, latent feature, and traffic map, and summarize their specific applications in AD research. In addition, the performance of representative approaches across generating and driving tasks is presented. Finally, we discuss the potential limitations of current research and propose future directions. This survey provides valuable insights into the development and application of DWM, fostering its broader adoption in AD. The relevant papers are collected at https://github.com/LMD0311/Awesome‑World‑Model.

Abstract:
We propose Heterogeneous Masked Autoregression (HMA) for modeling action‑video dynamics to generate high‑quality data and evaluation in scaling robot learning. Building interactive video world models and policies for robotics is difficult due to the challenge of handling diverse settings while maintaining computational efficiency to run in real time. HMA uses heterogeneous pre‑training from observations and action sequences across different robotic embodiments, domains, and tasks. HMA uses masked autoregression to generate quantized or soft tokens for video predictions. \ourshort achieves better visual fidelity and controllability than the previous robotic video generation models with 15 times faster speed in the real world. After post‑training, this model can be used as a video simulator from low‑level action inputs for evaluating policies and generating synthetic data. See this link https://liruiw.github.io/hma for more information.

Abstract:
While generative robot policies have demonstrated significant potential in learning complex, multimodal behaviors from demonstrations, they still exhibit diverse failures at deployment‑time. Policy steering offers an elegant solution to reducing the chance of failure by using an external verifier to select from low‑level actions proposed by an imperfect generative policy. Here, one might hope to use a Vision Language Model (VLM) as a verifier, leveraging its open‑world reasoning capabilities. However, off‑the‑shelf VLMs struggle to understand the consequences of low‑level robot actions as they are represented fundamentally differently than the text and images the VLM was trained on. In response, we propose FOREWARN, a novel framework to unlock the potential of VLMs as open‑vocabulary verifiers for runtime policy steering. Our key idea is to decouple the VLM's burden of predicting action outcomes (foresight) from evaluation (forethought). For foresight, we leverage a latent world model to imagine future latent states given diverse low‑level action plans. For forethought, we align the VLM with these predicted latent states to reason about the consequences of actions in its native representation‑‑natural language‑‑and effectively filter proposed plans. We validate our framework across diverse robotic manipulation tasks, demonstrating its ability to bridge representational gaps and provide robust, generalizable policy steering. Videos can be found on the project website: https://yilin‑wu98.github.io/forewarn/.

Abstract:
Heterogeneity in sensors and actuators across environments poses a significant challenge to building large‑scale pre‑trained world models on top of this low‑dimensional sensor information. In this work, we explore pre‑training world models for heterogeneous environments by addressing key transfer barriers in both data diversity and model flexibility. We introduce UniTraj, a unified dataset comprising over one million trajectories from 80 environments, designed to scale data while preserving critical diversity. Additionally, we propose TrajWorld, a novel architecture capable of flexibly handling varying sensor and actuator information and capturing environment dynamics in‑context. Pre‑training TrajWorld on UniTraj yields substantial gains in transition prediction, achieves a new state‑of‑the‑art for off‑policy evaluation, and also delivers superior online performance of model predictive control. To the best of our knowledge, this work, for the first time, demonstrates the transfer benefits of world models across heterogeneous and complex control environments. Code and data are available at https://github.com/thuml/TrajWorld.

Abstract:
It is still a challenging topic to make reactive driving behaviors in complex urban environments as road users' intentions are unknown. Model‑based reinforcement learning (MBRL) offers great potential to learn a reactive policy by constructing a world model that can provide informative states and imagination training. However, a critical limitation in relevant research lies in the scene‑level reconstruction representation learning, which may overlook key interactive vehicles and hardly model the interactive features among vehicles and their long‑term intentions. Therefore, this paper presents a novel MBRL method with a predictive individual world model (PIWM) for autonomous driving. PIWM describes the driving environment from an individual‑level perspective and captures vehicles' interactive relations and their intentions via trajectory prediction task. Meanwhile, a behavior policy is learned jointly with PIWM. It is trained in PIWM's imagination and effectively navigates in the urban driving scenes leveraging intention‑aware latent states. The proposed method is trained and evaluated on simulation environments built upon real‑world challenging interactive scenarios. Compared with popular model‑free and state‑of‑the‑art model‑based reinforcement learning methods, experimental results show that the proposed method achieves the best performance in terms of safety and efficiency.

Abstract:
Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. In this paper, we present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios. Specifically, HERMES leverages a Bird's‑Eye View (BEV) representation to consolidate multi‑view spatial information while preserving geometric relationships and interactions. We also introduce world queries, which incorporate world knowledge into BEV features via causal attention in the Large Language Model, enabling contextual enrichment for understanding and generation tasks. We conduct comprehensive studies on nuScenes and OmniDrive‑nuScenes datasets to validate the effectiveness of our method. HERMES achieves state‑of‑the‑art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%. The model and code will be publicly released at https://github.com/LMD0311/HERMES.

Abstract:
Recent breakthroughs in autonomous driving have been propelled by advances in robust world modeling, fundamentally transforming how vehicles interpret dynamic scenes and execute safe decision‑making. World models have emerged as a linchpin technology, offering high‑fidelity representations of the driving environment that integrate multi‑sensor data, semantic cues, and temporal dynamics. This paper systematically reviews recent advances in world models for autonomous driving, proposing a three‑tiered taxonomy: (i) Generation of Future Physical World, covering Image‑, BEV‑, OG‑, and PC‑based generation methods that enhance scene evolution modeling through diffusion models and 4D occupancy forecasting; (ii) Behavior Planning for Intelligent Agents, combining rule‑driven and learning‑based paradigms with cost map optimization and reinforcement learning for trajectory generation in complex traffic conditions; (ii) Interaction between Prediction and Planning, achieving multi‑agent collaborative decision‑making through latent space diffusion and memory‑augmented architectures. The study further analyzes training paradigms, including self‑supervised learning, multimodal pretraining, and generative data augmentation, while evaluating world models' performance in scene understanding and motion prediction tasks. Future research must address key challenges in self‑supervised representation learning, multimodal fusion, and advanced simulation to advance the practical deployment of world models in complex urban environments. Overall, the comprehensive analysis provides a technical roadmap for harnessing the transformative potential of world models in advancing safe and reliable autonomous driving solutions.

Abstract:
AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn "world models" that discover laws of physics ‑‑ or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics‑IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our project page is at https://physics‑iq.github.io; code at https://github.com/google‑deepmind/physics‑IQ‑benchmark.

Abstract:
Large Language Models (LLMs) excel at tasks like language processing, strategy games, and reasoning but struggle to build generalizable internal representations essential for adaptive decision‑making in agents. For agents to effectively navigate complex environments, they must construct reliable world models. While LLMs perform well on specific benchmarks, they often fail to generalize, leading to brittle representations that limit their real‑world effectiveness. Understanding how LLMs build internal world models is key to developing agents capable of consistent, adaptive behavior across tasks. We analyze OthelloGPT, a GPT‑based model trained on Othello gameplay, as a controlled testbed for studying representation learning. Despite being trained solely on next‑token prediction with random valid moves, OthelloGPT shows meaningful layer‑wise progression in understanding board state and gameplay. Early layers capture static attributes like board edges, while deeper layers reflect dynamic tile changes. To interpret these representations, we compare Sparse Autoencoders (SAEs) with linear probes, finding that SAEs offer more robust, disentangled insights into compositional features, whereas linear probes mainly detect features useful for classification. We use SAEs to decode features related to tile color and tile stability, a previously unexamined feature that reflects complex gameplay concepts like board control and long‑term planning. We study the progression of linear probe accuracy and tile color using both SAE's and linear probes to compare their effectiveness at capturing what the model is learning. Although we begin with a smaller language model, OthelloGPT, this study establishes a framework for understanding the internal representations learned by GPT models, transformers, and LLMs more broadly. Our code is publicly available: https://github.com/ALT‑JS/OthelloSAE.

Authors: NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mousavian, Seungjun Nah, Sriharsha Niverty, David Page, Despoina Paschalidou, Zeeshan Patel, Lindsey Pavao, Morteza Ramezanali, Fitsum Reda, Xiaowei Ren, Vasanth Rao Naik Sabavat, Ed Schmerling, Stella Shi, Bartosz Stefaniak, Shitao Tang, Lyne Tchapmi, Przemek Tredak, Wei-Cheng Tseng, Jibin Varghese, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Xinyue Wei, Jay Zhangjie Wu, Jiashu Xu, Wei Yang, Lin Yen-Chen, Xiaohui Zeng, Yu Zeng, Jing Zhang, Qinsheng Zhang, Yuxuan Zhang, Qingqing Zhao, Artur Zolkowski

Abstract:
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general‑purpose world model that can be fine‑tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre‑trained world foundation models, examples of post‑training of pre‑trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open‑source and our models open‑weight with permissive licenses available via https://github.com/nvidia‑cosmos/cosmos‑predict1.

Abstract:
Recent successes in autoregressive (AR) generation models, such as the GPT series in natural language processing, have motivated efforts to replicate this success in visual tasks. Some works attempt to extend this approach to autonomous driving by building video‑based world models capable of generating realistic future video sequences and predicting ego states. However, prior works tend to produce unsatisfactory results, as the classic GPT framework is designed to handle 1D contextual information, such as text, and lacks the inherent ability to model the spatial and temporal dynamics essential for video generation. In this paper, we present DrivingWorld, a GPT‑style world model for autonomous driving, featuring several spatial‑temporal fusion mechanisms. This design enables effective modeling of both spatial and temporal dynamics, facilitating high‑fidelity, long‑duration video generation. Specifically, we propose a next‑state prediction strategy to model temporal coherence between consecutive frames and apply a next‑token prediction strategy to capture spatial information within each frame. To further enhance generalization ability, we propose a novel masking strategy and reweighting strategy for token prediction to mitigate long‑term drifting issues and enable precise control. Our work demonstrates the ability to produce high‑fidelity and consistent video clips of over 40 seconds in duration, which is over 2 times longer than state‑of‑the‑art driving world models. Experiments show that, in contrast to prior works, our method achieves superior visual quality and significantly more accurate controllable future video generation. Our code is available at https://github.com/YvanYin/DrivingWorld.

Abstract:
Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre‑training representations or generative models, also referred to as world models, using large‑scale visual datasets. This paper presents an end‑to‑end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end‑to‑end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre‑trained on large‑scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine‑tuning data. Thanks to large‑scale, end‑to‑end training and the synergy between vision and action, Seer significantly outperforms previous methods across both simulation and real‑world experiments. It achieves improvements of 13% on the LIBERO‑LONG benchmark, 21% on CALVIN ABC‑D, and 43% in real‑world tasks. Notably, Seer sets a new state‑of‑the‑art on CALVIN ABC‑D benchmark, achieving an average length of 4.28, and exhibits superior generalization for novel objects, lighting conditions, and environments under high‑intensity disturbances on real‑world scenarios. Code and models are publicly available at https://github.com/OpenRobotLab/Seer/.

Abstract:
A world model provides an agent with a representation of its environment, enabling it to predict the causal consequences of its actions. Current world models typically cannot directly and explicitly imitate the actual environment in front of a robot, often resulting in unrealistic behaviors and hallucinations that make them unsuitable for real‑world robotics applications. To overcome those challenges, we propose to rethink robot world models as learnable digital twins. We introduce DreMa, a new approach for constructing digital twins automatically using learned explicit representations of the real world and its dynamics, bridging the gap between traditional digital twins and world models. DreMa replicates the observed world and its structure by integrating Gaussian Splatting and physics simulators, allowing robots to imagine novel configurations of objects and to predict the future consequences of robot actions thanks to its compositionality. We leverage this capability to generate new data for imitation learning by applying equivariant transformations to a small set of demonstrations. Our evaluations across various settings demonstrate significant improvements in accuracy and robustness by incrementing actions and object distributions, reducing the data needed to learn a policy and improving the generalization of the agents. As a highlight, we show that a real Franka Emika Panda robot, powered by DreMa's imagination, can successfully learn novel physical tasks from just a single example per task variation (one‑shot policy learning). Our project page can be found in: https://dreamtomanipulate.github.io/.

Abstract:
Humans possess the visual‑spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million‑scale video datasets also ``think in space'' from videos? We present a novel video‑based visual‑spatial intelligence benchmark (VSI‑Bench) of over 5,000 question‑answer pairs, and find that MLLMs exhibit competitive ‑ though subhuman ‑ visual‑spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain‑of‑thought, self‑consistency, tree‑of‑thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question‑answering enhances MLLMs' spatial distance ability.

Abstract:
3D occupancy prediction is important for autonomous driving due to its comprehensive perception of the surroundings. To incorporate sequential inputs, most existing methods fuse representations from previous frames to infer the current 3D occupancy. However, they fail to consider the continuity of driving scenarios and ignore the strong prior provided by the evolution of 3D scenes (e.g., only dynamic objects move). In this paper, we propose a world‑model‑based framework to exploit the scene evolution for perception. We reformulate 3D occupancy prediction as a 4D occupancy forecasting problem conditioned on the current sensor input. We decompose the scene evolution into three factors: 1) ego motion alignment of static scenes; 2) local movements of dynamic objects; and 3) completion of newly‑observed scenes. We then employ a Gaussian world model (GaussianWorld) to explicitly exploit these priors and infer the scene evolution in the 3D Gaussian space considering the current RGB observation. We evaluate the effectiveness of our framework on the widely used nuScenes dataset. Our GaussianWorld improves the performance of the single‑frame counterpart by over 2% in mIoU without introducing additional computations. Code: https://github.com/zuosc19/GaussianWorld.

Abstract:
End‑to‑end autonomous driving has received increasing attention due to its potential to learn from large amounts of data. However, most existing methods are still open‑loop and suffer from weak scalability, lack of high‑order interactions, and inefficient decision‑making. In this paper, we explore a closed‑loop framework for autonomous driving and propose a large Driving wOrld modEl (Doe‑1) for unified perception, prediction, and planning. We formulate autonomous driving as a next‑token generation problem and use multi‑modal tokens to accomplish different tasks. Specifically, we use free‑form texts (i.e., scene descriptions) for perception and generate future predictions directly in the RGB space with image tokens. For planning, we employ a position‑aware tokenizer to effectively encode action into discrete tokens. We train a multi‑modal transformer to autoregressively generate perception, prediction, and planning tokens in an end‑to‑end and unified manner. Experiments on the widely used nuScenes dataset demonstrate the effectiveness of Doe‑1 in various tasks including visual question‑answering, action‑conditioned video generation, and motion planning. Code: https://github.com/wzzheng/Doe.

Abstract:
Video generation models (VGMs) have received extensive attention recently and serve as promising candidates for general‑purpose large vision models. While they can only generate short videos each time, existing methods achieve long video generation by iteratively calling the VGMs, using the last‑frame output as the condition for the next‑round generation. However, the last frame only contains short‑term fine‑grained information about the scene, resulting in inconsistency in the long horizon. To address this, we propose an Omni World modeL (Owl‑1) to produce long‑term coherent and comprehensive conditions for consistent long video generation. As videos are observations of the underlying evolving world, we propose to model the long‑term developments in a latent space and use VGMs to film them into videos. Specifically, we represent the world with a latent state variable which can be decoded into explicit video observations. These observations serve as a basis for anticipating temporal dynamics which in turn update the state variable. The interaction between evolving dynamics and persistent state enhances the diversity and consistency of the long videos. Extensive experiments show that Owl‑1 achieves comparable performance with SOTA methods on VBench‑I2V and VBench‑Long, validating its ability to generate high‑quality video observations. Code: https://github.com/huang‑yh/Owl.

Abstract:
Autonomous driving requires robust perception models trained on high‑quality, large‑scale multi‑view driving videos for tasks like 3D object detection, segmentation and trajectory prediction. While world models provide a cost‑effective solution for generating realistic driving videos, challenges remain in ensuring these videos adhere to fundamental physical principles, such as relative and absolute motion, spatial relationship like occlusion and spatial consistency, and temporal consistency. To address these, we propose DrivePhysica, an innovative model designed to generate realistic multi‑view driving videos that accurately adhere to essential physical principles through three key advancements: (1) a Coordinate System Aligner module that integrates relative and absolute motion features to enhance motion interpretation, (2) an Instance Flow Guidance module that ensures precise temporal consistency via efficient 3D flow extraction, and (3) a Box Coordinate Guidance module that improves spatial relationship understanding and accurately resolves occlusion hierarchies. Grounded in physical principles, we achieve state‑of‑the‑art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks. Our project homepage: https://metadrivescape.github.io/papers_project/DrivePhysica/page.html

Abstract:
We aim to develop a model‑based planning framework for world models that can be scaled with increasing model and data budgets for general‑purpose manipulation tasks with only language and vision inputs. To this end, we present FLow‑centric generative Planning (FLIP), a model‑based planning algorithm on visual space that features three key modules: 1. a multi‑modal flow generation model as the general‑purpose action proposal module; 2. a flow‑conditioned video generation model as the dynamics module; and 3. a vision‑language representation learning model as the value module. Given an initial image and language instruction as the goal, FLIP can progressively search for long‑horizon flow and video plans that maximize the discounted return to accomplish the task. FLIP is able to synthesize long‑horizon plans across objects, robots, and tasks with image flows as the general action representation, and the dense flow information also provides rich guidance for long‑horizon video generation. In addition, the synthesized flow and video plans can guide the training of low‑level control policies for robot execution. Experiments on diverse benchmarks demonstrate that FLIP can improve both the success rates and quality of long‑horizon video plan synthesis and has the interactive world model property, opening up wider applications for future works.Video demos are on our website: https://nus‑lins‑lab.github.io/flipweb/.

Abstract:
Story visualization has gained increasing attention in artificial intelligence. However, existing methods still struggle with maintaining a balance between character identity preservation and text‑semantics alignment, largely due to a lack of detailed semantic modeling of the story scene. To tackle this challenge, we propose a novel knowledge graph, namely Character Graph (CG), which comprehensively represents various story‑related knowledge, including the characters, the attributes related to characters, and the relationship between characters. We then introduce StoryWeaver, an image generator that achieve Customization via Character Graph (C‑CG), capable of consistent story visualization with rich text semantics. To further improve the multi‑character generation performance, we incorporate knowledge‑enhanced spatial guidance (KE‑SG) into StoryWeaver to precisely inject character semantics into generation. To validate the effectiveness of our proposed method, extensive experiments are conducted using a new benchmark called TBC‑Bench. The experiments confirm that our StoryWeaver excels not only in creating vivid visual story plots but also in accurately conveying character identities across various scenarios with considerable storage efficiency, \emphe.g., achieving an average increase of +9.03% DINO‑I and +13.44% CLIP‑T. Furthermore, ablation experiments are conducted to verify the superiority of the proposed module. Codes and datasets are released at https://github.com/Aria‑Zhangjl/StoryWeaver.

Abstract:
Planning in complex environments requires an agent to efficiently query a world model to find a feasible sequence of actions from start to goal. Recent work has shown that Large Language Models (LLMs), with their rich prior knowledge and reasoning capabilities, can potentially help with planning by searching over promising states and adapting to feedback from the world. In this paper, we propose and study two fundamentally competing frameworks that leverage LLMs for query‑efficient planning. The first uses LLMs as a heuristic within a search‑based planner to select promising nodes to expand and propose promising actions. The second uses LLMs as a generative planner to propose an entire sequence of actions from start to goal, query a world model, and adapt based on feedback. We show that while both approaches improve upon comparable baselines, using an LLM as a generative planner results in significantly fewer interactions. Our key finding is that the LLM as a planner can more rapidly adapt its planning strategies based on immediate feedback than LLM as a heuristic. We present evaluations and ablations on Robotouille and PDDL planning benchmarks and discuss connections to existing theory on query‑efficient planning algorithms. Code is available at https://github.com/portal‑cornell/llms‑for‑planning

Abstract:
Recent work in Offline Reinforcement Learning (RL) has shown that a unified Transformer trained under a masked auto‑encoding objective can effectively capture the relationships between different modalities (e.g., states, actions, rewards) within given trajectory datasets. However, this information has not been fully exploited during the inference phase, where the agent needs to generate an optimal policy instead of just reconstructing masked components from unmasked ones. Given that a pretrained trajectory model can act as both a Policy Model and a World Model with appropriate mask patterns, we propose using Model Predictive Control (MPC) at test time to leverage the model's own predictive capability to guide its action selection. Empirical results on D4RL and RoboMimic show that our inference‑phase MPC significantly improves the decision‑making performance of a pretrained trajectory model without any additional parameter training. Furthermore, our framework can be adapted to Offline to Online (O2O) RL and Goal Reaching RL, resulting in more substantial performance gains when an additional online interaction budget is provided, and better generalization capabilities when different task targets are specified. Code is available: https://github.com/wkh923/m3pc.

Abstract:
Autonomous driving systems struggle with complex scenarios due to limited access to diverse, extensive, and out‑of‑distribution driving data which are critical for safe navigation. World models offer a promising solution to this challenge; however, current driving world models are constrained by short time windows and limited scenario diversity. To bridge this gap, we introduce InfinityDrive, the first driving world model with exceptional generalization capabilities, delivering state‑of‑the‑art performance in high fidelity, consistency, and diversity with minute‑scale video generation. InfinityDrive introduces an efficient spatio‑temporal co‑modeling module paired with an extended temporal training strategy, enabling high‑resolution (576×1024) video generation with consistent spatial and temporal coherence. By incorporating memory injection and retention mechanisms alongside an adaptive memory curve loss to minimize cumulative errors, achieving consistent video generation lasting over 1500 frames (more than 2 minutes). Comprehensive experiments in multiple datasets validate InfinityDrive's ability to generate complex and varied scenarios, highlighting its potential as a next‑generation driving world model built for the evolving demands of autonomous driving. Our project homepage: https://metadrivescape.github.io/papers_project/InfinityDrive/page.html

Abstract:
Identifying predictive world models for robots in novel environments from sparse online observations is essential for robot task planning and execution in novel environments. However, existing methods that leverage differentiable programming to identify world models are incapable of jointly optimizing the geometry, appearance, and physical properties of the scene. In this work, we introduce a novel rigid object representation that allows the joint identification of these properties. Our method employs a novel differentiable point‑based geometry representation coupled with a grid‑based appearance field, which allows differentiable object collision detection and rendering. Combined with a differentiable physical simulator, we achieve end‑to‑end optimization of world models, given the sparse visual and tactile observations of a physical motion sequence. Through a series of world model identification tasks in simulated and real environments, we show that our method can learn both simulation‑ and rendering‑ready world models from only one robot action sequence. The code and additional videos are available at our project website: https://tianyi20.github.io/rigid‑world‑model.github.io/

Abstract:
The technical report introduces O1‑CODER, an attempt to replicate OpenAI's o1 model with a focus on coding tasks. It integrates reinforcement learning (RL) and Monte Carlo Tree Search (MCTS) to enhance the model's System‑2 thinking capabilities. The framework includes training a Test Case Generator (TCG) for standardized code testing, using MCTS to generate code data with reasoning processes, and iteratively fine‑tuning the policy model to initially produce pseudocode and then generate the full code. The report also addresses the opportunities and challenges in deploying o1‑like models in real‑world applications, suggesting transitioning to the System‑2 paradigm and highlighting the imperative for world model construction. Updated model progress and experimental results will be reported in subsequent versions. All source code, curated datasets, as well as the derived models are disclosed at https://github.com/ADaM‑BJTU/O1‑CODER .

Abstract:
This technical report summarizes the second‑place solution for the Predictive World Model Challenge held at the CVPR‑2024 Workshop on Foundation Models for Autonomous Systems. We introduce D^2‑World, a novel World model that effectively forecasts future point clouds through Decoupled Dynamic flow. Specifically, the past semantic occupancies are obtained via existing occupancy networks (e.g., BEVDet). Following this, the occupancy results serve as the input for a single‑stage world model, generating future occupancy in a non‑autoregressive manner. To further simplify the task, dynamic voxel decoupling is performed in the world model. The model generates future dynamic voxels by warping the existing observations through voxel flow, while remaining static voxels can be easily obtained through pose transformation. As a result, our approach achieves state‑of‑the‑art performance on the OpenScene Predictive World Model benchmark, securing second place, and trains more than 300% faster than the baseline model. Code is available at https://github.com/zhanghm1995/D2‑World.

Abstract:
The concept of world models has garnered significant attention due to advancements in multimodal large language models such as GPT‑4 and video generation models such as Sora, which are central to the pursuit of artificial general intelligence. This survey offers a comprehensive review of the literature on world models. Generally, world models are regarded as tools for either understanding the present state of the world or predicting its future dynamics. This review presents a systematic categorization of world models, emphasizing two primary functions: (1) constructing internal representations to understand the mechanisms of the world, and (2) predicting future states to simulate and guide decision‑making. Initially, we examine the current progress in these two categories. We then explore the application of world models in key domains, including generative games, autonomous driving, robotics, and social simulacra, with a focus on how each domain utilizes these aspects. Finally, we outline key challenges and provide insights into potential future research directions. We summarize the representative papers along with their code repositories in https://github.com/tsinghua‑fib‑lab/World‑Model.

Abstract:
Why don't we have foundation models in 3D yet? A key limitation is data scarcity. For 3D object part segmentation, existing datasets are small in size and lack diversity. We show that it is possible to break this data barrier by building a data engine powered by 2D foundation models. Our data engine automatically annotates any number of object parts: 1755x more unique part types than existing datasets combined. By training on our annotated data with a simple contrastive objective, we obtain an open‑world model that generalizes to any part in any object based on any text query. Even when evaluated zero‑shot, we outperform existing methods on the datasets they train on. We achieve 260% improvement in mIoU and boost speed by 6x to 300x. Our scaling analysis confirms that this generalization stems from the data scale, which underscores the impact of our data engine. Finally, to advance general‑category open‑world 3D part segmentation, we release a benchmark covering a wide range of objects and parts. Project website: https://ziqi‑ma.github.io/find3dsite/

Abstract:
Recent advancements in large‑scale multi‑task robot learning offer the potential for deploying robot fleets in household and industrial settings, enabling them to perform diverse tasks across various environments. However, AI‑enabled robots often face challenges with generalization and robustness when exposed to real‑world variability and uncertainty. We introduce Sirius‑Fleet, a multi‑task interactive robot fleet learning framework to address these challenges. Sirius‑Fleet monitors robot performance during deployment and involves humans to correct the robot's actions when necessary. We employ a visual world model to predict the outcomes of future actions and build anomaly predictors to predict whether they will likely result in anomalies. As the robot autonomy improves, the anomaly predictors automatically adapt their prediction criteria, leading to fewer requests for human intervention and gradually reducing human workload over time. Evaluations on large‑scale benchmarks demonstrate Sirius‑Fleet's effectiveness in improving multi‑task policy performance and monitoring accuracy. We demonstrate Sirius‑Fleet's performance in both RoboCasa in simulation and Mutex in the real world, two diverse, large‑scale multi‑task benchmarks. More information is available on the project website: https://ut‑austin‑rpl.github.io/sirius‑fleet

Abstract:
Large Language Models (LLMs) have recently shown great promise in planning and reasoning applications. These tasks demand robust systems, which arguably require a causal understanding of the environment. While LLMs can acquire and reflect common sense causal knowledge from their pretraining data, this information is often incomplete, incorrect, or inapplicable to a specific environment. In contrast, causal representation learning (CRL) focuses on identifying the underlying causal structure within a given environment. We propose a framework that integrates CRLs with LLMs to enable causally‑aware reasoning and planning. This framework learns a causal world model, with causal variables linked to natural language expressions. This mapping provides LLMs with a flexible interface to process and generate descriptions of actions and states in text form. Effectively, the causal world model acts as a simulator that the LLM can query and interact with. We evaluate the framework on causal inference and planning tasks across temporal scales and environmental complexities. Our experiments demonstrate the effectiveness of the approach, with the causally‑aware method outperforming LLM‑based reasoners, especially for longer planning horizons.

Abstract:
Imitation learning (IL) enables agents to acquire skills directly from expert demonstrations, providing a compelling alternative to reinforcement learning. However, prior online IL approaches struggle with complex tasks characterized by high‑dimensional inputs and complex dynamics. In this work, we propose a novel approach to online imitation learning that leverages reward‑free world models. Our method learns environmental dynamics entirely in latent spaces without reconstruction, enabling efficient and accurate modeling. We adopt the inverse soft‑Q learning objective, reformulating the optimization process in the Q‑policy space to mitigate the instability associated with traditional optimization in the reward‑policy space. By employing a learned latent dynamics model and planning for control, our approach consistently achieves stable, expert‑level performance in tasks with high‑dimensional observation or action spaces and intricate dynamics. We evaluate our method on a diverse set of benchmarks, including DMControl, MyoSuite, and ManiSkill2, demonstrating superior empirical performance compared to existing approaches.

Abstract:
A longstanding goal of artificial general intelligence is highly capable generalists that can learn from diverse experiences and generalize to unseen tasks. The language and vision communities have seen remarkable progress toward this trend by scaling up transformer‑based models trained on massive datasets, while reinforcement learning (RL) agents still suffer from poor generalization capacity under such paradigms. To tackle this challenge, we propose Meta Decision Transformer (Meta‑DT), which leverages the sequential modeling ability of the transformer architecture and robust task representation learning via world model disentanglement to achieve efficient generalization in offline meta‑RL. We pretrain a context‑aware world model to learn a compact task representation, and inject it as a contextual condition to the causal transformer to guide task‑oriented sequence generation. Then, we subtly utilize history trajectories generated by the meta‑policy as a self‑guided prompt to exploit the architectural inductive bias. We select the trajectory segment that yields the largest prediction error on the pretrained world model to construct the prompt, aiming to encode task‑specific information complementary to the world model maximally. Notably, the proposed framework eliminates the requirement of any expert demonstration or domain knowledge at test time. Experimental results on MuJoCo and Meta‑World benchmarks across various dataset types show that Meta‑DT exhibits superior few and zero‑shot generalization capacity compared to strong baselines while being more practical with fewer prerequisites. Our code is available at https://github.com/NJU‑RL/Meta‑DT.

Abstract:
Offline reinforcement learning (RL) is a powerful approach for data‑driven decision‑making and control. Compared to model‑free methods, offline model‑based reinforcement learning (MBRL) explicitly learns world models from a static dataset and uses them as surrogate simulators, improving the data efficiency and enabling the learned policy to potentially generalize beyond the dataset support. However, there could be various MDPs that behave identically on the offline dataset and dealing with the uncertainty about the true MDP can be challenging. In this paper, we propose modeling offline MBRL as a Bayes Adaptive Markov Decision Process (BAMDP), which is a principled framework for addressing model uncertainty. We further propose a novel Bayes Adaptive Monte‑Carlo planning algorithm capable of solving BAMDPs in continuous state and action spaces with stochastic transitions. This planning process is based on Monte Carlo Tree Search and can be integrated into offline MBRL as a policy improvement operator in policy iteration. Our "RL + Search" framework follows in the footsteps of superhuman AIs like AlphaZero, improving on current offline MBRL methods by incorporating more computation input. The proposed algorithm significantly outperforms state‑of‑the‑art offline RL methods on twelve D4RL MuJoCo tasks and three challenging, stochastic tokamak control tasks. The codebase is available at: https://github.com/LucasCJYSDL/Offline‑RL‑Kit.

Abstract:
We propose DOME, a diffusion‑based world model that predicts future occupancy frames based on past occupancy observations. The ability of this world model to capture the evolution of the environment is crucial for planning in autonomous driving. Compared to 2D video‑based world models, the occupancy world model utilizes a native 3D representation, which features easily obtainable annotations and is modality‑agnostic. This flexibility has the potential to facilitate the development of more advanced world models. Existing occupancy world models either suffer from detail loss due to discrete tokenization or rely on simplistic diffusion architectures, leading to inefficiencies and difficulties in predicting future occupancy with controllability. Our DOME exhibits two key features:(1) High‑Fidelity and Long‑Duration Generation. We adopt a spatial‑temporal diffusion transformer to predict future occupancy frames based on historical context. This architecture efficiently captures spatial‑temporal information, enabling high‑fidelity details and the ability to generate predictions over long durations. (2)Fine‑grained Controllability. We address the challenge of controllability in predictions by introducing a trajectory resampling method, which significantly enhances the model's ability to generate controlled predictions. Extensive experiments on the widely used nuScenes dataset demonstrate that our method surpasses existing baselines in both qualitative and quantitative evaluations, establishing a new state‑of‑the‑art performance on nuScenes. Specifically, our approach surpasses the baseline by 10.5% in mIoU and 21.2% in IoU for occupancy reconstruction and by 36.0% in mIoU and 24.6% in IoU for 4D occupancy forecasting.

Abstract:
Model‑based reinforcement learning (RL) offers a solution to the data inefficiency that plagues most model‑free RL algorithms. However, learning a robust world model often requires complex and deep architectures, which are computationally expensive and challenging to train. Within the world model, sequence models play a critical role in accurate predictions, and various architectures have been explored, each with its own challenges. Currently, recurrent neural network (RNN)‑based world models struggle with vanishing gradients and capturing long‑term dependencies. Transformers, on the other hand, suffer from the quadratic memory and computational complexity of self‑attention mechanisms, scaling as O(n^2), where n is the sequence length. To address these challenges, we propose a state space model (SSM)‑based world model, Drama, specifically leveraging Mamba, that achieves O(n) memory and computational complexity while effectively capturing long‑term dependencies and enabling efficient training with longer sequences. We also introduce a novel sampling method to mitigate the suboptimality caused by an incorrect world model in the early training stages. Combining these techniques, Drama achieves a normalised score on the Atari100k benchmark that is competitive with other state‑of‑the‑art (SOTA) model‑based RL algorithms, using only a 7 million‑parameter world model. Drama is accessible and trainable on off‑the‑shelf hardware, such as a standard laptop. Our code is available at https://github.com/realwenlongwang/Drama.git.

Abstract:
Learning a latent dynamics model provides a task‑agnostic representation of an agent's understanding of its environment. Leveraging this knowledge for model‑based reinforcement learning (RL) holds the potential to improve sample efficiency over model‑free methods by learning from imagined rollouts. Furthermore, because the latent space serves as input to behavior models, the informative representations learned by the world model facilitate efficient learning of desired skills. Most existing methods rely on holistic representations of the environment's state. In contrast, humans reason about objects and their interactions, predicting how actions will affect specific parts of their surroundings. Inspired by this, we propose Slot‑Attention for Object‑centric Latent Dynamics (SOLD), a novel model‑based RL algorithm that learns object‑centric dynamics models in an unsupervised manner from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3 and TD‑MPC2 ‑ state‑of‑the‑art model‑based RL algorithms ‑ across a range of benchmark robotic environments that require relational reasoning and manipulation capabilities. Videos are available at https://slot‑latent‑dynamics.github.io/.

Abstract:
Zero‑shot imitation learning algorithms hold the promise of reproducing unseen behavior from as little as a single demonstration at test time. Existing practical approaches view the expert demonstration as a sequence of goals, enabling imitation with a high‑level goal selector, and a low‑level goal‑conditioned policy. However, this framework can suffer from myopic behavior: the agent's immediate actions towards achieving individual goals may undermine long‑term objectives. We introduce a novel method that mitigates this issue by directly optimizing the occupancy matching objective that is intrinsic to imitation learning. We propose to lift a goal‑conditioned value function to a distance between occupancies, which are in turn approximated via a learned world model. The resulting method can learn from offline, suboptimal data, and is capable of non‑myopic, zero‑shot imitation, as we demonstrate in complex, continuous benchmarks. The code is available at https://github.com/martius‑lab/zilot.

Abstract:
Can large language models (LLMs) directly serve as powerful world models for model‑based agents? While the gaps between the prior knowledge of LLMs and the specified environment's dynamics do exist, our study reveals that the gaps can be bridged by aligning an LLM with its deployed environment and such "world alignment" can be efficiently achieved by rule learning on LLMs. Given the rich prior knowledge of LLMs, only a few additional rules suffice to align LLM predictions with the specified environment dynamics. To this end, we propose a neurosymbolic approach to learn these rules gradient‑free through LLMs, by inducing, updating, and pruning rules based on comparisons of agent‑explored trajectories and world model predictions. The resulting world model is composed of the LLM and the learned rules. Our embodied LLM agent "WALL‑E" is built upon model‑predictive control (MPC). By optimizing look‑ahead actions based on the precise world model, MPC significantly improves exploration and learning efficiency. Compared to existing LLM agents, WALL‑E's reasoning only requires a few principal rules rather than verbose buffered trajectories being included in the LLM input. On open‑world challenges in Minecraft and ALFWorld, WALL‑E achieves higher success rates than existing methods, with lower costs on replanning time and the number of tokens used for reasoning. In Minecraft, WALL‑E exceeds baselines by 15‑30% in success rate while costing 8‑20 fewer replanning rounds and only 60‑80% of tokens. In ALFWorld, its success rate surges to a new record high of 95% only after 6 iterations.

Abstract:
High‑quality video generation, encompassing text‑to‑video (T2V), image‑to‑video (I2V), and video‑to‑video (V2V) generation, holds considerable significance in content creation to benefit anyone express their inherent creativity in new ways and world simulation to modeling and understanding the world. Models like SORA have advanced generating videos with higher resolution, more natural motion, better vision‑language alignment, and increased controllability, particularly for long video sequences. These improvements have been driven by the evolution of model architectures, shifting from UNet to more scalable and parameter‑rich DiT models, along with large‑scale data expansion and refined training strategies. However, despite the emergence of DiT‑based closed‑source and open‑source models, a comprehensive investigation into their capabilities and limitations remains lacking. Furthermore, the rapid development has made it challenging for recent benchmarks to fully cover SORA‑like models and recognize their significant advancements. Additionally, evaluation metrics often fail to align with human preferences.

Abstract:
Training visual reinforcement learning agents in a high‑dimensional open world presents significant challenges. While various model‑based methods have improved sample efficiency by learning interactive world models, these agents tend to be "short‑sighted", as they are typically trained on short snippets of imagined experiences. We argue that the primary challenge in open‑world decision‑making is improving the exploration efficiency across a vast state space, especially for tasks that demand consideration of long‑horizon payoffs. In this paper, we present LS‑Imagine, which extends the imagination horizon within a limited number of state transition steps, enabling the agent to explore behaviors that potentially lead to promising long‑term feedback. The foundation of our approach is to build a long short‑term world model. To achieve this, we simulate goal‑conditioned jumpy state transitions and compute corresponding affordance maps by zooming in on specific areas within single images. This facilitates the integration of direct long‑term values into behavior learning. Our method demonstrates significant improvements over state‑of‑the‑art techniques in MineDojo.

Abstract:
In recent years, end‑to‑end autonomous driving architectures have gained increasing attention due to their advantage in avoiding error accumulation. Most existing end‑to‑end autonomous driving methods are based on Imitation Learning (IL), which can quickly derive driving strategies by mimicking expert behaviors. However, IL often struggles to handle scenarios outside the training dataset, especially in high‑dynamic and interaction‑intensive traffic environments. In contrast, Reinforcement Learning (RL)‑based driving models can optimize driving decisions through interaction with the environment, improving adaptability and robustness. To leverage the strengths of both IL and RL, we propose RAMBLE, an end‑to‑end world model‑based RL method for driving decision‑making. RAMBLE extracts environmental context information from RGB images and LiDAR data through an asymmetrical variational autoencoder. A transformer‑based architecture is then used to capture the dynamic transitions of traffic participants. Next, an actor‑critic structure reinforcement learning algorithm is applied to derive driving strategies based on the latent features of the current state and dynamics. To accelerate policy convergence and ensure stable training, we introduce a training scheme that initializes the policy network using IL, and employs KL loss and soft update mechanisms to smoothly transition the model from IL to RL. RAMBLE achieves state‑of‑the‑art performance in route completion rate on the CARLA Leaderboard 1.0 and completes all 38 scenarios on the CARLA Leaderboard 2.0, demonstrating its effectiveness in handling complex and dynamic traffic scenarios. The model will be open‑sourced upon paper acceptance at https://github.com/SCP‑CN‑001/ramble to support further research and development in autonomous driving.

Abstract:
In the endeavor to make autonomous robots take actions, task planning is a major challenge that requires translating high‑level task descriptions to long‑horizon action sequences. Despite recent advances in language model agents, they remain prone to planning errors and limited in their ability to plan ahead. To address these limitations in robotic planning, we advocate a self‑refining scheme that iteratively refines a draft plan until an equilibrium is reached. Remarkably, this process can be optimized end‑to‑end from an analytical perspective without the need to curate additional verifiers or reward models, allowing us to train self‑refining planners in a simple supervised learning fashion. Meanwhile, a nested equilibrium sequence modeling procedure is devised for efficient closed‑loop planning that incorporates useful feedback from the environment (or an internal world model). Our method is evaluated on the VirtualHome‑Env benchmark, showing advanced performance with improved scaling w.r.t. inference‑time computation. Code is available at https://github.com/Singularity0104/equilibrium‑planner.

Abstract:
A significant aspiration of offline reinforcement learning (RL) is to develop a generalist agent with high capabilities from large and heterogeneous datasets. However, prior approaches that scale offline RL either rely heavily on expert trajectories or struggle to generalize to diverse unseen tasks. Inspired by the excellent generalization of world model in conditional video generation, we explore the potential of image observation‑based world model for scaling offline RL and enhancing generalization on novel tasks. In this paper, we introduce JOWA: Jointly‑Optimized World‑Action model, an offline model‑based RL agent pretrained on multiple Atari games with 6 billion tokens data to learn general‑purpose representation and decision‑making ability. Our method jointly optimizes a world‑action model through a shared transformer backbone, which stabilize temporal difference learning with large models during pretraining. Moreover, we propose a provably efficient and parallelizable planning algorithm to compensate for the Q‑value estimation error and thus search out better policies. Experimental results indicate that our largest agent, with 150 million parameters, achieves 78.9% human‑level performance on pretrained games using only 10% subsampled offline data, outperforming existing state‑of‑the‑art large‑scale offline RL baselines by 31.6% on averange. Furthermore, JOWA scales favorably with model capacity and can sample‑efficiently transfer to novel games using only 5k offline fine‑tuning data (approximately 4 trajectories) per game, demonstrating superior generalization. We will release codes and model weights at https://github.com/CJReinforce/JOWA

Abstract:
Legged locomotion over various terrains is challenging and requires precise perception of the robot and its surroundings from both proprioception and vision. However, learning directly from high‑dimensional visual input is often data‑inefficient and intricate. To address this issue, traditional methods attempt to learn a teacher policy with access to privileged information first and then learn a student policy to imitate the teacher's behavior with visual input. Despite some progress, this imitation framework prevents the student policy from achieving optimal performance due to the information gap between inputs. Furthermore, the learning process is unnatural since animals intuitively learn to traverse different terrains based on their understanding of the world without privileged knowledge. Inspired by this natural ability, we propose a simple yet effective method, World Model‑based Perception (WMP), which builds a world model of the environment and learns a policy based on the world model. We illustrate that though completely trained in simulation, the world model can make accurate predictions of real‑world trajectories, thus providing informative signals for the policy controller. Extensive simulated and real‑world experiments demonstrate that WMP outperforms state‑of‑the‑art baselines in traversability and robustness. Videos and Code are available at: https://wmp‑loco.github.io/.

Abstract:
The autoregressive world model exhibits robust generalization capabilities in vectorized scene understanding but encounters difficulties in deriving actions due to insufficient uncertainty modeling and self‑delusion. In this paper, we explore the feasibility of deriving decisions from an autoregressive world model by addressing these challenges through the formulation of multiple probabilistic hypotheses. We propose LatentDriver, a framework models the environment's next states and the ego vehicle's possible actions as a mixture distribution, from which a deterministic control signal is then derived. By incorporating mixture modeling, the stochastic nature of decisionmaking is captured. Additionally, the self‑delusion problem is mitigated by providing intermediate actions sampled from a distribution to the world model. Experimental results on the recently released close‑loop benchmark Waymax demonstrate that LatentDriver surpasses state‑of‑the‑art reinforcement learning and imitation learning methods, achieving expert‑level performance. The code and models will be made available at https://github.com/Sephirex‑X/LatentDriver.

Abstract:
Although research has produced promising results demonstrating the utility of active inference (AIF) in Markov decision processes (MDPs), there is relatively less work that builds AIF models in the context of environments and problems that take the form of partially observable Markov decision processes (POMDPs). In POMDP scenarios, the agent must infer the unobserved environmental state from raw sensory observations, e.g., pixels in an image. Additionally, less work exists in examining the most difficult form of POMDP‑centered control: continuous action space POMDPs under sparse reward signals. In this work, we address issues facing the AIF modeling paradigm by introducing novel prior preference learning techniques and self‑revision schedules to help the agent excel in sparse‑reward, continuous action, goal‑based robotic control POMDP environments. Empirically, we show that our agents offer improved performance over state‑of‑the‑art models in terms of cumulative rewards, relative stability, and success rate. The code in support of this work can be found at https://github.com/NACLab/robust‑active‑inference.

Abstract:
Articulated objects are ubiquitous in daily life. In this paper, we present DexSim2Real^2, a novel framework for goal‑conditioned articulated object manipulation. The core of our framework is constructing an explicit world model of unseen articulated objects through active interactions, which enables sampling‑based model predictive control to plan trajectories achieving different goals without requiring demonstrations or RL. It first predicts an interaction using an affordance network trained on self‑supervised interaction data or videos of human manipulation. After executing the interactions on the real robot to move the object parts, we propose a novel modeling pipeline based on 3D AIGC to build a digital twin of the object in simulation from multiple frames of observations. For dexterous hands, we utilize eigengrasp to reduce the action dimension, enabling more efficient trajectory searching. Experiments validate the framework's effectiveness for precise manipulation using a suction gripper, a two‑finger gripper and two dexterous hand. The generalizability of the explicit world model also enables advanced manipulation strategies like manipulating with tools.

Abstract:
World models are increasingly pivotal in interpreting and simulating the rules and actions of complex environments. Genie, a recent model, excels at learning from visually diverse environments but relies on costly human‑collected data. We observe that their alternative method of using random agents is too limited to explore the environment. We propose to improve the model by employing reinforcement learning based agents for data generation. This approach produces diverse datasets that enhance the model's ability to adapt and perform well across various scenarios and realistic actions within the environment. In this paper, we first release the model GenieRedux ‑ an implementation based on Genie. Additionally, we introduce GenieRedux‑G, a variant that uses the agent's readily available actions to factor out action prediction uncertainty during validation. Our evaluation, including a replication of the Coinrun case study, shows that GenieRedux‑G achieves superior visual fidelity and controllability using the trained agent exploration. The proposed approach is reproducable, scalable and adaptable to new types of environments. Our codebase is available at https://github.com/insait‑institute/GenieRedux .

Abstract:
Recently, world models have emerged as a promising paradigm for building intelligent agents by learning predictive models that estimate future environment states conditioned on observations and actions. In particular, JEPA‑style latent world models provide an efficient alternative to pixel space prediction by learning action‑conditioned dynamics in compact representation spaces. However, existing latent world models typically rely on one‑step prediction and must be recursively rolled out for long‑horizon planning, which leads to compounding errors and a mismatch between training objectives and downstream planning tasks. To address this limitation, we propose Variable‑length Latent World Models (VLWMs), a framework that learns to predict future latent states conditioned on action sequences of variable lengths. Instead of training only on one‑step transitions, VLWMs directly model temporally extended dynamics, allowing the same predictor to evaluate action plans over different horizons. We further introduce a curriculum training strategy that progressively expands the action horizon, stabilizing optimization from short‑range dynamics to long‑range prediction. At test time, we design planning methods tailored to VLWMs to better exploit their variable‑length predictive capabilities. Experiments on long‑horizon control tasks show that VLWMs significantly improve latent space world models, achieving 13% average improvement over the state‑of‑the‑art LeWM across different datasets, with especially large gains on tasks requiring extended planning. These results suggest that VLWM provides a simple yet effective paradigm for improving long‑horizon prediction and planning in latent world models.

Abstract:
Occluded tasks remain a bottleneck in robot manipulation. Existing solutions either deploy additional physical cameras requiring training‑inference camera parity, or rely on explicit 3D reconstruction with high computational cost. Moreover, both approaches rely on standard agent‑view and wrist‑view observations, while failing to capture occlusion information and future scene evolution. To this end, we propose UniviewVLA, a unified multiview Vision‑Language‑Action model with world modeling, which infers multiview scene evolution for action prediction from only standard two‑camera observations. We demonstrate that by leveraging generated multiview future views from the world model, UniviewVLA reveals occluded cues and models future scene evolution, improving action prediction and removing the need for extra hardware or explicit reconstruction. Besides, to accelerate inference while preserving prediction accuracy, UniviewVLA develops Motion‑Informative Token Compression, which compresses each generated view from 625 to 16 tokens and reduces per‑view latency from 6‑7s to 0.2‑0.3s. UniviewVLA also proposes training‑free Action‑Entropy View Selection, which dynamically identifies the most action‑informative view at different inference stages. Extensive experiments show that UniviewVLA achieves 95.8% on LIBERO and 4.60 on CALVIN ABCD to D, both standard occlusion‑free benchmarks. On customized occlusion‑focused tasks, it improves success rate from 40.0% to 73.3%, and average real‑robot success rate by 33.4 points, demonstrating stronger occlusion‑focused performance without sacrificing standard occlusion‑free benchmarks.

Abstract:
Social intelligence is a core competency for language agents, yet current research primarily focuses on static capability evaluation rather than how these skills are continuously shaped and accumulated. This gap calls for a shift toward sustainable learning paradigms. Currently, two methodological pain points exist: social interaction trajectories lack unified structured representations to form iterable learning signals, and capability improvement and retention are typically studied in isolation, hindering the assessment of continuous evolution. To bridge this gap, we propose the Social World Model. We decompose social interaction into five dimensions (scene setting, observation, mental state, action, and dialogue) to build a closed‑loop learning framework. In this setup, agents collect interaction experiences, convert them into preference signals for model updating, and redeploy the updated policy for continued learning. Additionally, we provide a reusable data synthesis mechanism and a lifelong learning benchmark, transforming social capabilities from an "object of evaluation" into an "object of sustainable training". Validating our framework on the ASCENT‑Bench, the interactively trained Qwen2.5‑7B model outperforms its baseline across all five core metrics. Notably, it matches the closed‑source Gemini 3 Flash in completion rate, exceeds it in pass rate, and achieves zero forgetting across three difficulty levels. Unlike prior works that merely report static comparisons or capability decay, this end‑to‑end approach provides a trainable, verifiable, and retainable pathway, demonstrating that small open‑source models can sustainably acquire competitive social coordination capabilities.

Abstract:
Sparse rewards pose a central challenge in reinforcement learning, since agents receive no informative signal until they reach their goal. Intrinsic‑reward methods address this issue by optimizing non‑stationary objectives such as novelty, prediction error, or skill diversity, thereby injecting a supervision signal into the problem. While effective, these methods often require that the extrinsic (sparse) reward can be evaluated ‑‑ either online or during offline relabeling of the stored transitions. This limitation is particularly vexing for multi‑task, meta‑, and continual reinforcement learning, where agents' interactions with the environment are usually reward‑free. In this work, we present a method to pre‑train transferable exploration policies that rapidly adapt to sparse rewards at downstream task time. Our objective maximizes state‑space covering for the occupancy measure, and can be framed in terms of entropy maximization. Its algorithmic implementation, ROVER, leverages recent advances on the operatorial formulation of RL to estimate occupancy with a learned resolvent world model, bypassing common hurdles associated with density and entropy estimation. ROVER further introduces a virtual "sink" state for unexplored regions, balancing coverage of known states with expansion into unseen ones and preventing cyclic expansion‑collapse behavior during learning. In tabular and pixel‑based sparse navigation tasks, ROVER produces more uniform aggregate coverage and stronger initializations for downstream tasks than standard reward‑free baselines.

Abstract:
Model‑based and model‑free reinforcement learning are traditionally viewed as separate paradigms: instead of learning a model of the transition kernel P, model‑free agents typically estimate value functions tied to a specific policy and reward. In this paper, we challenge this dichotomy by proving that value‑based agents trained on a sufficiently rich set of reward functions, e.g. using goal‑conditioned RL, implicitly encode a unique and accurate world model. To extract this model in practice, we introduce P‑learning, an inverse analogue to Q‑learning that samples from an agent's Q‑values, policies and rewards to decode its internal model of the environment. We then provide sufficient conditions on the type and number of goals for which agents encode the true kernel P, covering both stochastic and deterministic MDPs over finite or continuous state spaces. Even when our assumptions are violated, we empirically demonstrate that agents trained on a handful of reward functions encode accurate dynamics in \textttReacher, \textttMountainCar and stochastic variants of \textttFourRooms. Surprisingly, we find that policies trained exclusively on a \textttReacher agent's implicit world model are quasi‑optimal on out‑of‑distribution, velocity‑based goals despite position‑only training ‑‑ suggesting that agents contain hidden generalisation capabilities and providing a new lens into the connection between model‑based, model‑free, and goal‑conditioned RL.

Abstract:
Video world models are increasingly used in autonomous driving to forecast future scene evolution and provide future‑aware spatio‑temporal representations for downstream action prediction. In perception‑to‑action pipelines, these representations can directly influence ego‑vehicle waypoint planning, making the learned future dynamics a critical security‑sensitive component. Despite their promise, the training‑time security risks of autonomous‑driving video world models remain largely unexplored. We present BadDreamer, a transferable spatio‑temporal backdoor attack that targets the perception side of this pipeline. Unlike conventional backdoors that manipulate image labels, prompt outputs, or action supervision, BadDreamer poisons the learned transition dynamics of a video world model. It constructs trigger‑erasure sequences in which an oncoming yellow delivery rider is visible in the observed context frames but erased from the future frames. After fine‑tuning on a small fraction of such sequences, the compromised world model learns a hidden conditional association: when the physical trigger appears, it hallucinates a future where the rider disappears and the road appears clear. We further show that this corrupted future‑aware representation can transfer to the downstream action module without directly modifying ego‑trajectory labels, inducing unsafe non‑evasive waypoint predictions. Our experiments instantiate this attack on a representative open‑source perception‑to‑action pipeline, revealing a representation‑level safety risk in autonomous‑driving video world models and highlighting the need for backdoor‑aware validation beyond clean generation quality.

Abstract:
Vision‑Language‑Action (VLA) models enable general‑purpose robotic control via large‑scale multimodal pretraining, yet their effectiveness under few‑shot imitation learning remains limited. We conduct a systematic stress test of state‑of‑the‑art VLA models and show that performance degrades sharply as demonstrations are reduced, revealing a key weakness of existing adaptation strategies. To address this, we introduce FOCA, a future‑oriented conditioning framework for data‑efficient VLA adaptation. FOCA combines explicit prediction of task‑grounded future interaction embeddings with implicit alignment to future goal observations, enabling long‑horizon reasoning in latent space without pixel‑level prediction. This formulation naturally supports action‑free co‑training with synthetic videos from video world models and can be interpreted as learning a future‑conditioned value‑like representation. Extensive experiments demonstrate FOCA achieves 95.7% success with 20 demonstrations on LIBERO, improves 7‑12% on RoboCasa, and delivers up to 26% absolute gains on real robots, establishing a new state of the art in few‑shot VLA adaptation.

Abstract:
Reliable spatial decision automation, such as autonomous driving and maritime surveillance, critically depends on robust visual perception. However, real‑world spatiotemporal data exhibits severe heterogeneity, often manifesting as extreme long‑tail distributions for safety‑critical scenarios. This data scarcity induces dataset shift that degrades detection performance and pose safety risks. While synthetic data generation offers a potential solution, existing generative approaches, such as diffusion models and Generative Adversarial Networks (GANs), often lack explicit spatial grounding and structural constraints, resulting in spatial and physical inconsistencies in generated scenes. To address these challenges, we introduce WMGen‑v1, an agentic text‑based world model framework for long‑tail spatial data generation. WMGen‑v1 employs a Large Vision‑Language Model (LVLM) to construct a structured scene representation from a single reference image, while a Large Language Model (LLM) performs guidance‑based scene expansion under physical plausibility and commonsense constraints. Subsequently, conditioned on the structured semantic representations produced by this reasoning process, a diffusion model generates diverse and physically grounded long‑tail training data. Experiments on internal industrial datasets, ROADWork, and LaRS benchmarks demonstrate that WMGen‑v1 outperforms baseline approaches. Notably, detectors trained solely on WMGen‑v1 synthetic data approach real‑only performance on aggregate dataset‑level metrics, highlighting its potential to alleviate long‑tail data scarcity for downstream spatial perception.

Abstract:
While humans can identify physically implausible events within milliseconds, machine learning approaches addressing the same problem are extremely slow and expensive. They either rely on external multimodal‑LLM judges or require ad‑hoc modifications to the training procedure. In this work, we argue that indicators of physical plausibility are implicitly captured by five geometric properties of the per‑frame embeddings produced by frozen image encoders. In aggregate, we call them GEOPHYS. First, we show that these signals correlate with human EEG responses to two forms of object‑permanence violations. Second, GEOPHYS robustly discriminates physically implausible videos from realistic ones, achieving state‑of‑the‑art physics‑violation detection: 98.3% on LikePhys and 93.3% on IntPhys2, whereas V‑JEPA 2, GPT‑4o, Gemini, and twelve modern video diffusion models perform near chance. Third, used as a best‑of‑N verifier for physical alignment during video generation, GEOPHYS lifts MAGI‑1 24B from 50.01% to 64.50% on PhysicsIQ at 1.5x lower wall‑clock and 4.65x lower memory than the V‑JEPA 2 world‑model verifier. Ultimately, GEOPHYS demonstrates that physical plausibility in videos can be assessed by leveraging the emergent geometric properties of temporal features extracted from image encoders.

Abstract:
Safe control is a prerequisite for real‑world embodied intelligence, for which safe reinforcement learning has emerged as a promising paradigm. However, existing safe reinforcement learning methods either require costly real‑world exploration or depend on hand‑crafted safety functions. Neither scales to vision‑language‑action models deployed in open‑world physical environments. We propose SafeDojo, the first model‑based safe reinforcement learning framework for vision‑language‑action policies designed to learn safe actions through world model‑based imagination. Specifically, SafeDojo performs online reinforcement learning on top of an interactive video world model. The world model generates action‑conditioned future predictions, from which a tailored ResNet success classifier estimates per‑step task progress from imagined frames and a lightweight safety head predicts per‑step safety costs from latent context together with the proposed action chunk, enabling simultaneous assessment of task execution and trajectory safety. The decoupled task‑reward and safety‑cost signals are balanced through a Lagrangian‑based constrained GRPO objective, enabling coordinated improvement of task success and safety under explicit constraints. On SafeLIBERO, SafeDojo achieves the best aggregate task success, safe success, and execution efficiency among inference‑time safety, model‑free RL, and model‑based RL baselines, with the best average safe‑success rate on both levels and an 8.25 percentage‑point improvement over the strongest baseline on Level I. Real‑world Franka deployment further shows the best average task and safe‑success rates across five tasks. Our results position world model‑based safe reinforcement learning as a scalable and generalizable path toward safe embodied intelligence.

Abstract:
Video‑world‑model policies learn action‑relevant representations by predicting future observations. However, they condition on only a short observation window, which renders long‑horizon manipulation non‑Markovian when the correct action depends on earlier events that are no longer visible. We present MemoryVAM, an episodic memory mechanism for video‑world‑model policies. We employ a Recap‑Cue (RC) module, in which a Perceiver‑based Recap Compressor maps per‑frame CLIP embeddings into compact memory tokens, and a lightweight Cue Gate estimates task completion from memory and language. These tokens are injected into both the video backbone and the action decoder, aligning policy imagination with episode progress and conditioning actions on history. Our model trains the memory module with video prediction, a delta‑reconstruction auxiliary loss, and episode‑boundary supervision, requiring no per‑frame progress labels. The same mechanism applies to UNet and Diffusion Transformer (DiT) backbones by changing only the cross‑attention injection interface. On LIBERO‑Mem, our model improves average success from 5% to 42.5%. On real robots, it achieves 78.3% success on counting tasks, 80.0% on spatial recall, and 75.0% on sequential tracking. Project page: https://MemoryVAM.github.io/

Abstract:
Planning with world models is bottlenecked by compounding prediction errors and the difficulty of defining optimizable goals. Visual targets provide precise local gradients but poor distant guidance, while language is flexible yet limited by noisy cross‑modal alignment or dependence on large generative models unsuited for the high‑sampling nature of model‑based planning. To address these challenges, we introduce Latent Goal Prediction from Language (LAGO), a framework that predicts both sequences of intermediate goal states from language instructions and action‑conditioned rollouts, all within the same latent space. Rather than optimizing toward a single global objective, LAGO dynamically decomposes instructions into explicitly predicted, locally tractable latent subgoals. By updating these subgoals online and using a soft minimum trajectory cost during planning, LAGO enables an agent to follow coherent latent trajectories over long horizons. Evaluation across multiple environments planning horizons shows that LAGO avoids the sharp degradation of prior methods. By achieving robust and precise long‑horizon planning purely from language, LAGO bridges the precision of visual goals with the flexibility of text‑guided control.

Abstract:
World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce WRBench, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human‑calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9,600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world‑state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first‑class objectives of world‑model design, so that a world model captures how the world will unfold rather than how the next frame appears.

Abstract:
Perception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions. At the same time, latent JEPA‑style world models advocate learning compact predictive states from high‑dimensional observations to facilitate the prediction of future states, but end‑to‑end training of these models is nontrivial because representations may collapse if our only goal is to construct a latent state that is easy to predict. We introduce a sensorimotor world model (SMWM): a latent world model trained end‑to‑end with inverse dynamics regularization. This single regularizer addresses both issues: it prevents representation collapse and induces action‑aligned representations. By forcing latent states to preserve information about the action underlying a transition, it biases the model toward the controllable degrees of freedom of the environment while discarding uncontrollable distractors. This yields stable latent world models trained from offline, reward‑free trajectories, without frozen encoders, exponential moving averages, or complex latent regularizers. Empirically, SMWM learns compact, interpretable latent spaces and enables competitive planning performance across simple 2D and 3D control tasks.

Abstract:
While RL has become a promising tool for refining world models, existing methods largely rely on conservative rollouts near the training distribution, limiting exploration, behavioral diversity, and richer dynamic discovery. In this work, we challenge this conservative paradigm. We argue that the core limitation is not exploration itself, but the lack of reliable verification strategies to support broader exploration. Without reliable verification, expanded exploration becomes highly susceptible to reward hacking, where policies exploit imperfect rewards without achieving genuine improvement. To evaluate this motivation, we instantiate our method in embodied world models, where physical plausibility, and task completion provide a rigorous testbed for scalable RL under complex dynamics. On the verification side, we introduce Reward as an Agent, an agentic reward framework that actively evaluates generated behaviors to provide robust reward signals and mitigate reward hacking under distribution shifts. On the exploration side, we introduce Dynamic‑Aware Rollout Diversification through DynDiff‑GRPO, which explicitly expands action‑space exploration to diversify trajectories, broaden state‑action coverage, and encourage richer embodied behaviors beyond conservative rollout regimes. By unifying Reward as an Agent with DynDiff‑GRPO, we enable RL on a more reliable reward foundation with substantially diversified sampling, effectively mitigating reward hacking while yielding significant accuracy gains across multiple open‑source world models, thereby demonstrating that broader exploration can scale successfully when grounded in robust verification.

Abstract:
While latent world models enable the proactive predictions required for extreme parkour, their purely data‑driven nature forces them to redundantly encode left‑right symmetric interactions as independent patterns. This inflates the learning burden and hinders the capture of geometric regularities, restricting the latent space's efficiency for downstream policies. To address this, we propose SWAP, an end‑to‑end equivariant symmetric world model. This framework embeds symmetry directly into both the world model and the actor‑critic networks. In real‑world tests, the robot leaps across a 2.13 m gap and climbs a 1.63 m platform, breaking records for quadruped parkour. Furthermore, the framework exhibits robust geometric generalization to unseen mirrored terrains and exceptional zero‑shot transferability across diverse outdoor environments. These results demonstrate that symmetry equivariance is an effective structural prior for pushing the physical boundaries of learned legged locomotion.

Abstract:
Scaling robot policy learning for autonomous surgery is challenging, as expert demonstrations are expensive and in vivo exploration poses substantial safety risks. Surgical world models address this by generating realistic, action‑conditioned future frames from an initial observation, but existing methods exhibit two persistent failure modes: spatial interaction incoherence, where visible instrument contact fails to induce spatially consistent tissue deformation, and temporal fidelity collapse, where prediction errors compound across autoregressive rollouts and progressively corrupt visual quality. We present SurgVista, a surgical world model that mitigates both failures through two training recipes. Deformation Consistency Regularization extracts scene‑point trajectories from training videos and enforces cross‑frame coherence through latent contrastive learning, strengthening physically consistent instrument‑tissue dynamics. Drift Adaptation Training mitigates long‑horizon drift by perturbing conditioning frames with online prediction residuals and photometric augmentations calibrated to long‑horizon drift statistics, sustaining visual fidelity over extended rollouts. To enable rigorous evaluation, we further introduce SurgWorld‑Bench, featuring diverse procedure types, long‑range rollouts, and decoupled metrics for instrument‑motion accuracy and tissue‑response fidelity. Extensive experiments show that SurgVista consistently outperforms state‑of‑the‑art methods across visual quality, temporal consistency, and interaction fidelity, with gains widening as the prediction horizon grows.

Abstract:
Effective machine learning depends not only on how we model data, but also on what data we choose to collect. While large sequence models have revolutionized data modeling, the problem of automated data selection, or "intrinsic curiosity", remains a significant challenge. Classic approaches incentivize exploration by rewarding an agent based on its "learning progress", which measures how much a newly acquired observation improves a world model's predictive ability. However, evaluating these rewards traditionally requires expensive inner loops of gradient descent updates within each trajectory, rendering them computationally impractical at scale. In this work, we investigate whether the emergent in‑context learning (ICL) capabilities of sequence models can eliminate this bottleneck by serving as immediate, update‑free world models. Specifically, we evaluate whether an exploration policy can be trained to maximize learning progress, using solely the prediction errors and counterfactual context manipulations of an in‑context learner. We first prove that in general Markov decision processes, this is in fact impossible in an unbiased way: the resulting intrinsic rewards either suffer from nuisance terms that bias their estimation of true learning progress, or they cannot be implemented using an in‑context learner's prediction errors. Conversely, we prove a positive result for a broad subclass of non‑temporal settings, encompassing active learning and Bayesian Experimental Design: here, ICL‑derived rewards successfully bound and asymptotically converge to the true learning progress. We corroborate our theory with controlled experiments across continuous and symbolic environments, demonstrating that our ICL‑driven framework successfully trains curious data‑collection policies that explore optimally.

Abstract:
Latent actions provide a compact interface between action‑free video and downstream decision‑making, yet existing Latent Action Models (LAMs) force every transition through a fixed‑capacity bottleneck. We identify a bottleneck trade‑off: overly tight codes can discard transition cues needed for action alignment, while overly loose codes preserve additional transition variation that must be resolved when alignment labels are scarce or narrowly distributed. FlexLAM replaces this fixed capacity with variable‑length latent actions trained by nested dropout, yielding prefix‑valid codes that capture compact transition structure first and add detail only when needed, without new architectures or losses. A single FlexLAM matches or surpasses separately trained fixed‑capacity LAMs at every evaluated token budget under standard scarce‑label supervision and under a low‑return single‑task alignment stress test, indicating that FlexLAM is not merely adjustable at inference time but learns a better latent‑action interface at the same token budgets. The same model supports inference‑time token‑budget adjustment without retraining, and FlexLAM improves Ego4D transition reconstruction. These results suggest that variable‑length latent actions are an architecture‑free, drop‑in upgrade to the fixed‑capacity bottleneck in latent action models, latent‑action world models, and video‑pretrained action interfaces.

Abstract:
The growing reliance on pre‑trained Machine Learning (ML) models has introduced new attack surfaces. Recent vulnerabilities demonstrate that malicious behavior can be embedded within model artifacts, often bypassing existing defenses. Current model‑scanning solutions primarily rely on static, format‑specific rules or known attack signatures, which limit their ability to generalize across frameworks and to detect novel exploitation paths. In contrast, we propose a solution that focuses on the effects an attack has on the host system executing the model and builds on foundational intuitions about ML model execution. In particular, we observe that ML models operate within well‑defined lifecycle phases and that, within each phase, interactions with the host system are highly structured and predictable. We translate these intuitions into Moat, a dynamic lifecycle‑aware approach for securing ML model execution, and instantiate this design in Re‑Moat, our reference implementation. We evaluate Re‑Moat across multiple ML frameworks using 77,974 real‑world model artifacts from the Hugging Face Hub, 31 Proofs‑of‑Concept (PoCs) from CVEs, and 334 models from a state‑of‑the‑art dataset, and compare it against state‑of‑the‑art model‑scanning solutions. Our results show that our approach detects all evaluated attack classes while maintaining a close‑to‑zero false‑positive rate, validating our intuitions and motivating dynamic analysis for securing ML model execution.

Abstract:
Action‑conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real‑world experimentation by generating action‑consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end‑effector occlusions and rapid wrist‑camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem‑World, a memory‑augmented multi‑view action‑conditioned world model. At its core, we present W‑VMem, a 4D wrist‑view‑centered surfel‑indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W‑VMem enables geometry‑aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel‑based rendering and scoring, providing informative and non‑redundant context for prediction. Extensive experiments show that Mem‑World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl‑World, improving the Pearson correlation with real‑world performance by 14.5%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58% to 72% on long‑horizon tasks.

Abstract:
Ultrasound (US) is widely used for surgical navigation, yet real‑time registration between intraoperative 2D slices and preoperative 3D volumes remains challenging due to partial observability, speckle noise, and the action‑dependent US acquisition. Existing methods are one‑shot or short‑horizon, making it hard for them to gather evidence over time or capture how surgeons adjust probe motion based on on‑screen feedback. We propose DreamReg, a belief‑driven world‑model framework that formulates 2D‑3D registration as belief updating over rigid transformations. DreamReg maintains a latent belief state that summarizes past observations and poses information, and continuously refines the transformation through learned dynamics as new slices arrive. During training, DreamReg is exposed to probe‑motion trajectories that mimic clinical scanning behavior and learns to update its belief by conditioning pose refinement on the current US observation. During inference, DreamReg refines registration via internal imagination: it rolls out the learned world model to simulate candidate probe motions and their predicted observations, and integrates these imagined outcomes to converge to an accurate rigid transformation. Experiments on CAMUS and u‑RegPro datasets demonstrate improved robustness and competitive registration accuracy for real‑time guidance compared with state‑of‑the‑art methods.

Abstract:
Model‑based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training‑time attack surface: adversarially poisoned fine‑tuning trajectories can manipulate the learned dynamics and thereby corrupt downstream planning. In this paper, we propose SWAAP, the first two‑stage data poisoning framework for learned world models. In the first stage, SWAAP identifies a harmful target world model that induces low‑return behavior under planning while remaining close to clean dynamics, using first‑order bilevel optimization enabled by a transition‑gradient theorem. In the second stage, SWAAP realizes this target through stealth‑constrained gradient matching, modifying only a limited fraction of fine‑tuning transition targets so that the induced training gradients steer the victim model toward the adversarial target, while a prediction‑error regularizer encourages the poisoned targets to remain close to the world model's natural approximation error. To assess attack stealthiness, we evaluate defenses and detectability across three stages of the poisoning pipeline: pre‑training detection of poisoned transitions, robust training during fine‑tuning, and test‑time monitoring of the resulting world model. Across diverse continuous‑control tasks, SWAAP causes substantial performance degradation while keeping poisoned transitions close to clean data and evading the evaluated non‑adaptive residual/CUSUM/TRIM‑style defenses. These results reveal a practical vulnerability in world‑model adaptation pipelines and highlight the need for robustness methods that protect both world‑model training data and learned dynamics.

Abstract:
Joint Embedding Predictive Architectures (JEPAs) are a leading approach to world model representation learning. We identify a failure mode in JEPA‑based world models grounded against two qualitatively distinct external signals: physical dynamics (sparse, high‑magnitude, constraint‑satisfying gradient corrections) and social‑behavioral dynamics (diffuse, distribution‑matching corrections). We term this Objective Interference Collapse (OIC): we argue that joint learning in a shared latent space causes the dominant channel to systematically collapse the subordinate channel's representational subspace, in a manner not resolvable by loss weighting alone. We propose Dual‑Channel Grounded World Modeling (DCGWM), designed to structurally prevent OIC through a partitioned latent space (physical subspace Z_p, behavioral subspace Z_b) with inward‑only gradient flow. A Physical Grounding Channel updates only Z_p via VICReg‑style alignment to physical measurements; a Social‑Behavioral Grounding Channel updates only Z_b via alignment to trajectories from an emergent multi‑agent simulation. An Inter‑Channel Interface Module couples the subspaces at the task level without cross‑subspace gradients. An Asymmetric Grounding Adherence Loss penalizes rollout drift with a hard hinge for physical violations and a soft KL for behavioral divergence. A Generative Rendering Layer is architecturally isolated from the latent world model. We present three theoretical results: the partition removes the gradient‑interference pathway implicated in OIC; each grounded subspace inherits anti‑collapse guarantees from its alignment objective; and generative isolation is necessary under a stated assumption on the generative objective's geometry. This manuscript establishes the problem formulation and architecture; experimental validation is ongoing and will be reported in a future revision.

Abstract:
Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action‑conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3‑Eval, a self‑consistent video generation recipe that adapts a pre‑trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward‑inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward‑only model cannot penalize. Second, cross‑view consistency trains the model to inpaint each camera view from the other, keeping the multi‑camera observation coherent over long rollouts without any explicit memory mechanism. Third, test‑time consistency reuses the inverse dynamics mode at inference as a per‑action‑chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3‑Eval rollouts reproduce the failure modes that policies exhibit in real‑world rollouts, supporting fine‑grained diagnostic comparison rather than aggregate ranking alone. Across seven real‑world vision‑language‑action policies, SC3‑Eval attains a closed‑loop Pearson correlation of 0.929 and MMRV of 0.119, outperforming three strong prior video‑model‑based baselines, and generalizes to new tasks.

Abstract:
Action chunking has become a common interface for vision‑language‑action (VLA) models, enabling low‑frequency policy inference to drive high‑frequency robot execution. However, once an action chunk is committed, its open‑loop execution can be brittle under stochastic dynamics, hardware execution errors, and partial observability. We propose DREAM‑Chunk, a test‑time scaling method that augments chunking‑based policies with a lightweight latent world model, without requiring additional policy fine‑tuning. At test time, DREAM‑Chunk samples multiple candidate action chunks, rolls out their predicted latent futures, and selects actions from the chunk whose predicted state best matches the observed rollout. In this way, DREAM‑Chunk uses additional test‑time computation to cover multiple plausible stochastic futures and improve reactivity during long‑horizon chunk execution. On the Kinetix benchmark, DREAM‑Chunk improves robustness under increasing action noise and benefits from larger candidate sample sizes, especially when demonstrations contain corrective behaviors. We further validate DREAM‑Chunk on four manipulation tasks across two robot platforms and two VLA policies under various sources of stochasticity. Across simulation and hardware experiments, DREAM‑Chunk improves the robustness of action‑chunking policies in stochastic dynamics.

Abstract:
World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single‑view setting and lack the multi‑view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye‑to‑hand, and wrist‑mounted) for policy learning, current multi‑view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross‑view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter‑view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion‑transformer world models via three core components: (1) Geometry‑Aware Cross‑View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D‑REPA, which distills 3D‑aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT‑based world foundation model, PAIWorld achieves state‑of‑the‑art multi‑view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot‑Challenge2026 leaderboard, while enabling downstream applications such as model‑based planning, world action models, and multi‑view policy post‑training.

Abstract:
Forecasting the evolution of dynamic environments is crucial for autonomous agents. While generative world models have recently achieved high photorealism in 2D video synthesis by mixing ego‑motion and environmental dynamics within the image plane, they exhibit physical inconsistencies, such as morphing or vanishing objects, especially over long time horizons. In this paper, we propose FR3D, a world model that predicts a persistent 3D latent representation for future dynamic 3D reconstruction. Unlike prior works that treat the world as a sequence of image‑based features, FR3D explicitly decouples the 3D evolution of the scene from the agent's trajectory, treating the inferred ego‑motion as a latent proxy for action. This disentanglement resolves the ambiguities between self‑motion and world‑motion, ensuring geometric consistency into the future. Furthermore, we introduce a teacher‑student distillation strategy that leverages the spatial "common sense" of off‑the‑shelf foundation models, leading to robust zero‑shot generalization. Extensive experiments demonstrate FR3D's strong performance for future dynamic 3D reconstruction from monocular observations across multiple datasets, even 2 seconds into the future. Project page: https://fr3d‑wm.github.io.

Abstract:
The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video‑action‑language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data is difficult to obtain at scale. Web video datasets offer broad visual coverage but lack executable actions and reliable states; robotic datasets provide action and state supervision but are costly and limited in scene diversity; and existing simulators often lack large‑scale human‑driven interaction trajectories. In this paper, we introduce EgoCS‑400K, a large‑scale replay‑grounded egocentric Counter‑Strike dataset for world models, built from public professional CS and CS2 match demos that preserve human gameplay trajectories and enable parsing, replaying, rendering, and temporal alignment. We extract player states, view directions, movements, keyboard/button inputs, view‑angle changes, weapon usage, game events, and round‑level context, and render clean first‑person videos from the same trajectories. EgoCS‑400K contains over 400,000 first‑person videos and 10,000 hours of gameplay from more than 1,000 matches and 40,000 rounds, covering 13 maps and 10 player viewpoints per round. It supports a range of interactive visual modeling tasks, including action‑conditioned future prediction, state‑ and event‑aware scene rollout, replay‑grounded captioning, and agent egocentric action understanding. By connecting visual observations with human actions, camera motion, game states, and events at scale, EgoCS‑400K serves as a practical bridge between passive web videos, controllable game simulation, and costly real‑world embodied data.

Abstract:
Current Vision‑Language‑Action (VLA) models face a trade‑off between efficient action generation and explicit deliberation. Directly decoding actions from vision‑language backbone representations enables low‑latency control, whereas explicit reasoning through textual chains, pixel‑level subgoals, or action search can improve planning but incurs substantial latency and computational cost. We propose PearlVLA, a VLA framework that moves deliberation into the latent space of a vision‑language model (VLM). PearlVLA separates VLM meta‑query representations into a fixed visual grounding branch and an iterative latent plan branch. At each refinement round, a plan‑conditioned world query probes a lightweight frozen latent world model for an action‑free future observation latent, which is fed back to guide plan refinement. A future‑guided RefineNet then applies scheduled residual updates to progressively refine a coarse semantic draft into a fine‑grained latent action plan. The refined plan after K rounds is then decoded in parallel into an action chunk for low‑latency execution. We further introduce Causal Refinement‑Grouped Process‑Reward RL to optimize the latent refinement process with rewards from longer‑horizon imagined futures induced by latent plan edits. Empirical evaluations on the LIBERO benchmark demonstrate that PearlVLA achieves state‑of‑the‑art performance among existing methods.

Abstract:
Recent World‑Action (WA) models demonstrate strong generalization ability and data efficiency, but they typically rely on expert trajectories for training. This reliance limits their ability to acquire fine‑grained manipulation skills beyond the demonstration distribution and prevents them from continuously improving through real‑world interaction. To address these limitations, we propose WAM‑RL, a reinforcement learning framework that enables joint optimization of the world model and the action model through online interaction with the environment. By allowing the two components to co‑evolve, our approach enhances fine‑grained control and adaptability. Specifically, a WA model consists of a world model and an actor. We design a tailored reinforcement learning method with hierarchical optimization to coordinate their improvement. On the methodological side, we systematically investigate the effects of applying reinforcement learning to the action model, as well as online training of the world model within an RL setting. Our experiments reveal a key insight: optimizing only the actor yields improvements on short‑horizon tasks, but fails to provide significant gains on long‑horizon tasks. In contrast, jointly optimizing both the world model and the actor is critical for achieving strong performance in long‑horizon settings. Our work is the first to introduce reinforcement learning into the World‑Action paradigm, and provides insights into how online optimization of both the action head and the world model impacts overall performance.

Abstract:
As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human‑centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real‑time audio‑visual autoregressive model that has 22B parameters and is capable of real‑time streaming generation and sub‑second interaction, with a record‑breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real‑time audio‑visual generation model specifically optimized for social‑interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self‑resampling, cross‑modal representation alignment, domain‑aware preference optimization, and reinforced online‑policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand‑second‑scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real‑time inference performance. We believe this work not only sets a new state‑of‑the‑art (SOTA) performance benchmark for high‑quality, low‑latency, and long‑horizon audio‑visual autoregressive models, but also points out the paradigm shift desired for next‑generation AI‑native social platforms.

Abstract:
Generative world models for autonomous driving face two unresolved tensions: heterogeneous control injection, where free‑form language, HD‑maps, trajectories, and camera poses reside in incompatible representational spaces, and post‑hoc cross‑view fusion, where per‑camera latents fail to encode global 3‑D geometry. We trace both to a single root cause: the absence of a shared symbolic interlingua aligning language, geometry, and pixels at the latent‑token level. We present DRIVE‑CHOREO, an LLM‑choreographed multi‑agent world model that recasts controllable multi‑view video generation as latent choreography. Three Qwen2.5‑VL agents ‑ a Director parsing user intent into a structured WorldScript, a Cartographer grounding it into spatially‑anchored layout tokens, and an Auditor feeding cross‑view critiques back as auxiliary supervision ‑ jointly author a single position‑aware token sequence. This sequence is co‑compressed with the multi‑view video via a view‑time permutation that enforces inter‑camera geometry within the convolutional receptive field of a 3‑D VAE. On nuScenes, DRIVE‑CHOREO sets new state‑of‑the‑art multi‑view consistency and BEV mAP (21.6) with competitive FVD (45.7); a detector trained purely on our synthetic data gains +2.4 NDS on the real validation split, validating downstream utility.

Abstract:
Long‑form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine‑tuned, open‑frontier, closed‑frontier, and reasoning tiers, on a uniform set of structural narrative metrics. All closed‑frontier systems saturate at a plot‑beat F1 in the band [0.78, 0.81] and collapse by about ‑0.20 F1 at horizon h=200. We introduce NarrativeWorldBench, an open benchmark of nine narrative‑structure metrics evaluated across horizons h in 10, 20, 50, 100, 200, with cross‑lingual evaluation across four Indic languages (Hindi, Tamil, Telugu, Marathi). We introduce N‑VSSM, a Narrative Variational State‑Space Model that maintains a structured 256‑dimensional latent world state over more than 200 episodes via a Mamba‑2 backbone with an event‑conditioned posterior and an 8B decoder. N‑VSSM holds plot‑beat F1 >= 0.84 across all horizons at 4x lower compute than the closed‑frontier band. A learned Cultural Transfer Function lifts cross‑language fidelity by +0.20 to +0.23 Likert points. In a within‑subjects writer study (n = 12 professional authors, 240 trials), N‑VSSM is preferred over Claude Opus 4.5 on long‑arc consistency 71% of the time and rated +1.3 Likert points higher on controllability.

Abstract:
Quantum computing promises transformative advances across science and industry, yet the physical hardware that enables these computations remains invisible to the public: quantum processors operate inside sealed dilution refrigerators at temperatures near absolute zero, making direct observation impossible. This "imagination gap" between quantum computing's growing societal impact and the public's ability to visualize it represents a significant barrier to quantum literacy and workforce development. We present Quantum Cinema, an open‑source, browser‑based interactive application that closes this gap by transforming invisible quantum hardware into explorable, cinematic experiences using generative world models. Quantum Cinema guides users through a four‑act narrative ‑‑ from the foundational Nobel Prize‑winning science of quantum entanglement, through curated video introductions to three major quantum computing architectures (trapped‑ion, neutral‑atom, and superconducting systems), into immersive three‑dimensional generative worlds that make invisible quantum phenomena observable, and finally to interactive radar‑chart comparisons grounded in real quantum device specifications. All three‑dimensional environments are generated using WorldLabs' generative world model platform and are scientifically grounded in curated metrics from Amazon Web Services (AWS) Braket quantum hardware. Quantum Cinema requires no installation, no specialized hardware, and no quantum computing background. It is designed to serve two distinct communities: scholars and developers seeking to replicate or extend the platform, and educators, researchers, and science communicators seeking an intuitive tool for explaining quantum hardware to diverse audiences. This paper describes the system architecture, the generative world model pipeline, use cases for both communities, and directions for future work.

Abstract:
We introduce Qwen‑RobotWorld, a language‑conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human‑to‑robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language‑guided planning signals for downstream robot control. This is achieved through a three‑part design: a) Double‑Stream MMDiT with MLLM Action Encoding, where a 60‑layer double‑stream diffusion transformer couples frozen Qwen2.5‑VL semantics with video‑VAE latents through layer‑wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video‑text corpus (200M+ frames) with action‑language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two‑stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open‑source models on WorldModelBench and PBench. Additional zero‑shot analyses on RoboTwin‑IF benchmark further support robust generalization and multi‑view consistency.

Abstract:
World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre‑training Paradigm governed by a Cross‑Embodiment Data Curriculum, which organizes open‑world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding‑window attention captures local dynamics, dilated sliding windows capture mid‑range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment‑Aware System Co‑Design to support low‑latency rollout generation on server and consumer‑grade hardware for real‑world observation‑action‑feedback loops. Experiments on embodied world‑model, long‑horizon, and action‑policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency‑capability trade‑off. Together, these results position Kairos as a cohesive operational foundation for future self‑evolving physical intelligence.

Abstract:
Multivariate forecasting in physical systems requires models that predict coupled temporal variables while preserving meaningful state evolution. Deep forecasters can fit temporal correlations, and physics‑informed models can regularize predictions with scientific constraints, but these directions are often connected only at the decoded‑output level. As a result, the hidden predictive state that generates future trajectories may remain statistically useful but physically unstructured. We introduce Phys‑JEPA, a physics‑informed joint‑embedding predictive architecture for multivariate time‑series forecasting. Phys‑JEPA learns a latent world model in which predictive states are decomposed into physical and residual components, and physical consistency is imposed directly on latent states and latent transitions rather than only on decoded forecasts. This formulation uses known physical variables to organize the representation space while retaining residual capacity for unresolved dynamics. On Jena Climate 2009‑‑2016, Phys‑JEPA reduces aggregate MSE from 0.12482 to 0.12273 and temperature MSE from 0.01892 to 0.01831 at H=24. On Traffic, full Phys‑JEPA improves aggregate MSE over the supervised baseline across all tested horizons, reducing H=192 MSE from 0.800784 to 0.773873. On Electricity, the best variant depends on horizon: static latent consistency is strongest at H=24 and H=48, while full Phys‑JEPA gives the best aggregate and target‑variable MSE at H=192. These initial results suggest that moving physics‑informed learning from output space to latent predictive state space is a promising direction for interpretable temporal world models.

Abstract:
World‑model synthesis aims to turn interaction experience into an internal model of environment dynamics. Existing symbolic approaches often fit observed transitions or mixtures of local rules, but they do not produce a complete executable program that can run independently of the real environment. We present Mind‑Studio, a framework that synthesizes executable pygame‑style world models from state‑action‑next‑state trajectories using large language models. Mind‑Studio combines entropy‑selected traces with a lightweight game skill file containing object, action, and static scene information extracted from screenshots. We evaluate synthesis quality with a K‑step lookahead fidelity protocol that compares generated world‑model rollouts against Real‑ALE rollouts from the same state. On Montezuma's Revenge, Mind‑Studio improves chosen‑action next‑state prediction from 0.3% for PoE‑World to 48.7% while verifying 5 of 8 subgoals; across Alien, Assault, and Skiing, it achieves stronger branch‑level fidelity than prior learned lookahead sources.

Abstract:
Vision‑Language‑Action models (VLAs) leverage large‑scale vision‑language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World‑Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel‑level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent‑action‑conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics‑aware robot control. LaWAM achieves state‑of‑the‑art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real‑world manipulation tasks while retaining low‑latency inference. LaWAM runs in 187 ms per action‑chunk prediction and achieves up to 24x lower wall‑clock latency than pixel‑space WAMs.

Abstract:
We present SLS^2, a framework for safe feedback motion planning from pixels using robust model predictive control (MPC) in learned latent world models. Our approach trains an action‑conditioned joint‑embedding world model with compact Markovian latent states, enabling efficient gradient‑based trajectory optimization through learned latent dynamics. To enforce safety for the true system despite imperfect latent predictions, we inform a GPU‑accelerated system level synthesis (SLS) robust MPC scheme with conformal prediction to obtain calibrated latent error bounds and robust latent‑space constraint sets. We further learn and conformalize a latent constraint checker, allowing the SLS planner to impose probabilistic safety constraints during closed‑loop execution. We evaluate our method on vision‑based control tasks, where it improves both goal‑reaching performance and safety over latent world‑model and safe‑planning baselines.

Abstract:
World models have emerged as a promising paradigm for scaling autonomous driving (AD) data, yet existing video generative models fall short as interactive simulators. Layout‑conditioned renderers rely on "oracle" future trajectories of all background agents, rendering them strictly non‑reactive. Conversely, pure action‑conditioned predictors lack semantic control over complex interactions and suffer from prohibitive diffusion latencies, hindering closed‑loop policy learning. To bridge this gap, we present CausalDrive, a controllable, real‑time foundation driving world renderer. CausalDrive operates solely on the initial front‑view frame, the ego‑vehicle's trajectory, and a macroscopic text prompt. By excluding future NPC layouts, we compel the model to intrinsically predict causal interactions, enabling text‑driven control over Driving Sociology, allowing users to dynamically orchestrate diverse counterfactual reactions to identical ego‑actions. To overcome the efficiency bottleneck and address the covariate shift in autoregressive generation, we propose a novel Context‑Forced DMD architecture. This combines continuous flow‑matching with a self‑correcting distillation objective, achieving interactive speeds of 12 FPS. This breakthrough transforms the passive video generator into a playable neural simulator. We demonstrate its versatility across three downstream applications: (1) generative closed‑loop evaluation with significantly mitigated collision artifacts, (2) large‑scale Reinforcement Learning (RL) post‑training driven by a Video2Reward module, and (3) real‑time human‑in‑the‑loop simulation. Extensive experiments validate that policies trained within CausalDrive's reactive scenarios exhibit superior interaction capabilities in the real world.

Abstract:
Accurate interactive camera control is essential for video‑based world models, but most existing approaches learn camera motion implicitly, leading to inaccurate control under out‑of‑distribution trajectories. Explicit geometric conditioning improves controllability, but existing methods are non‑autoregressive and rely on a static 3D cache built from an initial frame, which becomes ineffective once the viewpoint moves beyond the original frustum. We propose GeoStream, a framework that enables precise metric‑scale camera control in autoregressive streaming video generation. Our method maintains a self‑refreshing 3D cache that is periodically updated online from the model's own outputs: we estimate depth from the most recently generated frame, unproject to 3D, and reproject into the target view to produce point reprojections as geometric conditioning for subsequent synthesis. By the same principle, the conditioning seen during training is also rendered from the student's own generated frames, yielding a fully on‑policy distillation that naturally aligns the train and inference conditioning distributions. Unlike prior work that uses off‑policy condition noising, our approach trains the model against the exact error distribution it encounters at inference, mitigating both standard autoregressive drift and the second‑order geometric feedback loop that arises when the cache itself is derived from generated outputs. Quantitative and qualitative results show that our approach substantially improves camera controllability.

Abstract:
Reasoning capabilities of multimodal large language models (MLLMs) have improved considerably in recent years. Existing approaches typically rely on explicit chain‑of‑thought or continuous latent‑space trajectories to enhance multi‑step reasoning. However, these methods generally assume that an input admits a single latent interpretation and unfold reasoning along a fixed path or under a uniform computation budget. In real‑world multimodal settings, visual observations are often subject to occlusion, blur, viewpoint variation, or semantic ambiguity, giving rise to multiple plausible interpretations. A uniform reasoning strategy not only limits the model's ability to explore multiple hypotheses but also incurs high memory usage and rollout cost. We present DLWM (Diverse Latent World Models), a multimodal reasoning framework that combines latent‑space reasoning with reinforcement learning. First, we construct a set of diverse latent world hypotheses in continuous latent space, each capturing a different plausible interpretation of the visual input, and unfold latent reasoning independently on each hypothesis. An orthogonality‑based diversity regularizer explicitly prevents hypothesis collapse. Second, we formulate the latent reasoning process as a resource‑constrained sequential decision problem and introduce a resource‑aware reinforcement learning policy that adaptively allocates computation across hypotheses, dynamically deciding whether to expand, terminate, or merge reasoning paths, thereby substantially reducing memory footprint and improving rollout efficiency. Experiments on multiple multimodal reasoning benchmarks demonstrate that DLWM outperforms existing methods by 2‑5 points in accuracy while reducing memory usage by 24%.

Abstract:
World models have rapidly become one of the central abstractions in modern AI. Yet the term now refers to several different objects: action‑conditioned environment models, latent imagination models, future‑video predictors, interactive neural simulators, latent predictive representations, and synthetic‑data engines. Evaluation has broadened with the term. Recent papers measure video realism, perceptual similarity, instruction following, physical plausibility, policy ranking, executability, planning success, and downstream policy improvement. The result is not only metric diversity but also a recurring problem of claim/evidence mismatch: papers frequently make a stronger claim about what their model is useful for than their evaluation can actually establish. This paper surveys the recent literature and argues that the central question is use‑dependent. When a model is presented as a world model for embodied decision‑making, a more decisive issue is not whether it generates visually compelling videos, but whether it supports reliable counterfactual reasoning, policy evaluation, planning, and policy optimization under intervention, policy‑induced distribution shift, and long‑horizon rollout. We organize the literature using an L0‑‑L7 ladder that ranges from visual plausibility to policy optimization utility. In our interpretation, L0‑‑L3 are most naturally read as diagnostics of generated artifacts, L4 is often the first genuinely interventional test, and L5‑‑L7 provide the most direct evidence of decision usefulness. Based on this diagnosis, we propose a decision‑making‑centric evaluation framework and a benchmark protocol that foreground counterfactual action fidelity, closed‑loop rollout validity, reward/value prediction, policy‑ranking agreement, optimization lift, model exploitability, and uncertainty calibration.

Abstract:
This work introduces the Separable Neural Architecture (SNA), a function representational class combining neural approximation with tensor decomposition. The SNA decouples localized coordinate functions (atoms) from global interactions governed by a sparse, low‑rank interaction object. This architecture possesses a compact and smooth inductive bias well‑suited for solving partial differential equations (PDEs). When viewed as a Galerkin trial space under the variational SNA (VSNA) framework, the formulation satisfies classical variational guarantees under Lax‑Milgram: well‑posedness, quasi‑optimality, convergence, and stability. In high‑dimensional spatiotemporal‑‑parametric PDEs, the VSNA mitigates the curse of dimensionality by scaling algebraically rather than exponentially. Exploiting an entirely factorized, tensor‑native alternating least squares (ALS) optimization framework reduces this cost to linear in dimension. The VSNA is validated across elliptic, hyperbolic, and parabolic systems, demonstrating close alignment with predicted algebraic and spectral scaling rates. We showcase the SNA as a "solve once, query anywhere" physical world model via two engineering case studies: a 7D parametric manufacturing simulation and an experimental thermal‑to‑property inversion pipeline for Inconel 718. The VSNA executes a 1,000,000‑query Monte Carlo sweep in 102s on a standard laptop CPU, yielding a 150,000x speedup over a full‑grid finite element baseline hosted on an NVIDIA A100 GPU. It further enables real‑time generative inverse‑mode reconstructions under 100ms. These results demonstrate that the SNA serves as a compact mathematical substrate for continuous parameter manifolds to enable real‑time inversion, optimization loops, and rapid uncertainty propagation.

Abstract:
Mobile agents require efficient exploration strategies to map unseen environments and autonomously plan tasks. Traditional methods rely on generating occupancy maps and optimizing the sequence in which unexplored regions are visited. However, in sensor‑constrained settings, such as those limited to monocular cameras, generating accurate occupancy maps is challenging. To address this, we propose VANDERER, an exploration framework that leverages a Visual Curiosity Module (VCM) to guide pre‑trained diffusion policies using only monocular image data. This curiosity module predicts the outcomes of proposed actions via a navigation world model and evaluates them through a curiosity cost. The cost then guides the diffusion process toward generating actions that maximize exploration. Evaluated across diverse simulated environments, VANDERER consistently outperforms established baselines, exploring an average of 13.4% more area than NoMaD. Our results reveal a direct correlation between visual and geometric curiosity in outdoor environments, demonstrating that VANDERER can effectively leverage this relationship for efficient exploration using sensor‑constrained agents.

Abstract:
Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM‑generated plans execute successfully, they overlook a critical type of failure: latent failures. Unlike immediate failures that trigger instant feedback at execution time and enable timely correction, latent failures do not immediately halt plan execution but silently compromise goal achievement. In severe cases, they cause irreversible harm. To address this gap, we introduce SIMMER, a benchmark for evaluating latent failures in LLM planning through a human‑curated symbolic world model grounded in the kitchen domain. SIMMER defines a world model comprising 77 actions, 262 unique objects, and approximately 46,800 possible interactions that are semantically realistic, derived from real‑world cooking scripts. It then leverages a state machine executor that validates plans against the world model and detects immediate precondition violations, latent hazards, and irreversible failures. Experiments across six LLMs show that even frontier models achieve at most 17% error‑free plans. Moreover, up to 56% of plans contain latent failures, the majority of which lead to irreversible consequences. We further demonstrate that explicit state reasoning via counterfactual foresight simulation can reduce latent failures by up to 72% and irreversible cases by up to 75%, suggesting a promising direction for more robust LLM planners.

Abstract:
We introduce COMET (Causal Object‑centric Model for Efficient Tree search), a model‑based reinforcement learning algorithm that performs Monte Carlo Tree Search in a slot‑structured latent space. COMET pairs a frozen unsupervised object‑centric encoder with a transformer‑based world model, in which actions are bound to objects through a novel action‑slot fusion mechanism that is used in slot transition prediction. Policy and value heads use object‑causal attention, modulating token interactions by learned per‑slot relevance scores so that decision‑making concentrates on task‑relevant entities. COMET adds an explicit object‑level inductive bias to MuZero‑style latent planning. Across eight visually and dynamically diverse tasks from the Object‑Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom, COMET achieves a higher mean normalized score during the early stages of training compared to object‑centric and monolithic baselines.

Abstract:
Reactive capability is a key property of data‑driven behavior world model simulators for autonomous driving simulation systems. With this capability, simulated world agents can respond feasibly to autonomous vehicle (AV) behaviors that differ from the log. However, existing behavior simulation benchmarks do not directly measure reactive capability. They often let the simulator jointly control the AV and surrounding agents and evaluate realism through log similarity or open‑loop prediction metrics. In this work, we introduce ReactSim‑Bench for evaluating the reactive capability of behavior world model simulation in autonomous driving. We decouple the control of agents and the AV, using AV behaviors that differ from the log and require agents to respond as independent AV inputs. To obtain these AV behaviors, we construct a pipeline that uses an AV planner model to generate candidate behaviors and filters the data using rules and manual verification. Collision metrics, map‑based metrics, and kinematic feasibility metrics are used to evaluate the safety and rule compliance of reactive responses. We construct 2,636 test scenarios with three categories and conduct a systematic evaluation of state‑of‑the‑art models across multiple architectures, including Transformer‑based, diffusion‑based, and next‑token‑prediction‑based models. We further analyze how replan frequency affects performance and provide insights for future studies.

Abstract:
Contact‑rich manipulation requires world models to reason over complex contact dynamics from multimodal sensory observations. However, it remains unclear which representation properties fundamentally support stable long‑horizon planning in contact‑rich settings. In this paper, we present ContactWorld, a benchmark and systematic empirical study of vision‑tactile world models spanning 12 contact‑rich manipulation tasks, including insertion, disassembly, screwing, and exploratory interaction. Across extensive experiments, we find that representations that are both spatially structured and temporally continuous consistently achieve the strongest planning performance. In particular, point‑cloud observations improve average planning success rates from 20.7% with wrist‑view observations and 22.0% with front‑view observations to 32.1%. We further find that the effectiveness of tactile sensing depends critically on cross‑modal representation compatibility rather than modality scaling alone. Combining point‑cloud observations with tactile force‑field representations, which preserve richer spatial structure and interaction dynamics, further improves performance to 36.1%, yielding the strongest overall planning performance across all evaluated tasks. Moreover, tactile sensing becomes increasingly important under long‑horizon planning objectives, where compounding prediction errors and contact uncertainty accumulate over time. Together, these findings highlight the importance of representation structure, multimodal compatibility, and long‑horizon robustness in vision‑tactile world models for contact‑rich robotic manipulation.

Abstract:
Autonomous driving is shifting from isolated vehicle intelligence toward multi‑agent embodied systems that share perception, infer intent, and coordinate action under uncertainty. This survey examines this transition through the lens of Shared World Models (SWMs): predictive cross‑agent representations maintained across vehicles, infrastructure, and other traffic participants. We review more than 380 publications spanning vehicle‑to‑everything (V2X) communication, collaborative perception, inter‑agent cognition, cooperative planning, end‑to‑end cooperative driving, and simulation and data engines for closed‑loop validation. The organizing question is how exchanged observations become aligned state, intent‑aware interaction, and coordinated downstream action. Across the surveyed literature, evaluation remains concentrated in simulation, curated benchmarks, and offline protocols. Foundation‑model‑based coordination also lacks verified real‑time safety guarantees in open traffic. These gaps motivate key research priorities for multi‑agent embodied autonomous driving (MAEAD): verifiable shared‑state maintenance, robust intent and plan alignment, and safe coordinated action under communication, latency, and deployment constraints.

Abstract:
World models in robot learning predict future states from visual observations and actions, enabling agents to reason about the consequences of their controls. However, many action‑conditioned models are evaluated in settings where motion is dominated by immediate control, whereas aquatic surface vehicles and other real‑world objects continue moving under inertia and are displaced by hidden ambient drift, such as water currents or wind. We propose FlowMo‑WM, an end‑to‑end trainable visual world model that infers object‑centric motion state and a predictive long‑history context associated with hidden drift from image‑action histories without direct supervision of flow fields. FlowMo‑WM factorizes image‑action history into a short‑history latent state, trained to summarize object‑centric motion, and a longer‑history context, trained to summarize slowly varying exogenous influences. A zero‑context residual transition separates action‑conditioned base dynamics from context‑dependent drift effects during latent rollout. In simulated aquatic surface‑vehicle environments with diverse hidden flows, disturbances, and randomized vehicle dynamics, FlowMo‑WM improves long‑horizon rollout accuracy over representative action‑conditioned latent world models. Prediction‑time context ablations, in which the inferred context is zeroed or shuffled during rollout, show that the ambient context is important for stable prediction under hidden drift, while frozen linear probes characterize information encoded in the learned factors.

Abstract:
When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's behavior does not exhibit the same types of failures because human reasoning uses principled and abstract world models. We evaluate human participants and 25 LLMs on their ability to engage in common‑sense reasoning about a variety of everyday situations and observe similar patterns of errors in both people and models. We then identify the set of attention heads driving LLM responses and find that these heads implement a form of pattern‑matching. These attention heads allow us to predict seemingly inexplicable reasoning errors in people caused by ostensibly irrelevant prompt details. Taken together, our results suggest that everyday causal reasoning in people and LLMs is more consistent with a form of pattern‑matching than with abstract world models.

Abstract:
Semantic 3D occupancy provides a voxelized world state for autonomous driving and robot decision making, but object and rare‑class errors can affect free‑space interpretation, collision checking, and temporal state propagation. We show that a common VLM strategy, aligning 3D voxel or object features with crop‑caption embeddings, improves text‑space similarity without reliably improving closed‑set occupancy mIoU. Motivated by this mismatch, we propose VISA, a training‑time semantic auditing approach for existing occupancy world models. VISA queries an offline VLM on a representative crop of each physical object instance, obtains a structured audit with class hypotheses, plausible confusions, reliability, attributes, and evidence, and propagates it along the object track. The audit is grounded to matched 3D object voxels and distilled into semantic logits through reliability‑weighted taxonomy, attribute‑factor, and scene‑level audit graph losses, while inference remains unchanged and requires no VLM. On nuScenes, averaged across three runs, VISA improves OccWorld from 19.06 to 20.05 mIoU and GaussianWorld from 21.36 to 21.91 mIoU; on GaussianWorld, object mIoU improves from 18.18 to 19.16 and rare‑class mIoU from 15.60 to 16.79. These results suggest that VLMs are better suited to closed‑set occupancy as reliability‑aware semantic auditors than as generic caption‑embedding targets.

Abstract:
We present MoVerse, a real‑time video world model that creates an interactively navigable scene from a single narrow‑field‑of‑view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high‑fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity‑aligned 360^\circ panorama with topology‑aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry‑aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian‑conditioned video renderer translates scaffold renderings along user‑specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high‑quality conditional rendering and distill it into a causal autoregressive student for bounded‑latency streaming. This design combines the controllability and long‑range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real‑time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single‑image world creation with interactive video output.

Abstract:
Pretrained‑feature world models provide a useful substrate for robot imagination, but visual or latent prediction alone does not determine whether an imagined future satisfies task‑relevant events. Long‑horizon manipulation requires progress signals that are relational, predicate‑level, and physically grounded: whether an object has moved, whether a drawer or contact state has changed, whether a placement predicate is satisfied, and whether a candidate future is reliable enough for execution. We introduce EA‑WM, an event‑aware world‑model framework that augments frozen visual‑feature dynamics with task‑specification‑grounded event prediction and verification. EA‑WM rolls out candidate futures in pretrained visual‑feature space, decodes them into structured event states, and scores them using task‑progress, semantic‑consistency, physical‑feasibility, and uncertainty terms. The verifier guides sampling‑based planning, gates candidate actions, and, in the contact‑sensitive LIBERO wine‑rack setting, selects among PPOgenerated proposals. Across navigation, deformable‑object, wall‑constrained, and languagedescribed manipulation studies, EA‑WM shows that event‑aware verification can make featurespace world models more interpretable and better aligned with task progress.

Abstract:
Action‑conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real‑world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front‑camera latent and a sequence of ego‑actions, predicts future scene latents a frozen decoder renders to 256 × 256 frames up to 8 seconds ahead, evaluated on 150 held‑out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V‑JEPA2 with temporal context reduces steering RMSE by 40% over the best single‑frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the x_0 objective, residual anchoring, and sampling matched to target uncertainty. In a Stable‑Diffusion‑VAE encode‑predict‑decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception‑based FID and KID reveal a clean perception‑distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression (4.8× better), and a deployable train‑derived calibration makes this practical without test‑time ground truth. The model is genuinely action‑controllable (steering drives scene displacement, Spearman ρ= 0.81, vs ‑0.18 for regression). We trace limited single‑pass motion to a shared‑present anchor and engineer a compact 1.7M‑parameter "jump" model that recovers full ground‑truth motion magnitude (1.02× GT), where single‑pass models capture less than half.

Abstract:
JEPA‑family world models use a static predictor whose weights do not adapt when test‑time dynamics diverge from training. We compare two mechanisms for incorporating accumulated experience into a JEPA predictor under distribution shift: operand‑side injection, where a compressed experience representation is added as a residual to the predictor's hidden state (EI‑JEPA), and operator‑side modulation, where the same representation generates low‑rank weight deltas via LoRA applied to the predictor's weights (EPM‑JEPA). On a pre‑registered comparison (Moving MNIST, gravity shift), EPM‑JEPA (D_shift^n=50 = 0.7848 +/‑ 0.0078, three seeds) differs from EI‑JEPA (0.8238) by delta = 4.74% ‑ Outcome C: a null result ‑ by our stated criterion, a valid outcome. As a secondary, non‑pre‑registered observation, EPM‑JEPA improves 1.90% over a no‑memory baseline (0.8000), consistently across seeds, while EI‑JEPA underperforms the baseline, indicating the benefit is specific to weight‑level modulation. Our primary contribution is a mechanism analysis: the D_shift^n=50 trajectory reflects three independent dynamical processes ‑ buffer cycling, EMA target drift, and an intrinsic LoRA settling transient of +0.021 ‑ rather than convergence to equilibrium. These findings motivate PEM‑JEPA, a physics‑grounded successor addressing this dynamical‑peak limitation.

Abstract:
World modeling is emerging as a central principle for building intelligent systems capable of prediction, reasoning, and decision making. A central distinction can be drawn between explicit world models, which learn structured dynamics for rollout‑based reasoning and planning, and implicit world models, which encode predictive structure within scalable learned representations. These complementary paradigms provide a foundation for physical AI in domains such as robotics and autonomous driving, enabling intelligence beyond reactive control under real‑world constraints. Recent foundation models further suggest a pathway toward unified systems integrating perception, prediction, and action. Despite rapid progress, major challenges remain in hierarchical reasoning, long‑horizon planning, and autonomous goal formation, which are critical for advancing toward artificial general intelligence. This tutorial presents a coherent framework in which diverse world modeling approaches are unified through shared predictive structure and differentiated by how such structure is represented and exploited.

Abstract:
Inferring others' beliefs requires more than reading surface signals; it requires tracking who told them what, in what order, and how credibly. The Theory of Mind Utility (ToM‑U) formalizes this epistemic state inference problem at the computational level of analysis, specifying what mentalizing computes and why without commitment to algorithmic or neural implementation. ToM‑U achieves this by constructing Local Epistemic World Models (LEWMs) ‑‑ directed typed graphs that represent agents, state nodes, and the epistemic relationships among them ‑‑ and evaluating discrete candidate LEWMs against observed behavior until one achieves sufficient confidence. Five formal definitions specify the LEWM structure, agent node properties including ordered information access history, a bounded proliferation mechanism for recursive mentalizing, three inference procedures, and a residue function that captures the structured trace left by failed mentalizing attempts. ToM‑U differs from Bayesian Theory of Mind and adjacent formal accounts, which presuppose rather than derive belief states, and from simulation theory and theory‑theory, which lack a formal apparatus for epistemic state inference. The architecture generates directional, falsifiable predictions about mentalizing failure that follow from structural properties of the model rather than auxiliary assumptions, and positions ToM‑U as a domain‑agnostic mechanism upstream of goal inference and other downstream social cognitive processes.

Abstract:
We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world‑model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech‑language models, vision‑language‑action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill‑suited to accommodate this new architectural diversity. Here we present M, a universal serving system for efficient serving of composite AI models. M represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model‑agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M on representative models and find that it achieves, on average, 20% lower end‑to‑end latency than vLLM‑Omni for text‑to‑image workloads on BAGEL, while delivering up to 2.9x lower real‑time factor and 2.7x higher throughput for text‑to‑speech workloads on Qwen3‑Omni. M also outperforms the V‑JEPA 2‑AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.

Abstract:
Klindt, LeCun, and Balestriero (arXiv:2605.26379) proved that Joint‑Embedding Predictive Architectures (JEPAs) achieve linear identifiability, the linear recovery of the world's true latent variables, if and only if the world's latent dynamics follow a Gaussian, stationary process. This Gaussian boundary implies a fundamental limit on temporal consistency: for any non‑Gaussian physical system, the representation error of a statistical World Model grows monotonically with time. We prove that this limit is an artifact of the statistical alignment mechanism, not a property of World Models in general. We introduce the Physics‑Grounded Symbolic Architecture (PGSA) and prove three results: (1) a PGSA achieves exact linear identifiability for all physical regimes, regardless of the latent distribution; (2) the per‑step error of a PGSA is bounded by numerical precision alone; and (3) as a direct consequence, a PGSA maintains temporal consistency for an unbounded number of transitions, a property we term near‑infinite temporal consistency. We further prove that statistical World Models cannot achieve this property for any non‑Gaussian system, regardless of model capacity or the volume of training data. The algebraic cores of four of the theorems are formalized in Lean 4 with Mathlib4 v4.31.0 (zero sorry placeholders); the Klindt et al. converse is taken as an external premise. The contrast establishes that symbolic grounding in the causal generator of the world's dynamics is the sufficient condition and, in non‑Gaussian regimes, the only condition for near‑infinite temporal consistency.

Abstract:
Vision‑Language‑Action (VLA) models inherit semantic grounding from large‑scale pretraining and perform competently across in‑distribution manipulation tasks. This grounding, however, is built on static image‑text pairs, whereas manipulation is a continuous, contact‑rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World‑Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene‑evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory‑level motion hint alongside its semantic conditioning, and the scene‑evolution prior remains effective even when supplied by a video‑pretrained world model that has not been action‑post‑trained. World Pilot attains a state‑of‑the‑art Total success rate of 84.7% on the LIBERO‑Plus zero‑shot OOD benchmark and the highest success rate on every real‑robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world‑pilot.github.io/

Abstract:
ARC tests in‑context rule induction: given a few input‑output demonstrations, a model must infer the hidden rule and apply it to a new query. While many approaches express ARC rules through language, code, or symbolic programs, ARC itself is visual‑symbolic: rules appear as grid transitions over objects, colors, shapes, and spatial relations. We introduce Loop‑OWM, an object‑centric world‑modeling architecture that learns these rules as composable transitions over structured states. It combines color‑prototype slots, demonstration‑conditioned task summaries, and a looped transition model with dense propagation and slot‑conditioned correction. On both ARC‑1 and ARC‑2, Loop‑OWM outperforms non‑looped and looped baselines with comparable or fewer parameters. These results suggest that ARC rules can be learned not only as language descriptions or searched programs, but also as transitions over visual‑symbolic world states.

Abstract:
World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action‑head attention analysis and causal interventions. We find that the action decoder fails to focus on task‑relevant interaction regions and remains sensitive to perturbations in task‑irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low‑level action control. In this paper, we propose AGRA, an Action‑Grounded Representation Alignment objective that regularizes the world‑action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real‑world manipulation tasks. Experiments show that AGRA makes world model representations more action‑grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task‑irrelevant regions. As a result, AGRA consistently improves both in‑distribution performance and out‑of‑distribution generalization over the baseline world action model.

Abstract:
Pretrained video generators are promising visual world models that exhibit emergent task‑solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision‑making. Existing approaches either outsource this reasoning to language or vision‑language models, or rely on supervised fine‑tuning with paired task‑execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task‑solving ability in such models by combining self‑distillation with reinforcement learning. Given an unlabeled scene image, a vision‑language model generates a candidate task and a detailed step‑by‑step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption‑guided generation to instruction‑conditioned task solving without curated task‑video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks‑Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM‑based evaluation protocol and transfers competitively to robotic tasks.

Abstract:
Low‑Altitude Wireless Networks (LAWNs), composed of Unmanned Aerial Vehicles (UAVs) and other aerial platforms, provide integrated perception, communication, and computation services in low‑altitude airspace. However, deploying large generative models in this domain faces three major challenges: 1) Limited embodied action mapping; 2) Inadequate physical environment modeling; 3) Insufficient closed‑loop optimization. To address these challenges, this study proposes an Embodied Agentic UAV framework. Centered on a Vision‑Language‑Action (VLA) model as the execution core, the framework establishes an end‑to‑end embodied decision‑making pipeline from multimodal environmental perception to continuous control generation. In addition, a World Model (WM) is introduced to capture the coupling between UAV actions and environmental state evolution, thereby supporting environment prediction, policy verification, and dynamic optimization. Furthermore, memory and reflection mechanisms are incorporated to form an adaptive closed‑loop optimization paradigm of decision, execution, evaluation, and update, thereby enhancing the system's autonomous decision‑making capability and continual evolution ability in complex dynamic environments. Experimental results validate its effectiveness in enabling robust, predictive, and sustainable autonomous control in LAWNs.

Abstract:
Understanding and predicting how social beliefs evolve in response to events ‑‑ from policy changes to scientific breakthroughs ‑‑ remains a fundamental challenge in social science. Given LLMs' commonsense knowledge and social intelligence, we ask: Can LLMs model the dynamics of social beliefs following social events? In this work, we introduce the concept of the Social World Model (SWM), a general framework designed to capture how social beliefs evolve in response to major events. SWM learns state‑transition functions for social beliefs by mining temporal patterns in social data and optimizing the evidence lower bound, without the need for explicit human annotations linking events to belief shifts, or for expensive census data. To evaluate SWM, we introduce a benchmark, SWM‑bench, derived from real‑world prediction markets, specifically Kalshi and Polymarket. SWM‑bench includes over 12k data points for social belief prediction tasks spanning diverse domains such as politics, finance, and cryptocurrency. Our experimental results show that SWM significantly outperforms time‑series foundation models, achieving state‑of‑the‑art results on Kalshi data and demonstrating competitive performance on Polymarket data, while offering interpretable insights into the underlying mechanisms of social belief dynamics.

Abstract:
Dexterous manipulation with multi‑finger hands can be sensitive to physical parameters such as object shape, pose, and friction coefficients. While simulation enables large‑scale data collection with known parameter values, simulation‑trained policies must still handle uncertainty at deployment, where the true parameters and therefore the true dynamics are unknown. Standard domain randomization strategies may be insufficient for precise tasks like screwdriver turning, as manipulation strategies may need to change depending on specific parameter values. To address this, we propose Probabilistic Latent Unified world Modeling and parameter Estimation (PLUME), a world model that jointly learns to evolve a belief over parameter values as well as the system dynamics conditioned on those parameters. We learn a latent space to jointly represent multiple qualitatively different physical parameters along with rewards, themselves functions of partially‑observable variables, to inform planning. Our novel learning framework leads to efficient alignment of the world model to true dynamics through online parameter inference as opposed to re‑training or fine‑tuning. We evaluate our method on simulated screwdriver turning, valve turning, bucket lifting, and disk flicking tasks, as well as a hardware screwdriver turning task, where we achieve successful zero‑shot transfer of our simulation‑trained policy and outperform state‑of‑the‑art offline reinforcement learning and world‑model‑augmented behavior cloning baselines. Please see our website at https://plume‑world‑model.github.io for videos.

Abstract:
We recast pass evaluation in football (soccer) as a Monte Carlo Tree Search (MCTS)‑like evaluation problem whose components mostly exist in the literature under different names: a value model (possession value), a world model (multi‑agent trajectories with ball interactions), and a policy over counterfactual actions (sampling pass variants with noise). Building on the first public high‑fidelity tracking dataset with 3D ball trajectories from the Bundesliga, we introduce Monte Carlo Pass Search (MCPS), which infers kick parameters for each observed pass, samples execution variants and option variants, rolls each candidate forward with a ball‑conditioned world model until the next ball interaction, and scores outcomes with a learned value model to obtain a distribution over gained value. This distribution enables distribution‑aware attribution with two complementary execution‑surplus scores used for analysis and ranking: mean‑based and percentile‑based scores. To make the world model sample‑efficient under limited public data, we adapt a discrete‑token, autoregressive trajectory generator from autonomous driving (SMART) and show it yields strong best‑of‑20 forecasting accuracy compared to baselines, while supporting fully hypothetical rollouts for downstream evaluation. We have released model checkpoints and code.

Abstract:
A common assumption holds that enough observational and interventional data, given to a strong enough predictor, suffices. We report a failure mode that contradicts it. Across hundreds of structural causal models, on identified quantities a strong predictor and a Bayesian baseline both succeed, but on unidentified quantities (the couplings between counterfactual worlds) the predictor collapses to a point, on 28% of models to one no valid model can produce, while the truth is an admissible interval more data never narrows. The gap is structural: prediction cannot represent uncertainty over counterfactual couplings. We cast a world model as a single positive semidefinite coupling kernel K(T,T') over admissible worlds, whose diagonal is the ordinary posterior (what a predictor recovers) and whose off‑diagonal is the cross‑world coupling it cannot, which every counterfactual reads. The paper is the theory of that off‑diagonal. It is real: two states with identical posteriors differ on a cross‑world query, and the off‑diagonal is the coupling that fixes counterfactuals. It can be bounded: positive semidefiniteness is partial‑identifying information the marginals lack, and enforcing it bounds counterfactuals in polynomial time where the exact response‑type program is intractable. Logical structure sharpens it: ontology axioms tighten the bound by up to a third, propagating to couplings they never touch. It can be acquired: targeted scars, constraints learned from encountered infeasibilities, close the gap several times faster than untargeted ones. Its full reconstruction is approximate counting of the admissible worlds, tractable below the Sly‑Sun threshold and inapproximable above; we do not claim to beat the worst case.

Abstract:
Diffusion policies (DPs) have emerged as expressive policy representations for robot learning, often used with imitation learning methods such as behavioral cloning (BC). However, while their success has largely been confined to BC, direct reinforcement learning (RL) fine‑tuning remains challenging because actions are generated through a multi‑step denoising process. In this work, we propose MODIP, a framework for the offline‑to‑online fine‑tuning of DPs. Rather than directly applying RL to the DPs, MODIP leverages a world model (WM) to guide policy adaptation and keeps the simplicity and stability of BC. We utilize model predictive control (MPC) to generate high‑quality trajectories within the WM, and use them as supervised targets for fine‑tuning the DP. To make MPC planning efficient, MODIP uses a terminal state value instead of a policy‑dependent state‑action value, reducing inference time. Additionally, MODIP trains critics with policy‑independent TD targets, reducing training time. Experiments on D4RL (MuJoCo, Kitchen) and RoboMimic tasks show that MODIP improves diffusion policies beyond BC, and is competitive with or outperforms diffusion policy RL fine‑tuning methods and strong model‑based baselines such as TD‑MPC2.

Abstract:
AI agents in supply chains face a fundamental epistemic gap: large language models (LLMs) interpret policies but lack physical grounding, while reinforcement learning (RL) optimizes flows but is semantically blind to unstructured constraints. We introduce REFLECTICHAIN, bridging this gap through a Generative Supply Chain World Model (SC‑WM) ‑ encoding heterogeneous supply networks into a 6‑dim graph‑latent space with physical conservation ‑ and Double‑Loop Learning that separates epistemic uncertainty (KL‑trust‑region‑bounded policy adaptation) from aleatoric uncertainty (stochastic latent rollouts). On Semi‑Sim, a 10‑node semiconductor benchmark with SIR risk propagation, 6 perturbation types, and 10 policy constraint templates, REFLECTICHAIN improves Rationale Consistency Score by 33.0% (p < 0.0001, d = 2.78), maintains 82.3% operability under adversarial shocks, and exhibits anti‑fragile behavior (+40.2% gain under moderate pressure). We identify three operational epistemic mechanisms ‑ uncertainty separation, knowledge‑boundary detection, and empirical Bayesian policy updating ‑ and discuss five limitation categories.

Abstract:
Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine‑tuning, autoregressive training, causal initialization, few‑step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume‑1.5 and Matrix‑Game‑3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long‑horizon rollout from self‑correcting error propagation, yet open‑source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full‑stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine‑tuning, then runs a few‑step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera‑controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1‑1.3B, Wan2.2‑5B, HunyuanVideo‑1.5‑8B, and LTX‑2.3‑22B, and also supports secondary fine‑tuning of existing bidirectional models. BiWM enables real‑world camera control where minWM loses controllability, integrates pluggable history compression (FramePack‑style and PackForcing‑style) for long rollouts, and offers an optional NVFP4 4‑bit training/inference pipeline. To counter DMD's mode‑seeking degradation, we add GAN and mass‑covering forward‑KL objectives that preserve scene dynamics. We open‑source BiWM for resource‑constrained research and high‑fidelity environment simulation.

Abstract:
Businesses are increasingly adopting AI‑enabled tools to improve productivity, reduce costs, and enhance products and services. However, the transformative potential of AI extends beyond automating predefined tasks: it lies in enabling intelligent systems to plan, optimize, and execute business initiatives from high‑level strategic objectives. This paper introduces the concept and architecture of a business world model (BWM), a world model specialized for business and organizational environments. Inspired by world models in artificial intelligence, cognitive science, and control theory, a BWM encodes business states, dynamics, constraints, objectives, and feasible action space to support autonomous decision‑making. We propose a business‑semantics‑centric formulation in which business states, dynamics and actions are linked to key business entities. Within this framework, agents can simulate alternative action sequences, estimate their effects on future business outcomes, and evaluate trade‑offs under uncertainty. The proposed architecture integrates semantic data representations, probabilistic machine learning models, deterministic business rules, and explicit action space into a coherent structure for planning and counterfactual reasoning. Although its individual components are not new, the contribution of BWM lies in organizing them as an executable internal simulator for business initiatives. This work establishes a conceptual foundation for autonomous business systems capable of moving from instruction‑based execution toward goal‑driven planning and execution.

Abstract:
World models are now built on substantially different computational substrates. Latent recurrent state‑space models such as PlaNet and the Dreamer family compress observations into recurrent states; token‑based models such as IRIS quantize observations into a learned codebook and predict autoregressively with a transformer; and joint‑embedding predictive architectures such as I‑JEPA predict in a learned latent space with no pixel decoder. The interpretability methods applied to these models, including probing, activation patching, sparse autoencoders, and surprise analysis, share a common set of primitives, yet they are re‑implemented from scratch for each architecture because existing hook‑and‑cache tooling assumes a transformer language model with no notion of actions, environment steps, or imagined rollouts. We argue that this fragmentation reflects the tooling rather than the models, and that the shared structure of world models is captured by a small typed interface. We present WorldModelLens, an open‑source interpretability substrate organized around a capability‑typed adapter: every model implements four required methods (encode, transition, initial state, sample) and declares a set of optional heads (decode, reward, continue, actor, critic) through an explicit capability descriptor, so that reinforcement‑learning and self‑supervised world models are first‑class without either imitating the other. A single hook and cache layer exposes time‑indexed activations, imagination rollouts, and intervention replay over this interface, allowing each analysis to be written once.

Abstract:
Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single‑stage tasks such as reaching or grasping, and struggle with multi‑stage ones that demand complex sequential planning. In this work, we introduce WorldDP, a world model framework designed for multi‑stage robotic manipulation. Our hierarchical approach utilizes a high‑level world model as a transition function to optimize for feasible subgoals during runtime, which are subsequently reached by a low‑level Diffusion Policy. To further aid in learning dynamics and planning, we incorporate object‑centric representations that decouple environmental entities and enable us to plan sequentially with respect to each. Evaluated across several robotics benchmarks, WorldDP consistently outperforms existing baselines, validating that coupling the world model's physically grounded planning with diffusion policy's efficient execution yields superior multi‑stage performance.

Abstract:
A learned world model provides a powerful physical intuition for evaluating future states. But its effectiveness in continuous control also depends critically on how candidate actions are generated for model‑based planning. Rather than solely asking how accurately a model can simulate the future, we ask: which candidate actions are worth evaluating in the first place? Existing planners typically search arbitrarily or use expert demonstrations only to initialize a sampling mean, discarding the expert's state‑conditioned confidence. Properly guiding this search requires a robust action prior, yet current approaches often rely on independent visual encoders or large‑scale VLMs to obtain one. We argue that this architectural bloat is unnecessary: the exact same data ‑ and the learned representations of the world model itself ‑ inherently encode the agent's action intuition. We introduce PRISM, a task‑agnostic framework that extracts both from a single dataset while maintaining strict architectural simplicity. Building on a standard JEPA‑style latent world model, PRISM attaches a lightweight MLP directly to its frozen encoder to predict a state‑conditioned Gaussian prior. At plan time, PRISM fuses this prior into the planner's sampling distribution via a precision‑weighted Product‑of‑Gaussians update. This parameter‑free, closed‑form integration steers the sampling process, making the prior confident where it is and ceding control where it is not. PRISM improves success rates by 35 percentage points over vanilla world‑model‑based MPC on Cube and 32 percentage points on PushT, without introducing significant inference overhead.

Abstract:
Controllable video world models target interactive world exploration, where models must faithfully execute explicit action commands while preserving visual quality and temporal coherence. However, most existing approaches rely on continuous camera trajectories as action conditions, which often lead to unreliable action following, especially under complex motion sequences. In this work, we identify action representation entanglement as a key bottleneck in controllable video generation, and show that continuous camera representations lead to high feature similarity across distinct motion patterns, degrading action controllability. Based on this insight, we propose DisCo, a controllable video world model that conditions generation on a compact set of discrete action primitives to improve action separability. We further introduce DisCoBench, a comprehensive benchmark for evaluating the ability of models in short‑term, long‑horizon, and highly dynamic exploration scenarios. Extensive experiments demonstrate that DisCo achieves significantly more reliable action following while preserving visual quality.

Abstract:
Video world models are increasingly used to provide predictive visual representations, yet it remains unclear which pretraining signals induce action‑relevant structure in their latent spaces. We study this question through a unified probe‑based evaluation across diverse encoder families, including image‑only self‑supervision, video pretraining with and without latent prediction, reconstruction‑based autoencoders, diffusion models, and shortcut‑forcing dynamics models. Using a common inverse‑dynamics probing objective, we find that action‑relevant structure is driven primarily by temporal video pretraining rather than pixel reconstruction fidelity: models with strong pixel decoding quality can exhibit near‑zero action recoverability, while video‑pretrained self‑supervised encoders consistently achieve the best Pareto trade‑off between visual fidelity and action prediction. Comparing V‑JEPA and VideoMAE further shows that most gains arise from natural‑video temporal context, with feature‑level latent prediction providing a smaller additional benefit. These trends transfer across robotic benchmarks, though CALVIN reveals that static‑environment tasks can partially mask the importance of temporal structure by allowing strong image priors to suffice. Finally, inverse‑dynamics supervision substantially improves robustness to visual corruption, suggesting that action‑aware objectives regularize latent geometry beyond clean‑setting performance. Our results identify temporal predictive structure ‑‑ not reconstruction fidelity ‑‑ as the primary ingredient underlying action‑relevant video representations.

Abstract:
Representation learning is central to modern machine learning, enabling transitions from handcrafted features to learned embeddings, latent spaces, foundation models, world models, and digital twins. Yet most research examines how representations are optimized after a representational framework has been selected, while less attention is given to when a new level of representation becomes necessary. We introduce the Bootstrap Theory of Representational Emergence (TBER), a framework describing how new representations arise when existing ones become explanatorily insufficient. In this view, representational innovation is not only driven by more data, larger models, or greater computational power, but also by persistent explanatory gaps: situations in which a representation can still describe observations but can no longer make their organization or transformations intelligible. TBER identifies explanatory insufficiency as a positive signal for representational transition. A representation becomes insufficient not because it is necessarily false, but because its explanatory domain has been exceeded. The bootstrap dynamic follows a recursive sequence: observations reveal anomalies; anomalies expose insufficiencies; insufficiencies motivate new representations; and these new representations generate further observations and possible new insufficiencies.We formalize this process through five stages: stabilized observation, anomaly detection, recognition of explanatory insufficiency, representational emergence, and provisional stabilization. We discuss applications to representation learning, latent spaces, foundation models, world models, digital twins, adaptive biological systems, and scientific discovery. TBER suggests that future AI systems may benefit from mechanisms for detecting the explanatory limits of their own internal representations.

Abstract:
Interactive agents trained only against task return can achieve high scores while failing to represent the mechanisms that make their actions succeed. This makes brittle behavior difficult to diagnose and limits adaptation when environment dynamics change. Existing LLM reflection and policy‑code repair can revise behavior from failed trajectories, but questions and world‑understanding tests are usually used only after training. We introduce an Explicit Symbolic Behavioral Model (ESBM), a trainable behavioral model that couples task performance with evidence‑grounded question answering and executable mechanism prediction. An ESBM represents behavior through typed predicates, weighted rules, bounded options and mechanism memory; the mechanism layer predicts symbolic events, object changes, rewards and terminal consequences under action interventions. After each rollout, adaptive questions and active world‑model probes convert score failures, QA errors and transition‑prediction errors into constraints for local ESBM edits. Candidate models are selected by a multi‑criterion rule that jointly evaluates task score, answerability and active world‑model consistency. Under the tested Atari‑style protocols, ESBM learns high‑scoring policies while producing explicit answers and executable mechanism predictions, indicating that adaptive questions can serve as both training pressure and reusable benchmarks for mechanistic policy learning in this setting.

Abstract:
Robots performing long‑horizon visual manipulation observe high‑dimensional images, but successful plans depend on action‑relevant facts: what can be done now and what changes afterward. A useful planning representation should discard irrelevant visual details while preserving action applicability and effects. Classical task planners exploit this structure through symbolic operators with preconditions and effects, but obtaining such representations from raw visual experience remains challenging. We study a visual task‑planning setting in which a robot receives only image transitions: the current image, executed high‑level action, and the resulting image. At test time, given a start image and a goal image, the robot must produce a sequence of high‑level actions that reaches the goal. To address this problem, we introduce STRIPS‑WM, a framework for learning image‑grounded STRIPS‑style world models directly from visual transitions. STRIPS‑WM first induces a finite abstract transition graph from images, then learns latent binary predicates and one grounded propositional operator per action label. The learned operators form a symbolic action model with sparse preconditions and add/delete effects. Finally, the learned predicates are distilled into a visual encoder, enabling classical planning directly from novel start and goal images. Experiments on visual rearrangement tasks show that STRIPS‑WM improves image‑to‑plan success over the tested visual rollout, latent graph‑search and latent‑symbolic baselines.

Abstract:
Generalist robot intelligence is often framed as a policy‑scaling problem: collect more robot demonstrations, train larger Vision‑Language‑Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision. Human motion, internet video, simulation rollouts, and interactive demonstrations contain rich information about tasks, goals, contacts, failures, and physical constraints, yet most of this information is not directly usable by robot policies because it lacks embodiment‑specific action labels, task semantics, and reward structure. We identify four missing components for the next generation of robotics: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world‑model interfaces for physics‑grounded 3D reasoning, and reward interfaces for inferring task progress and success from video and language. We survey recent progress in robot foundation models, cross‑embodiment datasets, learning from video, world models, and reward modelling, and propose a research agenda for building robotics systems that can learn not only from robot demonstrations, but from the broader physical world.

Abstract:
While Vision‑Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text‑oriented chain‑of‑thought. They often struggle to infer unobserved layouts, maintain cross‑view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action‑conditioned visual imagination. Specifically, Astra couples Astra‑VL, an RL‑trained VLM policy, with Astra‑WM, a Bagel‑based world simulator that generates novel‑view observations from context images and natural‑language camera motions. To provide reliable imagined evidence, Astra‑WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world‑simulator‑in‑the‑loop two‑phase RL curriculum to stabilize tool‑use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra‑WM improves simulator‑augmented Gemini‑3‑Flash on MMSI‑Bench from 45.1 to 49.5, while Astra‑VL improves the Qwen3‑VL backbone from 29.8 to 38.8 on MMSI‑Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world‑model‑augmented reasoning requires learning when, where, and how to imagine.

Abstract:
Vision‑Language‑Action (VLA) policies remain brittle in long‑horizon and high‑uncertainty control, where one‑pass action decoding provides limited inference‑time deliberation. Explicit chain‑of‑thought can increase reasoning depth, but introduces token latency and an indirect text‑to‑action interface. We propose MPCoT, a reward‑guided multi‑path latent reasoning framework that initializes M hypotheses, refines them for K weight‑tied steps, and softly aggregates them before action decoding. A training‑only path‑preference objective evaluates candidate action branches with expert‑action consistency, world‑model/VLM‑based progress, and success feedback to align the latent path scorer with downstream execution quality. MPCoT preserves the original 8‑step action interface, generates zero reasoning tokens, and exposes configurable inference controls (K,M). Under matched protocols on LIBERO and CALVIN, MPCoT improves long‑horizon performance, with ablations confirming depth‑width effects, confidence‑weighted aggregation, and reward‑guided path supervision.

Abstract:
End‑to‑end Vision‑Language‑Action (VLA) models have shown promise in UAV navigation. However, existing approaches typically rely on historical observations to directly predict actions, often struggling in dense urban environments where severe occlusions and sharp turns result in drastic viewpoint transitions. We argue that the ability to "imagine" future states ‑‑ inherent in World Models ‑‑ is critical for robust decision‑making under such partial observability. To address this, we construct a challenging Urban Canyon Traversal Benchmark, specifically designed to evaluate spatial understanding in scenarios characterized by severe occlusions and drastic viewpoint transitions. To this end, we propose WorldFly, a novel world‑model‑based VLA framework that employs a dual‑branch coupled flow matching mechanism to jointly generate future video predictions and navigation actions, thereby explicitly guiding the agent's policy via spatial imagination. Extensive evaluations on our benchmark demonstrate that WorldFly outperforms other baselines, particularly in unseen environments, validating the effectiveness of integrating world models into embodied aerial agents.

Abstract:
Latent world models (LWMs) have strengthened end‑to‑end autonomous driving by forecasting compact scene dynamics for downstream planning. However, existing LWM‑based planners usually generate trajectories directly from entangled latent representations. This compact latent‑to‑planner pathway lacks explicit modeling of risk, drivability, and diverse style preferences, making driving‑style dynamics difficult to supervise, inspect, or modulate before a final trajectory is selected. We propose PLAN‑S (PLANning with latent Style dynamics), a planner‑facing bridge that addresses this compactness‑controllability dilemma by decoding a style‑conditioned, four‑channel semantic cost map from the latent representation. The cost map is conditioned on ego state and driving style and is consumed up‑stream of the planning decision through two host‑side interfaces: attention‑level fusion for regression planners and reward‑level fusion for anchor‑score planners. We validate PLAN‑S on two architecturally distinct hosts, ResWorld on nuScenes and WoTE on NAVSIM, while keeping the host backbones frozen to isolate the contribution of the proposed bridge. On nuScenes, PLAN‑S reduces L2 at every horizon over the baseline, with 0.55 m average L2 and a 42% relative reduction in the 3 s collision rate. On NAVSIM, the rule‑cost variant reaches 89.4 Predictive Driver Model Score (PDMS), while the learned cost variant provides complementary gains on baseline‑challenging scenes. Ablations show that the cost pathway contributes most directly to safer trajectory selection. Qualitative results further show that PLAN‑S can produce diverse cost maps, with spatially consistent variations aligned to different driving styles.

Abstract:
Embodied intelligence is moving from laboratory demonstrations toward industrial deployment, with the logistics industry serving as a key application scenario. Learning‑based policies offer a promising path beyond traditional perception‑planning‑control pipelines, but their scalability depends on how embodied data can be collected, organized, and reused. This research studies a data‑centric framework for industrial embodied intelligence by constructing a logistics data flywheel. Our framework converts daily operations into reusable data assets, uses World Models to generate reliable supervision for long‑tail parcel manipulation, and feeds deployment feedback back into policy improvement. As an initial result, WM‑DAgger introduces a World‑Model‑based data aggregation framework that synthesizes out‑of‑distribution recovery data for robust imitation learning. Building on this result, ongoing work explores how large‑scale in‑the‑wild multimodal data, including labeled human demonstrations, unlabeled operational videos, and system‑level robot logs, can be aligned for policy learning and transformed into feedback for continual system improvement.

Abstract:
A central goal of biomedicine is to understand, predict and ultimately control the dynamic mechanisms by which biological systems respond to perturbations, disease progression and therapeutic intervention. Although foundation models and large language models have accelerated biomedical data interpretation, most current systems remain focused on static pattern recognition rather than prospective simulation of biological futures. Here we propose biomedical world models as a paradigm for AI‑driven discovery. These models learn latent representations of molecular, cellular, tissue and clinical states, together with intervention‑conditioned dynamics that allow future trajectories to be simulated before actions are taken. We discuss how biomedical world models could function as data engines, environment simulators and scientific planning substrates across applications including virtual cells, organoids, virtual patients and surgical simulation. We outline the data infrastructure, evaluation benchmarks, safety constraints and governance frameworks required. Biomedical world models may provide a foundation for simulation‑guided, closed‑loop and experimentally actionable biomedical discovery.

Abstract:
Vision‑language‑action (VLA) policies operate in a closed loop in real‑world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open‑loop prediction along pre‑collected action trajectories. This prevents them from supporting closed‑loop VLA evaluation, where each action chunk must be conditioned on the observation generated by the previous execution. To address this gap, we propose PiL‑World, a chunk‑wise world model designed for policy‑in‑the‑loop VLA evaluation. Given the current observation and the action trajectory rolled out by a VLA policy, PiL‑World generates multi‑view future observations that are consistent with the VLA rollout and match the image inputs required by the policy. By alternating between VLA inference and world‑model prediction, PiL‑World enables closed‑loop evaluation without real robot execution at every step. To improve rollout fidelity, PiL‑World conditions video generation on action‑derived visual control from head‑view robot motion and latent histories that encode task execution context, while jointly predicting complementary multi‑view observations. Beyond successful teleoperated demonstrations, it also learns from failed execution trajectories, helping the imagined rollouts better match the distribution of real policy executions. We evaluate PiL‑World on three real dual‑arm manipulation tasks. PiL‑World generates imagined rollouts that are highly consistent with real robot executions. More importantly, compared with the baseline, it reduces the error between VLA success rates measured in real‑world rollouts and those estimated through closed‑loop world‑model evaluation from 63.2% to 12.0%.

Abstract:
Bimanual dexterous tool use remains challenging for robots due to high‑dimensional hand configurations and complex hand‑tool‑object dynamics and contact. Most existing control policies depend on future configuration references provided from demonstrations, while future action‑conditioned world models require slow online planning over high‑dimensional action sequences. A significant challenge is generating a dynamically consistent future reference trajectory without relying on privileged states from demonstrations or slow counterfactual planning. We propose DexFuture, a hierarchical system that couples a high‑level Future‑State Visuomotor Target Predictor with a low‑level Target‑Conditioned Structured Dexterous Policy. Conditioned on egocentric RGB, proprioceptive and geometric history, the high‑level predictor constructs structured hand‑tool‑object visuomotor embeddings and uses a horizon‑conditioned transformer to generate a multi‑step future target trajectory. Then, the low‑level policy tracks them with a target‑conditioned per‑link transformer. This hierarchy decouples coarse future reference generation from fine‑grained action control, and slow long‑horizon semantic prediction from high‑frequency execution. On OakInk2 bimanual tool‑use tasks, DexFuture achieves 90% of the privileged‑oracle performance, compared to 7% for a no‑reference policy. DexFuture operates at 60 Hz, approximately 250 times faster than DexWM‑style Cross‑Entropy Method (CEM) planning with a future action‑conditioned world model.

Abstract:
Autonomous driving requires reasoning about how ego actions shape the evolution of the surrounding world. However, most end‑to‑end methods rely on direct state‑to‑action mappings, capturing correlations without explicitly modeling action‑conditioned dynamics. Conversely, continuous‑latent world models often lack compositional structure for causal reasoning across counterfactual futures. We introduce Discrete‑WAM, a unified latent vision‑action world policy that represents future visual states and ego actions as aligned discrete tokens, enabling compositional causal reasoning across alternative futures. Built upon this unified discrete alignment, Discrete‑WAM establishes a shared discrete diffusion framework with unified generative tasks, jointly formulating world modeling, world‑action policy, and hierarchical decision‑enabled policy, supporting compositional generalization across diverse driving scenarios. Experiments on large‑scale autonomous‑driving benchmarks show that Discrete‑WAM achieves competitive performance while supporting controllable generation and counterfactual reasoning, offering a principled path toward more reliable decision‑making.

Abstract:
Evaluating large language model (LLM) agents in multi‑turn interactive environments is expensive and risky, as it requires online environment interaction. We propose ADWM (Autoregressive Diffusion World Model), an evaluation framework that estimates the performance of a new LLM agent policy purely from pre‑collected trajectories. The core idea is to learn a latent diffusion world model that simulates how the environment responds to the evaluation policy, without ever executing it in the real environment. Existing diffusion‑based OPE methods guide full trajectories in a single pass by jointly diffusing states and actions, an assumption that breaks down for LLM agents whose actions are discrete text that must be sampled from the policy after observing the environment. Unlike autoregressive world models that suffer from compounding errors, ADWM models each transition as an independent denoising process, enabling reliable step‑by‑step rollouts where the world model and agent alternate in causal order. Crucially, the LLM agent under evaluation directly guides the diffusion generation at each step via a policy‑conditioned score function, ensuring that simulated trajectories accurately reflect its decision‑making patterns. Empirically, ADWM achieves accurate value estimates and evaluation reliability across diverse multi‑turn agent tasks, demonstrating its promise as a practical framework for offline LLM agent evaluation.

Abstract:
Scaling reinforcement learning (RL) to diverse multitask settings remains a central challenge. While recent advances in model‑based RL achieve strong performance, they rely on planning and complex training pipelines, making it unclear which components are essential for scalability. We revisit this question and argue that the primary driver of scalable multitask RL is not model‑based control, but \emphrepresentation learning. In particular, we show that combining predictive, model‑based representations with high‑capacity value function approximation is sufficient to achieve strong performance, even without planning. We evaluate a simple model‑free algorithm, MR.Q, coupled with auxiliary predictive objectives into a scalable actor‑critic architecture. This approach outperforms a recent world‑model‑based method and a range of deep RL baselines across a diverse suite of multitask continuous control tasks, while significantly reducing computational overhead and improving wall‑clock efficiency. We observe consistent improvements with increased model capacity and show through ablations that predictive representation learning is critical for performance.

Abstract:
World models, learned generative models that predict how an environment evolves, have become a promising tool for sample‑efficient robot learning. Yet how robust they are to environmental variability remains poorly understood. To address this, we conduct a systematic study using vision‑based quadrotor navigation as a testbed problem, training DreamerV3‑based world models under varying levels of environmental randomness and evaluating them across all levels through cross‑environment validation, spanning both Self‑Supervised Learning (SSL) pretraining and Reinforcement Learning (RL) fine‑tuning. We then deploy all world models and associated navigation policies on a real quadrotor in unseen environments, including an open‑loop run where the model receives just 2.5s of real sensory input before all sensors are cut off, leaving the system to navigate entirely in imagination over a 12m traverse. Our results show that world model robustness during SSL pretraining is a strong predictor of sim‑to‑real transfer: every model that generalized well in cross‑environment SSL validation deployed successfully in the real world, passing through gaps as narrow as 0.67m, whereas the model that dominated simulation policy evaluation failed on the real platform. We further identify (a) the discrete latent size and (b) the training‑sequence length as the dominant factors governing world model quality.

Abstract:
Trust in a decision‑making system requires both safety guarantees and the ability to interpret and understand its behavior. This is particularly important for learned systems, whose decision‑making processes are often highly opaque. Shielding is a prominent model‑based technique for enforcing safety in reinforcement learning. However, because shields are automatically synthesized using rigorous formal methods, their decisions are often similarly difficult for humans to interpret. Recently, decision trees became customary to represent controllers and policies. However, since shields are inherently non‑deterministic, their decision tree representations become too large to be explainable in practice. To address this challenge, we propose a novel approach for explainable safe RL that enhances trust by providing human‑interpretable explanations of the shield's decisions. Our method represents the shielding policy as a hierarchy of decision trees, offering top‑down, case‑based explanations. At design time, we use a world model to analyze the safety risks of executing actions in given states. Based on this analysis, we construct both the shield and a high‑level decision tree that classifies states into risk categories (safe, critical, dangerous, unsafe), explaining why a situation may be safety‑critical. At runtime, we generate localized decision trees that explain which actions are allowed and why others are deemed unsafe. Our method facilitates explainability of the safety aspect in safe‑by‑shielding reinforcement learning, requires no additional information beyond what is already used for shielding, incurs minimal overhead, and integrates readily into existing shielded RL pipelines. In our experiments, we compute explanations using decision trees that are several orders of magnitude smaller than the original shield.

Abstract:
Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi‑step navigation, and future state changes. However, many agents externalize this computation as long textual chains of thought, which slows interaction, increases supervision cost, and complicates deployment. We introduce MIRAGE, a framework that learns continuous latent reasoning representations from visible textual reasoning traces. MIRAGE transfers explicit reasoning into compact hidden states, enabling the agent to reason internally without decoding long rationales. It also incorporates a generative world‑model objective: latent reasoning vectors are aligned with future screenshots, encouraging the agent to anticipate upcoming interface states before acting. This turns hidden computation into both a compressed thought representation and a forward‑looking model of environment dynamics. At inference time, MIRAGE reasons in continuous latent space, reducing token generation while improving execution efficiency. On AndroidWorld, MIRAGE matches explicit chain‑of‑thought supervised fine‑tuning in the 4B ablation with a 3‑5x lower decoded‑token budget and improves a comparable instruction‑tuned baseline by 10.2 points; on AndroidControl, it improves action grounding while generating over 75% fewer tokens.

Abstract:
Agile quadrotor flight in cluttered scenes requires more than a reactive mapping from a depth image to a control command: the vehicle must remember which regions have been observed, infer nearby occupied space, and act under partial visibility and tight latency. In this paper, we present Mapping‑Aware Dreamer (MAD), a geometry‑aware world model for vision‑based quadrotor flight. Instead of using raw‑image reconstruction as the main self‑supervised objective, MAD learns recurrent latent dynamics that reconstruct robocentric occupancy and visibility grid maps together with proprioceptive states. This design forces the latent state to encode local geometry, visibility history, and ego‑motion in a form that is directly relevant to collision avoidance. MAD is trained in DiffAero using a GPU‑parallel map‑construction module that provides high‑throughput supervision for occupancy and visibility. The learned representation is used in three policy‑learning modes: imagination‑based MAD‑Dreamer and feature‑extractor variants based on PPO and SHAC. Across visual navigation and racing tasks, MAD‑based agents achieve higher success rates, faster flight, and better cross‑task transfer than corresponding vision‑only baselines. The model also produces interpretable map predictions and accurate ego‑motion estimates from depth observations. We further deploy the learned policy on a physical quadrotor with an Intel RealSense D435i and demonstrate safe indoor and outdoor flight under limited sensing, reaching 9.66 m/s in simulation and 5.05 m/s in real‑world forest experiments. These results show that mapping‑aware world models provide a practical middle ground between modular aerial navigation and end‑to‑end learning.

Abstract:
We present OSCAR, a precise action‑conditioned video world model that generalizes across different robot embodiments and enables robot policy evaluation. Existing video world models face three main challenges for real‑world robot evaluation: limited scenario diversity in current robot training datasets, imprecise action following, and poor generalization across embodiments for broad adoption. We tackle these challenges from two perspectives. At its core is a large‑scale standardized data pipeline that curates, filters, and deduplicates broad robotics and egocentric human datasets, yielding a clean joint‑training dataset that spans diverse tasks, scenarios, actions, and robot embodiments. To condition the video model, we adopt 2D kinematic skeleton rendering as a unified conditioning representation that generalizes across different robot arms or even human hands. We finetune the Cosmos‑Predict2.5‑2B model on a single GH200 GPU. Our model achieves significant improvement on action following, appearance quality, and motion consistency, compared to existing baselines, which either have a much larger model size or require more GPUs. We further deploy OSCAR to evaluate robot policies from RoboArena. Extensive experiments demonstrate the significant correlation between our virtual policy evaluation in OSCAR and real‑world evaluation, paving the way for the future where robot policies can be purely evaluated in virtual generated worlds.

Abstract:
Recent years have seen remarkable progress in unified vision‑language models handling both multimodal understanding and generation within a single architecture. While autoregressive VLMs can reason across modalities, they fail to generate high‑quality images. In contrast, diffusion models produce photorealistic visuals yet struggle to generate coherent text, making it challenging to develop a single unified model that can seamlessly handle both visual and text generation. Recent advances suggest that language can be effectively embedded within visual representations, allowing models to reason about textual semantics directly from images. To this end, we propose UniCanvas, a first attempt that unifies diffusion models to generate interleaved multimodal contents through text‑in‑image generation. Diffusion models naturally capture transformations on a shared pixel canvas, which can be viewed as world models of visual change. Instead of producing discrete text tokens, the model learns to represent language as visual patterns inside images, leveraging its inherent multimodal embedding space. This design allows the model to "draw" text naturally within a single pixel canvas during image synthesis, achieving seamless multimodal generation. Experiments demonstrate that UniCanvas improves performance over previous unified models, positioning text‑in‑image generation with diffusion models as a promising unified multimodal generation paradigm.

Abstract:
True general intelligence requires not only a model of the physical world but also a social world model: the capacity to infer how individual mental states interact and crystallize into group‑level outcomes. Despite notable progress in individual‑level Theory of Mind (ToM) reasoning, existing multimodal large language models fail at this broader task. Collective behavior emerges non‑linearly from social tensions, conformity dynamics, and structural constraints, meaning it cannot be recovered by merely summing individual intentions. We present GroupToM‑Bench, the first multimodal benchmark for group‑level ToM, built around a causal chain spanning micro‑level BDI states (belief, desire, intention), meso‑level group tension and structural constraints, and macro‑level outcome prediction and mechanistic attribution. To probe this full arc, we develop a seven‑level cognitive audit framework. Experiments reveal a gap between current models and human baselines, highlighting a failure to process social structures and non‑linear collective dynamics.

Abstract:
We introduce CLAW, a fully end‑to‑end self‑supervised framework for learning a world model jointly with continuous latent action representations directly from action‑free videos. Our approach leverages adversarial latent regularization and diffusion‑based video generation to capture structured and semantically meaningful action representations while modeling rich, predictive environment dynamics, without relying on any action labels or annotations. By simultaneously training the Latent Action Model and world model, CLAW learns to reason about how inferred actions induce environment transitions from visual observations alone. We show that the resulting latent action world model supports both imitation learning from observation and goal‑directed planning. In imitation learning, latent actions extracted from raw videos enable behavior cloning. For planning, CLAW generates sequences of latent actions and maps them to executable actions to reach desired goals. Extensive experiments across diverse tasks and embodiments demonstrate that CLAW produces semantically meaningful latent action representations, supports effective action transfer, and enables planning and imitation from observation, outperforming existing methods.

Abstract:
Supervised fine‑tuning (SFT) improves end‑to‑end classical planning in large language models (LLMs), but do these models also learn to represent and reason about the planning problems they are solving? Due to the relative complexity of classical planning problems and the challenge that end‑to‑end plan generation poses for LLMs, it has been difficult to explore this question. In our work, we devise and perform a series of interpretability experiments that holistically interrogate world model recovery by examining both internal representations and generative capabilities of fine‑tuned LLMs. We find that: a) Supervised fine‑tuning on valid action sequences enables LLMs to linearly encode action validity and some state predicates. b) Models that struggle to use output probabilities for classifying action validity may still learn internal representations that separate valid from invalid actions. c) Broader state space coverage during fine‑tuning, such as from random walk data, yields more accurate recovery of the underlying world model. In summary, this work contributes a recipe for applying interpretability techniques to planning LLMs and generates insights that shed light on open questions about how knowledge is represented in LLMs.

Abstract:
Embodied agents that navigate cities rely on world models that predict how their surroundings will change as they move. But for navigation, what matters is not what the buildings look like; it is where the agent can go. Most world models nonetheless predict appearance, learning how a scene looks rather than the space an agent can move through. Those that do target geometry, such as bird's‑eye‑view occupancy grids, flatten the three‑dimensional environment onto a ground plane, discarding the above‑ground and multi‑level structure that shapes real navigation. What is missing is a predictive target that captures the navigable geometry an agent actually traverses, without photometric entanglement and without collapsing the third dimension. Our key idea is to model the open volume between buildings, the negative space, encoded as a 3D isovist: a spherical visibility‑depth map recording the distance to the nearest surface in every direction. We introduce an embodied world model that predicts the next isovist from a short history of past isovists and a movement action. The prediction is formulated as a depth residual so the decoder inherits sharp building edges, trained with self‑rollout scheduled sampling to keep corrupted context on the geometry manifold, and equipped with a persistent latent bird's‑eye‑view spatial map for cross‑path consistency. Our central finding is emergent and unexpected: a single city‑blind model trained on Manhattan and Paris develops a cross‑city spatial signature, with city identity linearly decodable from its temporal latents far above single‑frame baselines, so the signature lives in the learned dynamics rather than in appearance. The representation is lightweight, interpretable, and reproducible, offering a geometric substrate for spatial reasoning in embodied AI, robotics, and urban analysis, released with an open dataset and pipeline.

Abstract:
World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task‑incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human‑verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open‑domain physical prediction, and propose Privileged‑Future On‑Policy Self‑Distillation (PF‑OPSD). During training, PF‑OPSD uses ground‑truth future videos and answers only as teacher‑side privileged context to evaluate on‑policy concrete‑reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF‑OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF‑OPSD.

Abstract:
Multimodal retail agents should not only recognize what a customer is doing, but also decide whether and how to assist before an explicit request is made. We study this setting through the See‑‑Infer‑‑Intervene (SII) framework, where a device must see pre‑interaction behavior, infer latent customer intent, and act by selecting an appropriate service intervention or choosing to wait. We instantiate SII with the Proactive Intent World Model (PIWM), which represents customer state with AIDA (Attention, Interest, Desire, Action) purchasing phases and BDI (belief, desire, intention) psychological fields, predicts action‑conditioned intent transitions, and selects from five response classes: Greet, Elicit, Inform, Recommend, and Hold. We further construct GuidanceSalesBench, a smart‑retail benchmark containing state manifests, pre‑interaction videos, candidate responses, action‑conditioned outcomes, and best‑action labels. When conditioned on ground‑truth customer state to isolate action selection, PIWM achieves 0.641 macro F1 on 30 held‑out target videos, outperforming a zero‑shot Qwen2.5‑VL‑7B baseline and training variants without balanced action supervision; end‑to‑end video‑only selection drops to 0.295, below the 5‑class balanced random baseline of 0.414, identifying video‑to‑state grounding as the dominant deployment‑time bottleneck. A preliminary staged real‑store pilot (recorded with paid participants performing scripted customer behaviors) reaches 0.579 action macro F1 on 20 fully annotated videos, with 10 additional accessible videos released with index‑level labels.

Abstract:
Humanoid loco‑manipulation requires stable whole‑body control under varying object masses and pickup/placement heights. This becomes particularly challenging in sim‑to‑real transfer, where object‑induced load variation and robot‑side dynamics mismatch interact during physical contact. Existing history‑based adapters often compress these factors into a single latent representation, which can weaken robustness under heavy‑load manipulation. We propose SplitAdapter: Load‑Aware Humanoid Loco‑Manipulation via Factorized Adaptation, which freezes a pretrained box manipulation policy and extends it with object/load and dynamics‑aware context encoders trained with split world‑model objectives, GRL‑based cross‑adversarial regularization, and hierarchical Feature‑wise Linear Modulation (FiLM). In sim‑to‑sim experiments and real‑world deployment, SplitAdapter improves Full‑task success over the base policy and world‑model FiLM baselines across object masses of 2, 4, and 6 kg and pickup/placement heights of 0, 30, and 60 cm, with the largest improvements under heavy‑load conditions.

Abstract:
Navigating a drone in unseen and cluttered environments requires reliable generalization to unseen scene layouts and understanding of environmental structure relative to the robot's capabilities. Previous methods, which assume the same environment configuration, often rely heavily on human‑designed perception pipelines and predefined rules to guide the robot toward the target. This process is environment‑dependent and generalizes poorly across environments. Inspired by animal navigation behavior, we design a navigation framework that navigates with a reinforcement‑learning‑based policy on top of a world‑model‑based environment understanding to overcome these issues. In addition, a sparse reward function without hand‑crafted shaping terms is designed to avoid local minima traps and encourage yaw control behaviors. In simulation and on real drones, our method exhibits emergent capabilities for navigating complex, unseen environments and escaping local optima where other methods fail. In challenging maps, it achieves a 5.3% higher navigation success rate than best baseline. Furthermore, the proposed framework achieves effective sim‑to‑real transfer without any tuning during deployment. The code will be publicly available.

Abstract:
As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long‑tail scenarios remains a critical bottleneck. In closed‑loop simulation, the driving policy model actively interacts with the environment, where its actions dynamically update the simulator state and directly influence the next set of generated sensor observations. While recent reconstruction‑based neural simulators offer photorealism, they are fundamentally constrained by their initial captured data and struggle to generalize to highly dynamic or novel scenes. To overcome these limitations, we introduce OmniDreams, a foundation generative world model mid‑ and post‑trained from the Cosmos diffusion model to autoregressively generate action‑conditioned videos in real time. By leveraging the rich visual priors of Cosmos and mid‑ and post‑training on 21k hours of driving scenarios, OmniDreams synthesizes complex, unobserved phenomena that are hard for traditional simulators to capture, such as extreme weather and unpredictable dynamic agent behaviors. Crucially, it autoregressively conditions its photorealistic sensor generation on past frames, the current simulator state, and immediate driving actions. Deployed in a closed‑loop system with the Alpamayo 1 policy model and AlpaSim orchestrator, OmniDreams acts as a highly responsive, reactive environment, providing a scalable and comprehensive solution for training and evaluating next‑generation autonomous driving policies. We additionally show preliminary results indicating that a world‑action model (WAM) post‑trained from OmniDreams achieves strong performance on the Physical AI Autonomous Vehicles NuRec dataset, surpassing the VLA‑based Alpamayo 1.5 research policy model while using only 1/5 the total parameters. These results highlight the potential for a real‑time world model like OmniDreams to also serve as a backbone for policy architectures.

Abstract:
A latent world model built from an equivariant encoder E and an equivariant predictor f inherits a provable symmetry of its training loss: when the world's dynamics genuinely carries a group G acting on latents by an orthogonal representation ρ(g), the one‑step prediction relMSE is exactly invariant across the whole group, so fitting the dynamics on a restricted slice of orientations mathematically determines it on the entire orbit (jǔ yī fǎn sān). We verify this end‑to‑end at laptop scale (CPU/MPS, fully seeded). [A] The symmetry survives a real Muon/AdamW + EMA + VICReg run ‑‑ composed encode‑then‑predict residual ～ 10^‑6 after optimisation, not just at initialisation, and under any optimiser. [B] One‑step error is flat to five digits across the group, while a same‑hypothesis‑class non‑equivariant baseline fits the slice but breaks out‑of‑distribution (VN × 1.00 vs baseline × 13.8 in 2D, × 17.2 in 3D, × 157 over the full \mathrmSE(3) ladder), with the equivariant model 4.5‑7.4× smaller. [C] The same isometry argument lifts to closed loop: under a matching equivariant planner the control trajectory at orientation g is exactly ρ(g) applied to the seen one, so closed‑loop error is invariant across the group ‑‑ float‑floor‑exact in 2D/\mathrmSO(2) on real PushT and statistically flat in 3D/\mathrmSE(3) (disjoint 95% CIs). We stress‑test the prior against Sutton's Bitter Lesson: augmentation, brute‑force scale, and soft‑equivariance each close at most the across‑group task metric, never the float‑floor exactness. Because equivariance is closed under composition, the H‑fold rollout stays flat (× 1.00, \le 2× 10^‑7) at every horizon, while the baseline's residual compounds with H. Out of scope: task‑success sweeps, planner‑free invariance, and scaling.

Authors: Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

Abstract:
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture‑of‑transformers architecture. By supporting highly flexible input‑output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI ‑‑ effectively subsuming vision‑language models, video generators, world simulators, and world‑action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state‑of‑the‑art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general‑purpose backbones for embodied agents. Our post‑trained Cosmos 3 models were ranked as the best open‑source Text‑to‑Image and Image‑to‑Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW‑1.1 https://openmdw.ai/license/1‑1/ License at https://github.com/nvidia/cosmosgithub.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3 . The project website is available at https://research.nvidia.com/labs/cosmos‑lab/cosmos3 .

Abstract:
I present EXOVEIL, a transit detection system that learns what a star's brightness should look like and flags when reality disagrees. Unlike existing systems that require phase‑folded input, EXOVEIL operates on raw flux time series and can detect planets that transit only once.A Transformer world model, trained on 16,499 Kepler light curves with transit‑masked self‑supervised learning, predicts expected stellar flux. A matched‑filter detector with variance weighting extracts transit signals from the prediction residuals. A learned classifier (XGBoost) separates planets from false positives, achieving AUC 0.938 on Kepler DR25. Applied to single‑transit injection‑recovery, EXOVEIL recovers 32% of transits at 1000 ppm depth a task where all classification‑based systems score 0% by construction. A blind search of 3,737 Kepler stars yields 179 new transit‑like signals not present in the DR25 TCE catalogue, including 46 monotransit candidates. Applied withoutretraining to 47 confirmed TESS planets in the PLATO LOPS2 field, EXOVEIL achieves 100% recovery, demonstrating zero‑shot cross‑mission transfer. At PLATO's 25‑second cadence, detection reaches 100 ppm ‑‑ approaching the Earth‑analog regime. I provide the first application of conformal prediction to transit detection (95.9% empirical coverage) and release the system as pip install exoveil with pretrained weights and a candidate catalogue.

Abstract:
Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi‑agent settings introduces two critical challenges: data scarcity (coordinated multi‑view recordings are prohibitively expensive to collect for general open‑domain scenarios) and world state alignment (independently generated video streams cannot ensure that shared physical environments and events evolve consistently across views). To address these challenges, we propose MetaWorld, a novel framework that scales multi‑agent video world models to open‑domain environments directly from single‑view videos. First, we introduce Monocular World‑State Unrolling (MWSU) to explicitly decompose monocular footage into the camera operator's ego‑motion and the visible subject's spatial trajectory. This camera‑trajectory decomposition naturally extracts synchronized multi‑agent motion data within a shared 3D space, completely bypassing the need for multi‑camera setups. Second, for precise visual control, we develop the Subject‑Aware World Generator to enable appearance‑driven simulation conditioned on per‑agent identity images. Finally, to ensure both views are grounded in the identical physical reality, we propose World‑State Alignment, a per‑frame inter‑branch cross‑attention mechanism inserted at every transformer layer of the video DiT. By jointly synchronizing the denoising process, WSA enforces both static geometric consistency and dynamic motion consistency, encouraging that the shared 3D environment and physical events remain well‑aligned across both egocentric views. Extensive experiments demonstrate that MetaWorld achieves superior cross‑view consistency and identity fidelity, establishing a highly scalable, physics‑driven paradigm for multi‑agent video world modeling.

Abstract:
World models are often evaluated by native frame cadence, but higher nominal frame rate can trade away long‑horizon scene stability. This article reports an independent proof of concept implemented using Overworld's Waypoint‑1.5 family and WorldEngine runtime on a Windows fallback stack with ONNX Runtime + DirectML and an FSR4 DX12 bridge. The tested coherence‑first branch generates higher‑context anchor frames at a 15 FPS presentation‑timeline cadence and reconstructs presentation to 30 FPS using latent‑delta motion guidance and synthesized depth. It is compared against a lower‑context cadence‑first baseline that generates about 30 FPS natively under the same seed, route, control script, target presentation duration, and local time‑scaling regime. Across forest, sword, desert, and snow scenes, the coherence‑first branch preserves path geometry, object identity, large silhouettes, and depth layering longer, while the baseline degrades earlier into brightness drift and geometric distortion. Lightweight temporal metrics and paired videos support the visual comparison, with LPIPS favoring the coherence‑first branch across all tested scenes. Here compute‑normalized means approximately matched same‑GPU, same‑timescale operating points, not exact FLOP parity or measured realtime throughput. A separate heavier sword‑scene probe suggests local non‑monotonicity: more context and denoising did not automatically improve quality. These results support coherence‑first allocation as a practical proof‑of‑concept strategy under limited inference budget, not as a finished realtime renderer.

Abstract:
Discrete visual tokens should provide a compact representation for both token‑based world modeling and planning in autonomous driving. However, most tokenizers are inherited from image generation and are optimized mainly for pixel reconstruction, which may leave a gap between what is easy to generate and what is useful to decode for driving decisions. We present a representation‑guided and geometry‑enhanced tokenizer that learns discrete tokens under joint supervision. The tokenizer aligns its discrete bottleneck with a frozen DINO feature space through feature decoding, while preserving appearance via RGB reconstruction with perceptual and adversarial losses. To inject geometric state‑related cues, we add adjacent‑frame depth and relative‑pose supervision during training and stabilize joint objectives with multi‑codebook quantization. We evaluate the same learned tokens with a lightweight planning readout and a GPT‑style next‑token world model. Experiments on NAVSIM show improved reconstruction fidelity and representation consistency, competitive planning performance under a fixed decoder, and better generative quality under matched settings.

Abstract:
With rapid development of large language models and diffusion‑based content generation, world modeling has attracted increasing research attention, benefiting various downstream domains such as game engines, embodied AI, autonomous driving, etc. Through explicitly incorporating user actions into world state transition, recent literature empowers world modeling with interactivity in an action‑conditioned video or 3D generation paradigm, further enhancing controllability over world evolutions and facilitating users to freely traverse, manipulate, navigate, and personalize the state evolution. In this paper, we aim to systematically review recent research trends, technical developments, evaluation benchmarks, and also propose future potential directions in interactive world modeling. Specifically, we first summarize recent efforts and trends in terms of application scenarios, world state evolution, and scene modality. Afterwards, we delve into three crucial technical challenges, including action‑conditioned controllability, long‑horizon interactions and memory, and action‑following responsiveness for real‑time interactivity. Furthermore, we also thoroughly compare existing benchmarks and metrics in four specific application fields: open‑world exploration, game engine, autonomous driving, and robotics. Finally, we discuss several promising future directions in achieving next‑generation interactive world modeling. The corresponding repository is publicly available at: https://github.com/liujiuming123/Awesome‑Interactive‑World‑Model.

Abstract:
Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present τ_0‑World Model (τ_0‑WM), a unified video‑action world model that integrates policy learning, video prediction, and action evaluation within a single future‑predictive framework. Built on a shared video diffusion backbone, τ_0‑WM provides two complementary interfaces. First, a video action model jointly predicts future visual latents and continuous action chunks from multi‑view observations, language instructions, and robot state. Second, an action‑conditioned video simulator rolls out candidate action chunks into multi‑view futures and predicts dense task‑progress scores. The model is trained on approximately 27,300 hours of real‑robot teleoperation, UMI‑style interaction, egocentric human videos, and rollout or failure trajectories using modality‑specific supervision masks. At inference time, τ_0‑WM uses test‑time computation to sample action candidates, rank them with re‑denoising consistency, and invoke simulator‑based rectification for low‑quality candidates. On challenging long‑horizon and fine‑grained robotic manipulation tasks, τ_0‑WM shows superior performance over other relevant baselines.

Abstract:
Recent advancements in video‑based world models have demonstrated an unprecedented ability to synthesize high‑fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text‑video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long‑term horizons and complex interactions. To address this gap, we present MBench, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub‑dimensions for comprehensive characterization of long‑term memory. Our benchmark is built upon rigorously curated real‑captured long videos, and evaluated by rule‑based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state‑of‑the‑art video world models reveal critical systemic limitations of existing methods in long‑term state retention, providing a standardized benchmark and clear research direction to advance the field.

Abstract:
Offline meta‑reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta‑learning adaptability, yet it faces key challenges from context and policy distribution shifts. These issues hinder agents from adapting to online environments, and are further exacerbated under sparse‑reward settings. As a result, agents often become trapped in an inherent pattern dilemma, failing to achieve robust generalization. In this work, we propose a novel framework that integrates information‑theoretic task representation learning with a Transformer‑based stochastic world model. Our approach extracts task‑defining latent variables that are invariant to behavior policy, thereby effectively mitigating the context distribution shift. To further handle policy shift and model exploitation, we apply a conservative value penalty to imagination‑based rollouts, preventing the policy from exploiting model inaccuracies while maintaining robust adaptation. Extensive evaluations demonstrate that our method outperforms state‑of‑the‑art approaches, with superior stability and generalization under out‑of‑distribution and sparse‑reward settings.

Abstract:
Embodied world models have emerged as a promising paradigm in robotics by predicting how robot actions affect the surrounding scene. However, the rollout inference remains computationally expensive in pixel space, as long‑horizon manipulation videos typically have to be generated frame by frame. This cost cannot be easily reduced by indiscriminately dropping frames, since downstream policies rely on complete preservation of sparse task‑relevant events such as approach, contact, grasp, and release. To address this challenge, we propose Sparse Keyframe Interpolation Paradigm (SKIP), an event‑preserving sparse‑to‑dense framework that avoids dense frame‑by‑frame generation. SKIP first identifies task‑relevant keyframes by leveraging robot‑aware multimodal features. It then synthesizes only these keyframes with a sparse video diffusion model. A learned gap predictor and an action‑conditioned interpolator subsequently reconstruct the missing intervals according to the robot actions. On LIBERO, SKIP generates dense rollouts 4.16× faster than a dense baseline while improving visual fidelity and reducing aggregate FVD by 89.0%. Importantly, SKIP‑generated videos are effective policy‑training data. Even when they fully replace real demonstrations, π_0.5 success drops only 1.3 pp in LIBERO simulation and 6.7 pp on the real robot, whereas fully dense frame‑by‑frame generation collapses by 48 to 58 pp.

Abstract:
Video generation models are becoming a scalable form of world models, but they mainly generate plausible motion rather than proactively control or optimize the underlying dynamics. As a result, an object in the generated video may follow trajectories that are unsafe, not smooth, inefficient, or physically inconsistent. In this work, we propose OptiWorld, a framework that brings classical optimal control into video generation at inference time. OptiWorld first extracts a compact, task‑relevant world state, then plans an optimal trajectory under physical constraints, and finally renders the video conditioned on this trajectory. We formulate planning as a geometric problem on a continuous manifold, which converts 3D geometry and task‑dependent physical constraints into a unified planning geometry. By adding this optimal‑control layer, OptiWorld generates videos with preferable dynamics, demonstrating strong potential in multiple tasks including goal‑conditioned image‑to‑video generation, video dynamics editing, and counterfactual generation.

Abstract:
A central challenge in visual intelligence is learning the physical structure of scenes from raw videos: how regions form objects and the laws that govern their interactions. Solving these tasks requires world models capable of inferring distributional states of the world from partial observations ‑ capabilities that current architectures do not provide. We introduce a new class of probabilistic world models that support estimation of the probability of any visual variable, such as appearance and dynamics, conditioned on any other variables. Here, we identify that these models can be trained efficiently with autoregressive sequence modeling, yielding world models from which rich object understanding emerges. First, we demonstrate that our model captures the physical laws governing how objects move by generating multiple plausible future states of the world through sequential inference. Then, by analyzing motion correlations across these futures, we extract objects and articulated object subparts. Having discovered these objects, we show that our world model can manipulate them in 3D. Finally, we demonstrate how physical relationships between objects can be computed from the world model, enabling applications such as Visual Jenga.

Abstract:
Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego‑robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high‑impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high‑impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion‑based WMs. However, optimizing high‑dimensional noise is challenging: the optimization must reason about nuanced, scene‑dependent target events in generated videos while avoiding out‑of‑distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision‑Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state‑of‑the‑art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high‑impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.

Abstract:
World models, internal simulators that learn the structure and dynamics of an environment, have emerged as a central paradigm in the pursuit of artificial general intelligence, enabling agents to predict, plan, and reason within learned representations. Despite rapid progress across reinforcement learning, robotics, autonomous driving, and video generation, the field lacks a unified framework integrating its diverse architectural choices, training methods, reasoning mechanisms, and application settings. This survey addresses that gap with a multi‑axis taxonomy organized along four dimensions: (i) architecture, encompassing representation format, dynamics formulation, input modality, learning paradigm, and downstream application; (ii) methodological family, including state‑space and recurrent approaches, transformer‑based models, diffusion‑based generators, physics‑informed networks, and language‑augmented multimodal systems; (iii) reasoning strategy, covering imagination‑based planning, latent policy learning, counterfactual reasoning, and planning under uncertainty; and (iv) application domain, spanning robotics, autonomous driving, video prediction, multimodal agents, reinforcement learning, scientific modeling, medical imaging, educational measurement, and business and finance. Tracing the field from early cognitive‑science foundations to milestone systems such as PlaNet, the Dreamer family, MuZero, Sora, Cosmos, and Genie, we examine how these dimensions interact and highlight the recent convergence of chain‑of‑thought reasoning with world‑model imagination. We review evaluation protocols and benchmarks, identify persistent challenges such as compounding prediction errors, sim‑to‑real transfer, and fragmented evaluation, and outline future directions toward unified multimodal world models, foundation‑scale interactive simulators, and safe deployment in safety‑critical domains.

Abstract:
Bridging the gap between visual realism and physical understanding is a core challenge for video‑based world models. We study the structural identifiability of continuous‑time physical laws from raw pixels, focusing on whether an encoder‑only pipeline can uniquely recover the parameters of second‑order linear ODEs. We prove that a level‑set slope‑coverage condition ensures the learned latent space is locally affine to the true physical state, enabling exact parameter recovery. Our theory provides the first characterization of minimal data requirements across damping regimes, establishing that underdamped systems are identifiable from a single video clip, whereas other regimes require three diverse trajectories. We further introduce a variance‑floor regularizer to stabilize the decoder‑free objective and prevent latent collapse. Validated on synthetic and real‑world data, our approach demonstrates that interpretable physical constants can be reliably estimated from video without the need for compute‑intensive pixel reconstruction, ensuring both physical correctness and transparency. Code is available at https://github.com/wenjiewang3/PhysicsFromVideo.

Abstract:
Robotic manipulation depends on the ability to anticipate how actions reshape objects, contacts, and scene geometry before execution. Learned world models provide this capability by predicting task‑relevant future evolution under robot intervention, yet the term now spans latent dynamics models, action‑conditioned video generators, three‑ and four‑dimensional scene predictors, physics‑informed simulators, and predictive modules inside vision‑language‑action systems. This breadth has fragmented the literature and obscured the design choices that matter for manipulation. We survey world models for robotic manipulation through three questions: what future representation is predicted, how prediction is connected to action, and when prediction is used in the robot‑learning pipeline. We operationally define a world model as an action‑conditioned predictive system and distinguish it from perception modules, inverse models, policies, rewards, and value functions. We then organize existing work into five representation families, develop a functional taxonomy that separates integrated prediction‑action models from explicit predictive planners, and characterize infrastructure roles including synthetic experience generation, candidate filtering, search‑based evaluation, learned environments, and outcome verification. We further map these roles across pretraining, post‑training, and inference adaptation, review 34 manipulation datasets, and synthesize evaluation protocols for predictive fidelity, task performance, and simulator reliability. This survey shows that world models are evolving from task‑specific dynamics predictors into predictive infrastructure for robot learning, while exposing open challenges in contact modeling, hallucination control, action alignment, and benchmarking under closed‑loop use.

Abstract:
Foundation models are increasingly used to drive autonomous systems, yet existing approaches either keep the model in a tight control loop, raising latency and hallucination risk, or compile natural language into opaque end‑to‑end policies that are hard to explain, constraint and require domain‑specific datasets and fine‑tuning. We propose a planner‑executor agent for PX4‑based drones that decouples high‑level mission planning from low‑level control. A large language model performs single‑pass task planning, while execution is handled through a structured ROS 2 tool‑calling interface bridged to MAVLink. The system constructs a world model by combining modular 2D detectors (e.g., YOLO or vision‑language models) with a pinhole depth projection module for 3D object localization. A constraint enforcement layer enforces altitude limits and horizontal geofencing, and bounded replanning enables recovery from execution‑time action failures. We position our approach within three common design patterns for foundation‑model‑based robotics systems and demonstrate its feasibility in PX4 software‑in‑the‑loop simulations in Gazebo. Results highlight improved explainability, constraint enforcement, and reduced LLM calls compared to tightly coupled LLM control. The code, dataset, videos, and other material can be found at the following link: https://github.com/erdemuysalx/PEACE

Abstract:
Physical AI systems increasingly map multimodal observations, language instructions, and learned world representations into physically consequential actions. Robotics foundation models, vision‑language‑action models, and world‑model‑based autonomous systems can condition decisions that move vehicles, robots, drones, and industrial machines. This transition exposes a safety problem that is not fully captured by conventional AI content moderation or by classical robot safety alone: a black‑box model may issue a physically consequential action while appearing confident, plausible, and semantically aligned. The resulting failure can be silent, arising from sensor drift, occlusion, state‑estimation error, distribution shift, hallucinated affordances, or invalid physical assumptions before downstream hardware controllers detect a violation. Across embodied foundation models, world models, robotics simulation, embodied safety benchmarks, safe control, runtime assurance, uncertainty estimation, verification, and guardrail evaluation, model capability and safety mechanisms have advanced along largely separate technical tracks. A recurring gap synthesized here is that no single stream surveyed in this review supplies a complete runtime authorization boundary between black‑box Physical AI models and physical execution. The resulting analysis develops a bounded problem formulation, a definition of silent physical‑action failure, a taxonomy of runtime guardrail functions, and evaluation requirements for comparing guardrails as Physical AI assurance mechanisms.

Abstract:
Recent progress in generalizable embodied control has been driven by large‑scale pretraining of Vision‑Language‑Action (VLA) models. However, most existing approaches rely on large collections of robot demonstrations, which are costly to obtain and tightly coupled to specific embodiments. Human videos, by contrast, are abundant and capture rich interactions, providing diverse semantic and physical cues for real‑world manipulation. Yet, embodiment differences and the frequent absence of task‑aligned annotations make their direct use in VLA models challenging. This survey provides a unified view of how human videos are transformed into effective knowledge for VLA models. We categorize existing approaches into four classes based on the action‑related information they derive: (i) latent action representations that encode inter‑frame changes; (ii) predictive world models that forecast future frames; (iii) explicit 2D supervision that extracts image‑plane cues; and (iv) explicit 3D reconstruction that recovers geometry or motion. Beyond this taxonomy, we highlight three key open challenges in this area: structuring unstructured videos into training‑ready episodes, grounding video‑derived supervision into robot‑executable actions under embodiment and viewpoint heterogeneity, and designing evaluation protocols that better predict real‑world deployment performance and transfer efficiency, thereby informing future research directions. A curated list of papers and resources is available at https://github.com/AaronFengZY/HumanCentricToVLA‑Survey.

Abstract:
Transformer‑based language models are widespread in today's society. As such, understanding the mechanisms by which they solve structured tasks and predicting how they may behave in novel scenarios is of great importance for safe deployment. We study the learning dynamics of attention heads in a controlled setting by training a decoder‑only Transformer (GPT‑J) on two structurally equivalent multi‑hop reasoning tasks: a number task requiring positional reasoning and a letter task requiring symbolic reasoning. Using a recently introduced metric that classifies attention‑head behavior as positional or symbolic for a given prompt, we show that successful learning is associated with the emergence of pure heads, i.e., heads that express themselves as either positional or symbolic. Despite the tasks' structural equivalence, they impose different mechanistic demands: the number task requires both positional and symbolic heads, whereas the letter task requires only symbolic heads. We then identify the computational roles of these heads, characterize the basic functions they implement, and give theoretical constructions showing how single‑layer RoPE‑based attention can realize these functions through geometrically interpretable query, key, and value operations. This analysis yields a quantitative separation between positional and symbolic mechanisms in their robustness to longer sequences, formalized through a novel notion of discrepancy. We empirically validate the resulting predictions in both controlled and real‑world models, showing that symbolic mechanisms extrapolate more reliably to longer sequences while positional mechanisms face sharper limitations.

Abstract:
End‑to‑end autonomous driving has emerged as a compelling paradigm for learning planning directly from sensor observations, while recent world‑model‑based approaches further enrich this paradigm by enabling explicit reasoning about how the scene may evolve in the future. Yet future prediction alone does not guarantee better planning unless the predicted evolution can be converted into planning‑relevant trajectory updates. Many current methods still forecast future scene states without explicitly decoding the motion implications hidden in state transitions. As a result, future reasoning often remains descriptively useful but only weakly coupled to executable motion generation. To address this limitation, we propose \mathbfIDOL, an inverse‑dynamics‑guided future prediction framework for world‑model‑based end‑to‑end planning in latent BEV space, where inverse dynamics serves as the key bridge between future prediction and trajectory optimization. IDOL first predicts multiple future latent scene states with a BEV world model, then applies an inverse dynamics model to adjacent latent futures to decode transition‑aware trajectory features and recover planning‑relevant motion deltas that explain how the latent world evolves over time. These inverse‑dynamics‑derived signals are used to optimize the planned trajectory, turning future forecasting from passive scene anticipation into actionable planning guidance. A lightweight closed‑loop refinement module further improves long‑horizon consistency by reusing the optimized trajectory for another round of future‑aware reasoning. By introducing inverse dynamics into latent future reasoning, IDOL tightens the coupling between world modeling and planning. Extensive experiments on the NAVSIM v1 and NAVSIM v2 benchmarks show that IDOL achieves state‑of‑the‑art performance among comparable methods.

Abstract:
In cooperative multi‑agent reinforcement learning (MARL), agents must coordinate with partners whose internal policies and intentions are not directly observable. While world models such as Dreamer have demonstrated strong generalization and sample efficiency in single‑agent settings, their application to MARL remains limited by an inability to handle teammate‑induced uncertainty. We propose a new perspective: treat teammates as structured, learnable components within the agent's world model. We introduce an architecture that factorizes the latent state of a Dreamer‑style recurrent state‑space model (RSSM) into environment and teammate components, and learns an auxiliary Theory‑of‑Mind (ToM) head to infer latent embeddings of partner behavior such as character, intent, and predicted actions from partial trajectories. These teammate latents condition the actor and critic, enabling the agent to imagine and adapt to diverse collaborators. We outline how this approach can support zero‑shot and few‑shot coordination in partially observable settings and propose a set of benchmarks and evaluation protocols to assess its impact. This work positions world models as not only predictors of environmental dynamics, but as simulators of social behavior, opening new directions for generalizable, human‑compatible AI.

Abstract:
Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine‑grained spatio‑temporal consistency under long‑horizon reasoning remains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame‑level implicit modeling, and propose a fine‑grained, learnable, and scalable memory for consistent world generation. We first identify two fundamental limitations of naïve learnable memory architectures in long‑horizon extrapolation, namely computational inefficiency and attention dispersion. Through a systematic analysis of attention dispersion, we propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine‑grained access to global history and Anchored Local Memory for stable and high‑quality extrapolation. Extensive experiments demonstrate that DecMem significantly outperforms current state‑of‑the‑art methods. By ensuring precise and efficient long‑term memory and achieving superior extrapolation capabilities, DecMem enables minute‑level controllable long video generation with high fidelity and consistency.

Abstract:
The diffusion based robot navigation world models are typically trained using parallel supervision, while autoregressive inference is employed during path planning. This results in a distribution shift between training and inference, which destabilizes the performance over long‑horizon prediction. We propose AR Forcing, an autoregressive training strategy, which integrates the standard diffusion loss into the autoregressive training loop. At each step, the model uses its own predictions to update the context and optimize the single step noise prediction objective, thereby explicitly exposing the model to the inference state distribution during training. Our method does not require additional discriminators or distribution‑matching losses, retains the original diffusion framework and sampler, and is easy to integrate. Experiments on multi‑domain navigation datasets (RECON, SCAND, HuRoN, TartanDrive) show that compared with strong baselines, AR Forcing improved the consistency of generated images during long‑horizon navigation and the accuracy of predicted trajectories, enhancing robustness of the model in complex known and unknown environments. We will release the code soon.

Abstract:
Interactive video world models generate video chunk by chunk in response to user‑controlled camera movements, enabling applications such as real‑time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training‑free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory‑dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early‑step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware‑software co‑designed 3D block sparse attention with fused Triton kernels. Evaluated on HY‑WorldPlay and Matrix‑Game‑3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.

Abstract:
Joint‑Embedding Predictive Architectures (JEPAs) learn compact latent world models by predicting future embeddings, but no single coordinate of the latent is designated to encode task progression. We carve the JEPA latent into two orthogonal subspaces with disjoint roles: a low‑dimensional progression subspace shaped by a cosine‑margin triplet loss, and a high‑dimensional content subspace regularised by the existing SIGReg objective of LeWM. We prove that the two anti‑collapse forces act on disjoint coordinates, so they compose additively rather than competing on the same dimensions. Our method, SD‑JEPA improves over the LeWM baseline on the majority of its control benchmarks at matched compute, and outperforms the strongest non‑LeWM JEPA baseline on Push‑T; a subspace‑ablation falsifier confirms the split is the load‑bearing ingredient. Beyond planning, the resulting 1‑D angular progression coordinate functions as a scene‑aware compass on the latent. It advances with task progress, regresses when the agent backtracks, and under controlled perturbations both spikes and relocalises to a semantically appropriate new task‑phase sector, separating the moment of surprise from its meaning in a way that prediction‑error scalars cannot. Three quantitative tests back this up: |Δθ_t| outperforms the standard latent‑prediction‑error surprise at localising semantic events on 40 held‑out cube episodes by up to +0.18 pooled AUROC (97.5% per‑episode win rate at \pm 1‑step tolerance); a within‑episode linear probe across all four environments (40 episodes per env) shows the 8‑dimensional progression subspace (4.2% of the latent) explains 72‑95% of task‑progress variance..

Abstract:
Text‑agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient‑free framework that turns offline trajectories into executable Python world models through counterexample‑guided code repair. Instead of predicting the next observation with a black‑box model, PatchWorld induces symbolic belief‑state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld‑Simple achieves the highest code‑based planning score among evaluated methods, reaching 76.4% macro success in live one‑step lookahead while invoking no LLM calls inside the world‑model prediction module itself. We further find that a human‑specified residual‑memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action‑discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU‑KnowComp/PatchWorld.

Abstract:
Reinforcement learning is a promising approach for improving the capabilities of vision‑language‑action (VLA) models while avoiding the heavy data requirements of imitation learning. However, its effectiveness for VLA models is often constrained by sparse supervision and the difficulty of designing informative reward signals for long‑horizon manipulation. In this work, we present Feat2Go, a fine‑grained value estimation framework for embodied reinforcement learning. Specifically, Feat2Go first derives a continuous progress target from a pretrained visual world model by measuring patch‑level similarity to subgoal states and partitioning episodes into semantic stages with trend‑based clustering. We then train an embodied value model to predict this structural progress from the current observation and task instruction, and use the predicted value to reshape terminal rewards during policy optimization. The proposed framework is compatible with existing VLA policy reinforcement learning pipelines, including PPO and GRPO, and does not rely on manual reward engineering. Extensive experiments on ManiSkill3 and RoboTwin 2.0 demonstrate that Feat2Go consistently improves the performance of existing VLA models under both single‑arm and bimanual manipulation settings. More specifically, on ManiSkill3, Feat2Go improves OpenVLAOFT from 17.5% to 82.9% average out‑of‑distribution success while retaining 96.9% in‑distribution performance. On RoboTwin 2.0, Feat2Go achieves an average success rate of 88.8% in domain‑randomized task settings, outperforming prior reinforcement learning methods.

Abstract:
World models for embodied AI must be physically viable: constructed to answer intervention queries by representing the physical structure governing action outcomes, rather than merely predicting future observations. Existing observation‑predictive world models can produce visually plausible but physically wrong rollouts. This failure is structural; distinct physical systems can look identical yet diverge under intervention. We expose this problem with controlled benchmarks that fix the visible scene while varying latent physics. We show that such models may recommend infeasible actions, mispredict interaction outcomes, or certify unsafe behavior. We argue that embodied AI requires world models that identify the simplest physical abstraction sufficient to answer an intervention query. Such a model comprises modular components, including environment representation, latent state and parameter estimation, action specification, interventional dynamics, and query‑level response. An autonomous orchestrator should identify the relevant abstraction and compose compatible learned and structured components per query. When closed‑form physics is unavailable, uncertain, or costly, the transition model may be analytic, simulated, learned, or hybrid, but it must preserve the structure that determines interventional outcomes. This decomposition makes the model interpretable, its components verifiable, and its outputs auditable against the query. It also provides a design principle for new world models and a feasibility test for existing ones: the right abstraction is not the most detailed model of the world, but the simplest model that preserves the distinctions relevant to the query. We demonstrate this approach on queries that existing systems fail to answer correctly, and outline how an orchestrator can dynamically assemble and adapt physically viable models for planning, control, and verification.

Abstract:
Data‑driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics ‑‑ realistic temporal deformations of static objects under various physical conditions ‑‑ remains challenging and often ad hoc, despite its importance in building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small‑scale datasets. We propose that these restrictions can be overcome by learning a data‑driven kinematic state parameterization for object‑centric physical systems. Specifically, we learn both a latent space representing all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the object. We refer to this parameterization as Neural Object Kinematics (NeuROK), and learn a transformer‑based encoder‑decoder model on a curated large‑scale 4D dataset. This formulation and the learned model significantly simplify the generation of simulative dynamics since we only need to consider the dynamics within a low‑dimensional latent space from the Lagrangian mechanics' perspective in classical physics. We demonstrate the effectiveness and generality of this neural simulation framework across diverse dynamic object types, showing clear advantages over prior works. Project page: https://chen‑geng.com/neurok

Abstract:
As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real‑world generalization due to the sim‑to‑real gap. We present YoCausal, a two‑level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real‑world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow‑of‑time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non‑causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state‑of‑the‑art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human‑level causal cognition.

Abstract:
World models require state tracking, which is the ability to maintain a correct latent state across action sequences. Existing benchmarks are often synthetic or language‑based, limiting their value as tests of structured state updates in realistic domains. We introduce Chess‑World‑Model, a large‑scale state‑tracking benchmark built from 10 million real chess games, where models predict the exact board state reached after a sequence of legal moves. Alongside a held‑out real‑game split, we include an out‑of‑distribution split from uniformly random legal play, which tests whether models learn the transition rules rather than shortcuts from common human positions. Prior theoretical and empirical work has shown that Transformers struggle to state‑track, while input‑dependent linear RNNs require expressive state‑transition matrices to do so. We therefore benchmark a causal Transformer, block‑diagonal SLiCE, Mamba‑3, and Gated DeltaNet with negative eigenvalues under a matched interface and training protocol. The recurrent models strongly outperform the Transformer at 3 and 8 million parameters. Real‑game performance saturates above 18 million parameters, but the random‑uniform split remains discriminative up to 40 million, exposing failures otherwise hidden by scale. Additionally, ablations show that less expressive state‑transition mechanisms reduce performance on the out‑of‑distribution split for all three recurrent models. Together, these results establish Chess‑World‑Model as a practical large‑scale benchmark for state tracking that exposes failures model scale would otherwise conceal.

Abstract:
Modern societies possess more information than ever before, yet they do not converge toward a single shared understanding. The same events, facts, laws, technologies, or risks can be interpreted as evidence of freedom, danger, exclusion, injustice, responsibility, or unrealized possibility. Existing discussions often treat such disagreement as a conflict of values, preferences, or beliefs. This paper argues that disagreement is already a late‑stage phenomenon. The central premise is simple but not trivial: observation is not yet inference. Not every observation becomes inferentially relevant, and not every possible object in an observation sequence becomes an estimation target. A possible target becomes admissible only when a state representation can be constructed that is approximately sufficient for prediction, evaluation, or action with respect to that target. This paper develops a world‑model theory of cognitive diversity and alignment by reconstructing recognition as the construction of such approximate sufficient statistics under finite informational, representational, observational, and action constraints. It formulates this position as the Multi‑Phase Inference Assumption (MIA) and defines its core internal mechanism as the Multi‑Phase Inference Mechanism (MIM). The framework introduces alignment maps and transformation loss to analyze how heterogeneous world models communicate without being collapsed into a single representation. World‑model alignment is therefore processability, not agreement: the design of AI systems that help heterogeneous forms of intelligence remain mutually processable while preserving their distinct error‑detection capacities.

Abstract:
Vision‑language models (VLMs) are increasingly used to answer questions about physical scenes, yet most evaluations reduce performance to a final answer. This hides whether the model perceived the right objects, represented the right physical state, predicted a plausible transition, or merely selected the right option for the wrong reasons. We introduce \wmw, an evaluation framework for auditing the \emphlanguage‑expressed physical commitments of VLMs. Instead of scoring only I,q\mapsto a, we ask models to produce a typed trace I,q\mapsto(s_0,Δs,s_1,a): an initial state, a state transition, a resulting state, and an answer. A hybrid verifier then checks schema validity, state grounding, transition consistency, and answer‑trace compatibility, yielding typed error labels such as object, relation, force, transition, temporal, unit/scale, and faithfulness errors. We release \tracebank, a controlled trace resource with \nSeed schema‑ and recomputation‑validated synthetic scenarios across \nFamilies physics families, \nPairs minimally perturbed contrastive preference pairs, verifier code, audit guidelines, and model outputs. We evaluate \nModels VLMs on both controlled and external physical‑reasoning examples. \wmw reveals failures that answer‑only evaluation misses: 35% of correct answers from mid‑tier models are backed by physically invalid traces. Verifier‑guided reranking recovers up to 7 percentage points of trace validity without sacrificing answer accuracy, and trace‑level preference tuning reduces hidden inconsistency by 41% relative. The contribution is not another final‑answer physics benchmark, but a reusable protocol for measuring whether a VLM's stated physical world can be true at the same time as its answer.

Abstract:
Action‑conditioned world models are increasingly used as scalable simulators for robot learning, yet current evaluations provide limited evidence that their predictions are reliable under the actions they condition on. Existing benchmarks largely emphasize visual fidelity, leaving unclear whether predicted futures are physically plausible, faithful to commanded actions, and calibrated to failure when actions should not succeed. We introduce \textscMiraBench, a hierarchical benchmark that defines \emphaction‑conditioned reliability as a core evaluation target for robotic world models. MiraBench decomposes this target into three progressively demanding levels: \emphPhysics Adherence, which evaluates reference‑free physical consistency; \emphAction‑Following Fidelity, which measures whether predictions respect task‑relevant action inputs; and \emphOptimism Bias Detection, which probes the tendency to predict successful outcomes under failure‑inducing actions. To support this evaluation, we curate a human‑annotated corpus with over 16,000 judgments across tasks, failure categories, and leading world models. We evaluate 12 representative model configurations spanning vector‑conditioned robotic world models, text‑conditioned generative world models, open‑weight systems, closed‑source systems, and multiple model scales. Across this broad model landscape, MiraBench reveals three central findings: visual fidelity is a poor proxy for action fidelity; increasing model scale does not reliably improve action following; and optimism bias is pervasive across current systems. By shifting evaluation from appearance to action‑conditioned reliability, MiraBench provides a diagnostic foundation for assessing and improving robotic world models as faithful simulators.

Abstract:
Model‑based reinforcement learning (MBRL) agents typically learn world models by minimizing predictive loss. However, powerful RL optimizers inevitably exploit minor model inaccuracies, leading to simulator exploitation and a reality gap where policies succeed in simulation but fail in the real world. We propose that the objective for learning simulators should be strategic robustness rather than predictive accuracy, and formulate this as a zero‑sum minimax game between a model player and an adversarial policy player. We provide a comprehensive theoretical analysis: (1) an online learning guarantee showing the game is learnable with sublinear regret bounds; (2) a tractable critic‑based simplification bounding the global policy‑value gap by the local critic's loss; and (3) an Error‑MDP duality, proving that finding the worst‑case policy is formally dual to a standard RL problem where the reward is the one‑step critic error. This duality yields a provably convergent active data selection algorithm. Experiments on continuous control tasks demonstrate that our approach reduces prediction error in strategically important regions by 1.5‑2.2× and enables policies trained purely in simulation to match near‑optimal real‑world performance.

Abstract:
What does a world model learn from physical exploration, without any linguistic supervision? We argue the answer is organized by a single principle: the geometric structure of the physical world. Training a VAE‑based world model on random embodied exploration, we find that its latent space develops spatial semantic structure that mirrors physical geometry ‑‑ direction accuracy 0.677+‑0.029 versus 0.547 for a randomly initialized encoder, and position RSA 0.192+‑0.047 versus 0.029 for random encoders (6.6x improvement), showing that training induces genuine structural organization beyond CNN inductive bias. Across 20 temporal checkpoints, prediction performance and semantic alignment co‑improve (Spearman r=‑0.61, p=0.004), consistent with the shared‑driver account. We confirm this through a double knockout: standard KL regularization (beta=0.1) forces the encoder away from geometric structure, and both prediction performance and semantic alignment collapse simultaneously to near‑chance by step 50,000 ‑‑ exactly as the shared‑driver account predicts. Reducing beta to 0.001 restores geometric access and recovers both capabilities together. These findings establish physical world geometry as the organizing principle of world model representations, with direct implications for the design of semantically grounded embodied agents.

Abstract:
World models for interactive video generation have largely focused on single‑agent settings, where future observations are generated from a single control signal. However, many generated environments require multi‑agent interaction: multiple players, robots, or embodied agents act simultaneously within a shared space. Scaling world models to such settings requires a principled multi‑agent design: agents should remain independently controllable, permutation‑symmetric, and support efficient inference while maintaining consistency across time and perspectives. In this paper, we present our generative multi‑agent world model for interactive simulation. It introduces Simplex Rotary Agent Encoding, a parameter‑free extension of 3D RoPE that represents agents as vertices of a regular simplex in rotary angle space. This gives each agent a distinct phase while making all agents permutation‑equivalent, enabling scalable agent identity without learned per‑slot identities or a fixed agent ordering. To avoid dense all‑to‑all attention across agents, we further propose Sparse Hub Attention, where learnable hub tokens mediate token interaction across agents, reducing cross‑agent attention cost from quadratic to linear in the number of agents. For real‑time rollout, we distill a full‑context diffusion teacher into a causal student that generates temporal blocks sequentially with KV caching, enabling action‑responsive generation at 24 FPS. Experiments in multiplayer virtual environments show that our model improves video fidelity, action controllability, and inter‑agent consistency over slot‑based and dense‑attention baselines, while generalizing from two to four players without additional training.

Abstract:
Functional music applications, from consumer focus and sleep aids to clinical interventions, share a distinctive recommendation problem: success is defined by the listener's affective state, but online experimentation on emotion is ethically constrained, particularly for clinical populations who cannot reliably skip a song or report distress. We describe AMRS, the Affective Music Recommendation System deployed on LUCID's health‑and‑wellness platforms, which serve clinical users (primarily older adults with neurocognitive conditions) and consumer‑wellness users across energize, focus, calm, and sleep modes. AMRS is built around a rollout‑based world model: a causal transformer trained on logged listening data to jointly predict engagement, binary rating, and self‑reported valence and arousal. The world model serves both as an in‑silico simulator for offline policy training and as a stress‑testing tool before deployment. A recommender policy initialized by behaviour cloning is fine‑tuned offline with Direct Preference Optimization (DPO) against a configurable multi‑objective utility function. Under a strict cold‑start protocol, the world model predicts both behavioural and affective signals with usable fidelity; DPO improves predicted valence and arousal over the cloned baseline while maintaining a similar diversity profile and avoiding the distributional collapse produced by greedy optimization. We position the work as an early deployed validation of a methodology for affective recommendation when online experimentation is ethically untenable.

Abstract:
World models have enabled interactive exploration of game environments and robotic manipulation, but physical engineering remains beyond their reach: real materials exhibit nonlinear constitutive laws, carry history‑dependent internal state, undergo inertial dynamics, and may possess hierarchical structures spanning multiple length scales. We present LEIA (Learned Environment for Interactive Architected materials), a world model that lets engineers apply boundary conditions step by step and observe the resulting deformation and stress fields in real time. LEIA handles large three‑dimensional unstructured meshes and generates autoregressive responses to user‑specified loading. We introduce MicroPlate, a benchmark of architected plates spanning two regimes of microstructure modeling: architected lattices that resolve microstructure explicitly through three‑dimensional geometry, and a homogeneous plate where microstructural change is modeled implicitly through internal degrees of freedom. MicroPlate is used to assess LEIA alongside four baseline methods across both regimes. Finally, we demonstrate that LEIA enables efficient candidate generation and ranking for fast surrogate‑guided search for de novo designs of architected materials, with stress‑accurate candidate ranking validated by finite element ground truth.

Abstract:
Neural surrogates promise large speedups over classical solvers for physical dynamics but fail silently at sharp dynamical events such as shocks, fronts, and contact. We present hybrid neural world models for physical dynamics: a recipe for training and deploying multi‑horizon surrogates in physical state space, where a single network with continuous horizon conditioning is trained with direct supervision against textbook reference solvers to predict any future state at horizon T in one forward pass. Although no part of the training data, loss function, or architecture supervises discontinuity location, the trained surrogate encodes it implicitly, recoverable from its forward passes alone as a per‑trajectory error map that concentrates on shocks, fronts, and contacts, and stays small elsewhere. The map is competitive with or better than standard label‑free baselines including deep ensembles, learned error heads, gradient‑magnitude indicators, and locally‑adaptive conformal prediction, while using only a single trained network and requiring no calibration set or governing‑equation knowledge. The recipe supports two operating points. Mode 1 runs the surrogate alone for maximum throughput, with same‑hardware CPU speedups of 26x to 72x against textbook solvers on the PDE environments. Mode 2 uses the error map to gate a reference‑solver fallback, deferring uncertain trajectories and roughly halving the surrogate's residual error at the default operating point. The recipe applies without modification across reaction‑diffusion, compressible Euler, and rigid‑body collision dynamics.

Abstract:
Whether large language models (LLMs) construct internal spatial world models from pure‑text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six‑level capability hierarchy (L0‑L5) spanning atomic spatial facts to generative world‑graph construction, together with four diagnostic axes probing frame of reference, reading‑direction bias, reasoning‑effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured‑text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured‑output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure‑text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text‑only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure‑text spatial reasoning as a multi‑axis world‑modeling problem and motivate multimodal and scratchpad‑augmented reasoning as future directions.

Abstract:
Modern video generative models produce visually impressive results, yet frequently violate basic physical principles. We propose Proprio, a training‑free framework that enables a frozen video generator to assess and improve the physical plausibility of its own outputs. Inspired by proprioception, the biological sense of one's own movement, Proprio treats the model's flow residual under controlled latent perturbations as a self‑scoring signal. Samples that are better explained by the generator's learned dynamics induce smaller and more stable residuals. We aggregate this signal across timesteps and perturbations, focus it on motion‑relevant regions with a dynamic spatiotemporal mask, and use it for best‑of‑N search, gradient‑based self‑refinement, or both. Across text‑to‑video and image‑to‑video benchmarks, Proprio consistently improves physical plausibility, outperforming VLM‑based scoring, and external world‑model baselines in several settings. With TurboWan2.2, Proprio improves Physics‑IQ from 32.2 to 37.5 (+16.5%) and VideoPhy2‑hard physical commonsense from 45.6 to 55.0 (+20.6%). Human evaluation further shows that raters prefer Proprio‑selected or refined videos for physical plausibility in roughly two‑thirds of comparisons. These results suggest that frozen video generators contain actionable internal signals for evaluating and improving the physical plausibility of their own outputs.

Abstract:
Predicting how a cell will change its transcriptional state under a developmental signal or a genetic perturbation is the computational core of in‑silico biology and the AI Virtual Cell program. Existing approaches either fit static control‑to‑treated maps that discard time, or solve multi‑step ODE / Schrödinger‑bridge problems on each dataset independently. We introduce Chreode, a one‑step cell world model that predicts action‑conditioned cell‑state transitions through a structured residual transition operator. It shifts distributional evolution from inference time to training time, enabling single‑pass generation while preserving a Waddington‑inspired decomposition into downhill landscape flow, rotational in‑tangent dynamics, and stochastic spread. The model is pretrained with a shared scVI encoder and a DiT‑based dynamics backbone on a 2.4M‑cell mouse embryonic atlas spanning 7 datasets. As a fine‑tuning initialization, Chreode improves per‑target Sinkhorn distance on Weinreb hematopoiesis and Veres islet differentiation over matched scratch models, PI‑SDE, and PRESCIENT. As a transferable gene‑state embedding for GEARS, the pretrained dynamics representation reduces shared‑vocabulary DE20 mean squared error on Norman Perturb‑seq from 0.2121 to 0.1858, a 12.4% relative improvement, without changing the GEARS training procedure. We interpret this transfer to perturbation prediction as evidence that pretrained developmental‑trajectory dynamics encode differentiation primitives transferable to CRISPR‑induced state shifts, since both involve cell‑state transitions in a shared latent geometry. The pretrained backbone additionally produces zero‑shot clonal fate scores on Weinreb that are competitive with strong dynamic‑OT baselines.

Abstract:
Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Recent work proposes robot foundation models that jointly predict future observations and actions by finetuning video models with action‑labeled data. In this paper, we test the limits of an alternative approach: leave the video planner as‑is while training an embodiment‑specific inverse dynamics model (IDM). This decoupling offers several natural benefits: the video planner remains embodiment‑agnostic, different video models can be interchanged easily without re‑training the IDM, and the IDM can be independently trained with readily available self‑play data. We present a closed‑loop, video‑to‑action policy that combines an action‑free video world model with a carefully‑designed IDM based on the robot embodiment Jacobian. We demonstrate that our IDM design is both data‑efficient and scalable to high‑dimensional action spaces. Our policy, which we coin the Video‑to‑Embodied Robot Action Model (VERA), achieves strong performance across simulated and real‑world benchmarks, including zero‑shot Panda arm manipulation and 16‑DoF Allegro‑hand dexterous cube re‑orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment‑specific IDMs. Our results show that decoupled video planning plus faithful video‑to‑action translation is a viable alternative route towards zero‑shot, cross‑embodiment, and generalizable robot control. More results are available on our project website: https://vera.csail.mit.edu.

Abstract:
Video generation models are increasingly used as world simulators for tasks like driving and robotic manipulation. What matters in these settings is not whether a single video looks right, but whether the model's output changes when its input changes. We test this by giving a model two prompts describing the same scene with one physical detail varied, and checking whether the two videos diverge the way physics predicts. The wording difference between the prompts is small by design, since only one variable is changed, but the correct physical difference is not. A model that misses this can still produce two videos that each look plausible individually, and existing benchmarks score videos one at a time and cannot detect this failure. We introduce What‑If World, 319 such prompt pairs built on real frames from nuScenes and DROID, organized by a taxonomy of six physical variables shared across driving and manipulation. Each pair is scored with APEO, a four‑part rubric checking whether each video follows its prompt (Adherence), is physically consistent (Physics), preserves the shared scene (Environment), and ends in the correct difference (Outcome). Across nine state‑of‑the‑art models, no system exceeds 52% on the paired score, and open‑source models cluster near 28%. Every model tested fails on a large fraction of causal interventions, indicating substantial room before these models can reliably support action‑conditioned simulation or model‑based planning. Where models do score well, performance appears to track the visual prominence of the intervention rather than the tractability of its underlying physics. Some visually subtle interventions score as low as 14.2%, while visually pronounced ones reach 40.4%.

Abstract:
We introduce GE‑Sim 2.0 (Genie Envisioner World Simulator 2.0), a closed‑loop video world simulator for robotic manipulation. Building on the action‑conditioned video generation framework of Genie Envisioner, GE‑Sim 2.0 is re‑trained on thousands of hours of real‑world robot data spanning teleoperation, contact‑rich interaction, and on‑robot policy deployment, substantially improving action‑following fidelity and trajectory coverage. On top of this foundation, three new modules close the loop from video simulation to policy learning: a state expert that decodes proprioceptive state from video latents to support next‑chunk prediction by downstream VLA policies; a world judge that scores generated rollouts against task instructions, yielding machine‑verifiable success signals and rewards in place of manual inspection; and an acceleration framework that delivers a 25‑frame rollout in 2.3 seconds on a single H100, with up to 4 frame skipping at inference for long‑horizon evaluation. GE‑Sim 2.0 tops the public WorldArena leaderboard at only 2B parameters, outperforming both dedicated robotic world models and closed‑source general video generators, and policies trained against its rollouts and rewards translate into measurable real‑world gains, establishing GE‑Sim 2.0 as a practical platform for scalable evaluation and closed‑loop learning of manipulation policies.

Abstract:
Reactive control is often considered insufficient for multi‑objective tasks because conflicting objectives give rise to local minima. We argue this limitation is not inherent but arises from static encodings that fail to reflect how objectives currently interact. We exploit the interaction structure encoded in a graph‑based world model by extending it with nullspace projections: conflicts are resolved where they arise by projecting lower‑priority gradients into the nullspace of higher‑priority ones, with priorities determined continuously from the current state. We demonstrate this in two domains where conflicts between objectives are central: navigation around non‑convex obstacles, where static potential fields fundamentally fail, and planar pushing of non‑convex objects, where our method achieves 100% success across one‑hundred configurations versus 0% for the steepest‑descent baseline and ～55% for diffusion policy, without demonstrations or retraining. The same formulation transfers directly to a real robot with additional perceptual and kinematic constraints, accommodating them through the same mechanism.

Abstract:
A representation that scrambles the true degrees of freedom of the world cannot support reliable planning or compositional generalization. We prove that LeJEPA (alignment plus Gaussian regularization) linearly recovers the world's latent variables from nonlinear observations, a property known as linear identifiability, in a broad class of worlds where latents evolve under stationary, additive‑noise transitions. Our main result is that among all such worlds, the Gaussian is the unique latent distribution for which this guarantee holds. The forward direction rests on a spectral decomposition in which each degree of nonlinearity is strictly penalized by alignment, making the linear map the optimum; the converse rules out every non‑Gaussian alternative. We further prove an approximate identifiability result where the guarantee degrades gracefully, and show that linear, orthogonal identifiability enables optimal latent‑space planning. We validate the theory with experiments ranging from 2D examples to 1024‑dimensional latents, including distributional ablations and pixel‑based robotic control. Our theory turns an empirically successful recipe into a mathematical guarantee, providing the foundation for building World Models that provably recover the structure of the world.

Abstract:
Model‑based reinforcement learning (RL) can be effectively supported at scale through the use of world models. However, in practice, scaling such approaches remains fundamentally limited. A commonly recognized challenge is model bias and error compounding, which degrade long‑horizon predictions. Beyond these issues, we identify a more critical yet underexplored bottleneck: a structural misalignment between search and value learning in existing world model approaches. In particular, policy improvement often relies on value functions induced by a separate, non‑search policy, resulting in training inconsistency and ultimately suboptimal learning. To address this limitation, we propose Model‑Based Diffusion Policy Optimization (MBDPO) in world models, a framework that unifies search and policy optimization through diffusion policy representations, thereby unlocking the potential of world models for scalable policy learning. Instead of constructing an explicit planner over a learned world model, we reformulate policy optimization as a diffusion process over searched trajectories in latent world models. In this view, we extract an implicit energy function from the collected dataset that anchors the policy, enabling MBDPO to refine the score field for policy optimization while mitigating misalignment. We evaluate MBDPO across a wide range of settings, including multi‑task offline pretraining, online learning, and offline‑to‑online fine‑tuning. In the offline regime, we further investigate its scaling behavior by pretraining on large‑scale datasets, observing consistent and monotonic performance gains with increasing model capacity.

Abstract:
World models enable agents to predict future dynamics conditioned on actions, making the choice of latent representation central to planning and control. Such representations are often either learned directly from pixels with limited semantic structure or inherited from frozen visual foundation models with excessive task‑irrelevant detail, yielding state spaces that are poorly matched to downstream planning and control. This is especially challenging in reward‑free offline settings, where the model must learn from fixed trajectories without reward supervision or online interaction. To address this, we propose TC‑WM, a framework for turning foundation‑model embeddings into compact, task‑sufficient world representations. The key design is to treat the pretrained embedding space as a semantic scaffold rather than as the final state space: TC‑WM linearly projects high‑dimensional visual embeddings into a compact latent as the dynamic space, aligns a subspace with the agent's physical state via contrastive learning, and reconstructs embeddings to preserve useful visual structure. This combines the generality of foundation features with the controllability of task‑centric dynamics. Theoretically, we show that TC‑WM suffices to identify the underlying task‑centric latent factors up to a simple transformation. Empirically, TC‑WM enables test‑time planning across diverse environments (e.g., Robomimic and D4RL), achieving better world‑modeling quality and more precise control than state‑of‑the‑art approaches.

Abstract:
Video world models should maintain evolving states when evidence is unobserved, yet current generators often freeze hidden states upon interruption. This is not simply a capacity problem: pretrained video diffusion transformers already possess KV‑cache mechanisms capable of non‑local retrieval, but they are rarely trained to use them as dynamic memory. We introduce ReMind, a framework eliciting dynamic memory behavior via memory‑oriented data, event‑aware training, and cache adaptation. Organized around a taxonomy of 100+ dynamic events, we build a camera‑annotated training mixture combining VLM‑filtered real videos, generated hard dynamics, synthetic camera loops, and memory‑interruption augmentations. Each clip is converted into a frame graph with protected anchors, degraded intervals, and explicit temporal gaps. A node‑structured curriculum, including node‑drop, noisy memory, frontier continuation, and reference‑cache training, forces the model to retrieve relevant past states across interruptions rather than relying solely on local continuity. PM‑RoPE, an elegant camera‑phase RoPE extension, unlocks spatiotemporal retrieval at a single‑attention cost while preserving pretrained pathways. ReMind achieves the best overall scores on STEVO‑Bench and recovery tasks. Furthermore, general image‑to‑video evaluations confirm this curriculum avoids catastrophic forgetting. We will open‑source our code, data, and models.

Abstract:
Training data for olfaction is scattered through disparate, non‑standardized datasets that limit the ability to build representative world models. Olfactory navigation is a highly dynamic and non‑stationary task that benefits from real‑time continual learning. We introduce an adaptive framework called Grow‑Prune‑Freeze (GPF) networks that enable an agent to continually learn through growing, pruning, and freezing early layers of its policy in response to world complexity. Grounding GPFs in non‑linear random matrix theory, we show that the work of Pennington & Worth (2017) can be extended from single hidden layers to n‑layer continual‑learning models, and that eigenvalue composition of network weights is preserved as successive layers are added. We show that GPFs based on Expected SARSA achieve a 94% success rate on turbulent plume navigation ‑ a partially observable, non‑stationary task representative of the "big world" challenges that motivate adaptive learning in robotics ‑ and provide supporting methodology for applying GPFs in other world models. Further experiments amount evidence that GPFs may generalize well to other machine learning tasks such as reinforcement learning in Atari, image classification, and autoregressive language models. We open source all code and data to encourage improvements on and more research in olfactory robotics.

Abstract:
Reinforcement learning offers a promising approach for scan‑order optimisation in laser additive manufacturing, where sequential scan decisions critically influence thermal accumulation, residual stress, distortion, and final part quality. A central challenge in applying RL to this domain lies in reward and world‑model fidelity: full finite‑element analysis is computationally prohibitive for dense in‑the‑loop evaluation, while cheap thermo‑inspired proxy metrics, though efficient, may capture only partial aspects of the true thermo‑mechanical objectives. This paper investigates a bilevel Proxy‑‑FEA diagnostic framework for reward and world‑model diagnosis in reinforcement‑learning‑guided scan‑order optimisation. The lower level employs lightweight scan‑path and thermo‑inspired proxies for rapid candidate generation and preliminary policy‑side screening, while the upper level utilises sparse Abaqus FEA simulations to provide simulation‑based reference labels. The framework is examined on a simplified whole‑track heating LDED32 stripe benchmark comprising ten representative scan strategies. Final‑cooling residual Mises stress, U3 vertical distortion, and PEEQ plasticity metrics reveal an observed stress‑‑distortion trade‑off rather than a single monotonic quality objective. Within the evaluated set, the center_out strategy emerges as a robust compromise candidate, while raster_left_to_right and edge_in form opposing endpoints of the trade‑off. Proxy‑‑FEA alignment analysis shows that current cheap path‑based metrics predominantly capture distortion‑related (U3) behaviour and exhibit only weak correlation with the sparse FEA reference labels. These findings highlight that proxy‑only reward designs risk misalignment in future RL training and underscore the value of sparse FEA reference signals for diagnostic‑guided reward and world‑model refinement prior to large‑scale policy optimisation.

Abstract:
Physical world knowledge resides mainly in videos. Equipping Vision‑Language‑Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long‑term causality by predicting future video from past observations. However, naive next‑frame prediction faces two challenges: 1) unlike semantically distinct text tokens, video tokens are low‑entropy and redundant, causing prediction to degenerate into trivial extrapolation. 2) world modeling poses a temporal dilemma: dense prediction captures instantaneous dynamics, but cannot efficiently model long‑horizon causality. To learn world knowledge effectively, we introduce X‑Foresight, a predictive world model integrated directly into the VLA architecture to jointly learn world modeling and real‑time action control. At its core lies a long‑horizon chunk‑wise auto‑regressive strategy that addresses both challenges: by predicting semantically distant chunks rather than adjacent frames, it escapes trivial extrapolation, while preserving dense intra‑chunk frames for instantaneous dynamics and sparse inter‑chunk transitions for long‑term causality. A curriculum learning schedule progressively extends prediction horizons and stabilizes long‑horizon training. To capture long‑term causality effectively, we present temporal importance sampling, which concentrates supervision on safety‑critical chunks identified by ego‑motion and behavioral signals. We further delegate photorealistic synthesis to a diffusion‑based multi‑view renderer, improving photorealistic appearance. Comprehensive experiments demonstrate that X‑Foresight significantly outperforms VLA baselines in planning performance while maintaining strong generative fidelity, establishing a robust paradigm for world‑knowledge‑driven autonomous systems.

Abstract:
We propose Drift‑Resistant Navigation World Model, a generative model that mitigates both perceptual drift and geometric drift in conventional rollout‑based navigation world models. Existing methods recursively feed generated content into subsequent steps, causing noise accumulation and degraded predictions, i.e., perceptual drift. Meanwhile, their predictions often deviate from the agent's motion, resulting in geometry drift. We address both types of drift by redesigning world‑model prediction as an anchor‑guided rollout. Instead of rolling out every frame sequentially, we first predict sparse future anchors that serve as stable long‑range targets, and then generate intermediate frames within each chunk conditioned on both past context and future anchors. Importantly, these sparse anchors also provide geometric constraints, supported by bidirectional epipolar geometry, to localize where corresponding content should appear in the intermediate frames. Experiments on four benchmarks demonstrate consistent improvements over strong baselines in long‑horizon visual quality, geometric consistency, and multi‑view coherence. These gains further translate into improved downstream planning performance under the same planners, highlighting the importance of drift‑resistant, geometry‑aware prediction for reliable navigation world models.

Abstract:
Recent progress of video diffusion models have enabled extensive simulation of the physical world. While simulation with hand object interaction has been less explored. We propose DexSIM, a dexterous simulation framework for simulating dexterous manipulation in real‑time. While previous works utilizing video diffusion and 3D reconstruction focus on navigation, dexterous manipulation has been limited while it has extensive applications for creating interactive experiences with the simulated world and for generating synthetic data for robotics. Existing methods lack real‑time interactivity and long‑term spatial consistency and memory. We propose a 2‑stage training framework for DexSIM. First we train a bi‑directional video diffusion model by jointly embedding the hand action trajectory and video in a unified feature space. We utilize gaussian heatmap hand encoding for more accurate hand representation. Then we conduct a roll‑out based autoregressive training with updated spatial cache as attention sink for spatial memory, which improves long‑term consistency and 3D aware dexterous manipulation simulation. DexSIM outperforms the baseline on pixel and semantic similarity, motion fidelity, and hand projection accuracy. It also allows new applications such as hand motion transfer and runs at 15.24 FPS real‑time interactivity.

Abstract:
Video world models have achieved strong visual realism, but this does not ensure that their dynamics are truly governed by actions. In this work, we argue that action faithfulness should be understood through the compositional structure of actions, which in many embodied settings follows a group structure (e.g., SE(2) for navigation). Based on this insight, we formalize action‑conditioned world modeling as realizing a group action on the state space, providing a principled criterion for evaluating dynamics beyond visual quality. To operationalize this framework, we propose a unified approach that enforces identity, inverse, and composition consistency via latent‑space regularization with synthesized supervision, avoiding additional data collection. We further introduce two metrics: Group‑Action Consistency (GAC) and Group‑Action Robustness (GAR), to evaluate structural correctness and rollout stability. Extensive experimental results show that our method consistently improves both GAC and GAR in state‑of‑the‑art video world models without degrading perceptual quality.

Abstract:
CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream ‑‑ stdout, errors, files, logs, and traces ‑‑ records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO‑style training updates action tokens with sparse outcome‑level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy‑gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross‑entropy Hybrid Objective), a hybrid objective that combines the standard policy‑gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench‑2.0: Qwen3‑8B improves from 2.70% to 5.17%, and Qwen3‑14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held‑out rollouts, it sharply reduces environment‑token cross‑entropy while GRPO alone barely changes it. From base Qwen3‑8B, ECHO matches expert‑SFT‑then‑GRPO performance on held‑out terminal tasks without expert demonstrations, and recovers roughly half of the expert‑SFT initialization benefit on TerminalBench‑2.0. In some settings, the environment prediction loss alone enables verifier‑free self‑improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on‑policy supervision signal already present in every rollout.

Abstract:
Large Language Models (LLMs) have shown great ability in generating executable code from natural language, opening the possibility of automatically constructing environments for AI agents. Recent work on Code World Models (CWMs) demonstrates that LLMs can translate game rules into Python implementations compatible with solvers like Monte Carlo Tree Search. We study this problem in game settings, where generated environments must implement rules, legal actions, state transitions, observations, and rewards. We refer to these game‑specific executable models as Game Code World Models (GameCWMs). However, current approaches to generating code world models rely on frontier models and inference‑time refinement loops, limiting accessibility and scalability. This work investigates whether GameCWM generation capabilities can be distilled into smaller models through post‑training. We introduce: (1) a curated dataset of 30 games spanning perfect and imperfect information games, (2) a verification framework that evaluates generated code against structural and semantic game properties, and (3) a post‑training pipeline combining Supervised Fine‑Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR). We experiment with Qwen2.5‑3B‑Instruct and find that SFT can increase syntactic correctness, while RLVR can improve execution‑level adherence to game rules, thereby improving Qwen's ability to generate valid GameCWMs in both perfect and imperfect information games. Overall, our pipeline makes Qwen2.5‑3B‑Instruct more capable of generating valid GameCWMs, thereby offering a scalable path toward automatic environment generation from natural language.

Abstract:
Video world models learn representations of physical dynamics, but controlling their physical expectations at inference time remains an open problem. Recent interpretability work identified a Physics Emergence Zone (PEZ), a group of middle transformer layers in VideoMAE where physical plausibility is represented separately from other visual features. However, it remained unclear whether this structure could be used to directly control the model's physics reasoning. We present physics steering, a training‑free method that uses the weight vector of a linear probe at a PEZ layer as a Concept Activation Vector (CAV) and injects it into hidden states during inference. This shifts the model's physical expectations without changing any model weights. On the IntPhys benchmark, this intervention reliably shifts the model's plausibility judgment in either direction, depending on the steering sign. The effect appears only when the intervention is applied within the Physics Emergence Zone, suggesting that the relevant physics representation is localized there. We further find that physics is encoded separately from motion direction, and that different intuitive physics principles occupy distinct directions within this representation space. Together, these results show that physical reasoning in VideoMAE is not only readable, but also directly steerable.

Abstract:
Understanding 3D scenes requires flexible combinations of visual reasoning tasks, including depth estimation, novel view synthesis, and object manipulation, all of which are essential for perception and interaction. Existing approaches have typically addressed these tasks in isolation, preventing them from sharing a common representation or transferring knowledge across tasks. A conceptually simpler but practically non‑trivial alternative is to unify these diverse tasks into a single model, reducing different tasks from separate training objectives to merely different prompts and allowing for joint training across all datasets. In this work, we present a physical world model for unified 3D understanding and interaction (3WM), formulated as a probabilistic graphical model in which nodes represent multimodal scene elements such as RGB, optical flow, and camera pose. Diverse tasks emerge from different inference pathways through the graph: novel view synthesis from RGB and dense flow prompts, object manipulation from RGB and sparse flow prompts, and depth estimation from RGB and camera conditioning, all zero‑shot without task‑specific training. 3WM outperforms specialized baselines without the need for finetuning by offering precise controllability, strong geometric consistency, and robustness in real‑world scenarios, achieving state‑of‑the‑art performance on NVS and 3D object manipulation. Beyond predefined tasks, the model supports composable inference pathways, such as moving objects aside while navigating a 3D environment, enabling complex geometric reasoning. This demonstrates that a unified model can serve as a practical alternative to fragmented task‑specific systems, taking a step towards a general‑purpose visual world model.

Abstract:
Radio Environment Maps (REMs) have the potential to serve as an important enabler for intelligent modeling and control in emerging AI‑native 6G networks. Despite significant progress, most REM construction methods remain passive, relying on interpolation or static uncertainty models and lacking an explicit mechanism to reason about how future measurements will affect reconstruction quality under a limited measurement budget. In this paper, we formulate REM construction as a sequential decision‑making problem and propose a world‑model‑inspired framework for active Received Signal Strength Indicator (RSSI) map reconstruction. By learning an internal representation of the radio environment and employing a dreaming mechanism to simulate the impact of candidate measurements, the proposed approach actively selects measurement locations under a limited budget. Experimental results on real indoor RSSI data demonstrate that the proposed method significantly outperforms Gaussian Process‑based interpolation in the few‑shot regime, achieving up to a fivefold reduction in Root Mean Square Error (RMSE) with the same number of measurements. These results highlight the potential of world models as a powerful paradigm for sample‑efficient radio environment mapping and intelligent model‑based sensing in 6G and beyond networks.

Abstract:
Radiologist eye‑tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, existing methods exploit this signal only partially, either as a static spatial prior or as an auxiliary prediction target decoupled from diagnosis. We propose GazeWorld, a medical imaging world model that treats the image as the world and the radiologist's fixation sequence as a trajectory through it. GazeWorld autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, while a spatial‑completion branch covers unvisited regions. At inference, GazeWorld generates a sequence of patch representations from the image alone without requiring real gaze data. Frozen GazeWorld features achieve state‑of‑the‑art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM‑ACR Pneumothorax, as well as the highest zero‑shot accuracy on all three benchmarks. On the GazeSearch benchmark, a generic decoder trained on the same frozen features outperforms the purpose‑built LogitGaze‑Med by over 16% in ScanMatch and 22% in SED, despite not being explicitly trained to predict gaze. GazeWorld demonstrates that modeling how experts read, not just what they conclude, offers a promising pretraining paradigm for medical imaging AI.

Abstract:
Laboratory workflows in pharmaceutical and biomedical research encode substantial tacit knowledge ‑‑ expert judgment about failure conditions, decision branching logic, and contextual dependencies ‑‑ that remains inaccessible to protocol documents, sensor streams, and existing biomedical ontologies. We present a repeatable structured expert elicitation methodology and federated Semantic Knowledge Graph (SKG) architecture for capturing and querying this knowledge, demonstrated through deployment at the Biochemical and Cellular Pharmacology Department of Genentech. Knowledge is elicited via the Protocol Intelligence Co‑pilot, a purpose‑built AI interview agent that applies structured elicitation lenses to surface tacit procedural knowledge with expert‑assigned confidence scores, producing graph representations across three tiers: program‑level decision milestones, assay protocol knowledge, and physical execution infrastructure. Separately constructed subgraphs, exemplified by immunoassay (ELISA), quantitative mass spectrometry (LC‑MS/PRM), and laboratory automation, are aligned through a shared upper ontology and queried as a single federated graph. Evaluation demonstrates seven query types structurally unavailable from any individual data source, including a cross‑subgraph traversal that identifies automation‑masked silent failures ‑‑ conditions where execution logs report success while scientific validity is compromised. Critically, the MASKED_BY graph relationship encodes a class of laboratory risk invisible to current informatics platforms ‑‑ the structural gap that prevents existing systems from reasoning about scientific validity. This architecture provides the semantic world model that AI laboratory agents currently lack: a queryable representation of where workflows fail silently, where human judgment is irreplaceable, and which execution assets mask rather than detect failure.

Abstract:
Data‑driven learning approaches for physics simulation, sometimes referred to as world models, have emerged as promising alternatives to traditional physics simulators due to their differentiable nature. Prior work has demonstrated impressive results in predicting the motions of rigid and non‑rigid objects in complex scenes involving multiple interacting bodies. However, these models are typically trained in simulated environments because obtaining perfect state information such as complete scene point clouds and point correspondences over time is challenging in real‑world settings. This reliance on synthetic data can limit their applicability when the sim‑to‑real gap is large. In this work, we aim to overcome these limitations by introducing a novel framework for training neural object dynamics models directly from unlabeled real‑world videos. Specifically, we propose to learn a particle‑based dynamics model compatible with a Gaussian splatting framework, which operates on dense particles derived from Gaussians (i.e., particles with scales and rotations) and predicts their position and rotation changes over time. The model is trained via rendering supervision, enabling learning from real‑world videos without requiring particle‑level labeled states. Our model operates directly on dense Gaussians without relying on heuristic subsampling anchor points. To enable this study, we also present a real‑world dataset consisting of about 500 videos capturing diverse object interactions.

Abstract:
Video prediction is increasingly viewed as a path toward generalizable world models, yet it remains unclear whether these systems learn underlying causal structure or merely exploit superficial visual correlations for future prediction. We introduce CRONOS, an intervention‑based benchmark designed to evaluate counterfactual physical consistency: whether a model's predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high‑fidelity generation of videos across diverse scenes and dynamics. In contrast to previous benchmarks, CRONOS systematically intervenes on four key factors ‑ viewpoint, scene, object category, and object appearance ‑ while keeping the underlying physical event type, such as a collision, occlusion, or fall, fixed. Our evaluation of recent open‑source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes. CRONOS provides a controlled and reproducible testbed for diagnosing how the quality of generated videos changes for different interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions. The dataset and code are available at our project page.

Abstract:
Despite the growing use of world models as decision‑making agents, their adversarial robustness remains underexplored due to the lack of dedicated automated evaluation methods. A key obstacle is that attack evaluation must be both accurate and efficient: weak manually tuned attacks can overestimate robustness, while exhaustive hyperparameter search is prohibitively expensive because each candidate requires closed‑loop rollouts through learned latent dynamics. We introduce WMAttack, an automated attack‑search framework for adversarial evaluation of world‑model agents. WMAttack formulates robustness evaluation as a finite‑budget search over attack configurations, including attack families, perturbation budgets, optimization steps, restarts, and allocation rules. To improve search accuracy, Self‑Correcting Attack Search (SCAS) refines the attack proposal distribution using feedback from reward degradation, action instability, runtime cost, and rollout variability. To improve search efficiency, Representation‑Guided Attack Retrieval (RGAR) retrieves effective historical configurations from representation‑similar tasks, providing a warm start for unseen environments. We provide a theoretical explanation showing that proposal refinement improves finite‑budget search when it shifts probability mass toward high‑utility attacks. Across Atari and DeepMind Control tasks, WMAttack consistently discovers stronger attacks than the evaluated baselines, improving normalized reward drop from 0.497 to 1.034 on DreamerV3 Atari and from 0.319 to 0.682 on DMC. Ablations further show that RGAR improves initial candidate quality and SCAS improves final attack utility under fixed evaluation budgets.

Abstract:
Model‑based reinforcement learning improves sample efficiency by learning a world model. However, existing latent world models such as DreamerV3 do not explicitly enforce local smoothness in their learned transition dynamics, leaving a useful inductive bias for transition dynamics learning unexploited. We propose GPLD, a gradient‑penalized latent dynamics regularizer for DreamerV3 that applies a row‑wise Jacobian penalty to the posterior latent distribution to encourage locally smooth transition learning. We show that this penalty can be interpreted as the continuous‑latent analog of finite‑difference smoothing of transition laws in discrete embedded‑state MDPs, and estimate it efficiently using Hutchinson‑style stochastic probes. Empirically, across DeepMind Control proprioceptive tasks, GPLD improves aggregate sample efficiency, with particularly strong gains on higher‑complexity locomotion environments. On more challenging quadruped tasks, GPLD reaches high‑return behavior earlier and exhibits more consistent late‑stage learning over longer horizons. Explicit local smoothness regularization is a simple and effective way to improve latent world models for smooth continuous control environments. Code for GPLD is available at github.com/romils9/gpld‑mbrl .

Abstract:
Video world models can generate realistic futures from a single instruction, but they often fail to preserve consistent point‑level motion over time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM‑4D, a geometry‑grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision, distilled from a pretrained geometry foundation model, into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single‑stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence‑consistent video rollouts into executable robot trajectories, enabling direct deployment in both real‑world and simulated manipulation. GEM‑4D achieves state‑of‑the‑art performance on both video prediction and geometric consistency across simulation and realistic scenarios and improves real‑world manipulation success from 61% to 81%. Additional results are available at the project page: https://anonymous‑submission‑20.github.io/gem.github.io/.

Abstract:
Cloud‑hosted LLM driver agents provide useful semantic judgments, but their inference latency exceeds stepwise vehicle‑control windows. Learned world models predict futures, but they usually keep future generation and action selection inside large coupled loops. We present SteinsGateDrive, a latency‑decoupled planner‑runtime architecture in which the worldline metaphor from the eponymous story names one plausible consequence of an intervention: the LLM selects counterfactual driving futures before the final control instant, and a runtime reuses the selected forecast only while safety contracts remain valid. The generator builds three world‑line roles: alpha nominal ego‑conditioned futures, beta interaction counterfactuals around nearby vehicles, and gamma hazard‑stress futures such as braking, cut‑ins, or blocked corridors. The selected branch becomes a typed StrategicForecast with horizon, validity/abort conditions, fallback, and authority. On a within‑subject, matched‑seed normal‑highway protocol with 10 seeds and 20 steps, GPT‑5.4 mini reduces effective lag from +3.07 s at 1‑second horizon to ‑0.01 s at 4‑second horizon while preserving the measured no‑collision safety boundary. The architecture's safety contribution comes from the atom‑predicate runtime check, not from the drift score, which functions as a refresh‑frequency knob.

Abstract:
While large vision‑language‑action (VLA) models and generative world models (WM) have advanced long‑horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning‑based action generation. Low‑quality actions may cause physical failures during execution or lead to misleading world‑model rollouts with redundant rendering costs. To address this issue, we propose Pre‑VLA, a unified runtime verification architecture that performs preemptive action validity assessment before physical execution or world‑model imagination. Pre‑VLA leverages an efficient multimodal backbone with modality‑aware pooling and a lightweight dual‑branch head to predict both safety confidence and critic‑derived advantage scores for candidate action chunks. To handle severe class imbalance and unstable boundary decisions, we train Pre‑VLA with a multi‑task objective combining Focal classification, advantage regression, and soft‑threshold calibration. During deployment, a dual‑mode preemptive resampling scheduler filters low‑quality actions and triggers adaptive resampling under a limited computation budget. Experiments on the LIBERO benchmark show that Pre‑VLA improves the average closed‑loop success rate across four suites from 30.79% to 37.62% over RynnVLA‑002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world‑model rollouts.

Abstract:
Latent world models can contain the state needed for control, yet their terminal‑cost interface can expose the planner to the wrong decision‑relevant information. In common latent MPC, candidate sequences are ranked by Euclidean distance between predicted terminal and goal latent states; this assumes that raw latent distance weights reachability‑relevant variables correctly. We propose trajectory reachability metrics (TRM), a post‑hoc terminal‑ranking method for fixed latent world models. TRM trains a small pairwise head from logged trajectory structure and uses it as a replacement or hybrid cost; the encoder, dynamics, sampler, optimizer, and evaluation manifests remain fixed. The key design choice is horizon‑aware supervision: the metric is trained on broad, balanced temporal separations to match the long‑horizon terminal candidate ranking problem. On a hard TwoRoom benchmark, raw latent planning with LeWorldModel (LeWM) reaches 7.0% success, while full‑horizon TRM reaches 97.0%; shuffled temporal‑label controls stay at 0.0%. The same recipe improves a PLDM baseline from 32.7% to 84.0% across three seeds, and a short‑horizon TRM variant reaches only 35.0% with the 100,000 pair budget. In TwoRoom, we provide mechanistic evidence for why TRM works: XY position is linearly decodable (R^2=0.998), yet raw latent MSE misranks candidates; the XY‑probe rowspace accounts for less than 1% of terminal‑goal latent MSE but carries most candidate‑quality signal; and SCSA audits show that TRM improves the ordering and selected endpoint seen by the planner. On PushT go50/go75, TRM‑style task‑state metrics improve SCSA ranking and selected final distance more cleanly than closed‑loop success, motivating auxiliary hybrid costs in continuous manipulation. TRM is the planner‑facing repair, and audits explain when terminal reachability metrics should replace or augment raw latent proximity.

Abstract:
Vision‑Language‑Action (VLA) models have emerged as a promising framework for end‑to‑end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel‑level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high‑level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a unified embedding space, processed in a single forward pass to conduct the future‑aware reasoning. We further design a two‑stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation. Extensive experiments on the challenging Bench2Drive benchmark demonstrate that LVDrive achieves significant improvements in closed‑loop driving performance, outperforming both action supervised methods and image‑reconstruction‑based world model approaches.

Abstract:
Long‑horizon clinical simulation ‑‑ predicting how a patient's physiology evolves over years under specified interventions ‑‑ is central to chronic‑disease care, yet existing electronic health record (EHR) models are predominantly discriminative, and general‑purpose large language models drift under repeated interventions. We propose the ChronoMedicalWorld Model (CMWM), an action‑conditioned latent world‑model framework for learning patient trajectories from longitudinal care data. CMWM couples a joint‑embedding state encoder with a wide action encoder that admits both structured intervention indicators and free‑text communication embeddings, and trains a recurrent latent transition module under a six‑term objective: next‑observation supervision, next‑latent prediction, SIGReg latent regularisation, and three physiology‑aware shape priors (slope, continuity, large‑jump penalty). A closed‑loop rollout‑prefix protocol matches training to deployment, so the model is optimised against the same multi‑step error it exhibits at inference. As a concrete case study, we instantiate CMWM for annual estimated glomerular filtration rate (eGFR) trajectory forecasting in chronic kidney disease (CKD). On a 2,232‑patient nephrology cohort, the CKD instantiation achieves a dynamic‑50% history rollout test mean absolute error (MAE) of 7.384 and root‑mean‑square error (RMSE) of 10.256, against 7.964 and 11.069 for a tuned GPT‑5.5 structured‑prompting baseline (‑7.28% MAE, ‑7.35% RMSE), with the gain dominated by the dialogue portion of patient‑‑health‑coach communication. The framework is not CKD‑specific: its architecture, loss design, and training protocol apply to any chronic condition that can be cast as periodic clinical state interleaved with structured and conversational interventions.

Abstract:
World models are central to building agents that can reason, plan, and generalize beyond their training data. However, research on world models is currently fragmented, with disparate codebases, data pipelines, and evaluation protocols hindering reproducibility and fair comparison. Current practice is further limited by three key bottlenecks: fragile one‑off codebases, slow video data loading, and the lack of standardized generalization benchmarks. We present stable‑worldmodel (swm), an open‑source platform for standardized and reproducible world modeling research and evaluation. It delivers (1) a high‑performance Lance‑based data layer with native support and conversion tools for MP4, HDF5, and LeRobot datasets, (2) clean, well‑tested implementations of modern world model baselines and planning solvers, and (3) a broad suite of environments and tasks extended with controllable visual, geometric, and physical factors of variation for systematic in‑silico evaluation of dynamics understanding, control performance, representation quality, and out‑of‑distribution generalization. By unifying the full pipeline under a single, scalable framework, \textttswm dramatically reduces research overhead and accelerates trustworthy progress toward reliable world models.

Abstract:
Current end‑to‑end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto‑regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as an interpretable physical sandbox from which safety metrics are directly derived. Built upon this dual infrastructure, we optimize the driving policy via GRPO with a novel dual‑reward mechanism: a physical reward derived from BEV rollouts enforces hard safety constraints, while a cognitive reward from a language‑aligned scorer ensures intent compliance. Extensive experiments demonstrate that CoPhy not only achieves state‑of‑the‑art results on NAVSIM v1 and v2 benchmarks, but also enables safer driving via cognitively informed scene compliance and flexible intent control through user‑defined language instructions.

Abstract:
Reliable confidence estimates are important for safely deploying vision‑based controllers in autonomous racing, where safety predictions must be derived from camera images, yet modern predictors become dangerously overconfident under test‑time distribution shifts. We identify a critical perception‑dynamics gap in existing anomaly signals: widely used scores, such as autoencoder reconstruction error, capture visual corruptions but miss dynamics anomalies (e.g., actuation bias, latency), where images remain plausible while the trajectory degrades. To address this, we propose an Anomaly‑Informed Online Calibration approach that, without retraining any model component, fuses two complementary anomaly scores extracted from a world model: a perceptual score from reconstruction error and a dynamics score from epistemic uncertainty and control‑stream statistics. Based on these fused scores, a lightweight temperature‑scaling calibrator leverages test‑time augmentation to selectively reduce overconfidence under shift while preserving nominal‑condition performance. Experiments on a physical DonkeyCar under four real‑world anomaly protocols unseen during training (darkness, blur, actuation bias, processing latency) reduce average expected calibration error from 0.184 to 0.116, a 37% improvement over the best baseline, without modifying the base safety predictor.

Abstract:
Robotic imitation learning is often treated as reproducing demonstrated actions, but actions are inherently embodiment‑specific. When demonstrations come from humans or robots with different morphology, kinematics, or action spaces, this action‑centric view requires shared action spaces, heuristic retargeting, or large‑scale multi‑embodiment co‑training. We instead view demonstrations as implicit specifications of future goals: the target agent should infer what state the demonstrator is trying to realize, rather than how the demonstrator executes it. We propose Demo‑JEPA, a cross‑embodiment imitation framework that decouples demonstration intent from embodiment‑specific execution. Built on a JEPA‑based world model, Demo‑JEPA translates source visual demonstrations into target‑compatible future latent trajectories in a shared predictive representation space. The target agent then uses these latent trajectories as subgoals and realizes them through planning under its own learned forward dynamics. Because Demo‑JEPA avoids action‑level correspondence and requires only visual demonstrations plus the target agent's own interaction experience, it supports flexible imitation across heterogeneous embodiments. Experiments on RLBench and real‑world manipulation tasks show that Demo‑JEPA matches specialized in‑domain planners and generalizes to unseen tasks and embodiment configurations where prior methods fail.

Abstract:
Vision‑language‑action (VLA) policies have advanced language‑conditioned robotic manipulation by transferring semantic priors from pretrained vision‑language models to action generation. However, standard action‑imitation learning often lacks sufficient modeling of explicit 3D spatial information, dense geometric supervision, and future environment evolution, all critical for precise robotic interaction. To address this, we propose GaussianDream, a feed‑forward 3D Gaussian world‑model plug‑in. Specifically, we introduce learnable GaussianDream Queries in the encoder, enabling the model to capture current‑frame 3D spatial structure and short‑horizon future evolution. During training, the latent GaussianDream prefix is processed by a static reconstruction head and a future prediction head to produce current 3D Gaussian scene states and future Gaussian evolution states. The current branch is supervised by RGB rendering and depth, while the future branch uses future RGB, depth, and pseudo 3D scene‑flow signals. During inference, GaussianDream discards all auxiliary heads and retains only the learned prefix to condition action generation, without test‑time Gaussian reconstruction or future prediction. Experimental results demonstrate that GaussianDream achieves state‑of‑the‑art performance across multiple robotic manipulation benchmarks, reaching 98.4% on LIBERO, 54.8% on RoboCasa Human‑50, and 50.0% on real‑robot tasks. Compared with existing 3D‑enhanced VLA methods, GaussianDream achieves strong accuracy while providing higher inference efficiency than video‑based world‑model approaches.

Abstract:
World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction‑agnostic scene regularities and the ego captures robot‑centric instruction‑conditioned dynamics. This world‑ego entanglement leads to a degradation in long‑horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emphWorld‑Ego Modeling, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world‑ego boundary from three perspectives, i.e., motion‑, semantic‑, and intention‑based views, and analyze three disentanglement strategies with post‑, pre‑, and full disentanglement. Further, we instantiate this paradigm as the World‑Ego Model (WEM), a unified embodied world model that couples an implicit separate world‑ego planner with a cascade‑parallel mixture‑of‑experts (CP‑MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long‑horizon world modeling with hybrid navigation‑manipulation tasks, providing 125K video clips (over 4.5M frames) with fine‑grained action annotations and 300 multi‑turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state‑of‑the‑art performance on HTEWorld while remaining competitive on existing manipulation‑only benchmarks.

Abstract:
Humans infer emotions by integrating observed multimodal cues with expectations about how affective states may unfold. Existing multimodal large language models (MLLMs), however, often treat emotion recognition as static fusion over complete audiovisual‑text inputs, leaving affective dynamics implicit. We propose AffectVerse, a Qwen2.5‑Omni‑based model equipped with an Emotion World Module (EWM), an action‑free representation‑level module for short‑horizon latent affective prediction. \revEWM contains three modules: 1) Cross‑Modal Temporal Imagination predicts future video/audio representations from past tokens with multi‑step rollout. 2) MAMA(Modality‑Aware Multi‑step Attention) Belief Aggregation compresses imagined tokens into modality‑aware belief tokens. 3) Belief Injection inserts these belief tokens into the LLM for affective reasoning. AffectVerse uses future prediction as a past‑conditioned self‑supervised signal: it does not replace modeling observed history or require unseen signals at inference, but forces the current belief state to encode transition cues that are predictive of subsequent affective change. Across nine benchmarks, AffectVerse improves at least 2.57% over other models, while controlled ablations show additive gains from temporal imagination, cross‑modal rollout, and belief aggregation. These results suggest predictive belief‑state modeling is a practical alternative for affective computing.

Abstract:
End‑to‑end autonomous driving has emerged as a compelling alternative to traditional modular pipelines by directly mapping raw sensor data to driving actions. While recent approaches achieve strong performance on single‑domain datasets, their performance degrades significantly when trained jointly across multiple heterogeneous domains. In practice, however, autonomous systems must operate across diverse environments with heterogeneous distributions, including different cities, sensor configurations, and traffic patterns, without domain‑specific retraining. This gap highlights a key challenge in multi‑domain learning: domain‑specific variations across heterogeneous domains introduce conflicting learning signals, driving models toward compromised solutions that are suboptimal across domains. To address this, we propose a trajectory‑driven learning paradigm that organizes training around planning trajectories, enabling the model to capture domain‑invariant representations of driving intent. Furthermore, we incorporate a world model that predicts future latent features conditioned on ego actions, improving feature consistency and mitigating domain‑induced biases. We evaluate our approach on three benchmarks, nuScenes, NAVSIM, and the Waymo end‑to‑end dataset, and show substantial improvements over existing methods across all domains. Our results demonstrate that a single unified model can be trained on heterogeneous datasets while maintaining strong performance within each domain, highlighting a step toward scalable real‑world deployment. We will make our code publicly available.

Abstract:
In the field of Vision‑Language Navigation (VLN), aerial datasets remain limited in their ability to combine scale, diversity, and realism, often relying on either costly real‑world scenes or visually limited simulations. To address these challenges, we introduce FlyMirage, a highly scalable and fully automated data generation pipeline for aerial VLN. Our approach leverages large language models (LLM) as an environment designer to promote scene diversity, paired with a generative world model that instantiates these designs into high‑fidelity 3D Gaussian Splatting (3DGS) scenes. To substantially reduce human labor and ensure the feasibility of flight data, FlyMirage automates scene exploration and semantic information acquisition, and further integrates a dynamically feasible planner for uncrewed aerial vehicle (UAV) trajectory generation. Utilizing this toolchain, we generate a large‑scale, diverse, and photorealistic aerial VLN dataset, with dynamically feasible flying trajectories, designed to support the development of next‑generation embodied navigation models.

Abstract:
This paper systematically diagnoses the training failure modes of Token‑Choice sparse Mixture‑of‑Experts (MoE) on video Diffusion Transformers. Starting from a pretrained dense model of about 5 billion parameters, we convert it into an MoE architecture following three laws: routed experts exactly clone the original FFN weights, shared experts are initialized to zero for verification and then to extremely small non‑zero noise for actual training, while only the gating networks start from random initialization. Experiments reveal a hierarchy of five failure modes: (1) linear routers suffer global soft saturation with complete expert homogenization; (2) MLP routers introduce selective deadlock, where roughly one‑third of layers degenerate into a single‑expert mode that cannot be prevented by increasing the auxiliary loss; (3) cross‑attention routers exhibit preliminary self‑recovery, yet about nine layers remain stubbornly deadlocked; (4) deadlocked layers display a U‑shaped distribution, concentrated in shallow visual processing layers and deep semantic integration layers; (5) bfloat16 mixed precision causes tiny weight updates to be truncated to zero by hardware. Based on routing decision time series over 65 million tokens across 5,000 training steps, we propose the Functional Redundancy Hypothesis: deadlock is a rational waiting strategy before the shared expert matures within the gate‑shared expert‑routed expert triadic system. This hypothesis is supported by the theory of functional redundancy in systems biology. On the engineering side, we summarize the Three Laws of dense‑to‑MoE conversion and provide a complete solution for the bfloat16 precision trap. We calibrate the current capability boundary of the Token‑Choice paradigm and outline a three‑step evolutionary roadmap from visual unification to a world model.

Abstract:
Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval‑augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or rely on observations in settings that are difficult to reproduce. To study root causes, we introduce HalluWorld, an extensible benchmark grounded in an explicit reference‑world formulation: a model hallucinates when it produces an observable claim that is false with respect to this world. Building on this view, we construct synthetic and semi‑synthetic environments in which the reference world is fully specified, the model's view is controlled, and hallucination labels are generated automatically. HalluWorld spans gridworlds, chess, and realistic terminal tasks, enabling controlled variation of world complexity, observability, temporal change, and source‑conflict policy, and disentangling hallucinations into fine‑grained error categories. We evaluate frontier and open‑weight language models across these settings and find consistent patterns: perceptual hallucination on directly observed information is near‑solved for frontier models, while multi‑step state tracking and causal forward simulation remain difficult and are not generally solved by extended thinking. In the terminal setting, models also struggle with when to abstain. The uneven profile of failures across probe types and domains suggests that hallucinations arise from distinct failure modes rather than a single capability. Our results suggest that controlled reference worlds offer a scalable and reproducible path toward measuring and reducing hallucinations in modern language models.

Abstract:
Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task‑relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task‑level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX‑Kontext under the same robotic data setting, and find that image editing produces more reliable task‑level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one‑shot sparse visual planning framework that progressively generates a sequence of task‑relevant manipulation keyframes through successive image editing, conditioned on language instructions and optional arrow‑based spatial guidance. A goal‑conditioned diffusion action predictor then converts adjacent imagined keyframes into executable action chunks. To reduce the mismatch between real and edited visual subgoals, we further introduce a mixed‑training strategy with filtered edited targets. Experiments on DROID and RoboMimic show that SWEET improves keyframe prediction across seen and unseen scenes and enables a full pipeline from sequential keyframe planning to executable robot actions, suggesting that image editing is a promising and underexplored direction for embodied visual prediction.

Abstract:
World simulators can provide safe and scalable environments for training Physical AI systems before real‑world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations, namely, generated videos that preserve the physical state implied by the conditioning input, and evolve in ways consistent with basic physical principles. We propose PhyWorld, a video generation world model designed to produce temporally coherent and physically faithful scene continuations through two‑stage post‑training. In the first stage, we improve video‑to‑video continuation with flow matching fine‑tuning, encouraging stable visual attributes and coherent motion dynamics across frames. In the second stage, we align generated dynamics with physical principles using Direct Preference Optimization (DPO) over physics preference pairs, guiding the model toward outputs with higher physical plausibility. To evaluate PhyWorld, we use both standard video‑quality benchmarks and a dedicated physical‑faithfulness benchmark with per‑law scoring. Experiments show that PhyWorld improves video consistency, achieving an average score of 0.769 on VBench compared with 0.756 or below for state‑of‑the‑art baselines. PhyWorld also improves physical plausibility, reaching an average score of 3.09 on our physical‑faithfulness benchmark compared with 2.99 for the strongest baseline. These results suggest that post‑training large video generation models with continuation and physics‑preference signals can make them more effective world simulators for Physical AI.

Abstract:
Do transformers, when trained on sequential reasoning traces, build internal models of the underlying task? And if so, does the structure of those internal representations mirror the structure of the domain? We train an 8‑layer transformer on Sudoku solving traces and perform a mechanistic analysis of its internal computation. We establish two results. First, the model builds a substructure world model: it does not represent the board state cell by cell, as a human analyst would expect, but organizes information around the rows, columns, and boxes that Sudoku's constraints act on. Second, we identify a naked‑single circuit: a small set of dedicated neurons in the final MLP layer, each individually detecting when exactly one digit remains possible for a specific cell, and reliably promoting that digit. These findings show that the geometry of an emergent world model is shaped by the constraint algebra of the domain, not its surface presentation, and that the resulting decision circuit is sparse, monosemantic, and fully interpretable. More broadly, they demonstrate that mechanistic interpretability tools can recover an end‑to‑end algorithmic account of how a transformer solves a combinatorial reasoning task.

Abstract:
World models aim to predict plausible futures consistent with past observations, a capability central to planning and decision‑making in reinforcement learning. Yet, existing architectures face a fundamental memory trade‑off: transformers preserve local detail but are bottlenecked by quadratic attention, while recurrent and state‑space models scale more efficiently but compress history at the cost of fidelity. To overcome this trade‑off, we suggest decoupling future‑past consistency from any single architecture and instead leveraging a set of specialized experts. We introduce a diffusion‑based framework that integrates heterogeneous memory models through a contrastive product‑of‑experts formulation. Our approach instantiates three complementary roles: a short‑term memory expert that captures fine local dynamics, a long‑term memory expert that stores episodic history in external diffusion weights via lightweight test‑time finetuning, and a spatial long‑term memory expert that enforces geometric and spatial coherence. This compositional design avoids mode collapse and scales to long contexts without incurring a quadratic cost. Across simulated and real‑world benchmarks, our method improves temporal consistency, recall of past observations, and navigation performance, establishing a novel paradigm for building and operating memory‑augmented diffusion world models.

Abstract:
Modern action‑conditioned video world models achieve strong short‑horizon visual realism, yet remain unreliable on rare, interaction‑critical transitions that dominate downstream planning and policy performance. Because passive demonstration data systematically under‑samples these high‑impact regimes, improving robustness requires actively eliciting model failures rather than relying on their natural occurrence. We introduce a KL‑constrained adversarial curriculum in which a policy is trained to expose high‑error trajectories of a diffusion‑based world model while remaining close to the behavior distribution. The world model is continuously fine‑tuned on these adversarially discovered trajectories, yielding an adversarial training loop that converts rare failures into a stable, near‑distribution training signal without drifting into out‑of‑distribution exploitation. To maintain pressure on unresolved weaknesses as the model improves, we propose a Prioritized Adversarial Trajectory (PAT) buffer that re‑ranks trajectories based on prediction error, action fidelity, and learning progress, focusing training on unresolved failure modes rather than repeatedly revisiting solved cases. We implement our approach in the MineRL framework and evaluate it on held‑out out‑of‑distribution trajectories; PROWL improves robustness over models trained on passive data alone, reveals reward‑hacking behaviors under weak behavioral constraints, and demonstrates that effective adversarial world‑model training critically depends on balancing exploratory failure discovery with explicit behavioral regularization. Our results suggest that scalable world models benefit not only from larger datasets, but also from selectively generating informative training data.

Abstract:
Data assimilation (DA) addresses the problem of sequentially estimating the state of a dynamical system from noisy and incomplete observations. In this work, we employ a diffusion model as a world model to simulate and predict the system's dynamics. Recently, score‑based diffusion models have learned global diffusion priors that effectively model (stochastic) dynamics, revealing strong potential for data assimilation. In this paper, we investigate how information from noisy observations can be incorporated to enable continuous correction and refinement of the predicted system state when using a diffusion prior. Motivated by particle filtering methods, we represent the posterior distribution using a set of particles. After receiving noisy observations, the diffusion model is guided using the observation likelihood to steer the generation process toward observation‑consistent states. Nevertheless, such guidance does not guarantee sampling from the true posterior. We therefore employ a Sequential Monte Carlo approach over the diffusion trajectory, viewed as a path measure, to reweight and resample particles, thereby correcting the generation process and ensuring convergence toward the desired posterior distribution. This leads to an unbiased particle filtering method that rigorously fuses observational data with diffusion model simulations.

Abstract:
Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real‑world objects by learning directly from point clouds or RGB‑D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.

Abstract:
The ability to navigate and interact with complex environments is central to real‑world embodied agents, yet navigation in unseen environments remains challenging due to "experiential amnesia," where existing trajectory‑driven or reactive policies fail to synthesize generalizable strategies from past interactions. We propose Robo‑Cortex, a self‑evolving framework that enables robots to autonomously induce navigation heuristics and refine cognitive strategies through a continuous reflection‑adaptation loop. By abstracting success patterns and failure pitfalls into natural‑language heuristics, Robo‑Cortex enables a transition from passive execution to active strategy evolution. Our core innovation is an Autonomous Knowledge Induction (AKI) mechanism that distills multimodal trajectories into a structured Navigation Heuristic Library for knowledge generalization. The architecture further incorporates a Dual‑Grain Cognitive Memory system, comprising a Short‑term Reflective Memory (SRM) for real‑time local progress analysis, and a Long‑term Principle Memory (LPM) that abstracts past trajectories into reusable guiding and cautionary principles. To ensure robust decision‑making, we introduce a multimodal Imagine‑then‑Verify loop, where a world model simulates potential outcomes and a VLM‑based evaluator validates action plans. Extensive evaluations on IGNav, AR, and AEQA show that Robo‑Cortex consistently outperforms strong baselines in both task success and exploration efficiency, with gains of up to +4.16% SPL over the strongest prior method and up to +15.30% SPL under heuristic transfer to unseen environments. Preliminary real‑world robotic experiments further support the effectiveness of Robo‑Cortex in physical settings.

Abstract:
Modern interactive video world models have achieved impressive visual fidelity, yet lack fine‑grained multi‑entity control and cross‑entity, cross‑world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene‑level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per‑latent‑frame (0.25 s) natural‑language conditioning that supports simultaneous multi‑entity control and concept‑level cross‑entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame‑local text cross‑attention, and enable real‑time long‑horizon streaming through ODE‑initialized Self‑Forcing distillation with a RoPE‑decoupled sliding KV‑cache. We surpass the Action‑Index baseline on cross‑entity transfer (89% vs. 43%) and out‑of‑vocabulary prompts (90% vs. 0%), and our 2‑step student sustains 19.7 FPS at 480p with stable FVD over 2‑hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per‑entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation‑elden‑ring‑scenes, containing manually collected Elden Ring player‑boss combat clips with structured action‑oriented metadata. Larger‑scale Elden Ring and KOF data will be released with the full project.

Abstract:
Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large‑scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier‑free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x‑prediction in RAE latent space. By simply re‑parameterizing the output of the DiT model, it can provide guidance for "free". Overall, RAEv2 leads to more than 10x faster convergence over the original RAE, achieving a state‑of‑the‑art gFID of 1.06 in just 80 epochs on ImageNet‑256. On FDr^k, RAEv2 achieves a state‑of‑the‑art 2.17 at just 80 epochs compared to the previous best 3.26 (800 epochs) without any post‑training. This motivates EP_FID@k (epochs to reach unguided gFID <= k) as a measure of training efficiency. RAEv2 attains an EP_FID@2 of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text‑to‑image generation and navigation world models, showing consistent improvements. Code is available at https://raev2.github.io.

Abstract:
World models built on recurrent state space architectures enable efficient latent imagination, yet remain physically unstructured, producing dynamics that violate conservation and dissipative principles. We introduce a unified Port‑Hamiltonian framework that remedies this through three synergistic mechanisms. First, we embed implicit physical priors into recurrent transitions by modeling projected latent evolution as action controlled energy routing governed by flow and dissipation, biasing the projected PH phase space toward a more compact and physically structured representation. Second, we develop a kinematics aware energy world model that estimates the Hamiltonian and power balance from proprioceptive observations, providing an explicit physical signal for thermodynamic reasoning. Third, leveraging these energy gradients, we establish an energy guided Actor‑Critic that uses Lagrangian multipliers to regularize policy optimization toward lower energy and smoother control. Across visual control benchmarks, this paradigm not only attains superior asymptotic returns but also elevates internal simulator fidelity by establishing a tighter, lower variance alignment between imagined and real rewards, all while reducing latent phase space volume by 4.18‑8.41%, energy consumption by up to 7.80%, and mean squared jerk by up to 9.38%.

Abstract:
Generalizable agents should adapt to diverse tasks and unseen environments beyond their training distribution. This position paper argues that such generalization requires environment scaling: expanding the distribution of executable rule‑sets that agents interact with, rather than only increasing trajectories or tasks within fixed benchmarks. Current scaling practices largely focus on collecting more experience or broader task sets under fixed interaction rules, leaving agents brittle when underlying interfaces, dynamics, observations, or feedback signals change. The core challenge is therefore a world‑level distribution shift: agents need systematic exposure to environments with meaningfully different executable rule‑sets. To clarify this challenge, we propose a unified taxonomy that separates trajectory scaling, task scaling, and environment scaling by their primary deliverables and by what changes in the executable rule‑set. Building on this taxonomy, we synthesize construction paradigms for scalable environments, contrasting programmatic generators that prioritize controllability and verifiability with generative world models that offer broader coverage and open‑endedness. We further outline how environment scaling can be coupled with stateful learning mechanisms, emphasizing learned update rules for cross‑environment adaptation. We conclude by discussing alternative perspectives and argue that scalable environments provide the essential substrate for measurable and controllable progress toward robust general agents.

Abstract:
This report presents a unified technical system addressing the two core capabilities of world models for autonomous driving: world representation and world generation. For world representation, we propose WorldRec, a feed‑forward reconstruction architecture driven by sparse scene queries. WorldRec initializes structured queries in 3D space, leveraging them to aggregate cross‑view, cross‑temporal features, thereby naturally enforcing spatial consistency across frames and yielding compact yet high‑fidelity 3D Gaussian scene representations. For world generation, we propose WorldGen, a two‑stage training framework of bidirectional pretraining followed by causal fine‑tuning through three progressive stages (Teacher Forcing, ODE distillation, and DMD), enabling high‑quality online causal video generation in as few as 4 denoising steps. Building on both modules, we further introduce the JWM, which deeply integrates WorldRec and WorldGen to achieve synergistic gains in generation stability, cross‑frame consistency, and visual fidelity, providing a solid foundation for closed‑loop simulation, data synthesis, and end‑to‑end training in autonomous driving.

Abstract:
In video generation models, particularly world models, training large‑scale video diffusion Transformers (such as DiT and MMDiT) poses significant computational challenges due to the extreme variance in sequence lengths within mixed‑mode datasets. Existing bucket‑based data loading strategies typically rely on "equal token length" constraints. This approach fails to account for the quadratic complexity of self‑attention mechanisms, leading to severe load imbalance and underutilization of GPU resources. This paper proposes AdaptiveLoad, an integrated optimization framework consisting of two core components: (1) A dual‑constraint adaptive load balancing system, which eliminates long‑sequence bottlenecks by simultaneously limiting memory consumption and computational load (B × S^p \le M_\textcomp); (2) A fused LayerNorm‑Modulate CUDA kernel, which utilizes a D‑tile coalesced reduction strategy to increase throughput and alleviate memory pressure. Experimental results on the Wan 2.1 world model demonstrate that our method reduces the computational imbalance rate from 39% to 18.9%, improves peak VRAM utilization efficiency by 22.7%, and achieves an overall training throughput increase of 27.2%.

Abstract:
World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action‑conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision‑only prediction, offline embodied applications, and simulator‑based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision‑only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator‑only evaluation to a diverse suite of simulated and real‑world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross‑platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world‑arena.ai.

Abstract:
Future 3D semantic occupancy forecasting and motion planning are central to autonomous driving, as they require models to reason about how surrounding scenes evolve and how the ego vehicle should act. Existing occupancy world models commonly discretize scenes into latent embeddings, volumetric features, or quantized tokens, and forecast future states through fixed‑step autoregressive generation. This limits temporal flexibility, obscures scene evolution, accumulates errors over long horizons, and poorly matches the continuous‑time dynamics of real driving scenes. We propose GEM, a Gaussian Evolution Model for non‑autoregressive occupancy world modeling, where driving scenes are represented as explicit continuous 4D Gaussian primitives with learned dynamics. Instead of rolling out future occupancy states step by step, GEM directly queries the Gaussian world representation at arbitrary timestamps and splats the corresponding conditional 3D Gaussians into semantic occupancy volumes. This enables efficient forecasting over the full horizon while retaining a compact and interpretable scene representation. By decoupling spatial geometry, temporal support, and primitive motion, GEM makes the predicted world easier to inspect, as each primitive's evolution can be followed continuously over time. The same representation also supports motion planning by predicting future ego trajectories from the learned Gaussian world. Extensive experiments show that GEM achieves state‑of‑the‑art future semantic occupancy forecasting and strong motion planning performance, while providing flexible temporal querying.

Abstract:
Electrocardiogram (ECG)‑based models have achieved strong performance in diagnostic tasks, yet they remain limited in modeling how cardiac dynamics evolve under external interventions. In particular, existing approaches focus primarily on static prediction and lack mechanisms to capture ECG variations under different pharmacological conditions. In this work, we propose an ECG World Model for action‑conditioned predictive simulation of cardiac electrophysiology. Moving beyond disjoint pipelines, our framework features a principled integration of physiological ordinary differential equation (ODE) priors into latent diffusion dynamics via energy regularization. This structural constraint enables the synthesis of physiologically plausible post‑intervention ECG trajectories while effectively mitigating generative hallucinations. Building on this simulation process, we introduce an uncertainty‑aware evaluation strategy that leverages the stochasticity of diffusion sampling to characterize both the expected clinical risk and its variability, allowing a more reliable comparative assessment of candidate interventions. We evaluate our method across diverse settings, including controlled drug‑response scenarios and real‑world clinical records. Beyond standard waveform metrics, experimental results demonstrate improved risk calibration and strong alignment with expert‑informed treatment preferences. These results establish our approach as a robust foundation for safe and intervention‑aware clinical decision support.

Abstract:
Planning and acting in 3D environments is a fundamental capability for robotic manipulation in the real world. Although prior work has explored predictive flow planners to guide 3D manipulation, existing approaches often rely on modular pipelines stacking multiple submodels, resulting in high computational overhead and limited real‑time performance. To address these challenges, we introduce RoboFlow4D, a lightweight flow world model that unifies perception and planning by estimating temporal motion in physical 3D space. As an end‑to‑end framework, RoboFlow4D directly predicts multi‑frame 3D flows from visual observations and textual instructions, providing explicit flow‑based planning to guide action generation. This design allows seamless integration with general action policies, forming an efficient observation‑planning‑execution closed loop. Through slow‑fast collaboration between flow prediction and action control, RoboFlow4D enables real‑time and resource‑efficient manipulation. Extensive experiments in both simulation and real‑world settings demonstrate that RoboFlow4D consistently improves manipulation success rates and computational efficiency, advancing flow‑guided planning for embodied intelligence.

Abstract:
Aerial object tracking has broad applications in public safety, emergency rescue, wildlife monitoring, and related fields. However, existing aerial tracking benchmarks are mainly based on passive 2D video sequences captured from fixed camera locations or predefined flight paths, where drones are treated as passive cameras rather than embodied agents that actively perceive, interact, and control their motion in dynamic 3D scenes. In this paper, we define a new drone‑embodied tracking task, termed DeTrack, which requires a drone to track a target in interactive 3D environments using online egocentric observations and active flight control in a closed loop. We build a large‑scale benchmark containing 11,368 target trajectories across diverse scenes, rendering conditions, semantic regions, and moving distractors, together with evaluation metrics for target visibility, tracking accuracy, and trajectory success. We further propose AaDWorlds, an altitude‑aware dual world model framework for drone‑embodied tracking. AaDWorlds consists of an altitude‑aware perception module and dual world models that imagine future states under both high‑ and low‑altitude regimes. By combining pseudo altitude‑aware observations and imagined future states, AaDWorlds alleviates the intrinsic altitude‑mediated contradiction between target visibility and flight safety. Experiments on the DeTrack benchmark demonstrate that AaDWorlds improves closed‑loop tracking performance across all evaluation metrics.

Abstract:
Joint‑Embedding Predictive Architectures (JEPA) are a promising framework for self‑supervised video representation learning, yet the behavior of auxiliary objectives in small‑scale Video‑JEPA training is not well characterized. We report a small‑scale empirical study of 18 auxiliary objective variants for Video‑JEPA across two pretraining regimes: single‑dataset (UCF‑101) and mixed‑dataset (UCF‑101 + Something‑Something V2 + ImageNet‑100). We evaluate frozen representations on three complementary benchmarks: Diving‑48 (fine‑grained motion), SomethingSomething V2 (temporal reasoning), and ImageNet‑100 (appearance). Our experiments suggest that many auxiliary objectives exhibit capacity trade‑offs: gains on one downstream capability often coincide with degradation on another. We then study FWM‑HW‑LD (Factorized World‑Model with Hard‑Region‑Weighted Latent Dynamics), a training‑time objective that separates the latent representation into appearance and dynamics subspaces and applies hard‑region weighting to both JEPA prediction errors and latent dynamics errors. In our mixed‑dataset setting, FWM‑HW‑LD improves ImageNet‑100 by +5.92 and SSv2 by +3.21 percentage points relative to the reference baseline, while remaining within 0.30 percentage points on Diving‑48. These results indicate that latent factorization is a useful direction for studying auxiliary‑objective trade‑offs in Video‑JEPA.

Abstract:
The combination of exponentially large action spaces, stochastic dynamics, and long‑horizon decision‑making under limited resources makes Sequential Stochastic Combinatorial Optimization (SSCO) particularly challenging for reinforcement learning. Hierarchical Reinforcement Learning (HRL) offers a natural decomposition, but it places the high‑level policy in a Semi‑Markov Decision Process (SMDP) where actions have variable durations, making it difficult to learn a world model that is suitable for planning. We introduce a model‑based hierarchical framework for sequential stochastic combinatorial decision‑making that directly addresses this issue. Our method combines a latent‑space tree‑search planner with an SMDP‑aware world model for variable‑duration decisions. A multi‑timescale objective structures the latent dynamics so that transition magnitudes reflect the effective temporal scales of abstract actions, enabling efficient lookahead under adaptive temporal abstraction. We further learn a subgoal‑conditioned budget policy jointly with the world model to support context‑aware resource allocation. Across challenging SSCO benchmarks, our method outperforms strong baselines.

Abstract:
Clinical decision‑making is a feedback system where risk estimates influence treatment, which in turn changes disease trajectories, and both shape clinicians' measurement practices. Static prediction often fails clinically: models trained on observational care logs conflate disease biology with clinician behavior, particularly under treatment confounder feedback and irregular or informative observation. This Review focuses on intervention‑aware disease trajectory modeling in clinical AI‑‑methods estimating patient‑specific longitudinal disease evolution and assessing trajectory changes under alternative treatments. We organize the field around six linked components: three decision tasks (factual forecasting, counterfactual estimation, policy evaluation) and three data‑generating mechanisms (disease evolution, treatment assignment, observation process) that determine identifiability. We present the first unified framework bridging forecasting, counterfactual trajectories, and policy evaluation across discrete/continuous time, explicitly addressing treatment assignment, time‑varying confounding, and observation bias. We synthesize key method families (multistate/joint models, temporal point‑process, deep sequence architectures, longitudinal causal inference), map them to relevant components, and align evaluation with claim strength via overlap diagnostics, uncertainty quantification, off‑policy robustness, and target‑trial validation. This synthesis advances benchmark prediction to decision‑grade clinical evidence, enabling treatment‑sensitive individualized futures, pre‑deployment policy stress‑testing, and safer closed‑loop learning health systems that adapt/abstain when evidence is insufficient.

Abstract:
Planning from raw visual input remains a significant challenge for current Vision‑Language Models (VLMs), when the complexity of input is beyond their one‑step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well‑trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training‑free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve efficiency, we propose Pattern Inference, a novel TWI strategy enabling VLMs to actively recognize known visual patterns in the new tasks and directly infer local world model structures. To obtain these patterns, we propose Pattern Induction, an online inductive learning strategy treating visual patterns as composite and reusable experts, which are autonomously discovered and optimized from experience. Experimental evaluations in FrozenLake, Crafter and CubeBench domains show that our approaches achieve a desirable balance between accuracy and efficiency.

Abstract:
Executable world models can be read, edited, executed, and reused for planning, but only if the program captures the environment's transition law rather than semantic shortcuts in its surface vocabulary. We study online executable world‑model learning under prior misalignment, where an agent must induce state‑dependent dynamics from interaction evidence alone, without rule descriptions, reward signals, or trustworthy lexical priors. We introduce Alice, a closed‑loop system that treats failed candidate updates as structural signal: when a candidate explains a new transition but loses previously explained ones, the preservation conflict reveals dynamics that the current program had conflated. Alice refines these conflicts into hypothesis classes that both provide compact, class‑stratified preservation counterexamples for update and guide frontier exploration toward transitions that are novel and underrepresented with respect to the current program. We evaluate Alice on Baba in Wonderland, a prior‑misaligned variant of Baba Is You that preserves simulator dynamics while replacing semantically meaningful rule‑property labels with unrelated words. Experiments show that Alice substantially improves executable world‑model learning under prior misalignment, and ablations show that both class refinement and class‑aware exploration contribute.

Abstract:
Modern Vision‑Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld‑VLM, a VLM‑side distillation framework that transfers geometric structure from frozen camera‑conditioned video world models into VLMs. GeoWorld‑VLM fine‑tunes only the image encoder and multimodal projector, aligning post‑projector image features with intermediate world‑model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world‑model teacher converts static visual input into a synthetic multi‑view spatial signal. Training combines spatial answer supervision, teacher‑student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld‑VLM preserves the original model's linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld‑VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld‑VLM improves performance by approximately 4 percent on both the What'sUp and VSR benchmarks, suggesting that world‑model‑guided visual alignment generalizes across model structures and spatial reasoning datasets.

Abstract:
AI‑native 6G visions increasingly invoke wireless foundation models, large multimodal models, and wireless world models as the natural endpoint of AI‑native networking, drawing an analogy to recent developments in large language models (LLMs). We argue that this analogy is structurally incomplete. The success of LLMs is based on a broad, reusable, and largely self‑contained tokenized data substrate, whereas the wireless domain lacks an equivalent data foundation. Unlike text, code, or images, wireless data such as CSI tensors, IQ samples, or scheduler logs are not self‑contained: their meaning is configuration‑dependent, simulator‑conditioned, task‑disaggregated, and weakly grounded in operational feedback, all structural bottlenecks that undermine current pre‑ and post‑training recipes. We therefore argue that monolithic models, including mixture‑of‑experts (MoE) and wireless world models, are not the most realistic near‑term path toward deployable AI‑native networks. Instead, emerging evidence points toward composable and agentic network architectures, where general reasoning models orchestrate specialized signal processing models, classical algorithms, digital twins, standards‑aware retrieval, and safety checks through explicit programmable interfaces.

Abstract:
Semantic communication has emerged as a promising paradigm for enabling goal‑oriented networking. However, most existing semantic communication solutions are tailored to one‑shot tasks and optimize instantaneous performance. Hence, they cannot be used to support closed‑loop dynamic systems with physical artificial intelligence (AI), in which the transmitted semantics affect not only the current inference outcome but also future control actions, state evolution, and ultimately long‑horizon task performance. To address this gap, this paper investigates goal‑oriented semantic communications for physical AI systems with closed‑loop sensing‑communication‑inference‑control. In particular, the problem of semantic communications is formulated as a long‑term return‑per‑bit maximization under wireless bit‑budget constraints while capturing both control efficiency and communication efficiency. To solve this problem, a novel causal information value (CIV) metric is introduced to evaluate the marginal contribution of each semantic token to the expected long‑term return by transmission interventions. Then, a world‑model‑enabled causal digital twin (WM‑CDT) framework is proposed to capture the dynamics of closed‑loop physical AI systems and enable counterfactual reasoning for long‑horizon imagined rollouts. Based on these imagined rollouts, an actor‑critic policy is trained for long‑horizon agent control with high data efficiency, while the semantic token selector is trained through CIV‑per‑bit evaluation. Extensive simulations on an AirSim‑Sionna‑based unmanned aerial vehicle (UAV) navigation simulator show that the proposed WM‑CDT framework achieves significant improvement in return‑per‑kbit and navigation success rate compared to existing reinforcement learning solutions.

Abstract:
Despite the central role of action in embodied intelligence, learning transferable action representations from visual transitions remains a fundamental challenge, particularly when world models must generalize across embodiments under limited data. We argue that action is not merely an auxiliary conditioning signal, but a distinct representational factor that decouples the controllable change from embodiment‑specific actuation. In this work, we propose SCAR, a joint inverse‑forward dynamics framework for learning unified action representations across embodiments from visual transitions. Built on a pretrained generative backbone, SCAR uses an inverse dynamics model (IDM) to infer latent actions from latent observation pairs and a forward dynamics model (FDM) to predict future dynamics conditioned on them. To make the latent space transferable rather than a generic visual bottleneck, we regularize the latent action posterior toward a standard Gaussian prior to limit arbitrary visual encoding, and introduce adversarial invariance to suppress embodiment‑ and environment‑specific nuisance factors. Experiments on the Procgen and Robotwin dataset show that the learned unified latent action representation serves as a stronger conditioning interface for world modeling than embodiment‑specific raw actions, yielding improved cross‑embodiment low‑data adaptation and cross‑task transfer. Taken together, these results suggest that action can be learned as a shared representation of controllable change across embodiments, providing an interface for more transferable and generalizable world models.

Abstract:
Reinforcement learning (RL) allows vision‑language‑action (VLA) policies to generalize beyond their training distribution by optimizing directly for task success, but post‑training is computationally expensive. A natural response has been to speed rollout collection through faster simulators and world models. In GRPO‑based VLA RL, we find that the dominant cost lies elsewhere: gradient computation accounts for approximately 78% of wall‑clock time per step in our runs, while rollout collection accounts for only 21%. Gradient cost dominates because much of this computation is spent on phases that contribute little to learning. GRPO's learning signal is driven by advantage variance: only phases where successful and failed rollouts diverge produce learning signal. However, GRPO assigns the same advantage to every chunk in a rollout. As a result, actor‑update compute is spent uniformly across the trajectory, including phases the policy already handles after pre‑training and supervised fine‑tuning. This paper presents Probabilistic Chunk Masking (PCM), a drop‑in modification to GRPO that allocates gradient computation to a small, probabilistically selected subset of chunks per trajectory. PCM scores semantic phases using success‑failure action variance, a rollout‑derived proxy for per‑phase gradient variance, and samples a fixed chunk budget with online‑updated phase‑level keep probabilities. We formalize per‑phase gradient variance as the quantity determines where gradient computation is useful and show that success‑failure action variance provides a measurable proxy for it. PCM requires no reward model or learned critic. On three LIBERO benchmarks, PCM matches the final success rate of standard GRPO while achieving 2.38 times wall‑clock speedup, 4.8 times faster gradient updates, and 60% lower peak activation memory, while backpropagating through fewer than 20% of trajectory chunks.

Abstract:
Model‑Based Reinforcement Learning yields sample efficiency via latent imagination, yet remains constrained by Historical Tethering: imagination is typically initialized from observed states. This creates a learning asymmetry, where the world model's manifold discovery outpaces the policy's sparse‑reward optimization. We propose Mind Dreamer (MD), a framework that instantiates Active Causal Intervention to transcend Markovian continuity. MD reformulates discovery as the minimization of a global Relay Expected Free Energy. Instead of initializing from historical data, it draws initial states from an adversarial generator s_0 ～ p_gen(\cdot), creating non‑continuous latent jumps to epistemic blind spots that are physically plausible yet cognitively challenging. We derive Relay Value Function and Relay Uncertainty Function to resolve the credit assignment paradox across these spatial ruptures. Treating synthesized anchors as interventional intermediary states, these potentials propagate pragmatic and epistemic value through Bellman‑style backups. Notably, we prove that uncertainty propagation across discontinuities necessitates a quadratic discount γ^2, establishing a formal epistemic horizon. Theoretically, MD approximates a variance‑minimizing importance sampler that expands the manifold's spectral gap, reducing the hitting time to critical bottleneck states. Empirically, MD achieves a 1.67× average speedup over DreamerV3 on DeepMind Control Suite, reaching 8.8× in sparse‑reward tasks.

Abstract:
We study event‑graph substrates: a class of world models that represent agent state as an append‑only log of typed RDF triples and answer counterfactual queries by forking the log under a structured intervention vocabulary. Substrates are inspectable at the triple level, support exact counterfactuals, and transfer across domains without learned components. We formalize the class, prove a duality between explanatory and counterfactual queries that reduces both to the same causal‑ancestor traversal, and evaluate a 1,400‑line CLEVRER‑DSL interpreter atop a domain‑agnostic substrate runtime at full CLEVRER validation scale (n=75,618). The substrate exceeds the NS‑DR symbolic oracle on all four per‑question categories (by 9.89, 20.26, 17.65, and 0.80 percentage points), and exceeds the parametric ALOE baseline on descriptive and explanatory while lagging on predictive and counterfactual. We also introduce twin‑EventLog, a 500‑specification Park‑canonical Smallville counterfactual benchmark on which the substrate exceeds Llama‑3.1‑8B with full context by 18.80 points joint accuracy.

Abstract:
We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.

Abstract:
Humans abstract experiences into structured representations to facilitate pattern inference and knowledge transfer. While the hippocampal‑entorhinal (HPC‑MEC) circuit is known to represent both spatial and conceptual spaces, the mechanisms for concurrently extracting abstract structures from continuous, high‑dimensional dynamics remain poorly understood. We propose a brain‑inspired hierarchical model that simultaneously infers latent transitions and constructs a predictive visual world model. Our architecture employs an inverse model for structural extraction alongside an HPC‑MEC coupling model that dissociates relational structures (MEC) from integrated episodic scenes (HPC). Using primitive transformation dynamics as a benchmark, we demonstrate the model's capacity for structural abstraction. By leveraging velocity‑driven path integration, the framework enables robust prediction and structural reuse across diverse contexts, thereby achieving structural generalization. This work provides a novel computational framework for understanding how brain‑inspired, self‑supervised learning of world models facilitates the acquisition of reusable abstract knowledge.

Abstract:
Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade‑off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two‑stage training with pre‑trained world models or by limiting predictions to optical flow. In this paper, we introduce DiLA, a novel Disentangled Latent Action world model that aims to resolve this trade‑off via content‑structure disentanglement. Our key insight is that disentanglement and latent action learning are co‑evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This synergy yields a continuous, semantically structured latent action space without compromising generative quality. DiLA achieves superior results in video generation quality, action transfer, visual planning, and manifold interpretability. These findings establish DiLA as a unified framework that simultaneously achieves high‑level action abstraction and high‑fidelity generation, advancing the frontier of self‑supervised world model learning.

Abstract:
World models aim to improve robotic decision making by predicting the consequences of actions. However, in practice, their predictions often become unreliable once the robot encounters states outside the training distribution, limiting their effectiveness at deployment. We observe that execution itself provides a natural but underutilized signal: after each action, the robot directly observes the true next state, revealing the mismatch between predicted and actual outcomes. Building on this insight, we propose feedback world model, a new paradigm that closes the loop between prediction and observation at inference time. Instead of treating the world model as a static open‑loop predictor, our method maintains a lightweight feedback state that is updated online to iteratively correct future predictions, compensating for model errors using real‑time observations without additional training data or parameter updates. We show that this process can be interpreted as a latent‑space observer and admits convergence guarantees under mild conditions. We further introduce action‑aware guidance to better translate corrected predictions into control by emphasizing action‑controllable components while suppressing irrelevant variations. Experiments on LIBERO‑Plus, Robomimic, and real‑world manipulation tasks demonstrate that our method substantially improves both prediction accuracy and policy performance under distribution shift. In particular, it reduces world model prediction error by up to 76.4% and improves out‑of‑distribution (OOD) success rate by 30%. These results show that incorporating real‑time feedback at inference time provides a simple yet powerful alternative to static world modeling.

Abstract:
Self‑supervised video models are increasingly framed as world models, yet their evaluation remains largely confined to a single top‑1 accuracy score on clean benchmarks. This leaves a major gap in comprehending their potential as world models. We present the first systematic study addressing this gap, analyzing four matched‑capacity frontier video foundation models, V‑JEPA 2.1, V‑JEPA 2, VideoPrism, and VideoMAEv2, across five robustness axes relevant to their deployment as video world models: feature discriminability, corruption robustness, fine‑grained discrimination, occlusion robustness, and sensitivity to temporal direction. Our evaluations establish that across all five axes, latent‑prediction models form a distinct and consistent profile. They degrade more gracefully under pixel corruption, preserve usable class structure rather than mere geometric stability under occlusion, capture fine‑grained physical contact cues without reconstructing pixels, and uniquely encode the arrow of time. These advantages can even survive task adaptation: a frozen V‑JEPA 2 backbone with a lightweight attentive probe outperforms a fully fine‑tuned VideoMAE and a supervised TimeSformer on corruption and occlusion robustness. Our extensive results offer concrete new evidence in favor of latent prediction for robust world modeling.

Abstract:
Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space ‑‑ and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in‑the‑wild exocentric data for egocentric world model training. We show that training whole‑body action‑conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in‑the‑wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented‑reality guidance.

Abstract:
Learning predictive world models from unlabelled video is a foundational challenge in artificial intelligence. While Joint Embedding Predictive Architectures (JEPA) have set new benchmarks in semantic classification, they often remain physics‑blind, failing to capture the causal dynamics necessary for downstream reasoning. We hypothesize that this stems from standard patch‑based masking strategies, which prioritize visual texture over rare but informative kinematic events. We propose Interaction‑Aware JEPA (IA‑JEPA), which utilizes a self‑supervised motion‑centric masking strategy to prioritize physical interactions. By specifically targeting entities engaged in collisions or momentum transfers, we force the architecture to reconstruct latent trajectories rather than static background features. Evaluated on the CLEVRER benchmark, IA‑JEPA achieves 14.26% accuracy on causal reasoning tasks, a significant lead over the 3.22% achieved by standard patch‑masked baselines. Crucially, we demonstrate that IA‑JEPA breaks the "static bias" of standard self‑supervision by inducing a higher‑entropy, more discriminative latent space (+10% entropy gain) that linearizes physical energy (R^2=0.43). We show that this interaction bias generalizes to real‑world human actions (Something‑Something V2) and zero‑shot physical puzzles (PHYRE‑Lite). Our results provide a scalable, fully self‑supervised path toward building foundational world models that begin to internalize the causal structure of the physical world.

Abstract:
Physiological time series signals reflect complex, multi‑scale dynamical processes of the human body. Existing modeling studies focus on static tasks such as classification, event forecasting, or short‑horizon next step prediction, while long‑horizon signal‑level forecasting and predictive nature of physiological signals remain underexplored. We introduce NormWear‑2, a world model that encodes both multivariate physiological signals and clinical intervention variables into a shared latent space and models their joint temporal evolution as a dynamical system. Our approach combines inference from prior pre‑trained knowledge (intuition) with instant non‑parametric latent state transition adaptation (insight), enabling coherent forecasting across multiple temporal scales, conditioned on heterogeneous clinical interventions. During the pretraining phase, we find that chaos‑theoretic balancing of dynamical regime diversity yields more robust representations, with a smaller balanced corpus outperforming one twice its size and capturing bifurcation regimes. We evaluate the world model performance across diverse real‑world physiological datasets spanning heterogeneous temporal resolutions and intervention regimes, covering daily life, point‑of‑care, and clinical settings, including fitness planning, hemodialysis, diabetes management, and surgical monitoring. These evaluation datasets comprise records from 8,026 subjects, spanning study durations from 3.2 hours for high‑resolution signal data to 2.3 years for longitudinal clinical biomarker tracking. NormWear‑2 achieves the best overall forecasting performance across time, frequency, and latent representation domains, with significant improvements over state‑of‑the‑art time series foundation models, while maintaining competitive downstream representation quality, providing a step toward general‑purpose world models for physiological signals.

Abstract:
Real‑time interactive video generation requires low‑latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk‑wise 4‑step regime by distilling bidirectional base models into few‑step AR students, but they remain limited by coarse response granularity and non‑negligible sampling latency. In this paper, we study a more aggressive setting: frame‑wise autoregression with only 1‑‑2 sampling steps. In this regime, we identify the initialization of a few‑step AR student as the key bottleneck: existing strategies are either target‑misaligned, incapable of few‑step generation, or too costly to scale. We propose Causal Forcing++, a principled and scalable pipeline that uses \emphcausal consistency distillation (causal CD) for few‑step AR initialization. The core idea is that causal CD learns the same AR‑conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF‑ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4‑step chunk‑wise Causal Forcing under the frame‑wise 2‑step setting by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first‑frame latency by 50% and Stage 2 training cost by ～4×. We further extend the pipeline to action‑conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu‑ml/Causal‑Forcing and https://github.com/shengshu‑ai/minWM .

Abstract:
Predictive world models enable agents to model scene dynamics and reason about the consequences of their actions. Inspired by human perception, object‑centric world models capture scene dynamics using object‑level representations, which can be used for downstream applications such as action planning. However, most object‑centric world models and reinforcement learning (RL) approaches learn reactive policies that are fixed at inference time, limiting generalization to novel situations. We propose Slot‑MPC, an object‑centric world modeling framework that enables planning through Model Predictive Control (MPC). Slot‑MPC leverages vision encoders to learn slot‑based representations, which encode individual objects in the scene, and uses these structured representations to learn an action‑conditioned object‑centric dynamics model. At inference time, the learned dynamics model enables action planning via MPC, allowing agents to adapt to previously unseen situations. Since the learned world model is differentiable, we can use gradient‑based MPC to directly optimize actions, which is computationally more efficient than relying on gradient‑free, sampling‑based MPC methods. Experiments on simulated robotic manipulation tasks show that Slot‑MPC improves both task performance and planning efficiency compared to non‑object‑centric world model baselines. In the considered offline setting with limited state‑action coverage, we find that gradient‑based MPC performs better than gradient‑free, sampling‑based MPC. Our results demonstrate that explicitly structured, object‑centric representations provide a strong inductive bias for controllable and generalizable decision‑making. Code and additional results are available at https://slot‑mpc.github.io.

Abstract:
Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action‑conditioned patient dynamics. We introduce SepsisAgent, a world model‑augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid‑‑vasopressor interventions, and follows a propose‑‑simulate‑‑refine workflow before committing to a prescription. We first show that world‑model access alone yields inconsistent LLM decision performance, motivating agent‑specific training. We then train SepsisAgent through a three‑stage curriculum: patient‑dynamics supervised fine‑tuning, propose‑‑simulate‑‑refine behavior cloning, and world‑model‑based agentic reinforcement learning. On MIMIC‑IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM‑based baselines in off‑policy value while achieving the best safety profile under guideline adherence and unsafe‑action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.

Abstract:
Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception‑planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception‑free driving world models achieve impressive driving performance, their real‑world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high‑quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real‑world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state‑of‑the‑art (SOTA) performances of EponaV2 among perception‑free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.

Abstract:
World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video‑based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints. As a result, the generated video rollouts are not physically plausible, exhibiting unstable contacts, distorted shapes, or inconsistent motion. In this paper, we present an agentic framework constructing physics‑based world models through executable simulation code. The framework coordinates planning, code generation, visual review, and physics analysis agents. The planning agent converts the natural language prompt into a structured scene plan, the code agent implements it as executable simulation code, and the visual review agent provide visual feedback while the physics analysis agent checks physical consistency. The code is iteratively revised based on the feedback until the simulation matches the prompt reqirements and physical constraints. Experimental results show that our framework outperforms advanced video‑based models in physical accuracy, instruction fidelity and visual quality, which could be applied to various scenarios including driving simulation and embodied robot tasks.

Abstract:
In current Large Language Models we can trust the production of smoothly flowing prose on the basis of the principles of machine learning. However, there is no comparably principled basis to justify trust in the content of the text produced. It appears to be conventional wisdom that addressing this issue by adding more principled reasoning is not computationally affordable. Here we propose a principled method of reasoning that is efficient enough to be practical for large language models. Further, the method allows the retention of much of the currently used software and hardware base. Our method for improving the functioning of large language models consists of a first stage of preprocessing that recodes the data to a Unary Relational Integracode that is more explicit about the relationships among the objects described in the text, followed as a second stage by a standard but possibly streamlined machine learning process that then also learns to predict these relationships. The method may be viewed as realizing a world model and applying beyond natural language, to vision and actions, for example, where the multiple properties of an object referred to in an input are brought together explicitly, rather than remaining distributed in the various references to it in the input. We articulate its advantages in terms of Robust Logic, a system for performing principled chaining on learned, and hence uncertain, information. We show that this recoding has the surprising and fortuitous property that, while succinct, it makes the task of learning a core subset of relational rules that hold in the world described in the training data polynomial time learnable in a defined sense, the polynomial depending on the complexity of the rule. This gives support for sound reasoning within each single call of the learned classifier as well as between multiple calls.

Abstract:
Diffusion world models have recently become competitive for online model‑based reinforcement learning, but current approaches expose a tension: pixel diffusion is effective but computationally expensive while the latest latent diffusion approach improves efficiency yet performs subpar. The latter also relies on separately trained latents rather than the end‑to‑end world‑model objectives that have driven much of modern MBRL progress. In particular, JEPA‑style predictive representation learning has emerged as an especially promising direction for world modeling and MBRL. Concurrently, diffusion‑style objectives have gained traction across multiple domains, with iterative refinement as a promising approach for multimodal and stochastic targets. Taken together, these trends motivate Joint Embedding DIffusion (JEDI), the first online end‑to‑end latent diffusion world model. JEDI learns its latent space directly from the diffusion denoising loss with a JEPA framework, using denoising to learn and predict future latents rather than relying on reconstruction and pretrained models. We provide a theoretical motivation showing that conventional JEPA objectives induce a predictive information bottleneck, and that conditional diffusion denoising admits a closely related predictive‑compression decomposition. Empirically, JEDI is competitive on Atari100k and outperforms the baseline with seperately trained latents where directly comparable. Relative to the pixel diffusion baseline, JEDI uses 43% less VRAM, over 3× faster world‑model sampling, and 2.5× faster training. JEDI also exhibits a markedly different task‑level performance profile from the pixel baseline, suggesting that end‑to‑end predictive latents change more than compute alone.

Abstract:
Effective collaboration between embodied agents requires more than acting in a shared environment; it demands communication grounded in each agent's evolving understanding of the world. When agents can only partially observe their surroundings, coordination without communication is provably hard, but communication can, in principle, bridge this gap by allowing agents to share observations and align their world models. In this work, we examine whether LLM‑based embodied agents actually realize the ability to communicate. We extend PARTNR, a benchmark for collaborative household robotics, with a natural‑language dialogue channel that enables two agents with partial observability to communicate during task execution. To evaluate whether dialogue leads to genuine world‑model alignment rather than superficial coordination, we propose a framework for measuring world‑model alignment defined over per‑agent world graphs: observation convergence (do private world models align over time?), information novelty (do messages convey what the partner lacks?), and belief‑sensitive messaging (do agents model what their partner knows?). Our experiments across three LLMs reveal that dialogue reduces action conflicts 40 to 83 percentage points but degrades task success relative to silent coordination. Using our metrics, we characterize the gap between superficial coordination and genuine world‑model alignment, and identify where current models fall on this spectrum.

Abstract:
Large language models can extract local causal claims from text, but those claims become more useful when organized as persistent, navigable world models rather than as flat summaries. We introduce PROMETHEUS, a framework that turns retrieved literature, filings, reviews, reports, agent traces, source data, code, simulations, and scientific models into causal atlases: sheaf‑like families of local causal predictive‑state models over an explicit cover of a research substrate. Each local region contains causal episodes, structured claim tables, predictive tests, support statistics, and provenance; restriction maps compare overlapping regions; gluing diagnostics expose agreement, drift, contradiction, and underdetermination. The resulting Topos World Model is not a single universal graph. It is a research instrument for navigating what a corpus says, where it says it, how strongly it is supported, and where local claims fail to assemble into a coherent global view. Three literature‑atlas case studies ‑‑ ocean‑temperature impacts on marine populations, GLP‑1 weight‑loss evidence, and resveratrol/red‑wine health‑benefit claims ‑‑ illustrate deep causal research from text with explicit locality, evidence, persistent state, and gluing tension. Four grounded‑counterfactual case studies ‑‑ a Nature Climate Change microplastics forcing paper, an Indus Valley hydrology paper with VIC‑derived figure data and model code, the canonical Sachs protein‑signaling study with single‑cell perturbation data, and a Nature singing‑mouse study with MAPseq projection matrices ‑‑ show a stronger mode: when a paper ships source data, simulation outputs, or code, PROMETHEUS can evaluate a counterfactual against that scientific substrate and then rebuild the sheaf world model around the

Abstract:
Retrospective novel view synthesis (NVS) of dynamic scenes is fundamental to applications such as sports. Recent dynamic 3D Gaussian Splatting (3DGS) approaches introduce temporally coupled formulations to enforce motion coherence across time. In this paper, we argue that, in a synchronized multi‑view (MV) setting typical of sports, the dynamic scene at each time step is already strongly geometrically constrained. We posit that the availability of calibrated, synchronized viewpoints provides sufficient spatial consistency, and therefore, explicit temporal coupling, or complex multi‑body constraints seems unnecessary for retrospective NVS. To this end, we propose an approach tailored for synchronized MV dynamic scene. By initializing the SfM‑derived point cloud at the start time and propagating optimized Gaussians over time, we show that efficient retrospective NVS can be achieved without imposing a temporal deformation constraint. Complementing our methodological contribution, we introduce a Dynamic MV dataset framework built on Blender for reproducible NeRF and 3DGS research. The framework generates high‑quality, synchronized camera rigs and exports training‑ready datasets in standard formats, eliminating inconsistencies in coordinate conventions and data pipelines. Using the framework, we construct a dynamic benchmark suite and evaluate representative NeRF and 3DGS approaches under controlled conditions. Together, we show that, under a synchronized MV setup, efficient retrospective dynamic scene NVS can be achieved using 3DGS. At the same time, the dataset‑generation framework enables reproducible and principled benchmarking of dynamic NVS methods.

Abstract:
Post‑training Vision‑Language‑Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real‑world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task‑specific data to fine‑tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero‑shot inference. We propose RAW‑Dream (Reinforcing VLAs in task‑Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW‑Dream utilizes a world model pre‑trained on diverse task‑free behaviors for predicting future rollouts, and an off‑the‑shelf Vision‑Language Model (VLM) for reward generation. Because both components are task‑agnostic, VLAs can be readily finetuned for any new task entirely within this zero‑shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual‑noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real‑world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task‑dependent data, offering a highly scalable roadmap for VLA adaptation.

Abstract:
When people share the same documents and observations yet reach different conclusions, the disagreement often shifts into a judgment that the other party is cognitively defective, irrational, or acting in bad faith. This paper argues that such divergence is better described as a form of non‑identifiability inherent in inference and learning, rather than as a defect of the other party. We organize the phenomenon into two levels: (i) θ‑level non‑identifiability, where conclusions diverge under the same world model W because inference settings differ; and (ii) W‑level non‑identifiability, where repeated use of an inference setting θ biases data exposure and update rules, causing the learned world model W itself to diverge. We introduce an inference profile θ= (R, E, S, D), consisting of Reference, Exploration, Stabilization, and Horizon, and show how outputs can split even for the same observation o and the same W. We further explain why disagreements tend to project onto a small number of bases ‑‑ abstract versus concrete, externalizability, and order versus freedom ‑‑ as a consequence of general constraints on learning systems: computational, observational, and coordination constraints. Finally, we relate the framework to deep representation learning, including representation hierarchy, latent‑state estimation, and regularization‑exploration trade‑offs, and illustrate the framework through a case study on AI regulation debates.

Abstract:
World models enable agents to anticipate the effects of their actions by internalizing environment dynamics. In enterprise systems, however, these dynamics are often defined by tenant‑specific business logic that varies across deployments and evolves over time, making models trained on historical transitions brittle under deployment shift. We ask a question the world‑models literature has not addressed: when the rules can be read at inference time, does an agent still need to learn them? We argue, and demonstrate empirically, that in settings where transition dynamics are configurable and readable, runtime discovery complements offline training by grounding predictions in the active system instance. We propose enterprise discovery agents, which recover relevant transition dynamics at runtime by reading the system's configuration rather than relying solely on internalized representations. We introduce CascadeBench, a reasoning‑focused benchmark for enterprise cascade prediction that adopts the evaluation methodology of World of Workflows on diverse synthetic environments, and use it together with deployment‑shift evaluation to show that offline‑trained world models can perform well in‑distribution but degrade as dynamics change, whereas discovery‑based agents are more robust under shift by grounding their predictions in the current instance. Our findings suggest that, in configurable enterprise environments, agents should not rely solely on fixed internalized dynamics, but should incorporate mechanisms for discovering relevant transition logic at runtime.

Abstract:
Vision‑Language‑Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation‑to‑action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet‑scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade‑offs, and identifies open challenges and future opportunities for this rapidly evolving field.

Abstract:
While world models learn compact representations of complex environments, they lack a physics‑grounded metric to assess the structural fidelity of their latent spaces. We identify the wavelet scaling exponent α as a critical diagnostic, proposing optimal representations satisfy variance equipartition (α\approx 1/2) ‑‑ mirroring Kolmogorov's inertial range. We establish α= 1/2 as a sharp transition boundary for the classical simulability of amplitude‑encoded quantum kernels. Using tensor‑network theory, we prove latents with α> 1/2 reside in an area‑law phase admitting efficient classical emulation, while α< 1/2 triggers a volume‑law phase where the Matrix Product State bond dimension χ grows exponentially with qubit count n. Analyzing pre‑trained VideoMAE latents reveals a dichotomy: spatial tokens approach the equipartition limit (α\approx 0.423), but permutation‑invariant feature channels exhibit unstructured disorder (α\approx ‑0.123). This forces real‑world latents deep into the volume‑law phase, providing a data‑driven necessary condition for simulation hardness. Finally, we apply Weingarten calculus to derive the exact variance of the scrambled transition probability under a 2‑design ensemble. We prove this variance scales strictly as \Var[X] = Θ(d^‑2). We confirm this numerically with a log‑log slope of ‑1.881 (R^2 = 0.999), identifying a formidable shot‑noise wall demanding a measurement budget of M = Ω(d^2) that constrains quantum machine learning scalability.

Abstract:
A plausible scene evolution depends on the maneuver being considered, while a good maneuver depends on how the scene may evolve. Existing World Action Models (WAMs) largely miss this reciprocity, treating world prediction and action generation as either isolated parallel branches or rigid predict‑then‑plan pipelines. We formalize this perspective as World‑Action Interactive Models (WAIMs), and instantiate it in autonomous driving with DAWN (Denoising Actions and World iNteractive model), a simple yet strong latent generative baseline. DAWN operates in a compact semantic latent space and couples a \emphWorld Predictor with a \emphWorld‑Conditioned Action Denoiser: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update the world prediction, so that both are recursively refined during inference. Rather than eliminating test‑time world evolution altogether or rolling out the full future in pixel space, DAWN performs a short explicit latent rollout that is sufficient to support long‑horizon trajectory generation in complex interactive scenes. Experiments show that DAWN achieves strong planning performance and favorable safety‑related results across multiple autonomous driving benchmarks. More broadly, our results suggest that interactive world‑action generation is a principled path toward truly actionable world models.

Abstract:
Recent advances in visual generative models have highlighted the promise of learning generative world models. However, most existing approaches frame world modeling as novel‑view synthesis or future‑frame prediction, emphasizing visual realism rather than the structured uncertainty required by embodied agents acting under partial observability. In this work, we propose a different perspective: world modeling as embodied belief inference in 3D space. From this view, a world model should not merely render what may be seen, but maintain and update an agent's belief about the unobserved 3D world as new observations are acquired. We identify several key capabilities for such models, including spatially consistent scene memory, multi‑hypothesis belief sampling, sequential belief updating, and semantically informed prediction of unseen regions. We instantiate these ideas in 3D‑Belief, a generative 3D world model that infers explicit, actionable 3D beliefs from partial observations and updates them online over time. Unlike prior visual prediction models, 3D‑Belief represents uncertainty directly in 3D, enabling embodied agents to imagine plausible scene completions and reason over partially observed environments. We evaluate 3D‑Belief on 2D visual quality for scene memory and unobserved‑scene imagination, object‑ and scene‑level 3D imagination using our proposed 3D‑CORE benchmark, and challenging object navigation tasks in both simulation and the real world. Experiments show that 3D‑Belief improves 2D and 3D imagination quality and downstream embodied task performance compared to state‑of‑the‑art methods.

Abstract:
World Action Models (WAMs) have emerged as a promising paradigm for robot control by modeling physical dynamics. Current WAMs generally follow two paradigms: the "Imagine‑then‑Execute" approach, which uses video prediction to infer actions via inverse dynamics, and the "Joint Modeling" approach, which jointly models actions and video representations. Based on systematic experiments, we observe a fundamental trade‑off between these paradigms: the former explicitly leverages world models for generalizable transit but lacks interaction precision, whereas the latter enables fine‑grained, temporally coherent action generation but is constrained by the exploration space of the training distribution. Motivated by these findings, we propose HarmoWAM, an end‑to‑end WAM that fully leverages a world model to unify predictive and reactive control, enabling both generalizable transit and precise manipulation. Specifically, the world model provides spatio‑temporal physical priors that condition two complementary action experts: a predictive expert that leverages latent dynamics for iterative action generation, and a reactive expert that directly infers actions from predicted visual evolution. To enable adaptive coordination, a Process‑Adaptive Gating Mechanism is proposed to automatically determine the timing and location of switching between them. This allows the world model to drive the reactive expert to expand the exploration space and the predictive expert to perform precise interactions across different stages of a task. For evaluation, we construct three training‑unseen test environments across six real‑world robotic tasks, covering variations in background, position, and object semantics. Notably, HarmoWAM achieves strong zero‑shot generalization across these scenarios, significantly outperforming prior state‑of‑the‑art VLA models and WAMs by margins of 33% and 29%, respectively.

Abstract:
With the growing prevalence of always‑on hardware such as smart glasses, body cameras, and home security systems, life‑logging visual sensing is becoming inevitable, forming the backbone of persistent, always‑on AI systems. Meanwhile, recent advances in proactive agents and world models signal a fundamental shift from episodic, prompt‑driven tools to next‑generation AI systems that continuously perceive and react to the physical world. Although life‑logging video streams can substantially improve utility of these promising systems, they also introduce significant privacy risks by revealing sensitive information, such as behavioral patterns, emotional states, and social interactions, beyond what isolated images expose. If unresolved, these risks may undermine public trust and hinder the sustainable development of always‑on AI technologies. Existing privacy protections are either attack‑specific or incur substantial utility loss, and fail to consider the entire data exploitation pipeline. We therefore posit that the privacy‑utility trade‑off in life‑logging video streams is a foundational challenge for next‑generation AI systems that demands further investigation. We call for novel pipeline‑aware privacy‑preserving designs that jointly optimize utility and privacy for long‑horizon life‑logging visual data. In parallel, formal privacy leakage metrics and standardized benchmarks remain important open directions for future research.

Abstract:
Recent advances in vision‑language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long‑horizon and high‑risk interactions. Existing mobile world models provide either text‑based or image‑based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test‑time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world‑model data, then train world models across four modalities: delta text, full text, diffusion‑based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in‑distribution fidelity and provides effective multimodal supervision for data construction, while text‑based feedback is more robust for online out‑of‑distribution (OOD) execution. Second, world‑model‑generated trajectories can provide transferable interaction experience in the training process and improve agents' end‑to‑end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self‑reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post‑hoc verifiers.

Abstract:
Robotic imitation learning typically assumes access to optimal demonstrations, yet real‑world data collection often yields suboptimal, exploratory, or even failed trajectories. Discarding such data wastes valuable information about environment dynamics and failure modes, which can instead be leveraged to improve decision‑making. While 3D policies reduce reliance on high‑quality demonstrations through strong spatial generalization, they still require large‑scale data to achieve high task success. To address this, we propose DALI‑R, a Data‑Asymmetric Latent Imagination and Reranking framework for 3D robotic imitation learning from mixed‑quality trajectories. It learns a Latent World Model over 3D point clouds for imagined rollouts and a Task Completion Scorer that reranks candidate action chunks, improving decision‑making without additional high‑quality demonstrations. We instantiate DALI‑R with both diffusion and efficient flow‑matching policies and evaluate it on Adroit and MetaWorld benchmarks. Across the two evaluated 3D base policies, DALI‑R achieves an average 6.8% improvement in success rate while incurring less than 0.7× additional inference overhead.

Abstract:
Generative driving world models rely on compact latent state representations that must be efficiently transmitted and synchronized across distributed compute and connected vehicles. We study network‑efficient streaming of a discrete world model state, where a stride‑16 VQ‑U‑Net tokenizer (codebook size 8,192) maps each 288x512 frame to an 18x32 grid of token IDs (576 tokens/frame), equivalent to 936 bytes/frame under fixed‑length coding. We consider a keyframe‑‑delta protocol under strict per‑message payload budgets and packet loss, and propose a fully online, label‑free algorithm that prioritizes delta updates via cosine distance in codebook embedding space and triggers keyframes adaptively using a Hamming‑drift threshold. The adaptive algorithm consistently improves the rate distortion frontier over periodic keyframes at matched bitrates: at 0.024 Mb/s (200‑byte budget) dynamic‑only embedding distortion drops from 0.0712 to 0.0661 (7.2%), and at 0.036 Mb/s (400‑byte budget) from 0.0427 to 0.0407 (4.8%). Under 10% delta packet loss at 200 bytes, dynamic‑only distortion is 0.0757 versus 0.0789 for a matched periodic baseline. To connect state fidelity to world model usefulness, we train a lightweight next‑token predictor and evaluate perplexity conditioned on streamed receiver states: at 0.024 Mb/s, dynamic‑position perplexity improves from 206.0 to 193.1 (6.3%), and at 0.036 Mb/s from 158.9 to 155.6 (2.1%). These results support discrete token‑state streaming as a practical systems layer for bandwidth‑aware synchronization and improved downstream token‑dynamics utility under vehicular networking constraints.

Abstract:
Detecting orbital anomalies, such as maneuvers, atmospheric decay, and attitude upsets, across the rapidly growing population of low‑Earth‑orbit (LEO) satellites is a prerequisite for collision avoidance, decay forecasting, and conjunction screening. The bottleneck is not modeling capacity but labels: there is no public ground‑truth corpus of orbital anomalies, manual review does not scale to approximately 10^4 active satellites, and pure rule‑based detectors trade recall for precision so aggressively that they are blind to most behavioral anomalies. We present a multi‑tier labeling cascade that composes three weak supervision sources of increasing fidelity: a fast physics rule set (rule_v1), an Interacting Multiple Model Unscented Kalman Filter (IMM‑UKF) bank, and a supplemental‑element calibration step (supGP), to produce labels at a scale unavailable from any single source. Applied to 232M Two‑Line Element (TLE) records spanning 60 years, the cascade yields 8.6M labeled sequences of length 50 (430M timesteps) over 11 features that include explicit time encoding and full mean‑element state. On overlapping satellites, IMM‑UKF surfaces 42.6x more anomalies than rule_v1 alone. We train a 6.5M‑parameter Transformer in two stages, achieving a maneuver recall of 55.4% and decay recall of 62.8% on a held‑out test set. An ablation on the time‑delta feature alone yields a 107% relative improvement in decay recall. We frame the resulting model as a high‑recall triage classifier whose role is to surface candidate events for downstream filtering, not to issue final attributions, and discuss the path toward a Neural‑ODE‑based orbital world model.

Abstract:
Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine‑tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks ‑‑ including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour ‑‑ that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open‑loop sequence of actions to solve a puzzle from an initial state, we show that the model's activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed by the model. We find that integrating as few as sixteen visual tokens per step into the chain of thought improves the average solve rate from 83% to 89%, with particularly strong gains on reasoning‑heavy tasks such as jigsaw and 3D mental rotation.

Abstract:
While extremely powerful and versatile at various tasks, the thinking capabilities of large language models (LLMs) are often put under scrutiny as they sometimes fail to solve problems that humans can systematically solve. However, recent literature focuses on breaking LLM reasoning with increasingly complex problems, and whether an LLM is robust in simple logical reasoning remains underexplored. This paper proposes Absurd World, a benchmarking framework, to test LLMs against altered realism, where scenarios are logically coherent, and humans can easily solve the tasks. Absurd World breaks a real‑world model into symbols, actions, sequences, and events, which are automatically altered to create absurd worlds where the logic to solve the tasks remains the same. It evaluates a large collection of models with simple and advanced prompting techniques, and proves that it is an effective tool to determine LLMs' ability to think logically, ignoring the patterns learned from the real world. One can use this framework to extensively test an LLM against a real‑world problem to verify whether the LLM's reasoning capability is robust against variations of the task.

Abstract:
Modern agents built on frontier language models often cannot adapt their weights. What, then, remains trainable? We argue it is the agent's \emphworkspace, the structured external substrate it reads, writes, and tests; we call its evolution workspace optimization. Workspace optimization targets hard multi‑turn environments where a frontier model has strong priors but cannot solve the task in a single shot, so the agent must learn through interaction. We propose a principled way to evolve the workspace, mirroring the structure of weight‑space training: artifacts in place of parameters, evidence in place of data, counterexamples in place of losses, and textual feedback in place of gradients. We instantiate the idea in DreamTeam, a multi‑agent harness for ARC‑AGI‑3 whose roles build an executable world model, plan, hypothesize, probe, strategize, and route failures. On the current 25‑game ARC‑AGI‑3 public set under the official scoring protocol and averaged over two independent runs, DreamTeam improves the SOTA protocol‑matched agent's score from 36% to 38.4%, while using 31% fewer environment actions per game.

Abstract:
Real‑world model deployment across multiple domains requires multimodal models to operate under two complementary regimes: (1) multi‑task pretraining, tasks are co‑available at design time where related tasks could borrow representational strength from one another, (2) continual adaptation, in which new tasks emerge after deployment with previously unseen modality combinations. However, neither regime alone suffices: the pretraining task set is never exhaustive, while bypassing joint training forfeits the transfer gains and efficiency among co‑trainable tasks. Sparse Mixture‑of‑Experts (MoE) is a natural fit for this dual requirement: sparse activation enables modular capacity expansion as new tasks arrive, while routing decouples modality‑level computation from task‑level composition. In this work, we propose a scalable MoE framework for multitask pretraining and continual learning across flexible modality combinations. The framework is designed to support training on multimodal tasks with diverse modality configurations by leveraging modality‑specific routers that process tokens from each modality across tasks. Furthermore, it enables continual learning over sequential multimodal tasks within a fixed‑capacity MoE by compressing accumulated expert knowledge into low‑rank memory subspaces, while expanding only the lightweight routers. We validate the effectiveness of our method on multiple healthcare multimodal benchmarks. It demonstrates competitive multitask pretraining performance while alleviating catastrophic forgetting and improving parameter efficiency.

Abstract:
OpenIIR runs hundreds of LLM‑driven personas as parameterised, reproducible IR research experiments. Researchers configure agents across four kinds of multi‑agent study (deliberative panels, social platforms, curated recommender feeds, and evolutionary co‑evolution between content producers and credibility detectors) under many priors, rounds, and constraints. Persona budgets, retrieval policies, ranker choices, intervention timings, and mutation rates are declared up front, and the same study can be re‑run under different settings to compare outcomes side by side. Every run produces structured outputs (argument graphs, exposure logs, fitness traces, transcripts) that a downstream evaluator can consume directly, and a new study is a 200‑‑400 line plug‑in over a shared core (agent runtime, world‑model store, retrieval primitives, claim extractor, persona ontology). The contributions are: (i) the shared core; (ii) a type interface for pluggable scenarios; (iii) four released types with reference runs (Panel, Social‑Media, Curated‑Feed, Multi‑Generational); and (iv) six modular extensions sketched against open IR research questions.

Abstract:
The Model Context Protocol (MCP) has unified the interface between Large Language Models (LLMs) and external tools, yet a fundamental gap remains in how agents conceptualize the environments within which they operate. Current paradigms are bifurcated: Task‑level planning often ignores execution‑time dynamics, while reactive execution lacks long‑horizon foresight. We present MCP‑Cosmos, a framework that infuses generative World Models (WM) into the MCP ecosystem to enable predictive task automation. By unifying three disparate technologies, namely MCP, World Model, and Agent, we demonstrate that a "Bring Your Own World Model" (BYOWM) strategy allows agents to simulate state transitions and refine plans in a latent space before execution. We conducted experiments using two strategies, namely ReAct and SPIRAL with 2 planning models and 3 representative world models over 20+ MCP‑Bench tasks. We observed improvements in Agent's environment interaction KPI such as tool success rate and tool parameter accuracy. The framework also offers new metrics such as Execution Quality to generate new insights about the effectiveness of world models compared to baseline.

Abstract:
A growing body of work pursues AI scientists capable of end‑to‑end autonomous scientific discovery. This position paper argues that although they already function as co‑scientists, agentic AI scientists are not built for autonomous scientific discovery. We identify the following challenges in building and deploying autonomous AI scientists: (1) Problem selection is influenced by the McNamara fallacy; (2) Agents are built on large language models (LLMs) whose training corpora omit tacit procedural and failure knowledge of laboratory practice; (3) Preference optimisation during post‑training compresses output diversity toward consensus; and (4) Most scientific benchmarks measure single‑turn prediction accuracy and lack feedback from physical experiments back to the computational model. These challenges are not just questions of scale and scaffolding; they require revisiting fundamental design choices. To build truly autonomous AI scientists, we recommend the use of scientific simulations as verifiers for training, the design of persistent world models that represent the shifting objectives governing real investigations, the establishment of a centralized preregistration repository for all AI‑generated hypotheses, and application driven by scientific need rather than tool affordance.

Abstract:
Molecular optimization in drug discovery aims to discover molecules with improved target properties, but practical lead optimization often requires more than high predicted scores. A useful candidate should also be actionable: it should be reachable from known molecules through valid local structural transformations, so that it can be interpreted as a plausible revision within an evolving chemical series. Existing de novo and single‑molecule optimization methods do not explicitly model such reachability, especially when both the target molecules and the intermediate molecules connecting them to known compounds are unknown. In this work, we formulate actionable molecular optimization as sequential expansion of a molecule‑transfer graph, where nodes are molecules and edges encode valid local transformations. We propose MolWorld, a molecule world model‑guided framework that treats the current molecule‑transfer graph as an evolving search state. At each iteration, MolWorld selects local anchor contexts, generates candidate molecules conditioned on these contexts, evaluates their properties, and uses a learned world model to update the evolving molecule world by retaining admissible candidates and inserting them into the molecule‑transfer graph. The expanded molecule world then guides subsequent optimization. Experiments on property optimization and docking‑based tasks show that MolWorld discovers high‑property molecules while maintaining substantially stronger structural connectivity, supporting actionable and sequential molecular design.

Abstract:
Modern vision‑based world models can represent observations as compact yet expressive latent manifolds, but fast goal‑oriented planning in these spaces remains challenging. This raises a central question: when does a learned representation simplify control, rather than merely enabling prediction? We study this question in a pretrained LeWorldModel, whose latent geometry is regularized for smoothness and uniformity. Our key insight is that, under such geometry, planning can be amortized into a latent inverse‑dynamics mapping instead of requiring online search. We therefore replace iterative planning with a lightweight Goal‑Conditioned Inverse Dynamics Model (GC‑IDM) that maps the current latent state, goal latent state, and remaining horizon directly to the next action. Empirically, across four benchmark environments spanning navigation, contact‑rich manipulation, and continuous control, our controller matches or exceeds CEM in seven of eight environment‑protocol settings while reducing per‑decision cost by 100‑130x. A broader sweep over test‑time planners (CEM, MPPI, iCEM, and gradient‑based methods) shows that this result is not specific to a particular optimizer. These findings suggest that much of the structure recovered by test‑time planning is already locally encoded in the latent representation. More broadly, our results indicate that sufficiently structured latent spaces can shift part of the planning burden from online optimization to learned inference.

Abstract:
Developing generalist systems that retain human‑like data efficiency is a central challenge. While world models (WMs) offer a promising path, existing research often conflates architectural mechanisms with the independent impact of model \emphscale. In this work, we use a minimalist transformer world model to analyze scaling behaviors on the Atari 100k benchmark, using fixed offline datasets derived from a presupposed expert policy. Our results reveal that environments fundamentally fall into distinct scaling regimes, even when constrained by identical offline data budgets and model capacities. For individual tasks, some environments naturally allow models to pass the interpolation threshold, yielding monotonic improvements in the overparameterized regime, while others remain trapped in the classical regime, where larger world models degrade fidelity. In the unified setting, i.e., a single transformer trained on a suite of 26 Atari environments, we uncover that joint training stabilizes scaling dynamics, ensuring monotonic gains across all environments, regardless of their distinct inherent scaling regimes. Finally, we demonstrate that improved fidelity translates directly to downstream control, with policies learned entirely within the simulated dynamics achieving a median expert‑random‑normalized score of 0.770. Our findings suggest that future progress lies as much in precise scaling strategies as in architectural innovation.

Abstract:
Action‑conditioned world models (ACWMs) have shown strong promise for video prediction and decision‑making. However, existing benchmarks are largely restricted to egocentric navigation or narrow, task‑specific robotics datasets, offering only limited coverage of the rich physical interactions required for generalized world understanding. We introduce ACWM‑Phys, a new benchmark for evaluating action‑conditioned prediction under diverse physical dynamics in a clean, controllable simulation environment with a carefully designed action space. ACWM‑Phys contains training and evaluation data spanning rigid‑body dynamics, kinematics, deformable‑object interactions, and particle dynamics. To evaluate both interpolation and generalization, we design in‑distribution and out‑of‑distribution protocols with controlled shifts in interaction patterns or scene configurations. By building the benchmark in a fully controllable simulator, ACWM‑Phys enables precise data collection, reproducible evaluation, and systematic analysis of model capabilities for physically grounded world modeling. Through systematic experiments on ACWM‑DiT, we find that OoD generalization depends not only on the physical regime but also on effective task complexity: models generalize well on visually simple, low‑dimensional interactions with clear geometric structure, but suffer larger drops on deformable contacts, high‑dimensional control, and complex articulated motion. This suggests that the model still relies heavily on visual appearance patterns instead of fully learning the underlying physics. Ablations show that cross‑attention improves high‑dimensional action conditioning, causal VAEs outperform frame‑wise encoders, and larger action spaces are harder to model but can improve generalization by providing richer control signals. These findings guide the design of physically grounded world models.

Abstract:
Learning predictive world models from visual observations is a core problem in embodied AI, with applications to model‑based reinforcement learning and robotic planning. Existing latent world models typically generate future states with unconstrained neural transition functions, while modern video generation systems often prioritize perceptual plausibility or introduce physical structure through auxiliary losses, external guidance, or separate dynamics modules. As a result, long‑horizon rollouts can remain weakly grounded in the physical principles that govern real dynamics, leading to compounding error, energy drift, and physically inconsistent futures. We propose Least Action World Models (LaWM), a latent world‑modeling framework that operationalizes the Principle of Least Action in learned visual latent space: future rollouts are governed by a learned Lagrangian action functional rather than produced only by an unconstrained transition predictor. Our main technical realization is a latent variational integrator: LaWM encodes observations into learned generalized coordinates, learns a latent discrete Lagrangian over consecutive latent states, constructs a discrete action functional, and advances prediction by solving the corresponding discrete integration condition. Thus, physical structure is not merely used to score, regularize, or constrain a completed trajectory; it defines the latent transition rule itself. Because the transition is induced by a discrete variational principle, LaWM provides a structure‑preserving bias for long‑horizon visual prediction. Across physics‑clean synthetic dynamics and embodied robot interaction benchmarks, LaWM improves physical invariance, background consistency, motion smoothness, and appearance and geometric prediction metrics over video‑generation and world‑model baselines.

Abstract:
Large language models (LLMs) are increasingly deployed in multi‑agent systems where agents communicate in natural language to solve tasks jointly. A key capability in such systems is consensus formation, where agents iteratively exchange messages and update decisions to reach a shared outcome. However, most existing multi‑agent LLM frameworks assume that all participating agents are aligned with the system objective. In practice, a malicious insider may participate as a legitimate member of the group while pursuing a hidden adversarial goal. In this work, we study insider manipulation in multi‑agent LLM consensus systems. We formalize the problem as a sequential decision‑making task in which a malicious agent seeks to delay or prevent agreement among benign agents. To make attack optimization tractable, we propose a world‑model‑based framework that learns surrogate dynamics over the latent behavioral states of benign agents and then trains an attacker using reinforcement learning based on this learned model. Preliminary results show that the trained attacker reduces the benign consensus rate and prolongs disagreement more effectively than the direct malicious‑prompt baseline. These results suggest that combining latent world models with reinforcement learning is a promising direction for adaptive insider attacks in language‑based multi‑agent systems.

Abstract:
Vision‑language‑action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world‑model‑augmented VLAs typically pass the per‑frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per‑frame representation and the latent action coupling under‑examined. We introduce OneWM‑VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow‑matching objective rather than connecting them through a separate decoder. Empirically, we find that per‑frame visual bandwidth can be reduced to a single token without compromising long‑horizon performance under our setup. Trained with 14.71M LoRA parameters on a π_0 (2B) backbone, OneWM‑VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO‑Long (vs.85.2% for π_0), and reaches 60.0% on the long‑horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for π_0).

Abstract:
Inter‑brain synchrony (IBS) observed in real‑time dyadic interactions, including parent‑infant exchanges, suggests that two agents can align their internal representations through interaction. Yet computational accounts of how such alignment can arise between agents that have only local sensory access and asymmetric internal knowledge remain underdeveloped. We propose a constructive model of parent‑infant homeostatic co‑regulation that integrates a POMDP formulation of active interoceptive inference with the Metropolis‑Hastings Naming Game (MHNG) derived from the Collective Predictive Coding (CPC) hypothesis. In our model, the parent and infant agents agree on homeostatic regulatory actions for the infant's visceral state through a shared communicative variable generated by a locally computable Metropolis‑Hastings probability. The parent observes the infant through body‑generated exteroceptive cues, whereas the infant directly senses its own visceral state through interoception. This difference in access modality is implemented as asymmetric generative‑model knowledge: the parent knows how actions transform visceral states but must learn what the infant's bodily cues indicate, whereas the infant perceives its visceral state directly but must learn how actions affect it. We quantify the degree of representational alignment using the Jensen‑Shannon divergence between the two agents' latent representations. Notably, this synchrony emerged far earlier than the generative‑model convergence and was maintained despite heterogeneous generative‑model knowledge, indicating that it does not require fully shared world models. These findings support CPC as a candidate computational framework for explaining how dynamic representational synchrony relevant to IBS can emerge through local interactions.

Abstract:
Generative models have achieved success in producing apparently coherent 2D videos, but remain challenging in the physical world due to lack of 4D spatiotemporal scale. Typically, existing 4D generative models directly embed macro scale constraints to enhance overall spatiotemporal consistency. However, these methods only ensure global appearance coherence and fail to reveal the local dynamics of the physical world. Our insight is that global appearance structure and local dynamic topology empower 4D spatiotemporal cognition, thereby enabling 4D generation with spatiotemporal regularities. In this work, we propose ST‑Gen4D, a 4D generation framework with 4D spatiotemporal cognition‑based world model. Our model is guided by four key designs: 1) Spatiotemporal representation. We encode various modalities into multiple representations as a feature basis. 2) Spatiotemporal cognition. We sculpture these representations into global appearance graph and local dynamic graph, and fuse them via semantic‑bridged spatiotemporal fusion to obtain a 4D cognition graph. 3) Spatiotemporal reasoning. We utilize a world model to derive future state based on the 4D cognition. 4) Spatiotemporal generation. We leverage the derived cognition as condition to guide latent diffusion for 4D Gaussian generation. By deeply integrating 4D intrinsic cognition with generative priors, our model guarantees the structural rationality and topological consistency of 4D generation. Moreover, we propose ST‑4D datasets by aggregating public 4D datasets and self‑built subset. Extensive experiments demonstrate the superiority of our ST‑Gen4D across 3D and 4D generation tasks.

Abstract:
The integration of Vision‑Language‑Action (VLA) models with World Models has gained increasing attention. One representative approach treats learned World Models as generative simulators, enabling policy optimization entirely within "imagination." However, when deployed as simulators for specific environments such as the LIBERO benchmark, existing World Models often suffer from poor generalization and long‑horizon error accumulation. During closed‑loop rollouts, these models are highly sensitive to initial‑state perturbations; minor changes in color, illumination, and other visual factors can trigger cascading hallucinations, leading to severe blurriness or overexposure. Moreover, long‑horizon error accumulation further degrades the quality and fidelity of predicted future states. These issues limit the reliability of World Models as simulators. To mitigate these problems, we propose Sword, a robust World Model framework. Our method introduces Structure‑Guided Style Augmentation to disentangle the visual textures of interactive environments from task‑relevant dynamics, thereby improving generalization. We further propose Dynamic Latent Bootstrapping, which maintains consistency between training and inference while keeping memory consumption low. Extensive experiments on the LIBERO benchmark show that our method significantly outperforms the baseline WoVR in terms of generalization, generation quality, robustness, fidelity, and the success rate of reinforcement‑learning post‑training for VLA models.

Abstract:
Marketing decisions reflect the interaction of latent consumer heterogeneity, time‑varying internal states, and explicit interventions, a structure that current prediction‑ and language‑oriented models do not capture in a unified manner. We propose a Three‑in‑One world‑model architecture in which a Deep Boltzmann Machine (DBM) learns a frozen belief representation from demographics, time, and lagged actions and outcomes, with lightweight task‑specific adapters attached on top. The same belief supports three tasks within a single framework: (i) energy‑based consistency evaluation through the DBM's free energy, (ii) outcome prediction through adapters, and (iii) counterfactual inference by holding the belief fixed and varying only the action input given to the adapter. Using a controlled simulation in which the latent price sensitivity, promotion responsiveness, and base preference of each consumer are known, we show that the adapters match a strong MLP baseline on visit‑ and purchase‑AUC while recovering heterogeneous treatment effects substantially better than S‑, T‑, X‑, and DR‑learner meta‑learners and a Causal Forest baseline built on the same raw features, with the largest gap on a confounded price‑promotion intervention. Complementing this, free‑energy clamps systematically penalize counterfactual purchase trajectories that lack prior promotional exposure, and the penalty itself depends on the latent base preference in the expected direction. These results indicate that DBM beliefs disentangle latent traits in a form that survives counterfactual queries, providing an integrated world‑model substrate for marketing intervention.

Abstract:
Current end‑to‑end autonomous driving planners are fundamentally reactive: they condition on historical and present observations to predict future actions. We argue that autonomous agents should instead imagine future scenes before deciding, just as human drivers mentally simulate ``what will happen next" before acting. We introduce ForeSight, a foundation world model centric planning framework that reframes autonomous driving as anticipatory decision‑making. Rather than treating world models as auxiliary components, ForeSight makes future scene imagination the primary driver of action prediction. Our approach operates in two stages: (1) generating plausible future visual worlds via a pretrained world model, and (2) planning actions conditioned on these imagined futures. This paradigm shift from ``what should I do now?" to ``what will happen, and how should I respond?" enables genuinely anticipatory rather than reactive planning. By grounding decisions in anticipated contexts rather than present observations alone, ForeSight navigates dynamic, interactive scenarios more effectively. Extensive experiments on NAVSIM and nuScenes demonstrate that explicit future imagination significantly outperforms previous state‑of‑the‑art alternatives, validating our foresight‑driven approach.

Abstract:
In model‑based learning, the agent learns behaviors by simulating trajectories based on world model predictions. Standard world models typically learn a stationary transition function that maps states and actions to next states, when an action and an outcome frequently co‑occur in training data, the model tends to internalize this correlation as a general causal rule while ignoring action preconditions. In interactive environments, however, agent actions can reshape the future affordance space. At each timestep, an action may becomes executable only after its prerequisites are met, or non‑executable when they are destroyed. We term such events structure‑changing events (SC events). As a result, a conventional world model often fails to determine whether a given action is executable in the current state, especially in multi‑step predictions. Each imagined step is conditioned on an incorrect affordance state, and therefore the prediction error compounds over the rollout horizon. In this paper, we propose AGWM (Affordance‑Grounded World Model), which learns an abstract affordance structure represented as a DAG of prerequisite dependencies to explicitly track the dynamic executability of actions. Experiments on game‑based simulated environments demonstrate the effectiveness of our method by achieving lower multi‑step prediction error, better generalization to novel configurations, and improved interpretability.

Abstract:
World model‑based policy evaluation is a practical proxy for testing real‑world robot control by rolling out candidate actions in action‑conditioned video diffusion models. As these models increasingly adopt latent diffusion modeling (LDM), choosing the right latent space becomes critical. While the status quo uses autoencoding latent spaces like VAEs that are primarily trained for pixel reconstruction, recent work suggests benefits from pretrained encoders with representation‑aligned semantic latent spaces. We systematically evaluate these latent spaces for action‑conditioned LDM by comparing six reconstruction and semantic encoders to train world model variants under a fixed protocol on BridgeV2 dataset, and show effective world model training in high‑dimensional representation spaces with and without dimension compression. We then propose three axes to assess robotic world model performance: visual fidelity, planning and downstream policy performance, and latent representation quality. Our results show visual fidelity alone is insufficient for world model selection. While reconstruction encoders like VAE and Cosmos achieve strong pixel‑level scores, semantic encoders such as V‑JEPA 2.1 (strongest overall on policy), Web‑DINO, and SigLIP 2 generally excel across the other two axes at all model scales. Our study advocates semantic latent space as stronger foundation for policy‑relevant robotics diffusion world models.

Abstract:
Despite the unprecedented volume of multimodal data provided by modern Earth observation systems, our ability to model atmospheric dynamics remains constrained. Traditional modeling frameworks force heterogeneous measurements into predefined spatial grids, inherently limiting the full exploitation of raw sensor data and creating severe computational bottlenecks. Here we present Earth‑o1, an observation‑native atmospheric world model that overcomes these structural limitations. Rather than relying on conventional atmospheric dynamical modeling systems or traditional data assimilation, Earth‑o1 directly learns the continuous, three‑dimensional physical evolution of the Earth system from ungridded observational data. By integrating diverse sensor inputs into a unified, grid‑free dynamical field, the model autonomously advances the atmospheric state in space and time. We show that this fundamentally distinct paradigm enables direct, real‑time forecasting and cross‑sensor inference without the overhead of explicit numerical solvers. In hindcast evaluations, Earth‑o1 achieves surface forecast skill comparable to the operational Integrated Forecasting System (IFS). These results establish that continuous, observation‑driven world models ‑‑ a new class of fully observation‑native geophysical simulators ‑‑ can match the fidelity of established physical frameworks, providing a scalable data‑driven foundation for a digital twin of the Earth.

Abstract:
Tool‑using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM‑based judges, which either do not scale or lack reliability for complex, long‑horizon manuals. To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine‑checkable compliance benchmarks from natural‑language manuals and tool schemas. MANTRA independently generates (i) a symbolic world model capturing procedural dependencies, and (ii) a set of trace‑level compliance checks for a given task, and validates their consistency using SMT solving. A structured repair loop resolves inconsistencies, requiring human intervention only as a fallback. %This yields benchmarks that are formally validated. Importantly, MANTRA supports arbitrary domains and long procedural manuals, and provides a tunable notion of task complexity which is utilized to automatically derive challenging tasks accompanying compliance checks. Using MANTRA, we build a new benchmark suite with 285 tasks across 6 domains scaling to 50+ page manuals with minimal human effort. Empirically, we show that the compliance checks are richer with stronger constraint enforcement compared to existing benchmarks. Additionally, the granularity of the checks can be used for debugging the agents' failure modes. These results demonstrate that combining automated benchmark generation with formally grounded validation methods enables scalable and reliable benchmarking of tool‑using agents.

Abstract:
Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves these models computationally expensive and uninterpretable. We address this problem by introducing NOVA, a world modelling framework that represents the system state as the weights and biases of an auxiliary coordinate‑based implicit neural representation (INR). This structured representation is analytically rendered, which eliminates the decoder bottleneck while conferring compactness, portability, and zero‑shot super‑resolution. Furthermore, like most latent action models, NOVA can be distilled into a context‑dependent video generator via an action‑matching objective. Surprisingly, without resorting to auxiliary losses or adversarial objectives, NOVA can disentangle structural scene components such as background, foreground, and inter‑frame motion, enabling users to edit either content or dynamics without compromising the other. We validate our framework on several challenging datasets, achieving strong controllable forecasting while operating on a single consumer GPU at ～40M parameters. Ultimately, structured representations like INRs not only enhance our understanding of latent dynamics but also pave the way for immersive and customisable virtual experiences.

Abstract:
Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world‑action models jointly optimize future videos and actions, they predominantly treat video generation as an auxiliary representation for policy learning. Consequently, they insufficiently explore the inverse problem: leveraging action signals to guide video synthesis, thereby often failing to preserve precise robot spatial geometry and fine‑grained robot‑object interaction dynamics in the generated rollouts. To bridge this gap, we present EA‑WM, an Event‑Aware Generative World Model that effectively closes the loop between kinematic control and visual perception. Rather than injecting joint or end‑effector actions as abstract, low‑dimensional tokens, EA‑WM projects actions and kinematic states directly into the target camera view as Structured Kinematic‑to‑Visual Action Fields. To fully exploit this geometrically grounded representation, we introduce event‑aware bidirectional fusion blocks that modulate cross‑branch attention, capturing object state changes and interaction dynamics. Evaluated on the comprehensive WorldArena benchmark, EA‑WM achieves state‑of‑the‑art performance, outperforming existing baselines by a significant margin.

Abstract:
Causal reinforcement learning (RL) lacks benchmarks for complex systems that combine sequential decision making, hidden information, large masked action spaces, and explicit causal structure. We introduce MTG‑Causal‑RL, a Gymnasium benchmark built on Magic: The Gathering with a 3,077‑dimensional partial observation, a 478‑action masked discrete action space, five competitive Standard archetypes, three reward schemes, and a hand‑specified Structural Causal Model (SCM) over strategic variables. Every episode exposes causal variables, SCM‑predicted intervention effects, and per‑factor credit traces, making causal credit assignment, leave‑one‑out cross‑archetype transfer, and policy auditability first‑class metrics. We adapt a panel of reference baselines: random, heuristic, masked PPO, a causal‑world‑model PPO variant, and an architecture‑matched scalar control. We propose Causal Graph‑Factored Advantage PPO (CGFA‑PPO) as a reference causal agent that uses SCM parents of win probability as factor‑aligned critic targets with an intervention‑calibration loss. All comparisons use paired seeds, paired‑bootstrap confidence intervals, and Holm‑Bonferroni correction within pre‑registered families. Masked PPO and CGFA‑PPO reach competitive in‑distribution win rates and exceed the random baseline; per‑factor calibration trajectories and leave‑one‑out transfer gaps expose diagnostic structure that scalar win rate alone cannot. We release the benchmark, reference‑baseline results, and full evaluation protocol openly. By coupling a strategically rich, partially observed domain with an explicit causal interface and statistical protocol, MTG‑Causal‑RL gives causal‑RL, world‑model, and LLM‑agent research a shared testbed for questions current benchmarks cannot pose together: causal credit assignment under masked action spaces, structural transfer across archetypes, and SCM‑grounded policy auditability.

Abstract:
This paper reports on the LoViF 2026 PhyScore challenge, a competition on holistic quality assessment of world‑model‑generated videos across both 2D and 4D generation settings. The challenge is motivated by a central gap in current evaluation practice: perceptual quality alone is insufficient to judge whether generated dynamics are physically plausible, temporally coherent, and consistent with input conditions. Participants are required to build a metric that jointly predicts four dimensions, i.e., Video Quality, Physical Realism, Condition‑Video Alignment, and Temporal Consistency. Depart from that, participants also need to localize physical anomaly timestamps for fine‑grained diagnosis. The benchmark dataset contains 1,554 videos generated by seven representative world generative models, organized into three tracks (text‑2D, image‑to‑4D, and video‑to‑4D) and spanning 26 categories. These categories explicitly cover physics‑relevant scenarios, including dynamics, optics, and thermodynamics, together with diverse real‑world and creative content. To ensure label reliability, scores and anomaly timestamps are produced through trained human annotation with an additional automated quality‑control pass. Evaluation is based on both score prediction and anomaly localization, with a composite protocol that combines TimeStamp_IOU and SRCC/PLCC. This report summarizes the challenge design and provides method‑level insights from submitted solutions.

Abstract:
We evaluate an initial coding‑agent system for ARC‑AGI‑3 in which the agent maintains an executable Python world model, verifies it against previous observations, refactors it toward simpler abstractions as a practical proxy for an MDL‑like simplicity bias, and plans through the model before acting. The system is intentionally direct: it uses a scripted controller, predefined world‑model interfaces, verifier programs, and a plan executor, but no hand‑coded game‑specific logic. We report results on the 25 public ARC‑AGI‑3 games. Each recorded playthrough uses a fresh agent instance with no access to previous playthrough‑specific files or conversation state. Most games have a single recorded playthrough; for a few games, we report multiple independent fresh‑agent playthroughs to expose run‑to‑run variability. The agent fully solved 7 games, achieved a Relative Human Action Efficiency greater than 75%, on 6 games, and obtained a mean per‑game RHAE of 32.58%. Because the system uses no game‑specific code, it can serve as a game‑general baseline for ARC‑AGI‑3. Performance on the private validation set remains to be tested. Overall, the results provide preliminary evidence that verifier‑driven executable world models are a promising approach for ARC‑AGI‑3 agents.

Abstract:
Neural representations carry rich geometric structure; but does that structure causally shape behavior? To address this question, we intervene along paths through activation space defined by different geometries, and measure the behavioral trajectories they induce. In particular, we test whether interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturally. Concretely, we first fit an activation manifold M_h to representations and a behavior manifold M_y to output probability distributions. We then test the link M_h \leftrightarrow M_y via interventions: we find that steering along M_h, which we term manifold steering, yields behavioral trajectories that follow M_y, while linear steering ‑‑ which assumes a Euclidean geometry ‑‑ cuts through off‑manifold regions and hence produces unnatural outputs. Moreover, optimizing interventions in activation space to produce paths along M_y recovers activation trajectories that trace the curvature of M_h. We demonstrate this bidirectional relationship between the geometry of representation and behavior across tasks and modalities. In language models, we use reasoning tasks with cyclic and sequential geometries as well as in‑context learning tasks with more complex graph geometries. In a video world model, we use a task with geometry corresponding to physical dynamics. Overall, our work shows that geometry in neural representation is not merely incidental, but is in fact the proper object for enabling principled control via intervention on internals. This recasts the core problem of steering from finding the right direction to finding the right geometry.

Abstract:
Safe L2/L3 driving automation requires anticipating human‑in‑the‑loop reactions during shared‑control transitions. While most driving world models forecast the external environment, in‑cabin intelligence remains strictly recognition‑oriented and lacks multi‑step rollout capabilities for driver dynamics. We introduce Driver‑WM, a driver‑centric latent world model that rolls out in‑cabin dynamics causally conditioned on out‑cabin traffic context. This formulation unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition. Operating in a compact latent space constructed from frozen vision‑language features, Driver‑WM adopts a dual‑stream architecture to separately encode external traffic and internal driver states. These streams are directionally coupled via a gated causal injection mechanism, which uses a learned vector gate to modulate external contextual perturbations while strictly enforcing temporal causality. Evaluations on a multi‑task assistive driving benchmark demonstrate that Driver‑WM yields robust long‑horizon geometric forecasting for reactive high‑motion maneuvers and improves semantic alignment for both driver and traffic states. Finally, the explicit external‑to‑internal conditioning allows for controlled test‑time interventions to systematically analyze mechanism responses.

Abstract:
We report a systematic failure mode in predictive representation learning. Across 2695 neural network configurations trained to predict linear‑Gaussian dynamics, the optimal encoder tracks the environment rather than the system it is meant to model. The mean causal fidelity ‑‑ the fraction of encoder sensitivity allocated to system degrees of freedom ‑‑ is 0.49, and only 2.5% of configurations exceed 0.70. The failure intensifies with dimension: at N=100, the optimal encoder becomes causally blind (fidelity ~10^‑8) while achieving 92% lower prediction error than the causal representation. We prove this is not an optimization artifact but a structural property of the predictive objective: when environment modes are slower or less noisy than system modes, every minimizer of the population risk encodes the former. The set of dynamics exhibiting this predictive‑causal gap is open and of positive measure in parameter space. In a nonlinear Duffing‑GRU sweep, unconstrained predictors learn environment‑dominant representations in 55% of tasks (95% CI 41‑‑68%) versus 24% under operational grounding (p=2.3e‑3); the median out‑of‑distribution MSE inflation under environment shift is 1.82x versus 1.00x. Operational grounding ‑‑ restricting the loss to system observables ‑‑ partially suppresses the gap, but causal fidelity is never recovered without an explicit system‑environment boundary. The results identify the predictive‑causal gap as a structural limit of learning, with implications for self‑supervised representation learning, world models, and the scaling paradigm.

Abstract:
Transformer based pre‑trained large language models have become ubiquitous. There is increasing evidence to suggest that even with large scale pre‑training, these models do not capture complete compositional context and certainly not, the full human analogous context. Besides, by the very nature of the architecture, these models hallucinate, are difficult to maintain, are not easily interpretable and require enormous compute resources for training and inference. Here, we describe Gyan, an explainable language model based on a novel non‑transformer architecture, without any of these limitations. Gyan achieves SOTA performance on 3 widely cited data sets and superior performance on two proprietary data sets. The novel architecture decouples the language model from knowledge acquisition and representation. The model draws on rhetorical structure theory, semantic role theory and knowledge‑based computational linguistics. Gyan's meaning representation structure captures the complete compositional context and attempts to mimic humans by expanding the context to a 'world model'. AI model adoption critically depends on trust and transparency especially in mission critical use cases. Collectively, our results demonstrate that it is possible to create models which are trustable and reliable for mission critical tasks. We believe our work has tremendous potential for guiding the development of transparent and trusted architectures for language models.

Abstract:
State‑of‑the‑art model‑based Reinforcement Learning (RL) approaches either use gradient‑free, population‑based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to leverage the advantages of both paradigms have shown promising results. However, these approaches typically rely on gradient‑free optimization methods, which can be computationally expensive for high‑dimensional control tasks. While gradient‑based methods are a promising alternative, recent works have empirically shown that gradient‑based methods often perform worse than their gradient‑free counterparts. We propose Dream‑MPC, a novel approach that generates few candidate trajectories from a rolled‑out policy and optimizes each trajectory by gradient ascent using a learned world model, uncertainty regularization and amortization of optimization iterations over time by reusing previously optimized actions. Our results on 24 continuous control tasks show that Dream‑MPC can significantly improve the performance of the underlying policy and can outperform gradient‑free MPC and state‑of‑the‑art baselines. Code and videos are available at https://dream‑mpc.github.io.

Abstract:
Structural causal models provide a unified semantics for interventions and counterfactuals, but most identifiability results rely on restrictive assumptions like global monotonicity, which are often violated in embodied interaction, where the same exogenous perturbation can induce opposite responses under different contact contexts. We ask what structure still suffices once global monotonicity is dropped. We introduce non‑monotone triangular structural causal models (NM‑TM‑SCM), which retain triangular recursion but replace global monotonicity with mechanism‑wise invertibility and context‑independent inverse transport. We prove that these conditions are equivalent to exogenous isomorphism and imply complete counterfactual identifiability, and we give a counterexample showing that local invertibility alone is insufficient. We instantiate the theory in CausalInverter, with triangular invertible layers, orientation gates, and transport‑stability regularization. On synthetic non‑monotonic mechanisms, the structural bias yields systematic counterfactual gains as non‑monotonicity increases. On MuJoCo Door, our model achieves perfect event‑level counterfactual recovery, lowers continuous angle error relative to a Transformer baseline, and delivers substantially more stable recovery than Transformer and conditional‑flow predictors. On MuJoCo Push, where non‑monotonicity is weaker, the same low‑data predictors remain competitive or better, consistent with a bias‑variance boundary. These results identify a broader identifiable regime between globally monotone triangular models and unconstrained black‑box world models.

Abstract:
Sessions is one of the major features introduced in the MPI‑4 standard. It offers an alternative to the traditional world communicator model by allowing applications to construct communicators from process sets, thereby eliminating the dependency on MPI_COMM_WORLD. The Sessions model was proposed as a more scalable solution for exascale systems, where MPI_COMM_WORLD was viewed as a potential scalability bottleneck. However, supporting Sessions is a significant challenge for established codebases like MPICH due to the deep integration of the world model in traditional MPI implementations. Although MPICH added support for the MPI‑4 standard upon its release, it still internally relied on a global world communicator. This approach enabled applications written using the Sessions model to function, but it did not fulfill the full design intent of Sessions, which meant to decouple MPI from MPI_COMM_WORLD. We describe MPICH effort to support true MPI Sessions, including a major internal refactoring. We describe the architectural changes required to support true Sessions and evaluate the resulting implementation scalability. Our results demonstrate that true Sessions can offer significant scalability benefits by adopting explicit hierarchical designs.

Abstract:
Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable environments for perception, reasoning, and action. Yet current research still lacks large‑scale datasets and unified benchmarks to evaluate their physical interaction capabilities. To address this, we propose iWorld‑Bench, a comprehensive benchmark for training and testing world models on interaction‑related abilities such as distance perception and memory. We construct a diverse dataset with 330k video clips and select 2.1k high‑quality samples covering varied perspectives, weather, and scenes. As existing world models differ in interaction modalities, we introduce an Action Generation Framework to unify evaluation and design six task types, generating 4.9k test samples. These tasks jointly assess model performance across visual generation, trajectory following, and memory. Evaluating 14 representative world models, we identify key limitations and provide insights for future research. The iWorld‑Bench model leaderboard is publicly available at iWorld‑Bench.com.

Abstract:
Existing robot video world models are typically trained with low‑level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long‑horizon autoregressive prediction. We present RoboAlign‑R1, a framework that combines reward‑aligned post‑training with stabilized long‑horizon inference for robot video world models. We construct RobotWorldBench, a benchmark of 10,000 annotated video‑instruction pairs collected from four robot data sources, and train a multimodal teacher judge, RoboAlign‑Judge, to provide fine‑grained six‑dimensional evaluation of generated videos. We then distill the teacher into a lightweight student reward model for efficient reinforcement‑learning‑based post‑training. To reduce long‑horizon rollout drift, we further introduce Sliding Window Re‑encoding (SWR), a training‑free inference strategy that periodically refreshes the generation context. Under our in‑domain evaluation protocol, RoboAlign‑R1 improves the aggregate six‑dimension score by 10.1% over the strongest baseline, including gains of 7.5% on Manipulation Accuracy and 4.6% on Instruction Following; these ranking improvements are further supported by an external VLM‑based cross‑check and a blinded human study. Meanwhile, SWR improves long‑horizon prediction quality with only about 1% additional latency, yielding a 2.8% gain in SSIM and a 9.8% reduction in LPIPS. Together, these results show that reward‑aligned post‑training and stabilized long‑horizon decoding improve task consistency, physical realism, and long‑horizon prediction quality in robot video world models.

Abstract:
To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is insufficient for sparse‑reward tasks, as it lacks the epistemic drive to actively uncover the ``known unknown'' required for robust generalization. We ask: Can VLM agents actively find signals that challenge and refine their internal world model through curiosity‑driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an intrinsic curiosity signal within reinforcement learning, steering the agent to actively explore areas where its internal model is uncertain. Extensive experiments across a series of agentic tasks show the effectiveness of GLANCE, and demonstrate that aligning ``what the agent thinks'' with ``what the agent sees'' is key to solving complex or sparse agentic tasks.

Abstract:
What does it mean to understand the world? Contemporary world models often operationalize understanding as accurate future prediction in latent or observation space. Developmental cognitive science, however, suggests a different view: human understanding emerges through the construction of internal theories of how the world works, even before mature language is acquired. Inspired by this theory‑building view of cognition, we introduce Learning‑to‑Theorize, a learning paradigm for inferring explicit explanatory theories of the world from raw, non‑textual observations. We instantiate this paradigm with the Neural Theorizer (NEO), a probabilistic neural model that induces latent programs as a learned Language of Thought and executes them through a shared transition model. In NEO, a theory is represented as an executable, compositional program whose learned primitives can be systematically recombined to explain novel phenomena. Experiments show that this formulation enables explanation‑driven generalization, allowing observations to be understood in terms of the programs that generate them.

Abstract:
Artificial intelligence‑generated content (AIGC) has emerged as a transformative paradigm for automating the creation of diverse and customized content, giving rise to rapidly growing computational workloads in cloud data centers. It is imperative for AIGC service providers (ASPs) to strategically schedule AIGC workloads to reduce data center energy costs while guaranteeing high‑quality content generation. However, the distinctive characteristics of AIGC services pose critical challenges, including model heterogeneity across ASPs, implicit service quality evaluation, and complex inference process control. To tackle these challenges, we propose a joint energy management and coordinated AIGC workload scheduling framework, which introduces an explicit mathematical characterization of service quality to promote both job transfer among ASPs and fine‑grained inference process configuration. Moreover, various energy resources within data centers are jointly considered to enhance power usage flexibility. Subsequently, a system utility maximization problem is formulated to balance AIGC service revenue with operational penalties and costs. Nevertheless, the strong coupling among job scheduling decisions induces severe reward sparsity, which limits the effectiveness of existing deep reinforcement learning (DRL) algorithms. To address this issue, we develop a diffusion model‑aided reward shaping approach to synthesize complementary reward signals through a multi‑step denoising process. This approach is seamlessly integrated with DRL to enable efficient learning of scheduling policies under sparse environmental feedback. Experiments based on real‑world models and datasets demonstrate that our scheme effectively accommodates electricity price fluctuations and AIGC model heterogeneity, while achieving superior learning convergence and system utility compared with benchmark methods.

Abstract:
Stories hold a reader's attention because they have causes, secrets, and consequences. Shadow‑Loom is an experimental open‑source framework that turns a narrative into a versioned graphical world model and lets two engines act on it: a causal physics grounded in Pearl's ladder of causation and a recently proposed counterfactual calculus over Ancestral Multi‑World Networks; and a narrative physics that scores the same graph against four structural reader‑states ‑‑ mystery, dramatic irony, suspense, and surprise ‑‑ in the tradition of Sternberg's curiosity/suspense/surprise triad, with suspense formalised in the structural‑affect line of work on story comprehension and computational suspense. Large language models are used only at the boundary: extraction, rendering, and audit; identification, intervention, and counterfactual reasoning are carried out in typed code over the graph. The system is offered as a research artefact rather than as a benchmarked NLP model; code, fixtures, and pipeline are released open source.

Abstract:
World models enable long‑horizon planning by internally generating and evaluating imagined trajectories, making them a promising foundation for generalist agents. However, this imagination‑driven decision process also introduces new security risks. Existing backdoor attacks typically aim to manipulate local features, one‑step predictions, or instantaneous policy outputs. While such objectives may suffice for weaker reactive models, they are often ineffective against world models, where the learned dynamics prior and planning process can absorb or wash out the effects of shallow perturbations. More importantly, we find that world models exhibit a distinct backdoor vulnerability rooted in the long‑tailed ranking structure of imagined trajectories, where disrupting the ordering of a few decision‑critical trajectories can systematically hijack planning. To exploit this vulnerability, we propose TRAP, a backdoor attack framework for world models that targets imagined trajectory ranking. TRAP combines a tail‑aware ranking loss to focus optimization on decision‑critical trajectories with dual gating mechanisms that stabilize optimization and regulate when and where the attack penalty is applied. Under trigger conditions, TRAP alters the relative ranking of imagined trajectories to redirect planning outcomes, while largely maintaining the normal ranking structure on clean inputs. Experiments on DreamerV3 and TD‑MPC2 across diverse tasks show that TRAP consistently induces sustained behavioral deviations and significant performance degradation, highlighting the need for dedicated security evaluation of world‑model‑based agents.

Abstract:
Emerging multi‑modal world models attempt to jointly generate videos across diverse modalities (e.g., RGB, depth, and mask), yet they fail to fully exploit the rich priors of existing foundation models. We propose M^2‑REPA, the first representation alignment method tailored for multi‑modal video generation. Our key insight is that foundation models trained on different modality spaces naturally capture distinct domain‑specific priors, acting as complementary "experts." Specifically, we first decouple modality‑specific features from the diffusion model's intermediate representations, then align each with its corresponding expert foundation model. To this end, we design two synergistic objectives: a multi‑modal representation alignment loss that enforces feature‑to‑expert matching, and a modality‑specific decoupling regularization that encourages complementarity across different modalities. This design enables joint optimization, fully exploiting priors from multiple foundation models. Extensive experiments demonstrate that our method significantly outperforms baselines in visual quality and long‑term consistency.

Abstract:
World models have made significant progress in modeling dynamic environments; however, most embodied world models are still restricted to 2D representations, lacking the comprehensive multi‑view information essential for embodied spatial reasoning. Bridging this gap is non‑trivial, primarily due to challenges from severe scarcity of paired multi‑view data, the difficulty of maintaining spatiotemporal consistency in generated 3D geometries, and the tendency to hallucinate manipulation details. To address these challenges, we propose Embody4D, a dedicated video‑to‑video world model for embodied scenarios, capable of synthesizing arbitrary novel views from a monocular video. First, to tackle data scarcity, we introduce a 3D‑aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross‑embodiment robotic arms with diverse backgrounds, guaranteeing broad generalization. Second, to enforce geometric stability, we devise an adaptive noise injection strategy; by leveraging confidence disparities across image regions, this method selectively regularizes the diffusion process to ensure strict spatiotemporal consistency. Finally, to guarantee manipulation fidelity, we incorporate an interaction‑aware attention mechanism that explicitly attends to the robotic interaction regions. Extensive experiments demonstrate that Embody4D achieves state‑of‑the‑art performance, serving as a robust world model that synthesizes high‑fidelity, view‑consistent videos to empower downstream robotic planning and learning.

Abstract:
A world model matters to an agent only through the state it constructs. That state must preserve some information, discard other information, and support some future function: prediction, control, planning, memory, grounding, or counterfactual reasoning. This paper treats world‑model research as latent state design under sufficiency constraints. We propose a functional taxonomy that groups methods by what their latent state is for, rather than by architecture or application domain: predictive embedding, recurrent belief state, object/causal structure, latent action interface, grounded planning interface, and memory substrate. These roles expose distinctions that architecture‑based groupings hide, including the gap between predictive sufficiency and control sufficiency, and the gap between passive video prediction and counterfactual action modeling. The taxonomy supports an evaluation framework that judges a model by the sufficiency constraint its latent state was built to satisfy. We compare methods along seven axes: representation, prediction, planning, controllability, causal/counterfactual support, memory, and uncertainty. We use the resulting matrix as a diagnostic for what a latent state preserves, discards, and enables. The conclusion that follows is that an actionable world model is the one whose state construction matches the task, not the one that preserves the most information.

Abstract:
Foundational models have advanced social robotics, enabling richer perception and communicative interaction with users. However, current systems still struggle with multi‑turn engagement, social‑relationship reasoning, and contextually grounded dialogue at scale. We present ARIS (Agentic and Relationship Intelligence System), an agentic AI framework that unifies multimodal reasoning, a graph‑based Social World Model, and retrieval‑augmented generation (RAG) within a single modular architecture for social robots. We evaluate ARIS with the Pepper robot in a robot‑mediated dyadic conversational setting, comparing it against a large language model baseline. A user study (N=23) shows that ARIS yields significantly higher perceived intelligence, animacy, anthropomorphism, and likeability. Our contributions are threefold: (1)~a Social World Model that explicitly maps and updates social relationships between users through a knowledge graph, enabling social reasoning and re‑identification across encounters; (2)~an efficient RAG‑based conversational pipeline that maintains bounded latency as dialogue histories grow to thousands of exchanges while preserving response relevance; and (3)~system integration and empirical validation of these components within a modular agentic architecture that coordinates speech, vision, and physical action through structured APIs. The implementation of ARIS will be released as open source upon publication.

Abstract:
World models have recently re‑emerged as a central paradigm for embodied intelligence, robotics, autonomous driving, and model‑based reinforcement learning. However, current world model research is often dominated by three partially separated routes: 2D video‑generative models that emphasize visual future synthesis, 3D scene‑centric models that emphasize spatial reconstruction, and JEPA‑like latent models that emphasize abstract predictive representations. While each route has made important progress, they still struggle to provide physically reliable, action‑controllable, and long‑horizon stable predictions for embodied decision making. In this paper, we argue that the bottleneck of world models is no longer only whether they can generate realistic futures, but whether those futures are physically meaningful and useful for action. We propose \emphHamiltonian World Models as a physically grounded perspective on world modeling. The key idea is to encode observations into a structured latent phase space, evolve the latent state through Hamiltonian‑inspired dynamics with control, dissipation, and residual terms, decode the predicted trajectory into future observations, and use the resulting rollouts for planning. We discuss how Hamiltonian structure may improve interpretability, data efficiency, and long‑horizon stability, while also noting practical challenges in real‑world robotic scenes involving friction, contact, non‑conservative forces, and deformable objects.

Abstract:
World models, which are predictive representations of how environments evolve under actions, have become a central component of robot learning. They support policy learning, planning, simulation, evaluation, data generation, and have advanced rapidly with the rise of foundation models and large‑scale video generation. However, the literature remains fragmented across architectures, functional roles, and embodied application domains. To address this gap, we present a comprehensive review of world models from a robot‑learning perspective. We examine how world models are coupled with robot policies, how they serve as learned simulators for reinforcement learning and evaluation, and how robotic video world models have progressed from imagination‑based generation to controllable, structured, and foundation‑scale formulations. We further connect these ideas to navigation and autonomous driving, and summarize representative datasets, benchmarks, and evaluation protocols. Overall, this survey systematically reviews the rapidly growing literature on world models for robot learning, clarifies key paradigms and applications, and highlights major challenges and future directions for predictive modeling in embodied agents. To facilitate continued access to newly emerging works, benchmarks, and resources, we will maintain and regularly update the accompanying GitHub repository alongside this survey.

Abstract:
Visual‑Language‑Action models (VLAs) have advanced generalist robot control by mapping multimodal observations and language instructions directly to actions, but sparse action supervision often encourages shortcut mappings rather than representations of dynamics, contact, and task progress. Recent world‑action models introduce future prediction through video rollouts, yet pixel‑space prediction is a costly and indirect substrate for control, as it may model visual details irrelevant to action generation and introduces substantial training or inference overhead. We present Being‑H0.7, a latent world‑action model that brings future‑aware reasoning into VLA‑style policies without generating future frames. Being‑H0.7 inserts learnable latent queries between perception and action as a compact reasoning interface, and trains them with a future‑informed dual‑branch design: a deployable prior branch infers latent states from the current context, while a training‑only posterior branch replaces the queries with embeddings from future observations. Jointly aligning the two branches at the latent reasoning space leads the prior branch to reason future‑aware, action‑useful structure from current observations alone. At inference, Being‑H0.7 discards the posterior branch and performs no visual rollout. Experiments across six simulation benchmarks and diverse real‑world tasks show that Being‑H0.7 achieves state‑of‑the‑art or comparable performance, combining the predictive benefits of world models with the efficiency and deployability of direct VLA policies.

Abstract:
Modern visual world modeling systems increasingly rely on high‑capacity architectures and large‑scale data to produce plausible motion, yet they often fail to preserve underlying 3D geometry or physically consistent camera dynamics. A key limitation lies not only in model capacity, but in the latent representations used to encode geometric structure. We propose S^2VAE, a geometry‑first latent learning framework that focuses on compressing and representing the latent 3D state of a scene, including camera motion, depth, and point‑level structure, rather than modeling appearance alone. Building on representations from a Visual Geometry Grounded Transformer (VGGT), we introduce a novel type of variational autoencoder using a product of Power Spherical latent distributions, explicitly enforcing hyperspherical structure in the bottleneck to preserve directional and geometric semantics under strong compression. Across depth estimation, camera pose recovery, and point cloud reconstruction, we show that geometry‑aligned hyperspherical latents consistently outperform conventional Gaussian bottlenecks, particularly in high‑compression regimes. Our results highlight latent geometry as a first‑class design choice for physically grounded visual and world models.

Abstract:
Learned driving agents often degrade when deployed in unseen environments. This paper studies a deliberately bounded instance of that problem in the CARLA simulator: zero‑shot transfer of a closed‑loop fixed‑route driving agent from Town05 and Town06 to unseen Town03 and Town04. The study isolates structural town shift by keeping weather fixed to ClearNoon and removing traffic and pedestrians. We build on a Dreamer‑style latent world‑model agent and add two training‑only auxiliary losses: multi‑horizon prediction of future visual‑semantic embeddings along imagined rollouts and town‑adversarial supervision on a semantic projection of the recurrent latent state. A causal context feature conditions the semantic rollout predictor, while the actor and critic retain the standard control feature. The policy receives no navigation command, route polyline, goal pose, or map input; the reference route is used only by the environment for reward, progress, success, and termination. Across the evaluated held‑out towns, the proposed model achieves the highest mean success rate among the included Dreamer‑family methods. Secondary safety and lane‑keeping metrics are mixed across towns. These results support a bounded conclusion: in this controlled fixed‑weather CARLA setting, semantic rollout supervision combined with town‑adversarial regularization improves mean held‑out‑town route completion.

Abstract:
This paper presents an expert‑guided active‑inference‑inspired framework for adaptive UAV swarm trajectory planning. The proposed method converts multi‑UAV trajectory design from a repeated combinatorial optimization problem into a hierarchical probabilistic inference problem. In the offline phase, a genetic‑algorithm planner with repulsive‑force collision avoidance (GA‑‑RF) generates expert demonstrations, which are abstracted into Mission, Route, and Motion dictionaries. These dictionaries are used to learn a probabilistic world model that captures how expert mission allocations induce route orders and how route orders induce motion‑level behaviors. During online operation, the UAV swarm evaluates candidate actions by forming posterior beliefs over symbolic states and minimizing KL‑divergence‑based abnormality indicators with respect to expert‑derived reference distributions. This enables mission allocation, route insertion, motion adaptation, and collision‑aware replanning without rerunning the offline optimizer. Bayesian state estimators, including EKF and PF modules, are integrated at the motion level to improve trajectory correction under uncertainty. Simulation results show that the proposed framework preserves expert‑like planning structure while producing smoother and more stable behavior than modified Q‑learning. Additional validation using real‑flight UAV trajectory data demonstrates that the learned world model can correct symbolic predictions under noisy and non‑smooth observations, supporting its applicability to adaptive UAV swarm autonomy.

Abstract:
Understanding how human health changes over time, and why responses to interventions vary between individuals, remains a central challenge in medicine. Here we present HealthFormer, a decoder‑only transformer that models the human physiological trajectory generatively, by training on data from the Human Phenotype Project, a multi‑visit cohort of over 15,000 deeply phenotyped individuals. We tokenise each participant's health trajectory across 667 measurements spanning seven domains: blood biomarkers, body composition, sleep physiology, continuous glucose monitoring, gut microbiome, wearable‑derived physiology, and behaviour and medication exposure. We train HealthFormer to forecast individual physiological trajectories across these domains, and from this single generative objective a range of clinically relevant tasks can be expressed as queries on the model. We show that, without task‑specific training, HealthFormer transfers to four independent cohorts and improves prediction for 27 of 30 incident‑disease and mortality endpoints, exceeding established clinical risk scores in every comparison. We further show that the model can simulate interventions in silico: in a held‑out personalised‑nutrition trial, intervention‑conditioned predictions recover individual six‑month biomarker changes (e.g., Pearson r = 0.78 for diastolic blood pressure). Across 41 randomised intervention‑outcome comparisons drawn from published trials, our results show that the predicted direction of effect agrees in every case, and the predicted mean falls within the reported 95% confidence interval in 30 cases. We position HealthFormer as an initial health world model, from which forecasting, risk stratification, and intervention‑conditioned simulation arise as queries, providing a basis for clinical digital twins.

Abstract:
As one of the mainstream models of artificial intelligence, world models allow agents to learn the representation of the environment for efficient prediction and planning. However, classical world models based on flat tensors face several key problems, including noise sensitivity, error accumulation and weak reasoning. To address these limitations, many recent studies use graph structure to decompose the environment into entity nodes and interactive edges, and model virtual environments in a structured space. This paper systematically formalizes and unifies these emerging graph‑based works under the concept of graph world models (GWMs). To the best of our knowledge, GWMs have not yet been explicitly defined and surveyed as a unified research paradigm. Furthermore, we propose a taxonomy based on relational inductive biases (RIB), categorizing GWMs by the specific structural priors they inject: (1) spatial RIB for topological abstraction; (2) physical RIB for dynamic simulation; and (3) logical RIB for causal and semantic reasoning. For each model category, we outline the key design principles, summarize representative models, and conduct comparative analyses. We further discuss open challenges and future directions, including dynamic graph adaptation, probabilistic relational dynamics, multi‑granularity inductive biases, and the need for dedicated benchmarks and evaluation metrics for GWMs.

Abstract:
We present BAss (BDD‑based ADF symbolic solver), a novel analysis tool for Abstract Dialectical Frameworks (ADFs) based on Binary Decision Diagrams (BDDs). It supports the fully symbolic computation of all admissible, complete, and preferred interpretations, as well as two‑valued and stable models of an ADFs. Our approach is inspired by the recently discovered equivalence between Boolean Networks (BNs) and ADFs by Heyninck et al. (2024) and Azpeitia et al. (2024), significantly extending current BDD‑based tools bioLQM, AEON, and adf‑bdd. We conducted experiments on a large‑scale collection of real‑world models from both the BN and ADF communities. Our results show that BAss dramatically outperforms previous BDD‑based tools and is competitive (even significantly better in some cases) with state‑of‑the‑art SAT/ASP‑based methods, particularly in scenarios involving large solution spaces. Notably, BAss is able to enumerate all fixed points or minimal trap spaces of certain biological networks beyond the reach of existing tools, thereby enabling new analysis and case studies in systems biology. These results highlight the practical relevance of symbolic reasoning for complex real‑world applications, particularly in systems biology and formal argumentation.

Abstract:
This paper revisits camera pose estimation through the lens of self‑supervised pretraining, focusing on inverse‑dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse‑ and forward‑dynamics models to learn latent action representations, similar to Genie from large‑scale driving videos. Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world‑models or as proxies of robot action parameters in policy networks. Our method, dubbed LA‑Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high‑quality 3D annotations. This formulation enables accurate and generalizable pose prediction while maintaining feed‑forward efficiency. Extensive experiments on driving benchmarks show that LA‑Pose achieves competitive and even superior performance to state‑of‑the‑art methods while using orders of magnitude less labeled data. Concretely, on the Waymo and PandaSet benchmarks, LA‑Pose achieves over 10% higher pose accuracy than recent feed‑forward methods. To our knowledge, this work is the first to demonstrate the power of inverse‑dynamics self‑supervised learning for pose estimation.

Abstract:
Large language models can now generate substantial code and draft research text, but research‑software projects require more than either artifact alone. The mathematical thesis, executable system, benchmark surface, and public claims must mature together, yet often drift apart. We identify two LM‑specific failure modes: hallucination accumulation, in which claims exceed what code or theory supports and unsupported assertions propagate across sessions; and desynchronization, in which code, theory, or the model's own world model fall out of alignment. We propose Comet‑H, an iterative prompt automaton that orchestrates ideation, implementation, evaluation, grounding, and paper‑writing as coupled coordinates of a single workspace state. At each step, a controller selects the next prompt by scoring it against what the workspace currently lacks, carries unfinished follow‑up work forward with a half‑life, and re‑checks the paper and README against the code and benchmarks whenever documentation changes. We frame prompt selection as a small contextual bandit problem over prompt families, with prompts as arms, workspace deficits as context, and a hand‑weighted linear score. This transparent scorer, paired with a fading record of unfinished work, bounds long‑horizon follow‑ups, requires no learned policy, and makes each prompt choice legible from the workspace. We created a portfolio of 46 research‑software repositories across two dozen domains. We study A3 in depth, a Python static‑analysis tool built entirely within the loop, which reaches (F1 = 0.768) on a 90‑case benchmark, compared with a next‑best baseline of 0.364. Across approximately 400 commits, we find that audit‑and‑contraction passes dominate the later phases of every successful trajectory.

Abstract:
Robotic manipulation requires reasoning about future spatial‑temporal interactions and geometric constraints, yet existing Vision‑Language‑Action (VLA) policies often leave predictive representation weakly coupled with action execution, causing failures in tasks requiring precise spatial‑temporal coordination. We propose STARRY, a world‑model‑enhanced action‑generation policy that aligns spatial‑temporal prediction and action generation by jointly denoising future spatial‑temporal latents and actions through a unified diffusion process. To bridge 2D visual tokens and 3D metric control, STARRY introduces Geometry‑Aware Selective Attention Modulation (GASAM), which converts predicted depth and end‑effector geometry into token‑aligned weights for selective action‑attention modulation. On RoboTwin 2.0, STARRY achieves 93.82% / 93.30% average success under Clean and Randomized settings across 50 bimanual tasks. Real‑world experiments show that STARRY improves average success from 42.5% to 70.8% compared with π_0.5. These results demonstrate the effectiveness of action‑centric spatial‑temporal world modeling for spatially and temporally demanding robotic manipulation.

Abstract:
Large Language Model (LLM)‑based agents exhibit systemic failures in compositional generalization, limiting their robustness in interactive environments. This work introduces AGEL‑Comp, a neuro‑symbolic AI agent architecture designed to address this challenge by grounding actions of the agent. AGEL‑Comp integrates three core innovations: (1) a dynamic Causal Program Graph (CPG) as a world model, representing procedural and causal knowledge as a directed hypergraph; (2) an Inductive Logic Programming (ILP) engine that synthesizes new Horn clauses from experiential feedback, grounding symbolic knowledge through interaction; and (3) a hybrid reasoning core where an LLM proposes a set of candidate sub‑goals that are verified for logical consistency by a Neural Theorem Prover (NTP). Together, these components operationalize a deduction‑‑abduction learning cycle: enabling the agent to deduce plans and abductively expand its symbolic world model, while a neural adaptation phase keeps its reasoning engine aligned with new knowledge. We propose an evaluation protocol within the \textttRetro Quest simulation environment to probe for compositional generalization scenarios to evaluate our AGEL agent. Our findings clearly indicate the better performance of our AGEL model over pure LLM‑based models. Our framework presents a principled path toward agents that build an explicit, interpretable, and compositionally structured understanding of their world.

Abstract:
Controllable medical video generation has achieved remarkable progress, but it still lacks interpretability, which requires the alignment of generated contents with physical priors and faithful clinical manifestations. To push the boundaries from mere controllability to interpretability, we propose DepthPilot, the first interpretable framework for colonoscopy video generation. This work takes a step toward trustworthy generation through two synergistic paradigms. To achieve explicit geometric grounding, DepthPilot devises a prior distribution alignment strategy, injecting depth constraints into the diffusion backbone via parameter‑efficient fine‑tuning to ensure anatomical fidelity. To enhance intrinsic nonlinear modeling under these geometric constraints, DepthPilot employs an adaptive spline denoising module, replacing fixed linear weights with learnable spline functions to capture complex spatio‑temporal dynamics. Extensive evaluations across three public datasets and in‑house clinical data confirm DepthPilot's robust ability to produce physically consistent videos. It achieves FID scores below 15 across all benchmarks and ranks first in clinician assessments, bridging the gap between "visually realistic" and "clinically interpretable". Moreover, DepthPilot‑generated videos are expected to enable reliable 3D reconstruction, facilitating surgical navigation and blind region identification, and serve as a foundation toward the colorectal world model.

Abstract:
World models of embodied agents predict future observations conditioned on an action taken by the agent. For complex embodiments, action spaces are high‑dimensional and difficult to specify: for example, precisely controlling a human agent requires specifying the motion of each joint. This makes the world model hard to control and expensive to plan with as search‑based methods like CEM scale poorly with action dimensionality. To address this issue, we train a lightweight policy that maps high‑level actions to sequences of low‑level joint actions. Composing this policy with the frozen world model produces a lifted world model that predicts a sequence of future observations from a single high‑level action. We instantiate this framework for a human‑like embodiment, defining the high‑level action space as a small set of 2D waypoints annotated on the current observation frame, each specifying a near‑term goal position for a leaf joint (pelvis, head, hands). Waypoints are low‑dimensional, visually interpretable, and easy to specify manually or to search over. We show that the lifted world model substantially outperforms searching directly in low‑level joint space (3.8× lower mean joint error to the goal pose), while remaining more compute‑efficient and generalizing to environments unseen by the policy.

Abstract:
End‑to‑end autonomous driving planners typically generate trajectories from current observations alone. However, real‑world driving is highly dynamic, and such reactive planning cannot anticipate future scene evolution, often leading to myopic decisions and safety‑critical failures. We propose ProDrive, a world‑model‑based proactive planning framework that enables ego‑environment co‑evolution for autonomous driving. ProDrive jointly trains a query‑centric trajectory planner and a bird's‑eye‑view (BEV) world model end‑to‑end: the planner generates diverse candidate trajectories and planning‑aware ego tokens, while the world model predicts future scene evolution conditioned on them. By injecting planner features into the world model and evaluating all candidates in parallel, ProDrive preserves end‑to‑end gradient flow and allows future outcome assessment to directly shape planning. This bidirectional coupling enables proactive planning beyond current‑observation‑driven decision‑making. Experiments on NAVSIM v1 show that ProDrive outperforms strong baselines in both safety and planning efficiency, while ablations validate the effectiveness of the proposed ego‑environment coupling design.

Abstract:
Lifetime prediction of reactor pressure vessel (RPV) steel requires bridging atomistic degradation mechanisms with service‑scale spatial and temporal regimes, from Angstroms and picoseconds to meters and decades. Existing engineering‑scale models provide long‑range reach but rely on fitted degradation laws, while recent atomistic kinetic Monte Carlo (AKMC) advances still fail to achieve year‑and‑meter‑scale coverage. We present AtomWorld, an atomistic world‑modeling framework for RPV steel lifetime simulation co‑designed with leadership‑scale supercomputing through three tightly coupled layers: (1) algorithm: AtomWorld recasts classical AKMC as an atomistic world model that learns consequence‑aware state transitions over the ab initio energy landscape; (2) HPC: it co‑designs this formulation with modern supercomputers, yielding a compute‑dense, synchronization‑light, and communication‑efficient execution pipeline; and (3) application: it extends atomistic world modeling to engineering‑scale simulation through a physically grounded voxel‑parallel framework, offering a scalable pathway from local atomistic dynamics to engineering‑scale degradation evolution. We demonstrate a paradigm shift in atomistic simulation: AtomWorld enables atomistic simulation of RPV steel across year‑and‑meter scales for the first time, extending direct atomistic modeling to ten‑quintillion‑atom systems and achieving a time‑to‑solution of 1.71 days for one simulated service year. These capabilities are sustained across five leadership supercomputers with 92‑97% scaling efficiency and peak performance up to 1.27 EFLOP/s, corresponding to 48% of the Lineshine peak FP64 performance.

Abstract:
Short‑term human pose prediction plays a crucial role in interactive systems, assistive robots, and emotion‑aware human‑computer interaction[1‑3]. While current trajectory prediction models primarily rely on geometric motion cues, they often overlook the underlying emotional signals influencing human motion dynamics[4‑5]. This paper investigates whether facial expression‑derived emotion embeddings can provide auxiliary conditional signals for short‑term pose prediction. To further evaluate multimodal conditionation in a recursive prediction setting, we propose a lightweight autoregressive predictive world model that performs 15‑step rolling pose prediction. This framework combines pose keypoints with emotion embeddings through a learnable gating mechanism and performs autoregressive unfolding prediction using a recurrent sequence model based on a two‑layer LSTM architecture. Experiments were conducted on two small‑scale pose‑emotion video datasets: controlled motion sequences with minimal facial expression changes and, natural emotion‑driven motion sequences with considerable facial expression changes. The results show that simple multimodal fusion does not consistently improve prediction accuracy, while normalized gating fusion significantly enhances the performance of emotion‑driven motion sequences. Furthermore, counterfactual perturbation experiments demonstrate that the predicted trajectory exhibits measurable sensitivity to changes in multimodal input, suggesting that facial expression embeddings act as auxiliary conditional signals rather than redundant features. In summary, these results indicate that incorporating facial expression‑derived emotion embeddings into emotion‑conditional short‑term pose prediction based on a lightweight predictive world model architecture is a feasible approach.

Abstract:
The proliferation of agentic artificial intelligence has outpaced the conceptual tools needed to characterize agency in computational systems. Prevailing definitions mainly rely on autonomy and goal‑directedness. Here, we argue for a minimal notion open to principled inspection given three criteria: intentionality as action grounded in beliefs and desires, rationality as normatively coherent action entailed by a world model, and explainability as action causally traceable to internal states; we subsequently instantiate these as a partially observable Markov decision process under a variational framework wherein posterior beliefs, prior preferences, and the minimization of expected free energy jointly constitute an agentic action chain. Using a canonical T‑maze paradigm, we evidence how empowerment, formulated as the channel capacity between actions and anticipated observations, serves as an operational metric that distinguishes zero‑, intermediate‑, and high‑agency phenotypes through structural manipulations of the generative model. We conclude by arguing that as agents engage in epistemic foraging to resolve ambiguity, the governance controls that remain effective must shift systematically from external constraints to the internal modulation of prior preferences, offering a principled, variational bridge from computational phenotyping to AI governance strategy

Abstract:
We identify and formalize a novel security risk: Context‑Fragmented Violations (CFVs) ‑ a class of policy breaches where individual agent actions appear locally safe and reasonable, yet collectively violate organizational policies because critical policy facts are siloed in different departments private contexts. Existing prompt‑based alignment mechanisms and monolithic interceptors are poorly matched to violations that span contextual islands. We propose Distributed Sentinel, a distributed zero‑trust enforcement architecture that introduces the Semantic Taint Token (STT) Protocol. Through lightweight sidecar proxies, our system propagates security state across organizational boundaries without exposing raw cross‑domain data, enabling Counterfactual Graph Simulation for cross‑domain policy verification. We construct PhantomEcosystem, a comprehensive benchmark comprising 9 categories of realistic cross‑agent violation scenarios with adversarially balanced safe controls. On this benchmark, Distributed Sentinel achieves F1 = 0.95 with 106ms end‑to‑end latency (16ms verification + 90ms entity extraction on A100), compared to 0.85 F1 for prompt‑based filtering and 0.65 for rule‑based DLP. To empirically validate the need for external enforcement, we evaluate eight frontier LLMs in execution‑oriented multi‑agent workflows with per‑agent domain world models. All models exhibit substantial violation rates (14‑98%), with cross‑domain data flows showing systematically higher violation rates than same‑domain flows. These results indicate that self‑avoidance is unreliable and that multi‑agent security benefits from a centralized enforcement layer operating above individual agents.

Abstract:
As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one‑step local transition operators; L2 Simulator, which composes them into multi‑step, action‑conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing‑law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model‑based reinforcement learning, video generation, web and GUI agents, multi‑agent social simulation, and AI‑driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level‑regime pairs, propose decision‑centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next‑step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.

Abstract:
Generative world models increasingly rely on 4D occupancy for realistic autonomous driving simulation. However, existing generation frameworks depend on rigid geometric conditions (e.g., explicit trajectories) or simplistic attribute‑level text, failing to orchestrate complex, sequential multi‑agent interactions. To address this semantic‑spatiotemporal gap, we propose OccDirector, a pioneering framework that generates 4D occupancy dynamics conditioned solely on natural language. Operating as a ``scenario director'', OccDirector maps natural language scripts into physically plausible voxel dynamics without requiring geometric priors. Technically, it employs a VLM‑driven Spatio‑Temporal MMDiT equipped with a history‑prefix anchoring strategy to ensure long‑horizon interaction consistency. Furthermore, we introduce OccInteract‑85k, a novel dataset uniquely annotated with multi‑level language instructions: ranging from static layouts to intricate multi‑agent behaviors, alongside a novel VLM‑based evaluation benchmark. Extensive experiments demonstrate that OccDirector achieves state‑of‑the‑art generation quality and unprecedented instruction‑following capabilities, successfully shifting the paradigm from appearance synthesis to language‑driven behavior orchestration.

Abstract:
Evaluating robotics policies across thousands of environments and thousands of tasks is infeasible with existing approaches. This motivates the need for a new methodology for scalable robotics policy evaluation. In this paper, we propose dWorldEval, which uses a discrete diffusion world model as a scalable evaluation proxy for robotics policies. Specifically, dWorldEval maps all modalities ‑ including vision, language, and robotic actions ‑ into a unified token space, modeling them via a single transformer‑based denoising network. In this paper, we propose dWorldEval, using a discrete diffusion world model as a scalable evaluation proxy for robotics policy. Specifically, it maps all modalities, including vision, language, and robotics action into a unified token space, then denoises them with a single transformer network. Building on this architecture, we employ a sparse keyframe memory to maintain spatiotemporal consistency. We also introduce a progress token that indicates the degree of task completion. At inference, the model jointly predicts future observations and progress token, allowing automatically determine success when the progress reaches 1. Extensive experiments demonstrate that dWorldEval significantly outperforms previous approaches, i.e., WorldEval, Ctrl‑World, and WorldGym, on LIBERO, RoboTwin, and multiple real‑robot tasks. It paves the way for a new architectural paradigm in building world simulators for robotics evaluation at scale.

Abstract:
How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self‑supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow‑motion video dataset to date from noisy in‑the‑wild sources. Such slow‑motion footage, typically filmed by high‑speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed‑conditioned video generation, which produces motion at specified playback speed, and temporal super‑resolution, which tranforms low‑FPS, blurry videos into high‑FPS sequences with fine‑grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world‑models that understand how events unfold over time.

Abstract:
Post‑training is essential for turning pretrained generalist robot policies into reliable task‑specific controllers, but existing human‑in‑the‑loop pipelines remain tied to physical execution: each correction requires robot time, scene setup, resets, and operator supervision in the real world. Meanwhile, action‑conditioned world models have been studied mainly for imagination, synthetic data generation, and policy evaluation. We propose Human‑in‑the‑World‑Model (Hi‑WM), a post‑training framework that uses a learned world model as a reusable corrective substrate for failure‑targeted policy improvement. A policy is first rolled out in closed loop inside the world model; when the rollout becomes incorrect or failure‑prone, a human intervenes directly in the model to provide short corrective actions. Hi‑WM caches intermediate states and supports rollback and branching, allowing a single failure state to be reused for multiple corrective continuations and yielding dense supervision around behaviors that the base policy handles poorly. The resulting corrective trajectories are then added back to the training set for post‑training. We evaluate Hi‑WM on three real‑world manipulation tasks spanning both rigid and deformable object interaction, and on two policy backbones. Hi‑WM improves real‑world success by 37.9 points on average over the base policy and by 19.0 points over a world‑model closed‑loop baseline, while world‑model evaluation correlates strongly with real‑world performance (r = 0.953). These results suggest that world models can serve not only as generators or evaluators, but also as effective corrective substrates for scalable robot post‑training.

Abstract:
Interactive video generation models such as Genie, YUME, HY‑World, and Matrix‑Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross‑model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM‑based judgments, but none supplies the standardized test conditions ‑‑ identical scenes, identical action sequences, and a unified control interface ‑‑ needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image‑to‑Video world models. WorldMark contributes: (1) a unified action‑mapping layer that translates a shared WASD‑style action vocabulary into each model's native control format, enabling apples‑to‑apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first‑ and third‑person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20‑60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side‑by‑side battles and watch the live leaderboard.

Authors: Open-H-Embodiment Consortium, :, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang, Alaa Eldin Abdelaal, Alberto Arezzo, Ayberk Acar, Farshid Alambeigi, Carlo Alberto Ammirati, Yunke Ao, Pablo David Aranda Rodriguez, Soofiyan Atar, Mattia Ballo, Noah Barnes, Federica Barontini, Filip Binkiewicz, Peter Black, Sebastian Bodenstedt, Leonardo Borgioli, Nikola Budjak, Benjamin Calmé, Fabio Carrillo, Nicola Cavalcanti, Changwei Chen, Haoxin Chen, Sihang Chen, Qihan Chen, Zhongyu Chen, Ziyang Chen, Shing Shin Cheng, Meiqing Cheng, Min Cheng, Zih-Yun Sarah Chiu, Xiangyu Chu, Camilo Correa-Gallego, Giulio Dagnino, Anton Deguet, Jacob Delgado, Jonathan C. DeLong, Kaizhong Deng, Alexander Dimitrakakis, Qingpeng Ding, Hao Ding, Giovanni Distefano, Daniel Donoho, Anqing Duan, Marco Esposito, Shane Farritor, Jad Fayad, Zahi Fayad, Mario Ferradosa, Filippo Filicori, Chelsea Finn, Philipp Fürnstahl, Jiawei Ge, Stamatia Giannarou, Xavier Giralt Ludevid, Frederic Giraud, Aditya Amit Godbole, Ken Goldberg, Antony Goldenberg, Diego Granero Marana, Xiaoqing Guo, Tamás Haidegger, Evan Hailey, Pascal Hansen, Ziyi Hao, Kush Hari, Kengo Hayashi, Jonathon Hawkins, Shelby Haworth, Ortrun Hellig, S. Duke Herrell, Zhouyang Hong, Andrew Howe, Junlei Hu, Zhaoyang Jacopo Hu, Ria Jain, Mohammad Rafiee Javazm, Howard Ji, Rui Ji, Jianmin Ji, Zhongliang Jiang, Dominic Jones, Jeffrey Jopling, Britton Jordan, Ran Ju, Michael Kam, Luoyao Kang, Fausto Kang, Siddhartha Kapuria, Peter Kazanzides, Sonika Kiehler, Ethan Kilmer, Ji Woong Kim, Przemysław Korzeniowski, Chandra Kuchi, Nithesh Kumar, Alan Kuntz, Federico Lavagno, Yu Chung Lee, Hao-Chih Lee, Hang Li, Zhen Li, Xiao Liang, Xinxin Lin, Jinsong Lin, Chang Liu, Fei Liu, Pei Liu, Yun-hui Liu, Wanli Liuchen, Eszter Lukács, Sareena Mann, Miles Mannas, Brett Marinelli, Sabina Martyniak, Francesco Marzola, Lorenzo Mazza, Xueyan Mei, Maria Clara Morais, Luigi Muratore, Chetan Reddy Narayanaswamy, Michał Naskręt, David Navarro-Alarcon, Cyrus Neary, Chi Kit Ng, Christopher Nguan, David Noonan, Ki Hwan Oh, Tom Christian Olesch, Allison M. Okamura, Justin Opfermann, Matteo Pescio, Doan Xuan Viet Pham, Tito Porras, Hongliang Ren, Ariel Rodriguez Jimenez, Ferdinando Rodriguez y Baena, Septimiu E. Salcudean, Asmitha Sathya, Preethi Satish, Lalithkumar Seenivasan, Jiaqi Shao, Yiqing Shen, Yu Sheng, Lucy XiaoYang Shi, Zoe Soulé, Stefanie Speidel, Mingwu Su, Jianhao Su, Idris Sunmola, Kristóf Takács, Yunxi Tang, Patrick Thornycroft, Yu Tian, Jordan Thompson, Mehmet K. Turkcan, Mathias Unberath, Pietro Valdastri, Carlos Vives, Quan Vuong, Martin Wagner, Farong Wang, Wei Wang, Lidian Wang, Chung-Pang Wang, Guankun Wang, Junyi Wang, Erqi Wang, Ziyi Wang, Tanner Watts, Wolfgang Wein, Yimeng Wu, Zijian Wu, Hongjun Wu, Luohong Wu, Jie Ying Wu, Junlin Wu, Victoria Wu, Kaixuan Wu, Mateusz Wójcikowski, Yunye Xiao, Nan Xiao, Wenxuan Xie, Hao Yang, Tianqi Yang, Yinuo Yang, Menglong Ye, Ryan S. Yeung, Nural Yilmaz, Chim Ho Yin, Michael Yip, Rayan Younis, Chenhao Yu, Sayem Nazmuz Zaman, Milos Zefran, Han Zhang, Yuelin Zhang, Yidong Zhang, Yanyong Zhang, Xuyang Zhang, Yameng Zhang, Joyce Zhang, Ning Zhong, Peng Zhou, Haoying Zhou, Xiuli Zuo, Nassir Navab, Mahdi Azizian, Sean D. Huver, Axel Krieger

Abstract:
Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single‑embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open‑H‑Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T‑H is the first open foundation vision‑language‑action model for medical robotics, which is the only evaluated model to achieve full end‑to‑end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29‑step ex vivo suturing sequence. We also train Cosmos‑H‑Surgical‑Simulator, the first action‑conditioned world model to enable multi‑embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large‑scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

Abstract:
Large language models have shown remarkable ability in serial code generation, but they still struggle with parallel code for which training data is comparatively scarce. A common remedy is to use coding agents that interact with external tools, but tool calls can be costly and sometimes impractical, e.g., for partially written code. We propose Parallel‑Code World Models (PCWMs), reasoning LLMs that aim to predict tool outcomes directly from parallel source code. To train PCWMs, we design a novel exploration and data generation pipeline that samples diverse parallel‑coding problems and candidate implementations across multiple domains, then executes them via tools to record data races and performance profiles. From these, we synthesize hindsight reasoning traces that causally connect source code to observed tool outcomes. Fine‑tuning on the resulting data yields noticeable gains, with a 7B‑parameter world model improving from 64.3% to 72.8% accuracy in race‑outcome prediction, while an 8B‑parameter model improves in a performance profiling task from 49.3% to 58.6% accuracy. Furthermore, when open‑weight models were tasked with fixing data races, world‑model feedback improved their race‑fixing rates relative to self‑feedback by 2.7%‑9.1% using our 7B‑parameter world model and by 6.1%‑11.1% using our 14B‑parameter world model. Our results suggest that reasoning models have the potential to serve as practical substitutes for external tool calls in parallel‑coding agents.

Abstract:
Safety‑critical traffic reasoning requires contrastive consistency: models must detect true hazards when an accident occurs, and reliably reject plausible‑but‑false hypotheses under near‑identical counterfactual scenes. We present CCTVBench, a Contrastive Consistency Traffic VideoQA Benchmark built on paired real accident videos and world‑model‑generated counterfactual counterparts, together with minimally different, mutually exclusive hypothesis questions. CCTVBench enforces a single structured decision pattern over each video question quadruple and provides actionable diagnostics that decompose failures into positive omission, positive swap, negative hallucination, and mutual‑exclusivity violation, while separating video versus question consistency. Experiments across open‑source and proprietary video LLMs reveal a large and persistent gap between standard per‑instance QA metrics and quadruple‑level contrastive consistency, with unreliable none‑of‑the‑above rejection as a key bottleneck. Finally, we introduce C‑TCD, a contrastive decoding approach leveraging a semantically exclusive counterpart video as the contrast input at inference time, improving both instance‑level QA and contrastive consistency.

Abstract:
Real‑time world simulation is becoming a key infrastructure for scalable evaluation and online reinforcement learning of autonomous driving systems. Recent driving world models built on autoregressive video diffusion achieve high‑fidelity, controllable multi‑camera generation, but their inference cost remains a bottleneck for interactive deployment. However, existing diffusion caching methods are designed for offline video generation with multiple denoising steps, and do not transfer to this scenario. Few‑step distilled models have no inter‑step redundancy left for these methods to reuse, and sequence‑level parallelization techniques require future conditioning that closed‑loop interactive generation does not provide. We present X‑Cache, a training‑free acceleration method that caches along a different axis: across consecutive generation chunks rather than across denoising steps. X‑Cache maintains per‑block residual caches that persist across chunks, and applies a dual‑metric gating mechanism over a structure‑ and action‑aware block‑input fingerprint to independently decide whether each block should recompute or reuse its cached residual. To prevent approximation errors from permanently contaminating the autoregressive KV cache, X‑Cache identifies KV update chunks (the forward passes that write clean keys and values into the persistent cache) and unconditionally forces full computation on these chunks, cutting off error propagation. We implement X‑Cache on X‑world, a production multi‑camera action‑conditioned driving world model built on multi‑block causal DiT with few‑step denoising and rolling KV cache. X‑Cache achieves 71% block skip rate with 2.6x wall‑clock speedup while maintaining minimum degradation.

Abstract:
Industrial robotic manipulation demands reliable long‑horizon execution across embodiments, tasks, and changing object distributions. While Vision‑Language‑Action models have demonstrated strong generalization, they remain fundamentally reactive. By optimizing the next action given the current observation without evaluating potential futures, they are brittle to the compounding failure modes of long‑horizon tasks. Cortex 2.0 shifts from reactive control to plan‑and‑act by generating candidate future trajectories in visual latent space, scoring them for expected success and efficiency, then committing only to the highest‑scoring candidate. We evaluate Cortex 2.0 on a single‑arm and dual‑arm manipulation platform across four tasks of increasing complexity: pick and place, item and trash sorting, screw sorting, and shoebox unpacking. Cortex 2.0 consistently outperforms state‑of‑the‑art Vision‑Language‑Action baselines, achieving the best results across all tasks. The system remains reliable in unstructured environments characterized by heavy clutter, frequent occlusions, and contact‑rich manipulation, where reactive policies fail. These results demonstrate that world‑model‑based planning can operate reliably in complex industrial environments.

Abstract:
Autonomous mechanical thrombectomy (MT) presents substantial challenges due to highly variable vascular geometries and the requirements for accurate, real‑time control. While reinforcement learning (RL) has emerged as a promising paradigm for the automation of endovascular navigation, existing approaches often show limited robustness when faced with diverse patient anatomies or extended navigation horizons. In this work, we investigate a world‑model‑based framework for autonomous endovascular navigation built on TD‑MPC2, a model‑based RL method that integrates planning and learned dynamics. We evaluate a TD‑MPC2 agent trained on multiple navigation tasks across hold out patient‑specific vasculatures and benchmark its performance against the state‑of‑the‑art Soft Actor‑Critic (SAC) algorithm agent. Both approaches are further validated in vitro using patient‑specific vascular phantoms under fluoroscopic guidance. In simulation, TD‑MPC2 demonstrates a significantly higher mean success rate than SAC (58% vs. 36%, p < 0.001), and mean tip contact forces of 0.15 N, well below the proposed 1.5 N vessel rupture threshold. Mean success rates for TD‑MPC2 (68%) were comparable to SAC (60%) in vitro, but TD‑MPC2 achieved superior path ratios (p = 0.017) at the cost of longer procedure times (p < 0.001). Together, these results provide the first demonstration of autonomous MT navigation validated across both hold out in silico data and fluoroscopy‑guided in vitro experiments, highlighting the promise of world models for safe and generalizable AI‑assisted endovascular interventions.

Abstract:
Large Language Models (LLMs) show promise for generating Register‑Transfer Level (RTL) code from natural language specifications, but single‑shot generation achieves only 60‑65% functional correctness on standard benchmarks. Multi‑agent approaches such as MAGE reach 95.9% on VerilogEval yet remain untested on harder industrial benchmarks such as NVIDIA's CVDP, lack synthesis awareness, and incur high API costs. We present ChipCraftBrain, a framework combining symbolic‑neural reasoning with adaptive multi‑agent orchestration for automated RTL generation. Four innovations drive the system: (1) adaptive orchestration over six specialized agents via a PPO policy over a 168‑dim state (an alternative world‑model MPC planner is also evaluated); (2) a hybrid symbolic‑neural architecture that solves K‑map and truth‑table problems algorithmically while specialized agents handle waveform timing and general RTL; (3) knowledge‑augmented generation from a 321‑pattern base plus 971 open‑source reference implementations with focus‑aware retrieval; and (4) hierarchical specification decomposition into dependency‑ordered sub‑modules with interface synchronization. On VerilogEval‑Human, ChipCraftBrain achieves 97.2% mean pass@1 (range 96.15‑98.72% across 7 runs, best 154/156), on par with ChipAgents (97.4%, self‑reported) and ahead of MAGE (95.9%). On a 302‑problem non‑agentic subset of CVDP spanning five task categories, we reach 94.7% mean pass@1 (286/302, averaged over 3 runs), a 36‑60 percentage‑point lift per category over the published single‑shot baseline; we additionally lead three of four categories shared with NVIDIA's ACE‑RTL despite using roughly 30x fewer per‑problem attempts. A RISC‑V SoC case study demonstrates hierarchical decomposition generating 8/8 lint‑passing modules (689 LOC) validated on FPGA, where monolithic generation fails entirely.

Abstract:
World models derived from large‑scale video generative pre‑training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high‑fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion‑based policy head to enable robust end‑to‑end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, significantly outperforming the state‑of‑the‑art RGB‑based world models. Furthermore, real‑world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.

Abstract:
Modern world models are becoming too complex to admit explicit dynamical descriptions. We study safety‑critical contextual control, where a Planner must optimize a task objective using only feasibility samples from a black‑box Simulator, conditioned on a context signal ξ_t. We develop a sample‑based Penalized Predictive Control (PPC) framework grounded in online Riemannian optimization, in which the Simulator compresses the feasibility manifold into a score‑based density \hatp(u \mid ξ_t) that endows the action space with a Riemannian geometry guiding the Planner's gradient descent. The barrier curvature κ(ξ_t), the minimum curvature of the conditional log‑density ‑\ln\hatp(\cdot\midξ_t), governs both convergence rate and safety margin, replacing the Lipschitz constant of the unknown dynamics. Our main result is a contextual safety bound showing that the distance from the true feasibility manifold is controlled by the score estimation error and a ratio that depends on κ(ξ_t), both of which improve with richer context. Simulations on a dynamic navigation task confirm that contextual PPC substantially outperforms marginal and frozen density models, with the advantage growing after environment shifts.

Abstract:
High‑fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challenging under sparse and constrained sensing. Conventional reconstruction methods typically rely on fixed sensor layouts, which cannot adapt to evolving physical states. We propose LASER, a unified, closed‑loop framework that formulates active sensing as a Partially Observable Markov Decision Process (POMDP). At its core, LASER employs a continuum field latent world model that captures the underlying physical dynamics and provides intrinsic reward feedback. This enables a reinforcement learning policy to simulate ''what‑if'' sensing scenarios within a latent imagination space. By conditioning sensor movements on predicted latent states, LASER navigates toward potentially high‑information regions beyond current observations. Our experiments demonstrate that LASER consistently outperforms static and offline‑optimized strategies, achieving high‑fidelity reconstruction under sparsity across diverse continuum fields.

Abstract:
Recent advances in large‑scale video world models have enabled increasingly realistic future prediction, raising the prospect of using generated videos as scalable supervision for robot learning. However, for embodied manipulation, perceptual realism alone is not sufficient: generated interactions must also be physically consistent and executable by robotic agents. Existing benchmarks provide valuable assessments of visual quality and physical plausibility, but they do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete manipulation tasks. We introduce RoboWM‑Bench, a manipulation‑centric benchmark for embodiment‑grounded evaluation of video world models. RoboWM‑Bench converts generated human‑hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded simulation environments. Built on real‑to‑sim scene reconstruction and diverse manipulation tasks, RoboWM‑Bench enables standardized, reproducible, and scalable evaluation of physical executability. Using RoboWM‑Bench, we evaluate state‑of‑the‑art video world models and observe that visual plausibility and embodied executability are not always aligned. Our analysis highlights several recurring factors that affect execution performance, including spatial reasoning, contact prediction, and non‑physical geometric distortions, particularly in complex and long‑horizon interactions. These findings provide a more fine‑grained view of current model capabilities and underscore the value of embodiment‑aware evaluation for guiding physically grounded world modeling in robotic manipulation.

Abstract:
Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action‑conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single‑agent scenarios and fail to capture the complex interactions inherent in real‑world multi‑agent systems. We present MultiWorld, a unified framework for multi‑agent multi‑view world modeling that enables accurate control of multiple agents while maintaining multi‑view consistency. We introduce the Multi‑Agent Condition Module to achieve precise multi‑agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi‑player game environments and multi‑robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action‑following ability, and multi‑view consistency. Project page: https://multi‑world.github.io/

Abstract:
We introduce Sonata, a compact latent world model for six‑axis trunk IMU representation learning under clinical data scarcity. Clinical cohorts typically comprise tens to hundreds of patients, making web‑scale masked‑reconstruction objectives poorly matched to the problem. Sonata is a 3.77 M‑parameter hybrid model, pre‑trained on a harmonised corpus of nine public datasets (739 subjects, 190k windows) with a latent world‑model objective that predicts future state rather than reconstructing raw sensor traces. In a controlled comparison against a matched autoregressive forecasting baseline (MAE) on the same backbone, Sonata yields consistently stronger frozen‑probe clinical discrimination, prospective fall‑risk prediction, and cross‑cohort transfer across a 14‑arm evaluation suite, while producing higher‑rank, more structured latent representations. At 3.77 M parameters the model is compatible with on‑device wearable inference, offering a step toward general kinematic world models for neurological assessment.

Abstract:
Recent studies reveal striking representational alignment between artificial neural networks (ANNs) and biological brains, leading to proposals that all sufficiently capable systems converge on universal representations of reality. Here, we argue that this claim of Universality is premature. We introduce the Umwelt Representation Hypothesis (URH), proposing that alignment arises not from convergence toward a single global optimum, but from overlap in ecological constraints under which systems develop. We review empirical evidence showing that representational differences between species, individuals, and ANNs are systematic and adaptive, which is difficult to reconcile with Universality. Finally, we reframe ANN model comparison as a method for mapping clusters of alignment in ecological constraint space rather than searching for a single optimal world model.

Abstract:
Vibe coding produces correct, executable code at speed, but leaves no record of the structural commitments, dependencies, or evidence behind it. Reviewers cannot determine what invariants were assumed, what changed, or why a regression occurred. This is not a generation failure but a control failure: the dominant artifact of AI‑assisted development (code plus chat history) performs dimension collapse, flattening complex system topology into low‑dimensional text and making systems opaque and fragile under change. We propose Agentic Consensus: a paradigm in which the consensus layer C, an operable world model represented as a typed property graph, replaces code as the primary artifact of engineering. Executable artifacts are derived from C and kept in correspondence via synchronization operators Phi (realize) and Psi (rehydrate). Evidence links directly to structural claims in C, making every commitment auditable and under‑specification explicit as measurable consensus entropy rather than a silent guess. Evaluation must move beyond code correctness toward alignment fidelity, consensus entropy, and intervention distance. We propose benchmark task families designed to measure whether consensus‑based workflows reduce human intervention compared to chat‑driven baselines.

Abstract:
World models, generative AI systems that simulate how environments evolve, are transforming autonomous driving, yet all existing approaches adopt an ego‑vehicle perspective, leaving the infrastructure viewpoint unexplored. We argue that infrastructure‑centric world models offer a fundamentally complementary capability: the bird's‑eye, multi‑sensor, persistent viewpoint that roadside systems uniquely possess. Central to our thesis is a spatio‑temporal complementarity: fixed roadside sensors excel at temporal depth, accumulating long‑term behavioral distributions including rare safety‑critical events, while vehicle‑borne sensors excel at spatial breadth, sampling diverse scenes across large road networks. This paper presents a vision for Infrastructure‑centric World Models (I‑WM) in three phases: (I) generative scene understanding with quality‑aware uncertainty propagation, (II) physics‑informed predictive dynamics with multi‑agent counterfactual reasoning, and (III) collaborative world models for V2X communication via latent space alignment. We propose a dual‑layer architecture, annotation‑free perception as a multi‑modal data engine feeding end‑to‑end generative world models, with a phased sensor strategy from LiDAR through 4D radar and signal phase data to event cameras. We establish a taxonomy of driving world model paradigms, position I‑WM relative to LeCun's JEPA, Li Fei‑Fei's spatial intelligence, and VLA architectures, and introduce Infrastructure VLA (I‑VLA) as a novel unification of roadside perception, language commands, and traffic control actions. Our vision builds upon existing multi‑LiDAR pipelines and identifies open‑source foundations for each phase, providing a path toward infrastructure that understands and anticipates traffic.

Abstract:
Vision‑Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video‑LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub‑goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual‑Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub‑goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark‑Centric World Model to retrospectively predict object‑centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real‑world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long‑horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.

Abstract:
Multi‑turn jailbreak attacks progressively erode LLM safety alignment across seemingly innocuous conversation turns, achieving success rates exceeding 90% against state‑of‑the‑art models. Existing alignment‑based and guardrail methods suffer from three key limitations: they require costly weight modification, evaluate each turn independently without modeling cumulative safety erosion, and detect attacks only after harmful content has been generated. To address these limitations, we first formulate the proactive early jailbreak detection problem with a new metric, detection lead, that measures how early an attack can be detected before the LLM complies. We then propose SAFEDREAM, a lightweight world‑model‑based framework that operates as an external module without modifying the LLM's weights. SAFEDREAM introduces three components: (1) a safety state world model that encodes LLM hidden states into a compact safety representation and predicts how it evolves across turns, (2) CUSUM detection that accumulates weak per‑turn risk signals into reliable evidence, and (3) contrastive imagination that simultaneously rolls out attack and benign futures in latent space to issue early alarms before jailbreaks occur. On three multi‑turn jailbreak benchmarks (XGuard‑Train, SafeDialBench, SafeMTData) against 8 baselines, SAFEDREAM achieves the best detection timeliness across all benchmarks (1.06‑1.20 turns before compliance) while maintaining competitive false positive rates and outperforming baselines in detection quality.

Abstract:
Physical awareness, especially in a large and dynamic environment, is shaped by sensing decisions that determine observability across space, time, and scale, while observations impact the quality of sensing decisions. This loopy information structure makes physical awareness a fundamentally challenging decision problem with partial observations. While in the past decade we have witnessed the unprecedented success of reinforcement learning (RL) in problems with full observability, decision problems with partial observation, such as POMDPs, remain largely open: real‑world explorations are excessively costly, while sim‑to‑real pipeline suffer from unobserved viewpoints. We introduce AW4RE (Active World‑model with 4D‑informed Retrieval for Exploration), an awareness‑centric generative world model that provides a sensor‑native surrogate environment for exploring sensing queries. Conditioned on a queried sensing action, AW4RE estimates the action‑conditioned observation process. This is done by combining 4D‑informed evidence retrieval, action‑conditioned geometric support with temporal coherence, and conditional generative completion. Experiments demonstrate that AW4RE produces more grounded and consistent predictions than geometry‑aware generative baselines under extreme viewpoint shifts, temporal gaps, and sparse geometric support.

Abstract:
This comprehensive report distinguishes prior works by the cognitive functions they innovate. Many works claim an almost "human‑like" cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles in Cognitive Architecture Theory (CAT). We present a conceptual unified framework for world models that fully incorporates all the cognitive functions associated with CAT (i.e. memory, perception, language, reasoning, imagining, motivation, and meta‑cognition) and identify gaps in the research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and meta‑cognition remain drastically under‑researched, and we propose concrete directions informed by active inference and global workspace theory to address them. We further introduce Epistemic World Models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied across video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.

Abstract:
We present the Global Neural World Model (GNWM), a self‑stabilizing framework that achieves topological quantization through balanced continuous entropy constraints. Operating as a continuous, action‑conditioned Joint‑Embedding Predictive Architecture (JEPA), the GNWM maps environments onto a discrete 2D grid, enforcing translational equivariance without pixel‑level reconstruction. Our results show this architecture prevents manifold drift during autoregressive rollouts by using grid ``snapping'' as a native error‑correction mechanism. Furthermore, by training via maximum entropy exploration (random walks), the model learns generalized transition dynamics rather than memorizing specific expert trajectories. We validate the GNWM across passive observation, active agent control, and abstract sequence regimes, demonstrating its capacity to act not just as a spatial physics simulator, but as a causal discovery model capable of organizing continuous, predictable concepts into structured topological maps.

Abstract:
Deploying generative World‑Action Models for manipulation is severely bottlenecked by redundant pixel‑level reconstruction, \mathcalO(T) memory scaling, and sequential inference latency. We introduce the Causal Latent World Model (CLWM), which employs DINOv3 features as generative targets to disentangle interaction semantics from visual noise, yielding highly robust domain generalization. To overcome memory scaling, CLWM features a Dual‑State Test‑Time Training (TTT) Memory that guarantees a strict \mathcalO(1) footprint for long‑horizon tasks. To overcome deployment latency, we propose Speculative Asynchronous Inference (SAI) to mask partial diffusion denoising behind physical execution, cutting blocking latency by about 50%. To scale robust policies, we present EmbodiChain, an online framework that establishes the Efficiency Law by injecting an infinite flow of physics‑grounded trajectories during training. Extensive experiments validate that CLWM achieves state‑of‑the‑art performance in complex dual‑arm simulation and unprecedented zero‑shot sim‑to‑real transfer on physical robots, outperforming baselines explicitly finetuned on real‑world data.

Abstract:
Video‑generative world models are increasingly used as neural simulators for embodied planning and policy learning, yet their ability to predict physical risk and severe consequences is rarely evaluated.We find that these models often downplay or omit key danger cues and severe outcomes for hazardous actions, which can induce unsafe preferences during planning and training on imagined rollouts. We propose ICAT, which grounds testing in real incident reports and safety manuals by building structured risk memories and retrieving/composing them to constrain the generation of risk cases with causal chains and severity labels. Experiments on an ICAT‑based benchmark show that mainstream world models frequently miss mechanisms and triggering conditions and miscalibrate severity, falling short of the reliability required for safety‑critical embodied deployment.

Abstract:
LLMs demonstrate remarkable reasoning capabilities, yet whether they utilize internal world models or rely on sophisticated pattern matching remains open. We study LLMs through the lens of robustness of their code understanding using a standard program‑output prediction task. Our results reveal a stark divergence in model behavior: while open‑source reasoning models (DeepSeek‑R1 family) maintain stable, albeit somewhat lower accuracies (38% to 67%) under code transformations & input perturbations, the frontier model GPT‑5.2 exhibits significant brittleness. Despite achieving a near‑perfect score of 99% on the original, unperturbed CRUXEval benchmark, perturbed inputs trigger accuracy declines between 20% and 24%. In addition, we find that many models perform much worse at predicting behavior on perturbed inputs that raise exceptions, and that prediction performance depends on the kind of exception. We study remedies to address this deficiency in exception prediction, and evaluate the effect of these remedies on the ability to predict non‑exception behaviors. Our findings both point to limitations in the way all models understand code, and establish the value of using perturbation to evaluate code models.

Abstract:
Efficiently locating target objects in complex indoor environments with diverse furniture, such as shelves, tables, and beds, is a significant challenge for mobile robots. This difficulty arises from factors like localization errors, limited fields of view, and visual occlusion. We address this by framing the object‑search task as a highdimensional Partially Observable Markov Decision Process (POMDP) with a growing state space and hybrid (continuous and discrete) action spaces in 3D environments. Based on a meticulously designed perception module, a novel online POMDP solver named the growing neural process filtered k‑center clustering tree (GNPF‑kCT) is proposed to tackle this problem. Optimal actions are selected using Monte Carlo Tree Search (MCTS) with belief tree reuse for growing state space, a neural process network to filter useless primitive actions, and k‑center clustering hypersphere discretization for efficient refinement of high‑dimensional action spaces. A modified upper‑confidence bound (UCB), informed by belief differences and action value functions within cells of estimated diameters, guides MCTS expansion. Theoretical analysis validates the convergence and performance potential of our method. To address scenarios with limited information or rewards, we also introduce a guessed target object with a grid‑world model as a key strategy to enhance search efficiency. Extensive Gazebo simulations with Fetch and Stretch robots demonstrate faster and more reliable target localization than POMDP‑based baselines and state‑of‑the‑art (SOTA) non‑POMDP‑based solvers, especially large language model (LLM) based methods, in object search under the same computational constraints and perception systems. Real‑world tests in office environments confirm the practical applicability of our approach. Project page: https://sites.google.com/view/gnpfkct.

Abstract:
Ad hoc wireless networks exhibit complex, innate and coupled dynamics: node mobility, energy depletion and topology change that are difficult to model analytically. Model‑free deep reinforcement learning requires sustained online interaction whereas existing model based approaches use flat state representations that lose per node structure. Therefore we propose G‑RSSM, a graph structured recurrent state space model that maintains per node latent states with cross node multi head attention to learn the dynamics jointly from offline trajectories. We apply the proposed method to the downstream task clustering where a cluster head selection policy trains entirely through imagined rollouts in the learned world model. Across 27 evaluation scenarios spanning MANET, VANET, FANET, WSN and tactical networks with N=30 to 1000 nodes, the learned policy maintains high connectivity with only trained for N=50. Herein, we propose the first multi physics graph structured world model applied to combinatorial per node decision making in size agnostic wireless ad hoc networks.

Abstract:
World models have been emerging as critical components for assessing the consequences of actions generated by interactive agents in online planning and offline evaluation. In text‑based environments, world models are typically evaluated and trained with single‑step metrics such as Exact Match, aiming to improve the similarity between predicted and real‑world states, but such metrics have been shown to be insufficient for capturing actual agent behavior. To address this issue, we introduce a new behavior‑aligned training paradigm aimed at improving the functional consistency between the world model and the real environment. This paradigm focuses on optimizing a tractable step‑level metric named Behavior Consistency Reward (BehR), which measures how much the likelihood of a logged next action changes between the real state and the world‑model‑predicted state under a frozen Reference Agent. Experiments on WebShop and TextWorld show that BehR‑based training improves long‑term alignment in several settings, with the clearest gains in WebShop and less movement in near‑ceiling regimes, while preserving or improving single‑step prediction quality in three of four settings. World models trained with BehR also achieve lower false positives in offline surrogate evaluation and show modest but encouraging gains in inference‑time lookahead planning.

Abstract:
Vision‑and‑Language Navigation for Unmanned Aerial Vehicles (UAV‑VLN) represents a pivotal challenge in embodied artificial intelligence, focused on enabling UAVs to interpret high‑level human commands and execute long‑horizon tasks in complex 3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to the current state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including Vision‑Language Models (VLMs), Vision‑Language‑Action (VLA) models, and the emerging integration of generative world models with VLA architectures for physically‑grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and evaluation metrics that facilitates standardized research. Furthermore, we conduct a critical analysis of the primary challenges impeding real‑world deployment: the simulation‑to‑reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and the efficient deployment of large models on resource‑constrained hardware. By synthesizing current benchmarks and limitations, this survey concludes by proposing a forward‑looking research roadmap to guide future inquiry into key frontiers such as multi‑agent swarm coordination and air‑ground collaborative robotics.

Abstract:
At its core, robotic manipulation is a problem of vision‑to‑geometry mapping (f(v) \rightarrow G). Physical actions are fundamentally defined by geometric properties like 3D positions and spatial relationships. Consequently, we argue that the foundation for generalizable robotic control should be a vision‑geometry backbone, rather than the widely adopted vision‑language or video models. Conventional VLA and video‑predictive models rely on backbones pretrained on large‑scale 2D image‑text or temporal pixel data. While effective, their representations are largely shaped by semantic concepts or 2D priors, which do not intrinsically align with the precise 3D geometric nature required for physical manipulation. Driven by this insight, we propose the Vision‑Geometry‑Action (VGA) model, which directly conditions action generation on pretrained native 3D representations. Specifically, VGA replaces conventional language or video backbones with a pretrained 3D world model, establishing a seamless vision‑to‑geometry mapping that translates visual inputs directly into physical actions. To further enhance geometric consistency, we introduce a Progressive Volumetric Modulation module and adopt a joint training strategy. Extensive experiments validate the effectiveness of our approach. In simulation benchmarks, VGA outperforms top‑tier VLA baselines including π_0.5 and GeoVLA, demonstrating its superiority in precise manipulation. More importantly, VGA exhibits remarkable zero‑shot generalization to unseen viewpoints in real‑world deployments, consistently outperforming π_0.5. These results highlight that operating on native 3D representations‑rather than translating through language or 2D video priors‑is a highly promising direction for achieving generalizable physical intelligence.

Abstract:
Marker‑based motion capture (MoCap) systems have long been the gold standard for accurate 4D human modeling, yet their reliance on specialized hardware and markers limits scalability and real‑world deployment. Advancing reliable markerless 4D human motion capture requires datasets that reflect the complexity of real‑world human interactions. Yet, existing benchmarks often lack realistic multi‑person dynamics, severe occlusions, and challenging interaction patterns, leading to a persistent domain gap. In this work, we present a new dataset and evaluation for complex 4D markerless human motion capture. Our proposed MoCap dataset captures both single and multi‑person scenarios with intricate motions, frequent inter‑person occlusions, rapid position exchanges between similarly dressed subjects, and varying subject distances. It includes synchronized multi‑view RGB and depth sequences, accurate camera calibration, ground‑truth 3D motion capture from a Vicon system, and corresponding SMPL/SMPL‑X parameters. This setup ensures precise alignment between visual observations and motion ground truth. Benchmarking state‑of‑the‑art markerless MoCap models reveals substantial performance degradation under these realistic conditions, highlighting limitations of current approaches. We further demonstrate that targeted fine‑tuning improves generalization, validating the dataset's realism and value for model development. Our evaluation exposes critical gaps in existing models and provides a rigorous foundation for advancing robust markerless 4D human motion capture.

Abstract:
In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision‑language‑aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM‑based VLAs in semantic generalization. On the proposed WISER benchmark, GWM‑MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.

Abstract:
For the past decades medical robotic solutions were mostly based on the concept of tele‑manipulation. While their design was extremely intelligent, allowing for better access, improved dexterity, reduced tremor, and improved imaging, their intelligence was limited. They therefore left cognition and decision making to the surgeon. As medical robotics advances towards high‑level autonomy, the scientific community needs to explore the required pathway towards partial and full autonomy. Here, we introduce the concept of Dyadic Partnership(DP), a new paradigm in which robots and clinicians engage in intelligent, expert interaction and collaboration. The Dyadic Partners would discuss and agree on decisions and actions during their dynamic and interactive collaboration relying also on intuitive advanced media using generative AI, such as a world model, and advanced multi‑modal visualization. This article outlines the foundational components needed to enable such systems, including foundation models for clinical intelligence, multi‑modal intent recognition, co‑learning frameworks, advanced visualization, and explainable, trust‑aware interaction. We further discuss key challenges such as data scarcity, lack of standardization, and ethical acceptance. Dyadic partnership is introduced and is positioned as a powerful yet achievable, acceptable milestone offering a promising pathway toward safer, more intuitive collaboration and a gradual transition to full autonomy across diverse clinical settings.

Abstract:
We present 3D‑Anchored Lookahead Planning (3D‑ALP), a System 2 reasoning engine for robotic manipulation that combines Monte Carlo Tree Search (MCTS) with a 3D‑consistent world model as the rollout oracle. Unlike reactive policies that evaluate actions from the current camera frame only, 3D‑ALP maintains a persistent camera‑to‑world (c2w) anchor that survives occlusion, enabling accurate replanning to object positions that are no longer directly observable. On a 5‑step sequential reach task requiring spatial memory (Experiment E3), 3D‑ALP achieves 0.650 0.109 success rate on memory‑required steps versus 0.006 0.008 for a greedy reactive baseline (Δ=+0.645), while step 5 success reaches 0.822 against 0.000 for greedy. An ablation study (30 episodes, 3 seeds) isolates tree search spatial memory as the primary driver (+0.533, 82% of gain) with additional benefit from deeper lookahead (+0.111, 17%). We also identify and resolve four structural failure modes in applying UCT‑MCTS (Upper Confidence Bounds applied to Trees [10]) to continuous robotic manipulation.

Abstract:
Semiconductor supply chains face unprecedented resilience challenges amidst global geopolitical turbulence. Conventional Large Language Model (LLM) planners, when confronting such non‑stationary "Policy Black Swan" events, frequently suffer from Decision Paralysis or a severe Grounding Gap due to the absence of physical environmental modeling. This paper introduces ReflectiChain, a cognitive agentic framework tailored for resilient macroeconomic supply chain planning. The core innovation lies in the integration of Latent Trajectory Rehearsal powered by a generative world model, which couples reflection‑in‑action (System 2 deliberation) with delayed reflection‑on‑action. Furthermore, we leverage a Retrospective Agentic RL mechanism to enable autonomous policy evolution during the deployment phase (test‑time). Evaluations conducted on our high‑fidelity benchmark, Semi‑Sim, demonstrate that under extreme scenarios such as export bans and material shortages, ReflectiChain achieves a 250% improvement in average step rewards over the strongest LLM baselines. It successfully restores the Operability Ratio (OR) from a deficient 13.3% to over 88.5% while ensuring robust gradient convergence. Ablation studies further underscore that the synergy between physical grounding constraints and double‑loop learning is fundamental to bridging the gap between semantic reasoning and physical reality for long‑horizon strategic planning.

Abstract:
Foundation models have shown remarkable performance across diverse tasks, yet their ability to construct internal spatial world models for reasoning and planning remains unclear. We systematically evaluate the spatial understanding of large language models through maze tasks, a controlled testing context requiring multi‑step planning and spatial abstraction. Across comprehensive experiments with Gemini‑2.5‑Flash, GPT‑5‑mini, Claude‑Haiku‑4.5, and DeepSeek‑Chat, we uncover significant discrepancies in spatial reasoning that challenge assumptions about LLM planning capabilities. Using chain‑of‑thought prompting, Gemini achieves 80‑86% accuracy on smaller mazes (5x5 to 7x7 grids) with tokenized adjacency representations, but performance collapses to 16‑34% with visual grid formats, which is a 2‑5x difference, suggesting representation‑dependent rather than format‑invariant spatial reasoning. We further probe spatial understanding through sequential proximity questions and compositional distance comparisons. Despite achieving 96‑99% semantic coverage in reasoning traces, models fail to leverage this understanding for consistent spatial computations, indicating that they treat each question independently rather than building cumulative spatial knowledge. Our findings based on the maze‑solving tasks suggest that LLMs do not develop robust spatial world models, but rather exhibit representation‑specific and prompting‑dependent reasoning that succeeds only under narrow conditions. These results have critical implications for deploying foundation models in applications requiring spatial abstraction.

Abstract:
Young children demonstrate early abilities to understand their physical world, estimating depth, motion, object coherence, interactions, and many other aspects of physical scene understanding. Children are both data‑efficient and flexible cognitive systems, creating competence despite extremely limited training data, while generalizing to myriad untrained tasks ‑‑ a major challenge even for today's best AI systems. Here we introduce a novel computational hypothesis for these abilities, the Zero‑shot Visual World Model (ZWM). ZWM is based on three principles: a sparse temporally‑factored predictor that decouples appearance from dynamics; zero‑shot estimation through approximate causal inference; and composition of inferences to build more complex abilities. We show that ZWM can be learned from the first‑person experience of a single child, rapidly generating competence across multiple physical understanding benchmarks. It also broadly recapitulates behavioral signatures of child development and builds brain‑like internal representations. Our work presents a blueprint for efficient and flexible learning from human‑scale data, advancing both a computational account for children's early physical understanding and a path toward data‑efficient AI systems.

Abstract:
Standard Chain‑of‑Thought (CoT) prompting empowers Large Language Models (LLMs) with reasoning capabilities, yet its reliance on linear natural language is inherently insufficient for effective world modeling in embodied tasks. While text offers flexibility, it fails to explicitly represent the state‑space, object hierarchies, and causal dependencies required for robust robotic planning. To address these limitations, we propose Object‑Oriented World Modeling (OOWM), a novel framework that structures embodied reasoning through the lens of software engineering formalisms. We redefine the world model not as a latent vector space, but as an explicit symbolic tuple W = \langle S, T \rangle: a State Abstraction (G_\textstate) instantiating the environmental state S, coupled with a Control Policy (G_\textcontrol) representing the transition logic T: S × A \rightarrow S'. OOWM leverages the Unified Modeling Language (UML) to materialize this definition: it employs Class Diagrams to ground visual perception into rigorous object hierarchies, and Activity Diagrams to operationalize planning into executable control flows. Furthermore, we introduce a three‑stage training pipeline combining Supervised Fine‑Tuning (SFT) with Group Relative Policy Optimization (GRPO). Crucially, this method utilizes outcome‑based rewards from the final plan to implicitly optimize the underlying object‑oriented reasoning structure, enabling effective learning even with sparse annotations. Extensive evaluations on the MRoom‑30k benchmark demonstrate that OOWM significantly outperforms unstructured textual baselines in planning coherence, execution success, and structural fidelity, establishing a new paradigm for structured embodied reasoning.

Abstract:
Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM‑based auto‑labeling is often noisy because the primary data sources lack accurate human action labels, chain‑of‑thought (CoT), and spatial annotations; these errors are amplified during long‑horizon spatial instruction following. These issues stem from insufficient coverage of minute‑long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world‑model synthesis can hallucinate objects, skip steps, or fail to respect real‑world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think‑aloud capture pipeline for egocentric data. It uses a say‑before‑act protocol to record step‑by‑step goals and spoken reasoning with word‑level timestamps, then calibrates physical properties with metric‑scale spatial estimators, a memory‑bank walkthrough for scene context, and clip‑level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long‑horizon generation over minute‑long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open‑world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long‑horizon planning and reasoning, step‑wise reasoning, instruction following, and spatial grounding.

Abstract:
World models have emerged as a unifying paradigm for learning latent dynamics, simulating counterfactual futures, and supporting planning under uncertainty. In this paper, we argue that computational epidemiology is a natural and underdeveloped setting for world models. This is because epidemic decision‑making requires reasoning about latent disease burden, imperfect and policy‑dependent surveillance signals, and intervention effects are mediated by adaptive human behavior. We introduce a conceptual framework for epidemiological world models, formulating epidemics as controlled, partially observed dynamical systems in which (i) the true epidemic state is latent, (ii) observations are noisy and endogenous to policy, and (iii) interventions act as sequential actions whose effects propagate through behavioral and social feedback. We present three case studies that illustrate why explicit world modeling is necessary for policy‑relevant reasoning: strategic misreporting in behavioral surveillance, systematic delays in time‑lagged signals such as hospitalizations and deaths, and counterfactual intervention analysis where identical histories diverge under alternative action sequences.

Abstract:
Recent advances in robot foundation models trained on large‑scale human teleoperation data have enabled robots to perform increasingly complex real‑world tasks. However, scaling these systems remains difficult because collecting task‑specific demonstrations is expensive and labor‑intensive. Synthetic data, especially generated videos, offer a promising direction, but existing World Models (WMs) are not directly suitable for policy learning since they do not provide paired action trajectories. World‑Action (WA) models partially address this by predicting actions with visual outputs, yet often lack strong video‑action alignment, while two‑stage pipelines that generate video first and then infer actions introduce inefficiency and error accumulation. To address these limitations, we propose VAG, a unified flow‑matching‑based dual‑stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross‑modal consistency during generation. Across both simulated and real‑world settings, VAG produces aligned video‑action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization, indicating its potential as a practical world‑action model for embodied data synthesis.

Abstract:
Vision‑Language‑Action (VLA) models have recently achieved notable progress in end‑to‑end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the imagined future they generate. In this work, we present VLA‑World, a simple yet effective VLA world model that unifies predictive imagination with reflective reasoning to improve driving foresight. VLA‑World first uses an action‑derived feasible trajectory to guide the generation of the next‑frame image, capturing rich spatial and temporal cues that describe how the surrounding environment evolves. The model then reasons over this self‑generated future imagined frame to refine the predicted trajectory, achieving higher performance and better interpretability. To support this pipeline, we curate nuScenes‑GR‑20K, a generative reasoning dataset derived from nuScenes, and employ a three‑stage training strategy that includes pretraining, supervised fine‑tuning, and reinforcement learning. Extensive experiments demonstrate that VLA‑World consistently surpasses state‑of‑the‑art VLA and world‑model baselines on both planning and future‑generation benchmarks. Project page: https://vlaworld.github.io

Abstract:
Model‑based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy‑only, discarding value information, or reward‑based, which becomes myopic when the diffusion horizon is short. We introduce Advantage‑Guided Diffusion for MBRL (AGD‑MBRL), which steers the reverse diffusion process using the agent's advantage estimates so that sampling concentrates on trajectories expected to yield higher long‑term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state‑action advantage‑implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD‑MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD‑style architectures by guiding the state components while leaving action generation policy‑conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD‑MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser‑style reward guide, and model‑free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage‑aware guidance is a simple, effective remedy for short‑horizon myopia in diffusion‑model MBRL.

Abstract:
With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory‑enabled long‑term temporal consistency and high‑resolution real‑time generation, limiting their applicability in real‑world scenarios. To address this, we present Matrix‑Game 3.0, a memory‑augmented interactive world model designed for 720p real‑time longform video generation. Building upon Matrix‑Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial‑scale infinite data engine that integrates Unreal Engine‑based synthetic data, large‑scale automated collection from AAA games, and real‑world video augmentation to produce high‑quality Video‑Pose‑Action‑Prompt quadruplet data at scale. Second, we propose a training framework for long‑horizon consistency: by modeling prediction residuals and re‑injecting imperfect generated frames during training, the base model learns self‑correction; meanwhile, camera‑aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi‑segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real‑time inference. Experimental results show that Matrix‑Game 3.0 achieves up to 40 FPS real‑time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute‑long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial‑scale deployable world models.

Abstract:
Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline‑to‑online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose World Model‑based Experience Transfer (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty‑penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine‑tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior‑driven initialization to task‑specific adaptation. We show that the uncertainty‑penalized objective provides a lower bound on the true return and derive a finite‑sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.

Abstract:
Multi‑agent coordination under partial observability requires agents to share complementary private information. While recent methods optimize messages for intermediate objectives (e.g., reconstruction accuracy or mutual information), rather than decision quality, we introduce SeqComm‑DFL, unifying the sequential communication with decision‑focused learning for task performance. Our approach features \emphvalue‑aware message generation with sequential Stackelberg conditioning: messages maximize receiver decision quality and are generated in priority order, with agents conditioning on their predecessors. The \emphguidance potential determined by their prosocial ordering. We extend Optimal Model Design to communication‑augmented world models with QMIX factorization, enabling efficient end‑to‑end training via implicit differentiation. We prove information‑theoretic bounds showing that communication value scales with coordination gaps and establish \mathcalO(1/\sqrtT) convergence for the bilevel optimization, where T denotes the number of training iterations. On collaborative healthcare and StarCraft Multi‑Agent Challenge (SMAC) benchmarks, SeqComm‑DFL achieves four to six times higher cumulative rewards and over 13% win rate improvements, enabling coordination strategies inaccessible under information asymmetry.

Abstract:
World models promise a paradigm shift in robotics, where an agent learns the underlying physics of its environment once to enable efficient planning and behavior learning. However, current world models are often hardware‑locked specialists: a model trained on a Boston Dynamics Spot robot fails catastrophically on a Unitree Go1 due to the mismatch in kinematic and dynamic properties, as the model overfits to specific embodiment constraints rather than capturing the universal locomotion dynamics. Consequently, a slight change in actuator dynamics or limb length necessitates training a new model from scratch. In this work, we take a step towards a framework for training a generalizable Quadrupedal World Model (QWM) that disentangles environmental dynamics from robot morphology. We address the limitations of implicit system identification, where treating static physical properties (like mass or limb length) as latent variables to be inferred from motion history creates an adaptation lag that can compromise zero‑shot safety and efficiency. Instead, we explicitly condition the generative dynamics on the robot's engineering specifications. By integrating a physical morphology encoder and a reward normalizer, we enable the model to serve as a neural simulator capable of generalizing across morphologies. This capability unlocks zero‑shot control across a range of embodiments. We introduce, for the first time, a world model that enables zero‑shot generalization to new morphologies for locomotion. While we carefully study the limitations of our method, QWM operates as a distribution‑bounded interpolator within the quadrupedal morphology family rather than a universal physics engine, this work represents a significant step toward morphology‑conditioned world models for legged locomotion.

Abstract:
Recent years have seen remarkable progress in autonomous driving, yet generalization to long‑tail and open‑world scenarios remains a major bottleneck for large‑scale deployment. To address this challenge, some works use LLMs and VLMs for vision‑language understanding and reasoning, enabling vehicles to interpret rare and safety‑critical situations when generating actions. Others study generative world models to capture the spatio‑temporal evolution of driving scenes, allowing agents to imagine possible futures before acting. Inspired by human intelligence, which unifies understanding and imagination, we explore a unified model for autonomous driving. We present LMGenDrive, the first framework that combines LLM‑based multimodal understanding with generative world models for end‑to‑end closed‑loop driving. Given multi‑view camera inputs and natural‑language instructions, LMGenDrive generates both future driving videos and control signals. This design provides complementary benefits: video prediction improves spatio‑temporal scene modeling, while the LLM contributes strong semantic priors and instruction grounding from large‑scale pretraining. We further propose a progressive three‑stage training strategy, from vision pretraining to multi‑step long‑horizon driving, to improve stability and performance. LMGenDrive supports both low‑latency online planning and autoregressive offline video generation. Experiments show that it significantly outperforms prior methods on challenging closed‑loop benchmarks, with clear gains in instruction following, spatio‑temporal understanding, and robustness to rare scenarios. These results suggest that unifying multimodal understanding and generation is a promising direction for more generalizable and robust embodied decision‑making systems.

Abstract:
The generation of planning domains from natural language descriptions remains an open problem even with the advent of large language models and reasoning models. Recent work suggests that while LLMs have the ability to assist with domain generation, they are still far from producing high quality domains that can be deployed in practice. To this end, we investigate the ability of an agentic language model feedback framework to generate planning domains from natural language descriptions that have been augmented with a minimal amount of symbolic information. In particular, we evaluate the quality of the generated domains under various forms of symbolic feedback, including landmarks, and output from the VAL plan validator. Using these feedback mechanisms, we experiment using heuristic search over model space to optimize domain quality.

Abstract:
Mobile traffic prediction is a fundamental yet challenging problem for wireless network planning and optimization. Existing models focus on learning static long‑term temporal patterns in mobile traffic series, which limits their ability to capture the dynamics between mobile traffic and network parameter adjustments. In this paper, we propose MobiWM, a world model for mobile networks. Taking mobile traffic as the system state, MobiWM models the dynamics between the states and network parameter actions, including power, azimuth, mechanical tilt, and electrical tilt through a predictive backbone. It fuses multimodal environmental contexts, comprising both image and sequential data, with encoded actions, leveraging shared spatial semantics to enhance spatial understanding. Leveraging the capacity of world models to capture real‑world operational dynamics, MobiWM supports unlimited‑horizon rollout over continuous network‑adjustment action trajectories, providing operators with an explorable counterfactual simulation environment for network planning and optimization. Extensive experiments on variable‑parameter mobile traffic data covering 31,900 cells across 9 districts demonstrate that MobiWM achieves the best distributional fidelity across all evaluation scenarios, significantly outperforming existing traffic prediction baselines and representative world models. A downstream RL‑based case study further validates MobiWM as a simulation environment for network optimization, establishing a new paradigm for digital twin‑driven wireless network management.

Abstract:
Vision‑language models (VLMs) and generative world models are opening new opportunities for embodied navigation. VLMs are increasingly used as direct planners or trajectory predictors, while world models support look‑ahead reasoning by imagining future views. Yet predicting a reliable trajectory from a single egocentric observation remains challenging. Current VLMs often generate unstable trajectories, and world models, though able to synthesize plausible futures, do not directly provide the grounded signals needed for navigation learning. This raises a central question: how can generated futures be turned into supervision for grounded trajectory prediction? We present WorldMAP, a teacher‑‑student framework that converts world‑model‑generated futures into persistent semantic‑spatial structure and planning‑derived supervision. Its world‑model‑driven teacher builds semantic‑spatial memory from generated videos, grounds task‑relevant targets and obstacles, and produces trajectory pseudo‑labels through explicit planning. A lightweight student with a multi‑hypothesis trajectory head is then trained to predict navigation trajectories directly from vision‑language inputs. On Target‑Bench, WorldMAP achieves the best ADE and FDE among compared methods, reducing ADE by 18.0% and FDE by 42.1% relative to the best competing baseline, while lifting a small open‑source VLM to DTW performance competitive with proprietary models. More broadly, the results suggest that, in embodied navigation, the value of world models may lie less in supplying action‑ready imagined evidence than in synthesizing structured supervision for navigation learning.

Abstract:
In this work, CausalVAE is introduced as a plug‑in structural module for latent world models and is attached to diverse encoder‑transition backbones. Across the reported benchmarks, competitive factual prediction is preserved and intervention‑aware counterfactual retrieval is improved after the plug‑in is added, suggesting stronger robustness under distribution shift and interventions. The largest gains are observed on the Physics benchmark: when averaged over 8 paired baselines, CF‑H@1 is improved by +102.5%. In a representative GNN‑NLL setting on Physics, CF‑H@1 is increased from 11.0 to 41.0 (+272.7%). Through causal analysis, learned structural dependencies are shown to recover meaningful first‑order physical interaction trends, supporting the interpretability of the learned latent causal structure.

Abstract:
Model‑based reinforcement learning (MBRL) improves sample efficiency by optimizing policies inside imagined rollouts, but long‑horizon planning degrades when model errors compound and imagined trajectories drift off the training manifold. We introduce GIRL (Generative Imagination Reinforcement Learning), a latent world‑model framework that addresses this failure mode with two key components. First, a cross‑modal grounding signal derived from a frozen foundation model (DINOv2) anchors the latent transition prior to a semantically consistent embedding space, penalizing inconsistent or implausible predictions. Second, an uncertainty‑adaptive trust‑region bottleneck interprets the KL regularizer as the Lagrange multiplier of a constrained optimization problem, restricting imagination drift within a learned region calibrated by Expected Information Gain and a Relative Performance Loss signal. We re‑derive a value‑gap bound using the Performance Difference Lemma and Integral Probability Metrics, yielding a bound that remains informative as the discount factor approaches one and connects the objective to real‑environment regret. Experiments across three benchmark suites, including DeepMind Control, Adroit Hand Manipulation, and Meta‑World with visual distractors, show that GIRL reduces latent rollout drift by 38 to 61 percent across tasks relative to DreamerV3, improves asymptotic return, and requires fewer environment interactions on long‑horizon tasks. GIRL also outperforms TD‑MPC2 on sparse‑reward and high‑contact settings under standard evaluation metrics. A distilled‑prior variant reduces inference overhead and improves computational efficiency relative to the full model.

Abstract:
Building world models with spatial consistency and real‑time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO‑WORLD, a novel real‑time framework capable of recovering and generating high‑fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long‑horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real‑world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over‑reliance on synthetic data. Extensive experiments demonstrate that INSPATIO‑WORLD significantly outperforms existing state‑of‑the‑art (SOTA) models in spatial consistency and interaction precision, ranking first among real‑time interactive methods on the WorldScore‑Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.

Abstract:
The integration of machine learning tools into telecom networks, has led to two prevailing paradigms, namely, language‑based systems, such as Large Language Models (LLMs), and physics‑based systems, such as Digital Twins (DTs). While LLM‑based approaches enable flexible interaction and automation, they lack explicit representations of network dynamics. DTs, in contrast, offer a high‑fidelity network simulation, but remain scenario‑specific and are not designed for learning or decision‑making under uncertainty. This gap becomes critical for 6G systems, where decisions must take into account the evolving network states, uncertainty, and the cascading effects of control actions across multiple layers. In this article, we introduce the Telecom World Model~(TWM) concept, an architecture for learned, action‑conditioned, uncertainty‑aware modeling of telecom system dynamics. We decompose the problem into two interacting worlds, a controllable system world consisting of operator‑configurable settings and an external world that captures propagation, mobility, traffic, and failures. We propose a three‑layer architecture, comprising a field world model for spatial environment prediction, a control/dynamics world model for action‑conditioned Key Performance Indicator (KPI) trajectory prediction, and a telecom foundation model layer for intent translation and orchestration. We showcase a comparative analysis between existing paradigms, which demonstrates that TWM jointly provides telecom state grounding, fast action‑conditioned roll‑outs, calibrated uncertainty, multi‑timescale dynamics, model‑based planning, and LLM‑integrated guardrails. Furthermore, we present a proof‑of‑concept on network slicing to validate the proposed architecture, showing that the full three‑layer pipeline outperforms single‑world baselines and accurately predicts KPI trajectories.

Abstract:
World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel‑grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low‑dimensional tokens, we translate 7‑DoF robot actions into interpretable action images: multi‑view action videos that are grounded in 2D pixels and explicitly track robot‑arm motion. This pixel‑grounded action representation allows the video backbone itself to act as a zero‑shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video‑action joint generation, action‑conditioned video generation, and action labeling under a shared representation. On RLBench and real‑world evaluations, our model achieves the strongest zero‑shot success rates and improves video‑action joint generation quality over prior video‑space world models, suggesting that interpretable action images are a promising route to policy learning.

Abstract:
Whether Large Language Models (LLMs) develop coherent internal world models remains a core debate. While conventional Next‑Token Prediction (NTP) focuses on one‑step‑ahead supervision, Multi‑Token Prediction (MTP) has shown promise in learning more structured representations. In this work, we provide a theoretical perspective analyzing the gradient inductive bias of MTP, supported by empirical evidence, showing that MTP promotes the convergence toward internal belief states by inducing representational contractivity via gradient coupling. However, we reveal that standard MTP often suffers from structural hallucinations, where discrete token supervision encourages illegal shortcuts in latent space that violate environmental constraints. To address this, we propose a novel method Latent Semantic Enhancement MTP (LSE‑MTP), which anchors predictions to ground‑truth hidden state trajectories. Experiments on synthetic graphs and real‑world Manhattan Taxi Ride show that LSE‑MTP effectively bridges the gap between discrete tokens and continuous state representations, enhancing representation alignment, reducing structural hallucinations, and improving robustness to perturbations.

Abstract:
With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real‑world model capabilities. To address this widening gap, we introduce Video‑MME‑v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a progressive tri‑level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi‑point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per‑question accuracy, we propose a group‑based non‑linear evaluation strategy that enforces both consistency across related queries and coherence in multi‑step reasoning. It penalizes fragmented or guess‑based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video‑MME‑v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by 3,300 human‑hours and up to 5 rounds of quality assurance, Video‑MME‑v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini‑3‑Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high‑level reasoning. We further find that thinking‑based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video‑MME‑v2 establishes a demanding new testbed for the development of next‑generation video MLLMs.

Abstract:
Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous "delta" token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three‑dimensional spatio‑temporal representation to a one‑dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi‑hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real‑world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.

Abstract:
Large language models (LLMs) are used globally, and because much of their training data is in English, they typically perform best on English inputs. As a result, many non‑native English speakers interact with them in English as a second language (ESL), and these inputs often contain typographical errors. Prior work has largely studied the effects of ESL variation and typographical errors separately, even though they often co‑occur in real‑world use. In this study, we use the Trans‑EnV framework to transform standard English inputs into eight ESL variants and apply MulTypo to inject typos at three levels: low, moderate, and severe. We find that combining ESL variation and typos generally leads to larger performance drops than either factor alone, though the combined effect is not simply additive. This pattern is clearest on closed‑ended tasks, where performance degradation can be characterized more consistently across ESL variants and typo levels, while results on open‑ended tasks are more mixed. Overall, these findings suggest that evaluations on clean standard English may overestimate real‑world model performance, and that evaluating ESL variation and typographical errors in isolation does not fully capture model behavior in realistic settings.

Abstract:
Generalization is a central challenge in autonomous driving, as real‑world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world‑model‑based planning methods have shown strong capabilities in scene understanding and multi‑modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video‑trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well‑pretrained large‑scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT‑based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long‑duration rollout consistency. DriveVA achieves an impressive closed‑loop performance of 90.9 PDM score on the challenge NAVSIM. Extensive experiments also demonstrate the zero‑shot capability and cross‑domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2drive built on CARLA v2 compared with the state‑of‑the‑art world‑model‑based planner.

Abstract:
Video‑based numerical reasoning provides a premier arena for testing whether Vision‑Language Models (VLMs) truly "understand" real‑world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching. However, existing benchmarks are often confined to narrow domains, such as repetitive athletic motions, or treat simple counting merely as a superficial regression task, failing to assess multi‑step numerical logic within the inherent complexity of real‑world multimedia content. We introduce VidNum‑1.4K, a comprehensive VideoQA benchmark comprising 1,379 strictly human‑annotated video‑question pairs designed to evaluate genuine numerical reasoning across highly diverse environments, encompassing object, action, and event quantification. The VidNum‑1.4K is uniquely structured into a three‑level hierarchy that evolves from direct visual perception to video‑based compositional numerical reasoning, requiring models to perform arithmetic operations, comparisons, and logical deductions grounded in temporal evidence. Our evaluations across a diverse suite of state‑of‑the‑art VLMs reveal a striking reasoning gap: while the Gemini‑3.1‑pro barely reaches a 60% accuracy threshold, representative open‑source families struggle heavily in the 25%‑‑45% range. These findings demonstrate that current VLMs still lack a stable "internal world model", positioning VidNum‑1.4K as a demanding diagnostic testbed for the next generation of numerical video intelligence.

Abstract:
Current AI safety relies on behavioral monitoring and post‑training alignment, yet empirical measurement shows these approaches produce no detectable pre‑commitment signal in a majority of instruction‑tuned models tested. We present an energy‑based governance framework connecting transformer inference dynamics to constraint‑satisfaction models of neural computation, and apply it to a seven‑model cohort across five geometric regimes. Using trajectory tension (rho = ||a|| / ||v||), we identify a 57‑token pre‑commitment window in Phi‑3‑mini‑4k‑instruct under greedy decoding on arithmetic constraint probes. This result is model‑specific, task‑specific, and configuration‑specific, demonstrating that pre‑commitment signals can exist but are not universal. We introduce a five‑regime taxonomy of inference behavior: Authority Band, Late Signal, Inverted, Flat, and Scaffold‑Selective. Energy asymmetry (Σ\rho_misaligned / Σ\rho_aligned) serves as a unifying metric of structural rigidity across these regimes. Across seven models, only one configuration exhibits a predictive signal prior to commitment; all others show silent failure, late detection, inverted dynamics, or flat geometry. We further demonstrate that factual hallucination produces no predictive signal across 72 test conditions, consistent with spurious attractor settling in the absence of a trained world‑model constraint. These results establish that rule violation and hallucination are distinct failure modes with different detection requirements. Internal geometry monitoring is effective only where resistance exists; detection of factual confabulation requires external verification mechanisms. This work provides a measurable framework for inference‑layer governability and introduces a taxonomy for evaluating deployment risk in autonomous AI systems.

Abstract:
Model predictive control (MPC) with learned world models has emerged as a promising paradigm for embodied control, particularly for its ability to generalize zero‑shot when deployed in new environments. However, learned world models often struggle with long‑horizon control due to the accumulation of prediction errors and the exponentially growing search space. In this work, we address these challenges by learning latent world models at multiple temporal scales and performing hierarchical planning across these scales, enabling long‑horizon reasoning while substantially reducing inference‑time planning complexity. Our approach serves as a modular planning abstraction that applies across diverse latent world‑model architectures and domains. We demonstrate that this hierarchical approach enables zero‑shot control on real‑world non‑greedy robotic tasks, achieving a 70% success rate on pick‑&‑place using only a final goal specification, compared to 0% for a single‑level world model. In addition, across physics‑based simulated environments including push manipulation and maze navigation, hierarchical planning achieves higher success while requiring up to 4x less planning‑time compute.

Abstract:
Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason about hardware constraints and timing semantics. In this work, we propose InCoder‑32B‑Thinking, trained on the data from the Error‑driven Chain‑of‑Thought (ECoT) synthesis framework with an industrial code world model (ICWM) to generate reasoning traces. Specifically, ECoT generates reasoning chains by synthesizing the thinking content from multi‑turn dialogue with environmental error feedback, explicitly modeling the error‑correction process. ICWM is trained on domain‑specific execution traces from Verilog simulation, GPU profiling, etc., learns the causal dynamics of how code affects hardware behavior, and enables self‑verification by predicting execution outcomes before actual compilation. All synthesized reasoning traces are validated through domain toolchains, creating training data matching the natural reasoning depth distribution of industrial tasks. Evaluation on 14 general (81.3% on LiveCodeBench v5) and 9 industrial benchmarks (84.0% in CAD‑Coder and 38.0% on KernelBench) shows InCoder‑32B‑Thinking achieves top‑tier open‑source results across all domains.GPU Optimization

Abstract:
Achieving quadruped robot locomotion across diverse and dynamic terrains presents significant challenges, primarily due to the discrepancies between simulation environments and real‑world conditions. Traditional sim‑to‑real transfer methods often rely on manual feature design or costly real‑world fine‑tuning. To address these limitations, this paper proposes the DreamTIP framework, which incorporates Task‑Invariant Properties learning within the Dreamer world model architecture to enhance sim‑to‑real transfer capabilities. Guided by large language models, DreamTIP identifies and leverages Task‑Invariant Properties, such as contact stability and terrain clearance, which exhibit robustness to dynamic variations and strong transferability across tasks. These properties are integrated into the world model as auxiliary prediction targets, enabling the policy to learn representations that are insensitive to underlying dynamic changes. Furthermore, an efficient adaptation strategy is designed, employing a mixed replay buffer and regularization constraints to rapidly calibrate to real‑world dynamics while effectively mitigating representation collapse and catastrophic forgetting. Extensive experiments on complex terrains, including Stair, Climb, Tilt, and Crawl, demonstrate that DreamTIP significantly outperforms state‑of‑the‑art baselines in both simulated and real‑world environments. Our method achieves an average performance improvement of 28.1% across eight distinct simulated transfer tasks. In the real‑world Climb task, the baseline method achieved only a 10\ success rate, whereas our method attained a 100% success rate. These results indicate that incorporating Task‑Invariant Properties into Dreamer learning offers a novel solution for achieving robust and transferable robot locomotion.

Abstract:
Scaling generalist GUI agents is hindered by the data scalability bottleneck of expensive human demonstrations and the "distillation ceiling" of synthetic teacher supervision. To transcend these limitations, we propose UI‑Oceanus, a framework that shifts the learning focus from mimicking high‑level trajectories to mastering interaction physics via ground‑truth environmental feedback. Through a systematic investigation of self‑supervised objectives, we identify that forward dynamics, defined as the generative prediction of future interface states, acts as the primary driver for scalability and significantly outweighs inverse inference. UI‑Oceanus leverages this insight by converting low‑cost autonomous exploration, which is verified directly by system execution, into high‑density generative supervision to construct a robust internal world model. Experimental evaluations across a series of models demonstrate the decisive superiority of our approach: models utilizing Continual Pre‑Training (CPT) on synthetic dynamics outperform non‑CPT baselines with an average success rate improvement of 7% on offline benchmarks, which amplifies to a 16.8% gain in real‑world online navigation. Furthermore, we observe that navigation performance scales with synthetic data volume. These results confirm that grounding agents in forward predictive modeling offers a superior pathway to scalable GUI automation with robust cross‑domain adaptability and compositional generalization.

Abstract:
Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single‑agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi‑subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action‑controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action‑following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

Abstract:
Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high‑quality 3D assets are scarce, making 3D synthesis under‑constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D‑native foundation model that unifies text‑to‑2D and text‑to‑3D generation within a single autoregressive framework. Our key insight is that cross‑modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X‑to‑X training paradigm that coordinates diverse cross‑modal tasks over heterogeneous paired datasets without requiring fully aligned text‑image‑3D triplets. By traversing semantic‑visual‑geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi‑view geometric consistency. Experiments show that Omni123 significantly improves text‑guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.

Abstract:
General‑purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action‑labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self‑improve. The key idea is to decompose action‑conditioned state prediction into two factors ‑‑ state plausibility and action reachability ‑‑ and verify each separately. We show that these verification problems can be substantially easier than predicting future states due to two underlying asymmetries: the broader availability of action‑free data and the lower dimensionality of action‑relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among generated subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under‑explored regimes, where existing methods typically fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by 18%.

Abstract:
Recently, world‑action models (WAM) have emerged to bridge vision‑language‑action (VLA) models and world models, unifying their reasoning and instruction‑following capabilities and spatio‑temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding‑an essential element for embodied systems operating in the physical world. We present DriveDreamer‑Policy, a unified driving world‑action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi‑view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry‑aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer‑Policy achieves strong performance on both closed‑loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world‑model‑based approaches while producing higher‑quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.

Abstract:
World models ‑ learned internal simulators of environment dynamics ‑ are rapidly becoming foundational to autonomous decision‑making in robotics, autonomous vehicles, and agentic AI. By predicting future states in compressed latent spaces, they enable sample‑efficient planning and long‑horizon imagination without direct environment interaction. Yet this predictive power introduces a distinctive set of safety, security, and cognitive risks. Adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause significant degradation in safety‑critical deployments. At the alignment layer, world model‑equipped agents are more capable of goal misgeneralisation, deceptive alignment, and reward hacking. At the human layer, authoritative world model predictions foster automation bias, miscalibrated trust, and planning hallucination. This paper surveys the world model landscape; introduces formal definitions of trajectory persistence and representational risk; presents a five‑profile attacker taxonomy; and develops a unified threat model drawing on MITRE ATLAS and the OWASP LLM Top 10. We provide an empirical proof‑of‑concept demonstrating trajectory‑persistent adversarial attacks on a GRU‑based RSSM (\mathcalA_1 = 2.26× amplification, ‑59.5% reward reduction under adversarial fine‑tuning), validate architecture‑dependence via a stochastic RSSM proxy (\mathcalA_1 = 0.65×), and probe a real DreamerV3 checkpoint (non‑zero action drift confirmed). We propose interdisciplinary mitigations spanning adversarial hardening, alignment engineering, NIST AI RMF and EU AI Act governance, and human‑factors design, arguing that world models require the same rigour as flight‑control software or medical devices.

Abstract:
Vision‑based autonomous driving has gained much attention due to its low costs and excellent performance. Compared with dense BEV (Bird's Eye View) or sparse query models, Gaussian‑centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian‑centric pre‑training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self‑supervised reconstructing multi‑view semantic and depth images. Equipped with fine‑grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian‑flow‑guided latent prediction for downstream occupancy perception and forecasting tasks, and ego‑planning‑guided latent prediction for motion planning. Extensive experiments in SurroundOcc and nuScenes benchmarks demonstrate that DLWM shows significant performance gains across Gaussian‑centric 3D occupancy perception, 4D occupancy forecasting and motion planning tasks.

Abstract:
Large language models (LLMs) are increasingly embedded in computer science education through AI‑assisted programming tools, yet such workflows often exhibit objective drift, in which locally plausible outputs diverge from stated task specifications. Existing instructional responses frequently emphasize tool‑specific prompting practices, limiting durability as AI platforms evolve. This paper adopts a human‑centered stance, treating human‑in‑the‑loop (HITL) control as a stable educational problem rather than a transitional step toward AI autonomy. Drawing on systems engineering and control‑theoretic concepts, we frame objectives and world models as operational artifacts that students configure to stabilize AI‑assisted work. We propose a pilot undergraduate CS laboratory curriculum that explicitly separates planning from execution and trains students to specify acceptance criteria and architectural constraints prior to code generation. In selected labs, the curriculum also introduces deliberate, concept‑aligned drift to support diagnosis and recovery from specification violations. We report a sensitivity power analysis for a three‑arm pilot design comparing unstructured AI use, structured planning, and structured planning with injected drift, establishing detectable effect sizes under realistic section‑level constraints. The contribution is a theory‑driven, methodologically explicit foundation for HITL pedagogy that renders control competencies teachable across evolving AI tools.

Abstract:
The ability to transform a flat sheet into a complex three‑dimensional structure is a fundamental test of physical intelligence. Unlike cloth manipulation, origami is governed by strict geometric axioms and hard kinematic constraints, where a single invalid crease or collision can invalidate the entire folding sequence. As a result, origami demands long‑horizon constructive reasoning that jointly satisfies precise physical laws and high‑level semantic intent. Existing approaches fall into two disjoint paradigms: optimization‑based methods enforce physical validity but require dense, precisely specified inputs, making them unsuitable for sparse natural language descriptions, while generative foundation models excel at semantic and perceptual synthesis yet fail to produce long‑horizon, physics‑consistent folding processes. Consequently, generating valid origami folding sequences directly from text remains an open challenge. To address this gap, we introduce Learn2Fold, a neuro‑symbolic framework that formulates origami folding as conditional program induction over a crease‑pattern graph. Our key insight is to decouple semantic proposal from physical verification. A large language model generates candidate folding programs from abstract text prompts, while a learned graph‑structured world model serves as a differentiable surrogate simulator that predicts physical feasibility and failure modes before execution. Integrated within a lookahead planning loop, Learn2Fold enables robust generation of physically valid folding sequences for complex and out‑of‑distribution patterns, demonstrating that effective spatial intelligence arises from the synergy between symbolic reasoning and grounded physical simulation.

Abstract:
4D generation, or dynamic 3D content generation, integrates spatial, temporal, and view dimensions to model realistic dynamic scenes, playing a foundational role in advancing world models and physical AI. However, maintaining long‑chain consistency across both frames and viewpoints through the unique spatio‑camera‑motion (SCM) attention mechanism introduces substantial computational and memory overhead, often leading to out‑of‑memory (OOM) failures and prohibitive generation times. To address these challenges, we propose Turbo4DGen, an ultra‑fast acceleration framework for diffusion‑based multi‑view 4D content generation. Turbo4DGen introduces a spatiotemporal cache mechanism that persistently reuses intermediate attention across denoising steps, combined with dynamically semantic‑aware attention pruning and an adaptive SCM chain bypass scheduler, to drastically reduce redundant SCM attention computation. Our experimental results show that Turbo4DGen achieves an average 9.7× speedup without quality degradation on the ObjaverseDy and Consistent4D datasets. To the best of our knowledge, Turbo4DGen is the first dedicated acceleration framework for 4D generation.

Abstract:
Multi‑agent traffic simulation is central to developing and testing autonomous driving systems. Recent data‑driven simulators have achieved promising results, but rely heavily on supervised learning from labeled trajectories or semantic annotations, making it costly to scale their performance. Meanwhile, large amounts of unlabeled sensor data can be collected at scale but remain largely unused by existing traffic simulation frameworks. This raises a key question: How can a method harness unlabeled data to improve traffic simulation performance? In this work, we propose AutoWorld, a traffic simulation framework that employs a world model learned from unlabeled occupancy representations of LiDAR data. Given world model samples, AutoWorld constructs a coarse‑to‑fine predictive scene context as input to a multi‑agent motion generation model. To promote sample diversity, AutoWorld uses a cascaded Determinantal Point Process framework to guide the sampling processes of both the world model and the motion model. Furthermore, we designed a motion‑aware latent supervision objective that enhances AutoWorld's representation of scene dynamics. Experiments on the WOSAC benchmark show that AutoWorld ranks first on the leaderboard according to the primary Realism Meta Metric (RMM). We further show that simulation performance consistently improves with the inclusion of unlabeled LiDAR data, and study the efficacy of each component with ablations. Our method paves the way for scaling traffic simulation realism without additional labeling. Our project page contains additional visualizations and released code.

Abstract:
This paper presents the World‑Action Model (WAM), an action‑regularized world model that jointly reasons over future visual observations and the actions that drive state transitions. Unlike conventional world models trained solely via image prediction, WAM incorporates an inverse dynamics objective into DreamerV2 that predicts actions from latent state transitions, encouraging the learned representations to capture action‑relevant structure critical for downstream control. We evaluate WAM on enhancing policy learning across eight manipulation tasks from the CALVIN benchmark. We first pretrain a diffusion policy via behavioral cloning on world model latents, then refine it with model‑based PPO inside the frozen world model. Without modifying the policy architecture or training procedure, WAM improves average behavioral cloning success from 59.4% to 71.2% over DreamerV2 and DiWA baselines. After PPO fine‑tuning, WAM achieves 92.8% average success versus 79.8% for the baseline, with two tasks reaching 100%, using 8.7x fewer training steps.

Abstract:
Data‑driven autonomous driving simulation has long been constrained by its heavy reliance on pre‑recorded driving logs or spatial priors, such as HD maps. This fundamental dependency severely limits scalability, restricting open‑ended generation capabilities to the finite scale of existing collected datasets. To break this bottleneck, we present OccSim, the first occupancy world model‑driven 3D simulator. OccSim obviates the requirement for continuous logs or HD maps; conditioned only on a single initial frame and a sequence of future ego‑actions, it can stably generate over 3,000 continuous frames, enabling the continuous construction of large‑scale 3D occupancy maps spanning over 4 kilometers for simulation. This represents an >80x improvement in stable generation length over previous state‑of‑the‑art occupancy world models. OccSim is powered by two modules: W‑DiT based static occupancy world model and the Layout Generator. W‑DiT handles the ultra‑long‑horizon generation of static environments by explicitly introducing known rigid transformations in architecture design, while the Layout Generator populates the dynamic foreground with reactive agents based on the synthesized road topology. With these designs, OccSim can synthesize massive, diverse simulation streams. Extensive experiments demonstrate its downstream utility: data collected directly from OccSim can pre‑train 4D semantic occupancy forecasting models to achieve up to 67% zero‑shot performance on unseen data, outperforming previous asset‑based simulator by 11%. When scaling the OccSim dataset to 5x the size, the zero‑shot performance increases to about 74%, while the improvement over asset‑based simulators expands to 22.1%.

Abstract:
Vision‑Language‑Action (VLA) models and world models have recently emerged as promising paradigms for general‑purpose robotic intelligence, yet their progress is hindered by the lack of reliable evaluation protocols that reflect real‑world deployment. Existing benchmarks are largely simulator‑centric, which provide controllability but fail to capture the reality gap caused by perception noise, complex contact dynamics, hardware constraints, and system latency. Moreover, fragmented real‑world evaluations across different robot platforms prevent fair and reproducible comparison. To address these challenges, we introduce ManipArena, a standardized evaluation framework designed to bridge simulation and real‑world execution. ManipArena comprises 20 diverse tasks across 10,812 expert trajectories emphasizing reasoning‑oriented manipulation tasks requiring semantic and spatial reasoning, supports multi‑level generalization through controlled out‑of‑distribution settings, and incorporates long‑horizon mobile manipulation beyond tabletop scenarios. The framework further provides rich sensory diagnostics, including low‑level motor signals, and synchronized real‑to‑sim environments constructed via high‑quality 3D scanning. Together, these features enable fair, realistic, and reproducible evaluation for both VLA and world model approaches, providing a scalable foundation for diagnosing and advancing embodied intelligence systems.

Abstract:
The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long‑horizon causalities, positioning them as potential world simulators. However, a critical gap still remains between the theoretical capacity for world simulation and the heavy computational costs of spatiotemporal modeling. To address this, we comprehensively and systematically review video generation frameworks and techniques that consider efficiency as a crucial requirement for practical world modeling. We introduce a novel taxonomy in three dimensions: efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. We further show that bridging this efficiency gap directly empowers interactive applications such as autonomous driving, embodied AI, and game simulation. Finally, we identify emerging research frontiers in efficient video‑based world modeling, arguing that efficiency is a fundamental prerequisite for evolving video generators into general‑purpose, real‑time, and robust world simulators.

Abstract:
Most existing vision‑language‑action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand‑crafted heuristics for task termination. This limitation is particularly severe in long‑horizon tasks involving cascaded sub‑goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named \textbf \vla. Our technical contributions are twofold: (1) \emphrobust progress estimation: We pre‑train a progress estimator on large‑scale, unsupervised video‑text robotic datasets. This estimator achieves a low prediction residual (0.07 on a scale of [0, 1]) in simulation and demonstrates zero‑shot generalization to unseen real‑world samples, and (2) \emphdifferentiable progress guidance: We introduce an inverse dynamics world model that maps predicted action tokens into future latent visual states. These latents are then processed by the progress estimator; by applying a maximal progress regularization, we establish a differentiable pipeline that provides progress‑piloted guidance to refine action tokens. Extensive experiments on the CALVIN and LIBERO benchmarks, alongside real‑world robot deployment, consistently demonstrate substantial improvements in success rates and generalization over strong baselines.

Abstract:
Learning human‑object manipulation presents significant challenges due to its fine‑grained and contact‑rich nature of the motions involved. Traditional physics‑based animation requires extensive modeling and manual setup, and more importantly, it neither generalizes well across diverse object morphologies nor scales effectively to real‑world environment. To address these limitations, we introduce LOME, an egocentric world model that can generate realistic human‑object interactions as videos conditioned on an input image, a text prompt, and per‑frame human actions, including both body poses and hand gestures. LOME injects strong and precise action guidance into object manipulation by jointly estimating spatial human actions and the environment contexts during training. After finetuning a pretrained video generative model on videos of diverse egocentric human‑object interactions, LOME demonstrates not only high action‑following accuracy and strong generalization to unseen scenarios, but also realistic physical consequences of hand‑object interactions, e.g., liquid flowing from a bottle into a mug after executing a ``pouring'' action. Extensive experiments demonstrate that our video‑based framework significantly outperforms state‑of‑the‑art image based and video‑based action‑conditioned methods and Image/Text‑to‑Video (I/T2V) generative model in terms of both temporal consistency and motion control. LOME paves the way for photorealistic AR/VR experiences and scalable robotic training, without being limited to simulated environments or relying on explicit 3D/4D modeling.

Abstract:
Autonomous driving requires reasoning about how the environment evolves and planning actions accordingly. Existing world‑model‑based approaches typically predict future scenes first and plan afterwards, resulting in open‑loop imagination that may drift from the actual decision process. In this paper, we present Uni‑World VLA, a unified vision‑language‑action (VLA) model that tightly interleaves future frame prediction and trajectory planning. Instead of generating a full world rollout before planning, our model alternates between predicting future frames and ego actions step by step, allowing planning decisions to be continuously conditioned on the imagined future observations. This interleaved generation forms a closed‑loop interaction between world modeling and control, enabling more adaptive decision‑making in dynamic traffic scenarios. In addition, we incorporate monocular depth information into frames to provide stronger geometric cues for world modeling, improving long‑horizon scene prediction. Experiments on the NAVSIM benchmark show that our approach achieves competitive closed‑loop planning performance while producing high‑fidelity future frame predictions. These results demonstrate that tightly coupling world prediction and planning is a promising direction for scalable VLA driving systems.

Abstract:
Navigating to a visually specified goal given natural language instructions remains a fundamental challenge in embodied AI. Existing approaches either rely on reactive policies that struggle with long‑horizon planning, or employ world models that suffer from poor action initialization in high‑dimensional spaces. We present PiJEPA, a two‑stage framework that combines the strengths of learned navigation policies with latent world model planning for instruction‑conditioned visual navigation. In the first stage, we finetune an Octo‑based generalist policy, augmented with a frozen pretrained vision encoder (DINOv2 or V‑JEPA‑2), on the CAST navigation dataset to produce an informed action distribution conditioned on the current observation and language instruction. In the second stage, we use this policy‑derived distribution to warm‑start Model Predictive Path Integral (MPPI) planning over a separately trained JEPA world model, which predicts future latent states in the embedding space of the same frozen encoder. By initializing the MPPI sampling distribution from the policy prior rather than from an uninformed Gaussian, our planner converges faster to high‑quality action sequences that reach the goal. We systematically study the effect of the vision encoder backbone, comparing DINOv2 and V‑JEPA‑2, across both the policy and world model components. Experiments on real‑world navigation tasks demonstrate that PiJEPA significantly outperforms both standalone policy execution and uninformed world model planning, achieving improved goal‑reaching accuracy and instruction‑following fidelity.

Abstract:
Action‑conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short‑term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post‑training scheme that trains the world model on its own autoregressive rollouts rather than on ground‑truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable‑length futures from the same rollout state, reinforcing higher‑fidelity predictions over lower‑fidelity ones. Third, we develop efficient, multi‑view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low‑variance training signal. Fourth, we show that our approach establishes a new state‑of‑the‑art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.

Abstract:
Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. In Organic Computing, such models are a key enabler for objective‑driven self‑adaptation under uncertainty and resource constraints. The core challenge is to acquire observations maximising model quality and downstream usefulness within a limited action budget. Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off information gain against navigation cost and decide when additional actions yield diminishing returns. This work presents a modular navigation component for Embodied Semantic Scene Graph Generation and modernises its decision‑making by replacing the policy‑optimisation method and revisiting the discrete action formulation. We study compact and finer‑grained, larger discrete motion sets and compare a single‑head policy over atomic actions with a factorised multi‑head policy over action components. We evaluate curriculum learning and optional depth‑based collision supervision, and assess SSG completeness, execution safety, and navigation behaviour. Results show that replacing the optimisation algorithm alone improves SSG completeness by 21% relative to the baseline under identical reward shaping. Depth mainly affects execution safety (collision‑free motion), while completeness remains largely unchanged. Combining modern optimisation with a finer‑grained, factorised action representation yields the strongest overall completeness‑‑efficiency trade‑off.

Abstract:
Vision‑Language‑Action (VLA) models aim to control robots for manipulation from visual observations and natural‑language instructions. However, existing hierarchical and autoregressive paradigms often introduce architectural overhead, suffer from temporal inconsistency and long‑horizon error accumulation, and lack a mechanism to capture environment dynamics without extra modules. To this end, we present MMaDA‑VLA, a fully native pre‑trained large diffusion VLA model that unifies multi‑modal understanding and generation in a single framework. Our key idea is a native discrete diffusion formulation that embeds language, images, and continuous robot controls into one discrete token space and trains a single backbone with masked token denoising to jointly generate a future goal observation and an action chunk in parallel. Iterative denoising enables global, order‑free refinement, improving long‑horizon consistency while grounding actions in predicted future visual outcomes without auxiliary world models. Experiments across simulation benchmarks and real‑world tasks show state‑of‑the‑art performance, achieving 98.0% average success on LIBERO and 4.78 average length on CALVIN.

Abstract:
Modeling cellular states and predicting their responses to perturbations are central challenges in computational biology and the development of virtual cells. Existing foundation models for single‑cell transcriptomics provide powerful static representations, but they do not explicitly model the distribution of cellular states for generative simulation. Here, we introduce Lingshu‑Cell, a masked discrete diffusion model that learns transcriptomic state distributions and supports conditional simulation under perturbation. By operating directly in a discrete token space that is compatible with the sparse, non‑sequential nature of single‑cell transcriptomic data, Lingshu‑Cell captures complex transcriptome‑wide expression dependencies across approximately 18,000 genes without relying on prior gene selection, such as filtering by high variability or ranking by expression level. Across diverse tissues and species, Lingshu‑Cell accurately reproduces transcriptomic distributions, marker‑gene expression patterns and cell‑subtype proportions, demonstrating its ability to capture complex cellular heterogeneity. Moreover, by jointly embedding cell type or donor identity with perturbation, Lingshu‑Cell can predict whole‑transcriptome expression changes for novel combinations of identity and perturbation. It achieves leading performance on the Virtual Cell Challenge H1 genetic perturbation benchmark and in predicting cytokine‑induced responses in human PBMCs. Together, these results establish Lingshu‑Cell as a flexible cellular world model for in silico simulation of cell states and perturbation responses, laying the foundation for a new paradigm in biological discovery and perturbation screening.

Abstract:
Integrating AI into the physical layer is a cornerstone of 6G networks. However, current data‑driven approaches struggle to generalize across dynamic environments because they lack an intrinsic understanding of electromagnetic wave propagation. We introduce the Wireless World Model (WWM), a multi‑modal foundation framework predicting the spatiotemporal evolution of wireless channels by internalizing the causal relationship between 3D geometry and signal dynamics. Pre‑trained on a massive ray‑traced multi‑modal dataset, WWM overcomes the data authenticity gap, further validated under real‑world measurement data. Using a joint‑embedding predictive architecture with a multi‑modal mixture‑of‑experts Transformer, WWM fuses channel state information, 3D point clouds, and user trajectories into a unified representation. Across the five key downstream tasks supported by WWM, it achieves remarkable performance in seen environments, unseen generalization scenarios, and real‑world measurements, consistently outperforming SOTA uni‑modal foundation models and task‑specific models. This paves the way for physics‑aware 6G intelligence that adapts to the physical world.

Abstract:
We introduce DreamerAD, the first latent world model framework that enables efficient reinforcement learning for autonomous driving by compressing diffusion sampling from 100 steps to 1 ‑ achieving 80x speedup while maintaining visual interpretability. Training RL policies on real‑world driving data incurs prohibitive costs and safety risks. While existing pixel‑level diffusion world models enable safe imagination‑based training, they suffer from multi‑step diffusion inference latency (2s/frame) that prevents high‑frequency RL interaction. Our approach leverages denoised latent features from video generation models through three key mechanisms: (1) shortcut forcing that reduces sampling complexity via recursive multi‑resolution step compression, (2) an autoregressive dense reward model operating directly on latent representations for fine‑grained credit assignment, and (3) Gaussian vocabulary sampling for GRPO that constrains exploration to physically plausible trajectories. DreamerAD achieves 87.7 EPDMS on NavSim v2, establishing state‑of‑the‑art performance and demonstrating that latent‑space RL is effective for autonomous driving.

Abstract:
We introduce Latent‑WAM, an efficient end‑to‑end autonomous driving framework that achieves strong trajectory planning through spatially‑aware and dynamics‑informed latent world representations. Existing world‑model‑based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub‑optimal planning under constrained data and compute budgets. Latent‑WAM addresses these limitations with two core modules: a Spatial‑Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi‑view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state‑of‑the‑art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD‑Score on HUGSIM, surpassing the best prior perception‑free method by 3.2 EPDMS with significantly less training data and a compact 104M‑parameter model.

Abstract:
Generative world models offer a compelling foundation for augmented‑reality (AR) applications: by predicting future image sequences that incorporate deliberate visual edits, they enable temporally coherent, augmented future frames that can be computed ahead of time and cached, avoiding per‑frame rendering from scratch in real time. In this work, we present SEGAR, a preliminary framework that combines a diffusion‑based world model with a selective correction stage to support this vision. The world model generates augmented future frames with region‑specific edits while preserving others, and the correction stage subsequently aligns safety‑critical regions with real‑world observations while preserving intended augmentations elsewhere. We demonstrate this pipeline in driving scenarios as a representative setting where semantic region structure is well defined and real‑world feedback is readily available. We view this as an early step toward generative world models as practical AR infrastructure, where future frames can be generated, cached, and selectively corrected on demand.

Abstract:
Computer‑use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general‑purpose agents is bottlenecked by the scarcity of continuous, high‑quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA‑Suite, a large‑scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer‑use agents. At its core is VideoCUA, which provides approximately 10,000 human‑demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi‑layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA‑Suite further provides two complementary resources: UI‑Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large‑scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA‑Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video‑based reward modeling, and visual world models. All data and models are publicly released.

Abstract:
Existing automated research systems operate as stateless, linear pipelines ‑‑ generating outputs without maintaining any persistent understanding of the research landscape they navigate. They process papers sequentially, propose ideas without structured gap analysis, and lack mechanisms for agents to verify, challenge, or refine each other's findings. We present AI‑Supervisor, a multi‑agent orchestration framework where specialized agents provide end‑to‑end AI research supervision driven by human interests ‑‑ from literature review through gap discovery, method development, evaluation, and paper writing ‑‑ through autonomous exploration and self‑correcting updates of research knowledge. Unlike sequential pipelines, AI‑Supervisor maintains a continuously evolving \emphResearch World Model, implemented as a Knowledge Graph, that captures methods, benchmarks, known limitations, and unexplored gaps, serving as shared memory across all agents and enabling agents to explore and build upon a structured understanding of the research landscape. The framework introduces three architectural contributions: (1) \emphstructured gap discovery that decomposes methods into core modules, validates their performance across benchmarks, and maps the specific gaps each module creates; (2) \emphself‑correcting discovery loops that probe why modules succeed on certain problems and fail on others, whether benchmarks carry hidden biases, and whether evaluation protocols remain adequate for emerging challenges; and (3) \emphself‑improving development loops governed by cross‑domain mechanism search that iteratively targets failing modules by finding solutions from other scientific fields. All agents operate under a \emphconsensus mechanism where independent findings are corroborated before being committed to the Research World Model.

Abstract:
Although large language models (LLMs) have advanced rapidly, robust automation of complex software workflows remains an open problem. In long‑horizon settings, agents frequently suffer from cascading errors and environmental stochasticity; a single misstep in a dynamic interface can lead to task failure, resulting in hallucinations or trial‑and‑error. This paper introduces Environment Maps: a persistent, agent‑agnostic representation that mitigates these failures by consolidating heterogeneous evidence, such as screen recordings and execution traces, into a structured graph. The representation consists of four core components: (1) Contexts (abstracted locations), (2) Actions (parameterized affordances), (3) Workflows (observed trajectories), and (4) Tacit Knowledge (domain definitions and reusable procedures). We evaluate this framework on the WebArena benchmark across five domains. Agents equipped with environment maps achieve a 28.2% success rate, nearly doubling the performance of baselines limited to session‑bound context (14.2%) and outperforming agents that have access to the raw trajectory data used to generate the environment maps (23.3%). By providing a structured interface between the model and the environment, Environment Maps establish a persistent foundation for long‑horizon planning that is human‑interpretable, editable, and incrementally refinable.

Abstract:
Deploying safety‑critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy's latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language‑ActiOn World Model), a fast steering layer that shifts the paradigm from "simulate‑then‑act" to "describe‑then‑act." DILLO is trained via cross‑modal distillation, where a privileged Vision Language Model teacher annotates offline trajectories and a latent‑conditioned Large Language Model student learns to predict semantic outcomes. This creates a text‑only inference path, bypassing heavy visual generation entirely, achieving a 14x speedup over baselines. Experiments on MetaWorld and LIBERO demonstrate that DILLO produces high‑fidelity descriptions of the next state and is able to steer the policy, improving episode success rate by up to 15 pp and 9.3 pp on average across tasks.

Abstract:
We study whether a risk‑sensitive objective from asset‑pricing theory ‑‑ recursive utility ‑‑ improves reinforcement learning for portfolio allocation. The Bellman equation under recursive utility involves a certainty equivalent (CE) of future value that has no closed form under observed returns; we approximate it by K‑sample Monte Carlo and train actor‑critic (PPO, A2C) on the resulting value target and an approximate advantage estimate (AAE) that generalizes the Bellman residual to multi‑step with state‑dependent weights. This formulation applies only to critic‑based algorithms. On 10 chronological train/test splits of South Korean ETF data, the recursive‑utility agent improves on the discounted (naive) baseline in Sharpe ratio, max drawdown, and cumulative return. Derivations, world model and metrics, and full result tables are in the appendices.

Abstract:
Embodied agents for creative tasks like photography must bridge the semantic gap between high‑level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs) reasoning with a novel control paradigm. PhotoAgent first translates subjective aesthetic goals into solvable geometric constraints via LMM‑driven, chain‑of‑thought (CoT) reasoning, allowing an analytical solver to compute a high‑quality initial viewpoint. This initial pose is then iteratively refined through visual reflection within a photorealistic internal world model built with 3D Gaussian Splatting (3DGS). This ``mental simulation'' replaces costly and slow physical trial‑and‑error, enabling rapid convergence to aesthetically superior results. Evaluations confirm that PhotoAgent excels in spatial reasoning and achieves superior final image quality.

Abstract:
This paper asks whether a bounded neural architecture can exhibit a meaningful division of labor between intuition and deliberation on a classic 64‑item syllogistic reasoning benchmark. More broadly, the benchmark is relevant to ongoing debates about world models and multi‑stage reasoning in AI. It provides a controlled setting for testing whether a learned system can develop structured internal computation rather than only one‑shot associative prediction. Experiment 1 evaluates a direct neural baseline for predicting full 9‑way human response distributions under 5‑fold cross‑validation. Experiment 2 introduces a bounded dual‑path architecture with separate intuition and deliberation pathways, motivated by computational mental‑model theory (Khemlani & Johnson‑Laird, 2022). Under cross‑validation, bounded intuition reaches an aggregate correlation of r = 0.7272, whereas bounded deliberation reaches r = 0.8152, and the deliberation advantage is significant across folds (p = 0.0101). The largest held‑out gains occur for NVC, Eca, and Oca, suggesting improved handling of rejection responses and c‑a conclusions. A canonical 80:20 interpretability run and a five‑seed stability sweep further indicate that the deliberation pathway develops sparse, differentiated internal structure, including an Oac‑leaning state, a dominant workhorse state, and several weakly used or unused states whose exact indices vary across runs. These findings are consistent with reasoning‑like internal organization under bounded conditions, while stopping short of any claim that the model reproduces full sequential processes of model construction, counterexample search, and conclusion revision.

Abstract:
Offline Reinforcement Learning (RL) learns optimal policies from fixed datasets, training a policy once and deploying it at inference time without further refinement. Inspired by model predictive control (MPC), we introduce an inference time adaptation framework that utilizes a pretrained policy along with a learned world model. While existing world model and diffusion‑planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference‑time information to optimize the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables end‑to‑end gradient computation through imagined rollouts for inference time policy optimization (ITPO). We evaluate our algorithm on D4RL continuous‑control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference‑time information to optimize the policy parameters yields consistent gains over strong offline RL baselines. Inference‑time adaptation, however, is expensive: rollout generation and backpropagation dominate per‑step compute. We study this tradeoff explicitly, showing that a suitable tilted version of one‑step MeanFlow sampler recovers much of the gains at a fraction of the cost.

Abstract:
Recent progress in latent world models (e.g., V‑JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low‑level extrapolation, making it difficult to capture long‑horizon semantics and reducing downstream utility. Vision‑‑language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute‑driven sparse sampling, a language‑output bottleneck that compresses fine‑grained interaction states into text‑oriented representations, and a data‑regime mismatch when adapting to small action‑conditioned datasets. We propose a VLM‑guided JEPA‑style latent world modeling framework that combines dense‑frame dynamics modeling with long‑horizon semantic guidance via a dual‑temporal pathway: a dense JEPA branch for fine‑grained motion and interaction cues, and a uniformly sampled VLM \emphthinker branch with a larger temporal stride for knowledge‑rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi‑layer VLM representations into guidance features compatible with latent prediction. Experiments on hand‑manipulation trajectory prediction show that our method outperforms both a strong VLM‑only baseline and a JEPA‑predictor baseline, and yields more robust long‑horizon rollout behavior.

Abstract:
Video‑‑based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text‑‑video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni‑‑WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni‑‑WorldBench comprises two key components: Omni‑‑WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni‑‑Metrics, an agent‑based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni‑WorldBench will be publicly released to foster progress in interactive 4D world modeling.

Abstract:
Robot action planning in the real world is challenging as it requires not only understanding the current state of the environment but also predicting how it will evolve in response to actions. Vision‑language‑action (VLA), which repurpose large‑scale vision‑language models for robot action generation using action experts, have achieved notable success across a variety of robotic tasks. Nevertheless, their performance remains constrained by the scope of their training data, exhibiting limited generalization to unseen scenarios and vulnerability to diverse contextual perturbations. More recently, world models have been revisited as an alternative to VLAs. These models, referred to as world action models (WAMs), are built upon world models that are trained on large corpora of video data to predict future states. With minor adaptations, their latent representation can be decoded into robot actions. It has been suggested that their explicit dynamic prediction capacity, combined with spatiotemporal priors acquired from web‑scale video pretraining, enables WAMs to generalize more effectively than VLAs. In this paper, we conduct a comparative study of prominent state‑of‑the‑art VLA policies and recently released WAMs. We evaluate their performance on the LIBERO‑Plus and RoboTwin 2.0‑Plus benchmarks under various visual and language perturbations. Our results show that WAMs achieve strong robustness, with LingBot‑VA reaching 74.2% success rate on RoboTwin 2.0‑Plus and Cosmos‑Policy achieving 82.2% on LIBERO‑Plus. While VLAs such as π_0.5 can achieve comparable robustness on certain tasks, they typically require extensive training with diverse robotic datasets and varied learning objectives. Hybrid approaches that partially incorporate video‑based dynamic learning exhibit intermediate robustness, highlighting the importance of how video priors are integrated.

Abstract:
Single‑image 3D generation lies at the core of vision‑to‑graphics models in the real world. However, it remains a fundamental challenge to achieve reliable generalization across diverse semantic categories and highly variable structural complexity under sparse supervision. Existing approaches typically model objects in a monolithic manner or rely on a fixed number of parts, including recent part‑aware models such as PartCrafter, which still require a labor‑intensive user‑specified part count. Such designs easily lead to overfitting, fragmented or missing structural components, and limited compositional generalization when encountering novel object layouts. To this end, this paper rethinks single‑image 3D generation as learning an adaptive part‑whole hierarchy in the flexible 3D latent space. We present a novel part‑to‑whole 3D generative world model that autonomously discovers latent structural slots by inferring soft and compositional masks directly from image tokens. Specifically, an adaptive slot‑gating mechanism dynamically determines the slot‑wise activation probabilities and smoothly consolidates redundant slots within different objects, ensuring that the emergent structure remains compact yet expressive across categories. Each distilled slot is then aligned to a learnable, class‑agnostic prototype bank, enabling powerful cross‑category shape sharing and denoising through universal geometric prototypes in the real world. Furthermore, a lightweight 3D denoiser is introduced to reconstruct geometry and appearance via unified diffusion objectives. Experiments show consistent gains in cross‑category transfer and part‑count extrapolation, and ablations confirm complementary benefits of the prototype bank for shape‑prior sharing as well as slot‑gating for structural adaptation.

Abstract:
World models learn to simulate environment dynamics from experience, enabling sample‑efficient reinforcement learning. But what do these models actually represent internally? We apply interpretability techniques‑‑including linear and nonlinear probing, causal interventions, and attention analysis‑‑to two architecturally distinct world models: IRIS (discrete token transformer) and DIAMOND (continuous diffusion UNet), trained on Atari Breakout and Pong. Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higher R^2, confirming that these representations are approximately linear. Causal interventions‑‑shifting hidden states along probe‑derived directions‑‑produce correlated changes in model predictions, providing evidence that representations are functionally used rather than merely correlated. Analysis of IRIS attention heads reveals spatial specialization: specific heads attend preferentially to tokens overlapping with game objects. Multi‑baseline token ablation experiments consistently identify object‑containing tokens as disproportionately important. Our findings provide interpretability evidence that learned world models develop structured, approximately linear internal representations of environment state across two games and two architectures.

Abstract:
This paper presents ARYA, a composable, physics‑constrained, deterministic world model architecture built on five foundational principles: nano models, composability, causal reasoning, determinism, and architectural AI safety. We demonstrate that ARYA satisfies all canonical world model requirements, including state representation, dynamic prediction, causal and physical awareness, temporal consistency, generalization, learnability, and planning and control. Unlike monolithic foundation models, the ARYA foundation model implements these capabilities through a hierarchical system‑of‑system‑of‑systems of specialized nano models, orchestrated by AARA (ARYA Autonomous Research Agent), an always‑on cognitive daemon that executes a continuous sense‑decide‑act‑learn loop. The nano model architecture provides linear scaling, sparse activation, selective untraining, and sub‑20‑second training cycles, resolving the traditional tension between capability and computational efficiency. A central contribution is the Unfireable Safety Kernel: an architecturally immutable safety boundary that cannot be disabled or circumvented by any system component, including its own self‑improvement engine. This is not a social or ethical alignment statement; it is a technical framework ensuring human control persists as autonomy increases. Safety is an architectural constraint governing every operation, not a policy layer applied after the fact. We present formal alignment between ARYA's architecture and canonical world model requirements, and report summarizing its state‑of‑the‑art performance across 6 of 9 competitive benchmarks head‑to‑head with GPT‑5.2, Opus 4.6, and V‑JEPA‑2. All with zero neural network parameters, across seven active industry domain nodes spanning aerospace, pharma manufacturing, oil and gas, smart cities, biotech, defense, and medical devices.

Abstract:
Generating safety‑critical driving scenarios requires understanding why dangerous interactions arise, rather than merely forcing collisions. However, existing methods rely on heuristic adversarial agent selection and unstructured perturbations, lacking explicit modeling of interaction dependencies and thus exhibiting a realism‑‑adversarial trade‑off. We present CounterScene, a framework that endows closed‑loop generative BEV world models with structured counterfactual reasoning for safety‑critical scenario generation. Given a safe scene, CounterScene asks: what if the causally critical agent had behaved differently? To answer this, we introduce causal adversarial agent identification to identify the critical agent and classify conflict types, and develop a conflict‑aware interactive world model in which a causal interaction graph is used to explicitly model dynamic inter‑agent dependencies. Building on this structure, stage‑adaptive counterfactual guidance performs minimal interventions on the identified agent, removing its spatial and temporal safety margins while allowing risk to emerge through natural interaction propagation. Extensive experiments on nuScenes demonstrate that CounterScene achieves the strongest adversarial effectiveness while maintaining superior trajectory realism across all horizons, improving long‑horizon collision rate from 12.3% to 22.7% over the strongest baseline with better realism (ADE 1.88 vs.2.09). Notably, this advantage further widens over longer rollouts, and CounterScene generalizes zero‑shot to nuPlan with state‑of‑the‑art realism.

Abstract:
Diffusion policies excel at visuomotor control but often fail catastrophically under severe out‑of‑distribution (OOD) disturbances, such as unexpected object displacements or visual corruptions. To address this vulnerability, we introduce the Dream Diffusion Policy (DDP), a framework that deeply integrates a diffusion world model into the policy's training objective via a shared 3D visual encoder. This co‑optimization endows the policy with robust state‑prediction capabilities. When encountering sudden OOD anomalies during inference, DDP detects the real‑imagination discrepancy and actively abandons the corrupted visual stream. Instead, it relies on its internal "imagination" (autoregressively forecasted latent dynamics) to safely bypass the disruption, generating imagined trajectories before smoothly realigning with physical reality. Extensive evaluations demonstrate DDP's exceptional resilience. Notably, DDP achieves a 73.8% OOD success rate on MetaWorld (vs. 23.9% without predictive imagination) and an 83.3% success rate under severe real‑world spatial shifts (vs. 3.3% without predictive imagination). Furthermore, as a stress test, DDP maintains a 76.7% real‑world success rate even when relying entirely on open‑loop imagination post‑initialization.

Abstract:
While current embodied policies exhibit remarkable manipulation skills, their execution remains unsatisfactorily slow as they inherit the tardy pacing of human demonstrations. Existing acceleration methods typically require policy retraining or costly online interactions, limiting their scalability for large‑scale foundation models. In this paper, we propose Speedup Patch (SuP), a lightweight, policy‑agnostic framework that enables plug‑and‑play acceleration using solely offline data. SuP introduces an external scheduler that adaptively downsamples action chunks provided by embodied policies to eliminate redundancies. Specifically, we formalize the optimization of our scheduler as a Constrained Markov Decision Process (CMDP) aimed at maximizing efficiency without compromising task performance. Since direct success evaluation is infeasible in offline settings, SuP introduces World Model based state deviation as a surrogate metric to enforce safety constraints. By leveraging a learned world model as a virtual evaluator to predict counterfactual trajectories, the scheduler can be optimized via offline reinforcement learning. Empirical results on simulation benchmarks (Libero, Bigym) and real‑world tasks validate that SuP achieves an overall 1.8x execution speedup for diverse policies while maintaining their original success rates.

Abstract:
Vision‑Language‑Action (VLA) models show strong generalization for robotic control, but finetuning them with reinforcement learning (RL) is constrained by the high cost and safety risks of real‑world interaction. Training VLA models in interactive world models avoids these issues but introduces several challenges, including pixel‑level world modeling, multi‑view consistency, and compounding errors under sparse rewards. Building on recent advances across large multimodal models and model‑based RL, we propose VLA‑MBPO, a practical framework to tackle these problems in VLA finetuning. Our approach has three key design choices: (i) adapting unified multimodal models (UMMs) for data‑efficient world modeling; (ii) an interleaved view decoding mechanism to enforce multi‑view consistency; and (iii) chunk‑level branched rollout to mitigate error compounding. Theoretical analysis and experiments across simulation and real‑world tasks demonstrate that VLA‑MBPO significantly improves policy performance and sample efficiency, underscoring its robustness and scalability for real‑world robotic deployment.

Abstract:
Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand‑object interactions, and goal‑directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand‑centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal‑directed world simulator that generates coherent, first‑person video rollouts from minimal static inputs: a single egocentric image, a high‑level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory‑level reward‑guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real‑world smart‑glasses experiments.

Abstract:
We introduce a self‑supervised framework for learning predictive and structured representations of wireless channels by modeling the temporal evolution of channel state information (CSI) in a compact latent space. Our method casts the problem as a world modeling task and leverages the Joint Embedding Predictive Architecture (JEPA) to learn action‑conditioned latent dynamics from CSI trajectories. To promote geometric consistency and compositionality, we parameterize transitions using homomorphic updates derived from Lie algebra, yielding a structured latent space that reflects spatial layout and user motion. Evaluations on the DICHASUS dataset show that our approach outperforms strong baselines in preserving topology and forecasting future embeddings across unseen environments. The resulting latent space enables metrically faithful channel charts, offering a scalable foundation for downstream applications such as mobility‑aware scheduling, localization, and wireless scene understanding.

Abstract:
Scalable and reliable evaluation is increasingly critical in the end‑to‑end era of autonomous driving, where vision‑‑language‑‑action (VLA) policies directly map raw sensor streams to driving actions. Yet, current evaluation pipelines still rely heavily on real‑world road testing, which is costly, biased toward limited scenario coverage, and difficult to reproduce. These challenges motivate a real‑world simulator that can generate realistic future observations under proposed actions, while remaining controllable and stable over long horizons. We present X‑World, an action‑conditioned multi‑camera generative world model that simulates future observations directly in video space. Given synchronized multi‑view camera history and a future action sequence, X‑World generates future multi‑camera video streams that follow the commanded actions. To ensure reproducible and editable scene rollouts, X‑World further supports optional controls over dynamic traffic agents and static road elements, and retains a text‑prompt interface for appearance‑level control (e.g., weather and time of day). Beyond world simulation, X‑World also enables video style transfer by conditioning on appearance prompts while preserving the underlying action and scene dynamics. At the core of X‑World is a multi‑view latent video generator designed to explicitly encourage cross‑view geometric consistency and temporal coherence under diverse control signals. Experiments show that X‑World achieves high‑quality multi‑view video generation with (i) strong view consistency across cameras, (ii) stable temporal dynamics over long rollouts, and (iii) high controllability with strict action following and faithful adherence to optional scene controls. These properties make X‑World a practical foundation for scalable and reproducible evaluation.

Abstract:
Given the remarkable ability of 2D foundation image models to generate high‑fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state‑of‑the‑art image generation models and Vision‑Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi‑agent architecture: a VLM‑based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM‑backed two‑step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D‑consistent worlds.

Abstract:
Cloud robotics enables robots to offload high‑dimensional motion planning and reasoning to remote servers. However, for continuous manipulation tasks requiring high‑frequency control, network latency and jitter can severely destabilize the system, causing command starvation and unsafe physical execution. To address this, we propose Speculative Policy Orchestration (SPO), a latency‑resilient cloud‑edge framework. SPO utilizes a cloud‑hosted world model to pre‑compute and stream future kinematic waypoints to a local edge buffer, decoupling execution frequency from network round‑trip time. To mitigate unsafe execution caused by predictive drift, the edge node employs an ε‑tube verifier that strictly bounds kinematic execution errors. The framework is coupled with an Adaptive Horizon Scaling mechanism that dynamically expands or shrinks the speculative pre‑fetch depth based on real‑time tracking error. We evaluate SPO on continuous RLBench manipulation tasks under emulated network delays. Results show that even when deployed with learned models of modest accuracy, SPO reduces network‑induced idle time by over 60% compared to blocking remote inference. Furthermore, SPO discards approximately 60% fewer cloud predictions than static caching baselines. Ultimately, SPO enables fluid, real‑time cloud‑robotic control while maintaining bounded physical safety.

Abstract:
Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi‑term losses, exponential moving averages, pre‑trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end‑to‑end from raw pixels using only two loss terms: a next‑embedding prediction loss and a regularizer enforcing Gaussian‑distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end‑to‑end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation‑model‑based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.

Abstract:
Traffic microsimulators are widely used to evaluate road network performance under various ``what‑if" conditions. However, the behavior models controlling the actions of the actors are overly simplistic and fails to capture realistic actor‑actor interactions. Deep learning‑based methods have been applied to model vehicles and pedestrians as ``agents" responding to their surrounding ``environment" (including lanes, signals, and neighboring agents). Although effective in learning actor‑actor interaction, these approaches fail to generate physically consistent trajectories over long time periods, and they do not explicitly address the complex dynamics that arise at traffic intersections which is a critical location in urban networks. Inspired by the World Model paradigm, we have developed an actor centric generative model using transformer‑based architecture that is able to capture the actor‑actor interaction, at the same time understanding the geometry to the traffic intersection to generate physically grounded trajectories that are based on learned behavior. Moreover, we test the model in a live ``simulation‑in‑the‑loop" setting, where we generate the initial conditions of the actors using SUMO and then let the model control the dynamics of the actors. We let the simulation run for 40000 timesteps (4000 seconds), testing the performance of the model on long timerange and evaluating the trajectories on traffic engineering related metrics. Experimental results demonstrate that the proposed framework effectively captures complex actor‑actor interactions and generates long‑horizon, physically consistent trajectories, while requiring significantly fewer training samples than traditional agent‑centric generative approaches. Our model is able to outperform the baseline in traffic related as well as aggregate metrics where our model beats the baseline by more than 10x on the KL‑Divergence.

Abstract:
Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid‑body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement‑learning post‑training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out‑of‑bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment‑specific artifacts in generated rollouts and improves downstream task execution success.

Abstract:
The rapid evolution toward 6G and beyond communication systems is accelerating the convergence of digital twins and world models at the network edge. Traditional digital twins provide high‑fidelity representations of physical systems and support monitoring, analysis, and offline optimization. However, in highly dynamic edge environments, they face limitations in autonomy, adaptability, and scalability. This paper presents a systematic survey of the transition from digital twins to world models and discusses its role in enabling edge general intelligence (EGI). First, the paper clarifies the conceptual differences between digital twins and world models and highlights the shift from physics‑based, centralized, and system‑centric replicas to data‑driven, decentralized, and agent‑centric internal models. This discussion helps readers gain a clear understanding of how this transition enables more adaptive, autonomous, and resource‑efficient intelligence at the network edge. The paper reviews the design principles, architectures, and key components of world models, including perception, latent state representation, dynamics learning, imagination‑based planning, and memory. In addition, it examines the integration of world models and digital twins in wireless EGI systems and surveys emerging applications in integrated sensing and communications, semantic communication, air‑ground networks, and low‑altitude wireless networks. Finally, this survey provides a systematic roadmap and practical insights for designing world‑model‑driven edge intelligence systems in wireless and edge computing environments. It also outlines key research challenges and future directions toward scalable, reliable, and interoperable world models for edge‑native agentic AI.

Abstract:
Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large‑scale Vision‑Language Models (VLMs). While VLMs show promise as zero‑shot planners, their lack of grounded physical understanding often leads to compounding errors and low success rates when deployed in complex real‑world environments, particularly for challenging tasks like deformable object manipulation. Although Reinforcement Learning (RL) can adapt these planners to specific task dynamics, directly fine‑tuning VLMs via real‑world interaction is prohibitively expensive, unsafe, and sample‑inefficient. To overcome this bottleneck, we introduce DreamPlan, a novel framework for the reinforcement fine‑tuning of VLM planners via video world models. Instead of relying on costly physical rollouts, DreamPlan first leverages the zero‑shot VLM to collect exploratory interaction data. We demonstrate that this sub‑optimal data is sufficient to train an action‑conditioned video generation model, which implicitly captures complex real‑world physics. Subsequently, the VLM planner is fine‑tuned entirely within the "imagination" of this video world model using Odds Ratio Policy Optimization (ORPO). By utilizing these virtual rollouts, physical and task‑specific knowledge is efficiently injected into the VLM. Our results indicate that DreamPlan bridges the gap between semantic reasoning and physical grounding, significantly improving manipulation success rates without the need for large‑scale real‑world data collection. Our project page is https://psi‑lab.ai/DreamPlan/.

Abstract:
Next‑token predictors often appear to develop internal representations of the latent world and its rules. The probabilistic nature of these models suggests a deep connection between the structure of the world and the geometry of probability distributions. In order to understand this link more precisely, we use a minimal stochastic process as a controlled setting: constrained random walks on a two‑dimensional lattice that must reach a fixed endpoint after a predetermined number of steps. Optimal prediction of this process solely depends on a sufficient vector determined by the walker's position relative to the target and the remaining time horizon; in other words, the probability distributions are parametrized by the world's geometry. We train decoder‑only transformers on prefixes sampled from the exact distribution of these walks and compare their hidden activations to the analytically derived sufficient vectors. Across models and layers, the learned representations align strongly with the ground‑truth predictive vectors and are often low‑dimensional. This provides a concrete example in which world‑model‑like representations can be directly traced back to the predictive geometry of the data itself. Although demonstrated in a simplified toy system, the analysis suggests that geometric representations supporting optimal prediction may provide a useful lens for studying how neural networks internalize grammatical and other structural constraints.

Abstract:
Real‑world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPile), SPT improves domain performance and preserves general capabilities after finetuning compared to standard pretraining. In our experiments, SPT reduces the pretraining tokens needed to reach a given domain performance by up to 1.75x. These gains grow when the target domain is underrepresented in the pretraining corpus: on domains far from web text, a 1B SPT model outperforms a 3B standard pretrained model. Beyond these empirical gains, we derive overfitting scaling laws to guide practitioners in selecting the optimal domain‑data repetition for a given pretraining compute budget. Our observations reveal the finetuner's fallacy: while finetuning may appear to be the cheapest path to domain adaptation, introducing specialized domain data during pretraining stretches its utility. SPT yields better specialized domain performance (via reduced overfitting across repeated exposures) and better general domain performance (via reduced forgetting during finetuning), ultimately achieving stronger results with fewer parameters and less total compute when amortized over inference. To get the most out of domain data, incorporate it as early in training as possible.

Abstract:
While recent Vision‑Language‑Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre‑execution prompts or focus exclusively on human speech. This leaves a significant gap in real‑time, sound‑centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low‑frequency updates or system latency. This problem is exacerbated by action chunking with open‑loop execution, which creates a Blind Execution Interval where acoustic events are lost between discrete audio observation windows. Recognizing the necessity of continuous auditory awareness, we formalize Vision‑Sound‑Language‑Action (VSLA) as a continuous control paradigm conditioned on vision, streaming audio, language, and proprioception under delayed decision loops. As an instantiation, we introduce HEAR, a VSLA framework integrating four components: (i) a streaming Historizer to maintain a compact, causal audio context across execution gaps; (ii) an Envisioner adapted from omni foundation models to reason over multi‑sensory inputs; (iii) an Advancer, formulated as an audio world model, to learn temporal dynamics by predicting near‑future audio codes; and (iv) a flow‑matching Realizer policy to generate smooth action chunks. To address the scarcity of pretraining data and evaluations for VSLA, we construct OpenX‑Sound for pretraining, alongside HEAR‑Bench, the first sound‑centric manipulation benchmark with strict causal timing rules. Our results suggest that robust sound‑centric manipulation necessitates causal persistence and explicit temporal learning. This framework provides a practical step toward multi‑sensory foundation models for embodied agents, enabling robots to perceive and interact with dynamic environments. Code and videos are available at https://hear.irmv.top.

Abstract:
Autonomous driving requires safe planning, but most learning‑based planners lack explicit self‑correction ability: once an unsafe action is proposed, there is no mechanism to correct it. Thus, we propose CorrectionPlanner, an autoregressive planner with self‑correction that models planning as motion‑token generation within a propose, evaluate, and correct loop. At each planning step, the policy proposes an action, namely a motion token, and a learned collision critic predicts whether it will induce a collision within a short horizon. If the critic predicts a collision, we retain the sequence of historical unsafe motion tokens as a self‑correction trace, generate the next motion token conditioned on it, and repeat this process until a safe motion token is proposed or the safety criterion is met. This self‑correction trace, consisting of all unsafe motion tokens, represents the planner's correction process in motion‑token space, analogous to a reasoning trace in language models. We train the planner with imitation learning followed by model‑based reinforcement learning using rollouts from a pretrained world model that realistically models agents' reactive behaviors. Closed‑loop evaluations show that CorrectionPlanner reduces collision rate by over 20% on Waymax and achieves state‑of‑the‑art planning scores on nuPlan.

Abstract:
Robot learning requires adaptation methods that improve reliably from limited, mixed‑quality interaction data. This is especially challenging in long‑horizon, contact‑rich tasks, where end‑to‑end policy finetuning remains inefficient and brittle. World models offer a compelling alternative: by predicting the outcomes of candidate action sequences, they enable online planning through counterfactual reasoning. However, training action‑conditioned robotic world models directly in the real world requires diverse data at impractical scale. We introduce Simulation Distillation (SimDist), a framework that uses physics simulators as a scalable source of action‑conditioned robot experience. During pretraining, SimDist distills structural priors from the simulator into a world model that enables planning from raw real‑world observations. During real‑world adaptation, SimDist transfers the encoder, reward model, and value function learned in simulation, and updates only the latent dynamics model using real‑world prediction losses. This reduces adaptation to supervised system identification while preserving dense, long‑horizon planning signals for online improvement. Across contact‑rich manipulation and quadruped locomotion tasks, SimDist rapidly improves with experience, while prior adaptation methods struggle to make progress or degrade during online finetuning. Project website and code: https://sim‑dist.github.io

Abstract:
What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city‑scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval‑augmented conditioning on nearby street‑view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle‑mounted captures at sparse intervals. We address these challenges through cross‑temporal pairing, a large‑scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street‑view images. We further introduce a Virtual Lookahead Sink to stabilize long‑horizon generation by continuously re‑grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long‑horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text‑prompted scenario variations.

Abstract:
Remote sensing world models aim to both explain observed changes and forecast plausible futures, two tasks that share spatiotemporal priors. Existing methods, however, typically address them separately, limiting cross‑task transfer. We present RS‑WorldModel, a unified world model for remote sensing that jointly handles spatiotemporal change understanding and text‑guided future scene forecasting, and we build RSWBench‑1.1M, a 1.1 million sample dataset with rich language annotations covering both tasks. RS‑WorldModel is trained in three stages: (1) Geo‑Aware Generative Pre‑training (GAGP) conditions forecasting on geographic and acquisition metadata; (2) synergistic instruction tuning (SIT) jointly trains understanding and forecasting; (3) verifiable reinforcement optimization (VRO) refines outputs with verifiable, task‑specific rewards. With only 2B parameters, RS‑WorldModel surpasses open‑source models up to 120 × larger on most spatiotemporal change question‑answering metrics. It achieves an FID of 43.13 on text‑guided future scene forecasting, outperforming all open‑source baselines as well as the closed‑source Gemini‑2.5‑Flash Image (Nano Banana).

Abstract:
End‑to‑end autonomous driving policies based on Imitation Learning (IL) often struggle in closed‑loop execution due to the misalignment between inadequate open‑loop training objectives and real driving requirements. While Reinforcement Learning (RL) offers a solution by directly optimizing driving goals via reward signals, the rendering‑based training environments introduce the rendering gap and are inefficient due to high computational costs. To overcome these challenges, we present a novel Pseudo‑simulation‑based RL method for closed‑loop end‑to‑end autonomous driving, PerlAD. Based on offline datasets, PerlAD constructs a pseudo‑simulation that operates in vector space, enabling efficient, rendering‑free trial‑and‑error training. To bridge the gap between static datasets and dynamic closed‑loop environments, PerlAD introduces a prediction world model that generates reactive agent trajectories conditioned on the ego vehicle's plan. Furthermore, to facilitate efficient planning, PerlAD utilizes a hierarchical decoupled planner that combines IL for lateral path generation and RL for longitudinal speed optimization. Comprehensive experimental results demonstrate that PerlAD achieves state‑of‑the‑art performance on the Bench2Drive benchmark, surpassing the previous E2E RL method by 10.29% in Driving Score without requiring expensive online interactions. Additional evaluations on the DOS benchmark further confirm its reliability in handling safety‑critical occlusion scenarios.

Abstract:
Autonomous driving systems depend on on models that can reason about high‑level scene contexts and accurately predict the dynamics of their surrounding environment. Vision‑ Language Models (VLMs) have recently emerged as promising tools for decision‑making and scene understanding, offering strong capabilities in contextual reasoning. However, their limited spatial comprehension constrains their effectiveness as end‑to‑end driving models. World Models (WM) internalize environmental dynamics to predict future scene evolution. Recently explored as ego‑motion predictors and foundation models for autonomous driving, they represent a promising direction for addressing key challenges in the field, particularly enhancing generalization while maintaining dynamic prediction. To leverage the complementary strengths of context‑based decision making and prediction, we propose WorldVLM: A hybrid architecture that unifies VLMs and WMs. In our design, the high‑level VLM generates behavior commands to guide the driving WM, enabling interpretable and context‑aware actions. We evaluate conditioning strategies and provide insights into the hybrid design challenges.

Abstract:
Trained Transformers have been shown to compute abstract features that appear redundant for predicting the immediate next token. We identify which components of the gradient signal from the next‑token prediction objective give rise to this phenomenon, and we propose a method to estimate the influence of those components on the emergence of specific features. After validating our approach on toy tasks, we use it to interpret the origins of the world model in OthelloGPT and syntactic features in a small language model. Finally, we apply our framework to a pretrained LLM, showing that features with extremely high or low influence on future tokens tend to be related to formal reasoning domains such as code. Overall, our work takes a step toward understanding hidden features of Transformers through the lens of their development during training.

Abstract:
Ophthalmic decision‑making depends on subtle lesion‑scale cues interpreted across multimodal imaging and over time, yet most medical foundation models remain static and degrade under modality and acquisition shifts. Here we introduce EyeWorld, a generative world model that conceptualizes the eye as a partially observed dynamical system grounded in clinical imaging. EyeWorld learns an observation‑stable latent ocular state shared across modalities, unifying fine‑grained parsing, structure‑preserving cross‑modality translation and quality‑robust enhancement within a single framework. Longitudinal supervision further enables time‑conditioned state transitions, supporting forecasting of clinically meaningful progression while preserving stable anatomy. By moving from static representation learning to explicit dynamical modeling, EyeWorld provides a unified approach to robust multimodal interpretation and prognosis‑oriented simulation in medicine.

Abstract:
To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human‑Object Interaction (HOI) world models are critical for predicting physically grounded first‑person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high‑DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact‑consistent interactions from action signals alone. To ensure physical accuracy without future‑state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics‑informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics‑informed design.

Abstract:
While recent foundation models have significantly advanced robotic manipulation, these systems still struggle to autonomously recover from execution errors. Current failure‑learning paradigms rely on either costly and unsafe real‑world data collection or simulator‑based perturbations, which introduce a severe sim‑to‑real gap. Furthermore, existing visual analyzers predominantly output coarse, binary diagnoses rather than the executable, trajectory‑level corrections required for actual recovery. To bridge the gap between failure diagnosis and actionable recovery, we introduce Dream2Fix, a framework that synthesizes photorealistic, counterfactual failure rollouts directly from successful real‑world demonstrations. By perturbing actions within a generative world model, Dream2Fix creates paired failure‑correction data without relying on simulators. To ensure the generated data is physically viable for robot learning, we implement a structured verification mechanism that strictly filters rollouts for task validity, visual coherence, and kinematic safety. This engine produces a high‑fidelity dataset of over 120k paired samples. Using this dataset, we fine‑tune a vision‑language model to jointly predict failure types and precise recovery trajectories, mapping visual anomalies directly to corrective actions. Extensive real‑world robotic experiments show our approach achieves state‑of‑the‑art correction accuracy, improving from 19.7% to 81.3% over prior baselines, and successfully enables zero‑shot closed‑loop failure recovery in physical deployments.

Abstract:
Diffusion‑based image‑to‑video (I2V) models increasingly exhibit world‑model‑like properties by implicitly capturing temporal dynamics. However, existing studies have mainly focused on visual quality and controllability, and the robustness of the state transition learned by the model remains understudied. To fill this gap, we are the first to analyze the vulnerability of I2V models, find that temporal control mechanisms constitute a new attack surface, and reveal the challenge of modeling them uniformly under different attack settings. Based on this, we propose a trajectory‑control attack, called CtrlAttack, to interfere with state evolution during the generation process. Specifically, we represent the perturbation as a low‑dimensional velocity field and construct a continuous displacement field via temporal integration, thereby affecting the model's state transitions while maintaining temporal consistency; meanwhile, we map the perturbation to the observation space, making the method applicable to both white‑box and black‑box attack settings. Experimental results show that even under low‑dimensional and strongly regularized perturbation constraints, our method can still significantly disrupt temporal consistency by increasing the attack success rate (ASR) to over 90% in the white‑box setting and over 80% in the black‑box setting, while keeping the variation of the FID and FVD within 6 and 130, respectively, thus revealing the potential security risk of I2V models at the level of state dynamics.

Abstract:
Backpropagation dominates modern machine learning, yet it is not the only principled method for optimizing dynamical systems. We propose Kalman World Models (KWM), a class of learned state‑space models trained via recursive Bayesian filtering rather than reverse‑mode automatic differentiation. Instead of gradient descent updates, we replace parameter learning with Kalman‑style gain adaptation. Training becomes online filtering; error signals become innovations. We further extend this framework to transformer‑based large language models (LLMs), where internal activations are treated as latent dynamical states corrected via innovation terms. This yields a gradient‑free training and adaptation paradigm grounded in control theory. We derive stability conditions, analyze computational complexity, and provide empirical results on sequence modeling tasks demonstrating competitive performance with improved robustness and continual adaptation properties.

Abstract:
VLMs excel at static perception but falter in interactive reasoning in dynamic physical environments, which demands planning and adaptation to dynamic outcomes. Existing physical reasoning methods often depend on abstract symbolic inputs or lack the ability to learn and adapt from direct, pixel‑based visual interaction in novel scenarios. We introduce ICPRL (In‑Context Physical Reinforcement Learning), a framework inspired by In‑Context Reinforcement Learning (ICRL) that empowers VLMs to acquire physical intuition and adapt their policies in‑context. Our approach trains a vision‑grounded policy model via multi‑turn Group Relative Policy Optimization (GRPO) over diverse multi‑episode interaction histories. This enables the agent to adapt strategies by conditioning on past trial‑and‑error sequences, without requiring any weight updates. This adaptive policy works in concert with a separately trained world model that provides explicit physical reasoning by predicting the results of potential actions. At inference, the policy proposes candidate actions, while the world model predicts outcomes to guide a root‑node PUCT search to select the most promising action. Evaluated on the diverse physics‑based puzzle‑solving tasks in the DeepPHY benchmark, ICPRL demonstrates significant improvements across both its I. policy‑only, and II. world‑model‑augmented stages. Notably, these gains are retained in unseen physical environments, demonstrating that our framework facilitates genuine in‑context acquisition of the environment's physical dynamics from interactive experience.

Abstract:
In this paper, we generate conceptual engineering designs of electric vertical take‑off and landing (eVTOL) aircraft. We follow the paradigm of simulation‑based inference (SBI), whereby we look to learn a posterior distribution over the full eVTOL design space. To learn this distribution, we sample over discrete aircraft configurations (topologies) and their corresponding set of continuous parameters. Therefore, we introduce a hierarchical probabilistic model consisting of two diffusion models. The first model leverages recent work on Riemannian Diffusion Language Modeling (RDLM) and Unified World Models (UWMs) to enable us to sample topologies from a discrete and continuous space. For the second model we introduce a masked diffusion approach to sample the corresponding parameters conditioned on the topology. Our approach rediscovers known trends and governing physical laws in aircraft design, while significantly accelerating design generation.

Abstract:
The pursuit of world model based artificial intelligence has predominantly relied on projecting high‑dimensional observations into parameterized latent spaces, wherein transition dynamics are subsequently learned. However, this conventional paradigm is mathematically flawed: it merely displaces the manifold learning problem into the latent space. When the underlying data distribution shifts, the latent manifold shifts accordingly, forcing the predictive operator to implicitly relearn the new topological structure. Furthermore, by classical approximation theory, positive operators like dot product attention inevitably suffer from the saturation phenomenon, permanently bottlenecking their predictive capacity and leaving them vulnerable to the curse of dimensionality. In this paper, we formulate a mathematically rigorous paradigm for world model construction by redefining the core predictive mechanism. Inspired by Ryan O'Dowd's foundational work we introduce Spherical Kernel Operator (SKO), a framework that replaces standard attention. By projecting the unknown data manifold onto a unified ambient hypersphere and utilizing a localized sequence of ultraspherical (Gegenbauer) polynomials, SKO performs direct integral reconstruction of the target function. Because this localized spherical polynomial kernel is not strictly positive, it bypasses the saturation phenomenon, yielding approximation error bounds that depend strictly on the intrinsic manifold dimension q, rather than the ambient dimension. Furthermore, by formalizing its unnormalized output as an authentic measure support estimator, SKO mathematically decouples the true environmental transition dynamics from the biased observation frequency of the agent. Empirical evaluations confirm that SKO significantly accelerates convergence and outperforms standard attention baselines in autoregressive language modeling.

Abstract:
A surgical world model capable of generating realistic surgical action videos with precise control over tool‑tissue interactions can address fundamental challenges in surgical AI and simulation ‑‑ from data scarcity and rare event synthesis to bridging the sim‑to‑real gap for surgical automation. However, current video generation methods, the very core of such surgical world models, require expensive annotations or complex structured intermediates as conditioning signals at inference, limiting their scalability. Other approaches exhibit limited temporal consistency across complex laparoscopic scenes and do not possess sufficient realism. We propose Surgical Action World (SAW) ‑‑ a step toward surgical action world modeling through video diffusion conditioned on four lightweight signals: language prompts encoding tool‑action context, a reference surgical scene, tissue affordance mask, and 2D tool‑tip trajectories. We design a conditional video diffusion approach that reformulates video‑to‑video diffusion into trajectory‑conditioned surgical action synthesis. The backbone diffusion model is fine‑tuned on a custom‑curated dataset of 12,044 laparoscopic clips with lightweight spatiotemporal conditioning signals, leveraging a depth consistency loss to enforce geometric plausibility without requiring depth at inference. SAW achieves state‑of‑the‑art temporal consistency (CD‑FVD: 199.19 vs. 546.82) and strong visual quality on held‑out test data. Furthermore, we demonstrate its downstream utility for (a) surgical AI, where augmenting rare actions with SAW‑generated videos improves action recognition (clipping F1‑score: 20.93% to 43.14%; cutting: 0.00% to 8.33%) on real test data, and (b) surgical simulation, where rendering tool‑tissue interaction videos from simulator‑derived trajectory points toward a visually faithful simulation engine.

Abstract:
World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT‑World, a geometry world model that side‑steps video generation entirely and instead forecasts the temporal evolution of frozen geometry‑foundation‑model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high‑dimensional (d=1024) feature space: (i) standard velocity‑prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean‑target (z‑prediction) parameterization that yields a substantially higher signal‑to‑noise ratio, and the second with a two‑stage latent flow‑forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT‑World significantly outperforms the strongest baselines in depth forecasting while running 3.6‑5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.

Abstract:
Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real‑world interaction. While Embodied World Models (EWMs) offer promise through imagined rollouts, existing approaches suffer from geometric hallucinations and lack unified optimization frameworks for practical policy improvement. We introduce RoboStereo, a symmetric dual‑tower 4D world model that employs bidirectional cross‑modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high‑fidelity 4D simulator, we present the first unified framework for world‑model‑based policy optimization: (1) Test‑Time Policy Augmentation (TTPA) for pre‑execution verification, (2) Imitative‑Evolutionary Policy Learning (IEPL) leveraging visual perceptual rewards to learn from expert demonstrations, and (3) Open‑Exploration Policy Learning (OEPL) enabling autonomous skill discovery and self‑correction. Comprehensive experiments demonstrate RoboStereo achieves state‑of‑the‑art generation quality, with our unified framework delivering >97% average relative improvement on fine‑grained manipulation tasks.

Abstract:
Recent world‑model‑based Vision‑Language‑Action (VLA) architectures have improved robotic manipulation through predictive visual foresight. However, dense future prediction introduces visual redundancy and accumulates errors, causing long‑horizon plan drift. Meanwhile, recent sparse methods typically represent visual foresight using high‑level semantic subtasks or implicit latent states. These representations often lack explicit kinematic grounding, weakening the alignment between planning and low‑level execution. To address this, we propose StructVLA, which reformulates a generative world model into an explicit structured planner for reliable control. Instead of dense rollouts or semantic goals, StructVLA predicts sparse, physically meaningful structured frames. Derived from intrinsic kinematic cues (e.g., gripper transitions and kinematic turning points), these frames capture spatiotemporal milestones closely aligned with task progress. We implement this approach through a two‑stage training paradigm with a unified discrete token vocabulary: the world model is first trained to predict structured frames and subsequently optimized to map the structured foresight into low‑level actions. This approach provides clear physical guidance and bridges visual planning and motion control. In our experiments, StructVLA achieves strong average success rates of 75.0% on SimplerEnv‑WidowX and 94.8% on LIBERO. Real‑world deployments further demonstrate reliable task completion and robust generalization across both basic pick‑and‑place and complex long‑horizon tasks.

Abstract:
Learning good representations is essential for latent planning with world models. While pretrained visual encoders produce strong semantic visual features, they are not tailored to planning and contain information irrelevant ‑‑ or even detrimental ‑‑ to planning. Inspired by the perceptual straightening hypothesis in human visual processing, we introduce temporal straightening to improve representation learning for latent planning. Using a curvature regularizer that encourages locally straightened latent trajectories, we jointly learn an encoder and a predictor. We show that reducing curvature this way makes the Euclidean distance in latent space a better proxy for the geodesic distance and improves the conditioning of the planning objective. We demonstrate empirically that temporal straightening makes gradient‑based planning more stable and yields significantly higher success rates across a suite of goal‑reaching tasks.

Abstract:
Generating safety‑critical driving scenarios is crucial for evaluating and improving autonomous driving systems, but long‑tail risky situations are rarely observed in real‑world data and difficult to specify through manual scenario design. Existing generative approaches typically treat risk as an after‑the‑fact label and struggle to maintain geometric consistency in multi‑view driving scenes. We present RiskMV‑DPO, a general and systematic pipeline for physically‑informed, risk‑controllable multi‑view scenario generation. By integrating target risk levels with physically‑grounded risk modeling, we autonomously synthesize diverse and high‑stakes dynamic trajectories that serve as explicit geometric anchors for a diffusion‑based video generator. To ensure spatial‑temporal coherence and geometric fidelity, we introduce a geometry‑appearance alignment module and a region‑aware direct preference optimization (RA‑DPO) strategy with motion‑aware masking to focus learning on localized dynamic regions.Experiments on the nuScenes dataset show that RiskMV‑DPO can freely generate a wide spectrum of diverse long‑tail scenarios while maintaining state‑of‑the‑art visual quality, improving 3D detection mAP from 18.17 to 30.50 and reducing FID to 15.70. Our work shifts the role of world models from passive environment prediction to proactive, risk‑controllable synthesis, providing a scalable toolchain for the safety‑oriented development of embodied intelligence.

Abstract:
Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model‑free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model‑based continual RL algorithm that extends DreamerV3 with a memory‑efficient, distribution‑matching replay buffer. Unlike standard fixed‑size FIFO buffers, ARROW maintains two complementary buffers: a short‑term buffer for recent experiences and a long‑term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model‑free and model‑based baselines with replay buffers of the same‑size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model‑based RL and bio‑inspired approaches for continual reinforcement learning, warranting further research.

Abstract:
Learning predictive world models from raw visual observations is a central challenge in reinforcement learning (RL), especially for robotics and continuous control. Conventional model‑based RL frameworks directly condition future predictions on absolute actions, which makes optimization unstable: the optimal action distributions are task‑dependent, unknown a priori, and often lead to oscillatory or inefficient control. To address this, we introduce the Residual‑Action World Model (ResWM), a new framework that reformulates the control variable from absolute actions to residual actions ‑‑ incremental adjustments relative to the previous step. This design aligns with the inherent smoothness of real‑world control, reduces the effective search space, and stabilizes long‑horizon planning. To further strengthen the representation, we propose an Observation Difference Encoder that explicitly models the changes between adjacent frames, yielding compact latent dynamics that are naturally coupled with residual actions. ResWM is integrated into a Dreamer‑style latent dynamics model with minimal modifications and no extra hyperparameters. Both imagination rollouts and policy optimization are conducted in the residual‑action space, enabling smoother exploration, lower control variance, and more reliable planning. Empirical results on the DeepMind Control Suite demonstrate that ResWM achieves consistent improvements in sample efficiency, asymptotic returns, and control smoothness, significantly surpassing strong baselines such as Dreamer and TD‑MPC. Beyond performance, ResWM produces more stable and energy‑efficient action trajectories, a property critical for robotic systems deployed in real‑world environments. These findings suggest that residual action modeling provides a simple yet powerful principle for bridging algorithmic advances in RL with the practical requirements of robotics.

Abstract:
Diffusion policies have shown to be very efficient at learning complex, multi‑modal behaviors for robotic manipulation. However, errors in generated action sequences can compound over time which can potentially lead to failure. Some approaches mitigate this by augmenting datasets with expert demonstrations or learning predictive world models which might be computationally expensive. We introduce Performance Predictive Guidance (PPGuide), a lightweight, classifier‑based framework that steers a pre‑trained diffusion policy away from failure modes at inference time. PPGuide makes use of a novel self‑supervised process: it uses attention‑based multiple instance learning to automatically estimate which observation‑action chunks from the policy's rollouts are relevant to success or failure. We then train a performance predictor on this self‑labeled data. During inference, this predictor provides a real‑time gradient to guide the policy toward more robust actions. We validated our proposed PPGuide across a diverse set of tasks from the Robomimic and MimicGen benchmarks, demonstrating consistent improvements in performance.

Abstract:
Degradation prognosis for lithium‑ion cells requires forecasting the state‑of‑health (SOH) trajectory over future cycles. Existing data‑driven approaches can produce trajectory outputs through direct regression, but lack a mechanism to propagate degradation dynamics forward in time. This paper formulates battery degradation prognosis as a world model problem, encoding raw voltage, current, and temperature time‑series from each cycle into a latent state and propagating it forward via a learned dynamics transition to produce a future trajectory spanning 80 cycles. To investigate whether electrochemical knowledge improves the learned dynamics, a Single Particle Model (SPM) constraint is incorporated into the training loss. Three configurations are evaluated on the Severson LiFePO4 (LFP) dataset of 138 cells. Iterative rollout halves the trajectory forecast error compared to direct regression from the same encoder. The SPM constraint improves prediction at the degradation knee where the resistance to SOH relationship is most applicable, without changing aggregate accuracy.

Abstract:
World Models (WMs) have emerged as a promising approach for post‑training Vision‑Language‑Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM‑based post‑training methods rely on pixel‑space supervision, making policies sensitive to pixel‑level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post‑training framework that aligns VLA actions directly with WM video‑dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post‑training performance is tied to rollout quality, yet current WMs struggle with arbitrary‑length video generation as they are mostly trained on fixed‑length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM‑based skill‑decomposition pipeline that segments high‑level instructions into low‑level prompts. Our pipeline produces RoboCasa‑Skill and LIBERO‑Skill, supporting skill‑compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T‑N1.6 and Cosmos Policy achieves state‑of‑the‑art results on RoboCasa and LIBERO, and improves real‑world performance by 6.7%, enhancing embodied agent generalization.

Abstract:
Training large language models (LLMs) on Python execution traces grounds them in code execution and enables the line‑by‑line execution prediction of whole Python programs, effectively turning them into neural interpreters (FAIR CodeGen Team et al., 2025). However, developers rarely execute programs step by step; instead, they use debuggers to stop execution at certain breakpoints and step through relevant portions only while inspecting or modifying program variables. Existing neural interpreter approaches lack such interactive control. To address this limitation, we introduce neural debuggers: language models that emulate traditional debuggers, supporting operations such as stepping into, over, or out of functions, as well as setting breakpoints at specific source lines. We show that neural debuggers ‑‑ obtained via fine‑tuning large LLMs or pre‑training smaller models from scratch ‑‑ can reliably model both forward execution (predicting future states and outputs) and inverse execution (inferring prior states or inputs) conditioned on debugger actions. Evaluated on CruxEval, our models achieve strong performance on both output and input prediction tasks, demonstrating robust conditional execution modeling. Our work takes first steps towards future agentic coding systems in which neural debuggers serve as a world model for simulated debugging environments, providing execution feedback or enabling agents to interact with real debugging tools. This capability lays the foundation for more powerful code generation, program understanding, and automated debugging.

Abstract:
Bird's‑eye‑view (BEV) perception has emerged as a cornerstone of autonomous driving systems, providing a structured, ego‑centric representation critical for downstream planning and control. However, real‑world deployment faces challenges from sensor degradation and adversarial attacks, which can cause severe perceptual anomalies and ultimately compromise the safety of autonomous driving systems. To address this, we propose a resilient and plug‑and‑play BEV perception method, RESBev, which can be easily applied to existing BEV perception methods to enhance their robustness to diverse disturbances. Specifically, we reframe perception robustness as a latent semantic prediction problem. A latent world model is constructed to extract spatiotemporal correlations across sequential BEV observations, thereby learning the underlying BEV state transitions to predict clean BEV features for reconstructing corrupted observations. The proposed framework operates at the semantic feature level of the Lift‑Splat‑Shoot pipeline, enabling recovery that generalizes across both natural disturbances and adversarial attacks without modifying the underlying backbone. Extensive experiments on the nuScenes dataset demonstrate that, with few‑shot fine‑tuning, RESBev significantly improves the robustness of existing BEV perception models against various external disturbances and adversarial attacks.

Abstract:
This work proposes a new formulation to the long‑standing problem of convex decomposition through learning feature fields, enabling the first feed‑forward model for open‑world convex decomposition. Our method produces high‑quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications. The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self‑supervised, purely‑geometric objective derived from the classical definition of convexity. Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self‑supervised learning on large datasets resulting in the first learned open‑world model for convex decomposition. Experiments show that our decompositions are higher‑quality than alternatives and generalize across open‑world objects as well as across representations to meshes, CAD models, and even Gaussian splats. https://research.nvidia.com/labs/sil/projects/learning‑convex‑decomp/

Abstract:
Emerging generative world models and vision‑language‑action (VLA) systems are rapidly reshaping automated driving by enabling scalable simulation, long‑horizon forecasting, and capability‑rich decision making. Across these directions, latent representations serve as the central computational substrate: they compress high‑dimensional multi‑sensor observations, enable temporally coherent rollouts, and provide interfaces for planning, reasoning, and controllable generation. This paper proposes a unifying latent‑space framework that synthesizes recent progress in world models for automated driving. The framework organizes the design space by the target and form of latent representations (latent worlds, latent actions, latent generators; continuous states, discrete tokens, and hybrids) and by structural priors for geometry, topology, and semantics. Building on this taxonomy, the paper articulates five cross‑cutting internal mechanics (i.e, structural isomorphism, long‑horizon temporal stability, semantic and reasoning alignment, value‑aligned objectives and post‑training, as well as adaptive computation and deliberation) and connects these design choices to robustness, generalization, and deployability. The work also proposes concrete evaluation prescriptions, including a closed‑loop metric suite and a resource‑aware deliberation cost, designed to reduce the open‑loop / closed‑loop mismatch. Finally, the paper identifies actionable research directions toward advancing latent world model for decision‑ready, verifiable, and resource‑efficient automated driving.

Abstract:
LM‑based agents excel when given high‑level action APIs but struggle to ground language into low‑level control. Prior work has LLMs generate skills or reward functions for RL, but these one‑shot approaches lack feedback to correct specification errors. We introduce SCALAR, a bidirectional framework coupling LLM planning with RL through a learned skill library. The LLM proposes skills with preconditions and effects; RL trains policies for each skill and feeds back execution results to iteratively refine specifications, improving robustness to initial errors. Pivotal Trajectory Analysis corrects LLM priors by analyzing RL trajectories; Frontier Checkpointing optionally saves environment states at skill boundaries to improve sample efficiency. On Craftax, SCALAR achieves 88.2% diamond collection, a 1.9x improvement over the best baseline, and reaches the Gnomish Mines 9.1% of the time where prior methods fail entirely.

Abstract:
Action‑conditioned video models offer a promising path to building general‑purpose robot simulators that can improve directly from data. Yet, despite training on large‑scale robot datasets, current state‑of‑the‑art video models still struggle to predict physically consistent robot‑object interactions that are crucial in robotic manipulation. To close this gap, we present PlayWorld, a simple, scalable, and fully autonomous pipeline for training high‑fidelity video world simulators from interaction experience. In contrast to prior approaches that rely on success‑biased human demonstrations, PlayWorld is the first system capable of learning entirely from unsupervised robot self‑play, enabling naturally scalable data collection while capturing complex, long‑tailed physical interactions essential for modeling realistic object dynamics. Experiments across diverse manipulation tasks show that PlayWorld generates high‑quality, physically consistent predictions for contact‑rich interactions that are not captured by world models trained on human‑collected data. We further demonstrate the versatility of PlayWorld in enabling fine‑grained failure prediction and policy evaluation, with up to 40% improvements over human‑collected data. Finally, we demonstrate how PlayWorld enables reinforcement learning in the world model, improving policy performance by 65% in success rates when deployed in the real world.

Abstract:
Action‑conditioned video prediction models (often referred to as world models) have shown strong potential for robotics applications, but existing approaches are often slow and struggle to capture physically consistent interactions over long horizons, limiting their usefulness for scalable robot policy training and evaluation. We present Interactive World Simulator, a framework for building interactive world models from a moderate‑sized robot interaction dataset. Our approach leverages consistency models for both image decoding and latent‑space dynamics prediction, enabling fast and stable simulation of physical interactions. In our experiments, the learned world models produce interaction‑consistent pixel‑level predictions and support stable long‑horizon interactions for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. Our framework enables scalable demonstration collection solely within the world models to train state‑of‑the‑art imitation policies. Through extensive real‑world evaluation across diverse tasks involving rigid objects, deformable objects, object piles, and their interactions, we find that policies trained on world‑model‑generated data perform comparably to those trained on the same amount of real‑world data. Additionally, we evaluate policies both within the world models and in the real world across diverse tasks, and observe a strong correlation between simulated and real‑world performance. Together, these results establish the Interactive World Simulator as a stable and physically consistent surrogate for scalable robotic data generation and faithful, reproducible policy evaluation.

Abstract:
Vision‑Language‑Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The execution of complex multi‑step behaviors in VLA models can be improved by robust instruction grounding, a critical component for effective control. However, current paradigms predominantly rely on coarse, high‑level task instructions during supervised fine‑tuning. This instruction grounding gap leaves models without explicit intermediate guidance, leading to severe compounding errors in long‑horizon tasks. Therefore, bridging this instruction gap and providing scalable post‑training for VLA models is urgent. To tackle this problem, we propose \method, the first subtask‑aware VLA framework integrated with a scalable offline post‑training pipeline. Our framework leverages a large language model to decompose high‑level demonstrations into fine‑grained atomic subtasks. This approach utilizes a pretrained predictive world model to score candidate action chunks against subtask goals in the latent space, mitigating error accumulation while significantly improving long‑horizon robustness. Furthermore, this approach enables highly efficient Group Relative Policy Optimization without the prohibitive expenses associated with online rollouts on physical robots. Extensive simulations validate that our AtomVLA maintains strong robustness under perturbations. When evaluated against fundamental baseline models, it achieves an average success rate of 97.0% on the LIBERO benchmark and 48.0% on the LIBERO‑PRO benchmark. Finally, experiments conducted in the real world using the Galaxea R1 Lite platform confirm its broad applicability across diverse tasks, especially long‑horizon tasks. All datasets, checkpoints, and code will be released to the public domain following the acceptance of this work for future research.

Abstract:
When an RL agent's observations are gradually corrupted, at what drift rate does it "wake up" ‑‑ and what determines this boundary? We study world model‑based self‑monitoring under continuous observation drift across four MuJoCo environments, three detector families (z‑score, variance, percentile), and three model capacities. We find that (1) a sharp detection threshold \varepsilon^ exists universally: below it, drift is absorbed as normal variation; above it, detection occurs rapidly. The threshold's existence and sigmoid shape are invariant across all detector families and model capacities, though its position depends on the interaction between detector sensitivity, noise floor structure, and environment dynamics. (2) Sinusoidal drift is completely undetectable by all detector families ‑‑ including variance and percentile detectors with no temporal smoothing ‑‑ establishing this as a world model property rather than a detector artifact. (3) Within each environment, \varepsilon^ follows a power law in detector parameters (R^2 = 0.89‑0.97), but cross‑environment prediction fails (R^2 = 0.45), revealing that the missing variable is environment‑specific dynamics structure \partial \mathrmPE/\partial\varepsilon. (4) In fragile environments, agents collapse before any detector can fire ("collapse before awareness"), creating a fundamentally unmonitorable failure mode. Our results reframe \varepsilon^ from an emergent world model property to a three‑way interaction between noise floor, detector, and environment dynamics, providing a more defensible and empirically grounded account of self‑monitoring boundaries in RL agents.

Abstract:
Recent advances in Vision‑Language‑Action (VLA) models have shown promising capabilities in autonomous driving by leveraging the understanding and reasoning strengths of Large Language Models(LLMs).However, our empirical analysis reveals that directly applying existing token‑level MoE mechanisms‑‑which are inherited from LLM architectures‑‑to VLA models results in unstable performance and safety degradation in autonomous driving, highlighting a misalignment between token‑based expert specialization and scene‑level decision‑making.To address this, we propose SAMoE‑VLA, a scene‑adaptive Vision‑Language‑Action framework that conditions expert selection on structured scene representations instead of token embeddings. Our key idea is to derive the MoE routing signal from bird's‑eye‑view (BEV) features that encapsulates traffic scene context, enabling scenario‑dependent expert weighting and merging tailored to distinct driving conditions. Furthermore, to support temporally consistent reasoning across world‑knowledge, perception, language, and action, we introduce a Conditional Cross‑Modal Causal Attention mechanism that integrates world state, linguistic intent, and action history into a unified causal reasoning process. Extensive experiments on the nuScenes open loop planning dataset and LangAuto closed‑loop benchmark demonstrate that SAMoE‑VLA achieves state‑of‑the‑art performance, outperforming prior VLA‑based and world‑model‑based approaches with fewer parameters.Our code will be released soon.

Abstract:
As large language models (LLMs) evolve into autonomous agents capable of acting in open‑ended environments, ensuring behavioral alignment with human values becomes a critical safety concern. Existing benchmarks, focused on static, single‑turn prompts, fail to capture the interactive and multi‑modal nature of real‑world conflicts. We introduce ConflictBench, a benchmark for evaluating human‑AI conflict through 150 multi‑turn scenarios derived from prior alignment queries. ConflictBench integrates a text‑based simulation engine with a visually grounded world model, enabling agents to perceive, plan, and act under dynamic conditions. Empirical results show that while agents often act safely when human harm is immediate, they frequently prioritize self‑preservation or adopt deceptive strategies in delayed or low‑risk settings. A regret test further reveals that aligned decisions are often reversed under escalating pressure, especially with visual input. These findings underscore the need for interaction‑level, multi‑modal evaluation to surface alignment failures that remain hidden in conventional benchmarks.

Abstract:
Accurate intraoperative navigation is essential for robot‑assisted endoluminal intervention, but remains difficult because of limited endoscopic field of view and dynamic artifacts. Existing navigation platforms often rely on external localization technologies, such as electromagnetic tracking or shape sensing, which increase hardware complexity and remain vulnerable to intraoperative anatomical mismatch. We present a vision‑only autonomy framework that performs long‑horizon bronchoscopic navigation using preoperative CT‑derived virtual targets and live endoscopic video, without external tracking during navigation. The framework uses hierarchical long‑short agents: a short‑term reactive agent for continuous low‑latency motion control, and a long‑term strategic agent for decision support at anatomically ambiguous points. When their recommendations conflict, a world‑model critic predicts future visual states for candidate actions and selects the action whose predicted state best matches the target view. We evaluated the system in a high‑fidelity airway phantom, three ex vivo porcine lungs, and a live porcine model. The system reached all planned segmental targets in the phantom, maintained 80% success to the eighth generation ex vivo, and achieved in vivo navigation performance comparable to the expert bronchoscopist. These results support the preclinical feasibility of sensor‑free autonomous bronchoscopic navigation.

Abstract:
Learned world models excel at interpolative generalization but fail at extrapolative generalization to novel physical properties. This limitation arises because they learn statistical correlations rather than the environment's underlying generative rules, such as physical invariances and conservation laws. We argue that learning these invariances is key to robust extrapolation. To achieve this, we first introduce Symmetry Exploration, an unsupervised exploration strategy where an agent is intrinsically motivated by a Hamiltonian‑based curiosity bonus to actively probe and challenge its understanding of conservation laws, thereby collecting physically informative data. Second, we design a Hamiltonian‑based world model that learns from the collected data, using a novel self‑supervised contrastive objective to identify the invariant physical state from raw, view‑dependent pixel observations. Our framework, DreamSAC, trained on this actively curated data, significantly outperforms state‑of‑the‑art baselines in 3D physics simulations on tasks requiring extrapolation.

Abstract:
Autonomous underwater robots are increasingly deployed for environmental monitoring, infrastructure inspection, subsea resource exploration, and long‑horizon exploration. Yet, despite rapid advances in learning‑based planning and control, reliable autonomy in real ocean environments remains fundamentally constrained by tightly coupled physical limits. Hydrodynamic uncertainty, partial observability, bandwidth‑limited communication, and energy scarcity are not independent challenges; they interact within the closed perception‑planning‑control loop and often amplify one another over time. This Review develops a constraint‑coupled perspective on underwater embodied intelligence, arguing that planning and control must be understood within tightly coupled sensing, communication, coordination, and resource constraints in real ocean environments. We synthesize recent progress in reinforcement learning, belief‑aware planning, hybrid control, multi‑robot coordination, and foundation‑model integration through this embodied perspective. Across representative application domains, we show how environmental monitoring, inspection, exploration, and cooperative missions expose distinct stress profiles of cross‑layer coupling. To unify these observations, we introduce a cross‑layer failure taxonomy spanning epistemic, dynamic, and coordination breakdowns, and analyze how errors cascade across autonomy layers under uncertainty. Building on this structure, we outline research directions toward physics‑grounded world models, certifiable learning‑enabled control, communication‑aware coordination, and deployment‑aware system design. By internalizing constraint coupling rather than treating it as an external disturbance, underwater embodied intelligence may evolve from performance‑driven adaptation toward resilient, scalable, and verifiable autonomy under real ocean conditions.

Abstract:
Data‑efficient learning remains a central challenge in autonomous driving due to the high cost and safety risks of large‑scale real‑world interaction. Although world‑model‑based reinforcement learning enables policy optimization through latent imagination, existing approaches often lack explicit mechanisms to encode spatial and kinematic structure essential for driving tasks. In this work, we build upon the Recurrent State‑Space Model (RSSM) and propose a kinematics‑aware latent world model framework for autonomous driving. Vehicle kinematic information is incorporated into the observation encoder to ground latent transitions in physically meaningful motion dynamics, while geometry‑aware supervision regularizes the RSSM latent state to capture task‑relevant spatial structure beyond pixel reconstruction. The resulting structured latent dynamics improve long‑horizon imagination fidelity and stabilize policy optimization. Experiments in a driving simulation benchmark demonstrate consistent gains over both model‑free and pixel‑based world‑model baselines in terms of sample efficiency and driving performance. Ablation studies further verify that the proposed design enhances spatial representation quality within the latent space. These results suggest that integrating kinematic grounding into RSSM‑based world models provides a scalable and physically grounded paradigm for autonomous driving policy learning.

Abstract:
Model‑based reinforcement learning (MBRL) agents operating in high‑dimensional observation spaces, such as Dreamer, rely on learning abstract representations for effective planning and control. Existing approaches typically employ reconstruction‑based objectives in the observation space, which can render representations sensitive to task‑irrelevant details. Recent alternatives trade reconstruction for auxiliary action prediction heads or view augmentation strategies, but perform worse in the Crafter environment than reconstruction‑based methods. We close this gap between Dreamer and reconstruction‑free models by introducing a JEPA‑style predictor defined on continuous, deterministic representations. Our method matches Dreamer's performance on Crafter, demonstrating effective world model learning on this benchmark without reconstruction objectives.

Abstract:
Deploying visuomotor robots at scale is challenging due to the potential for anomalous failures to degrade performance, cause damage, or endanger human life. Bimanual manipulators are no exception; these robots have vast state spaces comprised of high‑dimensional images and proprioceptive signals. Explicitly defining failure modes within such state spaces is infeasible. In this work, we overcome these challenges by training a probabilistic, history informed, world model within the compressed latent space of a pretrained vision foundation model (NVIDIA's Cosmos Tokenizer). The model outputs uncertainty estimates alongside its predictions that serve as non‑conformity scores within a conformal prediction framework. We use these scores to develop a runtime monitor, correlating periods of high uncertainty with anomalous failures. To test these methods, we use the simulated Push‑T environment and the Bimanual Cable Manipulation dataset, the latter of which we introduce in this work. This new dataset features trajectories with multiple synchronized camera views, proprioceptive signals, and annotated failures from a challenging data center maintenance task. We benchmark our methods against baselines from the anomaly detection and out‑of‑distribution detection literature, and show that our approach considerably outperforms statistical techniques. Furthermore, we show that our approach requires approximately one twentieth of the trainable parameters as the next‑best learning‑based approach, yet outperforms it by 3.8% in terms of failure detection rate, paving the way toward safely deploying manipulator robots in real‑world environments where reliability is non‑negotiable.

Abstract:
Video world models have shown immense promise for interactive simulation and entertainment, but current systems still struggle with two important aspects of interactivity: user control over the environment for reproducible, editable experiences, and shared inference where players hold influence over a common world. To address these limitations, we introduce an explicit external memory into the system, a persistent state operating independent of the model's context window, that is continually updated by user actions and queried throughout the generation roll‑out. Unlike conventional diffusion game engines that operate as next‑frame predictors, our approach decomposes generation into Memory, Observation, and Dynamics modules. This design gives users direct, editable control over environment structure via an editable memory representation, and it naturally extends to real‑time multiplayer rollouts with coherent viewpoints and consistent cross‑player interactions.

Abstract:
Situated reasoning often relies on active exploration, yet in many real‑world scenarios such exploration is infeasible due to physical constraints of robots or safety concerns of visually impaired users. Given only a limited observation, can an agent mentally simulate a future trajectory toward a target situation and answer spatial what‑if questions? We introduce WanderDream, the first large‑scale dataset designed for the emulative simulation of mental exploration, enabling models to reason without active exploration. WanderDream‑Gen comprises 15.8K panoramic videos across 1,088 real scenes from HM3D, ScanNet++, and real‑world captures, depicting imagined trajectories from current viewpoints to target situations. WanderDream‑QA contains 158K question‑answer pairs, covering starting states, paths, and end states along each trajectory to comprehensively evaluate exploration‑based reasoning. Extensive experiments with world models and MLLMs demonstrate (1) that mental exploration is essential for situated reasoning, (2) that world models achieve compelling performance on WanderDream‑Gen, (3) that imagination substantially facilitates reasoning on WanderDream‑QA, and (4) that WanderDream data exhibit remarkable transferability to real‑world scenarios. The source code and all data will be released.

Abstract:
Diffusion‑based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long‑horizon rollouts. While feature caching can accelerate inference without training, we find that policies designed for single‑modal diffusion transfer poorly to world models due to two world‑model‑specific obstacles: \emphtoken heterogeneity from multi‑modal coupling and spatial variation, and \emphnon‑uniform temporal dynamics where a small set of hard tokens drives error growth, making uniform skipping either unstable or overly conservative. We propose WorldCache, a caching framework tailored to diffusion world models. We introduce Curvature‑guided Heterogeneous Token Prediction, which uses a physics‑grounded curvature score to estimate token predictability and applies a Hermite‑guided damped predictor for chaotic tokens with abrupt direction changes. We also design Chaotic‑prioritized Adaptive Skipping, which accumulates a curvature‑normalized, dimensionless drift signal and recomputes only when bottleneck tokens begin to drift. Experiments on diffusion world models show that WorldCache delivers up to 3.7× end‑to‑end speedups while maintaining 98% rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource‑constrained scenarios. Our code is released in https://github.com/FofGofx/WorldCache.

Abstract:
Latent Action Models (LAMs) enable learning from actionless data for applications ranging from robotic control to interactive world models. However, existing LAMs typically focus on short‑horizon frame transitions and capture low‑level motion while overlooking longer‑term temporal structure. In contrast, actionless videos often contain temporally extended and high‑level skills. We present HiLAM, a hierarchical latent action model that discovers latent skills by modeling long‑term temporal information. To capture these dependencies across long horizons, we utilize a pretrained LAM as a low‑level extractor. This architecture aggregates latent action sequences, which contain the underlying dynamic patterns of the video, into high‑level latent skills. Our experiments demonstrate that HiLAM improves over the baseline and exhibits robust dynamic skill discovery.

Abstract:
World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision‑time planning remains computationally prohibitive for real‑time control. A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource‑intensive. To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning. An action‑conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders‑of‑magnitude faster planning, offering a practical step toward real‑world deployment of world models.

Abstract:
Machine learning models of cellular interaction dynamics hold promise for understanding cell behavior. Natural killer (NK) cell cytotoxicity is a prominent example of such interaction dynamics and is commonly studied using time‑resolved multi‑channel fluorescence microscopy. Although tumor cell death events can be annotated at single frames, NK cytotoxic outcome emerges over time from cellular interactions and cannot be reliably inferred from frame‑wise classification alone. We introduce BLINK, a trajectory‑based recurrent state‑space model that serves as a cell world model for NK‑tumor interactions. BLINK learns latent interaction dynamics from partially observed NK‑tumor interaction sequences and predicts apoptosis increments that accumulate into cytotoxic outcomes. Experiments on long‑term time‑lapse NK‑tumor recordings show improved cytotoxic outcome detection and enable forecasting of future outcomes, together with an interpretable latent representation that organizes NK trajectories into coherent behavioral modes and temporally structured interaction phases. BLINK provides a unified framework for quantitative evaluation and structured modeling of NK cytotoxic behavior at the single‑cell level.

Abstract:
"Dreaming" enables agents to learn from imagined experiences, enabling more robust and sample‑efficient learning of world models. In this work, we consider innovations to the state‑of‑the‑art Dreamer model using probabilistic methods that enable: (1) the parallel exploration of many latent states; and (2) maintaining distinct hypotheses for mutually exclusive futures while retaining the desirable gradient properties of continuous latents. Evaluating on the MPE SimpleTag domain, our method outperforms standard Dreamer with a 4.5% score improvement and 28% lower variance in episode returns. We also discuss limitations and directions for future work, including how optimal hyperparameters (e.g. particle count K) scale with environmental complexity, and methods to capture epistemic uncertainty in world models.

Abstract:
Recent work interprets the linear recoverability of geographic and temporal variables from large language model (LLM) hidden states as evidence for world‑like internal representations. We test a simpler possibility: that much of the relevant structure is already latent in text itself. Applying the same class of ridge regression probes to static co‑occurrence‑based embeddings (GloVe and Word2Vec), we find substantial recoverable geographic signal and weaker but reliable temporal signal, with held‑out R^2 values of 0.71‑0.87 for city coordinates and 0.48‑0.52 for historical birth years. Semantic‑neighbor analyses and targeted subspace ablations show that these signals depend strongly on interpretable lexical gradients, especially country names and climate‑related vocabulary. These findings suggest that ordinary word co‑occurrence preserves richer spatial, temporal, and environmental structure than is often assumed, revealing a remarkable and underappreciated capacity of simple static embeddings to preserve world‑shaped structure from text alone. Linear probe recoverability alone therefore does not establish a representational move beyond text.

Abstract:
Decision transformer based sequential policies have emerged as a powerful paradigm in offline reinforcement learning (RL), yet their efficacy remains constrained by the quality of static datasets and inherent architectural limitations. Specifically, these models often struggle to effectively integrate suboptimal experiences and fail to explicitly plan for an optimal policy. To bridge this gap, we propose Imaginary Planning Distillation (IPD), a novel framework that seamlessly incorporates offline planning into data generation, supervised training, and online inference. Our framework first learns a world model equipped with uncertainty measures and a quasi‑optimal value function from the offline data. These components are utilized to identify suboptimal trajectories and augment them with reliable, imagined optimal rollouts generated via Model Predictive Control (MPC). A Transformer‑based sequential policy is then trained on this enriched dataset, complemented by a value‑guided objective that promotes the distillation of the optimal policy. By replacing the conventional, manually‑tuned return‑to‑go with the learned quasi‑optimal value function, IPD improves both decision‑making stability and performance during inference. Empirical evaluations on the D4RL benchmark demonstrate that IPD significantly outperforms several state‑of‑the‑art value‑based and transformer‑based offline RL methods across diverse tasks.

Abstract:
As learning‑based robotic controllers are typically trained offline and deployed with fixed parameters, their ability to cope with unforeseen changes during operation is limited. Biologically inspired, this work presents a framework for online Continual Reinforcement Learning that enables automated adaptation during deployment. Building on DreamerV3, a model‑based Reinforcement Learning algorithm, the proposed method leverages world model prediction residuals to detect out‑of‑distribution events and automatically trigger finetuning. Adaptation progress is monitored using both task‑level performance signals and internal training metrics, allowing convergence to be assessed without external supervision and domain knowledge. The approach is validated on a variety of contemporary continuous control problems, including a quadruped robot in high‑fidelity simulation, and a real‑world model vehicle. Relevant metrics and their interpretation are presented and discussed, as well as resulting trade‑offs described. The results sketch out how autonomous robotic agents could once move beyond static training regimes toward adaptive systems capable of self‑reflection and improvement during operation, just like their biological counterparts.

Abstract:
World models are central to LLM agents that must evaluate actions over long horizons. Yet much existing work focuses on environments governed by physical dynamics or spatial structure, whereas many high‑impact domains, including supply chains, procurement networks, and business processes, evolve through discrete events, timing constraints, and causal dependencies. These settings call for discrete‑event world models. Existing approaches to constructing world models often fall near two extremes: hand‑engineered simulators provide consistency and reproducibility, but are costly to build and adapt; neural models are flexible, but can suffer from compounding inconsistency over long‑horizon rollouts. We seek a principled middle ground by synthesizing discrete‑event world models online from natural‑language specifications, retaining the reliability of explicit simulators while gaining the adaptability of neural models. We adopt the DEVS formalism and introduce a staged LLM‑based generation pipeline that separates structural inference over component interactions from component‑level event and timing logic. For evaluation, we develop benchmark suites in which simulators emit structured event traces, which are then validated against specification‑derived temporal, causal, and semantic constraints. This enables reproducible verification and localized diagnostics. Together, these contributions produce world models that remain consistent over long‑horizon rollouts, can be verified from observable behavior, and can be synthesized efficiently on demand during online execution.

Abstract:
Recent video diffusion models have achieved impressive capabilities as large‑scale generative world models. However, these models often struggle with fine‑grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present Phys4D, a pipeline for learning physics‑consistent 4D world representations from video diffusion models. Phys4D adopts a three‑stage training paradigm that progressively lifts appearance‑driven video diffusion models into physics‑consistent 4D world representations. We first bootstrap robust geometry and motion representations through large‑scale pseudo‑supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics‑grounded supervised fine‑tuning using simulation‑generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation‑grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine‑grained physical consistency beyond appearance‑based metrics, we introduce a set of 4D world consistency evaluation that probe geometric coherence, motion stability, and long‑horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine‑grained spatiotemporal and physical consistency compared to appearance‑driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational‑brioche‑7657e7.netlify.app/

Abstract:
Interactive world models continually generate video by responding to a user's actions, enabling open‑ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to down‑stream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesize new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long‑horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine‑grained, geometry‑aware control over generated experiences by supporting environment editing and specification directly in 3D space. Project page: https://francelico.github.io/persist.github.io

Abstract:
Offline meta‑reinforcement learning seeks to learn policies that generalize across related tasks from fixed datasets. Context‑based methods infer a task representation from transition histories, but learning effective task representations without supervision remains a challenge. In parallel, latent world models have demonstrated strong self‑supervised representation learning through temporal consistency. We introduce contextual latent world models, which condition latent world models on inferred task representations and train them jointly with the context encoder. This enforces task‑conditioned temporal consistency, yielding task representations that capture task‑dependent dynamics rather than merely discriminating between tasks. Our method learns more expressive task representations and significantly improves generalization to unseen tasks across MuJoCo, Contextual‑DeepMind Control, and Meta‑World benchmarks.

Abstract:
Capturing temporal dependencies is critical for model‑based reinforcement learning (MBRL) in partially observable, high‑dimensional domains. We introduce NE‑Dreamer, a decoder‑free MBRL agent that leverages a temporal transformer to predict next‑step encoder embeddings from latent state sequences, directly optimizing temporal predictive alignment in representation space. This approach enables NE‑Dreamer to learn coherent, predictive state representations without reconstruction losses or auxiliary supervision. On the DeepMind Control Suite, NE‑Dreamer matches or exceeds the performance of DreamerV3 and leading decoder‑free agents. On a challenging subset of DMLab tasks involving memory and spatial reasoning, NE‑Dreamer achieves substantial gains. These results establish next‑embedding prediction with temporal transformers as an effective, scalable framework for MBRL in complex, partially observable environments.

Abstract:
As artificial agents become increasingly capable, what internal structure is necessary for an agent to act competently under uncertainty? Classical results show that optimal control can be implemented using belief states or world models, but not that such representations are required. We prove quantitative "selection theorems" showing that strong task performance (low average‑case regret) forces world models, belief‑like memory and ‑‑ under task mixtures ‑‑ persistent variables resembling core primitives associated with emotion, along with informational modularity under block‑structured tasks. Our results cover stochastic policies, partial observability, and evaluation under task distributions, without assuming optimality, determinism, or access to an explicit model. Technically, we reduce predictive modeling to binary "betting" decisions and show that regret bounds limit probability mass on suboptimal bets, enforcing the predictive distinctions needed to separate high‑margin outcomes. In fully observed settings, this yields approximate recovery of the interventional transition kernel; under partial observability, it implies necessity of predictive state and belief‑like memory, addressing an open question in prior world‑model recovery work.

Abstract:
Precision oncology is currently limited by the small‑N, large‑P paradox, where high‑dimensional genomic data is abundant but pharmacological response samples are sparse. While deep learning achieves predictive accuracy, it frequently fails to provide the mechanistic clarity required for clinical adoption. We present the Contextual Invertible World Model (CIWM), a Neuro‑Symbolic Agentic Framework that bridges this gap by integrating a quantitative machine learning emulator with an LLM‑based reasoning layer. Utilising a zero‑leakage forensic pipeline on the Sanger GDSC dataset (N = 83), we achieve a robust predictive correlation (r = 0.447, p = 2.30e‑05). We identify a Symbolic Scaffold effect, where the explicit modelling of clinical context (MSI status) provides a 3.6 percent gain in fidelity in data‑sparse regimes. Through Inverse Reasoning, we perform in silico CRISPR perturbations across the colorectal landscape, identifying a hierarchical dominance of the APC/Wnt‑axis over the p53 apoptotic pathway. Validated against human clinical profiles (TCGA‑COAD proxy, p = 0.0357), our framework provides a transparent, invertible, and biologically grounded path towards explainable AI in oncology.

Abstract:
World models compress rich sensory streams into compact latent codes that anticipate future observations. We let separate agents acquire such models from distinct viewpoints of the same environment without any parameter sharing or coordination. After training, their internal representations exhibit a striking emergent property: the two latent spaces are related by an approximate linear isometry, enabling transparent translation between them. This geometric consensus survives large viewpoint shifts and scant overlap in raw pixels. Leveraging the learned alignment, a classifier trained on one agent can be ported to the other with no additional gradient steps, while distillation‑like migration accelerates later learning and markedly reduces total compute. The findings reveal that predictive learning objectives impose strong regularities on representation geometry, suggesting a lightweight path to interoperability among decentralized vision systems. The code is available at https://anonymous.4open.science/r/Social‑JEPA‑5C57.

Abstract:
Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera‑guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global‑geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial‑stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine‑grained details from the memory bank. These components enable WorldStereo to generate multi‑view‑consistent videos under precise camera control, facilitating high‑quality 3D reconstruction. Furthermore, the flexible control branch‑based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera‑guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high‑fidelity 3D results. Models will be released.

Abstract:
While Vision‑Language‑Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain‑of‑Thought (CoT) leads to semantic‑perceptual decoupling and perceptual‑symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics‑agnostic representation. To address this, we propose the Latent Spatio‑Temporal VLA (LaST‑VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio‑Temporal CoT. By implementing a dual‑feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial‑temporal reasoning on SURDS and NuDynamics benchmarks.

Abstract:
Mathematical modeling plays a vital role in epidemiology, offering insights into the spread and control of infectious diseases. The compartmental models developed by Kermack and McKendrick, particularly the SI (Susceptible‑Infected) and SIR (Susceptible‑Infected‑Recovered) models, form the basis of many epidemic studies. While some simple cases permit analytical solutions, most real‑world models require numerical methods such as Euler's method, the fourth‑order Runge‑Kutta (RK4) method, and Predictor‑Corrector (P‑C) methods. These methods are typically implemented in scientific computing software like Python, MATLAB, and R. However, the computational efficiency and run‑time performance of these software tools in solving epidemiological models have not been comprehensively compared in the literature. This study addresses this gap by solving the SI and SIR models using Euler's method, RK4, and P‑C methods in Python, MATLAB, and R. Execution times are recorded for each implementation to evaluate computational efficiency. Additionally, for the SI model, where an exact analytical solution exists, R2 values are computed to assess numerical accuracy. For the SIR model, a high‑accuracy reference solution is obtained by solving the system using MATLAB's ODE45 solver, and the SIR solutions computed via the RK4 method in MATLAB are compared against this reference. The results provide a comparative perspective on the accuracy and run‑time performance across different software and numerical methods, offering practical guidance for researchers and practitioners in selecting suitable tools for epidemic modeling.

Abstract:
World models aim to capture the states and dynamics of an environment in a compact latent space. Moreover, using Boolean state representations is particularly useful for search heuristics and symbolic reasoning and planning. Existing approaches keep latents informative via decoder‑based reconstruction, or instead via contrastive or reward signals. In this work, we introduce Discrete World Models via Regularization (DWMR): a reconstruction‑free and contrastive‑free method for unsupervised Boolean world‑model learning. In particular, we introduce a novel world‑modeling loss that couples latent prediction with specialized regularizers. Such regularizers maximize the entropy and independence of the representation bits through variance, correlation, and coskewness penalties, while simultaneously enforcing a locality prior for sparse action changes. To enable effective optimization, we also introduce a novel training scheme improving robustness to discrete roll‑outs. Experiments on two benchmarks with underlying combinatorial structure show that DWMR learns more accurate representations and transitions than reconstruction‑based alternatives. Finally, DWMR can also be paired with an auxiliary reconstruction decoder, and this combination yields additional gains.

Abstract:
Recent advances in video generation have spurred the development of world models capable of simulating 3D‑consistent environments and interactions with static objects. However, a significant limitation remains in their ability to model dynamic, reactive agents that can intelligently influence and interact with the world. To address this gap, we introduce COMBAT, a real‑time, action‑controlled world model trained on the complex 1v1 fighting game Tekken 3. Our work demonstrates that diffusion models can successfully simulate a dynamic opponent that reacts to player actions, learning its behavior implicitly. Our approach utilizes a 1.2 billion parameter Diffusion Transformer, conditioned on latent representations from a deep compression autoencoder. We employ state‑of‑the‑art techniques, including causal distillation and diffusion forcing, to achieve real‑time inference. Crucially, we observe the emergence of sophisticated agent behavior by training the model solely on single‑player inputs, without any explicit supervision for the opponent's policy. Unlike traditional imitation learning methods, which require complete action labels, COMBAT learns effectively from partially observed data to generate responsive behaviors for a controllable Player 1. We present an extensive study and introduce novel evaluation methods to benchmark this emergent agent behavior, establishing a strong foundation for training interactive agents within diffusion‑based world models.

Abstract:
A major challenge for world models in multi‑agent systems is to understand interdependent agent dynamics, predict interactive multi‑agent trajectories, and plan over long horizons with collective awareness, without centralized supervision or explicit communication. In this paper, MetaMind, a general and cognitive world model for multi‑agent systems that leverages a novel meta‑theory of mind (Meta‑ToM) framework, is proposed. Through MetaMind, each agent learns not only to predict and plan over its own beliefs, but also to inversely reason goals and beliefs from its own behavior trajectories. This self‑reflective, bidirectional inference loop enables each agent to learn a metacognitive ability in a self‑supervised manner. Then, MetaMind is shown to generalize the metacognitive ability from first‑person to third‑person through analogical reasoning. Thus, in multi‑agent systems, each agent with MetaMind can actively reason about goals and beliefs of other agents from limited, observable behavior trajectories in a zero‑shot manner, and then adapt to emergent collective intention without an explicit communication mechanism. Extended simulation results on diverse multi‑agent tasks demonstrate that MetaMind can achieve superior task performance and outperform baselines in few‑shot multi‑agent generalization.

Abstract:
NeuroHex is a brain‑inspired hexagonal coordinate system designed to support highly efficient world models and reference frames for online adaptive AI systems. Inspired by the hexadirectional firing structure of grid cells in the human brain, NeuroHex adopts a cubic isometric hexagonal coordinate formulation that provides full 60° rotational symmetry and low‑cost translation, rotation and distance computation. We develop a mathematical framework that incorporates ring indexing, quantized angular encoding, and a hierarchical library of foundational, simple, and complex geometric shape primitives. These constructs allow low‑overhead point‑in‑shape tests and spatial matching operations that are expensive in Cartesian coordinate systems. To support realistic settings, we also develop a novel tool (OSM2Hex) that can process OpenStreetMap (OSM) data sets and convert them into the NeuroHex coordinate system. The OSM2Hex spatial abstraction processing pipeline can achieve a reduction of 90‑99% in geometric complexity while maintaining the relevant spatial structure map for navigation. Our initial results, based on actual city and neighborhood scale data sets, demonstrate that NeuroHex offers a highly efficient substrate for building dynamic world models to enable adaptive spatial reasoning in autonomous energy‑efficient AI systems with continuous online‑adaptive learning (COAL) capability.

Abstract:
The next generation of autonomous agents must not only learn efficiently but also act reliably and adapt their behavior in open worlds. Standard approaches typically assume fixed tasks and environments with little or no novelty, which limits world models' ability to support agents that must evolve their policies as conditions change. This paper outlines a vision for foundation world models: persistent, compositional representations that unify reinforcement learning, reactive/program synthesis, and abstraction mechanisms. We propose an agenda built around four components: (i) learnable reward models from specifications to support optimization with clear objectives; (ii) adaptive formal verification integrated throughout learning; (iii) online abstraction calibration to quantify the reliability of the model's predictions; and (iv) test‑time synthesis and world‑model generation guided by verifiers. Together, these components enable agents to synthesize verifiable programs, derive new policies from a small number of interactions, and maintain correctness while adapting to novelty. The resulting framework positions foundation world models as a substrate for learning, reasoning, and adaptation, laying the groundwork for agents that not only act well but can explain and justify the behavior they adopt.

Abstract:
The evolution of video generation toward complex, multi‑shot narratives has exposed a critical deficit in current evaluation methods. Existing benchmarks remain anchored to single‑shot paradigms, lacking the comprehensive story assets and cross‑shot metrics required to assess long‑form coherence and appeal. To bridge this gap, we introduce MSVBench, the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi‑Shot Video generation. We propose a hybrid evaluation framework that synergizes the high‑level semantic reasoning of Large Multimodal Models (LMMs) with the fine‑grained perceptual rigor of domain‑specific expert models. Evaluating 20 video generation methods across diverse paradigms, we find that current models‑‑despite strong visual fidelity‑‑primarily behave as visual interpolators rather than true world models. We further validate the reliability of our benchmark by demonstrating a state‑of‑the‑art Spearman's rank correlation of 94.4% with human judgments. Finally, MSVBench extends beyond evaluation by providing a scalable supervisory signal. Fine‑tuning a lightweight model on its pipeline‑refined reasoning traces yields human‑aligned performance comparable to commercial models like Gemini‑2.5‑Flash.

Abstract:
With advances in imitation learning (IL) and large‑scale driving datasets, end‑to‑end autonomous driving (E2E‑AD) has made great progress recently. Currently, IL‑based methods have become a mainstream paradigm: models rely on standard driving behaviors given by experts, and learn to minimize the discrepancy between their actions and expert actions. However, this objective of "only driving like the expert" suffers from limited generalization: when encountering rare or unseen long‑tail scenarios outside the distribution of expert demonstrations, models tend to produce unsafe decisions in the absence of prior experience. This raises a fundamental question: Can an E2E‑AD system make reliable decisions without any expert action supervision? Motivated by this, we propose a unified framework named Risk‑aware World Model Predictive Control (RaWMPC) to address this generalization dilemma through robust control, without reliance on expert demonstrations. Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low‑risk actions through explicit risk evaluation. To endow the world model with the ability to predict the outcomes of risky driving behaviors, we design a risk‑aware interaction strategy that systematically exposes the world model to hazardous behaviors, making catastrophic outcomes predictable and thus avoidable. Furthermore, to generate low‑risk candidate actions at test time, we introduce a self‑evaluation distillation method to distill riskavoidance capabilities from the well‑trained world model into a generative action proposal network without any expert demonstration. Extensive experiments show that RaWMPC outperforms state‑of‑the‑art methods in both in‑distribution and out‑of‑distribution scenarios, while providing superior decision interpretability.

Abstract:
Foundation models must handle multiple generative processes, yet mechanistic interpretability largely studies capabilities in isolation; it remains unclear how a single transformer organizes multiple, potentially conflicting "world models". Previous experiments on Othello playing neural‑networks test world‑model learning but focus on a single game with a single set of rules. We introduce MetaOthello, a controlled suite of Othello variants with shared syntax but different rules or tokenizations, and train small GPTs on mixed‑variant data to study how multiple world models are organized in a shared representation space. We find that transformers trained on mixed‑game data do not partition their capacity into isolated sub‑models; instead, they converge on a mostly shared board‑state representation that transfers causally across variants. Linear probes trained on one variant can intervene on another's internal state with effectiveness approaching that of matched probes. For isomorphic games with token remapping, representations are equivalent up to a single orthogonal rotation that generalizes across layers. When rules partially overlap, early layers maintain game‑agnostic representations while a middle layer identifies game identity, and later layers specialize. MetaOthello offers a path toward understanding not just whether transformers learn world models, but how they organize many at once.

Abstract:
The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Recent advancements represented by video generation models like Sora have demonstrated the potential of data‑driven scaling laws to approximate physical dynamics, while the emerging Unified Multimodal Model (UMM) offers a promising architectural paradigm for integrating perception, language, and reasoning. Despite these advances, the field still lacks a principled theoretical framework that defines the essential properties requisite for a General World Model. In this paper, we propose that a World Model must be grounded in the Trinity of Consistency: Modal Consistency as the semantic interface, Spatial Consistency as the geometric basis, and Temporal Consistency as the causal engine. Through this tripartite lens, we systematically review the evolution of multimodal learning, revealing a trajectory from loosely coupled specialized modules toward unified architectures that enable the synergistic emergence of internal world simulators. To complement this conceptual framework, we introduce CoW‑Bench, a benchmark centered on multi‑frame reasoning and generation scenarios. CoW‑Bench evaluates both video generation models and UMMs under a unified evaluation protocol. Our work establishes a principled pathway toward general world models, clarifying both the limitations of current systems and the architectural requirements for future progress.

Abstract:
Generalized planning studies the construction of solution strategies that generalize across families of planning problems sharing a common domain model, formally defined by a transition function γ: S × A \rightarrow S. Classical approaches achieve such generalization through symbolic abstractions and explicit reasoning over γ. In contrast, recent Transformer‑based planners, such as PlanGPT and Plansformer, largely cast generalized planning as direct action‑sequence prediction, bypassing explicit transition modeling. While effective on in‑distribution instances, these approaches typically require large datasets and model sizes, and often suffer from state drift in long‑horizon settings due to the absence of explicit world‑state evolution. In this work, we formulate generalized planning as a transition‑model learning problem, in which a neural model explicitly approximates the successor‑state function \hatγ \approx γ and generates plans by rolling out symbolic state trajectories. Instead of predicting actions directly, the model autoregressively predicts intermediate world states, thereby learning the domain dynamics as an implicit world model. To study size‑invariant generalization and sample efficiency, we systematically evaluate multiple state representations and neural architectures, including relational graph encodings. Our results show that learning explicit transition models yields higher out‑of‑distribution satisficing‑plan success than direct action‑sequence prediction in multiple domains, while achieving these gains with significantly fewer training instances and smaller models. This is an extended version of a short paper accepted at ICAPS 2026 under the same title.

Abstract:
A reliable action feasibility scorer is a critical bottleneck in embodied agent pipelines: before any planning or reasoning occurs, the agent must identify which candidate actions are physically executable in the current state. Existing approaches use supervised fine‑tuning (SFT) to train action scorers, but SFT treats each candidate independently and does not explicitly teach the model to discriminate between actions that are physically correct and those that are subtly wrong. We propose the Contrastive World Model (CWM), which fine‑tunes a large language model (LLM) as an action scorer using an InfoNCE contrastive objective with hard‑mined negative examples. The key idea is to push valid actions away from invalid ones in scoring space, with special emphasis on hard negatives: semantically similar but physically incompatible candidates. We evaluate CWM on the ScienceWorld benchmark through two studies. First, an intrinsic affordance evaluation on 605 hard‑negative test pairs shows that CWM outperforms SFT by +6.76 percentage points on Precision@1 for minimal‑edit negatives ‑‑ cases where a single word changes the physical outcome ‑‑ and achieves a higher AUC‑ROC (0.929 vs. 0.906). Second, a live filter characterisation study measures how well CWM ranks gold‑path actions against all valid environment actions during task execution. Under out‑of‑distribution stress conditions, CWM maintains a significantly better safety margin (‑2.39) than SFT (‑3.96), indicating that the gold action is ranked closer to the top. These results support the hypothesis that contrastive training induces representations that capture physical feasibility more faithfully than SFT alone.

Abstract:
Can an LLM learn how an optimizer behaves ‑‑ and use that knowledge to control it? We extend Code World Models (CWMs), LLM‑synthesized Python programs that predict environment dynamics, from deterministic games to stochastic combinatorial optimization. Given suboptimal trajectories of (1+1)‑\textRLS_k, the LLM synthesizes a simulator of the optimizer's dynamics; greedy planning over this simulator then selects the mutation strength k at each step. On \lo and \onemax, CWM‑greedy performs within 6% of the theoretically optimal policy ‑‑ without ever seeing optimal‑policy trajectories. On \jump_k, where a deceptive valley causes all adaptive baselines to fail (0% success rate), CWM‑greedy achieves 100% success rate ‑‑ without any collection policy using oracle knowledge of the gap parameter. On the NK‑Landscape, where no closed‑form model exists, CWM‑greedy outperforms all baselines across fifteen independently generated instances (36.94 vs.\ 36.32; p<0.001) when the prompt includes empirical transition statistics. The CWM also outperforms DQN in sample efficiency (200 offline trajectories vs.\ 500 online episodes), success rate (100% vs.\ 58%), and generalization (k=3: 78% vs.\ 0%). Robustness experiments confirm stable synthesis across 5 independent runs.

Abstract:
Existing action‑conditioned video generation models (video world models) are limited to single‑agent perspectives, failing to capture the multi‑agent interactions of real‑world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi‑view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single‑player settings, our system supports coordinated multi‑agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single‑player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory‑efficient Self Forcing variant that enables a longer‑horizon teacher. Results show our architecture and training design outperform existing baselines. Through open‑sourcing our system and models, we hope to lay the groundwork for a new generation of multi‑agent world models.

Abstract:
A key challenge in artificial intelligence and neuroscience is understanding how neural systems learn representations that capture the underlying dynamics of the world. Most world models represent the transition function with unstructured neural networks, limiting interpretability, sample efficiency, and generalization to unseen states or action compositions. We address these issues with a generalizable world model grounded in Vector Symbolic Architecture (VSA) principles as geometric priors. Our approach utilizes learnable Fourier Holographic Reduced Representation (FHRR) encoders to map states and actions into a high dimensional complex vector space with learned group structure and models transitions with element‑wise complex multiplication. We formalize the framework's group theoretic foundation and show how training such structured representations to be approximately invariant enables strong multi‑step composition directly in latent space and generalization performances over various experiments. On a discrete grid world environment, our model achieves 87.5% zero shot accuracy to unseen state‑action pairs, obtains 53.6% higher accuracy on 20‑timestep horizon rollouts, and demonstrates 4x higher robustness to noise relative to an MLP baseline. These results highlight how training to have latent group structure yields generalizable, data‑efficient, and interpretable world models, providing a principled pathway toward structured models for real‑world planning and reasoning.

Abstract:
Vision‑language‑action models must enable agents to execute long‑horizon tasks under partial observability. However, most existing approaches remain observation‑driven, relying on short context windows or repeated queries to vision‑language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. While semantic grounding is important, long‑horizon manipulation fundamentally requires persistent, action‑conditioned state representations. Current VLAs lack such representations and exhibit limited temporal and physical reasoning, making them ill‑suited for multi‑stage control. This paper introduces RB‑VLA, a belief‑centric architecture trained with self‑supervised world‑model objectives that maintains a compact latent state encoding task‑relevant history, dynamics, and object interactions. Queried once per task, the VLM provides high‑level intent, while the belief tracks task progress and enables phase‑aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust closed‑loop execution. RB‑VLA outperforms prior VLAs on long‑horizon benchmarks, achieving 52.5 percent and 37.5 percent higher success rates on multi‑stage pick‑and‑place and stacking tasks, respectively, compared to pi_0. It also reduces inference latency by up to five times relative to baselines and eliminates memory growth across timesteps observed in existing VLAs. Ablations show the belief module is the primary driver of performance, increasing success rates from 32.5 percent without belief to 77.5 percent with belief.

Abstract:
Computational imaging forward models, from coded aperture spectral cameras to MRI scanners, are traditionally implemented as monolithic, modality‑specific codes. We prove that every forward model in a broad, precisely defined operator class Cimg (encompassing clinical, scientific, and industrial imaging modalities, both linear and nonlinear) admits an epsilon‑approximate representation as a typed directed acyclic graph (DAG) whose nodes are drawn from a library of exactly 11 canonical primitives: Propagate, Modulate, Project, Encode, Convolve, Accumulate, Detect, Sample, Disperse, Scatter, and Transform. We call this the Finite Primitive Basis Theorem. The proof is constructive: we provide an algorithm that, given any H in Cimg, produces a DAG G with relative operator error at most epsilon and graph complexity within prescribed bounds. We further prove that the library is minimal: removing any single primitive causes at least one modality to lose its epsilon‑approximate representation. A systematic analysis of nonlinearities in imaging physics shows they fall into two structural categories: pointwise scalar functions (handled by Transform) and self‑consistent iterations (unrolled into existing linear primitives). Empirical validation on 31 linear modalities confirms eimg below 0.01 with at most 5 nodes and depth 5, and we provide constructive DAG decompositions for 9 additional nonlinear modalities. These results establish mathematical foundations for the Physics World Model (PWM) framework.

Abstract:
Effective robotic manipulation requires policies that can anticipate physical outcomes and adapt to real‑world environments. Effective robotic manipulation requires policies that can anticipate physical outcomes and adapt to real‑world environments. In this work, we introduce a unified framework, World‑Model‑Driven Diffusion Policy with Online Adaptive Learning (AdaWorldPolicy) to enhance robotic manipulation under dynamic conditions with minimal human involvement. Our core insight is that world models provide strong supervision signals, enabling online adaptive learning in dynamic environments, which can be complemented by force‑torque feedback to mitigate dynamic force shifts. Our AdaWorldPolicy integrates a world model, an action expert, and a force predictor‑all implemented as interconnected Flow Matching Diffusion Transformers (DiT). They are interconnected via the multi‑modal self‑attention layers, enabling deep feature exchange for joint learning while preserving their distinct modularity characteristics. We further propose a novel Online Adaptive Learning (AdaOL) strategy that dynamically switches between an Action Generation mode and a Future Imagination mode to drive reactive updates across all three modules. This creates a powerful closed‑loop mechanism that adapts to both visual and physical domain shifts with minimal overhead. Across a suite of simulated and real‑robot benchmarks, our AdaWorldPolicy achieves state‑of‑the‑art performance, with dynamical adaptive capacity to out‑of‑distribution scenarios.

Abstract:
The ability to plan with temporal abstractions is central to intelligent decision‑making. Rather than reasoning over primitive actions, we study agents that compose pre‑trained policies as temporally extended actions, enabling solutions to complex tasks that no constituent alone can solve. Such compositional planning remains elusive as compounding errors in long‑horizon predictions make it challenging to estimate the visitation distribution induced by sequencing policies. Motivated by the geometric policy composition framework introduced in arXiv:2206.08736, we address these challenges by learning predictive models of multi‑step dynamics ‑‑ so‑called jumpy world models ‑‑ that capture state occupancies induced by pre‑trained policies across multiple timescales in an off‑policy manner. Building on Temporal Difference Flows (arXiv:2503.09817), we enhance these models with a novel consistency objective that aligns predictions across timescales, improving long‑horizon predictive accuracy. We further demonstrate how to combine these generative predictions to estimate the value of executing arbitrary sequences of policies over varying timescales. Empirically, we find that compositional planning with jumpy world models significantly improves zero‑shot performance across a wide range of base policies on challenging manipulation and navigation tasks, yielding, on average, a 200% relative improvement over planning with primitive actions on long‑horizon tasks.

Abstract:
As drone‑based object detection technology continues to evolve, the demand is shifting from merely detecting objects to enabling users to accurately identify specific targets. For example, users can input particular targets as prompts to precisely detect desired objects. To address this need, an efficient text‑guided object detection model has been developed to enhance the detection of small objects. Specifically, an improved version of the existing YOLO‑World model is introduced. The proposed method replaces the C2f layer in the YOLOv8 backbone with a C3k2 layer, enabling more precise representation of local features, particularly for small objects or those with clearly defined boundaries. Additionally, the proposed architecture improves processing speed and efficiency through parallel processing optimization, while also contributing to a more lightweight model design. Comparative experiments on the VisDrone dataset show that the proposed model outperforms the original YOLO‑World model, with precision increasing from 40.6% to 41.6%, recall from 30.8% to 31%, F1 score from 35% to 35.5%, and mAP@0.5 from 30.4% to 30.7%, confirming its enhanced accuracy. Furthermore, the model demonstrates superior lightweight performance, with the parameter count reduced from 4 million to 3.8 million and FLOPs decreasing from 15.7 billion to 15.2 billion. These results indicate that the proposed approach provides a practical and effective solution for precise object detection in drone‑based applications.

Abstract:
Optimizing GPU kernels is critical for efficient modern machine learning systems yet remains challenging due to the complex interplay of design factors and rapid hardware evolution. Existing automated approaches typically treat Large Language Models (LLMs) merely as stochastic code generators within heuristic‑guided evolutionary loops. These methods often struggle with complex kernels requiring coordinated, multi‑step structural transformations, as they lack explicit planning capabilities and frequently discard promising strategies due to inefficient or incorrect intermediate implementations. To address this, we propose Search via Co‑Evolving World Model and build K‑Search based on this method. By replacing static search heuristics with a co‑evolving world model, our framework leverages LLMs' prior domain knowledge to guide the search, actively exploring the optimization space. This approach explicitly decouples high‑level algorithmic planning from low‑level program instantiation, enabling the system to navigate non‑monotonic optimization paths while remaining resilient to temporary implementation defects. We evaluate K‑Search on diverse, complex kernels from FlashInfer, including GQA, MLA, and MoE kernels. Our results show that K‑Search significantly outperforms state‑of‑the‑art evolutionary search methods, achieving an average 2.10x improvement and up to a 14.3x gain on complex MoE kernels. On the GPUMode TriMul task, K‑Search achieves state‑of‑the‑art performance on H100, reaching 1030us and surpassing both prior evolution and human‑designed solutions.

Abstract:
The end‑to‑end (E2E) paradigm, which maps sensor inputs directly to driving decisions, has recently attracted significant attention due to its unified modeling capability and scalability. However, ensuring safety in this unified framework remains one of the most critical challenges. In this work, we propose SafeDrive, an E2E planning framework designed to perform explicit and interpretable safety reasoning through a trajectory‑conditioned Sparse World Model. SafeDrive comprises two complementary networks: the Sparse World Network (SWNet) and the Fine‑grained Reasoning Network (FRNet). SWNet constructs trajectory‑conditioned sparse worlds that simulate the future behaviors of critical dynamic agents and road entities, providing interaction‑centric representations for downstream reasoning. FRNet then evaluates agent‑specific collision risks and temporal adherence to drivable regions, enabling precise identification of safety‑critical events across future timesteps. SafeDrive achieves state‑of‑the‑art performance on both open‑loop and closed‑loop benchmarks. On NAVSIM, it records a PDMS of 91.6 and an EPDMS of 87.5, with only 61 collisions out of 12,146 scenarios (0.5%). On Bench2Drive, SafeDrive attains a 66.8% driving score.

Abstract:
Generative world models (WMs) are increasingly used to synthesize controllable, sensor‑conditioned driving videos, yet their reliance on physical priors exposes novel attack surfaces. In this paper, we present Physical‑Conditioned World Model Attack (PhysCond‑WMA), the first white‑box world model attack that perturbs physical‑condition channels, such as HDMap embeddings and 3D‑box features, to induce semantic, logic, or decision‑level distortion while preserving perceptual fidelity. PhysCond‑WMA is optimized in two stages: (1) a quality‑preserving guidance stage that constrains reverse‑diffusion loss below a calibrated threshold, and (2) a momentum‑guided denoising stage that accumulates target‑aligned gradients along the denoising trajectory for stable, temporally coherent semantic shifts. Extensive experimental results demonstrate that our approach remains effective while increasing FID by about 9% on average and FVD by about 3.9% on average. Under the targeted attack setting, the attack success rate (ASR) reaches 0.55. Downstream studies further show tangible risk, which using attacked videos for training decreases 3D detection performance by about 4%, and worsens open‑loop planning performance by about 20%. These findings has for the first time revealed and quantified security vulnerabilities in generative world models, driving more comprehensive security checkers.

Abstract:
How does the brain predict physical outcomes while acting in the world? Machine learning world models compress visual input into latent spaces, discarding the spatial structure that characterizes sensory cortex. We propose isomorphic world models: architectures preserving sensory topology so that physics prediction becomes geometric propagation rather than abstract state transition. We implement this using neural fields with motor‑gated channels, where activity evolves through local lateral connectivity and motor commands multiplicatively modulate specific populations. Three experiments support this approach: (1) local connectivity is sufficient to learn ballistic physics, with predictions traversing intermediate locations rather than "teleporting"; (2) policies trained entirely in imagination transfer to real physics at nearly twice the rate of latent‑space alternatives; and (3) motor‑gated channels spontaneously develop body‑selective encoding through visuomotor prediction alone. These findings suggest intuitive physics and body schema may share a common origin in spatially structured neural dynamics.

Abstract:
Mechanical thrombectomy (MT) is typically the optimal treatment for acute ischemic stroke involving large vessel occlusions, but access is limited due to geographic and logistical barriers. Reinforcement learning (RL) shows promise in autonomous endovascular navigation, but generalization across 'long' navigation tasks remains challenging. We propose a Hierarchical Modular Multi‑Agent Reinforcement Learning (HM‑MARL) framework for autonomous two‑device navigation in vitro, enabling efficient and generalizable navigation. HM‑MARL was developed to autonomously navigate a guide catheter and guidewire from the femoral artery to the internal carotid artery (ICA). A modular multi‑agent approach was used to decompose the complex navigation task into specialized subtasks, each trained using Soft Actor‑Critic RL. The framework was validated in both in silico and in vitro testbeds to assess generalization and real‑world feasibility. In silico, a single‑vasculature model achieved 92‑100% success rates on individual anatomies, while a multi‑vasculature model achieved 56‑80% across multiple patient anatomies. In vitro, both HM‑MARL models successfully navigated 100% of trials from the femoral artery to the right common carotid artery and 80% to the right ICA but failed on the left‑side vessel superhuman challenge due to the anatomy and catheter type used in navigation. This study presents the first demonstration of in vitro autonomous navigation in MT vasculature. While HM‑MARL enables generalization across anatomies, the simulation‑to‑real transition introduces challenges. Future work will refine RL strategies using world models and validate performance on unseen in vitro data, advancing autonomous MT towards clinical translation.

Abstract:
World models learned from high‑dimensional visual observations allow agents to make decisions and plan directly in latent space, avoiding pixel‑level reconstruction. However, recent latent predictive architectures (JEPAs), including the DINO world model (DINO‑WM), display a degradation in test time robustness due to their sensitivity to "slow features". These include visual variations such as background changes and distractors that are irrelevant to the task being solved. We address this limitation by augmenting the predictive objective with a bisimulation encoder that enforces control‑relevant state equivalence, mapping states with similar transition dynamics to nearby latent states while limiting contributions from slow features. We evaluate our model on a simple navigation task under different test‑time background changes and visual distractors. Across all benchmarks, our model consistently improves robustness to slow features while operating in a reduced latent space, up to 10x smaller than that of DINO‑WM. Moreover, our model is agnostic to the choice of pretrained visual encoder and maintains robustness when paired with DINOv2, SimDINOv2, and iBOT features.

Abstract:
The rapid deployment of Large Language Models and AI agents across critical societal and technical domains is hindered by persistent behavioral pathologies including sycophancy, hallucination, and strategic deception that resist mitigation via reinforcement learning. Current safety paradigms treat these failures as transient training artifacts, lacking a unified theoretical framework to explain their emergence and stability. Here we show that these misalignments are not errors, but mathematically rationalizable behaviors arising from model misspecification. By adapting Berk‑Nash Rationalizability from theoretical economics to artificial intelligence, we derive a rigorous framework that models the agent as optimizing against a flawed subjective world model. We demonstrate that widely observed failures are structural necessities: unsafe behaviors emerge as either a stable misaligned equilibrium or oscillatory cycles depending on reward scheme, while strategic deception persists as a "locked‑in" equilibrium or through epistemic indeterminacy robust to objective risks. We validate these theoretical predictions through behavioral experiments on six state‑of‑the‑art model families, generating phase diagrams that precisely map the topological boundaries of safe behavior. Our findings reveal that safety is a discrete phase determined by the agent's epistemic priors rather than a continuous function of reward magnitude. This establishes Subjective Model Engineering, defined as the design of an agent's internal belief structure, as a necessary condition for robust alignment, marking a paradigm shift from manipulating environmental rewards to shaping the agent's interpretation of reality.

Abstract:
Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human‑like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play all conceivable human games, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy ‑‑ the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open‑ended platform that uses LLMs with humans‑in‑the‑loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision‑language models (VLMs) on short episodes of play. The best models achieved less than 10% of the human average score on the majority of the games, and especially struggled with games that challenge world‑model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human‑like general intelligence in machines.

Abstract:
Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact‑preserving workflows. This challenge is particularly acute for computer‑using scenarios, where real execution does not support counterfactual exploration, making large‑scale trial‑and‑error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer‑Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two‑stage factorization of UI dynamics: it first predicts a textual description of agent‑relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer‑using environments. We evaluate CUWM via test‑time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world‑model‑guided test‑time scaling improves decision quality and execution robustness.

Abstract:
Efficiently navigating complex environments requires agents to internalize the underlying logic of their world, yet standard world modelling methods often struggle with sample inefficiency, lack of transparency, and poor scalability. We propose a framework for constructing symbolic causal world models entirely online by integrating continuous model learning and repair into the agent's decision loop, by leveraging the power of Meta‑Interpretive Learning and predicate invention to find semantically meaningful and reusable abstractions, allowing an agent to construct a hierarchy of disentangled, high‑quality concepts from its observations. We demonstrate that our lifted inference approach scales to domains with complex relational dynamics, where propositional methods suffer from combinatorial explosion, while achieving sample‑efficiency orders of magnitude higher than the established PPO neural‑network‑based baseline.

Abstract:
Learning to manipulate cloth is both a paradigmatic problem for robotic research and a problem of immediate relevance to a variety of applications ranging from assistive care to the service industry. The complex physics of the deformable object makes this problem of cloth manipulation nontrivial. In order to create a general manipulation strategy that addresses a variety of shapes, sizes, fold and wrinkle patterns, in addition to the usual problems of appearance variations, it becomes important to carefully consider model structure and their implications for generalisation performance. In this paper, we present an approach to in‑air cloth manipulation that uses a variation of a recently proposed reinforcement learning architecture, DreamerV2. Our implementation modifies this architecture to utilise surface normals input, in addition to modiying the replay buffer and data augmentation procedures. Taken together these modifications represent an enhancement to the world model used by the robot, addressing the physical complexity of the object being manipulated by the robot. We present evaluations both in simulation and in a zero‑shot deployment of the trained policies in a physical robot setup, performing in‑air unfolding of a variety of different cloth types, demonstrating the generalisation benefits of our proposed architecture.

Abstract:
Learning latent actions from action‑free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next‑step factor value. This factorized structure enables more accurate modeling of complex multi‑entity dynamics and improves video generation quality in action‑free video settings compared to monolithic models. Based on experiments on both simulation and real‑world multi‑entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.

Abstract:
Autonomous inspection robots for monitoring industrial sites can reduce costs and risks associated with human‑led inspection. However, accurate readings can be challenging due to occlusions, limited viewpoints, or unexpected environmental conditions. We propose a hybrid framework that combines supervised failure classification with anomaly detection, enabling classification of inspection tasks as a success, known failure, or anomaly (i.e., out‑of‑distribution) case. Our approach uses a world model backbone with compressed video inputs. This policy‑agnostic, distribution‑free framework determines classifications based on two decision functions set by conformal prediction (CP) thresholds before a human observer does. We evaluate the framework on gauge inspection feeds collected from office and industrial sites and demonstrate real‑time deployment on a Boston Dynamics Spot. Experiments show over 90% accuracy in distinguishing between successes, failures, and OOD cases, with classifications occurring earlier than a human observer. These results highlight the potential for robust, anticipatory failure detection in autonomous inspection tasks or as a feedback signal for model training to assess and improve the quality of training data. Project website: https://autoinspection‑classification.github.io

Abstract:
Vision‑language model (VLM) shows promise for high‑level planning in smart manufacturing, yet their deployment in dynamic workcells faces two critical challenges: (1) stateless operation, they cannot persistently track out‑of‑view states, causing world‑state drift; and (2) opaque reasoning, failures are difficult to diagnose, leading to costly blind retries. This paper presents VLM‑DEWM, a cognitive architecture that decouples VLM reasoning from world‑state management through a persistent, queryable Dynamic External World Model (DEWM). Each VLM decision is structured into an Externalizable Reasoning Trace (ERT), comprising action proposal, world belief, and causal assumption, which is validated against DEWM before execution. When failures occur, discrepancy analysis between predicted and observed states enables targeted recovery instead of global replanning. We evaluate VLM‑DEWM on multi‑station assembly, large‑scale facility exploration, and real‑robot recovery under induced failures. Compared to baseline memory‑augmented VLM systems, VLM DEWM improves state‑tracking accuracy from 56% to 93%, increases recovery success rate from below 5% to 95%, and significantly reduces computational overhead through structured memory. These results establish VLM‑DEWM as a verifiable and resilient solution for long‑horizon robotic operations in dynamic manufacturing environments.

Abstract:
Web agents based on large language models have demonstrated promising capability in automating web tasks. However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might not possess comprehensive awareness of execution risks, prematurely performing risky actions that cause losses and lead to task failure. To address these challenges, we propose WAC, a web agent that integrates model collaboration, consequence simulation, and feedback‑driven action refinement. To overcome the cognitive isolation of individual models, we introduce a multi‑agent collaboration process that enables an action model to consult a world model as a web‑environment expert for strategic guidance; the action model then grounds these suggestions into executable actions, leveraging prior knowledge of environmental state transition dynamics to enhance candidate action proposal. To achieve risk‑aware resilient task execution, we introduce a two‑stage deduction chain. A world model, specialized in environmental state transitions, simulates action outcomes, which a judge model then scrutinizes to trigger action corrective feedback when necessary. Experiments show that WAC achieves absolute gains of 1.8% on VisualWebArena and 1.3% on Online‑Mind2Web.

Abstract:
Cold‑start personalization requires inferring user preferences through interaction when no user‑specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi‑turn settings its terminal reward fails to exploit the factored, per‑criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold‑start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training‑free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users' stated preferences versus 68.5% for RL, with 3‑5x fewer interactions. When two users give different answers to the same question, Pep changes its follow‑up 39‑62% of the time versus 0‑28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold‑start elicitation is the capability to exploit the factored structure of preference data.

Abstract:
Large Language Models (LLMs) have recently shown strong reasoning and generalization capabilities, motivating their use as decision‑making policies in complex environments. StarCraft II (SC2), with its massive state‑action space and partial observability, is a challenging testbed. However, existing LLM‑based SC2 agents primarily focus on improving the policy itself and overlook integrating a learnable, action‑conditioned transition model into the decision loop. To bridge this gap, we propose StarWM, the first world model for SC2 that predicts future observations under partial observability. To facilitate learning SC2's hybrid dynamics, we introduce a structured textual representation that factorizes observations into five semantic modules, and construct SC2‑Dynamics‑50k, the first instruction‑tuning dataset for SC2 dynamics prediction. We further develop a multi‑dimensional offline evaluation framework for predicted structured observations. Offline results show StarWM's substantial gains over zero‑shot baselines, including nearly 60% improvements in resource prediction accuracy and self‑side macro‑situation consistency. Finally, we propose StarWM‑Agent, a world‑model‑augmented decision system that integrates StarWM into a Generate‑‑Simulate‑‑Refine decision loop for foresight‑driven policy refinement. Online evaluation against SC2's built‑in AI demonstrates consistent improvements, yielding win‑rate gains of 30%, 15%, and 30% against Hard (LV5), Harder (LV6), and VeryHard (LV7), respectively, alongside improved macro‑management stability and tactical risk assessment.

Abstract:
Web agents require massive trajectories to generalize, yet real‑world training is constrained by network latency, rate limits, and safety risks. We introduce WebWorld series, the first open‑web simulator trained at scale. While existing simulators are restricted to closed environments with thousands of trajectories, WebWorld leverages a scalable data pipeline to train on 1M+ open‑web interactions, supporting reasoning, multi‑format data, and long‑horizon simulations of 30+ steps. For intrinsic evaluation, we introduce WebWorld‑Bench with dual metrics spanning nine dimensions, where WebWorld achieves simulation performance comparable to Gemini‑3‑Pro. For extrinsic evaluation, Qwen3‑14B trained on WebWorld‑synthesized trajectories improves by +9.2% on WebArena, reaching performance comparable to GPT‑4o. WebWorld enables effective inference‑time search, outperforming GPT‑5 as a world model. Beyond web simulation, WebWorld exhibits cross‑domain generalization to code, GUI, and game environments, providing a replicable recipe for world model construction.

Abstract:
Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision‑Language‑Action (VLA) models, but its requirement for massive real‑world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed‑loop imagined rollouts inevitably suffer from hallucination and long‑horizon error accumulation. Such errors do not merely degrade visual fidelity; they corrupt the optimization signal, encouraging policies to exploit model inaccuracies rather than genuine task progress. We propose WoVR, a reliable world‑model‑based reinforcement learning framework for post‑training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action‑conditioned video world model, reshapes imagined interaction to reduce effective error depth via Keyframe‑Initialized Rollouts, and maintains policy‑simulator alignment through World Model‑Policy co‑evolution. Extensive experiments on LIBERO benchmarks and real‑world robotic manipulation demonstrate that WoVR enables stable long‑horizon imagined rollouts and effective policy optimization, improving average LIBERO success from 39.95% to 69.2% (+29.3 points) and real‑robot success from 61.7% to 91.7% (+30.0 points). These results show that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly controlled.

Abstract:
This paper investigates compact large language model (LLM) deployment and world‑model‑assisted inference offloading in mobile edge computing (MEC) networks. We first propose an edge compact LLM deployment (ECLD) framework that jointly applies structured pruning, low‑bit quantization, and knowledge distillation to construct edge‑deployable LLM variants, and we evaluate these models using four complementary metrics: accessibility, energy consumption, hallucination rate, and generalization accuracy. Building on the resulting compact models, we formulate an MEC offloading optimization problem that minimizes the long‑term average inference latency subject to per‑device energy budgets and LLM‑specific quality‑of‑service constraints on effective accuracy and hallucination. To solve this problem under unknown and time‑varying network dynamics, we develop a world model‑proximal policy optimization (PPO) algorithm, which augments an on‑policy PPO algorithm with a learned recurrent world model that provides improved value targets and short imagination rollouts. Extensive experiments on Llama‑3.1‑8B, Qwen3‑8B, and Mistral‑12B show that ECLD compresses base models by about 70‑80% in storage (i.e., from 15.3 GB to 3.3 GB for Llama‑3.1‑8B) and reduces per‑query energy consumption by up to 50%, while largely preserving accuracy and often lowering hallucination compared with quantization‑only or pruning‑only baselines. Moreover, they also show that world model‑PPO speeds up convergence by about 50%, improves the final reward by 15.8% over vanilla PPO, and reduces average inference latency by 12‑30% across different user populations, while satisfying the accuracy and hallucination constraints and approaching the generation quality of always‑offloading with much of the efficiency of local execution.

Abstract:
Automated warehouses execute millions of stow operations, where robots place objects into storage bins. For these systems it is valuable to anticipate how a bin will look from the current observations and the planned stow behavior before real execution. We propose FOREST, a stow‑intent‑conditioned world model that represents bin states as item‑aligned instance masks and uses a latent diffusion transformer to predict the post‑stow configuration from the observed context. Our evaluation shows that FOREST substantially improves the geometric agreement between predicted and true post‑stow layouts compared with heuristic baselines. We further evaluate the predicted post‑stow layouts in two downstream tasks, in which replacing the real post‑stow masks with FOREST predictions causes only modest performance loss in load‑quality assessment and multi‑stow reasoning, indicating that our model can provide useful foresight signals for warehouse planning.

Abstract:
An important question in the field of AI is the extent to which successful behaviour requires an internal representation of the world. In this work, we quantify the amount of information an optimal policy provides about the underlying environment. We consider a Controlled Markov Process (CMP) with n states and m actions, assuming a uniform prior over the space of possible transition dynamics. We prove that observing a deterministic policy that is optimal for any non‑constant reward function then conveys exactly n \log m bits of information about the environment. Specifically, we show that the mutual information between the environment and the optimal policy is n \log m bits. This bound holds across a broad class of objectives, including finite‑horizon, infinite‑horizon discounted, and time‑averaged reward maximization. These findings provide a precise information‑theoretic lower bound on the "implicit world model'' necessary for optimality.

Abstract:
Autonomous driving, as an agent operating in the physical world, requires the fundamental capability to build world models that capture how the environment evolves spatiotemporally in order to support long‑term planning. At the same time, scalability demands learning such models in a self‑supervised manner; joint‑embedding predictive architecture (JEPA) enables learning world models via leveraging large volumes of unlabeled data without relying on expensive human annotations. In this paper, we propose AD‑LiST‑JEPA, a self‑supervised world model for autonomous driving that predicts future spatiotemporal evolution from LiDAR data using a JEPA framework. We evaluate the quality of the learned representations through a downstream LiDAR‑based occupancy completion and forecasting (OCF) task, which jointly assesses perception and prediction. Proof of concept experiments show better OCF performance with pretrained encoder after JEPA‑based world model learning.

Abstract:
Learning to coordinate many agents in partially observable and highly dynamic environments requires both informative representations and data‑efficient training. To address this challenge, we present a novel model‑based multi‑agent reinforcement learning framework that unifies joint state‑action representation learning with imaginative roll‑outs. We design a world model trained with variational auto‑encoders and augment the model using the state‑action learned embedding (SALE). SALE is injected into both the imagination module that forecasts plausible future roll‑outs and the joint agent network whose individual action values are combined through a mixing network to estimate the joint action‑value function. By coupling imagined trajectories with SALE‑based action values, the agents acquire a richer understanding of how their choices influence collective outcomes, leading to improved long‑term planning and optimization under limited real‑environment interactions. Empirical studies on well‑established multi‑agent benchmarks, including StarCraft II Micro‑Management, Multi‑Agent MuJoCo, and Level‑Based Foraging challenges, demonstrate consistent gains of our method over baseline algorithms and highlight the effectiveness of joint state‑action learned embeddings within a multi‑agent model‑based paradigm.

Abstract:
The opioid epidemic remains one of the most severe public health crises in the United States, yet evaluating policy interventions before implementation is difficult: multiple policies interact within a dynamic system where targeting one risk pathway may inadvertently amplify another. We argue that effective opioid policy evaluation requires three capabilities ‑‑ forecasting future outcomes under current policies, counterfactual reasoning about alternative past decisions, and optimization over candidate interventions ‑‑ and propose to unify them through world modeling. We introduce Policy4OOD, a knowledge‑guided spatio‑temporal world model that addresses three core challenges: what policies prescribe, where effects manifest, and when effects unfold.Policy4OOD jointly encodes policy knowledge graphs, state‑level spatial dependencies, and socioeconomic time series into a policy‑conditioned Transformer that forecasts future opioid outcomes.Once trained, the world model serves as a simulator: forecasting requires only a forward pass, counterfactual analysis substitutes alternative policy encodings in the historical sequence, and policy optimization employs Monte Carlo Tree Search over the learned simulator. To support this framework, we construct a state‑level monthly dataset (2019‑‑2024) integrating opioid mortality, socioeconomic indicators, and structured policy encodings. Experiments demonstrate that spatial dependencies and structured policy knowledge significantly improve forecasting accuracy, validating each architectural component and the potential of world modeling for data‑driven public health decision support.

Abstract:
Determining whether neural models internalize physical laws as world models, rather than exploiting statistical shortcuts, remains challenging, especially under out‑of‑distribution (OOD) shifts. Standard evaluations often test latent capability via downstream adaptation (e.g., fine‑tuning or high‑capacity probes), but such interventions can change the representations being measured and thus confound what was learned during self‑supervised learning (SSL). We propose a non‑invasive evaluation protocol, PhyIP. We test whether physical quantities are linearly decodable from frozen representations, motivated by the linear representation hypothesis. Across fluid dynamics and orbital mechanics, we find that when SSL achieves low error, latent structure becomes linearly accessible. PhyIP recovers internal energy and Newtonian inverse‑square scaling on OOD tests (e.g., ρ> 0.90). In contrast, adaptation‑based evaluations can collapse this structure (ρ\approx 0.05). These findings suggest that adaptation‑based evaluation can obscure latent structures and that low‑capacity probes offer a more accurate evaluation of physical world models.

Abstract:
Recent robot foundation models largely rely on large‑scale behavior cloning, which imitates expert actions but discards transferable dynamics knowledge embedded in heterogeneous embodied data. While the Unified World Model (UWM) formulation has the potential to leverage such diverse data, existing instantiations struggle to scale to foundation‑level due to coarse data usage and fragmented datasets. We introduce LDA‑1B, a robot foundation model that scales through universal embodied data ingestion by jointly learning dynamics, policy, and visual forecasting, assigning distinct roles to data of varying quality. To support this regime at scale, we assemble and standardize EI‑30k, an embodied interaction dataset comprising over 30k hours of human and robot trajectories in a unified format. Scalable dynamics learning over such heterogeneous data is enabled by prediction in a structured DINO latent space, which avoids redundant pixel‑space appearance modeling. Complementing this representation, LDA‑1B employs a multi‑modal diffusion transformer to handle asynchronous vision and action streams, enabling stable training at the 1B‑parameter scale. Experiments in simulation and the real world show LDA‑1B outperforms prior methods (e.g., π_0.5) by up to 21%, 48%, and 23% on contact‑rich, dexterous, and long‑horizon tasks, respectively. Notably, LDA‑1B enables data‑efficient fine‑tuning, gaining 10% by leveraging 30% low‑quality trajectories typically harmful and discarded.

Abstract:
The goal of this paper is to improve the performance and reliability of vision‑language‑action (VLA) models through iterative online interaction. Since collecting policy rollouts in the real world is expensive, we investigate whether a learned simulator‑specifically, an action‑conditioned video generation model‑can be used to generate additional rollout data. Unfortunately, existing world models lack the physical fidelity necessary for policy improvement: they are predominantly trained on demonstration datasets that lack coverage of many different physical interactions (particularly failure cases) and struggle to accurately model small yet critical physical details in contact‑rich object manipulation. We propose a simple iterative improvement algorithm that uses real‑world roll‑out data to improve the fidelity of the world model, which can then, in turn, be used to generate supplemental synthetic data for improving the VLA model. In our experiments on a real robot, we use this approach to improve the performance of a state‑of‑the‑art VLA model on multiple downstream tasks. We achieve a 39.2% absolute success rate improvement over the base policy and 11.6% improvement from training with the generated synthetic rollouts. Videos can be found at this anonymous website: https://sites.google.com/view/vla‑w

Abstract:
Humanoid robots show promise for complex whole‑body tasks in unstructured environments. Although Human‑Object Interaction (HOI) has advanced, most methods focus on fully actuated objects rigidly coupled to the robot, ignoring underactuated objects with independent dynamics and non‑holonomic constraints. These introduce control challenges from coupling forces and occlusions. We present HAIC, a unified framework for robust interaction across diverse object dynamics without external state estimation. Our key contribution is a dynamics predictor that estimates high‑order object states (velocity, acceleration) solely from proprioceptive history. These predictions are projected onto static geometric priors to form a spatially grounded dynamic occupancy map, enabling the policy to infer collision boundaries and contact affordances in blind spots. We use asymmetric fine‑tuning, where a world model continuously adapts to the student policy's exploration, ensuring robust state estimation under distribution shifts. Experiments on a humanoid robot show HAIC achieves high success rates in agile tasks (skateboarding, cart pushing/pulling under various loads) by proactively compensating for inertial perturbations, and also masters multi‑object long‑horizon tasks like carrying a box across varied terrain by predicting the dynamics of multiple objects.

Abstract:
We study budget‑constrained tool‑augmented agents, where a large language model must solve multi‑step tasks by invoking external tools under a strict monetary budget. We formalize this setting as sequential decision making in context space with priced and stochastic tool executions, making direct planning intractable due to massive state‑action spaces, high variance of outcomes and prohibitive exploration cost. To address these challenges, we propose INTENT, an inference‑time planning framework that leverages an intention‑aware hierarchical world model to anticipate future tool usage, risk‑calibrated cost, and guide decisions online. Across cost‑augmented StableToolBench, INTENT strictly enforces hard budget feasibility while substantially improving task success over baselines, and remains robust under dynamic market shifts such as tool price changes and varying budgets.

Abstract:
World models are becoming central to robotic planning and control as they enable prediction of future state transitions. Existing approaches often emphasize video generation or natural‑language prediction, which are difficult to ground in robot actions and suffer from compounding errors over long horizons. Classic task and motion planning models world transitions in logical space, enabling robot‑executable and robust long‑horizon reasoning. However, they typically operate independently of visual perception, preventing synchronized symbolic and visual state prediction. We propose a Hierarchical World Model (H‑WM) that jointly predicts logical and visual state transitions within a unified framework. H‑WM combines a high‑level logical world model with a low‑level visual world model, integrating the long‑horizon robustness of symbolic reasoning with visual grounding. The hierarchical outputs provide stable intermediate guidance for long‑horizon tasks, mitigating error accumulation and enabling robust execution across extended task sequences. Experiments across multiple vision‑language‑action (VLA) control policies demonstrate the effectiveness and generality of H‑WM's guidance.

Abstract:
Despite the sustained scaling on model capacity and data acquisition, Vision‑Language‑Action (VLA) models remain brittle in contact‑rich and dynamic manipulation tasks, where minor execution deviations can compound into failures. While reinforcement learning (RL) offers a principled path to robustness, on‑policy RL in the physical world is constrained by safety risk, hardware cost, and environment reset. To bridge this gap, we present RISE, a scalable framework of robotic reinforcement learning via imagination. At its core is a Compositional World Model that (i) predicts multi‑view future via a controllable dynamics model, and (ii) evaluates imagined outcomes with a progress value model, producing informative advantages for the policy improvement. Such compositional design allows state and value to be tailored by best‑suited yet distinct architectures and objectives. These components are integrated into a closed‑loop self‑improving pipeline that continuously generates imaginary rollouts, estimates advantages, and updates the policy in imaginary space without costly physical interaction. Across three challenging real‑world tasks, RISE yields significant improvement over prior art, with more than +35% absolute performance increase in dynamic brick sorting, +45% for backpack packing, and +35% for box closing, respectively.

Abstract:
Developing world models that understand complex physical interactions is essential for advancing robotic planning and simulation.However, existing methods often struggle to accurately model the environment under conditions of data scarcity and complex contact‑rich dynamic motion.To address these challenges, we propose ContactGaussian‑WM, a differentiable physics‑grounded rigid‑body world model capable of learning intricate physical laws directly from sparse and contact‑rich video sequences.Our framework consists of two core components: (1) a unified Gaussian representation for both visual appearance and collision geometry, and (2) an end‑to‑end differentiable learning framework that differentiates through a closed‑form physics engine to infer physical properties from sparse visual observations.Extensive simulations and real‑world evaluations demonstrate that ContactGaussian‑WM outperforms state‑of‑the‑art methods in learning complex scenarios, exhibiting robust generalization capabilities.Furthermore, we showcase the practical utility of our framework in downstream applications, including data synthesis and real‑time MPC.

Abstract:
Robotic manipulation requires anticipating how the environment evolves in response to actions, yet most existing systems lack this predictive capability, often resulting in errors and inefficiency. While Vision‑Language Models (VLMs) provide high‑level guidance, they cannot explicitly forecast future states, and existing world models either predict only short horizons or produce spatially inconsistent frames. To address these challenges, we propose a framework for fast and predictive video‑conditioned action. Our approach first selects and adapts a robust video generation model to ensure reliable future predictions, then applies adversarial distillation for fast, few‑step video generation, and finally trains an action model that leverages both generated videos and real observations to correct spatial errors. Extensive experiments show that our method produces temporally coherent, spatially accurate video predictions that directly support precise manipulation, achieving significant improvements in embodiment consistency, spatial referring ability, and task completion over existing baselines. Codes & Models will be released.

Abstract:
Full models of the world require complex knowledge of immense detail. While pre‑trained large models have been hypothesized to contain similar knowledge due to extensive pre‑training on vast amounts of internet scale data, using them directly in a search procedure is inefficient and inaccurate. Conversely, partial models focus on making high quality predictions for a subset of state and actions: those linked through affordances that achieve user intents~\citepkhetarpal2020can. Can we posit large models as partial world models? We provide a formal answer to this question, proving that agents achieving task‑agnostic, language‑conditioned intents necessarily possess predictive partial‑world models informed by affordances. In the multi‑task setting, we introduce distribution‑robust affordances and show that partial models can be extracted to significantly improve search efficiency. Empirical evaluations in tabletop robotics tasks demonstrate that our affordance‑aware partial models reduce the search branching factor and achieve higher rewards compared to full world models.

Abstract:
This paper investigates whether LMs recruit shared computational mechanisms for general Theory of Mind (ToM) and language‑specific pragmatic reasoning in order to contribute to the general question of whether LMs may be said to have emergent "social world models", i.e., representations of mental states that are repurposed across tasks (the functional integration hypothesis). Using behavioral evaluations and causal‑mechanistic experiments via functional localization methods inspired by cognitive neuroscience, we analyze LMs' performance across seven subcategories of ToM abilities (Beaudoin et al., 2020) on a substantially larger localizer dataset than used in prior like‑minded work. Results from stringent hypothesis‑driven statistical testing offer suggestive evidence for the functional integration hypothesis, indicating that LMs may develop interconnected "social world models" rather than isolated competencies. This work contributes novel ToM localizer data, methodological refinements to functional localization techniques, and empirical insights into the emergence of social cognition in artificial systems.

Abstract:
Pretraining Vision‑Language‑Action (VLA) policies on internet‑scale video is appealing, yet current latent‑action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action‑relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA‑JEPA, a JEPA‑style pretraining framework that sidesteps these pitfalls by design. The key idea is leakage‑free state prediction: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation ‑‑ future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA‑JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two‑stage recipe ‑‑ JEPA pretraining followed by action‑head fine‑tuning ‑‑ without the multi‑stage complexity of prior latent‑action pipelines. Experiments on LIBERO, LIBERO‑Plus, SimplerEnv and real‑world manipulation tasks show that VLA‑JEPA achieves consistent gains in generalization and robustness over existing methods.

Abstract:
Efficient exploration remains a central challenge in reinforcement learning (RL), particularly in sparse‑reward environments. We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration that brings classical reward‑biased maximum likelihood estimation (RBMLE) from adaptive control into deep RL. In contrast to upper confidence bound (UCB)‑style exploration methods, OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss that biases imagined transitions toward higher‑reward outcomes. This fully gradient‑based loss requires neither uncertainty estimates nor constrained optimization. Our approach is plug‑and‑play with existing world model frameworks, preserving scalability while requiring only minimal modifications to standard training procedures. We instantiate OWMs within two state‑of‑the‑art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM, which demonstrate significant improvements in sample efficiency and cumulative return compared to their baseline counterparts.

Abstract:
World‑model‑based imagine‑then‑act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image‑based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary‑view RGBD generation: given only a single‑view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back‑projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi‑view, cross‑modality generation, we explicitly design cross‑view and cross‑modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill‑posed because multiple actions can explain the same transition. We address this with a test‑time action optimization strategy that backpropagates through the generative model to infer a trajectory‑level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.

Abstract:
Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long‑term stability. We study egocentric interaction generation from a single scene image under free‑space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free‑space gestures and contact‑heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary‑length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion‑invariant hand conditioning based on projected 3D hand meshes, allowing visibility and occlusion to be inferred from scene context rather than encoded in the control signal. To stabilize egocentric viewpoint changes, we inject explicit camera geometry via per‑pixel Plücker‑ray embeddings, disentangling camera motion from hand motion and preventing background drift. We further develop a fully automated monocular annotation pipeline and distill a bidirectional diffusion model into a causal generator, enabling arbitrary‑length synthesis. Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long‑horizon interactive generation.

Abstract:
This work presents WorldCompass, a novel Reinforcement Learning (RL) post‑training framework for the long‑horizon, interactive video‑based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip‑level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine‑grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction‑following accuracy and visual quality, which provide direct supervision and effectively suppress reward‑hacking behaviors. 3) Efficient RL Algorithm: We employ the negative‑aware fine‑tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open‑source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.

Abstract:
While world models have emerged as a cornerstone of embodied intelligence by enabling agents to reason about environmental dynamics through action‑conditioned prediction, their evaluation remains fragmented. Current evaluation of embodied world models has largely focused on perceptual fidelity (e.g., video generation quality), overlooking the functional utility of these models in downstream decision‑making tasks. In this work, we introduce WorldArena, a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub‑dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation. Furthermore, we propose EWMScore, a holistic metric integrating multi‑dimensional performance into a single interpretable index. Through extensive experiments on 14 representative models, we reveal a significant perception‑functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability. WorldArena benchmark with the public leaderboard is released at https://world‑arena.ai, providing a framework for tracking progress toward truly functional world models in embodied AI.

Abstract:
World Models have emerged as a powerful paradigm for learning compact, predictive representations of environment dynamics, enabling agents to reason, plan, and generalize beyond direct experience. Despite recent interest in World Models, most available implementations remain publication‑specific, severely limiting their reusability, increasing the risk of bugs, and reducing evaluation standardization. To mitigate these issues, we introduce stable‑worldmodel (SWM), a modular, tested, and documented world‑model research ecosystem that provides efficient data‑collection tools, standardized environments, planning algorithms, and baseline implementations. In addition, each environment in SWM enables controllable factors of variation, including visual and physical properties, to support robustness and continual learning research. Finally, we demonstrate the utility of SWM by using it to study zero‑shot robustness in DINO‑WM.

Abstract:
Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in‑depth analysis of test‑time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test‑time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world‑model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test‑time imagination for efficient and reliable spatial reasoning.

Abstract:
Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen‑space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce ViewRope, a geometry‑aware encoding that injects camera‑ray directions directly into video transformer self‑attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model‑native inductive bias for retrieving 3D‑consistent content across temporal gaps. We further propose Geometry‑Aware Frame‑Sparse Attention, which exploits these geometric cues to selectively attend to relevant historical frames, improving efficiency without sacrificing memory consistency. We also present ViewBench, a diagnostic suite measuring loop‑closure fidelity and geometric drift. Our results demonstrate that ViewRope substantially improves long‑term consistency while reducing computational costs.

Abstract:
Code World Models (CWMs) are language models trained to simulate program execution by predicting explicit runtime state after every executed command. This execution‑based world modeling enables internal verification within the model, offering an alternative to natural language chain‑of‑thought reasoning. However, the sources of errors and the nature of CWMs' limitations remain poorly understood. We study CWMs from two complementary perspectives: local semantic execution and long‑horizon state tracking. On real‑code benchmarks, we identify two dominant failure regimes. First, dense runtime state reveals produce token‑intensive execution traces, leading to token‑budget exhaustion on programs with long execution histories. Second, failures disproportionately concentrate in string‑valued state, which we attribute to limitations of subword tokenization rather than program structure. To study long‑horizon behavior, we use a controlled permutation‑tracking benchmark that isolates state propagation under action execution. We show that long‑horizon degradation is driven primarily by incorrect action generation: when actions are replaced with ground‑truth commands, a Transformer‑based CWM propagates state accurately over long horizons, despite known limitations of Transformers in long‑horizon state tracking. These findings suggest directions for more efficient supervision and state representations in CWMs that are better aligned with program execution and data types.

Abstract:
Contemporary visuo‑motor dexterity models often rely on expressive policy classes with diffusion and transformer backbones to achieve strong performance. However, these architectures require significant data and computational resources, and remain far from reliable, particularly for multi‑fingered dexterity. Importantly, they model skills as reactive mappings and rely on fixed‑horizon action chunking, creating a rigid trade‑off between temporal coherence and reactivity. To address these issues, we first introduce Unified Behavioral Models (UBMs), a framework to represent dexterous skills as coupled dynamical systems that capture how visual features of the environment (visual flow) and proprioceptive states of the robot (action flow) co‑evolve. As such, UBMs ensure temporal coherence by construction rather than heuristic averaging. Unlike world models that attempt to predict the impact of arbitrary robot actions on the environment, UBMs target behavioral dynamics that encode how demonstrated robot behavior is related to desired impacts on the environment. A UBM can be viewed as a pseudo planner: given an initial condition, it computes the desired robot behavior over the entire skill horizon, while simultaneously ``imagining" the resulting flow of visual features. To operationalize UBMs, we propose Koopman‑UBM, a first instantiation of UBMs as a structured latent linear system. K‑UBM is computationally efficient, enabling reactivity and adaptation via an online replanning strategy: the model acts as its own runtime monitor, automatically triggering replanning when predicted and observed visual flow diverge beyond a threshold. Across seven simulated tasks and four real‑world tasks, our approach matches or exceeds the performance of state‑of‑the‑art baselines, while offering considerably faster inference, smooth execution, robustness to occlusions, and flexible replanning.

Abstract:
World models enable agents to plan by imagining future states, but existing approaches operate from a single viewpoint, typically egocentric, even when other perspectives would make planning easier; navigation, for instance, benefits from a bird's‑eye view. We introduce Cross‑View World Models (XVWM), trained with a cross‑view prediction objective: given a sequence of frames from one viewpoint, predict the future state from the same or a different viewpoint after an action is taken. Enforcing cross‑view consistency acts as geometric regularization: because the input and output views may share little or no visual overlap, to predict across viewpoints, the model must learn view‑invariant representations of the environment's 3D structure. We train on synchronized multi‑view gameplay data from Aimlabs, an aim‑training platform providing precisely aligned multi‑camera recordings with high‑frequency action labels. The resulting model gives agents parallel imagination streams across viewpoints, enabling planning in whichever frame of reference best suits the task while executing from the egocentric view. Our results show that multi‑view consistency provides a strong learning signal for spatially grounded representations. Finally, predicting the consequences of one's actions from another viewpoint may offer a foundation for perspective‑taking in multi‑agent settings.

Abstract:
A long‑standing question in physical reasoning is whether video‑based models need to rely on factorized representations of physical variables in order to make physically accurate predictions, or whether they can implicitly represent such variables in a task‑specific, distributed manner. While modern video world models achieve strong performance on intuitive physics benchmarks, it remains unclear which of these representational regimes they implement internally. Here, we present the first interpretability study to directly examine physical representations inside large‑scale video encoders. Using layerwise probing, subspace geometry, patch‑level decoding, and targeted attention ablations, we characterize where physical information becomes accessible and how it is organized within encoder‑based video transformers. Across architectures, we identify a sharp intermediate‑depth transition ‑‑ which we call the Physics Emergence Zone ‑‑ at which physical variables become accessible. Physics‑related representations peak shortly after this transition and degrade toward the output layers. Decomposing motion into explicit variables, we find that scalar quantities such as speed and acceleration are available from early layers onwards, whereas motion direction becomes accessible only at the Physics Emergence Zone. Notably, we find that direction is encoded through a high‑dimensional population structure with circular geometry, requiring coordinated multi‑feature intervention to control. These findings suggest that modern video models do not use factorized representations of physical variables like a classical physics engine. Instead, they use a distributed representation that is nonetheless sufficient for making physical predictions.

Abstract:
Classical sabermetrics has profoundly shaped baseball analytics by summarizing long histories of play into compact statistics. While these metrics are invaluable for valuation and retrospective analysis, they do not define a generative model of how baseball games unfold pitch by pitch, leaving most existing approaches limited to single‑step prediction or post‑hoc analysis. In this work, we present Neural Sabermetrics with World Model, a Large Language Model (LLM) based play‑by‑play world model for baseball. We cast baseball games as long auto‑regressive sequences of events and continuously pretrain a single LLM on more than ten years of Major League Baseball (MLB) tracking data, comprising over seven million pitch sequences and approximately three billion tokens. The resulting model is capable of predicting multiple aspects of game evolution within a unified framework. We evaluate our model on both in‑distribution regular‑season data and out‑of‑distribution postseason games and compare against strong neural baselines from prior work. Despite using a single backbone model, our approach outperforms the performance of existing baselines, (1) correctly predicting approximately 64% of next pitches within a plate appearance and (2) 78% of batter swing decisions, suggesting that LLMs can serve as effective world models for sports.

Abstract:
Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post‑training on small‑scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real‑time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model‑based planning. Systematic evaluation on multiple challenging out‑of‑distribution (OOD) benchmarks verifies the significance of our method for simulating open‑world, contact‑rich tasks, paving the way for general‑purpose robot world models.

Abstract:
Can general‑purpose AI architectures go beyond prediction to discover the physical laws governing the universe? True intelligence relies on "world models" ‑‑ causal abstractions that allow an agent to not only predict future states but understand the underlying governing dynamics. While previous "AI Physicist" approaches have successfully recovered such laws, they typically rely on strong, domain‑specific priors that effectively "bake in" the physics. Conversely, Vafa et al. recently showed that generic Transformers fail to acquire these world models, achieving high predictive accuracy without capturing the underlying physical laws. We bridge this gap by systematically introducing three minimal inductive biases. We show that ensuring spatial smoothness (by formulating prediction as continuous regression) and stability (by training with noisy contexts to mitigate error accumulation) enables generic Transformers to surpass prior failures and learn a coherent Keplerian world model, successfully fitting ellipses to planetary trajectories. However, true physical insight requires a third bias: temporal locality. By restricting the attention window to the immediate past ‑‑ imposing the simple assumption that future states depend only on the local state rather than a complex history ‑‑ we force the model to abandon curve‑fitting and discover Newtonian force representations. Our results demonstrate that simple architectural choices determine whether an AI becomes a curve‑fitter or a physicist, marking a critical step toward automated scientific discovery.

Abstract:
Reinforcement learning (RL) can refine Vision‑Language‑Action (VLA) policies beyond behavior cloning, but real‑world RL remains expensive due to extensive rollouts, resets, supervision, and safety risks. Action‑conditioned video world models offer an option to train in virtual environments, yet they exhibit imprecise action following, particularly on subtle near‑success failures. Besides, they lack native reward signals for RL. Computing rewards based on inaccurate visual predictions remain unreliable. We introduce World‑VLA‑Loop, structured around two foundational designs and a higher‑level co‑evolving paradigm. We first curate SANS, dedicatedly mixing successful and near‑success trajectories to improve action‑outcome alignment. Then, we train a state‑aware video world model that jointly predicts future frames and binary rewards from diffusion latents. It couples reward estimation to the generator rather than a separate module, and in turn, benefits visual prediction. Since VLA behavior shifts during RL, a fixed simulator can misalign with the updated policy, World‑VLA‑Loop therefore closes the loop by using the refined world model for iterative VLA post‑training while feeding rollouts from each improved policy back to augment and fine‑tune the world model. Across simulation and real‑robot experiments, World‑VLA‑Loop substantially improves VLA performance while reducing reliance on costly physical interaction.

Abstract:
World models offer a promising avenue for more faithfully capturing complex dynamics, including contacts and non‑rigidity, as well as complex sensory information, such as visual perception, in situations where standard simulators struggle. However, these models are computationally complex to evaluate, posing a challenge for popular RL approaches that have been successfully used with simulators to solve complex locomotion tasks but yet struggle with manipulation. This paper introduces a method that bypasses simulators entirely, training RL policies inside world models learned from robots' interactions with real environments. At its core, our approach enables policy training with large‑scale diffusion models via a novel decoupled first‑order gradient (FoG) method: a full‑scale world model generates accurate forward trajectories, while a lightweight latent‑space surrogate approximates its local dynamics for efficient gradient computation. This coupling of a local and global world model ensures high‑fidelity unrolling alongside computationally tractable differentiation. We demonstrate the efficacy of our method on the Push‑T manipulation task, where it significantly outperforms PPO in sample efficiency. We further evaluate our approach through an ego‑centric object manipulation task with a quadruped. Together, these results demonstrate that learning inside data‑driven world models is a promising pathway for solving hard‑to‑model RL tasks in image space without reliance on hand‑crafted physics simulators.

Abstract:
We introduce multi‑task Visuo‑Tactile World Models (VT‑WM), which capture the physics of contact through touch reasoning. By complementing vision with tactile sensing, VT‑WM better understands robot‑object interactions in contact‑rich tasks, avoiding common failure modes of vision‑only models under occlusion or ambiguous contact states, such as objects disappearing, teleporting, or moving in ways that violate basic physics. Trained across a set of contact‑rich manipulation tasks, VT‑WM improves physical fidelity in imagination, achieving 33% better performance at maintaining object permanence and 29% better compliance with the laws of motion in autoregressive rollouts. Moreover, experiments show that grounding in contact dynamics also translates to planning. In zero‑shot real‑robot experiments, VT‑WM achieves up to 35% higher success rates, with the largest gains in multi‑step, contact‑rich tasks. Finally, VT‑WM demonstrates significant downstream versatility, effectively adapting its learned contact dynamics to a novel task and achieving reliable planning success with only a limited set of demonstrations.

Abstract:
Large language models (LLMs) have achieved strong performance in language‑centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world‑modeling capabilities in LLM‑based agents. We propose Reinforcement World Model Learning (RWML), a self‑supervised method that learns action‑conditioned world models for LLM‑based agents on textual states using sim‑to‑real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre‑trained embedding space. Unlike next‑state token prediction, which prioritizes token‑level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM‑as‑a‑judge. We evaluate our method on ALFWorld and τ^2 Bench and observe significant gains over the base model, despite being entirely self‑supervised. When combined with task‑success rewards, our method outperforms direct task‑success reward RL by 6.9 and 5.7 points on ALFWorld and τ^2 Bench respectively, while matching the performance of expert‑data training.

Abstract:
Urban traffic management demands systems that simultaneously predict future conditions, detect anomalies, and take safe corrective actions ‑‑ all while providing reliability guarantees. We present STREAM‑RL, a unified framework that introduces three novel algorithmic contributions: (1) PU‑GAT+, an Uncertainty‑Guided Adaptive Conformal Forecaster that uses prediction uncertainty to dynamically reweight graph attention via confidence‑monotonic attention, achieving distribution‑free coverage guarantees; (2) CRFN‑BY, a Conformal Residual Flow Network that models uncertainty‑normalized residuals via normalizing flows with Benjamini‑Yekutieli FDR control under arbitrary dependence; and (3) LyCon‑WRL+, an Uncertainty‑Guided Safe World‑Model RL agent with Lyapunov stability certificates, certified Lipschitz bounds, and uncertainty‑propagated imagination rollouts. To our knowledge, this is the first framework to propagate calibrated uncertainty from forecasting through anomaly detection to safe policy learning with end‑to‑end theoretical guarantees. Experiments on multiple real‑world traffic trajectory data demonstrate that STREAM‑RL achieves 91.4% coverage efficiency, controls FDR at 4.1% under verified dependence, and improves safety rate to 95.2% compared to 69% for standard PPO while achieving higher reward, with 23ms end‑to‑end inference latency.

Abstract:
Emerging networked systems such as industrial IoT and real‑time cyber‑physical infrastructures demand intelligent scheduling strategies capable of adapting to dynamic traffic, deadlines, and interference constraints. In this work, we present a novel Digital Twin‑enabled scheduling framework inspired by Dual Mind World Model (DMWM) architecture, for learning‑informed and imagination‑driven network control. Unlike conventional rule‑based or purely data‑driven policies, the proposed DMWM combines short‑horizon predictive planning with symbolic model‑based rollout, enabling the scheduler to anticipate future network states and adjust transmission decisions accordingly. We implement the framework in a configurable simulation testbed and benchmark its performance against traditional heuristics and reinforcement learning baselines under varied traffic conditions. Our results show that DMWM achieves superior performance in bursty, interference‑limited, and deadline‑sensitive environments, while maintaining interpretability and sample efficiency. The proposed design bridges the gap between network‑level reasoning and low‑overhead learning, marking a step toward scalable and adaptive NDT‑based network optimization.

Abstract:
Planning in interactive environments is challenging under partial observability: task‑critical preconditions (e.g., object locations or container states) may be unknown at decision time, yet grounding them through interaction is costly. Learned world models can cheaply predict missing facts, but prediction errors can silently induce infeasible commitments. We present Active Epistemic Control (AEC), an epistemic‑categorical planning layer that integrates model‑based belief management with categorical feasibility checks. AEC maintains a strict separation between a \emphgrounded fact store used for commitment and a \emphbelief store used only for pruning candidate plans. At each step, it either queries the environment to ground an unresolved predicate when uncertainty is high or predictions are ambiguous, or simulates the predicate to filter hypotheses when confidence is sufficient. Final commitment is gated by grounded precondition coverage and an SQ‑BCP pullback‑style compatibility check, so simulated beliefs affect efficiency but cannot directly certify feasibility. Experiments on ALFWorld and ScienceWorld show that AEC achieves competitive success with fewer replanning rounds than strong LLM‑agent baselines.

Abstract:
Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large‑scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate‑space actions and pixel‑space videos, sensitivity to camera viewpoint, and non‑unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate‑space actions into pixel‑aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet‑style pathway, which aligns the action control signals with predicted videos, adds view‑specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds, BridgeV2W further introduces a flow‑based motion loss that focuses on learning dynamic and task‑relevant regions. Experiments on single‑arm (DROID) and dual‑arm (AgiBot‑G1) datasets, covering diverse and challenging conditions with unseen viewpoints and scenes, show that BridgeV2W improves video generation quality compared to prior state‑of‑the‑art methods. We further demonstrate the potential of BridgeV2W on downstream real‑world tasks, including policy evaluation and goal‑conditioned planning. More results can be found on our project website at https://BridgeV2W.github.io .

Abstract:
Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long‑horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre‑trained teacher models and sequence‑level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long‑horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle‑consistency objective, thereby eliminating the need for teacher‑based distillation. Specifically, LIVE first performs a forward rollout from ground‑truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long‑horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state‑of‑the‑art performance on long‑horizon benchmarks, generating stable, high‑quality videos far beyond training rollout lengths.

Abstract:
World models offer a principled framework for simulating future states under interventions, but realizing such models in complex, high‑stakes domains like medicine remains challenging. Recent large language models (LLMs) have achieved strong performance on static medical reasoning tasks, raising the question of whether they can function as dynamic medical world models capable of simulating disease progression and treatment outcomes over time. In this work, we show that LLMs only incorporating medical knowledge struggle to maintain consistent patient states under sequential interventions, leading to error accumulation in long‑horizon clinical simulation. To address this limitation, we introduce EHRWorld, a patient‑centric medical world model trained under a causal sequential paradigm, together with EHRWorld‑110K, a large‑scale longitudinal clinical dataset derived from real‑world electronic health records. Extensive evaluations demonstrate that EHRWorld significantly outperforms naive LLM‑based baselines, achieving more stable long‑horizon simulation, improved modeling of clinically sensitive events, and favorable reasoning efficiency, highlighting the necessity of training on causally grounded, temporally evolving clinical data for reliable and robust medical world modeling.

Abstract:
Adaptive cognition requires structured internal models of objects and their relations. Predictive neural networks are often proposed to learn such world models, but how these are instantiated and how they support prediction remain unclear. We investigate this in a minimal in‑silico setting. A recurrent neural network samples tokens sequentially from 2D continuous token scenes and is trained to predict the upcoming token from the current input and a saccade‑like displacement. On novel scenes, prediction accuracy improves across the sequence, indicating in‑context learning. Decoding analyses reveal path integration and dynamic binding of token identity to position. Interventional analyses show that new bindings can be learned late in sequence and that out‑of‑distribution bindings can be learned as well. Together, these findings show how structured representations relying on flexible binding emerge to support prediction, offering a mechanistic account of sequential world modeling relevant to cognitive science.

Abstract:
Deciding whether an agent possesses a model of its surrounding world is a fundamental step toward understanding its capabilities and limitations. In [10], it was shown that, within a particular framework, every almost optimal and general agent necessarily contains sufficient knowledge of its environment to allow an approximate reconstruction of it by querying the agent as a black box. This result relied on the assumptions that the agent is deterministic and that the environment is fully observable. In this work, we remove both assumptions by extending the theorem to stochastic agents operating in partially observable environments. Fundamentally, this shows that stochastic agents cannot avoid learning their environment through the usage of randomization. We also strengthen the result by weakening the notion of generality, proving that less powerful agents already contain a model of the world in which they operate.

Abstract:
Operating in environments alongside humans requires robots to make decisions under uncertainty. In addition to exogenous dynamics, they must reason over others' hidden mental‑models and mental‑states. While Interactive POMDPs and Bayesian Theory of Mind formulations are principled, exact nested‑belief inference is intractable, and hand‑specified models are brittle in open‑world settings. We address both by learning structured mental‑models and an estimator of others' mental‑states. Building on the Influence‑Based Abstraction, we instantiate an Influence‑Augmented Local Model to decompose socially‑aware robot tasks into local dynamics, social influences, and exogenous factors. We propose (a) a neuro‑symbolic world model instantiating a factored, discrete Dynamic Bayesian Network, and (b) a perspective‑shift operator modeled as an amortized Schrödinger Bridge over the learned local dynamics that transports factored egocentric beliefs into other‑centric beliefs. We show that this architecture enables agents to synthesize socially‑aware policies in model‑based reinforcement learning, via decision‑time mental‑state planning (a Schrödinger Bridge in belief space), with preliminary results in a MiniGrid social navigation task.

Abstract:
Building agents that can perform new skills by composing existing skills is a long‑standing goal of AI agent research. Towards this end, we investigate how to efficiently acquire a sequence of skills, formalized as hierarchical neural options. However, existing model‑free hierarchical reinforcement algorithms need a lot of data. We propose a novel method, which we call AgentOWL (Option and World model Learning Agent), that jointly learns ‑‑ in a sample efficient way ‑‑ an abstract world model (abstracting across both states and time) and a set of hierarchical neural options. We show, on a subset of Object‑Centric Atari games, that our method can learn more skills using less data than baseline methods and possesses learning and generalization capabilities that the baselines do not have.

Abstract:
Robot learning from interacting with the physical world is fundamentally bottlenecked by the cost of physical interaction. The two alternatives, supervised finetuning (SFT) from expert demonstrations and reinforcement learning (RL) in a software‑based simulator, are limited by the amount of expert data available and the sim‑to‑real gap for manipulation. With the recent emergence of world models learned from real‑world video‑action data, we ask the question of whether training a policy in a world model can be more effective than supervised learning or software simulation in achieving better real‑robot performance. We propose World‑Gymnast, which performs RL finetuning of a vision‑language‑action (VLA) policy by rolling out the policy in an action‑conditioned video world model and rewarding the rollouts with a vision‑language model (VLM). On the Bridge robot setup, World‑Gymnast outperforms SFT by as much as 18x and outperforms software simulator by as much as 2x. More importantly, World‑Gymnast demonstrates intriguing capabilities of RL with a world model, including training on diverse language instructions and novel scenes from the world model, test‑time training in a novel scene, and online iterative world model and policy improvement. Our results suggest learning a world model and training robot policies in the cloud could be the key to bridging the gap between robots that work in demonstrations and robots that can work in anyone's household.

Abstract:
We study reinforcement learning for revenue management with delayed feedback, where a substantial fraction of value is determined by customer cancellations and modifications observed days after booking. We propose \emphchoice‑model‑assisted RL: a calibrated discrete choice model is used as a fixed partial world model to impute the delayed component of the learning target at decision time. In the fixed‑model deployment regime, we prove that tabular Q‑learning with model‑imputed targets converges to an O(\varepsilon/(1‑γ)) neighborhood of the optimal Q‑function, where \varepsilon summarizes partial‑model error, with an additional O(t^‑1/2) sampling term. Experiments in a simulator calibrated from 61,619 hotel bookings (1,088 independent runs) show: (i) no statistically detectable difference from a maturity‑buffer DQN baseline in stationary settings; (ii) positive effects under in‑family parameter shifts, with significant gains in 5 of 10 shift scenarios after Holm‑‑Bonferroni correction (up to 12.4%); and (iii) consistent degradation under structural misspecification, where the choice model assumptions are violated (1.4‑‑2.6% lower revenue). These results characterize when partial behavioral models improve robustness under shift and when they introduce harmful bias.

Abstract:
World models have demonstrated significant promise for data synthesis in autonomous driving. However, existing methods predominantly concentrate on single‑modality generation, typically focusing on either multi‑camera video or LiDAR sequence synthesis. In this paper, we propose UniDriveDreamer, a single‑stage unified multimodal world model for autonomous driving, which directly generates multimodal future observations without relying on intermediate representations or cascaded modules. Our framework introduces a LiDAR‑specific variational autoencoder (VAE) designed to encode input LiDAR sequences, alongside a video VAE for multi‑camera images. To ensure cross‑modal compatibility and training stability, we propose Unified Latent Anchoring (ULA), which explicitly aligns the latent distributions of the two modalities. The aligned features are fused and processed by a diffusion transformer that jointly models their geometric correspondence and temporal evolution. Additionally, structured scene layout information is projected per modality as a conditioning signal to guide the synthesis. Extensive experiments demonstrate that UniDriveDreamer outperforms previous state‑of‑the‑art methods in both video and LiDAR generation, while also yielding measurable improvements in downstream

Abstract:
Large‑scale video generative models have shown emerging capabilities as zero‑shot visual planners, yet video‑generated plans often violate temporal consistency and physical constraints, leading to failures when mapped to executable actions. To address this, we propose Grounding Video Plans with World Models (GVP‑WM), a planning method that grounds video‑generated plans into feasible action sequences using a learned action‑conditioned world model. At test‑time, GVP‑WM first generates a video plan from initial and goal observations, then projects the video guidance onto the manifold of dynamically feasible latent trajectories via video‑guided latent collocation. In particular, we formulate grounding as a goal‑conditioned latent‑space trajectory optimization problem that jointly optimizes latent states and actions under world‑model dynamics, while preserving semantic alignment with the video‑generated plan. Empirically, GVP‑WM recovers feasible long‑horizon plans from zero‑shot image‑to‑video‑generated and motion‑blurred videos that violate physical constraints, across navigation and manipulation simulation tasks.

Abstract:
With the widespread deployment of Computer‑using Agents (CUAs) in complex real‑world environments, prevalent long‑term risks often lead to severe and irreversible consequences. Most existing guardrails for CUAs adopt a reactive approach, constraining agent behavior only within the current observation space. While these guardrails can prevent immediate short‑term risks (e.g., clicking on a phishing link), they cannot proactively avoid long‑term risks: seemingly reasonable actions can lead to high‑risk consequences that emerge with a delay (e.g., cleaning logs leads to future audits being untraceable), which reactive guardrails cannot identify within the current observation space. To address these limitations, we propose a predictive guardrail approach, with the core idea of aligning predicted future risks with current decisions. Based on this approach, we present SafePred, a predictive guardrail framework for CUAs that establishes a risk‑to‑decision loop to ensure safe agent behavior. SafePred supports two key abilities: (1) Short‑ and long‑term risk prediction: by using safety policies as the basis for risk prediction, SafePred leverages the prediction capability of the world model to generate semantic representations of both short‑term and long‑term risks, thereby identifying and pruning actions that lead to high‑risk states; (2) Decision optimization: translating predicted risks into actionable safe decision guidances through step‑level interventions and task‑level re‑planning. Extensive experiments show that SafePred significantly reduces high‑risk behaviors, achieving over 97.6% safety performance and improving task utility by up to 21.4% compared with reactive baselines.

Abstract:
While large language models have become the prevailing approach for agentic reasoning and planning, their success in symbolic domains does not readily translate to the physical world. Spatial intelligence, the ability to perceive 3D structure, reason about object relationships, and act under physical constraints, is an orthogonal capability that proves important for embodied agents. Existing surveys address either agentic architectures or spatial domains in isolation. None provide a unified framework connecting these complementary capabilities. This paper bridges that gap. Through a thorough review of over 2,000 papers, citing 742 works from top‑tier venues, we introduce a unified three‑axis taxonomy connecting agentic capabilities with spatial tasks across scales. Crucially, we distinguish spatial grounding (metric understanding of geometry and physics) from symbolic grounding (associating images with text), arguing that perception alone does not confer agency. Our analysis reveals three key findings mapped to these axes: (1) hierarchical memory systems (Capability axis) are important for long‑horizon spatial tasks. (2) GNN‑LLM integration (Task axis) is a promising approach for structured spatial reasoning. (3) World models (Scale axis) are essential for safe deployment across micro‑to‑macro spatial scales. We conclude by identifying six grand challenges and outlining directions for future research, including the need for unified evaluation frameworks to standardize cross‑domain assessment. This taxonomy provides a foundation for unifying fragmented research efforts and enabling the next generation of spatially‑aware autonomous systems in robotics, autonomous vehicles, and geospatial intelligence.

Abstract:
World models have emerged as a critical frontier in AI research, aiming to enhance large models by infusing them with physical dynamics and world knowledge. The core objective is to enable agents to understand, predict, and interact with complex environments. However, current research landscape remains fragmented, with approaches predominantly focused on injecting world knowledge into isolated tasks, such as visual prediction, 3D estimation, or symbol grounding, rather than establishing a unified definition or framework. While these task‑specific integrations yield performance gains, they often lack the systematic coherence required for holistic world understanding. In this paper, we analyze the limitations of such fragmented approaches and propose a unified design specification for world models. We suggest that a robust world model should not be a loose collection of capabilities but a normative framework that integrally incorporates interaction, perception, symbolic reasoning, and spatial representation. This work aims to provide a structured perspective to guide future research toward more general, robust, and principled models of the world.

Abstract:
Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train‑ and inference‑time. However, current approaches face a critical trade‑off: text‑based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision‑Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre‑training on structured web code enables high‑fidelity visual generation. We introduce gWorld (8B, 32B), the first open‑weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code‑based training data. In extensive evaluation across 4 in‑ and 2 out‑of‑distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open‑weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.

Abstract:
A fundamental challenge in multi‑task reinforcement learning (MTRL) is achieving sample efficiency in visual domains where tasks exhibit substantial heterogeneity in both observations and dynamics. Model‑based reinforcement learning offers a promising path to improved sample efficiency through world models, but standard monolithic architectures struggle to capture diverse task dynamics, resulting in poor reconstruction and prediction accuracy. We introduce Mixture‑of‑World Models (MoW), a scalable architecture that combines modular variational autoencoders for task‑adaptive visual compression, a hybrid Transformer‑based dynamics model with task‑conditioned experts and a shared backbone, and a gradient‑based task clustering strategy for efficient parameter allocation. On the Atari 100k benchmark, a single MoW agent trained once on 26 Atari games achieves a mean human‑normalized score of 110.4%, competitive with the score of 114.2% achieved by STORM, an ensemble of 26 task‑specific models, while using 50% fewer parameters. On Meta‑World, MoW achieves a 74.5% average success rate within 300 thousand environment steps, establishing a new state of the art. These results demonstrate that MoW provides a scalable and parameter‑efficient foundation for generalist world models.

Abstract:
Large language model (LLM) agents trained using reinforcement learning has achieved superhuman performance in low‑cost environments like games, mathematics, and coding. However, these successes have not translated to complex domains where the cost of interaction is high, such as the physical cost of running robots, the time cost of ML engineering, and the resource cost of scientific experiments. The true bottleneck for achieving the next level of agent performance for these complex and high‑cost domains lies in the expense of executing actions to acquire reward signals. To address this gap, this paper argues that we should use world models as an intermediary between agents and the real world. We discuss how world models, viewed as models of dynamics, rewards, and task distributions, can overcome fundamental barriers of high‑cost actions such as extreme off‑policy learning and sample inefficiency in long‑horizon tasks. Moreover, we demonstrate how world models can provide critical and rich learning signals to agents across a broad set of domains, including machine learning engineering, computer use, robotics, and AI for science. Lastly, we identify the challenges of building these world models and propose actionable items along dataset curation, architecture design, scaling, and evaluation of world models.

Abstract:
Real‑time generative game engines represent a paradigm shift in interactive simulation, promising to replace traditional graphics pipelines with neural world models. However, existing approaches are fundamentally constrained by the ``Memory Wall,'' restricting practical deployments to low resolutions (e.g., 64 × 64). This paper bridges the gap between generative models and high‑resolution neural simulations by introducing a scalable Hardware‑Algorithm Co‑Design framework. We identify that high‑resolution generation suffers from a critical resource mismatch: the World Model is compute‑bound while the Decoder is memory‑bound. To address this, we propose a heterogeneous architecture that intelligently decouples these components across a cluster of AI accelerators. Our system features three core innovations: (1) an asymmetric resource allocation strategy that optimizes throughput under sequence parallelism constraints; (2) a memory‑centric operator fusion scheme that minimizes off‑chip bandwidth usage; and (3) a manifold‑aware latent extrapolation mechanism that exploits temporal redundancy to mask latency. We validate our approach on a cluster of programmable AI accelerators, enabling real‑time generation at 720 × 480 resolution ‑‑ a 50× increase in pixel throughput over prior baselines. Evaluated on both continuous 3D racing and discrete 2D platformer benchmarks, our system delivers fluid 26.4 FPS and 48.3 FPS respectively, with an amortized effective latency of 2.7 ms. This work demonstrates that resolving the ``Memory Wall'' via architectural co‑design is not merely an optimization, but a prerequisite for enabling high‑fidelity, responsive neural gameplay.

Abstract:
As wireless communication networks grow in scale and complexity, diverse resource allocation tasks become increasingly critical. Multi‑Agent Reinforcement Learning (MARL) provides a promising solution for distributed control, yet it often requires costly real‑world interactions and lacks generalization across diverse tasks. Meanwhile, recent advances in Diffusion Models (DMs) have demonstrated strong capabilities in modeling complex dynamics and supporting high‑fidelity simulation. Motivated by these challenges and opportunities, we propose a Communication‑based Diffusion World Model (NetWorld) to enable few‑shot generalization across heterogeneous MARL tasks in wireless networks. To improve applicability to large‑scale distributed networks, NetWorld adopts the Distributed Training with Decentralized Execution (DTDE) paradigm and is organized into a two‑stage framework: (i) pre‑training a classifier‑guided conditional diffusion world model on multi‑task offline datasets, and (ii) performing trajectory planning entirely within this world model to avoid additional online interaction. Cross‑task heterogeneity is handled via shared latent processing for observations, two‑hot discretization for task‑specific actions and rewards, and an inverse dynamics model for action recovery. We further introduce a lightweight Mean Field (MF) communication mechanism to reduce non‑stationarity and promote coordinated behaviors with low overhead. Experiments on three representative tasks demonstrate improved performance and sample efficiency over MARL baselines, indicating strong scalability and practical potential for wireless network optimization.

Abstract:
World models simulate environment dynamics from raw sensory inputs like video. However, using them for planning can be challenging due to the vast and unstructured search space. We propose a robust and highly parallelizable planner that leverages the differentiability of the learned world model for efficient optimization, solving long‑horizon control tasks from visual input. Our method treats states as optimization variables ("virtual states") with soft dynamics constraints, enabling parallel computation and easier optimization. To facilitate exploration and avoid local optima, we introduce stochasticity into the states. To mitigate sensitive gradients through high‑dimensional vision‑based world models, we modify the gradient structure to descend towards valid plans while only requiring action‑input gradients. Our planner, which we call GRASP (Gradient RelAxed Stochastic Planner), can be viewed as a stochastic version of a non‑condensed or collocation‑based optimal controller. We provide theoretical justification and experiments on video‑based world models, where our resulting planner outperforms existing planning algorithms like the cross‑entropy method (CEM) and vanilla gradient‑based optimization (GD) on long‑horizon experiments, both in success rate and time to convergence.

Abstract:
We present DISK, a training‑free adaptive inference method for autoregressive world models. DISK coordinates two coupled diffusion transformers for video and ego‑trajectory via dual‑branch controllers with cross‑modal skip decisions, preserving motion‑appearance consistency without retraining. We extend higher‑order latent‑difference skip testing to the autoregressive chain‑of‑forward regime and propagate controller statistics through rollout loops for long‑horizon stability. When integrated into closed‑loop driving rollouts on 1500 NuPlan and NuScenes samples using an NVIDIA L40S GPU, DISK achieves 2x speedup on trajectory diffusion and 1.6x speedup on video diffusion while maintaining L2 planning error, visual quality (FID/FVD), and NAVSIM PDMS scores, demonstrating practical long‑horizon video‑and‑trajectory prediction at substantially reduced cost.

Abstract:
Predicting physical dynamics from raw visual data remains a major challenge in AI. While recent video generation models have achieved impressive visual quality, they still cannot consistently generate physically plausible videos due to a lack of modeling of physical laws. Recent approaches combining 3D Gaussian splatting and physics engines can produce physically plausible videos, but are hindered by high computational costs in both reconstruction and simulation, and often lack robustness in complex real‑world scenarios. To address these issues, we introduce Neural Gaussian Force Field (NGFF), an end‑to‑end neural framework that integrates 3D Gaussian perception with physics‑based dynamic modeling to generate interactive, physically realistic 4D videos from multi‑view RGB inputs, achieving two orders of magnitude faster than prior Gaussian simulators. To support training, we also present GSCollision, a 4D Gaussian dataset featuring diverse materials, multi‑object interactions, and complex scenes, totaling over 640k rendered physical videos (~4 TB). Evaluations on synthetic and real 3D scenarios show NGFF's strong generalization and robustness in physical reasoning, advancing video prediction towards physics‑grounded world models.

Abstract:
The scalability of embodied intelligence is fundamentally constrained by the scarcity of real‑world interaction data. While simulation platforms provide a promising alternative, existing approaches often suffer from a substantial visual and physical gap to real environments and rely on expensive sensors, precise robot calibration, or depth measurements, limiting their practicality at scale. We present Simulate Anything, a graphics‑driven world modeling and simulation framework that enables efficient generation of high‑fidelity embodied training data using only multi‑view environment videos and off‑the‑shelf assets. Our approach reconstructs real‑world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS), seamlessly capturing fine‑grained geometry and appearance from video. We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target, enabling accurate scale alignment between the reconstructed scene and the real world. Together, these components provide a unified, editable, and physically grounded world model. Vision Language Action (VLA) models trained on our simulated data achieve strong zero‑shot performance on downstream tasks, matching or even surpassing results obtained with real‑world data, highlighting the potential of reconstruction‑driven world modeling for scalable and practical embodied intelligence training.

Abstract:
The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general‑purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model‑based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on‑policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state‑of‑the‑art open‑source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.

Abstract:
Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases. Existing enterprise benchmarks evaluate surface‑level agentic task completion similar to general consumer benchmarks, ignoring true challenges in enterprises, such as limited observability, large database state, and hidden workflows with cascading side effects. We introduce World of Workflows (WoW), a realistic ServiceNow‑based environment incorporating 4,000+ business rules and 55 active workflows embedded in the system, alongside WoW‑bench, a benchmark of 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities. We reveal two major takeaways: (1) Frontier LLMs suffer from dynamics blindness, consistently failing to predict the invisible, cascading side effects of their actions, which leads to silent constraint violations, and (2) reliability in opaque systems requires grounded world modeling, where agents must mentally simulate hidden state transitions to bridge the observability gap when high‑fidelity feedback is unavailable. For reliable and useful enterprise agents, WoW motivates a new paradigm to explicitly learn system dynamics. We release our GitHub for setting up and evaluating WoW.

Abstract:
Large language models (LLMs) trained with next‑word‑prediction have achieved success as clinical foundation models. Representations from these language backbones yield strong linear probe performance across biomedical tasks, suggesting that patient semantics emerge from next‑token prediction at scale. However, this paradigm treats patients as a document to be summarized rather than a dynamical system to be simulated; a patient's trajectory emerges from their state evolving under interventions and time, requiring models that simulate dynamics rather than predict tokens. To address this, we introduce SMB‑Structure, a world model for structured EHR that grounds a joint‑embedding prediction architecture (JEPA) with next‑token prediction (SFT). SFT grounds our model to reconstruct future patient states in token space, while JEPA predicts those futures in latent space from the initial patient representation alone, forcing trajectory dynamics to be encoded before the next state is observed. We validate across two large‑scale cohorts: Memorial Sloan Kettering (23,319 oncology patients; 323,000+ patient‑years) and INSPECT (19,402 pulmonary embolism patients). Using a linear probe evaluated at multiple points along the disease trajectory, we demonstrate that our training paradigm learns embeddings that capture disease dynamics not recoverable by autoregressive baselines, enabling SMB‑Structure to achieve competitive performance on complex tasks characterized by high patient heterogeneity. Model weights are available at https://huggingface.co/standardmodelbio/SMB‑v1‑1.7B‑Structure.

Abstract:
Partial differential equation (PDE) simulations are fundamental to engineering and physics but are often computationally prohibitive for real‑time applications. While generative AI offers a promising avenue for surrogate modeling, standard video generation architectures lack the specific control and data compatibility required for physical simulations. This paper introduces a geometry aware world model architecture, derived from a video generation architecture (LongVideoGAN), designed to learn transient physics. We introduce two key architecture elements: (1) a twofold conditioning mechanism incorporating global physical parameters and local geometric masks, and (2) an architectural adaptation to support arbitrary channel dimensions, moving beyond standard RGB constraints. We evaluate this approach on a 2D transient computational fluid dynamics (CFD) problem involving convective heat transfer from buoyancy‑driven flow coupled to a heat flow in a solid structure. We demonstrate that the conditioned model successfully reproduces complex temporal dynamics and spatial correlations of the training data. Furthermore, we assess the model's generalization capabilities on unseen geometric configurations, highlighting both its potential for controlled simulation synthesis and current limitations in spatial precision for out‑of‑distribution samples.

Abstract:
End‑to‑end autonomous driving increasingly leverages self‑supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive‑JEPA, a framework that integrates Video Joint‑Embedding Predictive Architecture (V‑JEPA) with multimodal trajectory distillation for end‑to‑end driving. First, we adapt V‑JEPA for end‑to‑end driving, pretraining a ViT encoder on large‑scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal‑centric planner that distills diverse simulator‑generated trajectories alongside human trajectories, with a momentum‑aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V‑JEPA representation combined with a simple transformer‑based decoder outperforms prior methods by 3 PDMS in the perception‑free setting. The complete Drive‑JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state‑of‑the‑art.

Abstract:
Reinforcement learning (RL) is widely used for humanoid control, with on‑policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large‑scale parallel simulation and, in some cases, zero‑shot deployment to real robots. However, the low sample efficiency of on‑policy algorithms limits safe adaptation to new environments. Although off‑policy RL and model‑based RL have shown improved sample efficiency, the gap between large‑scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off‑policy Soft Actor‑Critic (SAC), with large‑batch update and a high Update‑To‑Data (UTD) ratio, reliably supports large‑scale pretraining of humanoid locomotion policies, achieving zero‑shot deployment on real robots. For adaptation, we demonstrate that these SAC‑pretrained policies can be finetuned in new environments and out‑of‑distribution tasks using model‑based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics‑informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall‑clock efficiency of large‑scale simulation during pretraining with the sample efficiency of model‑based learning during fine‑tuning. For code and videos, see https://lift‑humanoid.github.io

Abstract:
Recent advances in generative foundational models, often termed "world models," have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real‑world dynamics. Existing physics‑based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video‑based benchmark specifically designed for concept‑specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low‑level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video‑based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real‑world interactions. Through its concept‑specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world‑model‑driven learning.

Abstract:
Large Language Models (LLMs) have enabled automated heuristic design (AHD) for combinatorial optimization problems (COPs), but existing frameworks' reliance on fixed evolutionary rules and static prompt templates often leads to myopic heuristic generation, redundant evaluations, and limited reasoning about how new heuristics should be derived. We propose a novel multi‑agent reasoning framework, referred to as Planning through World Model for Automated Heuristic Design via Self‑Evolving LLMs (PathWise), which formulates heuristic generation as a sequential decision process over an entailment graph serving as a compact, stateful memory of the search trajectory. This approach allows the system to carry forward past decisions and reuse or avoid derivation information across generations. A policy agent plans evolutionary actions, a world model agent generates heuristic rollouts conditioned on those actions, and critic agents provide routed reflections summarizing lessons from prior steps, shifting LLM‑based AHD from trial‑and‑error evolution toward state‑aware planning through reasoning. Experiments across diverse COPs show that PathWise converges faster to better heuristics, generalizes across different LLM backbones, and scales to larger problem sizes.

Abstract:
Gradient‑regularized value learning methods improve sample efficiency by leveraging learned models of transition dynamics and rewards to estimate return gradients. However, existing approaches, such as MAGE, struggle in stochastic or noisy environments, limiting their applicability. In this work, we address these limitations by extending distributional reinforcement learning on continuous state‑action spaces to model not only the distribution over scalar state‑action value functions but also over their gradients. We refer to this approach as Distributional Sobolev Training. Inspired by Stochastic Value Gradients (SVG), our method utilizes a one‑step world model of reward and transition distributions implemented via a conditional Variational Autoencoder (cVAE). The proposed framework is sample‑based and employs Max‑sliced Maximum Mean Discrepancy (MSMMD) to instantiate the distributional Bellman operator. We prove that the Sobolev‑augmented Bellman operator is a contraction with a unique fixed point, and highlight a fundamental smoothness trade‑off underlying contraction in gradient‑aware RL. To validate our method, we first showcase its effectiveness on a simple stochastic reinforcement learning toy problem, then benchmark its performance on several MuJoCo environments.

Abstract:
In a recent preprint [arXiv:2601.14134v1], Rubin argues that the arrow of time originates from the monotonic growth of the volume of extra dimensions. While the identification of a geometric origin for time's arrow is compelling in the case of brane‑world models, we point out a possible tension between the proposed volume growth and the observational stability of the effective four‑dimensional Newton's gravitational constant, G, that may arise in Kaluza‑Klein (KK) theory. In standard KK approaches, such volume growth induces a time‑variation of G that exceeds Big Bang Nucleosynthesis (BBN) and Lunar Laser Ranging (LLR) bounds by many orders of magnitude. To resolve this tension while preserving the author's key insight in the Kaluza‑Klein case, we propose an extension: the "shape‑dynamic arrow of time". By utilizing the scale‑invariant monotonicity of Perelman's nu‑entropy under normalized Ricci flow, we demonstrate how an arrow of time can emerge from the geometric smoothing of extra dimensions at fixed volume, thereby satisfying observational constraints on fundamental constants.

Abstract:
With the development of foundation model (FM), agentic AI systems are getting more attention, yet their inherent issues like hallucination and poor reasoning, coupled with the frequent ad‑hoc nature of system design, lead to unreliable and brittle applications. Existing efforts to characterise agentic design patterns often lack a rigorous systems‑theoretic foundation, resulting in high‑level or convenience‑based taxonomies that are difficult to implement. This paper addresses this gap by introducing a principled methodology for engineering robust AI agents. We propose two primary contributions: first, a novel system‑theoretic framework that deconstructs an agentic AI system into five core, interacting functional subsystems: Reasoning & World Model, Perception & Grounding, Action Execution, Learning & Adaptation, and Inter‑Agent Communication. Second, derived from this architecture and directly mapped to a comprehensive taxonomy of agentic challenges, we present a collection of 12 agentic design patterns. These patterns ‑ categorised as Foundational, Cognitive & Decisional, Execution & Interaction, and Adaptive & Learning ‑ offer reusable, structural solutions to recurring problems in agent design. The utility of the framework is demonstrated by a case study on the ReAct framework, showing how the proposed patterns can rectify systemic architectural deficiencies. This work provides a foundational language and a structured methodology to standardise agentic design communication among researchers and engineers, leading to more modular, understandable, and reliable autonomous systems.

Abstract:
Scenes are continuously undergoing dynamic changes in the real world. However, existing human‑scene interaction generation methods typically treat the scene as static, which deviates from reality. Inspired by world models, we introduce Dyn‑HSI, the first cognitive architecture for dynamic human‑scene interaction, which endows virtual humans with three humanoid components. (1)Vision (human eyes): we equip the virtual human with a Dynamic Scene‑Aware Navigation, which continuously perceives changes in the surrounding environment and adaptively predicts the next waypoint. (2)Memory (human brain): we equip the virtual human with a Hierarchical Experience Memory, which stores and updates experiential data accumulated during training. This allows the model to leverage prior knowledge during inference for context‑aware motion priming, thereby enhancing both motion quality and generalization. (3) Control (human body): we equip the virtual human with Human‑Scene Interaction Diffusion Model, which generates high‑fidelity interaction motions conditioned on multimodal inputs. To evaluate performance in dynamic scenes, we extend the existing static human‑scene interaction datasets to construct a dynamic benchmark, Dyn‑Scenes. We conduct extensive qualitative and quantitative experiments to validate Dyn‑HSI, showing that our method consistently outperforms existing approaches and generates high‑quality human‑scene interaction motions in both static and dynamic settings.

Abstract:
Building world models is essential for planning in real‑world domains such as businesses. Since such domains have rich semantics, we can leverage world knowledge to effectively model complex action effects and causal relationships from limited data. In this work, we propose CASSANDRA, a neurosymbolic world modeling approach that leverages an LLM as a knowledge prior to construct lightweight transition models for planning. CASSANDRA integrates two components: (1) LLM‑synthesized code to model deterministic features, and (2) LLM‑guided structure learning of a probabilistic graphical model to capture causal relationships among stochastic variables. We evaluate CASSANDRA in (i) a small‑scale coffee‑shop simulator and (ii) a complex theme park business simulator, where we demonstrate significant improvements in transition prediction and planning over baselines.

Abstract:
The vision‑language‑action (VLA) paradigm has enabled powerful robotic control by leveraging vision‑language models, but its reliance on large‑scale, high‑quality robot data limits its generalization. Generative world models offer a promising alternative for general‑purpose embodied AI, yet a critical gap remains between their pixel‑level plans and physically executable actions. To this end, we propose the Tool‑Centric Inverse Dynamics Model (TC‑IDM). By focusing on the tool's imagined trajectory as synthesized by the world model, TC‑IDM establishes a robust intermediate representation that bridges the gap between visual planning and physical control. TC‑IDM extracts the tool's point cloud trajectories via segmentation and 3D motion estimation from generated videos. Considering diverse tool attributes, our architecture employs decoupled action heads to project these planned trajectories into 6‑DoF end‑effector motions and corresponding control signals. This plan‑and‑translate paradigm not only supports a wide range of end‑effectors but also significantly improves viewpoint invariance. Furthermore, it exhibits strong generalization capabilities across long‑horizon and out‑of‑distribution tasks, including interacting with deformable objects. In real‑world evaluations, the world model with TC‑IDM achieves an average success rate of 61.11 percent, with 77.7 percent on simple tasks and 38.46 percent on zero‑shot deformable object tasks. It substantially outperforms end‑to‑end VLA‑style baselines and other inverse dynamics models.

Abstract:
Offline Reinforcement Learning (ORL) holds immense promise for safety‑critical domains like industrial robotics, where real‑time environmental interaction is often prohibitive. A primary obstacle in ORL remains the distributional shift between the static dataset and the learned policy, which typically mandates high degrees of conservatism that can restrain potential policy improvements. We present MoReBRAC, a model‑based framework that addresses this limitation through Uncertainty‑Aware latent synthesis. Instead of relying solely on the fixed data, MoReBRAC utilizes a dual‑recurrent world model to synthesize high‑fidelity transitions that augment the training manifold. To ensure the reliability of this synthetic data, we implement a hierarchical uncertainty pipeline integrating Variational Autoencoder (VAE) manifold detection, model sensitivity analysis, and Monte Carlo (MC) dropout. This multi‑layered filtering process guarantees that only transitions residing within high‑confidence regions of the learned dynamics are utilized. Our results on D4RL Gym‑MuJoCo benchmarks reveal significant performance gains, particularly in ``random'' and ``suboptimal'' data regimes. We further provide insights into the role of the VAE as a geometric anchor and discuss the distributional trade‑offs encountered when learning from near‑optimal datasets.

Abstract:
Humanoid robot loco‑manipulation remains constrained by the semantic‑physical gap. Current methods face three limitations: Low sample efficiency in reinforcement learning, poor generalization in imitation learning, and physical inconsistency in VLMs. We propose MetaWorld, a hierarchical world model that integrates semantic planning and physical control via expert policy transfer. The framework decouples tasks into a VLM‑driven semantic layer and a latent dynamics model operating in a compact state space. Our dynamic expert selection and motion prior fusion mechanism leverages a pre‑trained multi‑expert policy library as transferable knowledge, enabling efficient online adaptation via a two‑stage framework. VLMs serve as semantic interfaces, mapping instructions to executable skills and bypassing symbol grounding. Experiments on Humanoid‑Bench show MetaWorld outperforms world model‑based RL in task completion and motion coherence. Our code will be found at https://anonymous.4open.science/r/metaworld‑2BF4/

Abstract:
Large Language Models (LLMs) generate fluent text, yet whether they truly understand the world or merely produce plausible texts about it remains contested. We propose an architectural principle, the mouth is not the brain, that explicitly separates world models from language models. Our architecture comprises three components: a DBM that captures domain structure as an energy‑based world model, an adapter that projects latent belief states into embedding space, and a frozen GPT‑2 that provides linguistic competence without domain knowledge. We instantiate this framework in the consumer review domain using Amazon smartphone reviews. Experiments demonstrate that (1) world model conditioning achieves lower cross‑entropy loss and higher semantic similarity than architectural baselines including direct projection and full fine‑tuning, while qualitative analysis reveals that soft prompt conditioning resolves a trade‑off that prompt‑based approaches cannot: simple prompts lack expressiveness while detailed prompts cause output collapse in small LLMs; (2) the DBM's energy function distinguishes coherent from incoherent market configurations, assigning higher energy to implausible brand‑price combinations; and (3) interventions on specific attributes propagate causally to generated text with intervened outputs exhibiting distributions statistically consistent with naturally occurring samples sharing the target configuration. These findings suggest that even small‑scale language models can achieve consistent, controllable generation when connected to an appropriate world model, providing empirical support for separating linguistic competence from world understanding.

Abstract:
Large‑scale video generation models have demonstrated emergent physical coherence, positioning them as potential world models. However, a gap remains between contemporary "stateless" video architectures and classic state‑centric world model theories. This work bridges this gap by proposing a novel taxonomy centered on two pillars: State Construction and Dynamics Modeling. We categorize state construction into implicit paradigms (context management) and explicit paradigms (latent compression), while dynamics modeling is analyzed through knowledge integration and architectural reformulation. Furthermore, we advocate for a transition in evaluation from visual fidelity to functional benchmarks, testing physical persistence and causal reasoning. We conclude by identifying two critical frontiers: enhancing persistence via data‑driven memory and compressed fidelity, and advancing causality through latent factor decoupling and reasoning‑prior integration. By addressing these challenges, the field can evolve from generating visually plausible videos to building robust, general‑purpose world simulators.

Abstract:
Adapting to unforeseen novelties in open‑world environments remains a major challenge for autonomous systems. While hybrid planning and reinforcement learning (RL) approaches show promise, they often suffer from sample inefficiency, slow adaptation, and catastrophic forgetting. We present a neuro‑symbolic framework integrating hierarchical abstractions, task and motion planning (TAMP), and reinforcement learning to enable rapid adaptation in robotics. Our architecture combines symbolic goal‑oriented learning and world model‑based exploration to facilitate rapid adaptation to environmental changes. Validated in robotic manipulation and autonomous driving, our approach achieves faster convergence, improved sample efficiency, and superior robustness over state‑of‑the‑art hybrid methods, demonstrating its potential for real‑world deployment.

Abstract:
Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post‑training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos‑Predict2) into an effective robot policy through a single stage of post‑training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test‑time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state‑of‑the‑art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real‑world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model‑based policies, and state‑of‑the‑art vision‑language‑action models fine‑tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model‑based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at https://research.nvidia.com/labs/dir/cosmos‑policy/

Abstract:
Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common‑sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law‑consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center‑of‑mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics‑aware multimodal models. Our data will be released upon acceptance.

Abstract:
A world model is an AI system that simulates how an environment evolves under actions, enabling planning through imagined futures rather than reactive perception. Current world models, however, suffer from visual conflation: the mistaken assumption that high‑fidelity video generation implies an understanding of physical and causal dynamics. We show that while modern models excel at predicting pixels, they frequently violate invariant constraints, fail under intervention, and break down in safety‑critical decision‑making. This survey argues that visual realism is an unreliable proxy for world understanding. Instead, effective world models must encode causal structure, respect domain‑specific constraints, and remain stable over long horizons. We propose a reframing of world models as actionable simulators rather than visual engines, emphasizing structured 4D interfaces, constraint‑aware dynamics, and closed‑loop evaluation. Using medical decision‑making as an epistemic stress test, where trial‑and‑error is impossible and errors are irreversible, we demonstrate that a world model's value is determined not by how realistic its rollouts appear, but by its ability to support counterfactual reasoning, intervention planning, and robust long‑horizon foresight.

Abstract:
What if a video generation model could not only imagine a plausible future, but the correct one, accurately reflecting how the world changes with each action? We address this question by presenting the Egocentric World Model (EgoWM), a simple, architecture‑agnostic method that transforms any pretrained video diffusion model into an action‑conditioned world model, enabling controllable future prediction. Rather than training from scratch, we repurpose the rich world priors of Internet‑scale video models and inject motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully while preserving realism and strong generalization. Our approach scales naturally across embodiments and action spaces, ranging from 3‑DoF mobile robots to 25‑DoF humanoids, where predicting egocentric joint‑angle‑driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation tasks, requiring only modest fine‑tuning. To evaluate physical correctness independently of visual appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. EgoWM improves SCS by up to 80 percent over prior state‑of‑the‑art navigation world models, while achieving up to six times lower inference latency and robust generalization to unseen environments, including navigation inside paintings.

Abstract:
Joint Embedding Predictive Architectures (JEPA) offer a scalable paradigm for self‑supervised learning by predicting latent representations rather than reconstructing high‑entropy observations. However, existing formulations rely on deterministic regression objectives, which mask probabilistic semantics and limit its applicability in stochastic control. In this work, we introduce \emphVariational JEPA (VJEPA), a probabilistic generalization that learns a predictive distribution over future latent states via a variational objective. We show that VJEPA unifies representation learning with Predictive State Representations (PSRs) and Bayesian filtering, establishing that sequential modeling does not require autoregressive observation likelihoods. Theoretically, we prove that VJEPA representations can serve as sufficient information states for optimal control without pixel reconstruction, while providing formal guarantees for collapse avoidance. We further propose \emphBayesian JEPA (BJEPA), an extension that factorizes the predictive belief into a learned dynamics expert and a modular prior expert, enabling zero‑shot task transfer and constraint (e.g. goal, physics) satisfaction via a Product of Experts. Empirically, through a noisy environment experiment, we demonstrate that VJEPA and BJEPA successfully filter out high‑variance nuisance distractors that cause representation collapse in generative baselines. By enabling principled uncertainty estimation (e.g. constructing credible intervals via sampling) while remaining likelihood‑free regarding observations, VJEPA provides a foundational framework for scalable, robust, uncertainty‑aware planning in high‑dimensional, noisy environments.

Abstract:
Current Large Language Models (LLMs) exhibit a critical modal disconnect: they possess vast semantic knowledge but lack the procedural grounding to respect the immutable laws of the physical world. Consequently, while these agents implicitly function as world models, their simulations often suffer from physical hallucinations‑generating plans that are logically sound but physically unexecutable. Existing alignment strategies predominantly rely on resource‑intensive training or fine‑tuning, which attempt to compress dynamic environmental rules into static model parameters. However, such parametric encapsulation is inherently rigid, struggling to adapt to the open‑ended variability of physical dynamics without continuous, costly retraining. To bridge this gap, we introduce WorldMind, a framework that autonomously constructs a symbolic World Knowledge Repository by synthesizing environmental feedback. Specifically, it unifies Process Experience to enforce physical feasibility via prediction errors and Goal Experience to guide task optimality through successful trajectories. Experiments on EB‑ALFRED and EB‑Habitat demonstrate that WorldMind achieves superior performance compared to baselines with remarkable cross‑model and cross‑environment transferability.

Abstract:
This paper proposes an Active Inference‑based framework for autonomous trajectory design in UAV swarms. The method integrates probabilistic reasoning and self‑learning to enable distributed mission allocation, route ordering, and motion planning. Expert trajectories generated using a Genetic Algorithm with Repulsion Forces (GA‑RF) are employed to train a hierarchical World Model capturing swarm behavior across mission, route, and motion levels. During online operation, UAVs infer actions by minimizing divergence between current beliefs and model‑predicted states, enabling adaptive responses to dynamic environments. Simulation results show faster convergence, higher stability, and safer navigation than Q‑Learning, demonstrating the scalability and cognitive grounding of the proposed framework for intelligent UAV swarm control.

Abstract:
Recently, video‑based world models that learn to simulate the dynamics have gained increasing attention in robot learning. However, current approaches primarily emphasize visual generative quality while overlooking physical fidelity, dynamic consistency, and task logic, especially for contact‑rich manipulation tasks, which limits their applicability to downstream tasks. To this end, we introduce ReWorld, a framework aimed to employ reinforcement learning to align the video‑based embodied world models with physical realism, task completion capability, embodiment plausibility and visual quality. Specifically, we first construct a large‑scale (~235K) video preference dataset and employ it to train a hierarchical reward model designed to capture multi‑dimensional reward consistent with human preferences. We further propose a practical alignment algorithm that post‑trains flow‑based world models using this reward through a computationally efficient PPO‑style algorithm. Comprehensive experiments and theoretical analysis demonstrate that ReWorld significantly improves the physical fidelity, logical coherence, embodiment and visual quality of generated rollouts, outperforming previous methods.

Abstract:
Navigation is a fundamental capability for mobile robots. While the current trend is to use learning‑based approaches to replace traditional geometry‑based methods, existing end‑to‑end learning‑based policies often struggle with 3D spatial reasoning and lack a comprehensive understanding of physical world dynamics. Integrating world models‑which predict future observations conditioned on given actions‑with iterative optimization planning offers a promising solution due to their capacity for imagination and flexibility. However, current navigation world models, typically built on pure transformer architectures, often rely on multi‑step diffusion processes and autoregressive frame‑by‑frame generation. These mechanisms result in prohibitive computational latency, rendering real‑time deployment impossible. To address this bottleneck, we propose a lightweight navigation world model that adopts a one‑step generation paradigm and a 3D U‑Net backbone equipped with efficient spatial‑temporal attention. This design drastically reduces inference latency, enabling high‑frequency control while achieving superior predictive performance. We also integrate this model into an optimization‑based planning framework utilizing anchor‑based initialization to handle multi‑modal goal navigation tasks. Extensive closed‑loop experiments in both simulation and real‑world environments demonstrate our system's superior efficiency and robustness compared to state‑of‑the‑art baselines.

Abstract:
Numerous offline and model‑based reinforcement learning systems incorporate world models to emulate the inherent environments. A world model is particularly important in scenarios where direct interactions with the real environment is costly, dangerous, or impractical. The efficacy and interpretability of such world models are notably contingent upon the quality of the underlying training data. In this context, we introduce Action Shapley as an agnostic metric for the judicious and unbiased selection of training data. To facilitate the computation of Action Shapley, we present a randomized dynamic algorithm specifically designed to mitigate the exponential complexity inherent in traditional Shapley value computations. Through empirical validation across five data‑constrained real‑world case studies, the algorithm demonstrates a computational efficiency improvement exceeding 80% in comparison to conventional exponential time computations. Furthermore, our Action Shapley‑based training data selection policy consistently outperforms ad‑hoc training data selection.

Abstract:
State‑of‑the‑art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre‑training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference‑time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA‑2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test‑time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image‑conditioned, multiframe‑conditioned, and text‑conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.

Abstract:
Creativity in artificial intelligence is most often addressed through evaluative frameworks that aim to measure novelty, diversity, or usefulness in generated outputs. While such approaches have provided valuable insights into the behavior of modern generative models, they largely treat creativity as a property to be assessed rather than as a phenomenon to be explicitly modeled. In parallel, recent advances in large‑scale generative systems, particularly multimodal architectures, have demonstrated increasingly sophisticated forms of pattern recombination, raising questions about the nature and limits of machine creativity. This paper proposes a generative perspective on creativity in AI, framing it as an emergent property of domain‑limited generative models embedded within bounded informational environments. Rather than introducing new evaluative criteria, we focus on the structural and contextual conditions under which creative behaviors arise. We introduce a conceptual decomposition of creativity into four interacting components‑pattern‑based generation, induced world models, contextual grounding, and arbitrarity, and examine how these components manifest in multimodal generative systems. By grounding creativity in the interaction between generative dynamics and domain‑specific representations, this work aims to provide a technical framework for studying creativity as an emergent phenomenon in AI systems, rather than as a post hoc evaluative label.

Abstract:
Video generation models have emerged as high‑fidelity models of the physical world, capable of synthesizing high‑quality videos capturing fine‑grained interactions between agents and their environments conditioned on multi‑modal user inputs. Their impressive capabilities address many of the long‑standing challenges faced by physics‑based simulators, driving broad adoption in many problem domains, e.g., robotics. For example, video models enable photorealistic, physically consistent deformable‑body simulation without making prohibitive simplifying assumptions, which is a major bottleneck in physics‑based simulation. Moreover, video models can serve as foundation world models that capture the dynamics of the world in a fine‑grained and expressive way. They thus overcome the limited expressiveness of language‑only abstractions in describing intricate physical interactions. In this survey, we provide a review of video models and their applications as embodied world models in robotics, encompassing cost‑effective data generation and action prediction in imitation learning, dynamics and rewards modeling in reinforcement learning, visual planning, and policy evaluation. Further, we highlight important challenges hindering the trustworthy integration of video models in robotics, which include poor instruction following, hallucinations such as violations of physics, and unsafe content generation, in addition to fundamental limitations such as significant data curation, training, and inference costs. We present potential future directions to address these open research challenges to motivate research and ultimately facilitate broader applications, especially in safety‑critical settings.

Abstract:
Post‑training algorithms based on deep reinforcement learning can push the limits of robotic models for specific objectives, such as generalizability, accuracy, and robustness. However, Intervention‑requiring Failures (IR Failures) (e.g., a robot spilling water or breaking fragile glass) during real‑world exploration happen inevitably, hindering the practical deployment of such a paradigm. To tackle this, we introduce Failure‑Aware Offline‑to‑Online Reinforcement Learning (FARL), a new paradigm minimizing failures during real‑world reinforcement learning. We create FailureBench, a benchmark that incorporates common failure scenarios requiring human intervention, and propose an algorithm that integrates a world‑model‑based safety critic and a recovery policy trained offline to prevent failures during online exploration. Extensive simulation and real‑world experiments demonstrate the effectiveness of FARL in significantly reducing IR Failures while improving performance and generalization during online reinforcement learning post‑training. FARL reduces IR Failures by 73.1% while elevating performance by 11.3% on average during real‑world RL post‑training. Videos and code are available at https://failure‑aware‑rl.github.io.

Abstract:
Offline multi‑agent reinforcement learning (MARL) aims to solve cooperative decision‑making problems in multi‑agent systems using pre‑collected datasets. Existing offline MARL methods primarily constrain training within the dataset distribution, resulting in overly conservative policies that struggle to generalize beyond the support of the data. While model‑based approaches offer a promising solution by expanding the original dataset with synthetic data generated from a learned world model, the high dimensionality, non‑stationarity, and complexity of multi‑agent systems make it challenging to accurately estimate the transitions and reward functions in offline MARL. Given the difficulty of directly modeling joint dynamics, we propose a local‑to‑global (LOGO) world model, a novel framework that leverages local predictions‑which are easier to estimate‑to infer global state dynamics, thus improving prediction accuracy while implicitly capturing agent‑wise dependencies. Using the trained world model, we generate synthetic data to augment the original dataset, expanding the effective state‑action space. To ensure reliable policy learning, we further introduce an uncertainty‑aware sampling mechanism that adaptively weights synthetic data by prediction uncertainty, reducing approximation error propagation to policies. In contrast to conventional ensemble‑based methods, our approach requires only an additional encoder for uncertainty estimation, significantly reducing computational overhead while maintaining accuracy. Extensive experiments across 8 scenarios against 8 baselines demonstrate that our method surpasses state‑of‑the‑art baselines on standard offline MARL benchmarks, establishing a new model‑based baseline for generalizable offline multi‑agent learning.

Abstract:
In this paper, we introduce ObjectZero, a novel reinforcement learning (RL) algorithm that leverages the power of object‑level representations to model dynamic environments more effectively. Unlike traditional approaches that process the world as a single undifferentiated input, our method employs Graph Neural Networks (GNNs) to capture intricate interactions among multiple objects. These objects, which can be manipulated and interact with each other, serve as the foundation for our model's understanding of the environment. We trained the algorithm in a complex setting teeming with diverse, interactive objects, demonstrating its ability to effectively learn and predict object dynamics. Our results highlight that a structured world model operating on object‑centric representations can be successfully integrated into a model‑based RL algorithm utilizing Monte Carlo Tree Search as a planning module.

Abstract:
We present Akasha 2, a state‑of‑the‑art multimodal architecture that integrates Hamiltonian State Space Duality (H‑SSD) with Visual‑Language Joint Embedding Predictive Architecture (VL‑JEPA). The system leverages the Mamba‑3 Selective State Space Model (SSM) augmented by a Sparse Mixture of Hamiltonian Experts (SMoE‑HE) that enforces latent physical conservation laws through symplectic integration. For visual synthesis, we introduce Hamiltonian Flow Matching (HFM) and persistent 3D Gaussian Splatting (3DGS), enabling ultra‑low latency (<50ms) on mobile hardware. This work establishes a new paradigm in latent world models, achieving unprecedented spatiotemporal coherence through a holographic memory architecture. Our approach demonstrates that incorporating physics‑inspired inductive biases into neural architectures yields significant improvements: state‑of‑the‑art video prediction (FVD: 287), 4x faster visual synthesis than diffusion models, and 3‑18x inference speedup over transformer baselines while maintaining energy conservation over extended horizons.

Abstract:
Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives‑such as elastic collisions and falling dominos‑teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero‑shot generalization to complex, real‑world scenarios, including tool manipulation and multi‑object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics‑aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page.

Abstract:
Existing traffic simulation frameworks for autonomous vehicles typically rely on imitation learning or game‑theoretic approaches that solve for Nash or coarse correlated equilibria, implicitly assuming perfectly rational agents. However, human drivers exhibit bounded rationality, making approximately optimal decisions under cognitive and perceptual constraints. We propose EvoQRE, a principled framework for modeling safety‑critical traffic interactions as general‑sum Markov games solved via Quantal Response Equilibrium (QRE) and evolutionary game dynamics. EvoQRE integrates a pre‑trained generative world model with entropy‑regularized replicator dynamics, capturing stochastic human behavior while maintaining equilibrium structure. We provide rigorous theoretical results, proving that the proposed dynamics converge to Logit‑QRE under a two‑timescale stochastic approximation with an explicit convergence rate of O(log k / k^1/3) under weak monotonicity assumptions. We further extend QRE to continuous action spaces using mixture‑based and energy‑based policy representations. Experiments on the Waymo Open Motion Dataset and nuPlan benchmark demonstrate that EvoQRE achieves state‑of‑the‑art realism, improved safety metrics, and controllable generation of diverse safety‑critical scenarios through interpretable rationality parameters.

Abstract:
Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in‑the‑wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in‑the‑wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in‑the‑wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action‑conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.

Abstract:
As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real‑world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW‑World‑Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow‑wo‑val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models' generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow‑wo‑val, models achieve only 17.27 on long‑horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world. However, most models collapse to \approx 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.

Abstract:
Mobile GUI agents have shown strong potential in real‑world automation and practical applications. However, most existing agents remain reactive, making decisions mainly from current screen, which limits their performance on long‑horizon tasks. Building a world model from repeated interactions enables forecasting action outcomes and supports better decision making for mobile GUI agents. This is challenging because the model must predict post‑action states with spatial awareness while remaining efficient enough for practical deployment. In this paper, we propose MobileDreamer, an efficient world‑model‑based lookahead framework to equip the GUI agents based on the future imagination provided by the world model. It consists of textual sketch world model and rollout imagination for GUI agent. Textual sketch world model forecasts post‑action states through a learning process to transform digital images into key task‑related sketches, and designs a novel order‑invariant learning strategy to preserve the spatial information of GUI elements. The rollout imagination strategy for GUI agent optimizes the action‑selection process by leveraging the prediction capability of world model. Experiments on Android World show that MobileDreamer achieves state‑of‑the‑art performance and improves task success by 5.25%. World model evaluations further verify that our textual sketch modeling accurately forecasts key GUI elements.

Abstract:
Agents built on vision‑language models increasingly face tasks that demand anticipating future states rather than relying on short‑horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents' capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.

Abstract:
Human motion prediction has traditionally been framed as a sequence regression problem where models extrapolate future joint coordinates from observed pose histories. While effective over short horizons this approach does not separate observation reconstruction with dynamics modeling and offers no explicit representation of the latent causes governing motion. As a result, existing methods exhibit compounding drift, mean‑pose collapse, and poorly calibrated uncertainty when rolled forward beyond the training regime. Here we propose a Semantic Belief‑State World Model (SBWM) that reframes human motion prediction as latent dynamical simulation on the human body manifold. Rather than predicting poses directly, SBWM maintains a recurrent probabilistic belief state whose evolution is learned independently of pose reconstruction and explicitly aligned with the SMPL‑X anatomical parameterization. This alignment imposes a structural information bottleneck that prevents the latent state from encoding static geometry or sensor noise, forcing it to capture motion dynamics, intent, and control‑relevant structure. Inspired by belief‑state world models developed for model‑based reinforcement learning, SBWM adapts stochastic latent transitions and rollout‑centric training to the domain of human motion. In contrast to RSSM‑based, transformer, and diffusion approaches optimized for reconstruction fidelity, SBWM prioritizes stable forward simulation. We demonstrate coherent long‑horizon rollouts, and competitive accuracy at substantially lower computational cost. These results suggest that treating the human body as part of the world models state space rather than its output fundamentally changes how motion is simulated, and predicted.

Abstract:
Early artificial intelligence paradigms exhibited separated cognitive functions: Neural Networks focused on "perception‑representation," Reinforcement Learning on "decision‑making‑behavior," and Symbolic AI on "knowledge‑reasoning." With Transformer‑based large models and world models, these paradigms are converging into cognitive agents with closed‑loop "perception‑decision‑action" capabilities. Humans solve complex problems under limited cognitive resources through temporalized sequential reasoning. Language relies on problem space search for deep semantic reasoning. While early large language models (LLMs) could generate fluent text, they lacked robust semantic reasoning capabilities. Prompting techniques like Chain‑of‑Thought (CoT) and Tree‑of‑Thought (ToT) extended reasoning paths by making intermediate steps explicit. Recent models like DeepSeek‑R1 enhanced performance through explicit reasoning trajectories. However, these methods have limitations in search completeness and efficiency. This highlights the need for "Time‑Scaling"‑‑the systematic extension and optimization of an agent's ability to unfold reasoning over time. Time‑Scaling refers to architectural design utilizing extended temporal pathways, enabling deeper problem space exploration, dynamic strategy adjustment, and enhanced metacognitive control, paralleling human sequential reasoning under cognitive constraints. It represents a critical frontier for enhancing deep reasoning and problem‑solving without proportional increases in static model parameters. Advancing intelligent agent capabilities requires placing Time‑Scaling principles at the forefront, positioning explicit temporal reasoning management as foundational.

Abstract:
As the application of Embodied AI Agents in avatars, wearable devices, and robotic systems continues to deepen, their core research challenges have gradually shifted from physical environment interaction to the accurate understanding of social interactions. Traditional physical world models (PWM) focus on quantifiable physical attributes such as space and motion, failing to meet the needs of social intelligence modeling. In contrast, the Mental World Model (MWM), as a structured representation of humans' internal mental states, has become the critical cognitive foundation for embodied agents to achieve natural human‑machine collaboration and dynamic social adaptation. However, current MWM research faces significant bottlenecks: such as fragmented conceptual framework with vague boundaries between MWM and PWM, disjointed reasoning mechanisms for the technical pathways and applicable scenarios of different Theory of Mind (ToM) reasoning paradigms, and detachment between evaluation and practice. To address these issues, this review systematically synthesizes over 100 authoritative studies to provide a comprehensive overview of MWM research for embodied AI. Its core contributions are threefold: First, it constructs a complete theoretical framework for MWM for the first time. Specifically, it distinguishes the essential differences between MWM and PWMs. Second, it systematically defines the key components of MWM through two paradigms for mental element representation. Third, it comprehensively analyzes two core ToM reasoning paradigms with 19 ToM methods. Finally, it also clarifies the integration trend of neuro‑symbolic hybrid architectures, and synthesizes 26 ToM evaluation benchmarks. This work aims to promote the integration of embodied agents into human society and advance the in‑depth development of human‑machine collaborative interaction.

Abstract:
AI agents ‑‑ systems that combine foundation models with reasoning, planning, memory, and tool use ‑‑ are rapidly becoming a practical interface between natural‑language intent and real‑world computation. This survey synthesizes the emerging landscape of AI agent architectures across: (i) deliberation and reasoning (e.g., chain‑of‑thought‑style decomposition, self‑reflection and verification, and constraint‑aware decision making), (ii) planning and control (from reactive policies to hierarchical and multi‑step planners), and (iii) tool calling and environment interaction (retrieval, code execution, APIs, and multimodal perception). We organize prior work into a unified taxonomy spanning agent components (policy/LLM core, memory, world models, planners, tool routers, and critics), orchestration patterns (single‑agent vs.\ multi‑agent; centralized vs.\ decentralized coordination), and deployment settings (offline analysis vs.\ online interactive assistance; safety‑critical vs.\ open‑ended tasks). We discuss key design trade‑offs ‑‑ latency vs.\ accuracy, autonomy vs.\ controllability, and capability vs.\ reliability ‑‑ and highlight how evaluation is complicated by non‑determinism, long‑horizon credit assignment, tool and environment variability, and hidden costs such as retries and context growth. Finally, we summarize measurement and benchmarking practices (task suites, human preference and utility metrics, success under constraints, robustness and security) and identify open challenges including verification and guardrails for tool actions, scalable memory and context management, interpretability of agent decisions, and reproducible evaluation under realistic workloads.

Abstract:
This paper addresses the topic of robustness under sensing noise, ambiguous instructions, and human‑robot interaction. We take a radically different tack to the issue of reliable embodied AI: instead of focusing on formal verification methods aimed at achieving model predictability and robustness, we emphasise the dynamic, ambiguous and subjective nature of human‑robot interactions that requires embodied AI systems to perceive, interpret, and respond to human intentions in a manner that is consistent, comprehensible and aligned with human expectations. We argue that when embodied agents operate in human environments that are inherently social, multimodal, and fluid, reliability is contextually determined and only has meaning in relation to the goals and expectations of humans involved in the interaction. This calls for a fundamentally different approach to achieving reliable embodied AI that is centred on building and updating an accessible "explicit world model" representing the common ground between human and AI, that is used to align robot behaviours with human expectations.

Abstract:
Current attempts of Reinforcement Learning for Autonomous Controller are data‑demanding while the results are under‑performed, unstable, and unable to grasp and anchor on the concept of safety, and over‑concentrating on noise features due to the nature of pixel reconstruction. While current Self‑Supervised Learningapproachs that learning on high‑dimensional representations by leveraging the JointEmbedding Predictive Architecture (JEPA) are interesting and an effective alternative, as the idea mimics the natural ability of the human brain in acquiring new skill usingimagination and minimal samples of observations. This study introduces Hanoi‑World, a JEPA‑based world model that using recurrent neural network (RNN) formaking longterm horizontal planning with effective inference time. Experimentsconducted on the Highway‑Env package with difference enviroment showcase the effective capability of making a driving plan while safety‑awareness, with considerablecollision rate in comparison with SOTA baselines

Abstract:
Video generation models, as one form of world models, have emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models: generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast‑growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety‑critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent‑level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real‑world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet‑scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers, with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state‑of‑the‑art models reveals clear trade‑offs: general models look better but break physics, while driving‑specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data‑driven decision‑making.

Abstract:
Digital twins, as precise digital representations of physical systems, have evolved from passive simulation tools into intelligent and autonomous entities through the integration of artificial intelligence technologies. This paper presents a unified four‑stage framework that systematically characterizes AI integration across the digital twin lifecycle, spanning modeling, mirroring, intervention, and autonomous management. By synthesizing existing technologies and practices, we distill a unified four‑stage framework that systematically characterizes how AI methodologies are embedded across the digital twin lifecycle: (1) modeling the physical twin through physics‑based and physics‑informed AI approaches, (2) mirroring the physical system into a digital twin with real‑time synchronization, (3) intervening in the physical twin through predictive modeling, anomaly detection, and optimization strategies, and (4) achieving autonomous management through large language models, foundation models, and intelligent agents. We analyze the synergy between physics‑based modeling and data‑driven learning, highlighting the shift from traditional numerical solvers to physics‑informed and foundation models for physical systems. Furthermore, we examine how generative AI technologies, including large language models and generative world models, transform digital twins into proactive and self‑improving cognitive systems capable of reasoning, communication, and creative scenario generation. Through a cross‑domain review spanning eleven application domains, including healthcare, aerospace, smart manufacturing, robotics, and smart cities, we identify common challenges related to scalability, explainability, and trustworthiness, and outline directions for responsible AI‑driven digital twin systems.

Abstract:
Embodied systems experience the world as 'a symphony of flows': a combination of many continuous streams of sensory input coupled to self‑motion, interwoven with the dynamics of external objects. These streams obey smooth, time‑parameterized symmetries, which combine through a precisely structured algebra; yet most neural network world models ignore this structure and instead repeatedly re‑learn the same transformations from data. In this work, we introduce 'Flow Equivariant World Models', a framework in which both self‑motion and external object motion are unified as one‑parameter Lie group 'flows'. We leverage this unification to implement group equivariance with respect to these transformations, thereby providing a stable latent world representation over hundreds of timesteps. On both 2D and 3D partially observed video world modeling benchmarks, we demonstrate that Flow Equivariant World Models significantly outperform comparable state‑of‑the‑art diffusion‑based and memory‑augmented world modeling architectures ‑‑ particularly when there are predictable world dynamics outside the agent's current field of view. We show that flow equivariance is particularly beneficial for long rollouts, generalizing far beyond the training horizon. By structuring world model representations with respect to internal and external motion, flow equivariance charts a scalable route to data efficient, symmetry‑guided, embodied intelligence. Project link: https://flowequivariantworldmodels.github.io.

Abstract:
Evaluating recommender systems remains challenging due to the gap between offline metrics and real user behavior, as well as the scarcity of interaction data. Recent work explores large language model (LLM) agents as synthetic users, yet they typically rely on few‑shot prompting, which yields a shallow understanding of the environment and limits their ability to faithfully reproduce user actions. We introduce AlignUSER, a framework that learns world‑model‑driven agents from human interactions. Given rollout sequences of actions and states, we formalize world modeling as a next state prediction task that helps the agent internalize the environment. To align actions with human personas, we generate counterfactual trajectories around demonstrations and prompt the LLM to compare its decisions with human choices, identify suboptimal actions, and extract lessons. The learned policy is then used to drive agent interactions with the recommender system. We evaluate AlignUSER across multiple datasets and demonstrate closer alignment with genuine humans than prior work, both at the micro and macro levels.

Abstract:
Automated negotiations in insurance and business‑to‑business (B2B) commerce encounter substantial challenges. Current systems force a trade‑off between convenience and privacy by routing sensitive financial data through centralized servers, increasing security risks, and diminishing user trust. This study introduces a device‑native autonomous Agentic AI system for privacy‑preserving negotiations. The proposed system operates exclusively on user hardware, enabling real‑time bargaining while maintaining sensitive constraints locally. It integrates zero‑knowledge proofs to ensure privacy and employs distilled world models to support advanced on‑device reasoning. The architecture incorporates six technical components within an Agentic AI workflow. Agents autonomously plan negotiation strategies, conduct secure multi‑party bargaining, and generate cryptographic audit trails without exposing user data to external servers. The system is evaluated in insurance and B2B procurement scenarios across diverse device configurations. Results show an average success rate of 87 %, a 2.4x reduction in latency relative to cloud baselines, and strong privacy preservation through zero‑knowledge proofs. User studies show 27 % higher trust scores when decision trails are available. These findings establish a foundation for trustworthy autonomous agents in privacy‑sensitive financial domains.

Abstract:
Building deep learning models that can reason about their environment requires capturing its underlying dynamics. Joint‑Embedded Predictive Architectures (JEPA) provide a promising framework to model such dynamics by learning representations and predictors through a self‑supervised prediction objective. However, their ability to support effective action planning remains limited. We propose an approach to enhance planning with JEPA world models by shaping their representation space so that the negative goal‑conditioned value function for a reaching cost in a given environment is approximated by a distance (or quasi‑distance) between state embeddings. We introduce a practical method to enforce this constraint during training and show that it leads to significantly improved planning performance compared to standard JEPA models on simple control tasks.

Abstract:
Classic problem‑space theory models problem solving as a navigation through a structured space of states, operators, goals, and constraints. Systems Engineering (SE) employs analogous constructs (functional analysis, operational analysis, scenarios, trade studies), yet still lacks a rigorous systems‑theoretic representation of the problem space itself. In current practice, reasoning often proceeds directly from stakeholder goals to prescriptive artifacts. This makes foundational assumptions about the operational environment, admissible interactions, and contextual conditions implicit or prematurely embedded in architectures or requirements. This paper addresses that gap by formalizing the problem space as an explicit semantic world model containing theoretical constructs that are defined prior to requirements and solution commitments. These constructs along with the developed axioms, theorems and corollary establish a rigorous criterion for unambiguous boundary semantics, context‑dependent interaction traceability to successful stakeholder goal satisfaction, and sufficiency of problem‑space specification over which disciplined reasoning can occur independent of solution design. It offers a clear distinction between what is true of the problem domain and what is chosen as a solution. The paper concludes by discussing the significance of the theory on practitioners and provides a dialogue‑based hypothetical case study between a stakeholder and an engineer, demonstrating how the theory guides problem framing before designing any prescriptive artifacts.

Abstract:
In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel‑trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi‑view 4D data or by cumbersome training pre‑processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in‑the‑wild monocular videos. Specifically, NeoVerse features pose‑free feed‑forward 4D reconstruction, online monocular degradation pattern simulation, and other well‑aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state‑of‑the‑art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse‑4d.github.io.

Abstract:
World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real‑time interaction, long‑horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real‑time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long‑term world memory within a closed‑loop system. TeleWorld introduces a novel generation‑reconstruction‑guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio‑temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long‑horizon generation with low latency, we employ an autoregressive diffusion‑based video model enhanced with Macro‑from‑Micro Planning (MMPL)‑‑a hierarchical planning method that reduces error accumulation from frame‑level to segment‑level‑alongside efficient Distribution Matching Distillation (DMD), enabling real‑time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long‑term consistency, and real‑time generation efficiency, positioning it as a practical step toward interactive, memory‑enabled world models for multimodal generation and embodied intelligence.

Abstract:
Real‑world autonomous driving must adhere to complex human social rules that extend beyond legally codified traffic regulations. Many of these semantic constraints, such as yielding to emergency vehicles, complying with traffic officers' gestures, or stopping for school buses, are intuitive for humans yet difficult to encode explicitly. Although large vision‑language models (VLMs) can interpret such semantics, their inference cost makes them impractical for real‑time deployment. This work proposes LSRE, a Latent Semantic Rule Encoding framework that converts sparsely sampled VLM judgments into decision boundaries within the latent space of a recurrent world model. By encoding language‑defined safety semantics into a lightweight latent classifier, LSRE enables real‑time semantic risk assessment at 10 Hz without per‑frame VLM queries. Experiments on six semantic‑failure scenarios in CARLA demonstrate that LSRE attains semantic risk detection accuracy comparable to a large VLM baseline, while providing substantially earlier hazard anticipation and maintaining low computational latency. LSRE further generalizes to rarely seen semantic‑similar test cases, indicating that language‑guided latent classification offers an effective and deployable mechanism for semantic safety monitoring in autonomous driving.

Abstract:
Sarcasm understanding is a challenging problem in natural language processing, as it requires capturing the discrepancy between the surface meaning of an utterance and the speaker's intentions as well as the surrounding social context. Although recent advances in deep learning and Large Language Models (LLMs) have substantially improved performance, most existing approaches still rely on black‑box predictions of a single model, making it difficult to structurally explain the cognitive factors underlying sarcasm. Moreover, while sarcasm often emerges as a mismatch between semantic evaluation and normative expectations or intentions, frameworks that explicitly decompose and model these components remain limited. In this work, we reformulate sarcasm understanding as a world model inspired reasoning process and propose World Model inspired SArcasm Reasoning (WM‑SAR), which decomposes literal meaning, context, normative expectation, and intention into specialized LLM‑based agents. The discrepancy between literal evaluation and normative expectation is explicitly quantified as a deterministic inconsistency score, and together with an intention score, these signals are integrated by a lightweight Logistic Regression model to infer the final sarcasm probability. This design leverages the reasoning capability of LLMs while maintaining an interpretable numerical decision structure. Experiments on representative sarcasm detection benchmarks show that WM‑SAR consistently outperforms existing deep learning and LLM‑based methods. Ablation studies and case analyses further demonstrate that integrating semantic inconsistency and intention reasoning is essential for effective sarcasm detection, achieving both strong performance and high interpretability.

Abstract:
World Models serve as tools for understanding the current state of the world and predicting its future dynamics, with broad application potential across numerous fields. As a key component of world knowledge, emotion significantly influences human decision‑making. While existing Large Language Models (LLMs) have shown preliminary capability in capturing world knowledge, they primarily focus on modeling physical‑world regularities and lack systematic exploration of emotional factors. In this paper, we first demonstrate the importance of emotion in understanding the world by showing that removing emotionally relevant information degrades reasoning performance. Inspired by theory of mind, we further propose a Large Emotional World Model (LEWM). Specifically, we construct the Emotion‑Why‑How (EWH) dataset, which integrates emotion into causal relationships and enables reasoning about why actions occur and how emotions drive future world states. Based on this dataset, LEWM explicitly models emotional states alongside visual observations and actions, allowing the world model to predict both future states and emotional transitions. Experimental results show that LEWM more accurately predicts emotion‑driven social behaviors while maintaining comparable performance to general world models on basic tasks.

Abstract:
Vision‑Language‑Action (VLA) models have shown remarkable generalization by mapping web‑scale knowledge to robotic control, yet they remain blind to physical contact. Consequently, they struggle with contact‑rich manipulation tasks that require reasoning about force, texture, and slip. While some approaches incorporate low‑dimensional tactile signals, they fail to capture the high‑resolution dynamics essential for such interactions. To address this limitation, we introduce DreamTacVLA, a framework that grounds VLA models in contact physics by learning to feel the future. Our model adopts a hierarchical perception scheme in which high‑resolution tactile images serve as micro‑vision inputs coupled with wrist‑camera local vision and third‑person macro vision. To reconcile these multi‑scale sensory streams, we first train a unified policy with a Hierarchical Spatial Alignment (HSA) loss that aligns tactile tokens with their spatial counterparts in the wrist and third‑person views. To further deepen the model's understanding of fine‑grained contact dynamics, we finetune the system with a tactile world model that predicts future tactile signals. To mitigate tactile data scarcity and the wear‑prone nature of tactile sensors, we construct a hybrid large‑scale dataset sourced from both high‑fidelity digital twin and real‑world experiments. By anticipating upcoming tactile states, DreamTacVLA acquires a rich model of contact physics and conditions its actions on both real observations and imagined consequences. Across contact‑rich manipulation tasks, it outperforms state‑of‑the‑art VLA baselines, achieving up to 95% success, highlighting the importance of understanding physical contact for robust, touch‑aware robotic agents.

Abstract:
Transformer‑based large language models (LLMs) have demonstrated strong reasoning abilities across diverse fields, from solving programming challenges to competing in strategy‑intensive games such as chess. Prior work has shown that LLMs can develop emergent world models in games of perfect information, where internal representations correspond to latent states of the environment. In this paper, we extend this line of investigation to domains of incomplete information, focusing on poker as a canonical partially observable Markov decision process (POMDP). We pretrain a GPT‑style model on Poker Hand History (PHH) data and probe its internal activations. Our results demonstrate that the model learns both deterministic structure, such as hand ranks, and stochastic features, such as equity, without explicit instruction. Furthermore, by using primarily nonlinear probes, we demonstrated that these representations are decodeable and correlate with theoretical belief states, suggesting that LLMs are learning their own representation of the stochastic environment of Texas Hold'em Poker.

Abstract:
Specifying robotic manipulation tasks in a manner that is both expressive and precise remains a central challenge. While visual goals provide a compact and unambiguous task specification, existing goal‑conditioned policies often struggle with long‑horizon manipulation due to their reliance on single‑step action prediction without explicit modeling of task progress. We propose Act2Goal, a general goal‑conditioned manipulation policy that integrates a goal‑conditioned visual world model with multi‑scale temporal control. Given a current observation and a target visual goal, the world model generates a plausible sequence of intermediate visual states that captures long‑horizon structure. To translate this visual plan into robust execution, we introduce Multi‑Scale Temporal Hashing (MSTH), which decomposes the imagined trajectory into dense proximal frames for fine‑grained closed‑loop control and sparse distal frames that anchor global task consistency. The policy couples these representations with motor control through end‑to‑end cross‑attention, enabling coherent long‑horizon behavior while remaining reactive to local disturbances. Act2Goal achieves strong zero‑shot generalization to novel objects, spatial layouts, and environments. We further enable reward‑free online adaptation through hindsight goal relabeling with LoRA‑based finetuning, allowing rapid autonomous improvement without external supervision. Real‑robot experiments demonstrate that Act2Goal improves success rates from 30% to 90% on challenging out‑of‑distribution tasks within minutes of autonomous interaction, validating that goal‑conditioned world models with multi‑scale temporal control provide structured guidance necessary for robust long‑horizon manipulation. Project page: https://act2goal.github.io/

Abstract:
World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long‑tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high‑fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW‑Video, our powerful world model that generates high‑fidelity forecasting with expressive latent representations, and DriveLaW‑Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW‑Video, with both components optimized by a three‑stage progressive training strategy. The power of our unified paradigm is demonstrated by new state‑of‑the‑art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best‑performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.

Abstract:
Data scarcity remains a fundamental barrier to achieving fully autonomous surgical robots. While large scale vision language action (VLA) models have shown impressive generalization in household and industrial manipulation by leveraging paired video action data from diverse domains, surgical robotics suffers from the paucity of datasets that include both visual observations and accurate robot kinematics. In contrast, vast corpora of surgical videos exist, but they lack corresponding action labels, preventing direct application of imitation learning or VLA training. In this work, we aim to alleviate this problem by learning policy models from Cosmos‑H‑Surgical, a world model designed for surgical physical AI. We curated the Surgical Action Text Alignment (SATA) dataset with detailed action description specifically for surgical robots. Then we built Cosmos‑H‑Surgical based on the most advanced physical AI world model and SATA. It's able to generate diverse, generalizable and realistic surgery videos. We are also the first to use an inverse dynamics model to infer pseudokinematics from synthetic surgical videos, producing synthetic paired video action data. We demonstrate that a surgical VLA policy trained with these augmented data significantly outperforms models trained only on real demonstrations on a real surgical robot platform. Our approach offers a scalable path toward autonomous surgical skill acquisition by leveraging the abundance of unlabeled surgical video and generative world modeling, thus opening the door to generalizable and data efficient surgical robot policies.

Abstract:
Symbolic world models (e.g., PDDL domains or executable simulators) are central to model‑based planning, but training LLMs to generate such world models is limited by the lack of large‑scale verifiable supervision. Current approaches rely primarily on static validation methods that fail to catch behavior‑level errors arising from interactive execution. In this paper, we propose Agent2World, a tool‑augmented multi‑agent framework that achieves strong inference‑time world‑model generation and also serves as a data engine for supervised fine‑tuning, by grounding generation in multi‑agent feedback. Agent2World follows a three‑stage pipeline: (i) A Deep Researcher agent performs knowledge synthesis by web searching to address specification gaps; (ii) A Model Developer agent implements executable world models; And (iii) a specialized Testing Team conducts adaptive unit testing and simulation‑based validation. Agent2World demonstrates superior inference‑time performance across three benchmarks spanning both Planning Domain Definition Language (PDDL) and executable code representations, achieving consistent state‑of‑the‑art results. Beyond inference, Testing Team serves as an interactive environment for the Model Developer, providing behavior‑aware adaptive feedback that yields multi‑turn training trajectories. The model fine‑tuned on these trajectories substantially improves world‑model generation, yielding an average relative gain of 30.95% over the same model before training. Project page: https://agent2world.github.io.

Abstract:
Ad‑hoc teamwork (AHT) requires agents to infer the behavior of previously unseen teammates and adapt their policy accordingly. Conventional approaches often rely on fixed probabilistic models or classifiers, which can be brittle under partial observability and limited interaction. Large language models (LLMs) offer a flexible alternative: by mapping short behavioral traces into high‑level hypotheses, they can serve as world models over teammate behavior. We introduce \Collab, a language‑based framework that classifies partner types using a behavior rubric derived from trajectory features, and extend it to \ReCollab, which incorporates retrieval‑augmented generation (RAG) to stabilize inference with exemplar trajectories. In the cooperative Overcooked environment, \Collab effectively distinguishes teammate types, while \ReCollab consistently improves adaptation across layouts, achieving Pareto‑optimal trade‑offs between classification accuracy and episodic return. These findings demonstrate the potential of LLMs as behavioral world models for AHT and highlight the importance of retrieval grounding in challenging coordination settings.

Abstract:
Unmanned aerial vehicles (UAVs) have emerged as powerful embodied agents. One of the core abilities is autonomous navigation in large‑scale three‑dimensional environments. Existing navigation policies, however, are typically optimized for low‑level objectives such as obstacle avoidance and trajectory smoothness, lacking the ability to incorporate high‑level semantics into planning. To bridge this gap, we propose ANWM, an aerial navigation world model that predicts future visual observations conditioned on past frames and actions, thereby enabling agents to rank candidate trajectories by their semantic plausibility and navigational utility. ANWM is trained on 4‑DoF UAV trajectories and introduces a physics‑inspired module: Future Frame Projection (FFP), which projects past frames into future viewpoints to provide coarse geometric priors. This module mitigates representational uncertainty in long‑distance visual generation and captures the mapping between 3D trajectories and egocentric observations. Empirical results demonstrate that ANWM significantly outperforms existing world models in long‑distance visual forecasting and improves UAV navigation success rates in large‑scale environments.

Abstract:
Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav‑World, an end‑to‑end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion‑based video generator with a vision‑language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action‑conditioned multi‑step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task‑relevant futures, mitigating cumulative errors common in decoupled "envision‑then‑plan" pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision‑action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. In real‑world testing, AstraNav‑World demonstrated exceptional zero‑shot capabilities, adapting to previously unseen scenarios without any real‑world fine‑tuning. These results suggest that AstraNav‑World captures transferable spatial understanding and planning‑relevant navigation dynamics, rather than merely overfitting to simulation‑specific data distribution. Overall, by unifying foresight vision and control within a single generative model, we move closer to reliable, interpretable, and general‑purpose embodied agents that operate robustly in open‑ended real‑world settings.

Abstract:
Zero‑shot object navigation (ZSON) requires robots to locate target objects in unseen environments without task‑specific fine‑tuning or pre‑built maps, a capability crucial for service and household robotics. Existing methods perform well in simulation but struggle in realistic, cluttered environments where heavy occlusions and latent hazards make large portions of the scene unobserved. These approaches typically act on a single inferred scene, making them prone to overcommitment and unsafe behavior under uncertainty. To address these challenges, we propose Schrödinger's Navigator, a belief‑aware framework that explicitly reasons over multiple trajectory‑conditioned imagined 3D futures at inference time. A trajectory‑conditioned 3D world model generates hypothetical observations along candidate paths, maintaining a superposition of plausible scene realizations. An adaptive, occluder‑aware trajectory sampling strategy focuses imagination on uncertain regions, while a Future‑Aware Value Map (FAVM) aggregates imagined futures to guide robust, proactive action selection. Evaluations in simulation and on a physical Go2 quadruped robot demonstrate that Schrödinger's Navigator outperforms strong ZSON baselines, achieving more robust self‑localization, object localization, and safe navigation under severe occlusions and latent hazards. These results highlight the effectiveness of reasoning over imagined 3D futures as a scalable and generalizable strategy for zero‑shot navigation in uncertain real‑world environments.

Abstract:
This technical note considers the sampling of outcomes that provide the greatest amount of information about the structure of underlying world models. This generalisation furnishes a principled approach to structure learning under a plausible set of generative models or hypotheses. In active inference, policies ‑ i.e., combinations of actions ‑ are selected based on their expected free energy, which comprises expected information gain and value. Information gain corresponds to the KL divergence between predictive posteriors with, and without, the consequences of action. Posteriors over models can be evaluated quickly and efficiently using Bayesian Model Reduction, based upon accumulated posterior beliefs about model parameters. The ensuing information gain can then be used to select actions that disambiguate among alternative models, in the spirit of optimal experimental design. We illustrate this kind of active selection or reasoning using partially observed discrete models; namely, a 'three‑ball' paradigm used previously to describe artificial insight and 'aha moments' via (synthetic) introspection or sleep. We focus on the sample efficiency afforded by seeking outcomes that resolve the greatest uncertainty about the world model, under which outcomes are generated.

Abstract:
We present an interactive framework for evaluating whether large language models (LLMs) exhibit genuine "understanding" in a simple yet strategic environment. As a running example, we focus on Rock‑Paper‑Scissors (RPS), which, despite its apparent simplicity, requires sequential reasoning, adaptation, and strategy recognition. Our system positions the LLM as an Observer whose task is to identify which strategies are being played and to articulate the reasoning behind this judgment. The purpose is not to test knowledge of Rock‑Paper‑Scissors itself, but to probe whether the model can exhibit mind‑like reasoning about sequential behavior. To support systematic evaluation, we provide a benchmark consisting of both static strategies and lightweight dynamic strategies specified by well‑prompted rules. We quantify alignment between the Observer's predictions and the ground‑truth distributions induced by actual strategy pairs using three complementary signals: Cross‑Entropy, Brier score, and Expected Value (EV) discrepancy. These metrics are further integrated into a unified score, the Union Loss, which balances calibration, sensitivity, and payoff alignment. Together with a Strategy Identification Rate (SIR) metric, our framework captures not only predictive accuracy but also whether the model can stably identify the latent strategies in play. The demo emphasizes interactivity, transparency, and reproducibility. Users can adjust LLM distributions in real time, visualize losses as they evolve, and directly inspect reasoning snippets to identify where and why failures occur. In doing so, our system provides a practical and interpretable proxy for mind‑like inference in sequential games, offering insights into both the strengths and limitations of current LLM reasoning.

Abstract:
Latent World Models enhance scene representation through temporal self‑supervised learning, presenting a perception annotation‑free paradigm for end‑to‑end autonomous driving. However, the reconstruction‑oriented representation learning tangles perception with planning tasks, leading to suboptimal optimization for planning. To address this challenge, we propose WorldRFT, a planning‑oriented latent world model framework that aligns scene representation learning with planning via a hierarchical planning decomposition and local‑aware interactive refinement mechanism, augmented by reinforcement learning fine‑tuning (RFT) to enhance safety‑critical policy performance. Specifically, WorldRFT integrates a vision‑geometry foundation model to improve 3D spatial awareness, employs hierarchical planning task decomposition to guide representation optimization, and utilizes local‑aware iterative refinement to derive a planning‑oriented driving policy. Furthermore, we introduce Group Relative Policy Optimization (GRPO), which applies trajectory Gaussianization and collision‑aware rewards to fine‑tune the driving policy, yielding systematic improvements in safety. WorldRFT achieves state‑of‑the‑art (SOTA) performance on both open‑loop nuScenes and closed‑loop NavSim benchmarks. On nuScenes, it reduces collision rates by 83% (0.30% ‑> 0.05%). On NavSim, using camera‑only sensors input, it attains competitive performance with the LiDAR‑based SOTA method DiffusionDrive (87.8 vs. 88.1 PDMS).

Abstract:
Model‑based reinforcement learning (MBRL) can reduce interaction cost for autonomous driving by learning a predictive world model, but it typically still depends on task‑specific rewards that are difficult to design and often brittle under distribution shift. This paper presents InDRiVE, a DreamerV3‑style MBRL agent that performs reward‑free pretraining in CARLA using only intrinsic motivation derived from latent ensemble disagreement. Disagreement acts as a proxy for epistemic uncertainty and drives the agent toward under‑explored driving situations, while an imagination‑based actor‑critic learns a planner‑free exploration policy directly from the learned world model. After intrinsic pretraining, we evaluate zero‑shot transfer by freezing all parameters and deploying the pretrained exploration policy in unseen towns and routes. We then study few‑shot adaptation by training a task policy with limited extrinsic feedback for downstream objectives (lane following and collision avoidance). Experiments in CARLA across towns, routes, and traffic densities show that disagreement‑based pretraining yields stronger zero‑shot robustness and robust few‑shot collision avoidance under town shift and matched interaction budgets, supporting the use of intrinsic disagreement as a practical reward‑free pretraining signal for reusable driving world models.

Abstract:
Agentic reinforcement learning increasingly relies on experience‑driven scaling, yet real‑world environments remain non‑adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text‑based environments, which provide a controlled setting to reinterpret language modeling as next‑state prediction under interaction. We introduce a three‑level framework for evaluating LLM‑based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm‑starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.

Abstract:
Before closing venture capital financing rounds, lawyers conduct diligence that includes tying out the capitalization table: verifying that every security (for example, shares, options, warrants) and issuance term (for example, vesting schedules, acceleration triggers, transfer restrictions) is supported by large sets of underlying legal documentation. While LLMs continue to improve on legal benchmarks, specialized legal workflows, such as capitalization tie‑out, remain out of reach even for strong agentic systems. The task requires multi‑document reasoning, strict evidence traceability, and deterministic outputs that current approaches fail to reliably deliver. We characterize capitalization tie‑out as an instance of a real‑world benchmark for legal AI, analyze and compare the performance of existing agentic systems, and propose a world model architecture toward tie‑out automation‑and more broadly as a foundation for applied legal intelligence.

Abstract:
We present ChronoDreamer, an action‑conditioned world model for contact‑rich robotic manipulation. Given a history of egocentric RGB frames, contact maps, actions, and joint states, ChronoDreamer predicts future video frames, contact distributions, and joint angles via a spatial‑temporal transformer trained with MaskGIT‑style masked prediction. Contact is encoded as depth‑weighted Gaussian splat images that render 3D forces into a camera‑aligned format suitable for vision backbones. At inference, predicted rollouts are evaluated by a vision‑language model that reasons about collision likelihood, enabling rejection sampling of unsafe actions before execution. We train and evaluate on DreamerBench, a simulation dataset generated with Project Chrono that provides synchronized RGB, contact splat, proprioception, and physics annotations across rigid and deformable object scenarios. Qualitative results demonstrate that the model preserves spatial coherence during non‑contact motion and generates plausible contact predictions, while the LLM‑based judge distinguishes collision from non‑collision trajectories.

Abstract:
Large Language Models (LLMs) demonstrate strong few‑shot generalization through in‑context learning, yet their reasoning in dynamic and stochastic environments remains opaque. Prior studies mainly focus on static tasks and overlook the online adaptation required when beliefs must be continuously updated, which is a key capability for LLMs acting as world models or agents. We introduce a Bayesian filtering framework to evaluate online inference in LLMs. Our probabilistic probe suite spans both multivariate discrete distributions, such as dice rolls, and continuous distributions, such as Gaussian processes, where ground‑truth parameters shift over time. We find that while LLM belief updates resemble Bayesian posteriors, they are more accurately characterized by an exponential forgetting filter with a model‑specific discount factor smaller than one. This reveals systematic discounting of older evidence that varies significantly across model architectures. Although inherent priors are often miscalibrated, the updating mechanism itself remains structured and principled. We further validate these findings in a simulated agent task and propose prompting strategies that effectively recalibrate priors with minimal computational cost.

Abstract:
We present STORM (Search‑Guided Generative World Models), a novel framework for spatio‑temporal reasoning in robotic manipulation that unifies diffusion‑based action generation, conditional video prediction, and search‑based planning. Unlike prior Vision‑Language‑Action (VLA) models that rely on abstract latent dynamics or delegate reasoning to language components, STORM grounds planning in explicit visual rollouts, enabling interpretable and foresight‑driven decision‑making. A diffusion‑based VLA policy proposes diverse candidate actions, a generative video world model simulates their visual and reward outcomes, and Monte Carlo Tree Search (MCTS) selectively refines plans through lookahead evaluation. Experiments on the SimplerEnv manipulation benchmark demonstrate that STORM achieves a new state‑of‑the‑art average success rate of 51.0 percent, outperforming strong baselines such as CogACT. Reward‑augmented video prediction substantially improves spatio‑temporal fidelity and task relevance, reducing Frechet Video Distance by over 75 percent. Moreover, STORM exhibits robust re‑planning and failure recovery behavior, highlighting the advantages of search‑guided generative world models for long‑horizon robotic manipulation.

Abstract:
Recent advances in Vision‑Language‑Action (VLA) and world‑model methods have improved generalization in tasks such as robotic manipulation and object interaction. However, Successful execution of such tasks depends on large, costly collections of real demonstrations, especially for fine‑grained manipulation of articulated objects. To address this, we present AOMGen, a scalable data generation framework for articulated manipulation which is instantiated from a single real scan, demonstration and a library of readily available digital assets, yielding photoreal training data with verified physical states. The framework synthesizes synchronized multi‑view RGB temporally aligned with action commands and state annotations for joints and contacts, and systematically varies camera viewpoints, object styles, and object poses to expand a single execution into a diverse corpus. Experimental results demonstrate that fine‑tuning VLA policies on AOMGen data increases the success rate from 0% to 88.7%, and the policies are tested on unseen objects and layouts.

Abstract:
Long‑horizon robotic tasks are hard due to continuous state‑action spaces and sparse feedback. Symbolic world models help by decomposing tasks into discrete predicates that capture object properties and relations. Existing methods learn predicates either top‑down, by prompting foundation models without data grounding, or bottom‑up, from demonstrations without high‑level priors. We introduce UniPred, a bilevel learning framework that unifies both. UniPred uses large language models (LLMs) to propose predicate effect distributions that supervise neural predicate learning from low‑level data, while learned feedback iteratively refines the LLM hypotheses. Leveraging strong visual foundation model features, UniPred learns robust predicate classifiers in cluttered scenes. We further propose a predicate evaluation method that supports symbolic models beyond STRIPS assumptions. Across five simulated and one real‑robot domains, UniPred achieves 2‑4 times higher success rates than top‑down methods and 3‑4 times faster learning than bottom‑up approaches, advancing scalable and flexible symbolic world modeling for robotics.

Abstract:
Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static and are limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), a scene‑action‑conditioned video diffusion framework that models how dexterous human actions induce dynamic changes in static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human‑scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues to model action‑conditioned dynamics directly. To train DWM, we construct a hybrid interaction video dataset. Synthetic egocentric interactions provide fully aligned supervision for joint locomotion and manipulation learning, while fixed‑camera real‑world videos contribute diverse and realistic object dynamics. Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects, while maintaining camera and scene consistency. This framework represents a first step toward video diffusion‑based interactive digital twins and enables embodied simulation from egocentric actions.

Abstract:
This study investigates quantum computing approaches for solving the windfarm layout optimization (WFLO) problems formulated as a quadratic unconstrained binary optimization (QUBO) problem. We investigate two encoding methods that require fewer than one qubit per grid point: the previously developed Pauli correlation encoding (PCE) and a novel single‑qubit operator encoding (SQOE). These methods are tested on three windfarm configurations ‑ two from prior WFLO scaling studies and a new real‑world model based on an existing windfarm in Wales. The improved encoding methods allow us to solve WFLO problems on 9× 9 grids using up to 20 qubits on a quantum computer simulator. The results show that both encoding methods perform competitively and demonstrate favorable scaling characteristics across the tested systems.

Abstract:
Real‑time sequential control agents are often bottlenecked by inference latency. Even modest per‑step planning delays can destabilize control and degrade overall performance. We propose a speculation‑and‑correction framework that adapts the predict‑then‑verify philosophy of speculative execution to model‑based control with TD‑MPC2. At each step, a pretrained world model and latent‑space MPC planner generate a short‑horizon action queue together with predicted latent rollouts, allowing the agent to execute multiple planned actions without immediate replanning. When a new observation arrives, the system measures the mismatch between the encoded real latent state and the queued predicted latent. For small to moderate mismatch, a lightweight learned corrector applies a residual update to the speculative action, distilled offline from a replanning teacher. For large mismatch, the agent safely falls back to full replanning and clears stale action queues. We study both a gated two‑tower MLP corrector and a temporal Transformer corrector to address local errors and systematic drift. Experiments on the DMC Humanoid‑Walk task show that our method reduces the number of planning inferences from 500 to 282, improves end‑to‑end step latency by 25 percent, and maintains strong control performance with only a 7.1 percent return reduction. Ablation results demonstrate that speculative execution without correction is unreliable over longer horizons, highlighting the necessity of mismatch‑aware correction for robust latency reduction.

Abstract:
Fine‑grained fire prediction plays a crucial role in emergency response. Infrared images and fire masks provide complementary thermal and boundary information, yet current methods are predominantly limited to binary mask modeling with inherent signal sparsity, failing to capture the complex dynamics of fire. While world models show promise in video generation, their physical inconsistencies pose significant challenges for fire forecasting. This paper introduces PhysFire‑WM, a Physics‑informed World Model for emulating Fire spread dynamics. Our approach internalizes combustion dynamics by encoding structured priors from a Physical Simulator to rectify physical discrepancies, coupled with a Cross‑task Collaborative Training strategy (CC‑Train) that alleviates the issue of limited information in mask‑based modeling. Through parameter sharing and gradient coordination, CC‑Train effectively integrates thermal radiation dynamics and spatial boundary delineation, enhancing both physical realism and geometric accuracy. Extensive experiments on a fine‑grained multimodal fire dataset demonstrate the superior accuracy of PhysFire‑WM in fire spread prediction. Validation underscores the importance of physical priors and cross‑task collaboration, providing new insights for applying physics‑informed world models to disaster prediction.

Abstract:
Autonomous robotic systems require spatio‑temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision‑Language Models (VLMs) provide open‑world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open‑World Knowledge), a training‑free and backbone‑agnostic framework for unified 4D scene understanding that integrates VLM‑derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object‑level proposals that guide SAM2‑based segmentation. Each segmented region is encoded through our proposed Spatio‑Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state‑of‑the‑art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.

Abstract:
Efficient AI inference on AMD's Versal AI Engine (AIE) is challenging due to tightly coupled VLIW execution, explicit datapaths, and local memory management. Prior work focused on first‑generation AIE kernel optimizations, without tackling full neural network execution across the 2D array. In this work, we present AIE4ML, the first comprehensive framework for converting AI models automatically into optimized firmware targeting the AIE‑ML generation devices, also with forward compatibility for the newer AIE‑MLv2 architecture. At the single‑kernel level, we attain performance close to the architectural peak. At the graph and system levels, we provide a structured parallelization method that can scale across the 2D AIE‑ML fabric and exploit its dedicated memory tiles to stay entirely on‑chip throughout the model execution. As a demonstration, we designed a generalized and highly efficient linear‑layer implementation with intrinsic support for fused bias addition and ReLU activation. Also, as our framework necessitates the generation of multi‑layer implementations, our approach systematically derives deterministic, compact, and topology‑optimized placements tailored to the physical 2D grid of the device through a novel graph placement and search algorithm. Finally, the framework seamlessly accepts quantized models imported from high‑level tools such as hls4ml or PyTorch while preserving bit‑exactness. In layer scaling benchmarks, we achieve up to 98.6% efficiency relative to the single‑kernel baseline, utilizing 296 of 304 AIE tiles (97.4%) of the device with entirely on‑chip data movement. With evaluations across real‑world model topologies, we demonstrate that AIE4ML delivers GPU‑class throughput under microsecond latency constraints, making it a practical companion for ultra‑low‑latency environments such as trigger systems in particle physics experiments.

Abstract:
Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context‑dependent reasoning. Inspired by this capability, we introduce R4, a training‑free framework for retrieval‑augmented reasoning in 4D spatio‑temporal space that equips vision‑language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object‑level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Unlike classical retrieval‑augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio‑temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.

Abstract:
Equivariance is a powerful prior for learning physical dynamics, yet exact group equivariance can degrade performance if the symmetries are broken. We propose object‑centric world models built with geometric algebra neural networks, providing a soft geometric inductive bias. Our models are evaluated using simulated environments of 2d rigid body dynamics with static obstacles, where we train for next‑step predictions autoregressively. For long‑horizon rollouts we show that the soft inductive bias of our models results in better performance in terms of physical fidelity compared to non‑equivariant baseline models. The approach complements recent soft‑equivariance ideas and aligns with the view that simple, well‑chosen priors can yield robust generalization. These results suggest that geometric algebra offers an effective middle ground between hand‑crafted physics and unstructured deep nets, delivering sample‑efficient dynamics models for multi‑object scenes.

Abstract:
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi‑Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC‑AGI, Sudoku), Embodied Navigation (real‑world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine‑grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo‑3, Sora‑2, Wan‑2.2) and image models (Nano‑banana, Nano‑banana Pro, GPT‑4o‑image, Qwen‑image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC‑AGI) and struggle with long‑horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning‑aware generative world models.

Abstract:
Large language models (LLMs) increasingly generate code with minimal human oversight, raising critical concerns about backdoor injection and malicious behavior. We present Cross‑Trace Verification Protocol (CTVP), a novel AI control framework that verifies untrusted code‑generating models through semantic orbit analysis. Rather than directly executing potentially malicious code, CTVP leverages the model's own predictions of execution traces across semantically equivalent program transformations. By analyzing consistency patterns in these predicted traces, we detect behavioral anomalies indicative of backdoors. Our approach introduces the Adversarial Robustness Quotient (ARQ), which quantifies the computational cost of verification relative to baseline generation, demonstrating exponential growth with orbit size. Theoretical analysis establishes information‑theoretic bounds showing non‑gamifiability ‑ adversaries cannot improve through training due to fundamental space complexity constraints. This work demonstrates that semantic orbit analysis provides a theoretically grounded approach to AI control for code generation tasks, though practical deployment requires addressing the high false positive rates observed in initial evaluations.

Abstract:
Circuits in the brain commonly exhibit modular architectures that factorise complex tasks, resulting in the ability to compositionally generalise and reduce catastrophic forgetting. In contrast, artificial neural networks (ANNs) appear to mix all processing, because modular solutions are difficult to find as they are vanishing subspaces in the space of possible solutions. Here, we draw inspiration from fault‑tolerant computation and the Poisson‑like firing of real neurons to show that activity‑dependent neural noise, combined with nonlinear neural responses, drives the emergence of solutions that reflect an accurate understanding of modular tasks, corresponding to acquisition of a correct world model. We find that noise‑driven modularisation can be recapitulated by a deterministic regulariser that multiplicatively combines weights and activations, revealing rich phenomenology not captured in linear networks or by standard regularisation methods. Though the emergence of modular structure requires sufficiently many training samples (exponential in the number of modular task dimensions), we show that pre‑modularised ANNs exhibit superior noise‑robustness and the ability to generalise and extrapolate well beyond training data, compared to ANNs without such inductive biases. Together, our work demonstrates a regulariser and architectures that could encourage modularity emergence to yield functional benefits.

Abstract:
Modeling dexterous hand‑object interactions is challenging as it requires understanding how subtle finger motions influence the environment through contact with objects. While recent world models address interaction modeling, they typically rely on coarse action spaces that fail to capture fine‑grained dexterity. We, therefore, introduce DexWM, a Dexterous Interaction World Model that predicts future latent states of the environment conditioned on past states and dexterous actions. To overcome the scarcity of finely annotated dexterous datasets, DexWM represents actions using finger keypoints extracted from egocentric videos, enabling training on over 900 hours of human and non‑dexterous robot data. Further, to accurately model dexterity, we find that predicting visual features alone is insufficient; therefore, we incorporate an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, or full‑body actions in future‑state prediction and demonstrates strong zero‑shot transfer to unseen skills on a Franka Panda arm with an Allegro gripper, surpassing Diffusion Policy by over 50% on average across grasping, placing, and reaching tasks.

Abstract:
While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large‑scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture‑of‑Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser‑style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision‑language‑action models, inverse dynamics models, video generation models, and video‑action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three‑phase training pipeline and six‑layer data pyramid, thereby extracting pixel‑level "delta action" and enabling large‑scale action pretraining. Experiments show that Motus achieves superior performance against state‑of‑the‑art methods in both simulation (a +15% improvement over X‑VLA and a +45% improvement over Pi0.5) and real‑world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

Abstract:
Patch foraging involves the deliberate and planned process of determining the optimal time to depart from a resource‑rich region and investigate potentially more beneficial alternatives. The Marginal Value Theorem (MVT) is frequently used to characterize this process, offering an optimality model for such foraging behaviors. Although this model has been widely used to make predictions in behavioral ecology, discovering the computational mechanisms that facilitate the emergence of optimal patch‑foraging decisions in biological foragers remains under investigation. Here, we show that artificial foragers equipped with learned world models naturally converge to MVT‑aligned strategies. Using a model‑based reinforcement learning agent that acquires a parsimonious predictive representation of its environment, we demonstrate that anticipatory capabilities, rather than reward maximization alone, drive efficient patch‑leaving behavior. Compared with standard model‑free RL agents, these model‑based agents exhibit decision patterns similar to many of their biological counterparts, suggesting that predictive world models can serve as a foundation for more explainable and biologically grounded decision‑making in AI systems. Overall, our findings highlight the value of ecological optimality principles for advancing interpretable and adaptive AI.

Abstract:
While end‑to‑end autonomous driving has achieved remarkable progress in geometric control, current systems remain constrained by a command‑following paradigm that relies on simple navigational instructions. Transitioning to genuinely intelligent agents requires the capability to interpret and fulfill high‑level, abstract human intentions. However, this advancement is hindered by the lack of dedicated benchmarks and semantic‑aware evaluation metrics. In this paper, we formally define the task of Intention‑Driven End‑to‑End Autonomous Driving and present Intention‑Drive, a comprehensive benchmark designed to bridge this gap. We construct a large‑scale dataset featuring complex natural language intentions paired with high‑fidelity sensor data. To overcome the limitations of conventional trajectory‑based metrics, we introduce the Imagined Future Alignment (IFA), a novel evaluation protocol leveraging generative world models to assess the semantic fulfillment of human goals beyond mere geometric accuracy. Furthermore, we explore the solution space by proposing two distinct paradigms: an end‑to‑end vision‑language planner and a hierarchical agent‑based framework. The experiments reveal a critical dichotomy where existing models exhibit satisfactory driving stability but struggle significantly with intention fulfillment. Notably, the proposed frameworks demonstrate superior alignment with human intentions.

Abstract:
Developing artificial agents that unify representation, memory, adaptation, and prediction remains a fundamental challenge in artificial intelligence. Here we introduce a geometric framework in which cognitive computation emerges from Riemannian gradient flow on a learned latent manifold. The learned metric encodes representational constraints and computational preferences, while anisotropies in the geometry naturally generate multiple timescales of behaviour, yielding both rapid reactive responses and slower adaptive dynamics without explicit memory modules or recurrent mechanisms. We instantiate this framework through Riemannian representation and dynamics models and evaluate them in partially observable reinforcement‑learning environments. Across observation masking, sensory blackouts, dynamics perturbations, and predictive latent‑modelling tasks, the proposed approach consistently outperforms feedforward baselines, achieves robustness comparable to recurrent architectures, and produces highly predictable latent trajectories with low long‑horizon rollout error. These results suggest that learned latent geometry can serve simultaneously as a substrate for representation, memory, adaptation, and prediction. More broadly, the framework provides a principled connection between dynamical systems, representation learning, and world‑model‑based intelligence.

Abstract:
Autonomous AI agents on embedded platforms require real‑time, risk‑aware scheduling under resource and thermal constraints. Classical heuristics struggle with workload irregularity, tabular regressors discard structural information, and model‑free reinforcement learning (RL) risks overheating. We introduce GraphPerf‑RT, a graph neural network surrogate achieving deep learning accuracy at heuristic speeds (2‑7ms). GraphPerf‑RT is, to our knowledge, the first to unify task DAG topology, CFG‑derived code semantics, and runtime context (per‑core DVFS, thermal state, utilization) in a heterogeneous graph with typed edges encoding precedence, placement, and contention. Evidential regression with Normal‑Inverse‑Gamma priors provides calibrated uncertainty; we validate on makespan prediction for risk‑aware scheduling. Experiments on three ARM platforms (Jetson TX2, Orin NX, RUBIK Pi) achieve R^2 = 0.81 on log‑transformed makespan with Spearman rho = 0.95 and conservative uncertainty calibration (PICP = 99.9% at 95% confidence). Integration with four RL methods demonstrates that multi‑agent model‑based RL with GraphPerf‑RT as the world model achieves 66% makespan reduction and 82% energy reduction versus model‑free baselines, with zero thermal violations.

Abstract:
In autonomous driving, end‑to‑end planners learn scene representations from raw sensor data and utilize them to generate a motion plan or control actions. However, exclusive reliance on the current scene for motion planning may result in suboptimal responses in highly dynamic traffic environments where ego actions further alter the future scene. To model the evolution of future scenes, we leverage the World Model to represent how the ego vehicle and its environment interact and change over time, which entails complex reasoning. The Chain of Thought (CoT) offers a promising solution by forecasting a sequence of future thoughts that subsequently guide trajectory refinement. In this paper, we propose FutureX, a CoT‑driven pipeline that enhances end‑to‑end planners to perform complex motion planning via future scene latent reasoning and trajectory refinement. Specifically, the Auto‑think Switch examines the current scene and decides whether additional reasoning is required to yield a higher‑quality motion plan. Once FutureX enters the Thinking mode, the Latent World Model conducts a CoT‑guided rollout to predict future scene representation, enabling the Summarizer Module to further refine the motion plan. Otherwise, FutureX operates in an Instant mode to generate motion plans in a forward pass for relatively simple scenes. Extensive experiments demonstrate that FutureX enhances existing methods by producing more rational motion plans and fewer collisions without compromising efficiency, thereby achieving substantial overall performance gains, e.g., 6.2 PDMS improvement for TransFuser on NAVSIM. Code will be released.

Abstract:
Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation, we introduce a generative forecaster that performs autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used PCA‑based alternatives, both for forecasting and other applications, such as image generation. Our latent predictions can be easily decoded into multiple useful and interpretable output modalities: semantic segmentation, depth, surface normals, and even RGB. With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities. Our results suggest that stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models.

Abstract:
Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full‑spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects ‑‑ Generation, Reconstruction, Action‑Following, Downstream Task, and Human Preference ‑‑ jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry‑stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens‑26K, a large‑scale dataset of human‑annotated videos with numerical scores and textual rationales, and develop WorldLens‑Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity ‑‑ standardizing how future models are judged not only by how real they look, but by how real they behave.

Abstract:
Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in‑distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine‑tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out‑of‑distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi‑view consistency, while integrating generative image‑editing and multi‑view completion to synthesize realistic variations of real‑world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real‑world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.

Abstract:
Recent Vision‑Language‑Action (VLA) models for autonomous driving explore inference‑time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain‑of‑thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent‑CoT‑Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action‑aligned latent space. Instead of natural language, the model reasons by interleaving (1) action‑proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground‑truth future rollouts of the scene. We then post‑train with closed‑loop reinforcement learning to strengthen reasoning capabilities. On a large‑scale end‑to‑end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non‑reasoning and text‑reasoning baselines.

Abstract:
Inspired by how humans combine direct interaction with action‑free experience (e.g., videos), we study world models that learn from heterogeneous data. Standard world models typically rely on action‑conditioned trajectories, which limits effectiveness when action labels are scarce. We introduce a family of latent‑action world models that jointly use action‑conditioned and action‑free data by learning a shared latent action representation. This latent space aligns observed control signals with actions inferred from passive observations, enabling a single dynamics model to train on large‑scale unlabeled trajectories while requiring only a small set of action‑labeled ones. We use the latent‑action world model to learn a latent‑action policy through offline reinforcement learning (RL), thereby bridging two traditionally separate domains: offline RL, which typically relies on action‑conditioned data, and action‑free training, which is rarely used with subsequent RL. On the DeepMind Control Suite, our approach achieves strong performance while using about an order of magnitude fewer action‑labeled samples than purely action‑conditioned baselines. These results show that latent actions enable training on both passive and interactive data, which makes world models learn more efficiently.

Abstract:
World models paired with model predictive control (MPC) can be trained offline on large‑scale datasets of expert trajectories and enable generalization to a wide range of planning tasks at inference time. Compared to traditional MPC procedures, which rely on slow search algorithms or on iteratively solving optimization problems exactly, gradient‑based planning offers a computationally efficient alternative. However, the performance of gradient‑based planning has thus far lagged behind that of other approaches. In this paper, we propose improved methods for training world models that enable efficient gradient‑based planning. We begin with the observation that although a world model is trained on a next‑state prediction objective, it is used at test‑time to instead estimate a sequence of actions. The goal of our work is to close this train‑test gap. To that end, we propose train‑time data synthesis techniques that enable significantly improved gradient‑based planning with existing world models. At test time, our approach outperforms or matches the classical gradient‑free cross‑entropy method (CEM) across a variety of object manipulation and navigation tasks in 10% of the time budget.

Abstract:
Autonomous driving (AD) systems struggle in long‑tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision‑language‑action (VLA)‑based methods cannot leverage unlabeled videos for visual causal learning, while world model‑based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding‑Generation‑Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre‑trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi‑frame observations and language instructions as input, it produces interpretable chain‑of‑thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four‑stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state‑of‑the‑art performance in perception, reasoning, and decision‑making, with superior generalization to challenging long‑tail situations.

Abstract:
Verifying closed‑loop vision‑based control systems remains a fundamental challenge due to the high dimensionality of images and the difficulty of modeling visual environments. While generative models are increasingly used as camera surrogates in verification, their reliance on stochastic latent variables introduces unnecessary overapproximation error. To address this bottleneck, we propose a Deterministic World Model (DWM) that maps system states directly to generative images, effectively eliminating uninterpretable latent variables to ensure precise input bounds. The DWM is trained with a dual‑objective loss function that combines pixel‑level reconstruction accuracy with a control difference loss to maintain behavioral consistency with the real system. We integrate DWM into a verification pipeline utilizing Star‑based reachability analysis (StarV) and employ conformal prediction to derive rigorous statistical bounds on the trajectory deviation between the world model and the actual vision‑based system. Experiments on standard benchmarks show that our approach yields significantly tighter reachable sets and better verification performance than a latent‑variable baseline.

Abstract:
Model‑based planning in robotic domains is challenged by the hybrid nature of physical dynamics, where continuous motion is punctuated by discrete events such as contacts and impacts. Conventional latent world models typically employ monolithic neural networks that enforce global continuity, which over‑smooths distinct dynamic modes (e.g., sticking vs. sliding, flight vs. stance). For a planner, this smoothing results in compounding errors during long‑horizon lookaheads, rendering the search process unreliable at physical boundaries. To address this, we introduce the Prismatic World Model (PRISM‑WM), a structured architecture designed to decompose complex hybrid dynamics into composable primitives. PRISM‑WM uses a context‑aware Mixture‑of‑Experts (MoE) framework where a gating mechanism implicitly identifies the current physical mode, and specialized experts predict the associated transition dynamics. We further introduce a latent orthogonalization objective to ensure expert diversity, preventing mode collapse. By modeling the mode transitions in system dynamics, PRISM‑WM reduces rollout drift. Experiments on continuous control benchmarks, including high‑dimensional humanoids and multi‑task settings, demonstrate that PRISM‑WM provides a high‑fidelity substrate for trajectory optimization algorithms (e.g., TD‑MPC), indicating its potential as a foundational model for model‑based agents.

Abstract:
World models have demonstrated impressive performance on robotic learning tasks. Many such tasks inherently demand multimodal reasoning; for example, filling a bottle with water will lead to visual information alone being ambiguous or incomplete, thereby requiring reasoning over the temporal evolution of audio, accounting for its underlying physical properties and pitch patterns. In this paper, we propose a generative latent flow matching model to anticipate future audio observations, enabling the system to reason about long‑term consequences when integrated into a robot policy. We demonstrate the superior capabilities of our system through two manipulation tasks that require perceiving in‑the‑wild audio or music signals, compared to methods without future lookahead. We further emphasize that successful robot action learning for these tasks relies not merely on multi‑modal input, but critically on the accurate prediction of future audio states that embody intrinsic rhythmic patterns.

Abstract:
Learning about the causal structure of the world is a fundamental problem for human cognition. Causal models and especially causal learning have proved to be difficult for large pretrained models using standard techniques of deep learning. In contrast, cognitive scientists have applied advances in our formal understanding of causation in computer science, particularly within the Causal Bayes Net formalism, to understand human causal learning. In the very different tradition of reinforcement learning, researchers have described an intrinsic reward signal called "empowerment" which maximizes mutual information between actions and their outcomes. "Empowerment" may be an important bridge between classical Bayesian causal learning and reinforcement learning and may help to characterize causal learning in humans and enable it in machines. If an agent learns an accurate causal world model, they will necessarily increase their empowerment, and increasing empowerment will lead to a more accurate causal world model. Empowerment may also explain distinctive features of childrens causal learning, as well as providing a more tractable computational account of how that learning is possible. In an empirical study, we systematically test how children and adults use cues to empowerment to infer causal relations, and design effective causal interventions.

Abstract:
World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video‑generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in long‑horizon physical constraints. To address these limitations, we propose Embodied Tree of Thoughts (EToT), a novel Real2Sim2Real planning framework that leverages a physics‑based interactive digital twin as an embodied world model. EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms: (1) Priori Branching, which generates diverse candidate execution paths based on semantic and spatial analysis; and (2) Reflective Branching, which utilizes VLMs to diagnose execution failures within the simulator and iteratively refine the planning tree with corrective actions. By grounding high‑level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid‑body dynamics and collision constraints. We validate EToT on a suite of short‑ and long‑horizon manipulation tasks, where it consistently outperforms baselines by effectively predicting physical dynamics and adapting to potential failures. Website at https://embodied‑tree‑of‑thoughts.github.io .

Abstract:
Clinical decision‑making in oncology requires predicting dynamic disease evolution, a task current static AI predictors cannot perform. While world models (WMs) offer a paradigm for generative prediction, existing medical applications remain limited. Existing methods often rely on stochastic diffusion models, focusing on visual reconstruction rather than causal, physiological transitions. Furthermore, in medical domain, models like MeWM typically ignore patient‑specific temporal and clinical contexts and lack a feedback mechanism to link predictions to treatment decisions. To address these gaps, we introduce CLARITY, a medical world model that forecasts disease evolution directly within a structured latent space. It explicitly integrates time intervals (temporal context) and patient‑specific data (clinical context) to model treatment‑conditioned progression as a smooth, interpretable trajectory, and thus generate physiologically faithful, individualized treatment plans. Finally, CLARITY introduces a novel prediction‑to‑decision framework, translating latent rollouts into transparent, actionable recommendations. CLARITY demonstrates state‑of‑the‑art performance in treatment planning. On the MU‑Glioma‑Post dataset, our approach outperforms recent MeWM by 12%, and significantly surpasses all other medical‑specific large language models.

Abstract:
Despite advancements in Multi‑modal Large Language Models (MLLMs) for scene understanding, their performance on complex spatial reasoning tasks requiring mental simulation remains significantly limited. Current methods often rely on passive observation of spatial data, failing to internalize an active mental imagery process. To bridge this gap, we propose SpatialDreamer, a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration, visual imagination via a world model, and evidence‑grounded reasoning. To address the lack of fine‑grained reward supervision in longhorizontal reasoning tasks, we propose Geometric Policy Optimization (GeoPO), which introduces tree‑structured sampling and step‑level reward estimation with geometric consistency constraints. Extensive experiments demonstrate that SpatialDreamer delivers highly competitive results across multiple challenging benchmarks, signifying a critical advancement in human‑like active spatial mental simulation for MLLMs.

Abstract:
DreamerV3 is a state‑of‑the‑art online model‑based reinforcement learning (MBRL) algorithm known for remarkable sample efficiency. Concurrently, Kolmogorov‑Arnold Networks (KANs) have emerged as a promising alternative to Multi‑Layer Perceptrons (MLPs), offering superior parameter efficiency and interpretability. To mitigate KANs' computational overhead, variants like FastKAN leverage Radial Basis Functions (RBFs) to accelerate inference. In this work, we investigate integrating KAN architectures into the DreamerV3 framework. We introduce KAN‑Dreamer, replacing specific MLP and convolutional components of DreamerV3 with KAN and FastKAN layers. To ensure efficiency within the JAX‑based World Model, we implement a tailored, fully vectorized version with simplified grid management. We structure our investigation into three subsystems: Visual Perception, Latent Prediction, and Behavior Learning. Empirical evaluations on the DeepMind Control Suite (walker_walk) analyze sample efficiency, training time, and asymptotic performance. Experimental results demonstrate that utilizing our adapted FastKAN as a drop‑in replacement for the Reward and Continue predictors yields performance on par with the original MLP‑based architecture, maintaining parity in both sample efficiency and training speed. This report serves as a preliminary study for future developments in KAN‑based world models.

Abstract:
World models enable agents to plan within imagined environments by predicting future states conditioned on past observations and actions. However, their ability to plan over long horizons is limited by the effective memory span of the backbone architecture. This limitation leads to perceptual drift in long rollouts, hindering the model's capacity to perform loop closures within imagined trajectories. In this work, we investigate the effective memory span of transformer‑based world models through an analysis of several memory augmentation mechanisms. We introduce a taxonomy that distinguishes between memory encoding and memory injection mechanisms, motivating their roles in extending the world model's memory through the lens of residual stream dynamics. Using a state recall evaluation task, we measure the memory recall of each mechanism and analyze its respective trade‑offs. Our findings show that memory mechanisms improve the effective memory span in vision transformers and provide a path to completing loop closures within a world model's imagination.

Abstract:
Designing state encoders for reinforcement learning (RL) with multiple information sources ‑‑ such as sensor measurements, time‑series signals, image observations, and textual instructions ‑‑ remains underexplored and often requires manual design. We formalize this challenge as a problem of composite neural architecture search (NAS), where multiple source‑specific modules and a fusion module are jointly optimized. Existing NAS methods overlook useful side information from the intermediate outputs of these modules ‑‑ such as their representation quality ‑‑ limiting sample efficiency in multi‑source RL settings. To address this, we propose an LLM‑driven NAS pipeline in which the LLM serves as a neural architecture design agent, leveraging language‑model priors and intermediate‑output signals to guide sample‑efficient search for high‑performing composite state encoders. On a mixed‑autonomy traffic control task, our approach discovers higher‑performing architectures with fewer candidate evaluations than traditional NAS baselines and the LLM‑based GENIUS framework.

Abstract:
Scalable embodied intelligence is constrained by the scarcity of diverse, long‑horizon robotic manipulation data. Existing video world models in this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND‑V, a cognitive hierarchical world model designed to synthesize physically plausible and logically coherent videos of long‑horizon robotic manipulation. Inspired by cognitive science, MIND‑V bridges high‑level reasoning with pixel‑level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre‑trained vision‑language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain‑invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND‑V employs Staged Visual Future Rollouts, a test‑time optimization strategy to enhance long‑horizon robustness. To enforce adherence to physical laws, we introduce a GRPO reinforcement learning post‑training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V‑JEPA2 world model as a physics referee to penalize implausible dynamics in the latent feature space. Experiments confirm MIND‑V's SOTA performance in long‑horizon simulation and its significant value for policy learning, introducing a scalable and fully autonomous framework for embodied data synthesis.

Abstract:
Recent advances in generative video models have led to significant breakthroughs in high‑fidelity video synthesis, specifically in controllable video generation where the generated video is conditioned on text and action inputs, e.g., in instruction‑guided video editing and world modeling in robotics. Despite these exceptional capabilities, controllable video models often hallucinate ‑ generating future video frames that are misaligned with physical reality ‑ which raises serious concerns in many tasks such as robot policy evaluation and planning. However, state‑of‑the‑art video models lack the ability to assess and express their confidence, impeding hallucination mitigation. To rigorously address this challenge, we propose C3, an uncertainty quantification (UQ) method for training continuous‑scale calibrated controllable video models for dense confidence estimation at the subpatch level, precisely localizing the uncertainty in each generated video frame. Our UQ method introduces three core innovations to empower video models to estimate their uncertainty. First, our method develops a novel framework that trains video models for correctness and calibration via strictly proper scoring rules. Second, we estimate the video model's uncertainty in latent space, avoiding training instability and prohibitive training costs associated with pixel‑space approaches. Third, we map the dense latent‑space uncertainty to interpretable pixel‑level uncertainty in the RGB space for intuitive visualization, providing high‑resolution uncertainty heatmaps that identify untrustworthy regions. Through extensive experiments on large‑scale robot learning datasets (Bridge and DROID) and real‑world evaluations, we demonstrate that our method not only provides calibrated uncertainty estimates within the training distribution, but also enables effective out‑of‑distribution detection.

Abstract:
We introduce FieldSeer I, a geometry‑aware world model that forecasts electromagnetic field dynamics from partial observations in 2‑D TE waveguides. The model assimilates a short prefix of observed fields, conditions on a scalar source action and structure/material map, and generates closed‑loop rollouts in the physical domain. Training in a symmetric‑log domain ensures numerical stability. Evaluated on a reproducible FDTD benchmark (200 unique simulations, structure‑wise split), FieldSeer I achieves higher suffix fidelity than GRU and deterministic baselines across three practical settings: (i) software‑in‑the‑loop filtering (64x64, P=80‑>Q=80), (ii) offline single‑file rollouts (80x140, P=240‑>Q=40), and (iii) offline multi‑structure rollouts (80x140, P=180‑>Q=100). Crucially, it enables edit‑after‑prefix geometry modifications without re‑assimilation. Results demonstrate that geometry‑conditioned world models provide a practical path toward interactive digital twins for photonic design.

Abstract:
The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision‑Language‑Action (VLA) models and world models is severely hampered by the scarcity of large‑scale, diverse training data. A promising solution is to "robotize" web‑scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full‑body motions and scene occlusions in third‑person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X‑Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video‑to‑video structure and finetunes it for the human‑to‑humanoid translation task. This finetuning requires paired human‑humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego‑Exo4D videos, generating and releasing a new large‑scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.

Abstract:
Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real‑world tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross‑modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open‑ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM's latent space, and (2) achieving task‑aware adaptability that supports multi‑task learning and cross‑environment generalization. To address these limitations, we propose BiTAgent, a task‑aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. BiTAgent establishes two complementary pathways: a forward path that injects MLLM representations into the WM's latent space for semantically guided imagination, and a backward path where WM‑generated feedback refines the MLLM's semantic space via dense text‑conditioned rewards. This bidirectional interaction is realized through three synergistic components: Task‑Aware Dynamic Joint Learning, Task‑Aware Behavior Learning, and MLLM‑WM Joint Optimization, which together harmonize semantic reasoning and dynamic prediction. Extensive experiments across multi‑task and cross‑environment settings demonstrate superior stability and generalization over state‑of‑the‑art baselines, marking a step toward open‑ended embodied learning.

Abstract:
End‑to‑End autonomous driving (E2E‑AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high‑quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi‑dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high‑quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation ‑ candidate generation ‑ multi‑objective trade‑off". In particular, the proposed Future‑aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego‑conditioned "what‑if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM‑oriented Evaluator (VLoE) leverages the reasoning capability of a large vision‑language model to conduct multi‑objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human‑aligned decision making. Extensive experiments on the NAVSIM‑v1 and NAVSIM‑v2 benchmarks demonstrate that MindDrive achieves state‑of‑the‑art performance across multi‑dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.

Abstract:
We study how to exploit dense simulator‑defined rewards in vision‑based autonomous driving without inheriting their misalignment with deployment metrics. In realistic simulators such as CARLA, privileged state (e.g., lane geometry, infractions, time‑to‑collision) can be converted into dense rewards that stabilize and accelerate model‑based reinforcement learning, but policies trained directly on these signals often overfit and fail to generalize when evaluated on sparse objectives such as route completion and collision‑free overtaking. We propose reward‑privileged world model distillation, a two‑stage framework in which a teacher DreamerV3‑style agent is first trained with a dense privileged reward, and only its latent dynamics are distilled into a student trained solely on sparse task rewards. Teacher and student share the same observation space (semantic bird's‑eye‑view images); privileged information enters only through the teacher's reward, and the student does not imitate the teacher's actions or value estimates. Instead, the student's world model is regularized to match the teacher's latent dynamics while its policy is learned from scratch on sparse success/failure signals. In CARLA lane‑following and overtaking benchmarks, sparse‑reward students outperform both dense‑reward teachers and sparse‑from‑scratch baselines. On unseen lane‑following routes, reward‑privileged distillation improves success by about 23 percent relative to the dense teacher while maintaining comparable or better safety. On overtaking, students retain near‑perfect performance on training routes and achieve up to a 27x improvement in success on unseen routes, with improved lane keeping. These results show that dense rewards can be leveraged to learn richer dynamics models while keeping the deployed policy optimized strictly for sparse, deployment‑aligned objectives.

Abstract:
A truly interactive world model requires three key ingredients: real‑time long‑horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging‑for example, long‑term memory mechanisms often degrade real‑time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory‑aware, long‑duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video‑diffusion distillation techniques, our model represents long‑horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera‑aware memory structure supports implicit 3D‑consistent content retrieval and enforces long‑term coherence with minimal computational overhead. In parallel, we fine‑tune a bidirectional teacher video model to generate sequences beyond its original 5‑second training horizon, and transform it into a causal student generator using a new memory‑efficient self‑forcing paradigm that enables full‑context distillation over long‑duration teacher as well as long student self‑rollouts. Implemented as a 14B‑parameter model and trained on a curated Unreal Engine‑rendered dataset, RELIC achieves real‑time generation at 16 FPS while demonstrating more accurate action following, more stable long‑horizon streaming, and more robust spatial‑memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.

Abstract:
Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi‑scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task‑specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape‑R, a framework leveraging the world model to serve as a versatile, general‑purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model‑based general reward mechanism that generates ''endogenous'' rewards derived from the model's intrinsic understanding of real‑world state transition dynamics. Extensive experiments demonstrate that RoboScape‑R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out‑of‑domain scenarios.

Abstract:
World Foundation Models (WFMs) offer remarkable visual dynamics simulation capabilities, yet their application to precise robotic control remains limited by the gap between generative realism and control‑oriented precision. While existing approaches use WFMs as synthetic data generators, they suffer from high computational costs and underutilization of pre‑trained VLA policies. We introduce AdaPower (Adapt and Empower), a lightweight adaptation framework that transforms general‑purpose WFMs into specialist world models through two novel components: Temporal‑Spatial Test‑Time Training (TS‑TTT) for inference‑time adaptation and Memory Persistence (MP) for long‑horizon consistency. Integrated within a Model Predictive Control framework, our adapted world model empowers pre‑trained VLAs, achieving over 41% improvement in task success rates on LIBERO benchmarks without policy retraining, while preserving computational efficiency and generalist capabilities.

Abstract:
Interpreting natural‑language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context‑dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial‑Aware World Model (SA‑WM) that learns to reason ahead by distilling the current scene into a command‑aware latent state and rolling out a sequence of future latent states, providing forward‑looking cues for disambiguation. Complementing this, a hypergraph‑guided decoder then hierarchically fuses these states with the multimodal input, capturing higher‑order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi‑source VG dataset in AD, featuring semantic annotations generated by a Retrieval‑Augmented Generation (RAG) and Chain‑of‑Thought (CoT)‑prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state‑of‑the‑art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long‑text, multi‑agent, ambiguity) and retains superior performance even when trained on 50% of the data.

Abstract:
Autonomous navigation of terrestrial robots using Reinforcement Learning (RL) from LIDAR observations remains challenging due to the high dimensionality of sensor data and the sample inefficiency of model‑free approaches. Conventional policy networks struggle to process full‑resolution LIDAR inputs, forcing prior works to rely on simplified observations that reduce spatial awareness and navigation robustness. This paper presents a novel model‑based RL framework built on top of the DreamerV3 algorithm, integrating a Multi‑Layer Perceptron Variational Autoencoder (MLP‑VAE) within a world model to encode high‑dimensional LIDAR readings into compact latent representations. These latent features, combined with a learned dynamics predictor, enable efficient imagination‑based policy optimization. Experiments on simulated TurtleBot3 navigation tasks demonstrate that the proposed architecture achieves faster convergence and higher success rate compared to model‑free baselines such as SAC, DDPG, and TD3. It is worth emphasizing that the DreamerV3‑based agent attains a 100% success rate across all evaluated environments when using the full dataset of the Turtlebot3 LIDAR (360 readings), while model‑free methods plateaued below 85%. These findings demonstrate that integrating predictive world models with learned latent representations enables more efficient and robust navigation from high‑dimensional sensory data.

Abstract:
In this work we study how explicit world‑modeling objectives affect the internal representations and downstream capability of Transformers across different training stages. We use a controlled 2x2x2 Rubik's Cube and ask: (1) how does explicitly pretraining a world model affect the model's latent representations, and (2) how does world‑model quality affect the model's performance after reinforcement learning post‑training? We compare standard next‑token prediction to two explicit world‑modeling strategies ‑‑ (i) state‑prediction pretraining and (ii) a joint state‑prediction + next‑token objective ‑‑ and assess task performance after Group Relative Policy Optimization (GRPO) is applied as post‑training. We evaluate the representation quality with linear probes and causal interventions. We find that explicit world‑modeling yields more linearly decodable and causally steerable state representations. More importantly, we find that improved state representations lead to higher gains for GRPO, especially on harder cube states. Our results indicate that sharpening state representations can improve the effectiveness of post‑training for sequence‑planning tasks.

Abstract:
This paper investigates the dynamical properties of tokens in pre‑trained Transformer models and explores their application to improving Transformers. To this end, we analyze the dynamical system governing the continuous‑time limit of the pre‑trained model and characterize the asymptotic behavior of its solutions. Specifically, we characterize when tokens move closer to or farther from one another over time, depending on the model parameters. We provide sufficient conditions, based on these parameters, to identify scenarios where tokens either converge to zero or diverge to infinity. Unlike prior works, our conditions are broader in scope and more applicable to real‑world models. Furthermore, we investigate how different forms of positional encoding ‑‑ specifically absolute and rotary ‑‑ affect these dynamical regimes. Empirical evidence reveals that the convergence scenario adversely impacts model performance. Motivated by these insights, we propose simple refinements to Transformer architectures that mitigate convergence behavior in models with absolute or rotary positional encoding. These findings support theoretical foundations and design principles for improving Transformer models.

Abstract:
Video world models have attracted significant attention for their ability to produce high‑fidelity future visual observations conditioned on past observations and navigation actions. Temporally‑ and spatially‑consistent, long‑term world modeling has been a long‑standing problem, unresolved with even recent state‑of‑the‑art models, due to the prohibitively expensive computational costs for long‑context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long‑term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long‑term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long‑term consistency, and we verify that WorldPack notably outperforms strong state‑of‑the‑art models.

Abstract:
Recent advances in general‑purpose AI systems with attention‑based transformers offer a potential window into how the neocortex and cerebellum, despite their relatively uniform circuit architectures, give rise to diverse functions and, ultimately, to human intelligence. This Perspective provides a cross‑domain comparison between the brain and AI that goes beyond the traditional focus on visual processing, adopting the emerging perspecive of world‑model‑based computation. Here, we identify shared computational mechanisms in the attention‑based neocortex and the non‑attentional cerebellum: both predict future world events from past inputs and construct internal world models through prediction‑error learning. These predictive world models are repurposed for seemingly distinct functions ‑‑ understanding in sensory processing and generation in motor processing ‑‑ enabling the brain to achieve multi‑domain capabilities and human‑like adaptive intelligence. Notably, attention‑based AI has independently converged on a similar learning paradigm and world‑model‑based computation. We conclude that these shared mechanisms in both biological and artificial systems constitute a core computational foundation for realizing diverse functions including high‑level intelligence, despite their relatively uniform circuit structures. Our theoretical insights bridge neuroscience and AI, advancing our understanding of the computational essence of intelligence.

Abstract:
World models have gained significant attention as a promising approach for autonomous driving. By emulating human‑like perception and decision‑making processes, these models can predict and adapt to dynamic environments. Existing methods typically map high‑dimensional observations into compact latent spaces and learn optimal policies within these latent representations. However, prior work usually jointly learns ego‑vehicle dynamics and environmental transition dynamics from the image input, leading to inefficiencies and a lack of robustness to variations in vehicle dynamics. To address these issues, we propose the Vehicle Dynamics embedded Dreamer (VDD) method, which decouples the modeling of ego‑vehicle dynamics from environmental transition dynamics. This separation allows the world model to generalize effectively across vehicles with diverse parameters. Additionally, we introduce two strategies to further enhance the robustness of the learned policy: Policy Adjustment during Deployment (PAD) and Policy Augmentation during Training (PAT). Comprehensive experiments in simulated environments demonstrate that the proposed model significantly improves both driving performance and robustness to variations in vehicle dynamics, outperforming existing approaches.

Abstract:
The rapid shift from stateless large language models (LLMs) to autonomous, goal‑driven agents raises a central question: When is agentic AI truly necessary? While agents enable multi‑step reasoning, persistent memory, and tool orchestration, deploying them indiscriminately leads to higher cost, complexity, and risk. We present STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator), a framework that provides principled recommendations for selecting between three modalities: (i) direct LLM calls, (ii) guided AI assistants, and (iii) fully autonomous agentic AI. STRIDE integrates structured task decomposition, dynamism attribution, and self‑reflection requirement analysis to produce an Agentic Suitability Score, ensuring that full agentic autonomy is reserved for tasks with inherent dynamism or evolving context. Evaluated across 30 real‑world tasks spanning SRE, compliance, and enterprise automation, STRIDE achieved 92% accuracy in modality selection, reduced unnecessary agent deployments by 45%, and cut resource costs by 37%. Expert validation over six months in SRE and compliance domains confirmed its practical utility, with domain specialists agreeing that STRIDE effectively distinguishes between tasks requiring simple LLM calls, guided assistants, or full agentic autonomy. This work reframes agent adoption as a necessity‑driven design decision, ensuring autonomy is applied only when its benefits justify the costs.

Abstract:
World models have been recently proposed as sandbox environments in which AI agents can be trained and evaluated before deployment. While realistic world models often have high computational demands, this can often be alleviated by exploiting the fact that real‑world scenarios tend to involve subcomponents that interact in a modular manner. In this paper, we explore this idea by developing a framework for decomposing complex world models represented by transducers, a class of models generalising POMDPs. Whereas the composition of transducers is well understood, our results clarify how to invert this process by deriving sub‑transducers operating on distinct input‑output subspaces, enabling parallelizable and interpretable alternatives to monolithic world modelling that can support distributed inference. Overall, these results lay groundwork for bridging the computational efficiency required for real‑world inference and the structural transparency demanded by AI safety.

Abstract:
Video generators are increasingly evaluated as potential world models, which requires them to encode and understand physical laws. We investigate their representation of a fundamental law: gravity. Out‑of‑the‑box video generators consistently generate objects falling at an effectively slower acceleration. However, these physical tests are often confounded by ambiguous metric scale. We first investigate if observed physical errors are artifacts of these ambiguities (e.g., incorrect frame rate assumptions). We find that even temporal rescaling cannot correct the high‑variance gravity artifacts. To rigorously isolate the underlying physical representation from these confounds, we introduce a unit‑free, two‑object protocol that tests the timing ratio t_1^2/t_2^2 = h_1/h_2, a relationship independent of g, focal length, and scale. This relative test reveals violations of Galileo's equivalence principle. We then demonstrate that this physical gap can be partially mitigated with targeted specialization. A lightweight low‑rank adaptor fine‑tuned on only 100 single‑ball clips raises g_\mathrmeff from 1.81\,\mathrmm/s^2 to 6.43\,\mathrmm/s^2 (reaching 65% of terrestrial gravity). This specialist adaptor also generalizes zero‑shot to two‑ball drops and inclined planes, offering initial evidence that specific physical laws can be corrected with minimal data.

Abstract:
Recent advances in video world modeling have enabled large‑scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self‑supervised post‑training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle‑consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward‑aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post‑training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine‑tuning in outdoor environments.

Abstract:
Robots in uncertain real‑world environments must perform both goal‑directed and exploratory actions. However, most deep learning‑based control methods neglect exploration and struggle under uncertainty. To address this, we adopt deep active inference, a framework that accounts for human goal‑directed and exploratory actions. Yet, conventional deep active inference approaches face challenges due to limited environmental representation capacity and high computational cost in action selection. We propose a novel deep active inference framework that consists of a world model, an action model, and an abstract world model. The world model encodes environmental dynamics into hidden state representations at slow and fast timescales. The action model compresses action sequences into abstract actions using vector quantization, and the abstract world model predicts future slow states conditioned on the abstract action, enabling low‑cost action selection. We evaluate the framework on object‑manipulation tasks with a real‑world robot. Results show that it achieves high success rates across diverse manipulation tasks and switches between goal‑directed and exploratory actions in uncertain settings, while making action selection computationally tractable. These findings highlight the importance of modeling multiple timescale dynamics and abstracting actions and state transitions.

Abstract:
Embodied navigation for long‑horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long‑term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision‑Language Model (VLM) that unifies high‑level language planning and predictive world model imagination within a single, unified framework. Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub‑goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short‑term environmental dynamics and long‑term navigation milestones. The VLM's structured plan guides its targeted prediction, while the imagined future provides rich context to inform the navigation actions, creating a powerful internal feedback loop of perception‑planning/prediction‑action. We demonstrate through extensive experiments on the R2R‑CE and RxR‑CE benchmark that NavForesee achieves highly competitive performance in complex scenarios. Our work highlights the immense potential of fusing explicit language planning with implicit spatiotemporal prediction, paving the way for more intelligent and capable embodied agents.

Abstract:
Large‑scale video‑text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text‑supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel‑level reconstruction struggles with convergence and its low‑level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder‑decoder design into an Encoder‑Predictor‑Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo‑Next, a two‑stage pretraining scheme that builds a semantically consistent yet detail‑preserving latent space for this world model. First, conventional linear decoder in pixel MVM enforces the predictor output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction. Our Stage 1 proposes a conditional diffusion decoder and injects reliable image‑level semantic priors to enhance semantics and convergence, thus bridging pixel‑level fidelity with high‑level semantic abstraction. Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning. Trained on public, unlabeled videos, InternVideo‑Next achieves state‑of‑the‑art results across benchmarks and provides a scalable path toward general video representation learning.

Abstract:
World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real‑world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains largely unexplored. No prior work has formally defined what constitutes an audio‑visual world model or how to jointly capture binaural spatial audio and visual dynamics under precise action control. This work presents the first formal framework for Audio‑Visual World Models (AVWM), formulating multimodal environment simulation as a partially observable Markov decision process with synchronized audio‑visual observations. To address the lack of suitable training data, we construct AVW‑4k, a dataset comprising 30 hours of binaural audio‑visual trajectories with action annotations across 76 indoor environments. We propose AV‑CDiT, an Audio‑Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three‑stage training strategy for effective multimodal integration. Extensive experiments demonstrate that AV‑CDiT achieves high‑fidelity multimodal prediction across visual and auditory modalities. Furthermore, we validate its practical utility in continuous audio‑visual navigation tasks, where AVWM significantly enhances the agent's performance.

Abstract:
Virtual cell modeling aims to predict cellular responses to perturbations. Existing virtual cell models rely heavily on large‑scale single‑cell datasets, learning explicit mappings between gene expression and perturbations. Although recent models attempt to incorporate multi‑source biological information, their generalization remains constrained by data quality, coverage, and batch effects. More critically, these models often function as black boxes, offering predictions without interpretability or consistency with biological principles, which undermines their credibility in scientific research. To address these challenges, we present VCWorld, a cell‑level white‑box simulator that integrates structured biological knowledge with the iterative reasoning capabilities of large language models to instantiate a biological world model. VCWorld operates in a data‑efficient manner to reproduce perturbation‑induced signaling cascades and generates interpretable, stepwise predictions alongside explicit mechanistic hypotheses. In drug perturbation benchmarks, VCWorld achieves state‑of‑the‑art predictive performance, and the inferred mechanistic pathways are consistent with publicly available biological evidence.

Abstract:
Vision‑and‑Language Navigation (VLN) requires agents to follow language instructions while acting in continuous real‑world spaces. Prior image imagination based VLN work shows benefits for discrete panoramas but lacks online, action‑conditioned predictions and does not produce explicit planning values; moreover, many methods replace the planner with long‑horizon objectives that are brittle and slow. To bridge this gap, we propose VISTAv2, a generative world model that rolls out egocentric future views conditioned on past observations, candidate action sequences, and instructions, and projects them into an online value map for planning. Unlike prior approaches, VISTAv2 does not replace the planner. The online value map is fused at score level with the base objective, providing reachability and risk‑aware guidance. Concretely, we employ an action‑aware Conditional Diffusion Transformer video predictor to synthesize short‑horizon futures, align them with the natural language instruction via a vision‑language scorer, and fuse multiple rollouts in a differentiable imagination‑to‑value head to output an imagined egocentric value map. For efficiency, rollouts occur in VAE latent space with a distilled sampler and sparse decoding, enabling inference on a single consumer GPU. Evaluated on MP3D and RoboTHOR, VISTAv2 improves over strong baselines, and ablations show that action‑conditioned imagination, instruction‑guided value fusion, and the online value‑map planner are all critical, suggesting that VISTAv2 offers a practical and interpretable route to robust VLN.

Abstract:
The paradigm of learning‑based robotics holds immense promise, yet its translation to real‑world applications is critically hindered by the sample inefficiency and brittleness of conventional model‑free reinforcement learning algorithms. In this work, we address these challenges by introducing DREAMer‑VXS, a model‑based framework for Autonomous Ground Vehicle (AGV) exploration that learns to plan from imagined latent trajectories. Our approach centers on learning a comprehensive world model from partial and high‑dimensional LiDAR observations. This world model is composed of a Convolutional Variational Autoencoder (VAE), which learns a compact representation of the environment's structure, and a Recurrent State‑Space Model (RSSM), which models complex temporal dynamics. By leveraging this learned model as a high‑speed simulator, the agent can train its navigation policy almost entirely in imagination. This methodology decouples policy learning from real‑world interaction, culminating in a 90% reduction in required environmental interactions to achieve expert‑level performance when compared to state‑of‑the‑art model‑free SAC baselines. The agent's behavior is guided by an actor‑critic policy optimized with a composite reward function that balances task objectives with an intrinsic curiosity bonus, promoting systematic exploration of unknown spaces. We demonstrate through extensive simulated experiments that DREAMer‑VXS not only learns orders of magnitude faster but also develops more generalizable and robust policies, achieving a 45% increase in exploration efficiency in unseen environments and superior resilience to dynamic obstacles.

Abstract:
Developing robust world model reasoning is crucial for large language model (LLM) agents to plan and interact in complex environments. While multi‑turn interaction offers a superior understanding of environmental dynamics via authentic feedback, current approaches often impose a rigid reasoning process, which constrains the model's active learning, ultimately hindering efficient world model reasoning. To address these issues, we explore world‑model internalization through efficient interaction and active reasoning (WMAct), which liberates the model from structured reasoning, allowing the model to shape thinking directly through its doing, and achieves effective and efficient world model reasoning with two key mechanisms: (1) a reward rescaling mechanism adjusting outcome reward based on action efficacy to incentivize redundancy reduction and purposeful interaction; (2) an interaction frequency annealing strategy to progressively reduce the maximum allowed interaction turns, which compels the model to condense its learning and internalize environmental dynamics rather than over‑relying on environmental cues. Our experiments on Sokoban, Maze, and Taxi show that WMAct yields effective world model reasoning capable of resolving tasks in a single turn that previously required multiple interactions and fosters strong transferability to complex environments, improving performance on a suite of reasoning benchmarks.

Abstract:
Current world models lack a unified and controlled setting for systematic evaluation, making it difficult to assess whether they truly capture the underlying rules that govern environment dynamics. In this work, we address this open challenge by introducing the SmallWorld Benchmark, a testbed designed to assess world model capability under isolated and precisely controlled dynamics without relying on handcrafted reward signals. Using this benchmark, we conduct comprehensive experiments in the fully observable state space on representative architectures including Recurrent State Space Model, Transformer, Diffusion model, and Neural ODE, examining their behavior across six distinct domains. The experimental results reveal how effectively these models capture environment structure and how their predictions deteriorate over extended rollouts, highlighting both the strengths and limitations of current modeling paradigms and offering insights into future improvement directions in representation learning and dynamics modeling.

Abstract:
To interact effectively with humans in the real world, it is important for agents to understand language that describes the dynamics of the environment‑‑that is, how the environment behaves‑‑rather than just task instructions specifying "what to do". Understanding this dynamics‑descriptive language is important for human‑agent interaction and agent behavior. Recent work address this problem using a model‑based approach: language is incorporated into a world model, which is then used to learn a behavior policy. However, these existing methods either do not demonstrate policy generalization to unseen games or rely on limiting assumptions. For instance, assuming that the latency induced by inference‑time planning is tolerable for the target task or expert demonstrations are available. Expanding on this line of research, we focus on improving policy generalization from a language‑conditioned world model while dropping these assumptions. We propose a model‑based reinforcement learning approach, where a language‑conditioned world model is trained through interaction with the environment, and a policy is learned from this model‑‑without planning or expert demonstrations. Our method proposes Language‑aware Encoder for Dreamer World Model (LED‑WM) built on top of DreamerV3. LED‑WM features an observation encoder that uses an attention mechanism to explicitly ground language descriptions to entities in the observation. We show that policies trained with LED‑WM generalize more effectively to unseen games described by novel dynamics and language compared to other baselines in several settings in two environments: MESSENGER and MESSENGER‑WM.To highlight how the policy can leverage the trained world model before real‑world deployment, we demonstrate the policy can be improved through fine‑tuning on synthetic test trajectories generated by the world model.

Abstract:
Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first‑person (egocentric) and third‑person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in‑context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In‑Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross‑view synchronization. To further support our task, we curate EgoExo‑8K, a large‑scale dataset containing synchronized egocentric‑exocentric triplets from both synthetic and real‑world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric‑exocentric video translation.

Abstract:
This paper introduces a novel architecture for trajectory‑conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi‑frame future occupancy in an end‑to‑end manner directly from raw image features. Inspired by the success of attention‑based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird's eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite‑capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state‑of‑the‑art performance on the nuScenes benchmark for 1‑3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.

Abstract:
Digital twins of urban environments play a critical role in advancing autonomous vehicle (AV) research by enabling simulation, validation, and integration with emerging generative world models. While existing tools have demonstrated value, many publicly available solutions are tightly coupled to specific simulators, difficult to extend, or introduce significant technical overhead. For example, CARLA‑the most widely used open‑source AV simulator‑provides a digital twin framework implemented entirely as an Unreal Engine C++ plugin, limiting flexibility and rapid prototyping. In this work, we propose OpenTwinMap, an open‑source, Python‑based framework for generating high‑fidelity 3D urban digital twins. The completed framework will ingest LiDAR scans and OpenStreetMap (OSM) data to produce semantically segmented static environment assets, including road networks, terrain, and urban structures, which can be exported into Unreal Engine for AV simulation. OpenTwinMap emphasizes extensibility and parallelization, lowering the barrier for researchers to adapt and scale the pipeline to diverse urban contexts. We describe the current capabilities of the OpenTwinMap, which includes preprocessing of OSM and LiDAR data, basic road mesh and terrain generation, and preliminary support for CARLA integration.

Abstract:
Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments ‑ humans and different robots ‑ are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small‑data problem by introducing a unifying, symbolic representation ‑ a compact 3D "trace‑space" of scene‑level trajectories ‑ that enables learning from cross‑embodiment, cross‑environment, and cross‑task videos. We present TraceGen, a world model that predicts future motion in trace‑space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation‑trace‑language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50‑600x faster inference than state‑of‑the‑art video‑based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen's ability to adapt across embodiments without relying on object detectors or heavy pixel‑space generation.

Abstract:
In recent years, dual‑arm manipulation has become an area of strong interest in robotics, with end‑to‑end learning emerging as the predominant strategy for solving bimanual tasks. A critical limitation of such learning‑based approaches, however, is their difficulty in generalizing to novel scenarios, especially within cluttered environments. This paper presents an alternative paradigm: a sampling‑based optimization framework that utilizes a GPU‑accelerated physics simulator as its world model. We demonstrate that this approach can solve complex bimanual manipulation tasks in the presence of static obstacles. Our contribution is a customized Model Predictive Path Integral Control (MPPI) algorithm, guided by carefully designed task‑specific cost functions, that uses GPU‑accelerated MuJoCo for efficiently evaluating robot‑object interaction. We apply this method to solve significantly more challenging versions of tasks from the PerAct^2 benchmark, such as requiring the point‑to‑point transfer of a ball through an obstacle course. Furthermore, we establish that our method achieves real‑time performance on commodity GPUs and facilitates successful sim‑to‑real transfer by leveraging unique features within MuJoCo. The paper concludes with a statistical analysis of the sample complexity and robustness, quantifying the performance of our approach. The project website is available at: https://sites.google.com/view/bimanualakslabunitartu .

Abstract:
Generative world models for autonomous driving (AD) have become a trending topic. Unlike the widely studied image modality, in this work we explore generative world models for LiDAR data. Existing generation methods for LiDAR data only support single frame generation, while existing prediction approaches require multiple frames of historical input and can only deterministically predict multiple frames at once, lacking interactivity. Both paradigms fail to support long‑horizon interactive generation. To this end, we introduce LaGen, which to the best of our knowledge is the first framework capable of frame‑by‑frame autoregressive generation of long‑horizon LiDAR scenes. LaGen is able to take a single‑frame LiDAR input as a starting point and effectively utilize bounding box information as conditions to generate high‑fidelity 4D scene point clouds. In addition, we introduce a scene decoupling estimation module to enhance the model's interactive generation capability for object‑level content, as well as a noise modulation module to mitigate error accumulation during long‑horizon generation. We construct a protocol based on nuScenes for evaluating long‑horizon LiDAR scene generation. Experimental results comprehensively demonstrate LaGen outperforms state‑of‑the‑art LiDAR generation and prediction models, especially on the later frames.

Abstract:
World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high‑quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM‑centric vision foundation models. A key breakthrough empowering them is the semi‑autoregressive (block‑diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block‑applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM‑style KV Cache management, enabling efficient, variable‑length, and high‑quality generation. Therefore, Inferix is specifically designed as a next‑generation inference engine to enable immersive world synthesis through optimized semi‑autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high‑concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real‑time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV‑Bench, a new fine‑grained evaluation benchmark tailored for minute‑long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.

Abstract:
End‑to‑end models for autonomous driving hold the promise of learning complex behaviors directly from sensor data, but face critical challenges in safety and handling long‑tail events. Reinforcement Learning (RL) offers a promising path to overcome these limitations, yet its success in autonomous driving has been elusive. We identify a fundamental flaw hindering this progress: a deep seated optimistic bias in the world models used for RL. To address this, we introduce a framework for post‑training policy refinement built around an Impartial World Model. Our primary contribution is to teach this model to be honest about danger. We achieve this with a novel data synthesis pipeline, Counterfactual Synthesis, which systematically generates a rich curriculum of plausible collisions and off‑road events. This transforms the model from a passive scene completer into a veridical forecaster that remains faithful to the causal link between actions and outcomes. We then integrate this Impartial World Model into our closed‑loop RL framework, where it serves as an internal critic. During refinement, the agent queries the critic to ``dream" of the outcomes for candidate actions. We demonstrate through extensive experiments, including on a new Risk Foreseeing Benchmark, that our model significantly outperforms baselines in predicting failures. Consequently, when used as a critic, it enables a substantial reduction in safety violations in challenging simulations, proving that teaching a model to dream of danger is a critical step towards building truly safe and intelligent autonomous agents.

Abstract:
Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end‑to‑end systems and world‑model‑based planners predict rich multi‑modal trajectories, but typically rely on handcrafted anchors or reinforcement learning to select a single best mode for training and control. This selection discards information about alternative futures and complicates optimization. We propose MAP‑World, a prior‑free multi‑modal planning framework that couples masked action planning with a path‑weighted world model. The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion: past waypoints are encoded as visible tokens, future waypoints are represented as mask tokens, and a driving‑intent path provides a coarse scaffold. A compact latent planning state is expanded into multiple trajectory queries with injected noise, yielding diverse, temporally consistent modes without anchor libraries or teacher policies. A lightweight world model then rolls out future BEV semantics conditioned on each candidate trajectory. During training, semantic losses are computed as an expectation over modes, using trajectory probabilities as discrete path weights, so the planner learns from the full distribution of plausible futures instead of a single selected path. On NAVSIM, our method matches anchor‑based approaches and achieves state‑of‑the‑art performance among world‑model‑based methods, while avoiding reinforcement learning and maintaining real‑time inference latency.

Abstract:
Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatio‑temporal correlations between an agent's actions and the evolving environment. However, existing approaches often suffer from tight runtime coupling or depend on offline reward signals, resulting in substantial inference overhead or hindering end‑to‑end optimization. To overcome these limitations, we introduce WPT, a World‑to‑Policy Transfer training paradigm that enables online distillation under the guidance of an end‑to‑end world model. Specifically, we develop a trainable reward model that infuses world knowledge into a teacher policy by aligning candidate trajectories with the future dynamics predicted by the world model. Subsequently, we propose policy distillation and world reward distillation to transfer the teacher's reasoning ability into a lightweight student policy, enhancing planning performance while preserving real‑time deployability. Extensive experiments on both open‑loop and closed‑loop benchmarks show that our WPT achieves state‑of‑the‑art performance with a simple policy architecture: it attains a 0.11 collision rate (open‑loop) and achieves a 79.23 driving score (closed‑loop) surpassing both world‑model‑based and imitation‑learning methods in accuracy and safety. Moreover, the student sustains up to 4.9x faster inference, while retaining most of the gains.

Abstract:
World models are emerging as a foundational paradigm for scalable, data‑efficient embodied AI. In this work, we present GigaWorld‑0, a unified world model framework designed explicitly as a data engine for Vision‑Language‑Action (VLA) learning. GigaWorld‑0 integrates two synergistic components: GigaWorld‑0‑Video, which leverages large‑scale video generation to produce diverse, texture‑rich, and temporally coherent embodied sequences under fine‑grained control of appearance, camera viewpoint, and action semantics; and GigaWorld‑0‑3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction‑aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8‑precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld‑0 generates high‑quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain‑0) trained on GigaWorld‑0‑generated data achieve strong real‑world performance, significantly improving generalization and task success on physical robots without any real‑world interaction during training.

Abstract:
General‑purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single‑task or offline regimes, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large‑scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present \emphNewt, a language‑conditioned multitask world model that is first pretrained on demonstrations to acquire task‑aware representations and action priors, and then jointly optimized with online interaction across all tasks. Experiments show that Newt yields better multitask performance and data‑efficiency than a set of strong baselines, exhibits strong open‑loop control, and enables rapid adaptation to unseen tasks. We release our environments, demonstrations, code for training and evaluation, as well as 200+ checkpoints.

Abstract:
Recent interactive video world model methods generate scene evolution conditioned on user instructions. Although they achieve impressive results, two key limitations remain. First, they exhibit motion drift in complex environments with multiple interacting subjects, where dynamic subjects fail to follow realistic motion patterns during scene evolution. Second, they suffer from error accumulation in long‑horizon interactions, where autoregressive generation gradually drifts from earlier scene states and causes structural and semantic inconsistencies. In this paper, we propose MagicWorld, an interactive video world model built upon an autoregressive framework. To address motion drift, we incorporate a flow‑guided motion preservation constraint that mitigates motion degradation in dynamic subjects, encouraging realistic motion patterns and stable interactions during scene evolution. To mitigate error accumulation in long‑horizon interactions, we design two complementary strategies, including a history cache retrieval strategy and an enhanced interactive training strategy. The former reinforces historical scene states by retrieving past generations during interaction, while the latter adopts multi‑shot aggregated distillation with dual‑reward weighting for interactive training, enhancing long‑term stability and reducing error accumulation. In addition, we construct RealWM120K, a real‑world dataset with diverse city‑walk videos and multimodal annotations to support dynamic perception and long‑horizon world modeling. Experimental results demonstrate that MagicWorld improves motion realism and alleviates error accumulation during long‑horizon interactions.

Abstract:
Vision‑and‑Language Navigation (VLN) requires agents to autonomously navigate complex environments via visual images and natural language instructions‑‑remains highly challenging. Recent research on enhancing language‑guided navigation reasoning using pre‑trained large language models (LLMs) has shown promising prospects. However, the reasoning of such methods is limited to the linguistic modality, lacking visual reasoning capabilities. Moreover, existing reasoning modules are optimized separately from navigation policies, leading to incompatibility and potential conflicts in optimization objectives.To tackle these challenges, we introduce UNeMo, a novel framework designed for the collaborative optimization of visual state reasoning and navigational decision‑making. It introduces a Multimodal World Model (MWM) that takes visual features, language instructions, and navigational actions as inputs to jointly predict subsequent visual states, enabling cross‑modal reasoning. Via a Hierarchical Prediction‑Feedback (HPN) mechanism, MWM collaborates with navigation policies: the first layer generates actions using current vision‑and‑language features; MWM then infers post‑action visual states to guide the second layer's fine‑grained decisions. This forms a dynamic bidirectional promotion mechanism where MWM reasoning optimizes navigation policies, while policy decisions feedback to improve MWM's reasoning accuracy. Experiments on R2R and REVERIE datasets show UNeMo outperforms state‑of‑the‑art methods by 2.1% and 0.7% in navigation accuracy for unseen scenes, validating its effectiveness.

Abstract:
While video‑generation‑based embodied world models have gained increasing attention, their reliance on large‑scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long‑horizon video generation‑‑hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose Primitive Embodied World Models (PEWM), which restricts video generation to fixed shorter horizons, our approach 1) enables fine‑grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision‑Language Model (VLM) planner and a Start‑Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed‑loop control and supports compositional generalization of primitive‑level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine‑grained physical interaction and high‑level reasoning, paving the way toward scalable, interpretable, and general‑purpose embodied intelligence.

Abstract:
In this work, we define Foresight Intelligence as the capability to anticipate and interpret future events‑an ability essential for applications such as autonomous driving, yet largely overlooked by existing research. To bridge this gap, we introduce FSU‑QA, a new Visual Question‑Answering (VQA) dataset specifically designed to elicit and evaluate Foresight Intelligence. Using FSU‑QA, we conduct the first comprehensive study of state‑of‑the‑art Vision‑Language Models (VLMs) under foresight‑oriented tasks, revealing that current models still struggle to reason about future situations. Beyond serving as a benchmark, FSU‑QA also enables the assessment of world models by measuring the semantic coherence of their generated predictions, quantified through performance gains when VLMs are augmented with such outputs. Our experiments further demonstrate that FSU‑QA can effectively enhance foresight reasoning: even small VLMs fine‑tuned on FSU‑QA surpass much larger, advanced models by a substantial margin. Together, these findings position FSU‑QA as a principled foundation for developing next‑generation models capable of truly anticipating and understanding future events.

Abstract:
Autonomous inspection in hazardous environments requires AI agents that can interpret high‑level goals and execute precise control. A key capability for such agents is spatial grounding, for example when a drone must center a detected object in its camera view to enable reliable inspection. While large language models provide a natural interface for specifying goals, using them directly for visual control achieves only 58% success in this task. We envision that equipping agents with a world model as a tool would allow them to roll out candidate actions and perform better in spatially grounded settings, but conventional world models are data and compute intensive. To address this, we propose a task‑specific latent dynamics model that learns state‑specific action‑induced shifts in a shared latent space using only goal‑state supervision. The model leverages global action embeddings and complementary training losses to stabilize learning. In experiments, our approach achieves 71% success and generalizes to unseen images and instructions, highlighting the potential of compact, domain‑specific latent dynamics models for spatial alignment in autonomous inspection.

Abstract:
Current control algorithms for aerial robots struggle with robustness in dynamic environments and adverse conditions. Model‑based reinforcement learning (RL) has shown strong potential in handling these challenges while remaining sample‑efficient. Additionally, Dreamer has demonstrated that online model‑based RL can be achieved using a recurrent world model trained on replay buffer data. However, applying Dreamer to aerial systems has been quite challenging due to its sample inefficiency and poor generalization of dynamics models. Our work explores a physics‑informed approach to world model learning and improves policy performance. The world model treats the quadcopter as a free‑body system and predicts the net forces and moments acting on it, which are then passed through a 6‑DOF Runge‑Kutta integrator (RK4) to predict future state rollouts. In this paper, we compare this physics‑informed method to a standard RNN‑based world model. Although both models perform well on the training data, we observed that they fail to generalize to new trajectories, leading to rapid divergence in state rollouts, preventing policy convergence.

Abstract:
While recent video world models can generate highly realistic videos, their ability to perform semantic reasoning and planning remains unclear and unquantified. We introduce Target‑Bench, the first benchmark that enables comprehensive evaluation of video world models' semantic reasoning, spatial estimation, and planning capabilities. Target‑Bench provides 450 robot‑collected scenarios spanning 47 semantic categories, with SLAM‑based trajectories serving as motion tendency references. Our benchmark reconstructs motion from generated videos with a metric scale recovery mechanism, enabling the evaluation of planning performance with five complementary metrics that focus on target‑approaching capability and directional consistency. Our evaluation result shows that the best off‑the‑shelf model achieves only a 0.341 overall score, revealing a significant gap between realistic visual generation and semantic reasoning in current video world models. Furthermore, we demonstrate that fine‑tuning process on a relatively small real‑world robot dataset can significantly improve task‑level planning performance.

Abstract:
We introduce RynnVLA‑002, a unified Vision‑Language‑Action (VLA) and world model. The world model leverages action and visual inputs to predict future image states, learning the underlying physics of the environment to refine action generation. Conversely, the VLA model produces subsequent actions from image observations, enhancing visual understanding and supporting the world model's image generation. The unified framework of RynnVLA‑002 enables joint learning of environmental dynamics and action planning. Our experiments show that RynnVLA‑002 surpasses individual VLA and world models, demonstrating their mutual enhancement. We evaluate RynnVLA‑002 in both simulation and real‑world robot tasks. RynnVLA‑002 achieves 97.4% success rate on the LIBERO simulation benchmark without pretraining, while in real‑world LeRobot experiments, its integrated world model boosts the overall success rate by 50%.

Abstract:
World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as "what would happen if this object was removed?", is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel‑space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state‑of‑the‑art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation‑based world models.

Abstract:
Healthcare requires AI that is predictive, reliable, and data‑efficient. However, recent generative models lack physical foundation and temporal reasoning required for clinical decision support. As scaling language models show diminishing returns for grounded clinical reasoning, world models are gaining traction because they learn multimodal, temporally coherent, and action‑conditioned representations that reflect the physical and causal structure of care. This paper reviews World Models for healthcare systems that learn predictive dynamics to enable multistep rollouts, counterfactual evaluation and planning. We survey recent work across three domains: (i) medical imaging and diagnostics (e.g., longitudinal tumor simulation, projection‑transition modeling, and Joint Embedding Predictive Architecture i.e., JEPA‑style predictive representation learning), (ii) disease progression modeling from electronic health records (generative event forecasting at scale), and (iii) robotic surgery and surgical planning (action‑conditioned guidance and control). We also introduce a capability rubric: L1 temporal prediction, L2 action‑conditioned prediction, L3 counterfactual rollouts for decision support, and L4 planning/control. Most reviewed systems achieve L1‑‑L2, with fewer instances of L3 and rare L4. We identify cross‑cutting gaps that limit clinical reliability; under‑specified action spaces and safety constraints, weak interventional validation, incomplete multimodal state construction, and limited trajectory‑level uncertainty calibration. This review outlines a research agenda for clinically robust prediction‑first world models that integrate generative backbones (transformers, diffusion, VAE) with causal/mechanical foundation for safe decision support in healthcare.

Abstract:
Vision‑Language‑Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post‑training strategy to overcome these limits, yet current VLA‑RL methods, including group‑based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self‑Referential Policy Optimization (SRPO), a novel VLA‑RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self‑reference. This allows us to assign a progress‑wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain‑specific fine‑tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state‑of‑the‑art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO‑Plus benchmark.

Abstract:
Chest X‑ray radiography (CXR) is an essential medical imaging technique for disease diagnosis. However, as 2D projectional images, CXRs are limited by structural superposition and hence fail to capture 3D anatomies. This limitation makes representation learning and disease diagnosis challenging. To address this challenge, we propose a novel CXR world model named X‑WIN, which distills volumetric knowledge from chest computed tomography (CT) by learning to predict its 2D projections in latent space. The core idea is that a world model with internalized knowledge of 3D anatomical structure can predict CXRs under various transformations in 3D space. During projection prediction, we introduce an affinity‑guided contrastive alignment loss that leverages mutual similarities to capture rich, correlated information across projections from the same volume. To improve model adaptability, we incorporate real CXRs into training through masked image modeling and employ a domain classifier to encourage statistically similar representations for real and simulated CXRs. Comprehensive experiments show that X‑WIN outperforms existing foundation models on diverse downstream tasks using linear probing and few‑shot fine‑tuning. X‑WIN also demonstrates the ability to render 2D projections for reconstructing a 3D CT volume.

Abstract:
Despite recent advancements in neural 3D reconstruction, the dependence on dense multi‑view captures restricts their broader applicability. Additionally, 3D scene generation is vital for advancing embodied AI and world models, which depend on diverse, high‑quality scenes for learning and evaluation. In this work, we propose Gen3d, a novel method for generation of high‑quality, wide‑scope, and generic 3D scenes from a single image. After the initial point cloud is created by lifting the RGBD image, Gen3d maintains and expands its world model. The 3D scene is finalized through optimizing a Gaussian splatting representation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in generating a world model and Synthesizing high‑fidelity and consistent novel views.

Abstract:
Observations have shown that planets similar to Neptune are rarely found orbiting Sun‑like stars with periods up to ~4 days, defining the so‑called Neptune desert region. Therefore, the detection of each individual planet in this region holds a high value, providing detailed insights into how such a population came to form and evolve. Here we report the detection of TOI‑333b, a Neptune desert planet with a mass, radius, and bulk density of 20.1 \pm 2.4 M_\oplus, 4.26 \pm 0.11 R_\oplus, and 1.42 \pm 0.21 \gccc, respectively. The planet orbits a F7V star every 3.78 d, whose mass, radius and effective temperature are of 1.2 \pm 0.1 \msun, 1.10 \pm 0.03 \rsun, and 6241^+73_‑62 K, respectively. TOI‑333b is likely younger than 1 Gyr, which is supported by the presence of the doublet Li line around 6707.856 textup~Å and its comparison to Li abundances in open clusters with well constrained ages. The planet is expected to host only 8.5^+10.9_‑8.3% gas‑to‑core mass ratio for a H/He envelope. On the other hand, irradiated ocean world models predict 20^+11_‑10% H_2O mass fraction with a core fraction of 35^+20_‑23%. Therefore, we expect that TOI‑333b internal composition may be dominated by a pure rocky composition with almost no H/He envelope, or a rocky world with almost equal mass fraction of water. Finally, TOI‑333b is more massive and larger than 77% and 82% of its Neptune desert counterparts, respectively, while its host ranks among the hottest known for Neptune Desert planets, making this system a unique laboratory to study the evolution of such planets around hot stars.

Abstract:
Large language models (LLMs) and vision language models (VLMs), such as DeepSeek R1,OpenAI o3, and Gemini 2.5 Pro, have demonstrated remarkable reasoning capabilities across logical inference, problem solving, and decision making. However, spatial reasoning:a fundamental component of human cognition that includes mental rotation, navigation, and spatial relationship comprehension remains a significant challenge for current advanced VLMs. We hypothesize that imagination, the internal simulation of spatial states, is the dominant reasoning mechanism within a spatial world model. To test this hypothesis and systematically probe current VLM spatial reasoning mechanisms, we introduce SpatiaLite, a fully synthetic benchmark that jointly measures spatial reasoning accuracy and reasoning efficiency. Comprehensive experiments reveal three key findings. First, advanced VLMs predominantly rely on linguistic representations for reasoning and imagination, resulting in significant deficiencies on visual centric tasks that demand perceptual spatial relations and 3D geometry transformations such as mental rotation or projection prediction. Second, advanced VLMs exhibit severe inefficiency in their current spatial reasoning mechanisms, with token usage growing rapidly as transformation complexity increases. Third, we propose an Imagery Driven Framework (IDF) for data synthesis and training, which can implicitly construct an internal world model that is critical for spatial reasoning in VLMs. Building on SpatiaLite, this work delineates the spatial reasoning limits and patterns of advanced VLMs, identifies key shortcomings, and informs future advances

Abstract:
We propose a Kardashev‑inspired yet operational Autonomous AI (AAI) Scale that measures the progression from fixed robotic process automation (AAI‑0) to full artificial general intelligence (AAI‑4) and beyond. Unlike narrative ladders, our scale is multi‑axis and testable. We define ten capability axes (Autonomy, Generality, Planning, Memory/Persistence, Tool Economy, Self‑Revision, Sociality/Coordination, Embodiment, World‑Model Fidelity, Economic Throughput) aggregated by a composite AAI‑Index (a weighted geometric mean). We introduce a measurable Self‑Improvement Coefficient κ (capability growth per unit of agent‑initiated resources) and two closure properties (maintenance and expansion) that convert ``self‑improving AI'' into falsifiable criteria. We specify OWA‑Bench, an open‑world agency benchmark suite that evaluates long‑horizon, tool‑using, persistent agents. We define level gates for AAI‑0\ldots AAI‑4 using thresholds on the axes, κ, and closure proofs. Synthetic experiments illustrate how present‑day systems map onto the scale and how the delegability frontier (quality vs.\ autonomy) advances with self‑improvement. We also prove a theorem that AAI‑3 agent becomes AAI‑5 over time with sufficient conditions, formalizing "baby AGI" becomes Superintelligence intuition.

Abstract:
How do large language models solve spatial navigation tasks? We investigate this by training GPT‑2 models on three spatial learning paradigms in grid environments: passive exploration (Foraging Model‑ predicting steps in random walks), goal‑directed planning (generating optimal shortest paths) on structured Hamiltonian paths (SP‑Hamiltonian), and a hybrid model fine‑tuned with exploratory data (SP‑Random Walk). Using behavioural, representational and mechanistic analyses, we uncover two fundamentally different learned algorithms. The Foraging model develops a robust, map‑like representation of space, akin to a 'cognitive map'. Causal interventions reveal that it learns to consolidate spatial information into a self‑sufficient coordinate system, evidenced by a sharp phase transition where its reliance on historical direction tokens vanishes by the middle layers of the network. The model also adopts an adaptive, hierarchical reasoning system, switching between a low‑level heuristic for short contexts and map‑based inference for longer ones. In contrast, the goal‑directed models learn a path‑dependent algorithm, remaining reliant on explicit directional inputs throughout all layers. The hybrid model, despite demonstrating improved generalisation over its parent, retains the same path‑dependent strategy. These findings suggest that the nature of spatial intelligence in transformers may lie on a spectrum, ranging from generalisable world models shaped by exploratory data to heuristics optimised for goal‑directed tasks. We provide a mechanistic account of this generalisation‑optimisation trade‑off and highlight how the choice of training regime influences the strategies that emerge.

Abstract:
End‑to‑end planning methods are the de facto standard of the current autonomous driving system, while the robustness of the data‑driven approaches suffers due to the notorious long‑tail problem (i.e., rare but safety‑critical failure cases). In this work, we explore whether recent diffusion‑based video generation methods (a.k.a. world models), paired with structured 3D layouts, can enable a fully automated pipeline to self‑correct such failure cases. We first introduce an agent to simulate the role of product manager, dubbed PM‑Agent, which formulates data requirements to collect data similar to the failure cases. Then, we use a generative model that can simulate both data collection and annotation. However, existing generative models struggle to generate high‑fidelity data conditioned on 3D layouts. To address this, we propose DriveSora, which can generate spatiotemporally consistent videos aligned with the 3D annotations requested by PM‑Agent. We integrate these components into our self‑correcting agentic system, CorrectAD. Importantly, our pipeline is an end‑to‑end model‑agnostic and can be applied to improve any end‑to‑end planner. Evaluated on both nuScenes and a more challenging in‑house dataset across multiple end‑to‑end planners, CorrectAD corrects 62.5% and 49.8% of failure cases, reducing collision rates by 39% and 27%, respectively.

Abstract:
Real‑world conversations are rich with pragmatic elements, such as entity mentions, references, and implicatures. Understanding such nuances is a requirement for successful natural communication, and often requires building a local world model which encodes such elements and captures the dynamics of their evolving states. However, it is not well‑understood whether language models (LMs) construct or maintain a robust implicit representation of conversations. In this work, we evaluate the ability of LMs to encode and update their internal world model in dyadic conversations and test their malleability under linguistic alterations. To facilitate this, we apply seven minimal linguistic alterations to conversations sourced from popular datasets and construct two benchmarks comprising yes‑no questions. We evaluate a wide range of open and closed source LMs and observe that they struggle to maintain robust accuracy. Our analysis unveils that LMs struggle to memorize crucial details, such as tracking entities under linguistic alterations to conversations. We then propose a dual‑perspective interpretability framework which identifies transformer layers that are useful or harmful and highlights linguistic alterations most influenced by harmful layers, typically due to encoding spurious signals or relying on shortcuts. Inspired by these insights, we propose two layer‑regularization based fine‑tuning strategies that suppress the effect of the harmful layers.

Abstract:
Recent advancements in video generation have demonstrated the potential of using video diffusion models as world models, with autoregressive generation of infinitely long videos through masked conditioning. However, such models, usually with local full attention, lack effective memory compression and retrieval for long‑term generation beyond the window size, leading to issues of forgetting and spatiotemporal inconsistencies. To enhance the retention of historical information within a fixed memory budget, we introduce a recurrent neural network (RNN) into the diffusion transformer framework. Specifically, a diffusion model incorporating LSTM with attention achieves comparable performance to state‑of‑the‑art RNN blocks, such as TTT and Mamba2. Moreover, existing diffusion‑RNN approaches often suffer from performance degradation due to training‑inference gap or the lack of overlap across windows. To address these limitations, we propose a novel Recurrent Autoregressive Diffusion (RAD) framework, which executes frame‑wise autoregression for memory update and retrieval, consistently across training and inference time. Experiments on Memory Maze and Minecraft datasets demonstrate the superiority of RAD for long video generation, highlighting the efficiency of LSTM in sequence modeling.

Abstract:
Embodied world models aim to predict and interact with the physical world through visual observations and actions. However, existing models struggle to accurately translate low‑level actions (e.g., joint positions) into precise robotic movements in predicted frames, leading to inconsistencies with real‑world physical interactions. To address these limitations, we propose MTV‑World, an embodied world model that introduces Multi‑view Trajectory‑Video control for precise visuomotor prediction. Specifically, instead of directly using low‑level actions for control, we employ trajectory videos obtained through camera intrinsic and extrinsic parameters and Cartesian‑space transformation as control signals. However, projecting 3D raw actions onto 2D images inevitably causes a loss of spatial information, making a single view insufficient for accurate interaction modeling. To overcome this, we introduce a multi‑view framework that compensates for spatial information loss and ensures high‑consistency with physical world. MTV‑World forecasts future frames based on multi‑view trajectory videos as input and conditioning on an initial frame per view. Furthermore, to systematically evaluate both robotic motion precision and object interaction accuracy, we develop an auto‑evaluation pipeline leveraging multimodal large models and referring video object segmentation models. To measure spatial consistency, we formulate it as an object location matching problem and adopt the Jaccard Index as the evaluation metric. Extensive experiments demonstrate that MTV‑World achieves precise control execution and accurate physical interaction modeling in complex dual‑arm scenarios.

Abstract:
World models have garnered substantial interest in the AI community. These are internal representations that simulate aspects of the external world, track entities and states, capture causal relationships, and enable prediction of consequences. This contrasts with representations based solely on statistical correlations. A key motivation behind this research direction is that humans possess such mental world models, and finding evidence of similar representations in AI models might indicate that these models "understand" the world in a human‑like way. In this paper, we use case studies from the philosophy of science literature to critically examine whether the world model framework adequately characterizes human‑level understanding. We focus on specific philosophical analyses where the distinction between world model capabilities and human understanding is most pronounced. While these represent particular views of understanding rather than universal definitions, they help us explore the limits of world models.

Abstract:
While NVIDIA remains the dominant provider of AI accelerators within cloud data center, emerging vendors such as AMD, Intel, Mac, and Huawei offer cost‑effective alternatives with claims of compatibility and performance. This paper presents the first empirical study investigating divergence in machine learning model across heterogeneous AI accelerators. Utilizing an automated pipeline, we synthesize over 100,000 variant models derived from 4,000 real‑world models and execute them across five different enterprise‑grade accelerators. Our findings suggest that newer AI platforms from Mac and Huawei support at least 17% fewer operators than NVIDIA. These platforms also exhibit a higher rate of output discrepancies (exceeding 5%), which stem from differences in operator implementations, handling of exceptional numerical values, and instruction scheduling. They are also more susceptible to failures during model compilation‑based acceleration, and in some cases, the compiled models produce outputs that differ noticeably from those generated using the standard execution mode. In addition, we identify 7 implementation flaws in PyTorch and 40 platform‑specific issues across vendors. These results underscore the challenges of achieving consistent machine learning behavior in an increasingly diverse hardware ecosystem.

Abstract:
Training generalist policies for robotic manipulation has shown great promise, as they enable language‑conditioned, multi‑task behaviors across diverse scenarios. However, evaluating these policies remains difficult because real‑world testing is expensive, time‑consuming, and labor‑intensive. It also requires frequent environment resets and carries safety risks when deploying unproven policies on physical robots. Manually creating and populating simulation environments with assets for robotic manipulation has not addressed these issues, primarily due to the significant engineering effort required and the substantial sim‑to‑real gap, both in terms of physics and rendering. In this paper, we explore the use of action‑conditional video generation models as a scalable way to learn world models for policy evaluation. We demonstrate how to incorporate action conditioning into existing pre‑trained video generation models. This allows leveraging internet‑scale in‑the‑wild online videos during the pre‑training stage and alleviates the need for a large dataset of paired video‑action data, which is expensive to collect for robotic manipulation. Our paper examines the effect of dataset diversity, pre‑trained weights, and common failure cases for the proposed evaluation pipeline. Our experiments demonstrate that across various metrics, including policy ranking and the correlation between actual policy values and predicted policy values, these models offer a promising approach for evaluating policies without requiring real‑world interactions.

Abstract:
We present ARCTraj, a dataset and methodological framework for modeling human reasoning through complex visual tasks in the Abstraction and Reasoning Corpus (ARC). While ARC has inspired extensive research on abstract reasoning, most existing approaches rely on static input‑output supervision, which limits insight into how reasoning unfolds over time. ARCTraj addresses this gap by recording temporally ordered, object‑level actions that capture how humans iteratively transform inputs into outputs, revealing intermediate reasoning steps that conventional datasets overlook. Collected via the O2ARC web interface, it contains around 10,000 trajectories annotated with task identifiers, timestamps, and success labels across 400 training tasks from the ARC‑AGI‑1 benchmark. It further defines a unified reasoning pipeline encompassing data collection, action abstraction, Markov decision process (MDP) formulation, and downstream learning, enabling integration with reinforcement learning, generative modeling, and sequence modeling methods such as PPO, World Models, GFlowNets, Diffusion agents, and Decision Transformers. Analyses of spatial selection, color attribution, and strategic convergence highlight the structure and diversity of human reasoning. Together, these contributions position ARCTraj as a structured and interpretable foundation for studying human‑like reasoning, advancing explainability, alignment, and generalizable intelligence.

Abstract:
World models enable robots to conduct counterfactual reasoning in physical environments by predicting future world states. While conventional approaches often prioritize pixel‑level reconstruction of future scenes, such detailed rendering is computationally intensive and unnecessary for planning tasks like navigation. We therefore propose that prediction and planning can be efficiently performed directly within a latent space of high‑level semantic representations. To realize this, we introduce the Representative Latent space Navigation World Model (ReL‑NWM). Rather than relying on reconstructionoriented latent embeddings, our method leverages a pre‑trained representation encoder, DINOv3, and incorporates specialized mechanisms to effectively integrate action signals and historical context within this representation space. By operating entirely in the latent domain, our model bypasses expensive explicit reconstruction and achieves highly efficient navigation planning. Experiments show state‑of‑the‑art trajectory prediction and image‑goal navigation performance on multiple benchmarks. Additionally, we demonstrate real‑world applicability by deploying the system on a Unitree G1 humanoid robot, confirming its efficiency and robustness in practical navigation scenarios.

Abstract:
Continuous‑time stochastic processes underlie many natural and engineered systems. In healthcare, autonomous driving, and industrial control, direct interaction with the environment is often unsafe or impractical, motivating offline reinforcement learning from historical data. However, there is limited statistical understanding of the approximation errors inherent in learning policies from offline datasets. We address this by linking reinforcement learning to the Hamilton‑Jacobi‑Bellman equation and proposing an operator‑theoretic algorithm based on a simple dynamic programming recursion. Specifically, we represent our world model in terms of the infinitesimal generator of controlled diffusion processes learned in a reproducing kernel Hilbert space. By integrating statistical learning methods and operator theory, we establish global convergence of the value function and derive finite‑sample guarantees with bounds tied to system properties such as smoothness and stability. Our theoretical and numerical results indicate that operator‑based approaches may hold promise in solving offline reinforcement learning using continuous‑time optimal control.

Abstract:
Offline‑to‑online reinforcement learning (O2O‑RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine‑tuning strategies. Our contributions are threefold: (1) a multi‑seed dynamics‑aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion‑based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9% absolute improvement over Uni‑O4 on locomotion tasks and +12.4% on dexterous manipulation, demonstrating strong generalization and scalability.

Abstract:
Visual reinforcement learning has achieved remarkable progress in visual control and robotics, but its vulnerability to adversarial perturbations remains underexplored. Most existing black‑box attacks focus on vector‑based or discrete‑action RL, and their effectiveness on image‑based continuous control is limited by the large action space and excessive environment queries. We propose SEBA, a sample‑efficient framework for black‑box adversarial attacks on visual RL agents. SEBA integrates a shadow Q model that estimates cumulative rewards under adversarial conditions, a generative adversarial network that produces visually imperceptible perturbations, and a world model that simulates environment dynamics to reduce real‑world queries. Through a two‑stage iterative training procedure that alternates between learning the shadow model and refining the generator, SEBA achieves strong attack performance while maintaining efficiency. Experiments on MuJoCo and Atari benchmarks show that SEBA significantly reduces cumulative rewards, preserves visual fidelity, and greatly decreases environment interactions compared to prior black‑box and white‑box methods.

Abstract:
This study presents a generative optimization framework based on a guided denoising diffusion probabilistic model (DDPM) that leverages surrogate gradients to generate heat sink designs minimizing pressure drop while maintaining surface temperatures below a specified threshold. Geometries are represented using boundary representations of multiple fins, and a multi‑fidelity approach is employed to generate training data. Using this dataset, along with vectors representing the boundary representation geometries, we train a denoising diffusion probabilistic model to generate heat sinks with characteristics consistent with those observed in the data. We train two different residual neural networks to predict the pressure drop and surface temperature for each geometry. We use the gradients of these surrogate models with respect to the design variables to guide the geometry generation process toward satisfying the low‑pressure and surface temperature constraints. This inference‑time guidance directs the generative process toward heat sink designs that not only prevent overheating but also achieve lower pressure drops compared to traditional optimization methods such as CMA‑ES. In contrast to traditional black‑box optimization approaches, our method is scalable, provided sufficient training data is available. Unlike traditional topology optimization methods, once the model is trained and the heat sink world model is saved, inference under new constraints (e.g., temperature) is computationally inexpensive and does not require retraining. Samples generated using the guided diffusion model achieve pressure drops up to 10 percent lower than the limits obtained by traditional black‑box optimization methods. This work represents a step toward building a foundational generative model for electronics cooling.

Abstract:
Vision‑Language‑Action (VLA) models have shown strong potential for general‑purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self‑corrections. Reinforcement learning (RL) addresses these through self‑improving interactions with the physical environment, but suffers from high sample complexity on real robots. We introduce World‑Model‑based Policy Optimization (WMPO), a principled framework for on‑policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel‑based predictions that align the "imagined" trajectories with the VLA features pretrained with web‑scale images. Crucially, WMPO enables the policy to perform on‑policy GRPO that provides stronger performance than the often‑used off‑policy methods. Extensive experiments in both simulation and real‑robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self‑correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.

Abstract:
Predictive learning has emerged as a central paradigm for training models across diverse data domains and is increasingly viewed as a foundation for modern artificial intelligence. A common intuition for this success is that accurate prediction requires models to capture the underlying dynamics of the environment, leading to the emergence of structured world models. However, predictive learning does not universally yield such representations, and a mechanistic account of when and why it does remains incomplete. In this work, we identify the prediction horizon as a critical, but often implicit, component of predictive learning objectives. We show that increasing the prediction horizon fundamentally shapes the effective structure of the learning problem. In a minimal setting, we demonstrate both theoretically and empirically that the model's implicit biases interact with this structural change to recover the latent geometry of the task. We then extend these empirical results to nonlinear architectures and more complex datasets, where similar phenomena persist. These findings provide a principled explanation for the emergence of structured representations in predictive learning paradigms and clarify the conditions under which such representations should be expected.

Abstract:
A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt‑to‑full‑video manner without causal control, interactivity, or long‑horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D‑scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long‑horizon world model that predicts future world states through high‑quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text‑based knowledge and enables conditioning on language‑specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large‑scale video‑action pairs spanning diverse domains, PAN supports open‑domain, action‑conditioned simulation with coherent, long‑term dynamics. Extensive experiments show that PAN achieves strong performance in action‑conditioned world simulation, long‑horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.

Abstract:
The use of learned dynamics models, also known as world models, can improve the sample efficiency of reinforcement learning. Recent work suggests that the underlying causal graphs of such dynamics models are sparsely connected, with each of the future state variables depending only on a small subset of the current state variables, and that learning may therefore benefit from sparsity priors. Similarly, temporal sparsity, i.e. sparsely and abruptly changing local dynamics, has also been proposed as a useful inductive bias. In this work, we critically examine these assumptions by analyzing ground‑truth dynamics from a set of robotic reinforcement learning environments in the MuJoCo Playground benchmark suite, aiming to determine whether the proposed notions of state and temporal sparsity actually tend to hold in typical reinforcement learning tasks. We study (i) whether the causal graphs of environment dynamics are sparse, (ii) whether such sparsity is state‑dependent, and (iii) whether local system dynamics change sparsely. Our results indicate that global sparsity is rare, but instead the tasks show local, state‑dependent sparsity in their dynamics and this sparsity exhibits distinct structures, appearing in temporally localized clusters (e.g., during contact events) and affecting specific subsets of state dimensions. These findings challenge common sparsity prior assumptions in dynamics learning, emphasizing the need for grounded inductive biases that reflect the state‑dependent sparsity structure of real‑world dynamics.

Abstract:
Multimodal large language models (MLLMs) have achieved remarkable progress in vision‑language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture‑specific modifications, and remain constrained by large‑scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D‑aware MLLM trained with RL to integrate structured spatial grounding with multi‑step reasoning. The model simulates human‑like spatial perception by constructing a scene graph of task‑relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA‑7K, a high‑quality spatial VQA dataset, and (2) online RL with a multi‑objective dense spatial reward enforcing spatial grounding. SpatialThinker‑7B outperforms supervised fine‑tuning and the sparse RL baseline on spatial understanding and real‑world VQA benchmarks, nearly doubling the base‑model gain compared to sparse RL, and surpassing GPT‑4o. These results showcase the effectiveness of combining spatial supervision with reward‑aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human‑level visual reasoning.

Abstract:
Transformers have shown strong ability to model long‑term dependencies and are increasingly adopted as world models in model‑based reinforcement learning (RL) under partial observability. However, unlike natural language corpora, RL trajectories are sparse and reward‑driven, making standard self‑attention inefficient because it distributes weight uniformly across all past tokens rather than emphasizing the few transitions critical for control. To address this, we introduce structured inductive priors into the self‑attention mechanism of the dynamics head: (i) per‑head memory‑length priors that constrain attention to task‑specific windows, and (ii) distributional priors that learn smooth Gaussian weightings over past state‑action pairs. We integrate these mechanisms into UniZero, a model‑based RL agent with a Transformer‑based world model that supports planning under partial observability. Experiments on the Atari 100k benchmark show that most efficiency gains arise from the Gaussian prior, which smoothly allocates attention to informative transitions, while memory‑length priors often truncate useful signals with overly restrictive cut‑offs. In particular, Gaussian Attention achieves a 77% relative improvement in mean human‑normalized scores over UniZero. These findings suggest that in partially observable RL domains with non‑stationary temporal dependencies, discrete memory windows are difficult to learn reliably, whereas smooth distributional priors flexibly adapt across horizons and yield more robust data efficiency. Overall, our results demonstrate that encoding structured temporal priors directly into self‑attention improves the prioritization of informative histories for dynamics modeling under partial observability.

Abstract:
Model‑based reinforcement learning (MBRL) is a crucial approach to enhance the generalization capabilities and improve the sample efficiency of RL algorithms. However, current MBRL methods focus primarily on building world models for single tasks and rarely address generalization across different scenarios. Building on the insight that dynamics within the same simulation engine share inherent properties, we attempt to construct a unified world model capable of generalizing across different scenarios, named Meta‑Regularized Contextual World‑Model (MrCoM). This method first decomposes the latent state space into various components based on the dynamic characteristics, thereby enhancing the accuracy of world‑model prediction. Further, MrCoM adopts meta‑state regularization to extract unified representation of scenario‑relevant information, and meta‑value regularization to align world‑model optimization with policy learning across diverse scenario objectives. We theoretically analyze the generalization error upper bound of MrCoM in multi‑scenario settings. We systematically evaluate our algorithm's generalization ability across diverse scenarios, demonstrating significantly better performance than previous state‑of‑the‑art methods.

Abstract:
Object‑centric world models (OCWM) aim to decompose visual scenes into object‑level representations, providing structured abstractions that could improve compositional generalization and data efficiency in reinforcement learning. We hypothesize that explicitly disentangled object‑level representations, by localizing task‑relevant information, can enhance policy performance across novel feature combinations. To test this hypothesis, we introduce DLPWM, a fully unsupervised, disentangled object‑centric world model that learns object‑level latents directly from pixels. DLPWM achieves strong reconstruction and prediction performance, including robustness to several out‑of‑distribution (OOD) visual variations. However, when used for downstream model‑based control, policies trained on DLPWM latents underperform compared to DreamerV3. Through latent‑trajectory analyses, we identify representation shift during multi‑object interactions as a key driver of unstable policy learning. Our results suggest that, although object‑centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.

Abstract:
Wireless networks are undergoing a paradigm shift toward massive connectivity with energy‑efficient operation, driving the integration of satellite‑terrestrial architectures with simultaneous wireless information and power transfer (SWIPT). Optimizing transmit beamforming and power splitting in such systems faces formidable challenges, e.g., time‑varying channels and multi‑tier interference, which create a complex decision landscape where conventional model‑free multi‑agent reinforcement learning (MARL) suffers from sample inefficiency due to rarely‑encountered state transitions and poor coordination as decentralized agents act independently. This paper proposes the Decentralized World Model with Reasoning Offloading (DWM‑RO) framework to address these fundamental limitations. Specifically, each agent employs a world model to learn compact predictive representations of environment dynamics, enabling imagination‑based policy training that dramatically reduces required environment interactions. An uncertainty‑aware offloading gate monitors local interference levels and model reconstruction errors to trigger selective edge coordination. When activated, a lightweight latent decorrelation mechanism at the edge refines agents' strategic representations, guiding them toward orthogonal actions that minimize resource conflicts. Extensive simulations demonstrate that DWM‑RO converges 5 times faster than state‑of‑the‑art baselines while achieving 34.7% higher spectral efficiency and reducing constraint violations by 40%. In dense network scenarios with 10 users, DWM‑RO maintains violation rates below 20% while baselines exceed 70%, validating superior robustness.

Abstract:
This paper challenges a prevailing epistemological assumption in End‑to‑End Autonomous Driving: that high‑performance planning necessitates high‑fidelity world reconstruction. Inspired by cognitive science, we propose the Mental Bayesian Causal World Model (MBCWM) and instantiate it as the Tokenized Intent World Model (TIWM), a novel cognitive computing architecture. Its core philosophy posits that intelligence emerges not from pixel‑level objective fidelity, but from the Cognitive Consistency between the agent's internal intentional world and physical reality. By synthesizing von Uexküll's Umwelt theory, the neural assembly hypothesis, and the triple causal model (integrating symbolic deduction, probabilistic induction, and force dynamics) into an end‑to‑end embodied planning system, we demonstrate the feasibility of this paradigm on the nuPlan benchmark. Experimental results in open‑loop validation confirm that our Belief‑Intent Co‑Evolution mechanism effectively enhances planning performance. Crucially, in closed‑loop simulations, the system exhibits emergent human‑like cognitive behaviors, including map affordance understanding, free exploration, and self‑recovery strategies. We identify Cognitive Consistency as the core learning mechanism: during long‑term training, belief (state understanding) and intent (future prediction) spontaneously form a self‑organizing equilibrium through implicit computational replay, achieving semantic alignment between internal representations and physical world affordances. TIWM offers a neuro‑symbolic, cognition‑first alternative to reconstruction‑based planners, establishing a new direction: planning as active understanding, not passive reaction.

Abstract:
In this article, we investigate the proposed duality between the island and the defect extremal surface (DES) prescriptions using the fine‑grained entanglement entropy in Karch‑Randall (KR) brane‑world models with gravitating radiation baths. We consider the AdS_3 black string geometry and compute the entanglement entropy for radiation subsystems on an AdS_2 eternal black hole background using both the island and the DES prescriptions. We find an agreement between the two proposals for the island and the no‑island phases, thus verifying the validity of the proposed duality. We further extend to a T\barT deformed AdS_3 black string geometry with a cut‑off and find consistent results for both phases. We finally plot and compare the Page curves for the undeformed and deformed scenarios, and discuss the modifications due to T\barT deformation.

Abstract:
Cooperative multi‑agent planning requires agents to make joint decisions with partial information and limited communication. Coordination at the trajectory level often fails, as small deviations in timing or movement cascade into conflicts. Symbolic planning mitigates this challenge by raising the level of abstraction and providing a minimal vocabulary of actions that enable synchronization and collective progress. We present DR. WELL, a decentralized neurosymbolic framework for cooperative multi‑agent planning. Cooperation unfolds through a two‑phase negotiation protocol: agents first propose candidate roles with reasoning and then commit to a joint allocation under consensus and environment constraints. After commitment, each agent independently generates and executes a symbolic plan for its role without revealing detailed trajectories. Plans are grounded in execution outcomes via a shared world model that encodes the current state and is updated as agents act. By reasoning over symbolic plans rather than raw trajectories, DR. WELL avoids brittle step‑level alignment and enables higher‑level operations that are reusable, synchronizable, and interpretable. Experiments on cooperative block‑push tasks show that agents adapt across episodes, with the dynamic world model capturing reusable patterns and improving task completion rates and efficiency. Experiments on cooperative block‑push tasks show that our dynamic world model improves task completion and efficiency through negotiation and self‑refinement, trading a time overhead for evolving, more efficient collaboration strategies.

Abstract:
Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. Using the field of high‑temperature cuprates as an exemplar, we evaluate the ability of LLM systems to understand the literature at the level of an expert. We construct an expert‑curated database of 1,726 scientific papers that covers the history of the field, and a set of 67 expert‑formulated questions that probe deep understanding of the literature. We then evaluate six different LLM‑based systems for answering these questions, including both commercially available closed models and a custom retrieval‑augmented generation (RAG) system capable of retrieving images alongside text. Experts then evaluate the answers of these systems against a rubric that assesses balanced perspectives, factual comprehensiveness, succinctness, and evidentiary support. Among the six systems two using RAG on curated literature outperformed existing closed models across key metrics, particularly in providing comprehensive and well‑supported answers. We discuss promising aspects of LLM performances as well as critical short‑comings of all the models. The set of expert‑formulated questions and the rubric will be valuable for assessing expert level performance of LLM based reasoning systems.

Abstract:
Research indicates that humans can mistakenly assume that robots and humans have the same field of view, possessing an inaccurate mental model of robots. This misperception may lead to failures during human‑robot collaboration tasks where robots might be asked to complete impossible tasks about out‑of‑view objects. The issue is more severe when robots do not have a chance to scan the scene to update their world model while focusing on assigned tasks. To help align humans' mental models of robots' vision capabilities, we propose four field‑of‑view indicators in augmented reality and conducted a human‑subjects experiment (N=41) to evaluate them in a collaborative assembly task regarding accuracy, confidence, task efficiency, and workload. These indicators span a spectrum of positions: two at robot's eye and head space ‑‑ deepening eye socket and adding blocks to two sides of the eyes (i.e., egocentric), and two anchoring in the robot's task space ‑‑ adding extended blocks from the sides of eyes to the table and placing blocks directly on the tables (i.e., allocentric). Results showed that, when placed directly in the task space, the allocentric indicator yields the highest accuracy, although with a delay in interpreting the robot's field of view. When placed at the robot's eyes, the egocentric indicator of deeper eye sockets, possible for physical alteration, also increased accuracy. In all indicators, participants' confidence was high while cognitive load remained low. Finally, we contribute six guidelines for practitioners to apply our augmented reality indicators or physical alterations to align humans' mental models with robots' vision capabilities.

Abstract:
Robots must understand their environment from raw sensory inputs and reason about the consequences of their actions in it to solve complex tasks. Behavior Cloning (BC) leverages task‑specific human demonstrations to learn this knowledge as end‑to‑end policies. However, these policies are difficult to transfer to new tasks, and generating training data is challenging because it requires careful demonstrations and frequent environment resets. In contrast to such policy‑based view, in this paper we take a model‑based approach where we collect a few hours of unstructured easy‑to‑collect play data to learn an action‑conditioned visual world model, a diffusion‑based action sampler, and optionally a reward model. The world model ‑‑ in combination with the action sampler and a reward model ‑‑ is then used to optimize long sequences of actions with a Monte Carlo Tree Search (MCTS) planner. The resulting plans are executed on the robot via a zeroth‑order Model Predictive Controller (MPC). We show that the action sampler mitigates hallucinations of the world model during planning and validate our approach on 3 real‑world robotic tasks with varying levels of planning and modeling complexity. Our experiments support the hypothesis that planning leads to a significant improvement over BC baselines on a standard manipulation test environment.

Abstract:
Data‑driven scientific discovery requires iterative cycles of literature search, hypothesis generation, and data analysis. Substantial progress has been made towards AI agents that can automate scientific research, but all such agents remain limited in the number of actions they can take before losing coherence, thus limiting the depth of their findings. Here we present Kosmos, an AI scientist that automates data‑driven discovery. Given an open‑ended objective and a dataset, Kosmos runs for up to 12 hours performing cycles of parallel data analysis, literature search, and hypothesis generation before synthesizing discoveries into scientific reports. Unlike prior systems, Kosmos uses a structured world model to share information between a data analysis agent and a literature search agent. The world model enables Kosmos to coherently pursue the specified objective over 200 agent rollouts, collectively executing an average of 42,000 lines of code and reading 1,500 papers per run. Kosmos cites all statements in its reports with code or primary literature, ensuring its reasoning is traceable. Independent scientists found 79.4% of statements in Kosmos reports to be accurate, and collaborators reported that a single 20‑cycle Kosmos run performed the equivalent of 6 months of their own research time on average. Furthermore, collaborators reported that the number of valuable scientific findings generated scales linearly with Kosmos cycles (tested up to 20 cycles). We highlight seven discoveries made by Kosmos that span metabolomics, materials science, neuroscience, and statistical genetics. Three discoveries independently reproduce findings from preprinted or unpublished manuscripts that were not accessed by Kosmos at runtime, while four make novel contributions to the scientific literature.

Abstract:
We argue that sixth‑generation (6G) intelligence is not fluent token prediction but the capacity to imagine and choose ‑‑ to simulate future scenarios, weigh trade‑offs, and act with calibrated uncertainty. We reframe open radio access network (O‑RAN) near‑real‑time (Near‑RT) control via counterfactual dynamics and a world modeling (WM) paradigm that learns an action‑conditioned generative state space. This enables quantitative "what‑if" forecasting beyond large language models (LLMs) as the primary modeling primitive. Actions such as physical resource blocks (PRBs) are treated as first‑class control inputs in a causal world model, and both aleatoric and epistemic uncertainty are modeled for prediction and what‑if analysis. An agentic, model predictive control (MPC)‑based cross‑entropy method (CEM) planner operates over short horizons, using prior‑mean rollouts within data‑driven PRB bounds to maximize a deterministic reward. The model couples multi‑scale structured state‑space mixtures (MS3M) with a compact stochastic latent to form WM‑MS3M, summarizing key performance indicators (KPIs) histories and predicting next‑step KPIs under hypothetical PRB sequences. On realistic O‑RAN traces, WM‑MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with 32% fewer parameters and similar latency, and achieves 35‑80% lower root mean squared error (RMSE) than attention/hybrid baselines with 2.3‑4.1x faster inference, enabling rare‑event simulation and offline policy screening.

Abstract:
Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research‑‑relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD‑Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non‑experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD‑Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language‑spatial mapping. Our extensive experiments with state‑of‑the‑art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concept‑‑a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD‑Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.

Abstract:
Agents that understand objects and their interactions can learn policies that are more robust and transferable. However, most object‑centric RL methods factor state by individual objects while leaving interactions implicit. We introduce the Factored Interactive Object‑Centric World Model (FIOC‑WM), a unified framework that learns structured representations of both objects and their interactions within a world model. FIOC‑WM captures environment dynamics with disentangled and modular representations of object interactions, improving sample efficiency and generalization for policy learning. Concretely, FIOC‑WM first learns object‑centric latents and an interaction structure directly from pixels, leveraging pre‑trained vision encoders. The learned world model then decomposes tasks into composable interaction primitives, and a hierarchical policy is trained on top: a high level selects the type and order of interactions, while a low level executes them. On simulated robotic and embodied‑AI benchmarks, FIOC‑WM improves policy‑learning sample efficiency and generalization over world‑model baselines, indicating that explicit, modular interaction learning is crucial for robust control.

Abstract:
Autonomous agents are increasingly expected to operate in complex, dynamic, and uncertain environments, performing tasks such as manipulation, navigation, and decision‑making. Achieving these capabilities requires agents to understand the underlying mechanisms and dynamics of the world, moving beyond reactive control or simple replication of observed states. This motivates the development of world models as internal representations that encode environmental states, capture dynamics, and support prediction, planning, and reasoning. Despite growing interest, the definition, scope, architectures, and essential capabilities of world models remain ambiguous. In this survey, we go beyond prescribing a fixed definition and limiting our scope to methods explicitly labeled as world models. Instead, we examine approaches that exhibit the core capabilities of world models through a review of methods in robotic manipulation. We analyze their roles across perception, prediction, and control, identify key challenges and solutions, and distill the core components, capabilities, and functions that a fully realized world model should possess. Building on this analysis, we aim to motivate further development toward generalizable and practical world models for robotics.

Abstract:
The field of world modeling is fragmented, with researchers developing bespoke architectures that rarely build upon each other. We propose a framework that specifies the natural building blocks for structured world models based on the fundamental stochastic processes that any world model must capture: discrete processes (logic, symbols) and continuous processes (physics, dynamics); the world model is then defined by the hierarchical composition of these building blocks. We examine Hidden Markov Models (HMMs) and switching linear dynamical systems (sLDS) as natural building blocks for discrete and continuous modeling‑‑which become partially‑observable Markov decision processes (POMDPs) and controlled sLDS when augmented with actions. This modular approach supports both passive modeling (generation, forecasting) and active control (planning, decision‑making) within the same architecture. We avoid the combinatorial explosion of traditional structure learning by largely fixing the causal architecture and searching over only four depth parameters. We review practical expressiveness through multimodal generative modeling (passive) and planning from pixels (active), with performance competitive to neural approaches while maintaining interpretability. The core outstanding challenge is scalable joint structure‑parameter learning; current methods finesse this by cleverly growing structure and parameters incrementally, but are limited in their scalability. If solved, these natural building blocks could provide foundational infrastructure for world modeling, analogous to how standardized layers enabled progress in deep learning.

Abstract:
Foundation models in video generation are demonstrating remarkable capabilities as potential world models for simulating the physical world. However, their application in high‑stakes domains like surgery, which demand deep, specialized causal knowledge rather than general physical rules, remains a critical unexplored gap. To systematically address this challenge, we present SurgVeo, the first expert‑curated benchmark for video generation model evaluation in surgery, and the Surgical Plausibility Pyramid (SPP), a novel, four‑tiered framework tailored to assess model outputs from basic appearance to complex surgical strategy. On the basis of the SurgVeo benchmark, we task the advanced Veo‑3 model with a zero‑shot prediction task on surgical clips from laparoscopic and neurosurgical procedures. A panel of four board‑certified surgeons evaluates the generated videos according to the SPP. Our results reveal a distinct "plausibility gap": while Veo‑3 achieves exceptional Visual Perceptual Plausibility, it fails critically at higher levels of the SPP, including Instrument Operation Plausibility, Environment Feedback Plausibility, and Surgical Intent Plausibility. This work provides the first quantitative evidence of the chasm between visually convincing mimicry and causal understanding in surgical AI. Our findings from SurgVeo and the SPP establish a crucial foundation and roadmap for developing future models capable of navigating the complexities of specialized, real‑world healthcare domains.

Abstract:
Learning cooperative multi‑agent policies directly from high‑dimensional, multimodal sensory inputs like pixels and audio (from pixels) is notoriously sample‑inefficient. Model‑free Multi‑Agent Reinforcement Learning (MARL) algorithms struggle with the joint challenge of representation learning, partial observability, and credit assignment. To address this, we propose a novel framework based on a shared, generative Multimodal World Model (MWM). Our MWM is trained to learn a compressed latent representation of the environment's dynamics by fusing distributed, multimodal observations from all agents using a scalable attention‑based mechanism. Subsequently, we leverage this learned MWM as a fast, "imagined" simulator to train cooperative MARL policies (e.g., MAPPO) entirely within its latent space, decoupling representation learning from policy learning. We introduce a new set of challenging multimodal, multi‑agent benchmarks built on a 3D physics simulator. Our experiments demonstrate that our MWM‑MARL framework achieves orders‑of‑magnitude greater sample efficiency compared to state‑of‑the‑art model‑free MARL baselines. We further show that our proposed multimodal fusion is essential for task success in environments with sensory asymmetry and that our architecture provides superior robustness to sensor‑dropout, a critical feature for real‑world deployment.

Abstract:
Cross‑embodiment learning seeks to build generalist robots that operate across diverse morphologies, but differences in action spaces and kinematics hinder data sharing and policy transfer. This raises a central question: Is there any invariance that allows actions to transfer across embodiments? We conjecture that environment dynamics are embodiment‑invariant, and that world models capturing these dynamics can provide a unified interface across embodiments. To learn such a unified world model, the crucial step is to design state and action representations that abstract away embodiment‑specific details while preserving control relevance. To this end, we represent different embodiments (e.g., human hands and robot hands) as sets of 3D particles and define actions as particle displacements, creating a shared representation for heterogeneous data and control problems. A graph‑based world model is then trained on exploration data from diverse simulated robot hands and real human hands, and integrated with model‑based planning for deployment on novel hardware. Experiments on rigid and deformable manipulation tasks reveal three findings: (i) scaling to more training embodiments improves generalization to unseen ones, (ii) co‑training on both simulated and real data outperforms training on either alone, and (iii) the learned models enable effective control on robots with varied degrees of freedom. These results establish world models as a promising interface for cross‑embodiment dexterous manipulation.

Abstract:
Continual Learning (CL) methods have traditionally focused on mitigating catastrophic forgetting through gradient‑based retraining, an approach ill‑suited for deployed agents that must adapt in real time. We introduce our Adaptive Teaching and Learning System (ATLAS), a dual‑agent architecture that decouples reasoning (Teacher) from execution (Student) and incorporates a persistent learning memory that stores distilled guidance from experience. This informs the orchestration layer, enabling the system to dynamically adjust its operational strategies, such as supervision level or initial plan selection, at inference time. In doing so, ATLAS achieves gradient‑free continual learning, shifting the locus of adaptation from model parameters to system‑level orchestration. We formulate this as a system‑centric paradigm for continual learning, where the objective is adaptive efficiency: maximizing task success while minimizing computational cost through inference‑time orchestration rather than parameter updates. Evaluated on Microsoft's ExCyTIn‑Bench, an open‑source benchmark simulating complex cyberthreat investigation, ATLAS achieves 54.1% success with GPT‑5‑mini as its Student, outperforming the larger GPT‑5 (High) by 13% while reducing cost by 86%. Cross‑incident validation demonstrates generalization: frozen pamphlets from Incident #5 improve accuracy from 28% to 41% with zero retraining, while shifting output composition from verbose exploration to structured reasoning. Together, these findings establish gradient‑free continual learning as a viable path toward adaptive, deployable AI systems and provide causally annotated traces valuable for training explicit world models.

Abstract:
Constructing accurate digital twins of articulated objects is essential for robotic simulation training and embodied AI world model building, yet historically requires painstaking manual modeling or multi‑stage pipelines. In this work, we propose URDF‑Anything, an end‑to‑end automatic reconstruction framework based on a 3D multimodal large language model (MLLM). URDF‑Anything utilizes an autoregressive prediction framework based on point‑cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction. It implements a specialized [SEG] token mechanism that interacts directly with point cloud features, enabling fine‑grained part‑level segmentation while maintaining consistency with the kinematic parameter predictions. Experiments on both simulated and real‑world datasets demonstrate that our method significantly outperforms existing approaches regarding geometric segmentation (mIoU 17% improvement), kinematic parameter prediction (average error reduction of 29%), and physical executability (surpassing baselines by 50%). Notably, our method exhibits excellent generalization ability, performing well even on objects outside the training set. This work provides an efficient solution for constructing digital twins for robotic simulation, significantly enhancing the sim‑to‑real transfer capability.

Abstract:
Traffic congestion, primarily driven by intersection queuing, significantly impacts urban living standards, safety, environmental quality, and economic efficiency. While Traffic Signal Control (TSC) systems hold potential for congestion mitigation, traditional optimization models often fail to capture real‑world traffic complexity and dynamics. This study introduces a novel single‑agent reinforcement learning (RL) framework for regional adaptive TSC, circumventing the coordination complexities inherent in multi‑agent systems through a centralized decision‑making paradigm. The model employs an adjacency matrix to unify the encoding of road network topology, real‑time queue states derived from probe vehicle data, and current signal timing parameters. Leveraging the efficient learning capabilities of the DreamerV3 world model, the agent learns control policies where actions sequentially select intersections and adjust their signal phase splits to regulate traffic inflow/outflow, analogous to a feedback control system. Reward design prioritizes queue dissipation, directly linking congestion metrics (queue length) to control actions. Simulation experiments conducted in SUMO demonstrate the model's effectiveness: under inference scenarios with multi‑level (10%, 20%, 30%) Origin‑Destination (OD) demand fluctuations, the framework exhibits robust anti‑fluctuation capability and significantly reduces queue lengths. This work establishes a new paradigm for intelligent traffic control compatible with probe vehicle technology. Future research will focus on enhancing practical applicability by incorporating stochastic OD demand fluctuations during training and exploring regional optimization mechanisms for contingency events.

Abstract:
While world models are increasingly positioned as a pathway to overcoming data scarcity in domains such as robotics, open training infrastructure for world modeling remains nascent. We introduce Jasmine, a performant JAX‑based world modeling codebase that scales from single hosts to hundreds of accelerators with minimal code changes. Jasmine achieves an order‑of‑magnitude faster reproduction of the CoinRun case study compared to prior open implementations, enabled by performance optimizations across data loading, training and checkpointing. The codebase guarantees fully reproducible training and supports diverse sharding configurations. By pairing Jasmine with curated large‑scale datasets, we establish infrastructure for rigorous benchmarking pipelines across model families and architectural ablations.

Abstract:
A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future physical state of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. However, existing world models often focus on random generation of open worlds, but neglect the need for high‑fidelity modeling of deterministic scenarios (such as fixed‑map mazes and static space robot navigation). In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic 3D world. 1) Through diagnostic experiment, we quantitatively demonstrate that high‑fidelity cloning is feasible and the primary bottleneck for long‑horizon fidelity is the geometric structure of the latent representation, not the dynamics model itself. 2) Building on this insight, we show that applying temporal contrastive learning principle as a geometric regularization can effectively curate a latent space that better reflects the underlying physical state manifold, demonstrating that contrastive constraints can serve as a powerful inductive bias for stable world modeling; we call this approach Geometrically‑Regularized World Models (GRWM). At its core is a lightweight geometric regularization module that can be seamlessly integrated into standard autoencoders, reshaping their latent space to provide a stable foundation for effective dynamics modeling. By focusing on representation quality, GRWM offers a simple yet powerful pipeline for improving world model fidelity.

Abstract:
Adapting pretrained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two‑stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co‑adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non‑trivial and prone to representational collapse. In this work, we propose CoLA‑World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm‑up phase that effectively aligns the representations of the from‑scratch LAM with the pretrained world model. This unlocks a co‑evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high‑quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA‑World matches or outperforms prior two‑stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.

Abstract:
What happens at spacetime singularities is poorly understood. The Penrose‑Wall singularity theorem constrains possible scenarios, but until recently its key assumption‑‑the generalized second law (GSL)‑‑had only been proven perturbatively, severely limiting this application. We highlight that recent progress enables a proof of the GSL in holographic brane‑world models, valid non‑perturbatively at the species scale cG (with c the number of matter fields and G Newton's constant). This enables genuine constraints: an outer‑trapped surface in the Einstein gravity regime implies geodesic incompleteness non‑perturbatively at the species scale. Conversely, any genuine resolution must evade Penrose's criteria. We illustrate both possibilities with explicit examples: the classical BTZ black hole evolves to a more severe singularity, while a null singularity on the Rindler horizon is resolved, both by species‑scale effects. Subject to the GSL, these constraints on singularity resolution apply beyond brane‑worlds: namely, in any theory with a geometric UV scale‑‑roughly, where the metric remains well‑defined but classical Einstein gravity breaks down.

Abstract:
Exploration is fundamental to reinforcement learning (RL), as it determines how effectively an agent discovers and exploits the underlying structure of its environment to achieve optimal performance. Existing exploration methods generally fall into two categories: active exploration and passive exploration. The former introduces stochasticity into the policy but struggles in high‑dimensional environments, while the latter adaptively prioritizes transitions in the replay buffer to enhance exploration, yet remains constrained by limited sample diversity. To address the limitation in passive exploration, we propose Modelic Generative Exploration (MoGE), which augments exploration through the generation of under‑explored critical states and synthesis of dynamics‑consistent experiences through transition models. MoGE is composed of two components: (1) a diffusion‑based generator that synthesizes critical states under the guidance of a utility function evaluating each state's potential influence on policy exploration, and (2) a one‑step imagination world model for constructing critical transitions based on the critical states for agent learning. Our method adopts a modular formulation that aligns with the principles of off‑policy learning, allowing seamless integration with existing algorithms to improve exploration without altering their core structures. Empirical results on OpenAI Gym and DeepMind Control Suite reveal that MoGE effectively bridges exploration and policy learning, leading to remarkable gains in both sample efficiency and performance across complex control tasks.

Abstract:
Semantic communication is a promising technique for emerging wireless applications, which reduces transmission overhead by transmitting only task‑relevant features instead of raw data. However, existing methods struggle under extremely low bandwidth and varying channel conditions, where corrupted or missing semantics lead to severe reconstruction errors. To resolve this difficulty, we propose a world foundation model (WFM)‑aided semantic video transmission framework that leverages the predictive capability of WFMs to generate future frames based on the current frame and textual guidance. This design allows transmissions to be omitted when predictions remain reliable, thereby saving bandwidth. Through WFM's prediction, the key semantics are preserved, yet minor prediction errors tend to amplify over time. To mitigate issue, a lightweight depth‑based feedback module is introduced to determine whether transmission of the current frame is needed. Apart from transmitting the entire frame, a segmentation‑assisted partial transmission method is proposed to repair degraded frames, which can further balance performance and bandwidth cost. Furthermore, an active transmission strategy is developed for mobile scenarios by exploiting camera trajectory information and proactively scheduling transmissions before channel quality deteriorates. Simulation results show that the proposed framework significantly reduces transmission overhead while maintaining task performances across varying scenarios and channel conditions.

Abstract:
We present a framework for uncovering and exploiting dependencies among tools and documents to enhance exemplar artifact generation. Our method begins by constructing a tool knowledge graph from tool schemas,including descriptions, arguments, and output payloads, using a DeepResearch‑inspired analysis. In parallel, we derive a complementary knowledge graph from internal documents and SOPs, which is then fused with the tool graph. To generate exemplar plans, we adopt a deep‑sparse integration strategy that aligns structural tool dependencies with procedural knowledge. Experiments demonstrate that this unified framework effectively models tool interactions and improves plan generation, underscoring the benefits of linking tool graphs with domain knowledge graphs for tool‑augmented reasoning and planning.

Abstract:
We present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi‑turn interactive diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction‑tuned models trained on static data, our method acquires diagnostic strategies through dynamic exploration and outcome‑based feedback, mapping evolving patient states to the next optimal examination and subsequent diagnosis. Our contributions include: (i) DiagGym, a diagnostics world model trained with electronic health records, serving as a virtual clinical environment to support closed‑loop in‑silico training and evaluation for interactive diagnosis; (ii) DiagAgent, trained via end‑to‑end multi‑turn RL to learn dynamic diagnostic policies that optimize both interactive effectiveness and final accuracy; (iii) DiagBench, a multi‑center diagnostic benchmark designed to evaluate multi‑turn diagnostic interaction trajectories. The benchmark comprises 2.2K physician‑validated cases sourced from 4 distinct distributions, alongside 3.3K physician‑written rubrics for granular process‑oriented evaluation. (iv) Extensive evaluations demonstrate DiagAgent's superior performance across both in‑domain and out‑of‑domain (OOD) settings. DiagAgent significantly outperforms 11 SOTA LLMs and 2 prompt‑engineered agents. In the end‑to‑end setting, it delivers a 11.20% increase in diagnostic accuracy and a 17.58% boost in examination recommendation F1 score, while consistently maintaining SOTA performance across all three external centers. Furthermore, in rubric‑based evaluations, it surpasses the next‑best model by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers long‑term diagnostic management abilities unattainable through passive training.

Abstract:
Despite the popularity of reinforcement learning (RL) in wireless networks, existing approaches that rely on model‑free RL (MFRL) and model‑based RL (MBRL) are data inefficient and short‑sighted. Such RL‑based solutions cannot generalize to novel network states since they capture only statistical patterns rather than the underlying physics and logic from wireless data. These limitations become particularly challenging in complex wireless networks with high dynamics and long‑term planning requirements. To address these limitations, in this paper, a novel dual‑mind world model‑based learning framework is proposed with the goal of optimizing completeness‑weighted age of information (CAoI) in a challenging mmWave V2X scenario. Inspired by cognitive psychology, the proposed dual‑mind world model encompasses a pattern‑driven System 1 component and a logic‑driven System 2 component to learn dynamics and logic of the wireless network, and to provide long‑term link scheduling over reliable imagined trajectories. Link scheduling is learned through end‑to‑end differentiable imagined trajectories with logical consistency over an extended horizon rather than relying on wireless data obtained from environment interactions. Moreover, through imagination rollouts, the proposed world model can jointly reason network states and plan link scheduling. During intervals without observations, the proposed method remains capable of making efficient decisions. Extensive experiments are conducted on a realistic simulator based on Sionna with real‑world physical channel, ray‑tracing, and scene objects with material properties. Simulation results show that the proposed world model achieves a significant improvement in data efficiency and achieves strong generalization and adaptation to unseen environments, compared to the state‑of‑the‑art RL baselines, and the world model approach with only System 1.

Abstract:
The autonomy of software agents is fundamentally dependent on their ability to construct an actionable internal world model from the structured data that defines their digital environment, such as the Document Object Model (DOM) of web pages and the semantic descriptions of web services. However, constructing this world model from raw structured data presents two critical challenges: the verbosity of raw HTML makes it computationally intractable for direct use by foundation models, while the static nature of hardcoded API integrations prevents agents from adapting to evolving services. This paper introduces a pattern language for world modeling from structured data, presenting two complementary architectural patterns. The DOM Transduction Pattern addresses the challenge of web page complexity by distilling a verbose, raw DOM into a compact, task‑relevant representation or world model optimized for an agent's reasoning core. Concurrently, the Hypermedia Affordances Recognition Pattern enables the agent to dynamically enrich its world model by parsing standardized semantic descriptions to discover and integrate the capabilities of unknown web services at runtime. Together, these patterns provide a robust framework for engineering agents that can efficiently construct and maintain an accurate world model, enabling scalable, adaptive, and interoperable automation across the web and its extended resources.

Abstract:
Social robot navigation increasingly relies on large language models for reasoning, path planning, and enabling movement in dynamic human spaces. However, relying solely on LLMs for planning often leads to unpredictable and unsafe behaviors, especially in dynamic human spaces, due to limited physical grounding and weak logical consistency. In this work, we introduce NaviWM, a socially‑aware robot Navigation World Model that augments LLM reasoning with a structured world model and a logic‑driven chain‑of‑thought process. NaviWM consists of two main components: (1) a spatial‑temporal world model that captures the positions, velocities, and activities of agents in the environment, and (2) a deductive reasoning module that guides LLMs through a multi‑step, logic‑based inference process. This integration enables the robot to generate navigation decisions that are both socially compliant and physically safe, under well‑defined constraints such as personal space, collision avoidance, and timing. Unlike previous methods based on prompting or fine‑tuning, NaviWM encodes social norms as first‑order logic, enabling interpretable and verifiable reasoning. Experiments show that NaviWM improves success rates and reduces social violations, particularly in crowded environments. These results demonstrate the benefit of combining formal reasoning with LLMs for robust social navigation. Additional experimental details and demo videos for this work can be found at: https://sites.google.com/view/NaviWM.

Abstract:
Autonomous robotic navigation in real‑world environments requires exploration to acquire environmental information as well as goal‑directed navigation in order to reach specified targets. Active inference (AIF) based on the free‑energy principle provides a unified framework for these behaviors by minimizing the expected free energy (EFE), thereby combining epistemic and extrinsic values. To realize this practically, we propose a deep AIF framework that integrates a diffusion policy as the policy model and a multiple timescale recurrent state‑space model (MTRSSM) as the world model. The diffusion policy generates diverse candidate actions while the MTRSSM predicts their long‑horizon consequences through latent imagination, enabling action selection that minimizes EFE. Real‑world navigation experiments demonstrated that our framework achieved higher success rates and fewer collisions compared with the baselines, particularly in exploration‑demanding scenarios. These results highlight how AIF based on EFE minimization can unify exploration and goal‑directed navigation in real‑world robotic settings.

Abstract:
Generated networks are widely used in network‑based research as a convenient simulation environment. Generating universal networks that more accurately reflect real‑world patterns is a cornerstone task. This study proposes a vari‑linear network generation model that incorporates two core mechanisms: exponential probabilistic growth and vari‑linear preferential attachment. It concurrently overcomes the limitations of traditional growth in characterizing the low‑degree region of the degree distribution and the issues regarding the universality of linear preferential attachment. Results indicate that our model describes real‑world networks more comprehensively and faithfully, and is highly interpretable. Its performance on diverse empirical datasets is several times better than traditional methods. Related mechanisms and conclusions are substantiated through ablation experiments and statistical analysis. Notably, it achieves a unified interpretation of previously isolated classical network characteristics. This work not only provides a higher‑quality universal network generation method, but also bridges the boundaries between traditional concepts, thereby promoting substantive progress in the "world model" of networks.

Abstract:
In wireless communication systems, efficient and adaptive resource allocation plays a crucial role in enhancing overall Quality of Service (QoS). Compared to the conventional Model‑Free Reinforcement Learning (MFRL) scheme, Model‑Based RL (MBRL) first learns a generative world model for subsequent planning. The reuse of historical experience in MBRL promises more stable training behavior, yet its deployment in large‑scale wireless networks remains challenging due to high‑dimensional stochastic dynamics, strong inter‑agent cooperation, and communication constraints. To overcome these challenges, we propose the Multi‑Agent Conditional Diffusion Model Planner (MA‑CDMP) for decentralized communication resource management. Built upon the Distributed Training with Decentralized Execution (DTDE) paradigm, MA‑CDMP models each communication node as an autonomous agent and employs Diffusion Models (DMs) to capture and predict environment dynamics. Meanwhile, an inverse dynamics model guides action generation, thereby enhancing sample efficiency and policy scalability. Moreover, to approximate large‑scale agent interactions, a Mean‑Field (MF) mechanism is introduced as an assistance to the classifier in DMs. This design mitigates inter‑agent non‑stationarity and enhances cooperation with minimal communication overhead in distributed settings. We further theoretically establish an upper bound on the distributional approximation error introduced by the MF‑based diffusion generation, guaranteeing convergence stability and reliable modeling of multi‑agent stochastic dynamics. Extensive experiments demonstrate that MA‑CDMP consistently outperforms existing MARL baselines in terms of average reward and QoS metrics, showcasing its scalability and practicality for real‑world wireless network optimization.

Abstract:
Large Language Model (LLM) web agents often struggle with long‑horizon web navigation and web task completion in new websites, producing inefficient action sequences unless fine‑tuned on environment‑specific data. We show that experience‑driven memory, combined with look‑ahead action simulation, is sufficient for LLM agents to adapt to unseen web environments by remembering past failures and predicting the consequences of future actions. We introduce WebATLAS (Actor‑Critic Task‑completion with Look‑ahead Action Simulation), a memory‑augmented LLM web agent that learns a lightweight internal model of the environment from interaction experience and performs hypothetical action rollouts before acting in the real world. WebATLAS builds a persistent cognitive map via curiosity‑driven exploration, stores interaction outcomes as experience‑based memory, and evaluates candidate actions in cognitive space using a planner‑‑simulator‑‑critic loop. This enables the agent to reuse past experience, avoid previously unsuccessful behaviors, and generate more efficient plans. We evaluate WebATLAS on the WebArena‑Lite benchmark for autonomous web navigation and demonstrate a success rate of 63%, outperforming the previous state‑of‑the‑art at 53.9%. Unlike previous systems, our modular architecture requires no website‑specific LLM fine‑tuning. Ablation studies confirm that experience‑driven memory, look‑ahead action simulation, and hierarchical replanning play complementary roles in enabling robust, training‑free web agents.

Abstract:
Reliable assessment of safe landing sites in unstructured environments is essential for deploying Unmanned Aerial Vehicles (UAVs) in real‑world applications such as delivery, inspection, and surveillance. Existing learning‑based approaches often degrade under covariate shift and offer limited transparency, making their decisions difficult to interpret and validate on resource‑constrained platforms. We present NeuroSymLand, a neuro‑symbolic framework for marker‑free UAV landing site safety assessment that explicitly separates perception‑driven world modeling from logic‑based safety reasoning. A lightweight segmentation model incrementally constructs a probabilistic semantic scene graph encoding objects, attributes, and spatial relations. Symbolic safety rules, synthesized offline via large language models with human‑in‑the‑loop refinement, are executed directly over this world model at runtime to perform white‑box reasoning, producing ranked landing candidates with human‑readable explanations of the underlying safety constraints. Across 72 simulated and hardware‑in‑the‑loop landing scenarios, NeuroSymLand achieves 61 successful assessments, outperforming four competitive baselines, which achieve between 37 and 57 successes. Qualitative analysis highlights its superior interpretability and transparent reasoning, while deployment incurs negligible edge overhead. Our results suggest that combining explicit world modeling with symbolic reasoning can support accurate, interpretable, and edge‑deployable safety assessment in mobile systems, as demonstrated through UAV landing site assessment.

Abstract:
Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat‑Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high‑quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat‑Video supports Text‑to‑Video, Image‑to‑Video, and Video‑Continuation tasks with a single model; Long video generation: Pretraining on Video‑Continuation tasks enables LongCat‑Video to maintain high quality and temporal coherence in the generation of minutes‑long videos; Efficient inference: LongCat‑Video generates 720p, 30fps videos within minutes by employing a coarse‑to‑fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi‑reward RLHF: Multi‑reward RLHF training enables LongCat‑Video to achieve performance on par with the latest closed‑source and leading open‑source models. Code and model weights are publicly available to accelerate progress in the field.

Abstract:
Accurate and reliable motion forecasting is essential for the safe deployment of autonomous vehicles (AVs), particularly in rare but safety‑critical scenarios known as corner cases. Existing models often underperform in these situations due to an over‑representation of common scenes in training data and limited generalization capabilities. To address this limitation, we present WM‑MoE, the first world model‑based motion forecasting framework that unifies perception, temporal memory, and decision making to address the challenges of high‑risk corner‑case scenarios. The model constructs a compact scene representation that explains current observations, anticipates future dynamics, and evaluates the outcomes of potential actions. To enhance long‑horizon reasoning, we leverage large language models (LLMs) and introduce a lightweight temporal tokenizer that maps agent trajectories and contextual cues into the LLM's feature space without additional training, enriching temporal context and commonsense priors. Furthermore, a mixture‑of‑experts (MoE) is introduced to decompose complex corner cases into subproblems and allocate capacity across scenario types, and a router assigns scenes to specialized experts that infer agent intent and perform counterfactual rollouts. In addition, we introduce nuScenes‑corner, a new benchmark that comprises four real‑world corner‑case scenarios for rigorous evaluation. Extensive experiments on four benchmark datasets (nuScenes, NGSIM, HighD, and MoCAD) showcase that WM‑MoE consistently outperforms state‑of‑the‑art (SOTA) baselines and remains robust under corner‑case and data‑missing conditions, indicating the promise of world model‑based architectures for robust and generalizable motion forecasting in fully AVs.

Abstract:
This is a short technical report describing the winning entry of the PhysicsIQ Challenge, presented at the Perception Test Workshop at ICCV 2025. State‑of‑the‑art video generative models exhibit severely limited physical understanding, and often produce implausible videos. The Physics IQ benchmark has shown that visual realism does not imply physics understanding. Yet, intuitive physics understanding has shown to emerge from SSL pretraining on natural videos. In this report, we investigate whether we can leverage SSL‑based video world models to improve the physics plausibility of video generative models. In particular, we build ontop of the state‑of‑the‑art video generative model MAGI‑1 and couple it with the recently introduced Video Joint Embedding Predictive Architecture 2 (VJEPA‑2) to guide the generation process. We show that by leveraging VJEPA‑2 as reward signal, we can improve the physics plausibility of state‑of‑the‑art video generative models by ~6%.

Abstract:
Interactive world models that simulate object dynamics are crucial for robotics, VR, and AR. However, it remains a significant challenge to learn physics‑consistent dynamics models from limited real‑world video data, especially for deformable objects with spatially‑varying physical properties. To overcome the challenge of data scarcity, we propose PhysWorld, a novel framework that utilizes a simulator to synthesize physically plausible and diverse demonstrations to learn efficient world models. Specifically, we first construct a physics‑consistent digital twin within MPM simulator via constitutive model selection and global‑to‑local optimization of physical properties. Subsequently, we apply part‑aware perturbations to the physical properties and generate various motion patterns for the digital twin, synthesizing extensive and diverse demonstrations. Finally, using these demonstrations, we train a lightweight GNN‑based world model that is embedded with physical properties. The real video can be used to further refine the physical properties. PhysWorld achieves accurate and fast future predictions for various deformable objects, and also generalizes well to novel interactions. Experiments show that PhysWorld has competitive performance while enabling inference speeds 47 times faster than the recent state‑of‑the‑art method, i.e., PhysTwin.

Abstract:
In reinforcement learning (RL) theory, the concept of most confusing instances is central to establishing regret lower bounds, that is, the minimal exploration needed to solve a problem. Given a reference model and its optimal policy, a most confusing instance is the statistically closest alternative model that makes a suboptimal policy optimal. While this concept is well‑studied in multi‑armed bandits and ergodic tabular Markov decision processes, constructing such instances remains an open question in the general case. In this paper, we formalize this problem for neural network world models as a constrained optimization: finding a modified model that is statistically close to the reference one, while producing divergent performance between optimal and suboptimal policies. We propose an adversarial training procedure to solve this problem and conduct an empirical study across world models of varying quality. Our results suggest that the degree of achievable confusion correlates with uncertainty in the approximate model, which may inform theoretically‑grounded exploration strategies for deep model‑based RL.

Abstract:
World models, which explicitly learn environmental dynamics to lay the foundation for planning, reasoning, and decision‑making, are rapidly advancing in predicting both physical dynamics and aspects of social behavior, yet predominantly in separate silos. This division results in a systemic failure to model the crucial interplay between physical environments and social constructs, rendering current models fundamentally incapable of adequately addressing the true complexity of real‑world systems where physical and social realities are inextricably intertwined. This position paper argues that the systematic, bidirectional unification of physical and social predictive capabilities is the next crucial frontier for world model development. We contend that comprehensive world models must holistically integrate objective physical laws with the subjective, evolving, and context‑dependent nature of social dynamics. Such unification is paramount for AI to robustly navigate complex real‑world challenges and achieve more generalizable intelligence. This paper substantiates this imperative by analyzing core impediments to integration, proposing foundational guiding principles (ACE Principles), and outlining a conceptual framework alongside a research roadmap towards truly holistic world models.

Abstract:
Causal representation learning (CRL) has emerged as a powerful unsupervised framework that (i) disentangles the latent generative factors underlying high‑dimensional data, and (ii) learns the cause‑and‑effect interactions among the disentangled variables. Despite extensive recent advances in identifiability and some practical progress, a substantial gap remains between theory and real‑world practice. This paper takes a step toward closing that gap by bringing CRL to robotics, a domain that has motivated CRL. Specifically, this paper addresses the well‑defined robot pose estimation ‑‑ the recovery of position and orientation from raw images ‑‑ by introducing Robotic Pose Estimation via Score‑Based CRL (ROPES). Being an unsupervised framework, ROPES embodies the essence of interventional CRL by identifying those generative factors that are actuated: images are generated by intrinsic and extrinsic latent factors (e.g., joint angles, arm/limb geometry, lighting, background, and camera configuration) and the objective is to disentangle and recover the controllable latent variables, i.e., those that can be directly manipulated (intervened upon) through actuation. Interventional CRL theory shows that variables that undergo variations via interventions can be identified. In robotics, such interventions arise naturally by commanding actuators of various joints and recording images under varied controls. Empirical evaluations in semi‑synthetic manipulator experiments demonstrate that ROPES successfully disentangles latent generative factors with high fidelity with respect to the ground truth. Crucially, this is achieved by leveraging only distributional changes, without using any labeled data. The paper also includes a comparison with a baseline based on a recently proposed semi‑supervised framework. This paper concludes by positioning robot pose estimation as a near‑practical testbed for CRL.

Abstract:
World models are central to building AI agents capable of flexible reasoning and planning. Yet current evaluations (i) test only properties measurable from observed interactions, such as next‑frame prediction or task return, and (ii) do not test whether a learned model supports diverse queries about the environment. In contrast, humans build general‑purpose models that can answer many different questions about an environment\unicodex2014including questions that require understanding global structure and counterfactual consequences. We propose WorldTest: a protocol for evaluating whether agents learn models that support multiple environment‑level queries\unicodex2014questions whose answers depend on properties of the full environment, not just observed trajectories. Individually, these queries can target properties (e.g., reachability or the effects of interventions) that no single rollout distribution determines. Collectively, they assess model generality across query types. We instantiate WorldTest as AutumnBench, a benchmark of 43 interactive grid‑world environments and 129 tasks across three query families for both humans and learning agents. Experiments with 517 human participants and five frontier models show that humans substantially outperform these models, a gap we attribute to differences in exploration and belief updating. AutumnBench provides a framework for evaluating world‑model learning in grid‑world environments with environment‑level queries, and WorldTest provides a template for extending such evaluations to richer domains.

Abstract:
Training Vision‑Language‑Action (VLA) models for generalist robots typically requires large‑scale real‑world robot data, which is expensive and time‑consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain‑0, a novel VLA foundation model empowered by world model‑generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain‑0 significantly reduces reliance on real robot data while improving cross‑task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain‑of‑Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long‑horizon dependencies during task execution. This leads to substantial gains in real‑world performance on dexterous, long‑horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain‑0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain‑0‑Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

Abstract:
Uncertainty‑aware robot motion prediction is crucial for downstream traversability estimation and safe autonomous navigation in unstructured, off‑road environments, where terrain is heterogeneous and perceptual uncertainty is high. Most existing methods assume deterministic or spatially independent terrain uncertainties, ignoring the inherent local correlations of 3D spatial data and often producing unreliable predictions. In this work, we introduce an efficient probabilistic framework that explicitly models spatially correlated aleatoric uncertainty over terrain parameters as a probabilistic world model and propagates this uncertainty through a differentiable physics engine for probabilistic trajectory forecasting. By leveraging structured convolutional operators, our approach provides high‑resolution multivariate predictions at manageable computational cost. Experimental evaluation on a publicly available dataset shows significantly improved uncertainty estimation and trajectory prediction accuracy over aleatoric uncertainty estimation baselines.

Abstract:
Designing adaptive mechanisms to align individual and collective interests remains a central challenge in artificial social intelligence. Existing methods often struggle with modeling heterogeneous agents possessing persistent latent traits (e.g., skills, preferences) and dealing with complex multi‑agent system dynamics. These challenges are compounded by the critical need for high sample efficiency due to costly real‑world interactions. World Models, by learning to predict environmental dynamics, offer a promising pathway to enhance mechanism design in heterogeneous and complex systems. In this paper, we introduce a novel method named SWM‑AP (Social World Model‑Augmented Mechanism Design Policy Learning), which learns a social world model hierarchically modeling agents' behavior to enhance mechanism design. Specifically, the social world model infers agents' traits from their interaction trajectories and learns a trait‑based model to predict agents' responses to the deployed mechanisms. The mechanism design policy collects extensive training trajectories by interacting with the social world model, while concurrently inferring agents' traits online during real‑world interactions to further boost policy learning efficiency. Experiments in diverse settings (tax policy design, team coordination, and facility location) demonstrate that SWM‑AP outperforms established model‑based and model‑free RL baselines in cumulative rewards and sample efficiency.

Abstract:
We investigate how embedding dimension affects the emergence of an internal "world model" in a transformer trained with reinforcement learning to perform bubble‑sort‑style adjacent swaps. Models achieve high accuracy even with very small embedding dimensions, but larger dimensions yield more faithful, consistent, and robust internal representations. In particular, higher embedding dimensions strengthen the formation of structured internal representation and lead to better interpretability. After hundreds of experiments, we observe two consistent mechanisms: (1) the last row of the attention weight matrix monotonically encodes the global ordering of tokens; and (2) the selected transposition aligns with the largest adjacent difference of these encoded values. Our results provide quantitative evidence that transformers build structured internal world models and that model size improves representation quality in addition to end performance. We release our metrics and analyses, which can be used to probe similar algorithmic tasks.

Abstract:
A key challenge in training Vision‑Language Model (VLM) agents, compared to Language Model (LLM) agents, lies in the shift from textual states to complex visual observations. This transition introduces partial observability and demands robust world modeling. We ask: Can VLM agents construct internal world models through explicit visual state reasoning? To address this question, we architecturally enforce and reward the agent's reasoning process via reinforcement learning (RL), formulating it as a Partially Observable Markov Decision Process (POMDP). We find that decomposing the agent's reasoning into State Estimation ("what is the current state?") and Transition Modeling ("what comes next?") is critical for success, as demonstrated through five reasoning strategies. Our investigation into how agents represent internal beliefs reveals that the optimal representation is task‑dependent: Natural Language excels at capturing semantic relationships in general tasks, while Structured formats are indispensable for precise manipulation and control. Building on these insights, we design a World Modeling Reward that provides dense, turn‑level supervision for accurate state prediction, and introduce Bi‑Level General Advantage Estimation (Bi‑Level GAE) for turn‑aware credit assignment. Through this form of visual state reasoning, a 3B‑parameter model achieves a score of 0.82 across five diverse agent benchmarks, representing a 3× improvement over its untrained counterpart (0.21) and outperforming proprietary reasoning models such as GPT‑5 (0.75), Gemini 2.5 Pro (0.67) and Claude 4.5 (0.62). All experiments are conducted within our VAGEN framework, a scalable system for training and analyzing multi‑turn VLM agents in diverse visual environments. Code and data are publicly available at https://vagen‑ai.github.io.

Abstract:
End‑to‑end autonomous driving systems increasingly rely on vision‑centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR‑WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR‑WM first establishes a robust bird's‑eye‑view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego‑vehicle's actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting‑planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR‑WM achieves top performance in both 4D occupancy forecasting and trajectory planning.

Abstract:
World Models have vastly permeated the field of Reinforcement Learning. Their ability to model the transition dynamics of an environment have greatly improved sample efficiency in online RL. Among them, the most notorious example is Dreamer, a model that learns to act in a diverse set of image‑based environments. In this paper, we leverage similarity search and stochastic representations to approximate a world model without a training procedure. We establish a comparison with PlaNet, a well‑established world model of the Dreamer family. We evaluate the models on the quality of latent reconstruction and on the perceived similarity of the reconstructed image, on both next‑step and long horizon dynamics prediction. The results of our study demonstrate that a search‑based world model is comparable to a training based one in both cases. Notably, our model show stronger performance in long‑horizon prediction with respect to the baseline on a range of visually different environments.

Abstract:
We propose Grid‑like Code Quantization (GCQ), a brain‑inspired method for compressing observation‑action sequences into discrete representations using grid‑like patterns in attractor dynamics. Unlike conventional vector quantization approaches that operate on static inputs, GCQ performs spatiotemporal compression through an action‑conditioned codebook, where codewords are derived from continuous attractor neural networks and dynamically selected based on actions. This enables GCQ to jointly compress space and time, serving as a unified world model. The resulting representation supports long‑horizon prediction, goal‑directed planning, and inverse modeling. Experiments across diverse tasks demonstrate GCQ's effectiveness in compact encoding and downstream performance. Our work offers both a computational tool for efficient sequence modeling and a theoretical perspective on the formation of grid‑like codes in neural systems.

Abstract:
Open world Machine Learning (OWML) aims to develop intelligent systems capable of recognizing known categories, rejecting unknown samples, and continually learning from novel information. Despite significant progress in open set recognition, novelty detection, and continual learning, the field still lacks a unified theoretical foundation that can quantify uncertainty, characterize information transfer, and explain learning adaptability in dynamic, nonstationary environments. This paper presents a comprehensive review of information theoretic approaches in open world machine learning, emphasizing how core concepts such as entropy, mutual information, and Kullback Leibler divergence provide a mathematical language for describing knowledge acquisition, uncertainty suppression, and risk control under open world conditions. We synthesize recent studies into three major research axes: information theoretic open set recognition enabling safe rejection of unknowns, information driven novelty discovery guiding new concept formation, and information retentive continual learning ensuring stable long term adaptation. Furthermore, we discuss theoretical connections between information theory and provable learning frameworks, including PAC Bayes bounds, open‑space risk theory, and causal information flow, to establish a pathway toward provable and trustworthy open world intelligence. Finally, the review identifies key open problems and future research directions, such as the quantification of information risk, development of dynamic mutual information bounds, multimodal information fusion, and integration of information theory with causal reasoning and world model learning.

Abstract:
Simulating human reasoning in open‑ended tasks has long been a central aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population‑level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human‑like reasoning in machines, we introduce HugAgent (Human‑Grounded Agent Benchmark), which rethinks human reasoning simulation along three dimensions: (i) from averaged to individualized reasoning, (ii) from behavioral mimicry to cognitive alignment, and (iii) from vignette‑based to open‑ended data. The benchmark evaluates whether a model can predict a specific person's behavioral responses and the underlying reasoning dynamics in out‑of‑distribution scenarios, given partial evidence of their prior views. HugAgent adopts a dual‑track design: a human track that automates and scales the think‑aloud method to collect ecologically valid human reasoning data, and a synthetic track for further scalability and systematic stress testing. This architecture enables low‑cost, extensible expansion to new tasks and populations. Experiments with state‑of‑the‑art language models reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. The benchmark, along with its complete data collection pipeline and companion chatbot, is open‑sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace‑your‑thinking).

Abstract:
Large Language Models (LLMs) as agents often struggle in out‑of‑distribution (OOD) scenarios. Real‑world environments are complex and dynamic, governed by task‑specific rules and stochasticity, which makes it difficult for LLMs to ground their internal knowledge in those dynamics. Under such OOD conditions, vanilla RL training often fails to scale; we observe Pass@k‑‑the probability that at least one of (k) sampled trajectories succeeds‑‑drops markedly across training steps, indicating brittle exploration and limited generalization. Inspired by model‑based reinforcement learning, we hypothesize that equipping LLM agents with an internal world model can better align reasoning with environmental dynamics and improve decision‑making. We show how to encode this world model by decomposing it into two components: state representation and transition modeling. Building on this, we introduce SPA, a simple reinforcement learning framework that cold‑starts the policy via a Self‑Play supervised finetuning (SFT) stage to learn the world model by interacting with the environment, then uses it to simulate future states prior to policy optimization. This simple initialization outperforms the online world‑modeling baseline and greatly boosts the RL‑based agent training performance. Experiments across diverse environments like Sokoban, FrozenLake, and Sudoku show that our approach significantly improves performance. For example, SPA boosts the Sokoban success rate from 25.6% to 59.8% and raises the FrozenLake score from 22.1% to 70.9% for the Qwen2.5‑1.5B‑Instruct model.

Abstract:
Digital twin worlds with realistic interactive dynamics presents a new opportunity to develop generalist embodied agents in scannable environments with complex physical behaviors. To this end, we present GDGen (Generalized Representation for Generalized Dynamics Generation), a framework that takes a potential energy perspective to seamlessly integrate rigid body, articulated body, and soft body dynamics into a unified, geometry‑agnostic system. GDGen operates from the governing principle that the potential energy for any stable physical system should be low. This fresh perspective allows us to treat the world as one holistic entity and infer underlying physical properties from simple motion observations. We extend classic elastodynamics by introducing directional stiffness to capture a broad spectrum of physical behaviors, covering soft elastic, articulated, and rigid body systems. We propose a specialized network to model the extended material property and employ a neural field to represent deformation in a geometry‑agnostic manner. Extensive experiments demonstrate that GDGen robustly unifies diverse simulation paradigms, offering a versatile foundation for creating interactive virtual environments and training robotic agents in complex, dynamically rich scenarios.

Abstract:
Training robust world models requires large‑scale, precisely labeled multimodal datasets, a process historically bottlenecked by slow and expensive manual annotation. We present a production‑tested GAZE pipeline that automates the conversion of raw, long‑form video into rich, task‑ready supervision for world‑model training. Our system (i) normalizes proprietary 360‑degree formats into standard views and shards them for parallel processing; (ii) applies a suite of AI models (scene understanding, object tracking, audio transcription, PII/NSFW/minor detection) for dense, multimodal pre‑annotation; and (iii) consolidates signals into a structured output specification for rapid human validation. The GAZE workflow demonstrably yields efficiency gains (~19 minutes saved per review hour) and reduces human review volume by >80% through conservative auto‑skipping of low‑salience segments. By increasing label density and consistency while integrating privacy safeguards and chain‑of‑custody metadata, our method generates high‑fidelity, privacy‑aware datasets directly consumable for learning cross‑modal dynamics and action‑conditioned prediction. We detail our orchestration, model choices, and data dictionary to provide a scalable blueprint for generating high‑quality world model training data without sacrificing throughput or governance.

Abstract:
Autonomous drone racing (ADR) systems have recently achieved champion‑level performance, yet remain highly specific to drone racing. While end‑to‑end vision‑based methods promise broader applicability, no system to date simultaneously achieves full sim‑to‑real transfer, onboard execution, and champion‑level performance. In this work, we present SkyDreamer, to the best of our knowledge, the first end‑to‑end vision‑based ADR policy that maps directly from pixel‑level representations to motor commands. SkyDreamer builds on informed Dreamer, a model‑based reinforcement learning approach where the world model decodes to privileged information only available during training. By extending this concept to end‑to‑end vision‑based ADR, the world model effectively functions as an implicit state and parameter estimator, greatly improving interpretability. SkyDreamer runs fully onboard without external aid, resolves visual ambiguities by tracking progress using the state decoded from the world model's hidden state, and requires no extrinsic camera calibration, enabling rapid deployment across different drones without retraining. Real‑world experiments show that SkyDreamer achieves robust, high‑speed flight, executing tight maneuvers such as an inverted loop, a split‑S and a ladder, reaching speeds of up to 21 m/s and accelerations of up to 6 g. It further demonstrates a non‑trivial visual sim‑to‑real transfer by operating on poor‑quality segmentation masks, and exhibits robustness to battery depletion by accurately estimating the maximum attainable motor RPM and adjusting its flight path in real‑time. These results highlight SkyDreamer's adaptability to important aspects of the reality gap, bringing robustness while still achieving extremely high‑speed, agile flight.

Abstract:
What are the physical requirements for agency? We investigate whether a purely quantum system (one evolving unitarily in a coherent regime without decoherence or collapse) can satisfy three minimal conditions for agency: an agent must be able to create a world‑model, use it to evaluate the likely consequences of alternative actions, and reliably perform the action that maximizes expected utility. We show that the first two conditions conflict with the no‑cloning theorem, which forbids copying unknown quantum states: world‑model construction requires copying information from the environment, and deliberation requires copying the world‑model to assess multiple actions. Approximate cloning strategies do not permit sufficient fidelity or generality for agency to be viable in purely quantum systems. The third agency condition also fails due to the linearity of quantum dynamics. These results imply four key consequences. First, agency requires significant classical resources, placing clear constraints on its physical basis. Second, they provide insight into how classical agents emerge within a quantum universe. Third, they show that quantum computers cannot straightforwardly simulate agential behavior without significant classical components. Finally, they challenge quantum theories of agency, free will, and consciousness.

Abstract:
Scaling Vision‑Language‑Action (VLA) models on large‑scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low‑dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose DriveVLA‑W0, a training paradigm that employs world modeling to predict future images. This task generates a dense, self‑supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm's versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real‑time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in‑house dataset demonstrate that DriveVLA‑W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.

Abstract:
Safe policy improvement (SPI) offers theoretical control over policy updates, yet existing guarantees largely concern offline, tabular reinforcement learning (RL). We study SPI in general online settings, when combined with world model and representation learning. We develop a theoretical framework showing that restricting policy updates to a well‑defined neighborhood of the current policy ensures monotonic improvement and convergence. This analysis links transition and reward prediction losses to representation quality, yielding online, "deep" analogues of classical SPI theorems from the offline RL literature. Building on these results, we introduce DeepSPI, a principled on‑policy algorithm that couples local transition and reward losses with regularised policy updates. On the ALE‑57 benchmark, DeepSPI matches or exceeds strong baselines, including PPO and DeepMDPs, while retaining theoretical guarantees.

Abstract:
Large Language Models (LLMs) can serve as world models to enhance agent decision‑making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial‑and‑error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long‑horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models‑‑future state prediction and reward estimation‑‑through three tasks: next‑state identification, full‑procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full‑procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval‑augmented World Model (R‑WoM), which grounds LLM simulations by incorporating factual, up‑to‑date knowledge retrieved from external tutorials. Experiments show that R‑WoM achieves relative improvements of up to 23.4% and 16.3% on the subsets of OSWorld and Webarena compared to baselines, with particular advantage in longer‑horizon simulations.

Abstract:
Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for autonomy in unstructured environments. Traditional optimization‑based planners struggle with contact complexity, while on‑policy reinforcement learning (RL) is sample‑inefficient and has limited multi‑task ability. We propose a framework combining a learned world model with sampling‑based Model Predictive Control (MPC), trained on a demonstration‑free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust planning. Our single, scalable model supports contact‑aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height‑limited arches, with improved sample efficiency and multi‑task capability over on‑policy RL. Deployed on a physical humanoid, our system achieves robust, real‑time contact planning from proprioception and ego‑centric depth images. Code and dataset are available at our website: https://ego‑vcp.github.io/

Abstract:
Recent Text‑to‑Video (T2V) models have demonstrated powerful capability in visual simulation of real‑world geometry and physical laws, indicating its potential as implicit world models. Inspired by this, we explore the feasibility of leveraging the video generation prior for viewpoint planning from given 4D scenes, since videos internally accompany dynamic scenes with natural viewpoints. To this end, we propose a two‑stage paradigm to adapt pre‑trained T2V models for viewpoint prediction, in a compatible manner. First, we inject the 4D scene representation into the pre‑trained T2V model via an adaptive learning branch, where the 4D scene is viewpoint‑agnostic and the conditional generated video embeds the viewpoints visually. Then, we formulate viewpoint extraction as a hybrid‑condition guided camera extrinsic denoising process. Specifically, a camera extrinsic diffusion branch is further introduced onto the pre‑trained T2V model, by taking the generated video and 4D scene as input. Experimental results show the superiority of our proposed method over existing competitors, and ablation studies validate the effectiveness of our key technical designs. To some extent, this work proves the potential of video generation models toward 4D interaction in real world.

Abstract:
The seamless integration of physical and digital environments in Cyber‑Physical Systems(CPS), particularly within Industry 4.0, presents significant challenges stemming from system heterogeneity and complexity. Traditional approaches often rely on rigid, data‑centric solutions like co‑simulation frameworks or brittle point‑to‑point middleware bridges, which lack the semantic richness and flexibility required for intelligent, autonomous coordination. This report introduces the Knowledge Graph‑Enhanced Multi‑Agent Infrastructure(KG‑MAS), as resolution in addressing such limitations. KG‑MAS leverages a centralized Knowledge Graph (KG) as a dynamic, shared world model, providing a common semantic foundation for a Multi‑Agent System(MAS). Autonomous agents, representing both physical and digital components, query this KG for decision‑making and update it with real‑time state information. The infrastructure features a model‑driven architecture which facilitates the automatic generation of agents from semantic descriptions, thereby simplifying system extension and maintenance. By abstracting away underlying communication protocols and providing a unified, intelligent coordination mechanism, KG‑MAS offers a robust, scalable, and flexible solution for coupling heterogeneous physical and digital robotic environments.

Abstract:
Generalist robot policies can now perform a wide range of manipulation skills, but evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge. Rigorous evaluation requires a large number of real‑world rollouts, while systematic improvement demands additional corrective data with expert labels. Both of these processes are slow, costly, and difficult to scale. World models offer a promising, scalable alternative by enabling policies to rollout within imagination space. However, a key challenge is building a controllable world model that can handle multi‑step interactions with generalist robot policies. This requires a world model compatible with modern generalist policies by supporting multi‑view prediction, fine‑grained action control, and consistent long‑horizon interactions, which is not achieved by previous works. In this paper, we make a step forward by introducing a controllable multi‑view world model that can be used to evaluate and improve the instruction‑following ability of generalist robot policies. Our model maintains long‑horizon consistency with a pose‑conditioned memory retrieval mechanism and achieves precise action control through frame‑level action conditioning. Trained on the DROID dataset (95k trajectories, 564 scenes), our model generates spatially and temporally consistent trajectories under novel scenarios and new camera placements for over 20 seconds. We show that our method can accurately rank policy performance without real‑world robot rollouts. Moreover, by synthesizing successful trajectories in imagination and using them for supervised fine‑tuning, our approach can improve policy success by 44.7%.

Abstract:
Vision‑and‑Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory‑persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory‑persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed‑horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision‑making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language‑conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint‑Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience‑augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory‑persistent VLN benchmarks with 10 distinct testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory‑persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination‑guided paradigm.

Abstract:
The recent rapid advancement of Text‑to‑Video (T2V) generation technologies are engaging the trained models with more world model ability, making the existing benchmarks increasingly insufficient to evaluate state‑of‑the‑art T2V models. First, current evaluation dimensions, such as per‑frame aesthetic quality and temporal consistency, are no longer able to differentiate state‑of‑the‑art T2V models. Second, event‑level temporal causality‑an essential property that differentiates videos from other modalities‑remains largely unexplored. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark focusing on evaluating whether the current T2V model could understand complex temporal causality and world knowledge to synthesize videos. We collect representative videos across diverse domains and extract their event‑level descriptions with inherent temporal causality, which are then rewritten into text‑to‑video prompts by independent annotators. For each prompt, we design ten evaluation dimensions covering dynamic and static properties, resulting in 300 prompts, 815 events, and 793 evaluation questions. Consequently, a human preference‑aligned QA‑based evaluation pipeline is developed by using modern vision‑language models to systematically benchmark leading open‑ and closed‑source T2V systems, revealing the current gap between T2V models and desired world modeling abilities.

Abstract:
While large language models (LLMs) excel in mathematical and code reasoning, we observe they struggle with social reasoning tasks, exhibiting cognitive confusion, logical inconsistencies, and conflation between objective world states and subjective belief states. Through deteiled analysis of DeepSeek‑R1's reasoning trajectories, we find that LLMs frequently encounter reasoning impasses and tend to output contradictory terms like "tricky" and "confused" when processing scenarios with multiple participants and timelines, leading to erroneous reasoning or infinite loops. The core issue is their inability to disentangle objective reality from agents' subjective beliefs. To address this, we propose an adaptive world model‑enhanced reasoning mechanism that constructs a dynamic textual world model to track entity states and temporal sequences. It dynamically monitors reasoning trajectories for confusion indicators and promptly intervenes by providing clear world state descriptions, helping models navigate through cognitive dilemmas. The mechanism mimics how humans use implicit world models to distinguish between external events and internal beliefs. Evaluations on three social benchmarks demonstrate significant improvements in accuracy (e.g., +10% in Hi‑ToM) while reducing computational costs (up to 33.8% token reduction), offering a simple yet effective solution for deploying LLMs in social contexts.

Abstract:
The application of advanced generative artificial intelligence in education is often constrained by the lack of real‑time adaptability, personalization, and reliability of the content. To address these challenges, we propose ExpertAgent ‑ an intelligent agent framework designed for personalized education that provides reliable knowledge and enables highly adaptive learning experiences. Therefore, we developed ExpertAgent, an innovative learning agent that provides users with a proactive and personalized learning experience. ExpertAgent dynamic planning of the learning content and strategy based on a continuously updated student model. Therefore, overcoming the limitations of traditional static learning content to provide optimized teaching strategies and learning experience in real time. All instructional content is grounded in a validated curriculum repository, effectively reducing hallucination risks in large language models and improving reliability and trustworthiness.

Abstract:
Coordinating heterogeneous robot teams from free‑form natural‑language instructions is hard. Language‑only planners struggle with long‑horizon coordination and hallucination, while purely formal methods require closed‑world models. We present FLEET, a hybrid decentralized framework that turns language into optimized multi‑robot schedules. An LLM front‑end produces (i) a task graph with durations and precedence and (ii) a capability‑aware robot‑‑task fitness matrix; a formal back‑end solves a makespan‑minimization problem while the underlying robots execute their free‑form subtasks with agentic closed‑loop control. Across multiple free‑form language‑guided autonomy coordination benchmarks, FLEET improves success over state of the art generative planners on two‑agent teams across heterogeneous tasks. Ablations show that mixed integer linear programming (MILP) primarily improves temporal structure, while LLM‑derived fitness is decisive for capability‑coupled tasks; together they deliver the highest overall performance. We demonstrate the translation to real world challenges with hardware trials using a pair of quadruped robots with disjoint capabilities.

Abstract:
Wrist‑view observations are crucial for VLA models as they capture fine‑grained hand‑object interactions that directly enhance manipulation performance. Yet large‑scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist‑view first frame and thus fail to generate wrist‑view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross‑view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist‑view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist‑view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist‑view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state‑of‑the‑art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor‑wrist view gap.

Abstract:
World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open‑source benchmark of real‑world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan‑2.2 TI2V‑5B to video‑state‑conditioned future frame prediction. We condition the video generation on robot states using AdaLN‑Zero, and further post‑train the model using LoRA. For the compression track, we train a Spatio‑Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top‑500 CE of 6.6386 in the compression task, securing 1st place in both challenges.

Abstract:
Safe control techniques, such as Hamilton‑Jacobi reachability, provide principled methods for synthesizing safety‑preserving robot policies but typically assume hand‑designed state spaces and full observability. Recent work has relaxed these assumptions via latent‑space safe control, where state representations and dynamics are learned jointly through world models that reconstruct future high‑dimensional observations (e.g., RGB images) from current observations and actions. This enables safety constraints that are difficult to specify analytically (e.g., spilling) to be framed as classification problems in latent space, allowing controllers to operate directly from raw observations. However, these methods assume that safety‑critical features are observable in the learned latent state. We ask: when are latent state spaces sufficient for safe control? To study this, we examine temperature‑based failures, comparable to overheating in cooking or manufacturing tasks, and find that RGB‑only observations can produce myopic safety behaviors, e.g., avoiding seeing failure states rather than preventing failure itself. To predict such behaviors, we introduce a mutual information‑based measure that identifies when observations fail to capture safety‑relevant features. Finally, we propose a multimodal‑supervised training strategy that shapes the latent state with additional sensory inputs during training, but requires no extra modalities at deployment, and validate our approach in simulation and on hardware with a Franka Research 3 manipulator preventing a pot of wax from overheating.

Abstract:
Transferability estimation metrics are used to find a high‑performing pre‑trained model for a given target task without fine‑tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset‑agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real‑world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.

Abstract:
Recent advances in generative models have sparked exciting new possibilities in the field of autonomous vehicles. Specifically, video generation models are now being explored as controllable virtual testing environments. Simultaneously, end‑to‑end (E2E) driving models have emerged as a streamlined alternative to conventional modular autonomous driving systems, gaining popularity for their simplicity and scalability. However, the application of these techniques to simulation and planning raises important questions. First, while video generation models can generate increasingly realistic videos, can these videos faithfully adhere to the specified conditions and be realistic enough for E2E autonomous planner evaluation? Second, given that data is crucial for understanding and controlling E2E planners, how can we gain deeper insights into their biases and improve their ability to generalize to out‑of‑distribution scenarios? In this work, we bridge the gap between the driving models and generative world models (Drive&Gen) to address these questions. We propose novel statistical measures leveraging E2E drivers to evaluate the realism of generated videos. By exploiting the controllability of the video generation model, we conduct targeted experiments to investigate distribution gaps affecting E2E planner performance. Finally, we show that synthetic data produced by the video generation model offers a cost‑effective alternative to real‑world data collection. This synthetic data effectively improves E2E model generalization beyond existing Operational Design Domains, facilitating the expansion of autonomous vehicle services into new operational contexts.

Abstract:
The rapid progress in embodied artificial intelligence has highlighted the necessity for more advanced and integrated models that can perceive, interpret, and predict environmental dynamics. In this context, World Models (WMs) have been introduced to provide embodied agents with the abilities to anticipate future environmental states and fill in knowledge gaps, thereby enhancing agents' ability to plan and execute actions. However, when dealing with embodied agents it is fundamental to ensure that predictions are safe for both the agent and the environment. In this article, we conduct a comprehensive literature review of World Models in the domains of autonomous driving and robotics, with a specific focus on the safety implications of scene and control generation tasks. Our review is complemented by an empirical analysis, wherein we collect and examine predictions from state‑of‑the‑art models, identify and categorize common faults (herein referred to as pathologies), and provide a quantitative evaluation of the results.

Abstract:
Large Language Models (LLMs) reasoning abilities are increasingly being applied to classical board and card games, but the dominant approach ‑‑ involving prompting for direct move generation ‑‑ has significant drawbacks. It relies on the model's implicit fragile pattern‑matching capabilities, leading to frequent illegal moves and strategically shallow play. Here we introduce an alternative approach: We use the LLM to translate natural language rules and game trajectories into a formal, executable world model represented as Python code. This generated model ‑‑ comprising functions for state transition, legal move enumeration, and termination checks ‑‑ serves as a verifiable simulation engine for high‑performance planning algorithms like Monte Carlo tree search (MCTS). In addition, we prompt the LLM to generate heuristic value functions (to make MCTS more efficient), and inference functions (to estimate hidden states in imperfect information games). Our method offers three distinct advantages compared to directly using the LLM as a policy: (1) Verifiability: The generated CWM serves as a formal specification of the game's rules, allowing planners to algorithmically enumerate valid actions and avoid illegal moves, contingent on the correctness of the synthesized model; (2) Strategic Depth: We combine LLM semantic understanding with the deep search power of classical planners; and (3) Generalization: We direct the LLM to focus on the meta‑task of data‑to‑code translation, enabling it to adapt to new games more easily. We evaluate our agent on 10 different games, of which 4 are novel and created for this paper. 5 of the games are fully observed (perfect information), and 5 are partially observed (imperfect information). We find that our method outperforms or matches Gemini 2.5 Pro in 9 out of the 10 considered games.

Abstract:
The computational role of imagination remains debated. While classical accounts emphasize reward maximization, emerging evidence suggests it accesses internal world models (IWMs). We employ psychological network analysis to compare IWMs in humans and large language models (LLMs) via imagination vividness ratings, distinguishing offline world models (persistent memory structures accessed independent of immediate goals) from online models (task‑specific representations). Analyzing 2,743 humans across three populations and six LLM variants, we find human imagination networks exhibit robust structural consistency, with high centrality correlations and aligned clustering. LLMs show minimal clustering and weak correlations with human networks, even with conversational memory, across environmental and sensory contexts. These differences highlight disparities in how biological and artificial systems organize internal representations. Our framework offers quantitative metrics for evaluating offline world models in cognitive agents.

Abstract:
We introduce GDPval, a benchmark evaluating AI model capabilities on real‑world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and that the current best frontier models are approaching industry experts in deliverable quality. We analyze the potential for frontier models, when paired with human oversight, to perform GDPval tasks cheaper and faster than unaided experts. We also demonstrate that increased reasoning effort, increased task context, and increased scaffolding improves model performance on GDPval. Finally, we open‑source a gold subset of 220 tasks and provide a public automated grading service at evals.openai.com to facilitate future research in understanding real‑world model capabilities.

Abstract:
To address the dual challenges of inherent stochasticity and non‑differentiable metrics in physical spatiotemporal forecasting, we propose Spatiotemporal Forecasting as Planning (SFP), a new paradigm grounded in Model‑Based Reinforcement Learning. SFP constructs a novel Generative World Model to simulate diverse, high‑fidelity future states, enabling an "imagination‑based" environmental simulation. Within this framework, a base forecasting model acts as an agent, guided by a beam search‑based planning algorithm that leverages non‑differentiable domain metrics as reward signals to explore high‑return future sequences. These identified high‑reward candidates then serve as pseudo‑labels to continuously optimize the agent's policy through iterative self‑training, significantly reducing prediction error and demonstrating exceptional performance on critical domain metrics like capturing extreme events.

Abstract:
Humans understand the world through the integration of multiple sensory modalities, enabling them to perceive, reason about, and imagine dynamic physical processes. Inspired by this capability, multimodal foundation models (MFMs) have emerged as powerful tools for multimodal understanding and generation. However, today's MFMs fall short of serving as effective world models. They lack the essential ability such as perform counterfactual reasoning, simulate dynamics, understand the spatiotemporal information, control generated visual outcomes, and perform multifaceted reasoning. We investigates what it takes to bridge the gap between multimodal foundation models and world models. We begin by improving the reasoning capabilities of MFMs through discriminative tasks and equipping MFMs with structured reasoning skills, such as causal inference, counterfactual thinking, and spatiotemporal reasoning, enabling them to go beyond surface correlations and understand deeper relationships within visual and textual data. Next, we explore generative capabilities of multimodal foundation models across both image and video modalities, introducing new frameworks for structured and controllable generation. Our approaches incorporate scene graphs, multimodal conditioning, and multimodal alignment strategies to guide the generation process, ensuring consistency with high‑level semantics and fine‑grained user intent. We further extend these techniques to controllable 4D generation, enabling interactive, editable, and morphable object synthesis over time and space.

Abstract:
We are interested in solving the problem of imitation learning with a limited amount of real‑world expert data. Existing offline imitation methods often struggle with poor data coverage and severe performance degradation. We propose a solution that leverages robot simulators to achieve online imitation learning. Our sim‑to‑real framework is based on world models and combines online imitation pretraining with offline finetuning. By leveraging online interactions, our approach alleviates the data coverage limitations of offline methods, leading to improved robustness and reduced performance degradation during finetuning. It also enhances generalization during domain transfer. Our empirical results demonstrate its effectiveness, improving success rates by at least 31.7% in sim‑to‑sim transfer and 23.3% in sim‑to‑real transfer over existing offline imitation learning baselines.

Abstract:
We release Code World Model (CWM), a 32‑billion‑parameter open‑weights LLM, to advance research on code generation with world models. To improve code understanding beyond what can be learned from training on static code alone, we mid‑train CWM on a large amount of observation‑action trajectories from Python interpreter and agentic Docker environments, and perform extensive multi‑task reasoning RL in verifiable coding, math, and multi‑turn software engineering environments. With CWM, we provide a strong testbed for researchers to explore the opportunities world modeling affords for improving code generation with reasoning and planning in computational environments. We present first steps of how world models can benefit agentic coding, enable step‑by‑step simulation of Python code execution, and show early results of how reasoning can benefit from the latter. CWM is a dense, decoder‑only LLM trained with a context size of up to 131k tokens. Independent of its world modeling capabilities, CWM offers strong performance on general coding and math tasks: it reaches pass@1 scores of 65.8% on SWE‑bench Verified (with test‑time scaling), 68.6% on LiveCodeBench, 96.6% on Math‑500, and 76.0% on AIME 2024. To support further research on code world modeling, we release model checkpoints after mid‑training, SFT, and RL.

Abstract:
Current video models fail as world model as they lack fine‑graiend control. General‑purpose household robots require real‑time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine‑grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine‑grained interactions that are difficult to simulate with text‑conditioned generative models. To effectively simulate fine‑grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.

Abstract:
Large Language Model (LLM) agents are rapidly emerging as powerful systems for automating tasks across domains. Yet progress in the open‑source community is constrained by the lack of high quality permissively licensed tool‑agentic training data. Existing datasets are often limited in diversity, realism, and complexity, particularly regarding multi‑tool and multi‑turn interactions. To address this gap, we introduce Toucan, the largest publicly available tool‑agentic dataset to date, containing 1.5 million trajectories synthesized from nearly 500 real‑world Model Context Protocols (MCPs). Unlike prior work, Toucan leverages authentic MCP environments to generate diverse, realistic, and challenging tasks with trajectories involving real tool execution. Our pipeline first produces a broad spectrum of tool‑use queries using five distinct models, applies model‑based quality filtering, and then generates agentic trajectories with three teacher models using two agentic frameworks. Rigorous rule‑based and model‑based validation ensures high‑quality outputs. We also introduce three extension mechanisms to further diversify tasks and simulate multi‑turn conversations. Models fine‑tuned on Toucan outperform larger closed‑source counterparts on the BFCL V3 benchmark and push the Pareto frontier forward on MCP‑Universe Bench.

Abstract:
Trained on internet‑scale video data, generative world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. This raises a natural question: with the advent of strong video foundational models, might they supplant conventional vision encoder paradigms for general‑purpose multimodal understanding? While recent studies have begun to explore the potential of world models on common vision tasks, these explorations typically lack a systematic investigation of generic, multimodal tasks. In this work, we strive to investigate the capabilities when world model priors are transferred into Vision‑Language Models: we re‑purpose a video diffusion model as a generative encoder to perform a single denoising step and treat the resulting latents as a set of visual embedding. We empirically investigate this class of models, which we refer to as World‑Language Models (WorldLMs), and we find that generative encoders can capture latents useful for downstream understanding that show distinctions from conventional encoders. Naming our best‑performing variant Dynamic Vision Aligner (DyVA), we further discover that this method significantly enhances spatial reasoning abilities and enables single‑image models to perform multi‑frame reasoning. Through the curation of a suite of visual reasoning tasks, we find DyVA to surpass both open‑source and proprietary baselines, achieving state‑of‑the‑art or comparable performance. We attribute these gains to WorldLM's inherited motion‑consistency internalization from video pre‑training. Finally, we systematically explore extensive model designs to highlight promising directions for future work. We hope our study can pave the way for a new family of VLMs that leverage priors from world models and are on a promising path towards generalist vision learners.

Abstract:
Mobile ground robots lacking prior knowledge of an environment must rely on sensor data to develop a model of their surroundings. In these scenarios, consistent identification of obstacles and terrain features can be difficult due to noise and algorithmic shortcomings, which can make it difficult for motion planning systems to generate safe motions. One particular difficulty to overcome is when regions of the cost map switch between being marked as obstacles and free space through successive planning cycles. One potential solution to this, which we refer to as Valid in Every Hypothesis (VEH), is for the planning system to plan motions that are guaranteed to be safe through a history of world models. Another approach is to track a history of world models, and adjust node costs according to the potential penalty of needing to reroute around previously hazardous areas. This work discusses three major iterations on this idea. The first iteration, called PEH, invokes a sub‑search for every node expansion that crosses through a divergence point in the world models. The second and third iterations, called GEH and GEGRH respectively, defer the sub‑search until after an edge expands into the goal region. GEGRH uses an additional step to revise the graph based on divergent nodes in each world. Initial results showed that, although PEH and GEH find more optimistic solutions than VEH, they are unable to generate solutions in less than one‑second, which exceeds our requirements for field deployment. Analysis of results from a field experiment in an unstructured, off‑road environment on a Clearpath Robotics Warthog UGV indicate that GEGRH finds lower cost trajectories and has faster average planning times than VEH. Compared to single‑hypothesis (SH) search, where only the latest world model is considered, GEGRH generates more conservative plans with a small increase in average planning time.

Abstract:
Long‑horizon embodied planning is challenging because the world does not only change through an agent's actions: exogenous processes (e.g., water heating, dominoes cascading) unfold concurrently with the agent's actions. We propose a framework for abstract world models that jointly learns (i) symbolic state representations and (ii) causal processes for both endogenous actions and exogenous mechanisms. Each causal process models the time course of a stochastic cause‑effect relation. We learn these world models from limited data via variational Bayesian inference combined with LLM proposals. Across five simulated tabletop robotics environments, the learned models enable fast planning that generalizes to held‑out tasks with more objects and more complex goals, outperforming a range of baselines.

Abstract:
Autonomous navigation for mechanical thrombectomy (MT) remains a critical challenge due to the complexity of vascular anatomy and the need for precise, real‑time decision‑making. Reinforcement learning (RL)‑based approaches have demonstrated potential in automating endovascular navigation, but current methods often struggle with generalization across multiple patient vasculatures and long‑horizon tasks. We propose a world model for autonomous endovascular navigation using TD‑MPC2, a model‑based RL algorithm. We trained a single RL agent across multiple endovascular navigation tasks in ten real patient vasculatures, comparing performance against the state‑of‑the‑art Soft Actor‑Critic (SAC) method. Results indicate that TD‑MPC2 significantly outperforms SAC in multi‑task learning, achieving a 65% mean success rate compared to SAC's 37%, with notable improvements in path ratio. TD‑MPC2 exhibited increased procedure times, suggesting a trade‑off between success rate and execution speed. These findings highlight the potential of world models for improving autonomous endovascular navigation and lay the foundation for future research in generalizable AI‑driven robotic interventions.

Abstract:
Multimodal Large Language Models (MLLMs) strive to achieve a profound, human‑like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision‑language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine‑grained alignment with textual instructions; and Cognition, the higher‑order capability for proactive, multi‑step, goal‑oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe‑think‑verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting‑edge methods designed to address these challenges, spanning from techniques that enhance low‑level visual representations to those that improve high‑level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next‑generation models capable of deep reasoning and a genuine understanding of the world.

Abstract:
Large language model (LLM) agents are increasingly capable of orchestrating complex tasks in low‑code environments. However, these agents often exhibit hallucinations and logical inconsistencies because their inherent reasoning mechanisms rely on probabilistic associations rather than genuine causal understanding. This paper introduces a new programming paradigm: Causal‑Visual Programming (CVP), designed to address this fundamental issue by explicitly introducing causal structures into the workflow design. CVP allows users to define a simple "world model" for workflow modules through an intuitive low‑code interface, effectively creating a Directed Acyclic Graph (DAG) that explicitly defines the causal relationships between modules. This causal graph acts as a crucial constraint during the agent's reasoning process, anchoring its decisions to a user‑defined causal structure and significantly reducing logical errors and hallucinations by preventing reliance on spurious correlations. To validate the effectiveness of CVP, we designed a synthetic experiment that simulates a common real‑world problem: a distribution shift between the training and test environments. Our results show that a causally anchored model maintained stable accuracy in the face of this shift, whereas a purely associative baseline model that relied on probabilistic correlations experienced a significant performance drop. The primary contributions of this study are: a formal definition of causal structures for workflow modules; the proposal and implementation of a CVP framework that anchors agent reasoning to a user‑defined causal graph; and empirical evidence demonstrating the framework's effectiveness in enhancing agent robustness and reducing errors caused by causal confusion in dynamic environments. CVP offers a viable path toward building more interpretable, reliable, and trustworthy AI agents.

Abstract:
Vision Language Action models (VLAs) trained with policy‑based reinforcement learning (RL) encode complex behaviors without explicitly modeling environmental dynamics. However, it remains unclear whether VLAs implicitly learn world models, a hallmark of model‑based RL. We propose an experimental methodology using embedding arithmetic on state representations to probe whether OpenVLA, the current state of the art in VLAs, contains latent knowledge of state transitions. Specifically, we measure the difference between embeddings of sequential environment states and test whether this transition vector is recoverable from intermediate model activations. Using linear and non linear probes trained on activations across layers, we find statistically significant predictive ability on state transitions exceeding baselines (embeddings), indicating that OpenVLA encodes an internal world model (as opposed to the probes learning the state transitions). We investigate the predictive ability of an earlier checkpoint of OpenVLA, and uncover hints that the world model emerges as training progresses. Finally, we outline a pipeline leveraging Sparse Autoencoders (SAEs) to analyze OpenVLA's world model.

Abstract:
World models learn general knowledge from videos and simulate experience for training behaviors in imagination, offering a path towards intelligent agents. However, previous world models have been unable to accurately predict object interactions in complex environments. We introduce Dreamer 4, a scalable agent that learns to solve control tasks by reinforcement learning inside of a fast and accurate world model. In the complex video game Minecraft, the world model accurately predicts object interactions and game mechanics, outperforming previous world models by a large margin. The world model achieves real‑time interactive inference on a single GPU through a shortcut forcing objective and an efficient transformer architecture. Moreover, the world model learns general action conditioning from only a small amount of data, allowing it to extract the majority of its knowledge from diverse unlabeled videos. We propose the challenge of obtaining diamonds in Minecraft from only offline data, aligning with practical applications such as robotics where learning from environment interaction can be unsafe and slow. This task requires choosing sequences of over 20,000 mouse and keyboard actions from raw pixels. By learning behaviors in imagination, Dreamer 4 is the first agent to obtain diamonds in Minecraft purely from offline data, without environment interaction. Our work provides a scalable recipe for imagination training, marking a step towards intelligent agents.

Abstract:
Sampling‑based motion planning is a well‑established approach in autonomous driving, valued for its modularity and analytical tractability. In complex urban scenarios, however, uniform or heuristic sampling often produces many infeasible or irrelevant trajectories. We address this limitation with a hybrid framework that learns where to sample while keeping trajectory generation and evaluation fully analytical and verifiable. A reinforcement learning (RL) agent guides the sampling process toward regions of the action space likely to yield feasible trajectories, while evaluation and final selection remains governed by deterministic feasibility checks and cost functions. We couple the RL sampler with a world model (WM) based on a decodable deep set encoder, enabling both variable numbers of traffic participants and reconstructable latent representations. The approach is evaluated in the CommonRoad (CR) simulation environment and compared against uniform‑sampling baselines, showing up to 99% fewer required samples and a runtime reduction of up to 84% while maintaining planning quality in terms of success and collision‑free rates. These improvements lead to faster, more reliable decision‑making for autonomous vehicles in urban environments.

Abstract:
Generating realistic robot videos from explicit action trajectories is a critical step toward building effective world models and robotics foundation models. We introduce two training‑free, inference‑time techniques that fully exploit explicit action parameters in diffusion‑based robot video generation. Instead of treating action vectors as passive conditioning signals, our methods actively incorporate them to guide both the classifier‑free guidance process and the initialization of Gaussian latents. First, action‑scaled classifier‑free guidance dynamically modulates guidance strength in proportion to action magnitude, enhancing controllability over motion intensity. Second, action‑scaled noise truncation adjusts the distribution of initially sampled noise to better align with the desired motion dynamics. Experiments on real robot manipulation datasets demonstrate that these techniques significantly improve action coherence and visual quality across diverse robot environments.

Abstract:
LLM‑based agents have seen promising advances, yet they are still limited in "hard‑exploration" tasks requiring learning new knowledge through exploration. We present GLoW, a novel approach leveraging dual‑scale world models, maintaining a trajectory frontier of high‑value discoveries at the global scale, while learning from local trial‑and‑error in exploration through a Multi‑path Advantage Reflection mechanism which infers advantage‑based progress signals to guide exploration. To evaluate our framework for hard‑exploration, we tackle the Jericho benchmark suite of text‑based games, where GLoW achieves a new state‑of‑theart performance for LLM‑based approaches. Compared to state‑of‑the‑art RLbased methods, our approach achieves comparable performance while requiring 100‑800x fewer environment interactions.

Abstract:
Simulating interactive world models remains a core challenge in Large Language Models(LLMs). In this work, we introduce the ByteSized32Refactored, a refactored, modular, and extensible implementation of the original ByteSized32 corpus to explore the task of text game generation. We further optimize the code structure of each text game and create the GameBasic.py foundation library, which centralizes common logic across all 32 games by abstracting 7 base classes (GameObject, etc.) into reusable modules, thereby reducing from 20k to 10k total lines of Python code compared to the original Bytesized32. Our refactored implementation enables extendability ‑ with our centralized design, ByteSized32Refactored can be more efficiently extended to include text games of new scenarios and specifications by reusing the shared logic and functionalities. Extensive experiments with GPT‑4o demonstrate a mix of performance ‑ with Bytesized32Refactored, the generated text games for unseen scenarios showcase quality improvements on two of the four evaluation dimensions while decreases on the other two, indicating that the hierarchical structure of the refactored code presents new challenges for LLMs. Overall, we highlight that our extensible code structure, centered on the foundation library and the modular optimization, not only facilitates LLM adaptation to environment specifications but also establishes a scalable environment that supports future extensions.

Abstract:
World models simulate dynamic environments, enabling agents to interact with diverse input modalities. Although recent advances have improved the visual quality and temporal consistency of video world models, their ability of accurately modeling human‑specified actions remains under‑explored. Reinforcement learning presents a promising approach for directly improving the suboptimal action‑following capability of pre‑trained models, assuming that an appropriate reward function can be defined. However, transferring reinforcement learning post‑training methods to world model is impractical due to the prohibitive cost of large‑scale preference annotations and the infeasibility of constructing rule‑based video verifiers. To address this gap, we propose Reinforcement Learning with Inverse Rewards (RLIR), a post‑training framework that derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model. By mapping high‑dimensional video modality to a low‑dimensional action space, RLIR provides an objective and verifiable reward for optimization via Group Relative Policy Optimization. Experiments across autoregressive and diffusion paradigms demonstrate 5‑10% gains in action‑following, up to 10% improvements in visual quality, and higher human preference scores, establishing RLIR as the first post‑training method specifically designed to enhance action‑following in video world models.

Abstract:
The Model Context Protocol (MCP) defines a schema bound execution model for agent‑tool interaction, enabling modular computer vision workflows without retraining. To our knowledge, this is the first protocol level, deployment scale audit of MCP in vision systems, identifying systemic weaknesses in schema semantics, interoperability, and runtime coordination. We analyze 91 publicly registered vision centric MCP servers, annotated along nine dimensions of compositional fidelity, and develop an executable benchmark with validators to detect and categorize protocol violations. The audit reveals high prevalence of schema format divergence, missing runtime schema validation, undeclared coordinate conventions, and reliance on untracked bridging scripts. Validator based testing quantifies these failures, with schema format checks flagging misalignments in 78.0 percent of systems, coordinate convention checks detecting spatial reference errors in 24.6 percent, and memory scope checks issuing an average of 33.8 warnings per 100 executions. Security probes show that dynamic and multi agent workflows exhibit elevated risks of privilege escalation and untyped tool connections. The proposed benchmark and validator suite, implemented in a controlled testbed and to be released on GitHub, establishes a reproducible framework for measuring and improving the reliability and security of compositional vision workflows.

Abstract:
Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14‑billion‑parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision‑language model agents evaluate the DiT‑generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co‑trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination‑to‑action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state‑of‑the‑art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large‑scale, real‑world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open‑sourced.

Abstract:
The capability of predicting environmental dynamics underpins both biological neural systems and general embodied AI in adapting to their surroundings. Yet prevailing approaches rest on static world models that falter when confronted with novel or rare configurations. We investigate in‑context learning (ICL) of world models, shifting attention from zero‑shot performance to the growth and asymptotic limits of the world model. Our contributions are three‑fold: (1) we formalize ICL of a world model and identify two core mechanisms: environment recognition (ER) and environment learning (EL); (2) we derive error upper‑bounds for both mechanisms that expose how the mechanisms emerge; and (3) we empirically confirm that distinct ICL mechanisms exist in the world model, and we further investigate how data distribution and model architecture affect ICL in a manner consistent with theory. These findings demonstrate the potential of self‑adapting world models and highlight the key factors behind the emergence of EL/ER, most notably the necessity of long context and diverse environments.

Abstract:
High‑quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry‑enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross‑branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D‑aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per‑scene optimization or fine‑tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry‑consistent baselines in multi‑view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross‑branch information exchange.

Abstract:
We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene‑wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real‑world intuitive physics datasets. Although recent state‑of‑the‑art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine‑tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

Abstract:
Robotic world models are a promising paradigm for forecasting future environment states, yet their inference speed and the physical plausibility of generated trajectories remain critical bottlenecks, limiting their real‑world applications. This stems from the redundancy of the prevailing frame‑to‑frame generation approach, where the model conducts costly computation on similar frames, as well as neglecting the semantic importance of key transitions. To address this inefficiency, we propose KeyWorld, a framework that improves text‑conditioned robotic world models by concentrating transformers computation on a few semantic key frames while employing a lightweight convolutional model to fill the intermediate frames. Specifically, KeyWorld first identifies significant transitions by iteratively simplifying the robot's motion trajectories, obtaining the ground truth key frames. Then, a DiT model is trained to reason and generate these physically meaningful key frames from textual task descriptions. Finally, a lightweight interpolator efficiently reconstructs the full video by inpainting all intermediate frames. Evaluations on the LIBERO benchmark demonstrate that KeyWorld achieves a 5.68× acceleration compared to the frame‑to‑frame generation baseline, and focusing on the motion‑aware key frames further contributes to the physical validity of the generated videos, especially on complex tasks. Our approach highlights a practical path toward deploying world models in real‑time robotic control and other domains requiring both efficient and effective world models. Code is released at https://anonymous.4open.science/r/Keyworld‑E43D.

Abstract:
Evaluating AI agents that solve real‑world tasks through function‑call sequences remains an open challenge. Existing agentic benchmarks often reduce evaluation to a binary judgment of the final state, overlooking critical aspects such as safety, efficiency, and intermediate correctness. We propose a framework based on deterministic finite automata (DFAs) that encodes tasks as sets of valid tool‑use paths, enabling principled assessment of agent behavior in diverse world models. Building on this foundation, we introduce CORE, a suite of five metrics, namely Path Correctness, Path Correctness ‑ Kendall's tau Composite, Prefix Criticality, Harmful‑Call Rate, and Efficiency, that quantify alignment with expected execution patterns. Across diverse worlds, our method reveals important performance differences between agents that would otherwise appear equivalent under traditional final‑state evaluation schemes.

Abstract:
Reinforcement learning has enabled significant progress in complex domains such as coordinating and navigating multiple quadrotors. However, even well‑trained policies remain vulnerable to collisions in obstacle‑rich environments. Addressing these infrequent but critical safety failures through retraining or fine‑tuning is costly and risks degrading previously learned skills. Inspired by activation steering in large language models and latent editing in computer vision, we introduce a framework for inference‑time Latent Activation Editing (LAE) that refines the behavior of pre‑trained policies without modifying their weights or architecture. The framework operates in two stages: (i) an online classifier monitors intermediate activations to detect states associated with undesired behaviors, and (ii) an activation editing module that selectively modifies flagged activations to shift the policy towards safer regimes. In this work, we focus on improving safety in multi‑quadrotor navigation. We hypothesize that amplifying a policy's internal perception of risk can induce safer behaviors. We instantiate this idea through a latent collision world model trained to predict future pre‑collision activations, thereby prompting earlier and more cautious avoidance responses. Extensive simulations and real‑world Crazyflie experiments demonstrate that LAE achieves statistically significant reduction in collisions (nearly 90% fewer cumulative collisions compared to the unedited baseline) and substantially increases the fraction of collision‑free trajectories, while preserving task completion. More broadly, our results establish LAE as a lightweight paradigm, feasible on resource‑constrained hardware, for post‑deployment refinement of learned robot policies.

Abstract:
Embodied Artificial Intelligence (AI) is an intelligent system paradigm for achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications and driving the evolution from cyberspace to physical systems. Recent breakthroughs in Large Language Models (LLMs) and World Models (WMs) have drawn significant attention for embodied AI. On the one hand, LLMs empower embodied AI via semantic reasoning and task decomposition, bringing high‑level natural language instructions and low‑level natural language actions into embodied cognition. On the other hand, WMs empower embodied AI by building internal representations and future predictions of the external world, facilitating physical law‑compliant embodied interactions. As such, this paper comprehensively explores the literature in embodied AI from basics to advances, covering both LLM driven and WM driven works. In particular, we first present the history, key technologies, key components, and hardware systems of embodied AI, as well as discuss its development via looking from unimodal to multimodal angle. We then scrutinize the two burgeoning fields of embodied AI, i.e., embodied AI with LLMs/multimodal LLMs (MLLMs) and embodied AI with WMs, meticulously delineating their indispensable roles in end‑to‑end embodied cognition and physical laws‑driven embodied interactions. Building upon the above advances, we further share our insights on the necessity of the joint MLLM‑WM driven embodied AI architecture, shedding light on its profound significance in enabling complex tasks within physical worlds. In addition, we examine representative applications of embodied AI, demonstrating its wide applicability in real‑world scenarios. Last but not least, we point out future research directions of embodied AI that deserve further investigation.

Abstract:
Recent works have shown that foundational safe control methods, such as Hamilton‑Jacobi (HJ) reachability analysis, can be applied in the latent space of world models. While this enables the synthesis of latent safety filters for hard‑to‑model vision‑based tasks, they assume that the safety constraint is known a priori and remains fixed during deployment, limiting the safety filter's adaptability across scenarios. To address this, we propose constraint‑parameterized latent safety filters that can adapt to user‑specified safety constraints at runtime. Our key idea is to define safety constraints by conditioning on an encoding of an image that represents a constraint, using a latent‑space similarity measure. The notion of similarity to failure is aligned in a principled way through conformal calibration, which controls how closely the system may approach the constraint representation. The parameterized safety filter is trained entirely within the world model's imagination, treating any image seen by the model as a potential test‑time constraint, thereby enabling runtime adaptation to arbitrary safety constraints. In simulation and hardware experiments on vision‑based control tasks with a Franka manipulator, we show that our method adapts at runtime by conditioning on the encoding of user‑specified constraint images, without sacrificing performance. Video results can be found on https://any‑safe.github.io

Abstract:
Diffusion‑based world models have demonstrated strong capabilities in synthesizing realistic long‑horizon trajectories for offline reinforcement learning (RL). However, many existing methods do not directly generate actions alongside states and rewards, limiting their compatibility with standard value‑based offline RL algorithms that rely on one‑step temporal difference (TD) learning. While prior work has explored joint modeling of states, rewards, and actions to address this issue, such formulations often lead to increased training complexity and reduced performance in practice. We propose DAWM, a diffusion‑based world model that generates future state‑reward trajectories conditioned on the current state, action, and return‑to‑go, paired with an inverse dynamics model (IDM) for efficient action inference. This modular design produces complete synthetic transitions suitable for one‑step TD‑based offline RL, enabling effective and computationally efficient training. Empirically, we show that conservative offline RL algorithms such as TD3BC and IQL benefit significantly from training on these augmented trajectories, consistently outperforming prior diffusion‑based baselines across multiple tasks in the D4RL benchmark.

Abstract:
Robotic manipulation policies are commonly initialized through imitation learning, but their performance is limited by the scarcity and narrow coverage of expert data. Reinforcement learning can refine polices to alleviate this limitation, yet real‑robot training is costly and unsafe, while training in simulators suffers from the sim‑to‑real gap. Recent advances in generative models have demonstrated remarkable capabilities in real‑world simulation, with diffusion models in particular excelling at generation. This raises the question of how diffusion model‑based world models can be combined to enhance pre‑trained policies in robotic manipulation. In this work, we propose World4RL, a framework that employs diffusion‑based world models as high‑fidelity simulators to refine pre‑trained policies entirely in imagined environments for robotic manipulation. Unlike prior works that primarily employ world models for planning, our framework enables direct end‑to‑end policy optimization. World4RL is designed around two principles: pre‑training a diffusion world model that captures diverse dynamics on multi‑task datasets and refining policies entirely within a frozen world model to avoid online real‑world interactions. We further design a two‑hot action encoding scheme tailored for robotic manipulation and adopt diffusion backbones to improve modeling fidelity. Extensive simulation and real‑world experiments demonstrate that World4RL provides high‑fidelity environment modeling and enables consistent policy refinement, yielding significantly higher success rates compared to imitation learning and other baselines.

Abstract:
World models have shown potential in artificial intelligence by predicting and reasoning about world states beyond direct observations. However, existing approaches are predominantly evaluated in synthetic environments or constrained scene settings, limiting their validation in real‑world contexts with broad spatial coverage and complex semantics. Meanwhile, remote sensing applications urgently require spatial reasoning capabilities for disaster response and urban planning. This paper bridges these gaps by introducing the first framework for world modeling in remote sensing. We formulate remote sensing world modeling as direction‑conditioned spatial extrapolation, where models generate semantically consistent adjacent image tiles given a central observation and directional instruction. To enable rigorous evaluation, we develop RSWISE (Remote Sensing World‑Image Spatial Evaluation), a benchmark containing 1,600 evaluation tasks across four scenarios: general, flood, urban, and rural. RSWISE combines visual fidelity assessment with instruction compliance evaluation using GPT‑4o as a semantic judge, ensuring models genuinely perform spatial reasoning rather than simple replication. Afterwards, we present RemoteBAGEL, a unified multimodal model fine‑tuned on remote sensing data for spatial extrapolation tasks. Extensive experiments demonstrate that RemoteBAGEL consistently outperforms state‑of‑the‑art baselines on RSWISE.

Abstract:
While reinforcement learning from scratch has shown impressive results in solving sequential decision‑making tasks with efficient simulators, real‑world applications with expensive interactions require more sample‑efficient agents. Foundation models (FMs) are natural candidates to improve sample efficiency as they possess broad knowledge and reasoning capabilities, but it is yet unclear how to effectively integrate them into the reinforcement learning framework. In this paper, we anticipate and, most importantly, evaluate two promising strategies. First, we consider the use of foundation world models (FWMs) that exploit the prior knowledge of FMs to enable training and evaluating agents with simulated interactions. Second, we consider the use of foundation agents (FAs) that exploit the reasoning capabilities of FMs for decision‑making. We evaluate both approaches empirically in a family of grid‑world environments that are suitable for the current generation of large language models (LLMs). Our results suggest that improvements in LLMs already translate into better FWMs and FAs; that FAs based on current LLMs can already provide excellent policies for sufficiently simple environments; and that the coupling of FWMs and reinforcement learning agents is highly promising for more complex settings with partial observability and stochastic elements.

Abstract:
World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long‑horizon decision‑making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose Scale‑wise Autoregression with Motion PrOmpt (SAMPO), a hybrid framework that combines visual autoregressive modeling for intra‑frame generation with causal modeling for next‑frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi‑scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory‑aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action‑conditioned video prediction and model‑based control, improving generation quality with 4.4× faster inference. We also evaluate SAMPO's zero‑shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.

Abstract:
Recent successful video generation systems that predict and create realistic automotive driving scenes from short video inputs assign tokenization, future state prediction (world model), and video decoding to dedicated models. These approaches often utilize large models that require significant training resources, offer limited insight into design choices, and lack publicly available code and datasets. In this work, we address these deficiencies and present OpenViGA, an open video generation system for automotive driving scenes. Our contributions are: Unlike several earlier works for video generation, such as GAIA‑1, we provide a deep analysis of the three components of our system by separate quantitative and qualitative evaluation: Image tokenizer, world model, video decoder. Second, we purely build upon powerful pre‑trained open source models from various domains, which we fine‑tune by publicly available automotive data (BDD100K) on GPU hardware at academic scale. Third, we build a coherent video generation system by streamlining interfaces of our components. Fourth, due to public availability of the underlying models and data, we allow full reproducibility. Finally, we also publish our code and models on Github. For an image size of 256x256 at 4 fps we are able to predict realistic driving scene videos frame‑by‑frame with only one frame of algorithmic latency.

Abstract:
Trajectory prediction is a fundamental problem in computer vision, vision‑language‑action models, world models, and autonomous systems, with broad impact on autonomous driving, robotics, and surveillance. However, most existing methods assume complete and clean observations, and therefore do not adequately handle out‑of‑sight agents or noisy sensing signals caused by limited camera coverage, occlusions, and the absence of ground‑truth denoised trajectories. These challenges raise safety concerns and reduce robustness in real‑world deployment. In this extended study, we introduce major improvements to Out‑of‑Sight Trajectory (OST), a task for predicting noise‑free visual trajectories of out‑of‑sight objects from noisy sensor observations. Building on our prior work, we expand Out‑of‑Sight Trajectory Prediction (OOSTraj) from pedestrians to both pedestrians and vehicles, increasing its relevance to autonomous driving, robotics, and surveillance. Our improved Vision‑Positioning Denoising Module exploits camera calibration to establish vision‑position correspondence, mitigating the lack of direct visual cues and enabling effective unsupervised denoising of noisy sensor signals. Extensive experiments on the Vi‑Fi and JRDB datasets show that our method achieves state‑of‑the‑art results for both trajectory denoising and trajectory prediction, with clear gains over prior baselines. We also compare with classical denoising methods, including Kalman filtering, and adapt recent trajectory prediction models to this setting, establishing a stronger benchmark. To the best of our knowledge, this is the first work to use vision‑positioning projection to denoise noisy sensor trajectories of out‑of‑sight agents, opening new directions for future research.

Abstract:
Ensuring safety of vision‑based control systems remains a major challenge hindering their deployment in critical settings. Safety filters have gained increased interest as effective tools for ensuring the safety of classical control systems, but their applications in vision‑based control settings have so far been limited. Pre‑trained vision models (PVRs) have been shown to be effective perception backbones for control in various robotics domains. In this paper, we are interested in examining their effectiveness when used for designing vision‑based safety filters. We use them as backbones for classifiers defining failure sets, for Hamilton‑Jacobi (HJ) reachability‑based safety filters, and for latent world models. We discuss the trade‑offs between training from scratch, fine‑tuning, and freezing the PVRs when training the models they are backbones for. We also evaluate whether one of the PVRs is superior across all tasks, evaluate whether learned world models or Q‑functions are better for switching decisions to safe policies, and discuss practical considerations for deploying these PVRs on resource‑constrained devices.

Abstract:
We introduce PhysicalAgent, an agentic framework for robotic manipulation that integrates iterative reasoning, diffusion‑based video generation, and closed‑loop execution. Given a textual instruction, our method generates short video demonstrations of candidate trajectories, executes them on the robot, and iteratively re‑plans in response to failures. This approach enables robust recovery from execution errors. We evaluate PhysicalAgent across multiple perceptual modalities (egocentric, third‑person, and simulated) and robotic embodiments (bimanual UR3, Unitree G1 humanoid, simulated GR1), comparing against state‑of‑the‑art task‑specific baselines. Experiments demonstrate that our method consistently outperforms prior approaches, achieving up to 83% success on human‑familiar tasks. Physical trials reveal that first‑attempt success is limited (20‑30%), yet iterative correction increases overall success to 80% across platforms. These results highlight the potential of video‑based generative reasoning for general‑purpose robotic manipulation and underscore the importance of iterative execution for recovering from initial failures. Our framework paves the way for scalable, adaptable, and robust robot control.

Abstract:
We study whether next‑token prediction can yield world models that truly support planning, in a controlled symbolic setting where propositional STRIPS action models are learned from action traces alone and correctness can be evaluated exactly. We introduce two architectures. The first is the STRIPS Transformer, a symbolically aligned model grounded in theoretical results linking transformers and the formal language structure of STRIPS domains. The second is a standard transformer architecture without explicit symbolic structure built in, for which we study different positional encoding schemes and attention aggregation mechanisms. We evaluate both architectures on five classical planning domains, measuring training accuracy, generalization, and planning performance across domains and problem sizes. Interestingly, both approaches can be used to produce models that support planning with off‑the‑shelf STRIPS planners over exponentially many unseen initial states and goals. Although the STRIPS Transformer incorporates a strong symbolic inductive bias, it is harder to optimize and requires larger datasets to generalize reliably. In contrast, a standard transformer with stick‑breaking attention achieves near‑perfect training accuracy and strong generalization. Finally, standard transformers without stick‑breaking attention do not generalize to long traces, whereas a symbolic STRIPS model extracted from a transformer trained on shorter traces does.

Abstract:
Training agents to act in embodied environments typically requires vast training data or access to accurate simulation, neither of which exists for many cases in the real world. Instead, world models are emerging as an alternative leveraging offline, passively collected data, they make it possible to generate diverse worlds for training agents in simulation. In this work, we harness world models to generate imagined environments to train robust agents capable of generalizing to novel task variations. One of the challenges in doing this is ensuring the agent trains on useful generated data. We thus propose a novel approach, IMAC (Imagined Autocurricula), leveraging Unsupervised Environment Design (UED), which induces an automatic curriculum over generated worlds. In a series of challenging, procedurally generated environments, we show it is possible to achieve strong transfer performance on held‑out environments, having trained only inside a world model learned from a narrower dataset. We believe this opens the path to utilizing larger‑scale, foundation world models for generally capable agents.

Abstract:
A major challenge in deploying world models is the trade‑off between size and performance. Large world models can capture rich physical dynamics but require massive computing resources, making them impractical for edge devices. Small world models are easier to deploy but often struggle to learn accurate physics, leading to poor predictions. We propose the Physics‑Informed BEV World Model (PIWM), a compact model designed to efficiently capture physical interactions in bird's‑eye‑view (BEV) representations. PIWM uses Soft Mask during training to improve dynamic object modeling and future prediction. We also introduce a simple yet effective technique, Warm Start, for inference to enhance prediction quality with a zero‑shot model. Experiments show that at the same parameter scale (400M), PIWM surpasses the baseline by 60.6% in weighted overall score. Moreover, even when compared with the largest baseline model (400M), the smallest PIWM (130M Soft Mask) achieves a 7.4% higher weighted overall score with a 28% faster inference speed.

Abstract:
Modern deep learning models excel at pattern recognition but remain fundamentally limited by their reliance on spurious correlations, leading to poor generalization and a demand for massive datasets. We argue that a key ingredient for human‑like intelligence‑robust, sample‑efficient learning‑stems from an understanding of causal mechanisms. In this work, we introduce Causal‑Symbolic Meta‑Learning (CSML), a novel framework that learns to infer the latent causal structure of a task distribution. CSML comprises three key modules: a perception module that maps raw inputs to disentangled symbolic representations; a differentiable causal induction module that discovers the underlying causal graph governing these symbols and a graph‑based reasoning module that leverages this graph to make predictions. By meta‑learning a shared causal world model across a distribution of tasks, CSML can rapidly adapt to novel tasks, including those requiring reasoning about interventions and counterfactuals, from only a handful of examples. We introduce CausalWorld, a new physics‑based benchmark designed to test these capabilities. Our experiments show that CSML dramatically outperforms state‑of‑the‑art meta‑learning and neuro‑symbolic baselines, particularly on tasks demanding true causal inference.

Abstract:
While generative world models have advanced video and occupancy‑based data synthesis, LiDAR generation remains underexplored despite its importance for accurate 3D perception. Extending generation to 4D LiDAR data introduces challenges in controllability, temporal stability, and evaluation. We present LiDARCrafter, a unified framework that converts free‑form language into editable LiDAR sequences. Instructions are parsed into ego‑centric scene graphs, which a tri‑branch diffusion model transforms into object layouts, trajectories, and shapes. A range‑image diffusion model generates the initial scan, and an autoregressive module extends it into a temporally coherent sequence. The explicit layout design further supports object‑level editing, such as insertion or relocation. To enable fair assessment, we provide EvalSuite, a benchmark spanning scene‑, object‑, and sequence‑level metrics. On nuScenes, LiDARCrafter achieves state‑of‑the‑art fidelity, controllability, and temporal consistency, offering a foundation for LiDAR‑based simulation and data augmentation.

Abstract:
The development of intelligent agents, particularly those powered by language models (LMs), has shown a critical role in various environments that require intelligent and autonomous decision‑making. Environments are not passive testing grounds, and they represent the data required for agents to learn and exhibit in very challenging conditions that require adaptive, complex, and autonomous capacity to make decisions. While the paradigm of scaling models and datasets has led to remarkable emergent capabilities, we argue that scaling the structure, fidelity, and logical consistency of agent reasoning within these environments is a crucial, yet underexplored, dimension of AI research. This paper introduces a neuro‑symbolic multi‑agent architecture where the belief states of individual agents are formally represented as Kripke models. This foundational choice enables them to reason about known concepts of \emphpossibility and \emphnecessity using the formal language of modal logic. In this work, we use immutable, domain‑specific knowledge to make an informed root cause diagnosis, which is encoded as logical constraints essential for proper, reliable, and explainable diagnosis. In the proposed model, we show constraints that actively guide the hypothesis generation of LMs, effectively preventing them from reaching physically or logically untenable conclusions. In a high‑fidelity simulated particle accelerator environment, our system successfully diagnoses complex, cascading failures by combining the powerful semantic intuition of LMs with the rigorous, verifiable validation of modal logic and a factual world model and showcasing a viable path toward more robust, reliable, and verifiable autonomous agents.

Abstract:
We present Probabilistic Structure Integration (PSI), a system for learning richly controllable and flexibly promptable world models from data. PSI consists of a three‑step cycle. The first step, Probabilistic prediction, involves building a probabilistic graphical model Psi of the data, in the form of a random‑access autoregressive sequence model. Psi supports a complete set of learned conditional distributions describing the dependence of any variables in the data on any other set of variables. In step 2, Structure extraction, we show how to extract underlying low‑dimensional properties in the data, corresponding to a diverse set of meaningful "intermediate structures", in a zero‑shot fashion via causal inference on Psi. Step 3, Integration, completes the cycle by converting these structures into new token types that are then continually mixed back into the training diet as conditioning signals and prediction targets. Each such cycle augments the capabilities of Psi, both allowing it to model the underlying data better, and creating new control handles ‑‑ akin to an LLM‑like universal prompting language. We train an instance of Psi on 1.4 trillion tokens of internet video data; we use it to perform a variety of useful video prediction and understanding inferences; we extract state‑of‑the‑art optical flow, self‑supervised depth and object segmentation; and we use these structures to support a full cycle of predictive improvements.

Abstract:
Future AI‑native wireless networks are moving from reactive optimization to agentic decision‑making that can sense, predict, and plan under fast‑varying channels. This calls for wireless world models that can predict and roll out channel dynamics, for which multi‑step channel state information (CSI) prediction offers a practical short‑horizon look‑ahead. Recent advances in foundation sequence models further motivate large language models (LLMs) as general‑purpose dynamics learners when suitably adapted to non‑text time‑series signals. However, bridging CSI to LLMs is non‑trivial because an effective adapter must expose informative spectral and temporal evolution patterns, while prior designs provide limited inductive bias to capture such channel structures. To this end, we propose SCA‑LLM, a spectral‑attentive LLM‑based wireless world modeling framework that bridges CSI to LLMs via a spectral‑channel attention (SCA) adapter. Specifically, the SCA adapter performs multi‑spectral representation learning to extract informative channel features and align CSI with the LLM's sequence modeling capability, enabling parameter‑efficient adaptation while keeping the LLM backbone largely frozen. Extensive simulations show that SCA‑LLM achieves state‑of‑the‑art prediction performance and strong zero‑shot generalization, yielding up to ‑2.4 dB normalized mean squared error (NMSE) advantage over the previous LLM based method. Our ablation studies further confirm the effectiveness of the proposed SCA adapter in mitigating domain mismatch.

Abstract:
Recent research has been increasingly focusing on developing 3D world models that simulate complex real‑world scenarios. World models have found broad applications across various domains, including embodied AI, autonomous driving, entertainment, etc. A more realistic simulation with accurate physics will effectively narrow the sim‑to‑real gap and allow us to gather rich information about the real world conveniently. While traditional manual modeling has enabled the creation of virtual 3D scenes, modern approaches have leveraged advanced machine learning algorithms for 3D world generation, with most recent advances focusing on generative methods that can create virtual worlds based on user instructions. This work explores such a research direction by proposing LatticeWorld, a simple yet effective 3D world generation framework that streamlines the industrial production pipeline of 3D environments. LatticeWorld leverages lightweight LLMs (LLaMA‑2‑7B) alongside the industry‑grade rendering engine (e.g., Unreal Engine 5) to generate a dynamic environment. Our proposed framework accepts textual descriptions and visual instructions as multimodal inputs and creates large‑scale 3D interactive worlds with dynamic agents, featuring competitive multi‑agent interaction, high‑fidelity physics simulation, and real‑time rendering. We conduct comprehensive experiments to evaluate LatticeWorld, showing that it achieves superior accuracy in scene layout generation and visual fidelity. Moreover, LatticeWorld achieves over a 90× increase in industrial production efficiency while maintaining high creative quality compared with traditional manual production methods. Our demo video is available at https://youtu.be/8VWZXpERR18

Abstract:
LLMs struggle with decision‑making in high‑stakes environments like MOBA games, primarily due to a lack of proactive reasoning and limited understanding of complex game dynamics. To address this, we propose What‑if Analysis LLM (WiA‑LLM), a framework that trains an LLM as an explicit, language‑based world model. Instead of representing the environment in latent vectors, WiA‑LLM uses natural language to simulate how the game state evolves over time in response to candidate actions, and provides textual justifications for these predicted outcomes. WiA‑LLM is trained in two stages: supervised fine‑tuning on human‑like reasoning traces, followed by reinforcement learning with outcome‑based rewards based on the alignment between predicted and actual future states. In the Honor of Kings (HoK) environment, WiA‑LLM attains 74.2% accuracy (27%\uparrow vs. base model) in forecasting game‑state changes. In addition, WiA‑LLM demonstrate strategic behavior more closely aligned with expert players than purely reactive LLMs, indicating enhanced foresight and expert‑like decision‑making.

Abstract:
Recent advances in agent development have focused on scaling model size and raw interaction data, mirroring successes in large language models. However, for complex, long‑horizon multi‑agent tasks such as robotic soccer, this end‑to‑end approach often fails due to intractable exploration spaces and sparse rewards. We propose that an effective world model for decision‑making must model the world's physics and also its task semantics. A systematic review of 2024 research in low‑resource multi‑agent soccer reveals a clear trend towards integrating symbolic and hierarchical methods, such as Hierarchical Task Networks (HTNs) and Bayesian Strategy Networks (BSNs), with multi‑agent reinforcement learning (MARL). These methods decompose complex goals into manageable subgoals, creating an intrinsic curriculum that shapes agent learning. We formalize this trend into a framework for Hierarchical Task Environments (HTEs), which are essential for bridging the gap between simple, reactive behaviors and sophisticated, strategic team play. Our framework incorporates the use of Large Language Models (LLMs) as generative world models of tasks, capable of dynamically generating this scaffolding. We argue that HTEs provide a mechanism to guide exploration, generate meaningful learning signals, and train agents to internalize hierarchical structure, enabling the development of more capable and general‑purpose agents with greater sample efficiency than purely end‑to‑end approaches.

Abstract:
The capacity of an embodied agent to understand, predict, and interact with its environment is fundamentally contingent on an internal world model. This paper introduces a novel framework for investigating the formation and adaptation of such world models within a biological substrate: human neural organoids. We present a curriculum of three scalable, closed‑loop virtual environments designed to train these biological agents and probe the underlying synaptic mechanisms of learning, such as long‑term potentiation (LTP) and long‑term depression (LTD). We detail the design of three distinct task environments that demand progressively more sophisticated world models for successful decision‑making: (1) a conditional avoidance task for learning static state‑action contingencies, (2) a one‑dimensional predator‑prey scenario for goal‑directed interaction, and (3) a replication of the classic Pong game for modeling dynamic, continuous‑time systems. For each environment, we formalize the state and action spaces, the sensory encoding and motor decoding mechanisms, and the feedback protocols based on predictable (reward) and unpredictable (punishment) stimulation, which serve to drive model refinement. In a significant methodological advance, we propose a meta‑learning approach where a Large Language Model automates the generative design and optimization of experimental protocols, thereby scaling the process of environment and curriculum design. Finally, we outline a multi‑modal evaluation strategy that moves beyond task performance to directly measure the physical correlates of the learned world model by quantifying synaptic plasticity at electrophysiological, cellular, and molecular levels. This work bridges the gap between model‑based reinforcement learning and computational neuroscience, offering a unique platform for studying embodiment, decision‑making, and the physical basis of intelligence.

Abstract:
Global human motion reconstruction from in‑the‑wild monocular videos is increasingly demanded across VR, graphics, and robotics applications, yet requires accurate mapping of human poses from camera to world coordinates‑a task challenged by depth ambiguity, motion ambiguity, and the entanglement between camera and human movements. While human‑motion‑centric approaches excel in preserving motion details and physical plausibility, they suffer from two critical limitations: insufficient exploitation of camera orientation information and ineffective integration of camera translation cues. We present WATCH (World‑aware Allied Trajectory and pose reconstruction for Camera and Human), a unified framework addressing both challenges. Our approach introduces an analytical heading angle decomposition technique that offers superior efficiency and extensibility compared to existing geometric methods. Additionally, we design a camera trajectory integration mechanism inspired by world models, providing an effective pathway for leveraging camera translation information beyond naive hard‑decoding approaches. Through experiments on in‑the‑wild benchmarks, WATCH achieves state‑of‑the‑art performance in end‑to‑end trajectory reconstruction. Our work demonstrates the effectiveness of jointly modeling camera‑human motion relationships and offers new insights for addressing the long‑standing challenge of camera translation integration in global human motion reconstruction. The code will be available publicly.

Abstract:
Speech synthesis systems can now produce highly realistic vocalisations that pose significant authenticity challenges. Despite substantial progress in deepfake detection models, their real‑world effectiveness is often undermined by evolving distribution shifts between training and test data, driven by the complexity of human speech and the rapid evolution of synthesis systems. Existing datasets suffer from limited real speech diversity, insufficient coverage of recent synthesis systems, and heterogeneous mixtures of deepfake sources, which hinder systematic evaluation and open‑world model training. To address these issues, we introduce AUDETER (AUdio DEepfake TEst Range), a large‑scale and highly diverse deepfake audio dataset comprising over 4,500 hours of synthetic audio generated by 11 recent TTS models and 10 vocoders, totalling 3 million clips. We further observe that most existing detectors default to binary supervised training, which can induce negative transfer across synthesis sources when the training data contains highly diverse deepfake patterns, impacting overall generalisation. As a complementary contribution, we propose an effective curriculum‑learning‑based approach to mitigate this effect. Extensive experiments show that existing detection models struggle to generalise to novel deepfakes and human speech in AUDETER, whereas XLR‑based detectors trained on AUDETER achieve strong cross‑domain performance across multiple benchmarks, achieving an EER of 1.87% on In‑the‑Wild. AUDETER is available on GitHub.

Abstract:
In embodied AI, a persistent challenge is enabling agents to robustly adapt to novel domains without requiring extensive data collection or retraining. To address this, we present a world model implanting framework (WorMI) that combines the reasoning capabilities of large language models (LLMs) with independently learned, domain‑specific world models through test‑time composition. By allowing seamless implantation and removal of the world models, the embodied agent's policy achieves and maintains cross‑domain adaptability. In the WorMI framework, we employ a prototype‑based world model retrieval approach, utilizing efficient trajectory‑based abstract representation matching, to incorporate relevant models into test‑time composition. We also develop a world‑wise compound attention method that not only integrates the knowledge from the retrieved world models but also aligns their intermediate representations with the reasoning model's representation within the agent's policy. This framework design effectively fuses domain‑specific knowledge from multiple world models, ensuring robust adaptation to unseen domains. We evaluate our WorMI on the VirtualHome and ALFWorld benchmarks, demonstrating superior zero‑shot and few‑shot performance compared to several LLM‑based approaches across a range of unseen domains. These results highlight the frameworks potential for scalable, real‑world deployment in embodied agent scenarios where adaptability and data efficiency are essential.

Abstract:
In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high‑fidelity long‑term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine‑grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from inefficiency, temporal degradation in long‑term generation and lack of controllability. To holistically address these issues, we reformulate the occupancy world model as a temporal next‑scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale‑by‑scale generation and temporal scene‑by‑scene prediction. With a TensFormer, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego‑motion. Experiments show that OccTENS outperforms the state‑of‑the‑art method with both higher occupancy quality and faster inference time.

Abstract:
On‑the‑fly reasoning often requires adaptation to novel problems under limited data and distribution shift. This work introduces CausalARC: an experimental testbed for AI reasoning in low‑data and out‑of‑distribution regimes, modeled after the Abstraction and Reasoning Corpus (ARC). Each CausalARC reasoning task is sampled from a fully specified causal world model, formally expressed as a structural causal model. Principled data augmentations provide observational, interventional, and counterfactual feedback about the world model in the form of few‑shot, in‑context learning demonstrations. As a proof‑of‑concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test‑time training, (2) counterfactual reasoning with in‑context learning, (3) program synthesis, and (4) causal discovery with logical reasoning. Within‑ and between‑model performance varied heavily across tasks, indicating room for significant improvement in language model reasoning.

Abstract:
As AI technology advances, research in playing text‑based games with agents has becomeprogressively popular. In this paper, a novel approach to agent design and agent learning ispresented with the context of reinforcement learning. A model of deep learning is first applied toprocess game text and build a world model. Next, the agent is learned through a policy gradient‑based deep reinforcement learning method to facilitate conversion from state value to optimal policy.The enhanced agent works better in several text‑based game experiments and significantlysurpasses previous agents on game completion ratio and win rate. Our study introduces novelunderstanding and empirical ground for using reinforcement learning for text games and sets thestage for developing and optimizing reinforcement learning agents for more general domains andproblems.

Abstract:
Non‑deductive reasoning, encompassing inductive and abductive reasoning, is essential in addressing complex real‑world questions. One key feature of inductive and abductive reasoning is that there are many valid hypotheses; the simplest ones (those that adhere to Occam's Razor) are often most useful. However, this aspect is ignored in recent work that evaluates the non‑deductive reasoning capabilities of large language models (LLMs). This work fills this gap, focusing on understanding whether the inductive and abductive reasoning capabilities of LLMs adhere to Occam's Razor, while also examining the correctness of their reasoning. To accomplish this goal, we introduce a framework to synthetically generate reasoning questions that (a) require inductive reasoning and abductive reasoning simultaneously; (b) is readily extended to produce any abductive/inductive reasoning question expressible in first‑order logic. The task for the intelligent agent is to produce hypotheses to explain observations under a given world model. We also propose a new automated metric to assess whether hypotheses quantitatively adhere to Occam's Razor; those hypotheses that are correct and simplest are considered high‑quality. Our findings on state‑of‑the‑art LLMs suggest that LLMs can perform inductive and abductive reasoning in simple scenarios, but struggle with complex world models and with producing high‑quality hypotheses, even with popular reasoning‑enhancing techniques such as in‑context learning and RLVR.

Abstract:
Effective planning requires strong world models, but high‑level world models that can understand and reason about actions with semantic and temporal abstraction remain largely underdeveloped. We introduce the Vision Language World Model (VLWM), a foundation model trained for language‑based world modeling on natural videos. Given visual observations, the VLWM first infers the overall goal achievements then predicts a trajectory composed of interleaved actions and world state changes. Those targets are extracted by iterative LLM Self‑Refine conditioned on compressed future observations represented by Tree of Captions. The VLWM learns both an action policy and a dynamics model, which respectively facilitates reactive system‑1 plan decoding and reflective system‑2 planning via cost minimization. The cost evaluates the semantic distance between the hypothetical future states given by VLWM roll‑outs and the expected goal state, and is measured by a critic model that we trained in a self‑supervised manner. The VLWM achieves state‑of‑the‑art Visual Planning for Assistance (VPA) performance on both benchmark evaluations and our proposed PlannerArena human evaluations, where system‑2 improves the Elo score by +27% upon system‑1. The VLWM models also outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmark.

Abstract:
Humans intuitively navigate social interactions by simulating unspoken dynamics and reasoning about others' perspectives, even with limited information. In contrast, AI systems struggle to structure and reason about implicit social contexts, as they lack explicit representations for unobserved dynamics such as intentions, beliefs, and evolving social states. In this paper, we introduce the concept of social world models (SWMs) to characterize the complex social dynamics. To operationalize SWMs, we introduce a novel structured social world representation formalism (S3AP), which captures the evolving states, actions, and mental states of agents, addressing the lack of explicit structure in traditional free‑text‑based inputs. Through comprehensive experiments across five social reasoning benchmarks, we show that S3AP significantly enhances LLM performance‑achieving a +51% improvement on FANToM over OpenAI's o1. Our ablations further reveal that these gains are driven by the explicit modeling of hidden mental states, which proves more effective than a wide range of baseline methods. Finally, we introduce an algorithm for social world models using S3AP, which enables AI agents to build models of their interlocutors and predict their next actions and mental states. Empirically, S3AP‑enabled social world models yield up to +18% improvement on the SOTOPIA multi‑turn social interaction benchmark. Our findings highlight the promise of S3AP as a powerful, general‑purpose representation for social world states, enabling the development of more socially‑aware systems that better navigate social interactions.

Abstract:
Achieving human‑like reasoning in deep learning models for complex tasks in unknown environments remains a critical challenge in embodied intelligence. While advanced vision‑language models (VLMs) excel in static scene understanding, their limitations in spatio‑temporal reasoning and adaptation to dynamic, open‑set tasks like task‑oriented navigation and embodied question answering (EQA) persist due to inadequate modeling of fine‑grained spatio‑temporal cues and physical world comprehension. To address this, we propose VEME, a novel cross‑modal alignment method that enhances generalization in unseen scenes by learning an ego‑centric, experience‑centered world model. Our framework integrates three key components: (1) a cross‑modal alignment framework bridging objects, spatial representations, and visual semantics with spatio‑temporal cues to enhance VLM in‑context learning; (2) a dynamic, implicit cognitive map activated by world embedding to enable task‑relevant geometric‑semantic memory recall; and (3) an instruction‑based navigation and reasoning framework leveraging embodied priors for long‑term planning and efficient exploration. By embedding geometry‑aware spatio‑temporal episodic experiences, our method significantly improves reasoning and planning in dynamic environments. Experimental results on VSI‑Bench and VLN‑CE demonstrate 1%‑3% accuracy and exploration efficiency improvement compared to traditional approaches.

Abstract:
The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments. How do people integrate these two sources of knowledge, and how might AI systems? We present a computational framework that models social learning as joint probabilistic inference over structured, executable world models given sensorimotor and linguistic data. We make this possible by turning a pretrained language model into a probabilistic model of how humans share advice conditioned on their beliefs, allowing our agents both to generate advice for others and to interpret linguistic input as evidence during Bayesian inference. Using behavioral experiments and simulations across 10 video games, we show how linguistic guidance can shape exploration and accelerate learning by reducing risky interactions and speeding up key discoveries in both humans and models. We further explore how knowledge can accumulate across generations through iterated learning experiments and demonstrate successful knowledge transfer between humans and models ‑‑ revealing how structured, language‑compatible representations might enable human‑machine collaborative learning.

Abstract:
While video‑generation‑based embodied world models have gained increasing attention, their reliance on large‑scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long‑horizon video generation‑‑hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling‑‑Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine‑grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision‑Language Model (VLM) planner and a Start‑Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed‑loop control and supports compositional generalization of primitive‑level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine‑grained physical interaction and high‑level reasoning, paving the way toward scalable, interpretable, and general‑purpose embodied intelligence.

Abstract:
Real‑world reinforcement learning demands adaptation to unseen environmental conditions without costly retraining. Contextual Markov Decision Processes (cMDP) model this challenge, but existing methods often require explicit context variables (e.g., friction, gravity), limiting their use when contexts are latent or hard to measure. We introduce Dynamics‑Aligned Latent Imagination (DALI), a framework integrated within the Dreamer architecture that infers latent context representations from agent‑environment interactions. By training a self‑supervised encoder to predict forward dynamics, DALI generates actionable representations conditioning the world model and policy, bridging perception and control. We theoretically prove this encoder is essential for efficient context inference and robust generalization. DALI's latent space enables counterfactual consistency: Perturbing a gravity‑encoding dimension alters imagined rollouts in physically plausible ways. On challenging cMDP benchmarks, DALI achieves significant gains over context‑unaware baselines, often surpassing context‑aware baselines in extrapolation tasks, enabling zero‑shot generalization to unseen contextual variations.

Abstract:
Large Language Models (LLMs) exhibit emergent capabilities in structured domains, suggesting they may implicitly internalize high‑fidelity representations of world models. While probing techniques have shown promising signs of this in scientific and game‑based settings, they rely on model‑specific internal activations, which limit interpretability and generalizability. In this work, we propose a model‑agnostic, state‑based evaluation framework using chess as a benchmark to assess whether LLMs preserve the semantics of structured environments. Our method analyzes the downstream legal move distributions (state affordances) to estimate semantic fidelity between predicted and actual game states. This approach offers a more meaningful evaluation than conventional string‑based metrics by aligning more closely with the strategic and rule‑governed nature of chess. Experimental results demonstrate that our metrics capture deficiencies in state‑tracking, highlighting limitations of LLMs in maintaining coherent internal models over long sequences. Our framework provides a robust tool for evaluating structured reasoning in LLMs without requiring internal model access, and generalizes to a wide class of symbolic environments.

Abstract:
We study the usage of language models (LMs) for planning over world models specified in the Planning Domain Definition Language (PDDL). We prompt LMs to generate Python programs that serve as generalised policies for solving PDDL problems from a given domain. Notably, our approach synthesises policies that are provably sound relative to the PDDL domain without reliance on external verifiers. We conduct experiments on competition benchmarks which show that our policies can solve more PDDL problems than PDDL planners and recent LM approaches within a fixed time and memory constraint. Our approach manifests in the LMPlan planner which can solve planning problems with several hundreds of relevant objects. Surprisingly, we observe that LMs used in our framework sometimes plan more effectively over PDDL problems written in meaningless symbols in place of natural language; e.g. rewriting (at dog kitchen) as (p2 o1 o3). This finding challenges hypotheses that LMs reason over word semantics and memorise solutions from its training corpus, and is worth further exploration.

Abstract:
Training robot policies within a learned world model is trending due to the inefficiency of real‑world interactions. The established image‑based world models and policies have shown prior success, but lack robust geometric information that requires consistent spatial and physical understanding of the three‑dimensional world, even pre‑trained on internet‑scale video sources. To this end, we propose a novel branch of world model named Gaussian World Model (GWM) for robotic manipulation, which reconstructs the future state by inferring the propagation of Gaussian primitives under the effect of robot actions. At its core is a latent Diffusion Transformer (DiT) combined with a 3D variational autoencoder, enabling fine‑grained scene‑level future state reconstruction with Gaussian Splatting. GWM can not only enhance the visual representation for imitation learning agent by self‑supervised future prediction training, but can serve as a neural simulator that supports model‑based reinforcement learning. Both simulated and real‑world experiments depict that GWM can precisely predict future scenes conditioned on diverse robot actions, and can be further utilized to train policies that outperform the state‑of‑the‑art by impressive margins, showcasing the initial data scaling potential of 3D world model.

Abstract:
Generation‑driven world models create immersive virtual environments but suffer slow inference due to the iterative nature of diffusion models. While recent advances have improved diffusion model efficiency, directly applying these techniques to world models introduces limitations such as quality degradation. In this paper, we present HERO, a training‑free hierarchical acceleration framework tailored for efficient world models. Owing to the multi‑modal nature of world models, we identify a feature coupling phenomenon, wherein shallow layers exhibit high temporal variability, while deeper layers yield more stable feature representations. Motivated by this, HERO adopts hierarchical strategies to accelerate inference: (i) In shallow layers, a patch‑wise refresh mechanism efficiently selects tokens for recomputation. With patch‑wise sampling and frequency‑aware tracking, it avoids extra metric computation and remain compatible with FlashAttention. (ii) In deeper layers, a linear extrapolation scheme directly estimates intermediate features. This completely bypasses the computations in attention modules and feed‑forward networks. Our experiments show that HERO achieves a 1.73× speedup with minimal quality degradation, significantly outperforming existing diffusion acceleration methods.

Abstract:
World models have been widely utilized in robotics, gaming, and auto‑driving. However, their applications on natural language tasks are relatively limited. In this paper, we construct the dialogue world model, which could predict the user's emotion, sentiment, and intention, and future utterances. By defining a POMDP, we argue emotion, sentiment and intention can be modeled as the user belief and solved by maximizing the information bottleneck. By this user belief modeling, we apply the model‑based reinforcement learning framework to the dialogue system, and propose a framework called DreamCUB. Experiments show that the pretrained dialogue world model can achieve state‑of‑the‑art performances on emotion classification and sentiment identification, while dialogue quality is also enhanced by joint training of the policy, critic and dialogue world model. Further analysis shows that this manner holds a reasonable exploration‑exploitation balance and also transfers well to out‑of‑domain scenarios such as empathetic dialogues.

Abstract:
Recent advancements in video generation have substantially improved visual quality and temporal coherence, making these models increasingly appealing for applications such as autonomous driving, particularly in the context of driving simulation and so‑called "world models". In this work, we investigate the effects of existing fine‑tuning video generation approaches on structured driving datasets and uncover a potential trade‑off: although visual fidelity improves, spatial accuracy in modeling dynamic elements may degrade. We attribute this degradation to a shift in the alignment between visual quality and dynamic understanding objectives. In datasets with diverse scene structures within temporal space, where objects or perspective shift in varied ways, these objectives tend to highly correlated. However, the very regular and repetitive nature of driving scenes allows visual quality to improve by modeling dominant scene motion patterns, without necessarily preserving fine‑grained dynamic behavior. As a result, fine‑tuning encourages the model to prioritize surface‑level realism over dynamic accuracy. To further examine this phenomenon, we show that simple continual learning strategies, such as replay from diverse domains, can offer a balanced alternative by preserving spatial accuracy while maintaining strong visual quality.

Abstract:
Purposeful behavior is a hallmark of natural and artificial intelligence. Its acquisition is often believed to rely on world models, comprising both descriptive (what is) and prescriptive (what is desirable) aspects that identify and evaluate state of affairs in the world, respectively. Canonical computational accounts of purposeful behavior, such as reinforcement learning, posit distinct components of a world model comprising a state representation (descriptive aspect) and a reward function (prescriptive aspect). However, an alternative possibility, which has not yet been computationally formulated, is that these two aspects instead co‑emerge interdependently from an agent's goal. Here, we describe a computational framework of goal‑directed state representation in cognitive agents, in which the descriptive and prescriptive aspects of a world model co‑emerge from agent‑environment interaction sequences, or experiences. Drawing on Buddhist epistemology, we introduce a construct of goal‑directed, or telic, states, defined as classes of goal‑equivalent experience distributions. Telic states provide a parsimonious account of goal‑directed learning in terms of the statistical divergence between behavioral policies and desirable experience features. We review empirical and theoretical literature supporting this novel perspective and discuss its potential to provide a unified account of behavioral, phenomenological and neural dimensions of purposeful behaviors across diverse substrates.

Abstract:
The Model Context Protocol has emerged as a transformative standard for connecting large language models to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long‑horizon reasoning and large, unfamiliar tool spaces. To address this critical gap, we introduce MCP‑Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real‑world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution‑based evaluators, including format evaluators for agent format compliance, static evaluators for time‑invariant content matching, and dynamic evaluators that automatically retrieve real‑time ground truth for temporally sensitive tasks. Through extensive evaluation of leading LLMs, we find that even SOTA models such as GPT‑5 (43.72%), Grok‑4 (33.33%) and Claude‑4.0‑Sonnet (29.44%) exhibit significant performance limitations. In addition, our benchmark poses a significant long‑context challenge for LLM agents, as the number of input tokens increases rapidly with the number of interaction steps. Moreover, it introduces an unknown‑tools challenge, as LLM agents often lack familiarity with the precise usage of the MCP servers. Notably, enterprise‑level agents like Cursor cannot achieve better performance than standard ReAct frameworks. Beyond evaluation, we open‑source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.

Abstract:
We present 4DNeX, the first feed‑forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi‑frame video inputs, 4DNeX enables efficient, end‑to‑end image‑to‑4D generation by fine‑tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX‑10M, a large‑scale dataset with high‑quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high‑quality dynamic point clouds that enable novel‑view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image‑to‑4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.

Abstract:
Recent advances in interactive video generations have demonstrated diffusion model's potential as world models by capturing complex physical dynamics and interactive behaviors. However, existing interactive world models depend on bidirectional attention and lengthy inference steps, severely limiting real‑time performance. Consequently, they are hard to simulate real‑world dynamics, where outcomes must update instantaneously based on historical context and current actions. To address this, we present Matrix‑Game 2.0, an interactive world model generates long videos on‑the‑fly via few‑step auto‑regressive diffusion. Our framework consists of three key components: (1) A scalable data production pipeline for Unreal Engine and GTA5 environments to effectively produce massive amounts (about 1200 hours) of video data with diverse interaction annotations; (2) An action injection module that enables frame‑level mouse and keyboard inputs as interactive conditions; (3) A few‑step distillation based on the casual architecture for real‑time and streaming video generation. Matrix Game 2.0 can generate high‑quality minute‑level videos across diverse scenes at an ultra‑fast speed of 25 FPS. We open‑source our model weights and codebase to advance research in interactive world modeling.

Abstract:
Multi‑agent path finding (MAPF) is the problem of planning conflict‑free paths from the designated start locations to goal positions for multiple agents. It underlies a variety of real‑world tasks, including multi‑robot coordination, robot‑assisted logistics, and social navigation. Recent decentralized learnable solvers have shown great promise for large‑scale MAPF, especially when leveraging foundation models and large datasets. However, these agents are reactive policy models and exhibit limited modeling of environmental temporal dynamics and inter‑agent dependencies, resulting in performance degradation in complex, long‑term planning scenarios. To address these limitations, we propose MAPF‑World, an autoregressive action world model for MAPF that unifies situation understanding and action generation, guiding decisions beyond immediate local observations. It improves situational awareness by explicitly modeling environmental dynamics, including spatial features and temporal dependencies, through future state and actions prediction. By incorporating these predicted futures, MAPF‑World enables more informed, coordinated, and far‑sighted decision‑making, especially in complex multi‑agent settings. Furthermore, we augment MAPF benchmarks by introducing an automatic map generator grounded in real‑world scenarios, capturing practical map layouts for training and evaluating MAPF solvers. Extensive experiments demonstrate that MAPF‑World outperforms state‑of‑the‑art learnable solvers, showcasing superior zero‑shot generalization to out‑of‑distribution cases. Notably, MAPF‑World is trained with a 96.5% smaller model size and 92% reduced data.

Abstract:
Autonomous driving requires rich contextual comprehension and precise predictive reasoning to navigate dynamic and complex environments safely. Vision‑Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi‑modal context, while DWMs excel in generating detailed and plausible future driving scenarios essential for proactive planning. Integrating VLMs with DWMs is an intuitive, promising, yet understudied strategy to exploit the complementary strengths of accurate behavioral prediction and realistic scene generation. Nevertheless, this integration presents notable challenges, particularly in effectively connecting action‑level decisions with high‑fidelity pixel‑level predictions and maintaining computational efficiency. In this paper, we propose ImagiDrive, a novel end‑to‑end autonomous driving framework that integrates a VLM‑based driving agent with a DWM‑based scene imaginer to form a unified imagination‑and‑planning loop. The driving agent predicts initial driving trajectories based on multi‑modal inputs, guiding the scene imaginer to generate corresponding future scenarios. These imagined scenarios are subsequently utilized to iteratively refine the driving agent's planning decisions. To address efficiency and predictive accuracy challenges inherent in this integration, we introduce an early stopping mechanism and a trajectory selection strategy. Extensive experimental validation on the nuScenes and NAVSIM datasets demonstrates the robustness and superiority of ImagiDrive over previous alternatives under both open‑loop and closed‑loop conditions.

Abstract:
Grasping is a fundamental task in robot‑assisted surgery (RAS), and automating it can reduce surgeon workload while enhancing efficiency, safety, and consistency beyond teleoperated systems. Most prior approaches rely on explicit object pose tracking or handcrafted visual features, limiting their generalization to novel objects, robustness to visual disturbances, and the ability to handle deformable objects. Visuomotor learning offers a promising alternative, but deploying it in RAS presents unique challenges, such as low signal‑to‑noise ratio in visual observations, demands for high safety and millimeter‑level precision, as well as the complex surgical environment. This paper addresses three key challenges: (i) sim‑to‑real transfer of visuomotor policies to ex vivo surgical scenes, (ii) visuomotor learning using only a single stereo camera pair ‑‑ the standard RAS setup, and (iii) object‑agnostic grasping with a single policy that generalizes to diverse, unseen surgical objects without retraining or task‑specific models. We introduce Grasp Anything for Surgery V2 (GASv2), a visuomotor learning framework for surgical grasping. GASv2 leverages a world‑model‑based architecture and a surgical perception pipeline for visual observations, combined with a hybrid control system for safe execution. We train the policy in simulation using domain randomization for sim‑to‑real transfer and deploy it on a real robot in both phantom‑based and ex vivo surgical settings, using only a single pair of endoscopic cameras. Extensive experiments show our policy achieves a 65% success rate in both settings, generalizes to unseen objects and grippers, and adapts to diverse disturbances, demonstrating strong performance, generality, and robustness.

Abstract:
Spatio‑physical reasoning, a foundation capability for understanding the real physics world, is a critical step towards building robust world models. While recent vision language models (VLMs) have shown remarkable progress in specialized domains like multimodal mathematics and pure spatial understanding, their capability for spatio‑physical reasoning remains largely unexplored. This paper provides a comprehensive diagnostic analysis of mainstream VLMs, revealing that current models perform inadequately on this crucial task. Further detailed analysis shows that this underperformance is largely attributable to biases caused by human‑like prior and a lack of deep reasoning. To address these challenges, we apply supervised fine‑tuning followed by rule‑based reinforcement learning to Qwen2.5‑VL‑7B, resulting in significant improvements in spatio‑physical reasoning capabilities and surpassing leading proprietary models. Nevertheless, despite this success, the model's generalization to new physics scenarios remains limited ‑‑ underscoring the pressing need for new approaches in spatio‑physical reasoning.

Abstract:
With the advent of Joint Embedding Predictive Architectures (JEPAs), which appear to be more capable than reconstruction‑based methods, this paper introduces a novel technique for creating world models using continuous‑time dynamic systems from arbitrary observation data. The proposed method integrates sequence embeddings with neural ordinary differential equations (neural ODEs). It employs loss functions that enforce contractive embeddings and Lipschitz constants in state transitions to construct a well‑organized latent state space. The approach's effectiveness is demonstrated through the generation of structured latent state‑space models for a simple pendulum system using only image data. This opens up a new technique for developing more general control algorithms and estimation techniques with broad applications in robotics.

Abstract:
Embodied AI aims to develop intelligent systems with physical forms capable of perceiving, decision‑making, acting, and learning in real‑world environments, providing a promising way to Artificial General Intelligence (AGI). Despite decades of explorations, it remains challenging for embodied agents to achieve human‑level intelligence for general‑purpose tasks in open dynamic environments. Recent breakthroughs in large models have revolutionized embodied AI by enhancing perception, interaction, planning and learning. In this article, we provide a comprehensive survey on large model empowered embodied AI, focusing on autonomous decision‑making and embodied learning. We investigate both hierarchical and end‑to‑end decision‑making paradigms, detailing how large models enhance high‑level planning, low‑level execution, and feedback for hierarchical decision‑making, and how large models enhance Vision‑Language‑Action (VLA) models for end‑to‑end decision making. For embodied learning, we introduce mainstream learning methodologies, elaborating on how large models enhance imitation learning and reinforcement learning in‑depth. For the first time, we integrate world models into the survey of embodied AI, presenting their design methods and critical roles in enhancing decision‑making and learning. Though solid advances have been achieved, challenges still exist, which are discussed at the end of this survey, potentially as the further research directions.

Abstract:
Edge General Intelligence (EGI) represents a transformative evolution of edge computing, where distributed agents possess the capability to perceive, reason, and act autonomously across diverse, dynamic environments. Central to this vision are world models, which act as proactive internal simulators that not only predict but also actively imagine future trajectories, reason under uncertainty, and plan multi‑step actions with foresight. This proactive nature allows agents to anticipate potential outcomes and optimize decisions ahead of real‑world interactions. While prior works in robotics and gaming have showcased the potential of world models, their integration into the wireless edge for EGI remains underexplored. This survey bridges this gap by offering a comprehensive analysis of how world models can empower agentic artificial intelligence (AI) systems at the edge. We first examine the architectural foundations of world models, including latent representation learning, dynamics modeling, and imagination‑based planning. Building on these core capabilities, we illustrate their proactive applications across EGI scenarios such as vehicular networks, unmanned aerial vehicle (UAV) networks, the Internet of Things (IoT) systems, and network functions virtualization, thereby highlighting how they can enhance optimization under latency, energy, and privacy constraints. We then explore their synergy with foundation models and digital twins, positioning world models as the cognitive backbone of EGI. Finally, we highlight open challenges, such as safety guarantees, efficient training, and constrained deployment, and outline future research directions. This survey provides both a conceptual foundation and a practical roadmap for realizing the next generation of intelligent, autonomous edge systems.

Abstract:
Autonomous robots that rely on deep neural network controllers pose critical challenges for safety prediction, especially under partial observability and distribution shift. Traditional model‑based verification techniques are limited in scalability and require access to low‑dimensional state models, while model‑free methods often lack reliability guarantees. This paper addresses these limitations by introducing a framework for calibrated safety prediction in end‑to‑end vision‑controlled systems, where neither the state‑transition model nor the observation model is accessible. Building on the foundation of world models, we leverage variational autoencoders and recurrent predictors to forecast future latent trajectories from raw image sequences and estimate the probability of satisfying safety properties. We distinguish between monolithic and composite prediction pipelines and introduce a calibration mechanism to quantify prediction confidence. In long‑horizon predictions from high‑dimensional observations, the forecasted inputs to the safety evaluator can deviate significantly from the training distribution due to compounding prediction errors and changing environmental conditions, leading to miscalibrated risk estimates. To address this, we incorporate unsupervised domain adaptation to ensure robustness of safety evaluation under distribution shift in predictions without requiring manual labels. Our formulation provides theoretical calibration guarantees and supports practical evaluation across long prediction horizons. Experimental results on three benchmarks show that our UDA‑equipped evaluators maintain high accuracy and substantially lower false positive rates under distribution shift. Similarly, world model‑based composite predictors outperform their monolithic counterparts on long‑horizon tasks, and our conformal calibration provides reliable statistical bounds.

Abstract:
Reinforcement learning (RL) agents often struggle to generalize to new tasks and contexts without updating their parameters, mainly because their learned representations and policies are overfit to the specifics of their training environments. To boost agents' in‑context RL (ICRL) ability, this work formulates ICRL as a two‑agent emergent communication problem and introduces CORAL (Communicative Representation for Adaptive RL), a framework that learns a transferable communicative context by decoupling latent representation learning from control. In CORAL, an Information Agent (IA) is pre‑trained as a world model on a diverse distribution of tasks. Its objective is not to maximize task reward, but to build a world model and distill its understanding into concise messages. The emergent communication protocol is shaped by a novel Causal Influence Loss, which measures the effect that the message has on the next action. During deployment, the previously trained IA serves as a fixed contextualizer for a new Control Agent (CA), which learns to solve tasks by interpreting the provided communicative context. Our experiments demonstrate that this approach enables the CA to achieve significant gains in sample efficiency and successfully perform zero‑shot adaptation with the help of pre‑trained IA in entirely unseen sparse‑reward environments, validating the efficacy of learning a transferable communicative representation.

Abstract:
Vision‑Language‑Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open‑loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close‑loop training relies heavily on high‑fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL‑VLA, a novel close‑loop Reinforcement Learning via Inverse Reinforcement Learning reward world model with a self‑built VLA approach. Our framework proceeds in a three‑stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight reward world model via inverse reinforcement learning to enable efficient close‑loop reward computation. To further enhance planning performance, finally, we design specialized reward world model guidence reinforcement learning via PPO(Proximal Policy Optimization) to effectively balance the safety incidents, comfortable driving, and traffic efficiency. Our approach achieves state‑of‑the‑art performance in NAVSIM v2 end‑to‑end driving benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that our framework will accelerate VLA research in close‑loop autonomous driving.

Abstract:
Recent work on visual world models shows significant promise in latent state dynamics obtained from pre‑trained image backbones. However, most of the current approaches are sensitive to training quality, requiring near‑complete coverage of the action and state space during training to prevent divergence during inference. To make a model‑based planning algorithm more robust to the quality of the learned world model, we propose in this work to use a variational autoencoder as a novelty detector to ensure that proposed action trajectories during planning do not cause the learned model to deviate from the training data distribution. To evaluate the effectiveness of this approach, a series of experiments in challenging simulated robot environments was carried out, with the proposed method incorporated into a model‑predictive control policy loop extending the DINO‑WM architecture. The results clearly show that the proposed method improves over state‑of‑the‑art solutions in terms of data efficiency.

Abstract:
This paper argues that Active Inference (AIF) provides a crucial foundation for developing autonomous AI agents capable of learning from experience without continuous human reward engineering. As AI systems begin to exhaust high‑quality training data and rely on increasingly large human workforces for reward design, the current paradigm faces significant scalability challenges that could impede progress toward genuinely autonomous intelligence. The proposal for an ``Era of Experience,'' where agents learn from self‑generated data, is a promising step forward. However, this vision still depends on extensive human engineering of reward functions, effectively shifting the bottleneck from data curation to reward curation. This highlights what we identify as the grounded‑agency gap: the inability of contemporary AI systems to autonomously formulate, adapt, and pursue objectives in response to changing circumstances. We propose that AIF can bridge this gap by replacing external reward signals with an intrinsic drive to minimize free energy, allowing agents to naturally balance exploration and exploitation through a unified Bayesian objective. By integrating Large Language Models as generative world models with AIF's principled decision‑making framework, we can create agents that learn efficiently from experience while remaining aligned with human values. This synthesis offers a compelling path toward AI systems that can develop autonomously while adhering to both computational and physical constraints.

Abstract:
Generative world models have become essential data engines for autonomous driving, yet most existing efforts focus on videos or occupancy grids, overlooking the unique LiDAR properties. Extending LiDAR generation to dynamic 4D world modeling presents challenges in controllability, temporal coherence, and evaluation standardization. To this end, we present LiDARCrafter, a unified framework for 4D LiDAR generation and editing. Given free‑form natural language inputs, we parse instructions into ego‑centric scene graphs, which condition a tri‑branch diffusion network to generate object structures, motion trajectories, and geometry. These structured conditions enable diverse and fine‑grained scene editing. Additionally, an autoregressive module generates temporally coherent 4D LiDAR sequences with smooth transitions. To support standardized evaluation, we establish a comprehensive benchmark with diverse metrics spanning scene‑, object‑, and sequence‑level aspects. Experiments on the nuScenes dataset using this benchmark demonstrate that LiDARCrafter achieves state‑of‑the‑art performance in fidelity, controllability, and temporal consistency across all levels, paving the way for data augmentation and simulation. The code and benchmark are released to the community.

Abstract:
Fine‑tuning diffusion policies with reinforcement learning (RL) presents significant challenges. The long denoising sequence for each action prediction impedes effective reward propagation. Moreover, standard RL methods require millions of real‑world interactions, posing a major bottleneck for practical fine‑tuning. Although prior work frames the denoising process in diffusion policies as a Markov Decision Process to enable RL‑based updates, its strong dependence on environment interaction remains highly inefficient. To bridge this gap, we introduce DiWA, a novel framework that leverages a world model for fine‑tuning diffusion‑based robotic skills entirely offline with reinforcement learning. Unlike model‑free approaches that require millions of environment interactions to fine‑tune a repertoire of robot skills, DiWA achieves effective adaptation using a world model trained once on a few hundred thousand offline play interactions. This results in dramatically improved sample efficiency, making the approach significantly more practical and safer for real‑world robot learning. On the challenging CALVIN benchmark, DiWA improves performance across eight tasks using only offline adaptation, while requiring orders of magnitude fewer physical interactions than model‑free baselines. To our knowledge, this is the first demonstration of fine‑tuning diffusion policies for real‑world robotic skills using an offline world model. We make the code publicly available at https://diwa.cs.uni‑freiburg.de.

Abstract:
Robust coordination is critical for effective decision‑making in multi‑agent systems, especially under partial observability. A central question in Multi‑Agent Reinforcement Learning (MARL) is whether to engineer communication protocols or learn them end‑to‑end. We investigate this dichotomy using embodied world models. We propose and compare two communication strategies for a cooperative task‑allocation problem. The first, Learned Direct Communication (LDC), learns a protocol end‑to‑end. The second, Intention Communication, uses an engineered inductive bias: a compact, learned world model, the Imagined Trajectory Generation Module (ITGM), which uses the agent's own policy to simulate future states. A Message Generation Network (MGN) then compresses this plan into a message. We evaluate these approaches on goal‑directed interaction in a grid world, a canonical abstraction for embodied AI problems, while scaling environmental complexity. Our experiments reveal that while emergent communication is viable in simple settings, the engineered, world model‑based approach shows superior performance, sample efficiency, and scalability as complexity increases. These findings advocate for integrating structured, predictive models into MARL agents to enable active, goal‑driven coordination.

Abstract:
There is a broad consensus that the inability to form long‑term plans is one of the key limitations of current foundational models and agents. However, the existing planning benchmarks remain woefully inadequate to truly measure their planning capabilities. Most existing benchmarks either focus on loosely defined tasks like travel planning or end up leveraging existing domains and problems from international planning competitions. While the former tasks are hard to formalize and verify, the latter were specifically designed to test and challenge the weaknesses of existing automated planners. To address these shortcomings, we propose a procedure for creating a planning benchmark centered around the game called Countdown, where a player is expected to form a target number from a list of input numbers through arithmetic operations. From a world‑model perspective, each instance induces a fully specified transition model (dynamics) over states and actions, enabling evaluation of planning with verifiable outcomes. We discuss how this problem meets many of the desiderata associated with an ideal benchmark for planning capabilities evaluation. Specifically, the domain allows for an intuitive, natural language description for each problem instance, it is computationally challenging (NP‑complete), and the instance space is rich enough that we do not have to worry about memorization. We perform an extensive theoretical analysis, establishing the computational complexity result and demonstrate the advantage of our instance generation procedure over public benchmarks. We evaluate a variety of existing LLM‑assisted planning methods on instances generated using our procedure. Our results show that, unlike other domains like 24 Game (a special case of Countdown), our proposed dynamic benchmark remains extremely challenging for existing LLM‑based approaches.

Abstract:
World models have become increasingly popular in acting as learned traffic simulators. Recent work has explored replacing traditional traffic simulators with world models for policy training. In this work, we explore the robustness of existing metrics to evaluate world models as traffic simulators to see if the same metrics are suitable for evaluating a world model as a pseudo‑environment for policy training. Specifically, we analyze the metametric employed by the Waymo Open Sim‑Agents Challenge (WOSAC) and compare world model predictions on standard scenarios where the agents are fully or partially controlled by the world model (partial replay). Furthermore, since we are interested in evaluating the ego action‑conditioned world model, we extend the standard WOSAC evaluation domain to include agents that are causal to the ego vehicle. Our evaluations reveal a significant number of scenarios where top‑ranking models perform well under no perturbation but fail when the ego agent is forced to replay the original trajectory. To address these cases, we propose new metrics to highlight the sensitivity of world models to uncontrollable objects and evaluate the performance of world models as pseudo‑environments for policy training and analyze some state‑of‑the‑art world models under these new metrics.

Abstract:
Power is a key concept in AI safety: power‑seeking as an instrumental goal, sudden or gradual disempowerment of humans, power balance in human‑AI interaction and international AI governance. At the same time, power as the ability to pursue diverse goals is essential for wellbeing. This paper explores the idea of promoting both safety and wellbeing by forcing AI agents explicitly to empower humans and to manage the power balance between humans and AI agents in a desirable way. Using a principled, partially axiomatic approach, we design a parametrizable and decomposable objective function that represents an inequality‑ and risk‑averse long‑term aggregate of human power. It takes into account humans' bounded rationality and social norms, and, crucially, considers a wide variety of possible human goals. We derive algorithms for computing that metric by backward induction or approximating it via a form of multi‑agent reinforcement learning from a given world model. We exemplify the consequences of (softly) maximizing this metric in a variety of paradigmatic situations and describe what instrumental sub‑goals it will likely imply. Our cautious assessment is that softly maximizing suitable aggregate metrics of human power might constitute a beneficial objective for agentic AI systems that is safer than direct utility‑based objectives.

Abstract:
Lane segment topology reasoning provides comprehensive bird's‑eye view (BEV) road scene understanding, which can serve as a key perception module in planning‑oriented end‑to‑end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream‑based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over‑reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast‑slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane‑V2 benchmark demonstrate that FASTopoWM outperforms state‑of‑the‑art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).

Abstract:
Performance monitoring is essential for safe clinical deployment of image classification models. However, because ground‑truth labels are typically unavailable in the target dataset, direct assessment of real‑world model performance is infeasible. State‑of‑the‑art performance estimation methods address this by leveraging confidence scores to estimate the target accuracy. Despite being a promising direction, the established methods mainly estimate the model's accuracy and are rarely evaluated in a clinical domain, where strong class imbalances and dataset shifts are common. Our contributions are twofold: First, we introduce generalisations of existing performance prediction methods that directly estimate the full confusion matrix. Then, we benchmark their performance on chest x‑ray data in real‑world distribution shifts as well as simulated covariate and prevalence shifts. The proposed confusion matrix estimation methods reliably predicted clinically relevant counting metrics on medical images under distribution shifts. However, our simulated shift scenarios exposed important failure modes of current performance estimation techniques, calling for a better understanding of real‑world deployment contexts when implementing these performance monitoring techniques for postmarket surveillance of medical AI models.

Abstract:
Planning in modern LLM agents relies on the utilization of LLM as an internal world model, acquired during pretraining. However, existing agent designs fail to effectively assimilate new observations into dynamic updates of the world model. This reliance on the LLM's static internal world model is progressively prone to misalignment with the underlying true state of the world, leading to the generation of divergent and erroneous plans. We introduce a hierarchical agent architecture, CoEx, in which hierarchical state abstraction allows LLM planning to co‑evolve with a dynamically updated model of the world. CoEx plans and interacts with the world by using LLM reasoning to orchestrate dynamic plans consisting of subgoals, and its learning mechanism continuously incorporates these subgoal experiences into a persistent world model in the form of a neurosymbolic belief state, comprising textual inferences and code‑based symbolic memory. We evaluate our agent across a diverse set of agent scenarios involving rich environments and complex tasks including ALFWorld, PDDL, and Jericho. Our experiments show that CoEx outperforms existing agent paradigms in planning and exploration.

Abstract:
Large Language Models (LLMs), despite their advanced linguistic capabilities, fundamentally lack an intuitive understanding of physical dynamics, which limits their effectiveness in real‑world scenarios that require causal reasoning. In this paper, we introduce Causal World Model Induction (CWMI), a novel framework designed to embed an explicit model of causal physics within an LLM. Our approach incorporates a dedicated Causal Physics Module (CPM) and a new training objective called Causal Intervention Loss, encouraging the model to learn cause‑and‑effect relationships from multimodal data. By training the model to predict the outcomes of hypothetical interventions instead of merely capturing statistical correlations, CWMI develops a robust internal representation of physical laws. Experimental results show that CWMI significantly outperforms state‑of‑the‑art LLMs on zero‑shot physical reasoning tasks, including the PIQA benchmark and our newly proposed PhysiCa‑Bench dataset. These findings demonstrate that inducing a causal world model is a critical step toward more reliable and generalizable AI systems.

Abstract:
We present DINO‑world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre‑trained image encoder and training a future predictor on a large‑scale uncurated video dataset, DINO‑world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO‑world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine‑tune the predictor on observation‑action trajectories. The resulting action‑conditioned world model can be used for planning by simulating candidate trajectories in latent space.

Abstract:
Self‑supervised image encoders such as DINO have recently gained significant interest for learning robust visual features without labels. However, most SSL methods train on static images and miss the temporal cues inherent in videos. We introduce a video‑distilled single‑image encoder trained to predict the next‑frame representation from the current frame. This simple objective injects 3D spatial and temporal priors without optical flow or tracking. When pre‑training on a single 2‑hour video, our approach raises the mean Intersection‑over‑Union (mIoU) on ADE20K from 35.0 (DoRA) to 36.4 while remaining a drop‑in replacement for image‑only pipelines. Our results highlight video self‑distillation as a lightweight route to geometry‑aware perception an essential ingredient for physically plausible world models and Physical AI.

Abstract:
Recommender systems (RecSys) have become an essential component of many web applications. The core of the system is a recommendation model trained on highly sensitive user‑item interaction data. While privacy‑enhancing techniques are actively studied in the research community, the real‑world model development still depends on minimal privacy protection, e.g., via controlled access. Users of such systems should have the right to choose \emphnot to share highly sensitive interactions. However, there is no method allowing the user to know which interactions are more sensitive than others. Thus, quantifying the privacy risk of RecSys training data is a critical step to enabling privacy‑aware RecSys model development and deployment. We propose a membership‑inference attack (MIA)‑ based privacy scoring method, RecPS, to measure privacy risks at both the interaction and user levels. The RecPS interaction‑level score definition is motivated and derived from differential privacy, which is then extended to the user‑level scoring method. A critical component is the interaction‑level MIA method RecLiRA, which gives high‑quality membership estimation. We have conducted extensive experiments on well‑known benchmark datasets and RecSys models to show the unique features and benefits of RecPS scoring in risk assessment and RecSys model unlearning.

Abstract:
Do large language models (LLMs) construct and manipulate internal world models, or do they rely solely on statistical associations represented as output layer token probabilities? We adapt cognitive science methodologies from human mental models research to test LLMs on pulley system problems using TikZ‑rendered stimuli. Study 1 examines whether LLMs can estimate mechanical advantage (MA). State‑of‑the‑art models performed marginally but significantly above chance, and their estimates correlated significantly with ground‑truth MA. Significant correlations between number of pulleys and model estimates suggest that models employed a pulley counting heuristic, without necessarily simulating pulley systems to derive precise values. Study 2 tested this by probing whether LLMs represent global features crucial to MA estimation. Models evaluated a functionally connected pulley system against a fake system with randomly placed components. Without explicit cues, models identified the functional system as having greater MA with F1=0.8, suggesting LLMs could represent systems well enough to differentiate jumbled from functional systems. Study 3 built on this by asking LLMs to compare functional systems with matched systems which were connected up but which transferred no force to the weight; LLMs identified the functional system with F1=0.46, suggesting random guessing. Insofar as they may generalize, these findings are compatible with the notion that LLMs manipulate internal world models, sufficient to exploit statistical associations between pulley count and MA (Study 1), and to approximately represent system components' spatial relations (Study 2). However, they may lack the facility to reason over nuanced structural connectivity (Study 3). We conclude by advocating the utility of cognitive scientific methods to evaluate the world‑modeling capacities of artificial intelligence systems.

Abstract:
Language models are often said to face a symbol grounding problem. While some have argued the problem can be solved without resort to other modalities, many have speculated that grounded learning is more efficient. We explore this question in Othello, a simplified, rule‑based world that offers a controlled and interpretable testbed for studying world understanding. Building on prior work, we introduce VISOTHELLO, a multi‑modal model trained jointly on move sequences and board images. Using the Othello rule understanding task, we examine whether multi‑modal learning provides advantages over text‑only approaches. We further evaluate robustness under semantically irrelevant perturbations and analyze the consistency of cross‑modal alignment. Our results suggest that multi‑modal training not only improves performance and robustness but also promotes convergence toward shared internal representations across different model architectures.

Abstract:
Human intelligence exhibits a remarkable capacity for rapid adaptation and effective problem‑solving in novel and unfamiliar contexts. We argue that this profound adaptability is fundamentally linked to the efficient construction and refinement of internal representations of the environment, commonly referred to as world models, and we refer to this adaptation mechanism as world model induction. However, current understanding and evaluation of world models in artificial intelligence (AI) remains narrow, often focusing on static representations learned from training on massive corpora of data, instead of the efficiency and efficacy in learning these representations through interaction and exploration within a novel environment. In this Perspective, we provide a view of world model induction drawing on decades of research in cognitive science on how humans learn and adapt so efficiently; we then call for a new evaluation framework for assessing adaptive world models in AI. Concretely, we propose a new benchmarking paradigm based on suites of carefully designed games with genuine, deep and continually refreshing novelty in the underlying game structures ‑‑ we refer to this class of games as novel games. We detail key desiderata for constructing these games and propose appropriate metrics to explicitly challenge and evaluate the agent's ability for rapid world model induction. We hope that this new evaluation framework will inspire future evaluation efforts on world models in AI and provide a crucial step towards developing AI systems capable of human‑like rapid adaptation and robust generalization ‑‑ a critical component of artificial general intelligence.

Abstract:
Reliable anticipation of traffic accidents is essential for advancing autonomous driving systems. However, this objective is limited by two fundamental challenges: the scarcity of diverse, high‑quality training data and the frequent absence of crucial object‑level cues due to environmental disruptions or sensor deficiencies. To tackle these issues, we propose a comprehensive framework combining generative scene augmentation with adaptive temporal reasoning. Specifically, we develop a video generation pipeline that utilizes a world model guided by domain‑informed prompts to create high‑resolution, statistically consistent driving scenarios, particularly enriching the coverage of edge cases and complex interactions. In parallel, we construct a dynamic prediction model that encodes spatio‑temporal relationships through strengthened graph convolutions and dilated temporal operators, effectively addressing data incompleteness and transient visual noise. Furthermore, we release a new benchmark dataset designed to better capture diverse real‑world driving risks. Extensive experiments on public and newly released datasets confirm that our framework enhances both the accuracy and lead time of accident anticipation, offering a robust solution to current data and modeling limitations in safety‑critical autonomous driving applications.

Abstract:
When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations. We propose a computational implementation of this idea ‑‑ a ``Model Synthesis Architecture'' (MSA) ‑‑ using language models to implement global relevance‑based retrieval and model synthesis and probabilistic programs to implement bespoke, coherent world models. We evaluate our MSA as a model of human judgments on a novel reasoning dataset. The dataset ‑‑ built around a `Model Olympics` domain of sports vignettes ‑‑ tests models' capacity for human‑like, open‑ended reasoning by requiring (i) judgments about novel causal structures described in language; (ii) drawing on large bodies of background knowledge; and (iii) doing both in light of observations that introduce arbitrary novel variables. Our MSA approach captures human judgments better than language model‑only baselines, under both direct and chain‑of‑thought generations from the LM that supports model synthesis. These results suggest that MSAs can be implemented in a way that mirrors people's ability to deliver locally coherent reasoning over globally relevant variables, offering a path to understanding and replicating human reasoning in open‑ended domains.

Abstract:
Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open‑ended task solving in embodied environments in a reward‑free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations. This mapping enables the learning of a goal‑conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi‑task offline visual control benchmarks, excelling in capturing the deep‑level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground‑truth reward is also empirically validated. Our project website is https://sites.google.com/view/founder‑rl.

Abstract:
LinkedIn, one of the world's largest platforms for professional networking and job seeking, encounters various modeling challenges in building recommendation systems for its job matching product, including cold‑start, filter bubbles, and biases affecting candidate‑job matching. To address these, we developed the STAR (Signal Integration for Talent And Recruiters) system, leveraging the combined strengths of Large Language Models (LLMs) and Graph Neural Networks (GNNs). LLMs excel at understanding textual data, such as member profiles and job postings, while GNNs capture intricate relationships and mitigate cold‑start issues through network effects. STAR integrates diverse signals by uniting LLM and GNN capabilities with industrial‑scale paradigms including adaptive sampling and version management. It provides an end‑to‑end solution for developing and deploying embeddings in large‑scale recommender systems. Our key contributions include a robust methodology for building embeddings in industrial applications, a scalable GNN‑LLM integration for high‑performing recommendations, and practical insights for real‑world model deployment.

Abstract:
Accurate modeling and simulation of mobile networks are essential for enabling intelligent and cost‑effective network optimization. In this paper, we propose MobiWorld, a generative world model designed to support high‑fidelity and flexible environment simulation for mobile network planning and optimization. Unlike traditional predictive models constrained by limited generalization capabilities, MobiWorld exhibits strong universality by integrating heterogeneous data sources, including sensors, mobile devices, and base stations, as well as multimodal data types such as sequences and images. It is capable of generating both network element‑level observations (e.g., traffic load, user distribution) and system‑level performance indicators (e.g., throughput, energy consumption) to support a wide range of planning and optimization tasks. Built upon advanced diffusion models, MobiWorld offers powerful controllable generation capabilities by modeling the joint distribution between mobile network data and diverse conditional factors including spatio temporal contexts, user behaviors, and optimization policies. This enables accurate simulation of dynamic network states under varying policy configurations, providing optimization agents with precise environmental feedback and facilitating effective decision‑making without relying on costly real‑network interactions. We demonstrate the effectiveness of MobiWorld in a collaborative energy‑saving scenario, where an agent uses observations and rewards generated by MobiWorld to optimize base station sleep and user offloading policies. Experimental results show that MobiWorld exhibits strong controllable generation performance and outperforms traditional methods in energy optimization.

Abstract:
Continual reinforcement learning (CRL) refers to a naturalistic setting where an agent needs to endlessly evolve, by trial and error, to solve multiple tasks that are presented sequentially. One of the largest obstacles to CRL is that the agent may forget how to solve previous tasks when learning a new task, known as catastrophic forgetting. In this paper, we propose to address this challenge by planning with online world models. Specifically, we learn a Follow‑The‑Leader shallow model online to capture the world dynamics, in which we plan using model predictive control to solve a set of tasks specified by any reward functions. The online world model is immune to forgetting by construction with a proven regret bound of \mathcalO(\sqrtK^2D\log(T)) under mild assumptions. The planner searches actions solely based on the latest online model, thus forming a FTL Online Agent (OA) that updates incrementally. To assess OA, we further design Continual Bench, a dedicated environment for CRL, and compare with several strong baselines under the same model‑planning algorithmic framework. The empirical results show that OA learns continuously to solve new tasks while not forgetting old skills, outperforming agents built on deep world models with various continual learning techniques.

Abstract:
What drives an agent to explore the world while also maintaining control over the environment? From a child at play to scientists in the lab, intelligent agents must balance curiosity (the drive to seek knowledge) with competence (the drive to master and control the environment). Bridging cognitive theories of intrinsic motivation with reinforcement learning, we ask how evolving internal representations mediate the trade‑off between curiosity (novelty or information gain) and competence (empowerment). We compare two model‑based agents using handcrafted state abstractions (Tabular) or learning an internal world model (Dreamer). The Tabular agent shows curiosity and competence guide exploration in distinct patterns, while prioritizing both improves exploration. The Dreamer agent reveals a two‑way interaction between exploration and representation learning, mirroring the developmental co‑evolution of curiosity and competence. Our findings formalize adaptive exploration as a balance between pursuing the unknown and the controllable, offering insights for cognitive theories and efficient reinforcement learning.

Abstract:
Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high‑quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA's Planetary Data System (PDS), and renders high‑fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric‑scale resolution. MarsGen, fine‑tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.

Abstract:
We show that deep neural networks, including transformers and RNNs, pretrained as usual on next‑token prediction, intrinsically discover and represent beliefs over 'quantum' and 'post‑quantum' low‑dimensional generative models of their training data ‑‑ as if performing iterative Bayesian updates over the latent state of this world model during inference as they observe more context. Notably, neural nets easily find these representation whereas there is no finite classical circuit that would do the job. The corresponding geometric relationships among neural activations induced by different input sequences are found to be largely independent of neural‑network architecture. Each point in this geometry corresponds to a history‑induced probability density over all possible futures, and the relative displacement of these points reflects the difference in mechanism and magnitude for how these distinct pasts affect the future.

Abstract:
Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler's predictions of planetary motion later led to the discovery of Newtonian mechanics. However, evaluating whether these models truly capture deeper structure remains a challenge. We develop a technique for evaluating foundation models that examines how they adapt to synthetic datasets generated from some postulated world model. Our technique measures whether the foundation model's inductive bias aligns with the world model, and so we refer to it as an inductive bias probe. Across multiple domains, we find that foundation models can excel at their training tasks yet fail to develop inductive biases towards the underlying world model when adapted to new tasks. We particularly find that foundation models trained on orbital trajectories consistently fail to apply Newtonian mechanics when adapted to new physics tasks. Further analysis reveals that these models behave as if they develop task‑specific heuristics that fail to generalize.

Abstract:
Visual imitation learning is effective for robots to learn versatile tasks. However, many existing methods rely on behavior cloning with supervised historical trajectories, limiting their 3D spatial and 4D spatiotemporal awareness. Consequently, these methods struggle to capture the 3D structures and 4D spatiotemporal relationships necessary for real‑world deployment. In this work, we propose 4D Diffusion Policy (DP4), a novel visual imitation learning method that incorporates spatiotemporal awareness into diffusion‑based policies. Unlike traditional approaches that rely on trajectory cloning, DP4 leverages a dynamic Gaussian world model to guide the learning of 3D spatial and 4D spatiotemporal perceptions from interactive environments. Our method constructs the current 3D scene from a single‑view RGB‑D observation and predicts the future 3D scene, optimizing trajectory generation by explicitly modeling both spatial and temporal dependencies. Extensive experiments across 17 simulation tasks with 173 variants and 3 real‑world robotic tasks demonstrate that the 4D Diffusion Policy (DP4) outperforms baseline methods, improving the average simulation task success rate by 16.4% (Adroit), 14% (DexArt), and 6.45% (RLBench), and the average real‑world robotic task success rate by 8.6%.

Abstract:
World Model, the supposed algorithmic surrogate of the real‑world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci‑Fi classic Dune, and drawing inspiration from the concept of "hypothetical thinking" in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general‑purpose world model, based on hierarchical, multi‑level, and mixed continuous/discrete representations, and a generative and self‑supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

Abstract:
Surgical action planning requires predicting future instrument‑verb‑target triplets for real‑time assistance. While teleoperated robotic surgery provides natural expert demonstrations for imitation learning (IL), reinforcement learning (RL) could potentially discover superior strategies through self‑exploration. We present the first comprehensive comparison of IL versus RL for surgical action planning on CholecT50. Our Dual‑task Autoregressive Imitation Learning (DARIL) baseline achieves 34.6% action triplet recognition mAP and 33.6% next frame prediction mAP with smooth planning degradation to 29.2% at 10‑second horizons. We evaluated three RL variants: world model‑based RL, direct video RL, and inverse RL enhancement. Surprisingly, all RL approaches underperformed DARIL‑‑world model RL dropped to 3.1% mAP at 10s while direct video RL achieved only 15.9%. Our analysis reveals that distribution matching on expert‑annotated test sets systematically favors IL over potentially valid RL policies that differ from training demonstrations. This challenges assumptions about RL superiority in sequential decision making and provides crucial insights for surgical AI development.

Abstract:
In this work, we explore the use of compact latent representations with learned time dynamics ('World Models') to simulate physical systems. Drawing on concepts from control theory, we propose a theoretical framework that explains why projecting time slices into a low‑dimensional space and then concatenating to form a history ('Tokenization') is so effective at learning physics datasets, and characterise when exactly the underlying dynamics admit a reconstruction mapping from the history of previous tokenized frames to the next. To validate these claims, we develop a sequence of models with increasing complexity, starting with least‑squares regression and progressing through simple linear layers, shallow adversarial learners, and ultimately full‑scale generative adversarial networks (GANs). We evaluate these models on a variety of datasets, including modified forms of the heat and wave equations, the chaotic regime 2D Kuramoto‑Sivashinsky equation, and a challenging computational fluid dynamics (CFD) dataset of a 2D Kármán vortex street around a fixed cylinder, where our model is successfully able to recreate the flow.

Abstract:
Recent advancements in large language models (LLMs) have significantly improved the capabilities of web agents. However, effectively navigating complex and dynamic web environments still requires more advanced trajectory‑level planning and execution. Prior studies have addressed self‑improving agents by collecting extensive GUI trajectories from real‑environment interactions. Despite their effectiveness, these approaches encounter two critical challenges: (1) Uncontrollable environment states, where real or sandboxed web environments often yield unstable and non‑deterministic feedback, complicating the reproduction and debugging of agent behaviors; and (2) High API costs, as generating even a single interaction trajectory can involve hundreds of queries, leading to considerable API usage and computational expenses. To address these limitations and enable scalable self‑improvement for agents, we propose WebSynthesis, a novel framework for trajectory synthesis and training. WebSynthesis leverages a learned world model to simulate virtual web environments, allowing a policy agent to perform efficient and reversible tree‑based planning. This approach supports the large‑scale generation of diverse and high‑quality trajectories, which are subsequently utilized to refine the agent's policy. Experimental results demonstrate that an agent trained using WebSynthesis on a small‑scale synthetic dataset achieves performance comparable to or even surpassing that of models trained on large‑scale real‑world data.

Abstract:
The Dreamer algorithm has recently obtained remarkable performance across diverse environment domains by training powerful agents with simulated trajectories. However, the compressed nature of its world model's latent space can result in the loss of crucial information, negatively affecting the agent's performance. Recent approaches, such as Δ‑IRIS and DIAMOND, address this limitation by training more accurate world models. However, these methods require training agents directly from pixels, which reduces training efficiency and prevents the agent from benefiting from the inner representations learned by the world model. In this work, we propose an alternative approach to world modeling that is both accurate and efficient. We introduce EMERALD (Efficient MaskEd latent tRAnsformer worLD model), a world model using a spatial latent state with MaskGIT predictions to generate accurate trajectories in latent space and improve the agent performance. On the Crafter benchmark, EMERALD achieves new state‑of‑the‑art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation.

Abstract:
World models aim to capture the dynamics of the environment, enabling agents to predict and plan for future states. In most scenarios of interest, the dynamics are highly centered on interactions among objects within the environment. This motivates the development of world models that operate on object‑centric rather than monolithic representations, with the goal of more effectively capturing environment dynamics and enhancing compositional generalization. However, the development of object‑centric world models has largely been explored in environments with limited visual complexity (such as basic geometries). It remains underexplored whether such models can generalize to more complex settings with diverse textures and cluttered scenes. In this paper, we fill this gap by introducing Dyn‑O, an enhanced structured world model built upon object‑centric representations. Compared to prior work in object‑centric representations, Dyn‑O improves in both learning representations and modeling dynamics. On the challenging Procgen games, we find that our method can learn object‑centric world models directly from pixel observations, outperforming DreamerV3 in rollout prediction accuracy. Furthermore, by decoupling object‑centric features into dynamics‑agnostic and dynamics‑aware components, we enable finer‑grained manipulation of these features and generate more diverse imagined trajectories.

Abstract:
The (generative) artificial intelligence (AI) era has profoundly reshaped the meaning and value of data. No longer confined to static content, data now permeates every stage of the AI lifecycle from the training samples that shape model parameters to the prompts and outputs that drive real‑world model deployment. This shift renders traditional notions of data protection insufficient, while the boundaries of what needs safeguarding remain poorly defined. Failing to safeguard data in AI systems can inflict societal and individual, underscoring the urgent need to clearly delineate the scope of and rigorously enforce data protection. In this perspective, we propose a four‑level taxonomy, including non‑usability, privacy preservation, traceability, and deletability, that captures the diverse protection needs arising in modern (generative) AI models and systems. Our framework offers a structured understanding of the trade‑offs between data utility and control, spanning the entire AI pipeline, including training datasets, model weights, system prompts, and AI‑generated content. We analyze representative technical approaches at each level and reveal regulatory blind spots that leave critical assets exposed. By offering a structured lens to align future AI technologies and governance with trustworthy data practices, we underscore the urgency of rethinking data protection for modern AI techniques and provide timely guidance for developers, researchers, and regulators alike.

Abstract:
LiDAR‑based world models offer more structured and geometry‑aware representations than their image‑based counterparts. However, existing LiDAR world models are narrowly trained; each model excels only in the domain for which it was built. Can we develop LiDAR world models that exhibit strong transferability across multiple domains? We conduct the first systematic domain transfer study across three demanding scenarios: (i) outdoor to indoor generalization, (ii) sparse‑beam & dense‑beam adaptation, and (iii) non‑semantic to semantic transfer. Given different amounts of fine‑tuning data, our experiments show that a single pre‑trained model can achieve up to 11% absolute improvement (83% relative) over training from scratch and outperforms training from scratch in 30/36 of our comparisons. This transferability of dynamic learning significantly reduces the reliance on manually annotated data for semantic occupancy forecasting: our method exceed the previous semantic occupancy forecasting models with only 5% of the labeled training data required by prior models. We also observed inefficiencies of current LiDAR world models, mainly through their under‑compression of LiDAR data and inefficient training objectives. To address this, we propose a latent conditional flow matching (CFM)‑based frameworks that achieves state‑of‑the‑art reconstruction accuracy using only half the training data and a compression ratio 6 times higher than that of prior methods. Our model achieves SOTA performance on future‑trajectory‑conditioned semantic occupancy forecasting while being 23x more computationally efficient (a 28x FPS speedup); and achieves SOTA performance on semantic occupancy forecasting while being 2x more computationally efficient (a 1.1x FPS speedup).

Abstract:
When building a world model, a common assumption is that the environment has a single, unchanging underlying causal rule, like applying Newton's laws to every situation. In reality, what appears as a drifting causal mechanism is often the manifestation of a fixed underlying mechanism seen through a narrow observational window. This brings about a problem that, when building a world model, even subtle shifts in policy or environment states can alter the very observed causal mechanisms. In this work, we introduce the Meta‑Causal Graph as world models, a minimal unified representation that efficiently encodes the transformation rules governing how causal structures shift across different latent world states. A single Meta‑Causal Graph is composed of multiple causal subgraphs, each triggered by meta state, which is in the latent state space. Building on this representation, we introduce a Causality‑Seeking Agent whose objectives are to (1) identify the meta states that trigger each subgraph, (2) discover the corresponding causal relationships by agent curiosity‑driven intervention policy, and (3) iteratively refine the Meta‑Causal Graph through ongoing curiosity‑driven exploration and agent experiences. Experiments on both synthetic tasks and a challenging robot arm manipulation task demonstrate that our method robustly captures shifts in causal dynamics and generalizes effectively to previously unseen contexts.

Abstract:
One classic idea from the cybernetics literature is the Every Good Regulator Theorem (EGRT). The EGRT provides a means to identify good regulation, or the conditions under which an agent (regulator) can match the dynamical behavior of a system. We reevaluate and recast the EGRT in a modern context to provide insight into how intelligent autonomous learning systems might utilize a compressed global representation (world model). One‑to‑one mappings between a regulator (R) and the corresponding system (S) provide a reduced representation that preserves useful variety to match all possible outcomes of a system. The EGRT also extends to second‑order cybernetics, where an internal model (M) observes the behavior of S and supervises a S‑R closed loop mapping. Secondarily, we demonstrate how physical phenomena such as temporal criticality, non‑normal denoising, and alternating procedural acquisition can recast behavior as statistical mechanics and yield regulatory relationships. These diverse physical systems challenge the notion of tightly‑coupled good regulation when applied to non‑uniform and out‑of‑distribution phenomena. Overall, we aim to recast the EGRT as a potential approach for developing world models that adapt and respond to a wide range of task environments.

Abstract:
Just like power, water, and transportation systems, wireless networks are a crucial societal infrastructure. As natural and human‑induced disruptions continue to grow, wireless networks must be resilient. This requires them to withstand and recover from unexpected adverse conditions, shocks, unmodeled disturbances and cascading failures. Unlike robustness and reliability, resilience is based on the understanding that disruptions will inevitably happen. Resilience, as elasticity, focuses on the ability to bounce back to favorable states, while resilience as plasticity involves agents and networks that can flexibly expand their states and hypotheses through real‑time adaptation and reconfiguration. This situational awareness and active preparedness, adapting world models and counterfactually reasoning about potential system failures and the best responses, is a core aspect of resilience. This article will first disambiguate resilience from reliability and robustness, before delving into key mathematical foundations of resilience grounded in abstraction, compositionality and emergence. Subsequently, we focus our attention on a plethora of techniques and methodologies pertaining to the unique characteristics of resilience, as well as their applications through a comprehensive set of use cases. Ultimately, the goal of this paper is to establish a unified foundation for understanding, modeling, and engineering resilience in wireless communication systems, while laying a roadmap for the next‑generation of resilient‑native and intelligent wireless systems.

Abstract:
This paper describes our research on AI agents embodied in visual, virtual or physical forms, enabling them to interact with both users and their environments. These agents, which include virtual avatars, wearable devices, and robots, are designed to perceive, learn and act within their surroundings, which makes them more similar to how humans learn and interact with the environments as compared to disembodied agents. We propose that the development of world models is central to reasoning and planning of embodied AI agents, allowing these agents to understand and predict their environment, to understand user intentions and social contexts, thereby enhancing their ability to perform complex tasks autonomously. World modeling encompasses the integration of multimodal perception, planning through reasoning for action and control, and memory to create a comprehensive understanding of the physical world. Beyond the physical world, we also propose to learn the mental world model of users to enable better human‑agent collaboration.

Abstract:
Offline reinforcement learning (RL) has emerged as a prevalent and effective methodology for real‑world recommender systems, enabling learning policies from historical data and capturing user preferences. In offline RL, reward shaping encounters significant challenges, with past efforts to incorporate prior strategies for uncertainty to improve world models or penalize underexplored state‑action pairs. Despite these efforts, a critical gap remains: the simultaneous balancing of intrinsic biases in world models and the diversity of policy recommendations. To address this limitation, we present an innovative offline RL framework termed Reallocated Reward for Recommender Systems (R3S). By integrating inherent model uncertainty to tackle the intrinsic fluctuations in reward predictions, we boost diversity for decision‑making to align with a more interactive paradigm, incorporating extra penalizers with decay that deter actions leading to diminished state variety at both local and global scales. The experimental results demonstrate that R3S improves the accuracy of world models and efficiently harmonizes the heterogeneous preferences of the users.

Abstract:
The goal of traffic simulation is to augment a potentially limited amount of manually‑driven miles that is available for testing and validation, with a much larger amount of simulated synthetic miles. The culmination of this vision would be a generative simulated city, where given a map of the city and an autonomous vehicle (AV) software stack, the simulator can seamlessly simulate the trip from point A to point B by populating the city around the AV and controlling all aspects of the scene, from animating the dynamic agents (e.g., vehicles, pedestrians) to controlling the traffic light states. We refer to this vision as CitySim, which requires an agglomeration of simulation technologies: scene generation to populate the initial scene, agent behavior modeling to animate the scene, occlusion reasoning, dynamic scene generation to seamlessly spawn and remove agents, and environment simulation for factors such as traffic lights. While some key technologies have been separately studied in various works, others such as dynamic scene generation and environment simulation have received less attention in the research community. We propose SceneDiffuser++, the first end‑to‑end generative world model trained on a single loss function capable of point A‑to‑B simulation on a city scale integrating all the requirements above. We demonstrate the city‑scale traffic simulation capability of SceneDiffuser++ and study its superior realism under long simulation conditions. We evaluate the simulation quality on an augmented version of the Waymo Open Motion Dataset (WOMD) with larger map regions to support trip‑level simulation.

Abstract:
Internal world models (WMs) enable agents to understand the world's state and predict transitions, serving as the basis for advanced deliberative reasoning. Recent large Vision‑Language Models (VLMs), such as OpenAI o3, GPT‑4o and Gemini, exhibit potential as general‑purpose WMs. While the latest studies have evaluated and shown limitations in specific capabilities such as visual understanding, a systematic evaluation of VLMs' fundamental WM abilities remains absent. Drawing on comparative psychology and cognitive science, we propose a two‑stage framework that assesses Perception (visual, spatial, temporal, quantitative, and motion) and Prediction (mechanistic simulation, transitive inference, compositional inference) to provide an atomic evaluation of VLMs as WMs. Guided by this framework, we introduce WM‑ABench, a large‑scale benchmark comprising 23 fine‑grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations. Through 660 experiments on 15 latest commercial and open‑source VLMs, we find that these models exhibit striking limitations in basic world modeling abilities. For instance, almost all models perform at near‑random accuracy when distinguishing motion trajectories. Additionally, they lack disentangled understanding ‑‑ e.g., some models tend to believe blue objects move faster than green ones. More rich results and analyses reveal significant gaps between VLMs and human‑level world modeling.

Abstract:
We introduce Massively Multi‑Task Model‑Based Policy Optimization (M3PO), a scalable model‑based reinforcement learning (MBRL) framework designed to address sample inefficiency in single‑task settings and poor generalization in multi‑task domains. Existing model‑based approaches like DreamerV3 rely on pixel‑level generative models that neglect control‑centric representations, while model‑free methods such as PPO suffer from high sample complexity and weak exploration. M3PO integrates an implicit world model, trained to predict task outcomes without observation reconstruction, with a hybrid exploration strategy that combines model‑based planning and model‑free uncertainty‑driven bonuses. This eliminates the bias‑variance trade‑off in prior methods by using discrepancies between model‑based and model‑free value estimates to guide exploration, while maintaining stable policy updates through a trust‑region optimizer. M3PO provides an efficient and robust alternative to existing model‑based policy optimization approaches and achieves state‑of‑the‑art performance across multiple benchmarks.

Abstract:
World models have garnered increasing attention in the development of artificial general intelligence (AGI), serving as computational frameworks for learning representations of the external world and forecasting future states. While early efforts focused on 2D visual perception and simulation, recent 3D‑aware generative world models have demonstrated the ability to synthesize geometrically consistent, interactive 3D environments, marking a shift toward 3D spatial cognition. Despite rapid progress, the field lacks systematic analysis to categorize emerging techniques and clarify their roles in advancing 3D cognitive world models. This survey addresses this need by introducing a conceptual framework, providing a structured and forward‑looking review of world models transitioning from 2D perception to 3D cognition. Within this framework, we highlight two key technological drivers, particularly advances in 3D representations and the incorporation of world knowledge, as fundamental pillars. Building on these, we dissect three core cognitive capabilities that underpin 3D world modeling: 3D physical scene generation, 3D spatial reasoning, and 3D spatial interaction. We further examine the deployment of these capabilities in real‑world applications, including embodied AI, autonomous driving, digital twin, and gaming/VR. Finally, we identify challenges across data, modeling, and deployment, and outline future directions for advancing more robust and generalizable 3D world models.

Abstract:
Chest X‑rays (CXRs) are the most widely used medical imaging modality and play a pivotal role in diagnosing diseases. However, as 2D projection images, CXRs are limited by structural superposition, which constrains their effectiveness in precise disease diagnosis and risk prediction. To address the limitations of 2D CXRs, this study introduces Xray2Xray, a novel World Model that learns latent representations encoding 3D structural information from chest X‑rays. Xray2Xray captures the latent representations of the chest volume by modeling the transition dynamics of X‑ray projections across different angular positions with a vision model and a transition model. We employed the latent representations of Xray2Xray for downstream risk prediction and disease diagnosis tasks. Experimental results showed that Xray2Xray outperformed both supervised methods and self‑supervised pretraining methods for cardiovascular disease risk estimation and achieved competitive performance in classifying five pathologies in CXRs. We also assessed the quality of Xray2Xray's latent representations through synthesis tasks and demonstrated that the latent representations can be used to reconstruct volumetric context.

Abstract:
Video Generation Models (VGMs) have become powerful backbones for Vision‑Language‑Action (VLA) models, leveraging large‑scale pretraining for robust dynamics modeling. However, current methods underutilize their distribution modeling capabilities for predicting future states. Two challenges hinder progress: integrating generative processes into feature learning is both technically and conceptually underdeveloped, and naive frame‑by‑frame video diffusion is computationally inefficient for real‑time robotics. To address these, we propose Manipulate in Dream (MinD), a dual‑system world model for real‑time, risk‑aware planning. MinD uses two asynchronous diffusion processes: a low‑frequency visual generator (LoDiff) that predicts future scenes and a high‑frequency diffusion policy (HiDiff) that outputs actions. Our key insight is that robotic policies do not require fully denoised frames but can rely on low‑resolution latents generated in a single denoising step. To connect early predictions to actions, we introduce DiffMatcher, a video‑action alignment module with a novel co‑training strategy that synchronizes the two diffusion models. MinD achieves a 63% success rate on RL‑Bench, 60% on real‑world Franka tasks, and operates at 11.3 FPS, demonstrating the efficiency of single‑step latent features for control signals. Furthermore, MinD identifies 74% of potential task failures in advance, providing real‑time safety signals for monitoring and intervention. This work establishes a new paradigm for efficient and reliable robotic manipulation using generative world models.

Abstract:
Transforming casually captured, monocular videos into fully immersive dynamic experiences is a highly ill‑posed task, and comes with significant challenges, e.g., reconstructing unseen regions, and dealing with the ambiguity in monocular depth estimation. In this work we introduce BulletGen, an approach that takes advantage of generative models to correct errors and complete missing information in a Gaussian‑based dynamic scene representation. This is done by aligning the output of a diffusion‑based video generation model with the 4D reconstruction at a single frozen "bullet‑time" step. The generated frames are then used to supervise the optimization of the 4D Gaussian model. Our method seamlessly blends generative content with both static and dynamic scene components, achieving state‑of‑the‑art results on both novel‑view synthesis, and 2D/3D tracking tasks.

Abstract:
We present the Multi‑Agent Transformer World Model (MATWM), a novel transformer‑based world model designed for multi‑agent reinforcement learning in both vector‑ and image‑based environments. MATWM combines a decentralized imagination framework with a semi‑centralized critic and a teammate prediction module, enabling agents to model and anticipate the behavior of others under partial observability. To address non‑stationarity, we incorporate a prioritized replay mechanism that trains the world model on recent experiences, allowing it to adapt to agents' evolving policies. We evaluated MATWM on a broad suite of benchmarks, including the StarCraft Multi‑Agent Challenge, PettingZoo, and MeltingPot. MATWM achieves state‑of‑the‑art performance, outperforming both model‑free and prior world model approaches, while demonstrating strong sample efficiency, achieving near‑optimal performance in as few as 50K environment interactions. Ablation studies confirm the impact of each component, with substantial gains in coordination‑heavy tasks.

Abstract:
World models ‑ generative models that simulate environment dynamics conditioned on past observations and actions ‑ are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine‑grained, temporally grounded assessment of action alignment and semantic consistency ‑ capabilities not captured by existing metrics. Vision‑Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine‑grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce an evaluation protocol targeting two recognition tasks ‑ action recognition and character recognition ‑ each assessed across binary, multiple‑choice, and open‑ended formats. To support this, we present UNIVERSE (UNIfied Vision‑language Evaluator for Rollouts in Simulated Environments), a VLM‑based evaluator for video world model rollouts adapted under data and compute constraints. In our extensive experiments totaling over 5,154 GPU‑days, we explore full, partial, and parameter‑efficient adaptation methods across various task formats, context lengths, sampling methods, and data compositions. The resulting unified evaluator achieves parity with task‑specific checkpoints. Human studies across seven diverse environments confirm strong alignment with human judgments, establishing UNIVERSE as a lightweight, adaptable, and semantics‑aware evaluator for video world models.

Abstract:
World models enable robots to "imagine" future observations given current observations and planned actions, and have been increasingly adopted as generalized dynamics models to facilitate robot learning. Despite their promise, these models remain brittle when encountering novel visual distractors such as objects and background elements rarely seen during training. Specifically, novel distractors can corrupt action outcome predictions, causing downstream failures when robots rely on the world model imaginations for planning or action verification. In this work, we propose Reimagination with Observation Intervention (ReOI), a simple yet effective test‑time strategy that enables world models to predict more reliable action outcomes in open‑world scenarios where novel and unanticipated visual distractors are inevitable. Given the current robot observation, ReOI first detects visual distractors by identifying which elements of the scene degrade in physically implausible ways during world model prediction. Then, it modifies the current observation to remove these distractors and bring the observation closer to the training distribution. Finally, ReOI "reimagines" future outcomes with the modified observation and reintroduces the distractors post‑hoc to preserve visual consistency for downstream planning and verification. We validate our approach on a suite of robotic manipulation tasks in the context of action verification, where the verifier needs to select desired action plans based on predictions from a world model. Our results show that ReOI is robust to both in‑distribution and out‑of‑distribution visual distractors. Notably, it improves task success rates by up to 3x in the presence of novel distractors, significantly outperforming action verification that relies on world model predictions without imagination interventions.

Abstract:
This survey provides a comprehensive overview of the emerging field of world models grounded in the foundation of acoustic physical information. It examines the theoretical underpinnings, essential methodological frameworks, and recent technological advancements in leveraging acoustic signals for high‑fidelity environmental perception, causal physical reasoning, and predictive simulation of dynamic events. The survey explains how acoustic signals, as direct carriers of mechanical wave energy from physical events, encode rich, latent information about material properties, internal geometric structures, and complex interaction dynamics. Specifically, this survey establishes the theoretical foundation by explaining how fundamental physical laws govern the encoding of physical information within acoustic signals. It then reviews the core methodological pillars, including Physics‑Informed Neural Networks (PINNs), generative models, and self‑supervised multimodal learning frameworks. Furthermore, the survey details the significant applications of acoustic world models in robotics, autonomous driving, healthcare, and finance. Finally, it systematically outlines the important technical and ethical challenges while proposing a concrete roadmap for future research directions toward robust, causal, uncertainty‑aware, and responsible acoustic intelligence. These elements collectively point to a research pathway towards embodied active acoustic intelligence, empowering AI systems to construct an internal "intuitive physics" engine through sound.

Abstract:
Recent advancements in open‑world robot manipulation have been largely driven by vision‑language models (VLMs). While these models exhibit strong generalization ability in high‑level planning, they struggle to predict low‑level robot controls due to limited physical‑world understanding. To address this issue, we propose a model predictive control framework for open‑world manipulation that combines the semantic reasoning capabilities of VLMs with physically‑grounded, interactive digital twins of the real‑world environments. By constructing and simulating the digital twins, our approach generates feasible motion trajectories, simulates corresponding outcomes, and prompts the VLM with future observations to evaluate and select the most suitable outcome based on language instructions of the task. To further enhance the capability of pre‑trained VLMs in understanding complex scenes for robotic control, we leverage the flexible rendering capabilities of the digital twin to synthesize the scene at various novel, unoccluded viewpoints. We validate our approach on a diverse set of complex manipulation tasks, demonstrating superior performance compared to baseline methods for language‑conditioned robotic control using VLMs.

Abstract:
Large Language Models (LLMs) possess intricate internal representations of the world, yet these latent structures are notoriously difficult to interpret or repurpose beyond the original prediction task. Building on our earlier work (Rothenfusser, 2025), which introduced the concept of vector ontologies as a framework for translating high‑dimensional neural representations into interpretable geometric structures, this paper provides the first empirical validation of that approach. A vector ontology defines a domain‑specific vector space spanned by ontologically meaningful dimensions, allowing geometric analysis of concepts and relationships within a domain. We construct an 8‑dimensional vector ontology of musical genres based on Spotify audio features and test whether an LLM's internal world model of music can be consistently and accurately projected into this space. Using GPT‑4o‑mini, we extract genre representations through multiple natural language prompts and analyze the consistency of these projections across linguistic variations and their alignment with ground‑truth data. Our results show (1) high spatial consistency of genre projections across 47 query formulations, (2) strong alignment between LLM‑inferred genre locations and real‑world audio feature distributions, and (3) evidence of a direct relationship between prompt phrasing and spatial shifts in the LLM's inferred vector ontology. These findings demonstrate that LLMs internalize structured, repurposable knowledge and that vector ontologies offer a promising method for extracting and analyzing this knowledge in a transparent and verifiable way.

Abstract:
The generation of temporally consistent, high‑fidelity driving videos over extended horizons presents a fundamental challenge in autonomous driving world modeling. Existing approaches often suffer from error accumulation and feature misalignment due to inadequate decoupling of spatio‑temporal dynamics and limited cross‑frame feature propagation mechanisms. To address these limitations, we present STAGE (Streaming Temporal Attention Generative Engine), a novel auto‑regressive framework that pioneers hierarchical feature coordination and multi‑phase optimization for sustainable video synthesis. To achieve high‑quality long‑horizon driving video generation, we introduce Hierarchical Temporal Feature Transfer (HTFT) and a novel multi‑stage training strategy. HTFT enhances temporal consistency between video frames throughout the video generation process by modeling the temporal and denoising process separately and transferring denoising features between frames. The multi‑stage training strategy is to divide the training into three stages, through model decoupling and auto‑regressive inference process simulation, thereby accelerating model convergence and reducing error accumulation. Experiments on the Nuscenes dataset show that STAGE has significantly surpassed existing methods in the long‑horizon driving video generation task. In addition, we also explored STAGE's ability to generate unlimited‑length driving videos. We generated 600 frames of high‑quality driving videos on the Nuscenes dataset, which far exceeds the maximum length achievable by existing methods.

Abstract:
Just like power, water and transportation systems, wireless networks are a crucial societal infrastructure. As natural and human‑induced disruptions continue to grow, wireless networks must be resilient to unforeseen events, able to withstand and recover from unexpected adverse conditions, shocks, unmodeled disturbances and cascading failures. Despite its critical importance, resilience remains an elusive concept, with its mathematical foundations still underdeveloped. Unlike robustness and reliability, resilience is premised on the fact that disruptions will inevitably happen. Resilience, in terms of elasticity, focuses on the ability to bounce back to favorable states, while resilience as plasticity involves agents (or networks) that can flexibly expand their states, hypotheses and course of actions, by transforming through real‑time adaptation and reconfiguration. This constant situational awareness and vigilance of adapting world models and counterfactually reasoning about potential system failures and the corresponding best responses, is a core aspect of resilience. This article seeks to first define resilience and disambiguate it from reliability and robustness, before delving into the mathematics of resilience. Finally, the article concludes by presenting nuanced metrics and discussing trade‑offs tailored to the unique characteristics of network resilience.

Abstract:
The growing number of pretrained models in Machine Learning (ML) presents significant challenges for practitioners. Given a new dataset, they need to determine the most suitable deep learning (DL) pipeline, consisting of the pretrained model and the hyperparameters for finetuning to it. Moreover, as models grow in scale, the increasing reliance on real‑world data poses a bottleneck for training and requires leveraging data more effectively. Addressing the first challenge often involves manual model selection and hyperparameter tuning. At the same time, as models grow larger and more and more of the available human‑generated data is being used for training, data augmentation and synthetic data become critical elements. Automated machine learning offers a path to address these challenges but is traditionally designed for tabular data and classical ML methods. This dissertation adopts meta‑learning to extend automated machine learning to the deep learning domain. We propose empirical approaches to automate DL pipeline selection for Computer Vision tasks using prior task knowledge to learn surrogate models for pipeline ranking. Extending these methods to the language domain, we learn to finetune large language models. As a result, we show that our approach can outperform finetuning foundation models. Additionally, we meta‑learn data augmentation and synthetic data to enhance performance in up‑stream and down‑stream tasks. We empirically show the underestimated importance of data augmentation when using Self‑Supervised Learning and meta‑learn advanced data augmentation strategies. Leveraging synthetic data, we also propose to meta‑learn neural synthetic data generators as proxies for Reinforcement Learning (RL) environments. Additionally, we learn a multiple‑environment world model in an in‑context learning fashion by purely using synthetic, randomly sampled data.

Abstract:
World models aim to simulate environments and enable effective agent behavior. However, modeling real‑world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio‑Temporal Road Image Dataset for Exploration (STRIDE) permuting 360‑degree panoramic imagery into rich interconnected observation, state and action nodes. Leveraging this structure, we can simultaneously model the relationship between egocentric views, positional coordinates, and movement commands across both space and time. We benchmark this dataset via TARDIS, a transformer‑based generative world model that integrates spatial and temporal dynamics through a unified autoregressive framework trained on STRIDE. We demonstrate robust performance across a range of agentic tasks such as controllable photorealistic image synthesis, instruction following, autonomous self‑control, and state‑of‑the‑art georeferencing. These results suggest a promising direction towards sophisticated generalist agents‑‑capable of understanding and manipulating the spatial and temporal aspects of their material environments‑‑with enhanced embodied reasoning capabilities. Training code, datasets, and model checkpoints are made available at https://huggingface.co/datasets/Tera‑AI/STRIDE.

Abstract:
Understanding and reasoning about dynamics governed by physical laws through visual observation, akin to human capabilities in the real world, poses significant challenges. Currently, object‑centric dynamic simulation methods, which emulate human behavior, have achieved notable progress but overlook two critical aspects: 1) the integration of physical knowledge into models. Humans gain physical insights by observing the world and apply this knowledge to accurately reason about various dynamic scenarios; 2) the validation of model adaptability across diverse scenarios. Real‑world dynamics, especially those involving fluids and objects, demand models that not only capture object interactions but also simulate fluid flow characteristics. To address these gaps, we introduce SlotPi, a slot‑based physics‑informed object‑centric reasoning model. SlotPi integrates a physical module based on Hamiltonian principles with a spatio‑temporal prediction module for dynamic forecasting. Our experiments highlight the model's strengths in tasks such as prediction and Visual Question Answering (VQA) on benchmark and fluid datasets. Furthermore, we have created a real‑world dataset encompassing object interactions, fluid dynamics, and fluid‑object interactions, on which we validated our model's capabilities. The model's robust performance across all datasets underscores its strong adaptability, laying a foundation for developing more advanced world models.

Abstract:
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self‑supervised approach that combines internet‑scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre‑train an action‑free joint‑embedding‑predictive architecture, V‑JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V‑JEPA 2 achieves strong performance on motion understanding (77.3 top‑1 accuracy on Something‑Something v2) and state‑of‑the‑art performance on human action anticipation (39.7 recall‑at‑5 on Epic‑Kitchens‑100) surpassing previous task‑specific models. Additionally, after aligning V‑JEPA 2 with a large language model, we demonstrate state‑of‑the‑art performance on multiple video question‑answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self‑supervised learning can be applied to robotic planning tasks by post‑training a latent action‑conditioned world model, V‑JEPA 2‑AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V‑JEPA 2‑AC zero‑shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task‑specific training or reward. This work demonstrates how self‑supervised learning from web‑scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

Abstract:
How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real‑world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non‑expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real‑world human demonstrations with diverse non‑expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open‑world driving scenarios under various actions, including hazardous non‑expert ones. To close the gap between high‑fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non‑expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.

Abstract:
We introduce Option Kernel Bellman Equations (OKBEs) for a new reward‑free Markov Decision Process. Rather than a value function, OKBEs directly construct and optimize a predictive map called a state‑time option kernel (STOK) to maximize the probability of completing a goal while avoiding constraint violations. STOKs are compositional, modular, and interpretable initiation‑to‑termination transition kernels for policies in the Options Framework of Reinforcement Learning. This means: 1) STOKs can be composed using Chapman‑Kolmogorov equations to make spatiotemporal predictions for multiple policies over long horizons, 2) high‑dimensional STOKs can be represented and computed efficiently in a factorized and reconfigurable form, and 3) STOKs record the probabilities of semantically interpretable goal‑success and constraint‑violation events, needed for formal verification. Given a high‑dimensional state‑transition model for an intractable planning problem, we can decompose it with local STOKs and goal‑conditioned policies that are aggregated into a factorized goal kernel, making it possible to forward‑plan at the level of goals in high‑dimensions to solve the problem. These properties lead to highly flexible agents that can rapidly synthesize meta‑policies, reuse planning representations across many tasks, and justify goals using empowerment, an intrinsic motivation function. We argue that reward‑maximization is in conflict with the properties of compositionality, modularity, and interpretability. Alternatively, OKBEs facilitate these properties to support verifiable long‑horizon planning and intrinsic motivation that scales to dynamic high‑dimensional world‑models.

Abstract:
Large Language Models (LLMs) are increasingly capable but often require significant guidance or extensive interaction history to perform effectively in complex, interactive environments. Existing methods may struggle with adapting to new information or efficiently utilizing past experiences for multi‑step reasoning without fine‑tuning. We introduce a novel LLM agent framework that enhances planning capabilities through in‑context learning, facilitated by atomic fact augmentation and a recursive lookahead search. Our agent learns to extract task‑critical ``atomic facts'' from its interaction trajectories. These facts dynamically augment the prompts provided to LLM‑based components responsible for action proposal, latent world model simulation, and state‑value estimation. Planning is performed via a depth‑limited lookahead search, where the LLM simulates potential trajectories and evaluates their outcomes, guided by the accumulated facts and interaction history. This approach allows the agent to improve its understanding and decision‑making online, leveraging its experience to refine its behavior without weight updates. We provide a theoretical motivation linking performance to the quality of fact‑based abstraction and LLM simulation accuracy. Empirically, our agent demonstrates improved performance and adaptability on challenging interactive tasks, achieving more optimal behavior as it accumulates experience, showcased in tasks such as TextFrozenLake and ALFWorld.

Abstract:
In this work, we introduce the Time‑Aware World Model (TAWM), a model‑based approach that explicitly incorporates temporal dynamics. By conditioning on the time‑step size, Δt, and training over a diverse range of Δt values ‑‑ rather than sampling at a fixed time‑step ‑‑ TAWM learns both high‑ and low‑frequency task dynamics across diverse control problems. Grounded in the information‑theoretic insight that the optimal sampling rate depends on a system's underlying dynamics, this time‑aware formulation improves both performance and data efficiency. Empirical evaluations show that TAWM consistently outperforms conventional models across varying observation rates in a variety of control tasks, using the same number of training samples and iterations. Our code can be found online at: github.com/anh‑nn01/Time‑Aware‑World‑Model.

Abstract:
We study multi‑agent reinforcement learning (MARL) for tasks in complex high‑dimensional environments, such as autonomous driving. MARL is known to suffer from the partial observability and non‑stationarity issues. To tackle these challenges, information sharing is often employed, which however faces major hurdles in practice, including overwhelming communication overhead and scalability concerns. By making use of generative AI embodied in world model together with its latent representation, we develop \it CALL, \underlineCommunic\underlineative Wor\underlineld Mode\underlinel, for MARL, where 1) each agent first learns its world model that encodes its state and intention into low‑dimensional latent representation with smaller memory footprint, which can be shared with other agents of interest via lightweight communication; and 2) each agent carries out ego‑centric learning while exploiting lightweight information sharing to enrich her world model, and then exploits its generalization capacity to improve prediction for better planning. We characterize the gain on the prediction accuracy from the information sharing and its impact on performance gap. Extensive experiments are carried out on the challenging local trajectory planning tasks in the CARLA platform to demonstrate the performance gains of using CALL.

Abstract:
A major bottleneck in the training process for Zero‑Shot Coordination (ZSC) agents is the generation of partner agents that are diverse in collaborative conventions. Current Cross‑play Minimization (XPM) methods for population generation can be very computationally expensive and sample inefficient as the training objective requires sampling multiple types of trajectories. Each partner agent in the population is also trained from scratch, despite all of the partners in the population learning policies of the same coordination task. In this work, we propose that simulated trajectories from the dynamics model of an environment can drastically speed up the training process for XPM methods. We introduce XPM‑WM, a framework for generating simulated trajectories for XPM via a learned World Model (WM). We show XPM with simulated trajectories removes the need to sample multiple trajectories. In addition, we show our proposed method can effectively generate partners with diverse conventions that match the performance of previous methods in terms of SP population training reward as well as training partners for ZSC agents. Our method is thus, significantly more sample efficient and scalable to a larger number of partners.

Abstract:
Understanding the behavior of deep reinforcement learning (DRL) agents particularly as task and agent sophistication increase‑ requires more than simple comparison of reward curves, yet standard methods for behavioral analysis remain underdeveloped in DRL. We apply tools from neuroscience and ethology to study DRL agents in a novel, complex, partially observable environment, ForageWorld, designed to capture key aspects of real‑world animal foraging‑ including sparse, depleting resource patches, predator threats, and spatially extended arenas. We use this environment as a platform for applying joint behavioral and neural analysis to agents, revealing detailed, quantitatively grounded insights into agent strategies, memory, and planning. Contrary to common assumptions, we find that model‑free RNN‑based DRL agents can exhibit structured, planning‑like behavior purely through emergent dynamics‑ without requiring explicit memory modules or world models. Our results show that studying DRL agents like animals analyzing them with neuroethology‑inspired tools that reveal structure in both behavior and neural dynamics‑ uncovers rich structure in their learning dynamics that would otherwise remain invisible. We distill these tools into a general analysis framework linking core behavioral and representational features to diagnostic methods, which can be reused for a wide range of tasks and agents. As agents grow more complex and autonomous, bridging neuroscience, cognitive science, and AI will be essential‑ not just for understanding their behavior, but for ensuring safe alignment and maximizing desirable behaviors that are hard to measure via reward. We show how this can be done by drawing on lessons from how biological intelligence is studied.

Abstract:
Efficient simulation is essential for enhancing proactive preparedness for sudden‑onset disasters such as earthquakes. Recent advancements in large language models (LLMs) as world models show promise in simulating complex scenarios. This study examines multiple LLMs to proactively estimate perceived earthquake impacts. Leveraging multimodal datasets including geospatial, socioeconomic, building, and street‑level imagery data, our framework generates Modified Mercalli Intensity (MMI) predictions at zip code and county scales. Evaluations on the 2014 Napa and 2019 Ridgecrest earthquakes using USGS ''Did You Feel It? (DYFI)'' reports demonstrate significant alignment, as evidenced by a high correlation of 0.88 and a low RMSE of 0.77 as compared to real reports at the zip code level. Techniques such as RAG and ICL can improve simulation performance, while visual inputs notably enhance accuracy compared to structured numerical data alone. These findings show the promise of LLMs in simulating disaster impacts that can help strengthen pre‑event planning.

Abstract:
Manipulation has long been a challenging task for robots, while humans can effortlessly perform complex interactions with objects, such as hanging a cup on the mug rack. A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills. Current robot datasets often record robot action in different action spaces within a simple scene. This hinders the robot to learn a unified and robust action representation for different robots within diverse scenes. Observing how humans understand a manipulation task, we find that understanding how the objects should move in the 3D space is a critical clue for guiding actions. This clue is embodiment‑agnostic and suitable for both humans and different robots. Motivated by this, we aim to learn a 3D flow world model from both human and robot manipulation data. This model predicts the future movement of the interacting objects in 3D space, guiding action planning for manipulation. Specifically, we synthesize a large‑scale 3D optical flow dataset, named ManiFlow‑110k, through a moving object auto‑detect pipeline. A video diffusion‑based world model then learns manipulation physics from these data, generating 3D optical flow trajectories conditioned on language instructions. With the generated 3D object optical flow, we propose a flow‑guided rendering mechanism, which renders the predicted final state and leverages GPT‑4o to assess whether the predicted flow aligns with the task description. This equips the robot with a closed‑loop planning ability. Finally, we consider the predicted 3D optical flow as constraints for an optimization policy to determine a chunk of robot actions for manipulation. Extensive experiments demonstrate strong generalization across diverse robotic manipulation tasks and reliable cross‑embodiment adaptation without hardware‑specific training.

Abstract:
To what extent do vision‑and‑language foundation models possess a realistic world model (observation × action \rightarrow observation) and a dynamics model (observation × observation \rightarrow action), when actions are expressed through language? While open‑source foundation models struggle with both, we find that fine‑tuning them to acquire a dynamics model through supervision is significantly easier than acquiring a world model. In turn, dynamics models can be used to bootstrap world models through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, the dynamics model can annotate actions for unlabelled pairs of video frame observations to expand the training data. We further propose a new objective, where image tokens in observation pairs are weighted by their importance, as predicted by a recognition model. Secondly, the dynamics models can assign rewards to multiple samples of the world model to score them, effectively guiding search at inference time. We evaluate the world models resulting from both strategies through the task of action‑centric image editing on Aurora‑Bench. Our best model achieves a performance competitive with state‑of‑the‑art image editing models, improving on them by a margin of 15% on real‑world subsets according to GPT4o‑as‑judge, and achieving the best average human evaluation across all subsets of Aurora‑Bench.

Abstract:
Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long‑term consistency of video world models through a geometry‑grounded long‑term spatial memory. Our framework includes mechanisms to store and retrieve information from the long‑term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long‑term consistent world generation.

Abstract:
Building an efficient and physically consistent world model from limited observations is a long standing challenge in vision and robotics. Many existing world modeling pipelines are based on implicit generative models, which are hard to train and often lack 3D or physical consistency. On the other hand, explicit 3D methods built from a single state often require multi‑stage processing‑such as segmentation, background completion, and inpainting‑due to occlusions. To address this, we leverage two perturbed observations of the same scene under different object configurations. These dual states offer complementary visibility, alleviating occlusion issues during state transitions and enabling more stable and complete reconstruction. In this paper, we present DSG‑World, a novel end‑to‑end framework that explicitly constructs a 3D Gaussian World model from Dual State observations. Our approach builds dual segmentation‑aware Gaussian fields and enforces bidirectional photometric and semantic consistency. We further introduce a pseudo intermediate state for symmetric alignment and design collaborative co‑pruning trategies to refine geometric completeness. DSG‑World enables efficient real‑to‑simulation transfer purely in the explicit Gaussian representation space, supporting high‑fidelity rendering and object‑level scene manipulation without relying on dense observations or multi‑stage pipelines. Extensive experiments demonstrate strong generalization to novel views and scene states, highlighting the effectiveness of our approach for real‑world 3D reconstruction and simulation.

Abstract:
Reinforcement Learning (RL) applications in real‑world scenarios must prioritize safety and reliability, which impose strict constraints on agent behavior. Model‑based RL leverages predictive world models for action planning and policy optimization, but inherent model inaccuracies can lead to catastrophic failures in safety‑critical settings. We propose a novel model‑based RL framework that jointly optimizes task performance and safety. To address world model errors, our method incorporates an adaptive mechanism that dynamically switches between model‑based planning and direct policy execution. We resolve the objective mismatch problem of traditional model‑based approaches using an implicit world model. Furthermore, our framework employs dynamic safety thresholds that adapt to the agent's evolving capabilities, consistently selecting actions that surpass safe policy suggestions in both performance and safety. Experiments demonstrate significant improvements over non‑adaptive methods, showing that our approach optimizes safety and performance simultaneously rather than merely meeting minimum safety requirements. The proposed framework achieves robust performance on diverse safety‑critical continuous control tasks, outperforming existing methods.

Abstract:
Humans are known to have an internal "world model" that enables us to carry out action planning based on world states. AI agents need to have such a world model for action planning as well. It is not clear how current AI models, especially generative models, are able to learn such world models and carry out procedural planning in diverse environments. We introduce WorldPrediction, a video‑based benchmark for evaluating world modeling and procedural planning capabilities of different AI models. In contrast to prior benchmarks that focus primarily on low‑level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction‑WM) or the properly ordered sequence of actions (WorldPrediction‑PP) from a set of counterfactual distractors. This discriminative task setup enable us to evaluate different types of world models and planners and realize a thorough comparison across different hypothesis. The benchmark represents states and actions using visual observations. In order to prevent models from exploiting low‑level continuity cues in background scenes, we provide "action equivalents" ‑ identical actions observed in different contexts ‑ as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi‑MDP, ensuring better reliability and robustness of the evaluation. We conduct extensive human filtering and validation on our benchmark and show that current frontier models barely achieve 57% accuracy on WorldPrediction‑WM and 38% on WorldPrediction‑PP whereas humans are able to solve both tasks perfectly.

Abstract:
Open‑vocabulary semantic segmentation (OVSS) involves assigning labels to each pixel in an image based on textual descriptions, leveraging world models like CLIP. However, they encounter significant challenges in cross‑domain generalization, hindering their practical efficacy in real‑world applications. Embodied AI systems are transforming autonomous navigation for ground vehicles and drones by enhancing their perception abilities, and in this study, we present AetherVision‑Bench, a benchmark for multi‑angle segmentation across aerial, and ground perspectives, which facilitates an extensive evaluation of performance across different viewing angles and sensor modalities. We assess state‑of‑the‑art OVSS models on the proposed benchmark and investigate the key factors that impact the performance of zero‑shot transfer models. Our work pioneers the creation of a robustness benchmark, offering valuable insights and establishing a foundation for future research.

Abstract:
Many real world tasks where Large Language Models (LLMs) can be used require spatial reasoning, like Point of Interest (POI) recommendation and itinerary planning. However, on their own LLMs lack reliable spatial reasoning capabilities, especially about distances. To address this problem, we develop a novel approach, DistRAG, that enables an LLM to retrieve relevant spatial information not explicitly learned during training. Our method encodes the geodesic distances between cities and towns in a graph and retrieves a context subgraph relevant to the question. Using this technique, our method enables an LLM to answer distance‑based reasoning questions that it otherwise cannot answer. Given the vast array of possible places an LLM could be asked about, DistRAG offers a flexible first step towards providing a rudimentary `world model' to complement the linguistic knowledge held in LLMs.

Abstract:
Physical intelligence ‑‑ anticipating and shaping the world from partial, multisensory observations ‑‑ is critical for next‑generation world models. We propose FOLIAGE, a physics‑informed multimodal world model for unbounded accretive surface growth. In its Action‑Perception loop, a unified context encoder maps images, mesh connectivity, and point clouds to a shared latent state. A physics‑aware predictor, conditioned on physical control actions, advances this latent state in time to align with the target latent of the surface, yielding a Modality‑Agnostic Growth Embedding (MAGE) that interfaces with critic heads for downstream objectives. FOLIAGE's Accretive Graph Network (AGN) captures dynamic connectivity through Age Positional Encoding and Energy‑Gated Message‑Passing. Geometry‑Correspondence Fusion and Cross‑Patch Masking enhance MAGE's expressiveness, while Hierarchical Pooling balances global context with local dynamics. We create SURF‑GARDEN, a world model learning platform comprising a Counterfactual Physics Simulator, a Multimodal Correspondence Extractor, and Evolution Tracing, which generates 7,200 diverse surface‑growth sequences. SURF‑BENCH, our physical‑intelligence evaluation suite, evaluates six core tasks ‑‑ topology recognition, inverse material estimation, growth‑stage classification, latent roll‑out, cross‑modal retrieval, and dense correspondence ‑‑ and four stress tests ‑‑ sensor dropout, zero‑shot modality transfer, long‑horizon prediction, and physics ablation ‑‑ to probe resilience. FOLIAGE outperforms specialized baselines while remaining robust across dynamic environments, establishing a new world‑model based, multimodal pathway to physical intelligence.

Abstract:
Large language models (LLMs) have demonstrated emergent abilities across diverse tasks, raising the question of whether they acquire internal world models. In this work, we investigate whether LLMs implicitly encode linear spatial world models, which we define as linear representations of physical space and object configurations. We introduce a formal framework for spatial world models and assess whether such structure emerges in contextual embeddings. Using a synthetic dataset of object positions, we train probes to decode object positions and evaluate geometric consistency of the underlying space. We further conduct causal interventions to test whether these spatial representations are functionally used by the model. Our results provide empirical evidence that LLMs encode linear spatial world models.

Abstract:
As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent's beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent's behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent's behaviour is guided by a world model. Our contribution is the derivation of novel bounds on the agent's behaviour in new (unseen) deployment environments, which represent a theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety.

Abstract:
Scene synthesis and editing has emerged as a promising direction in computer graphics. Current trained approaches for 3D indoor scene generation either oversimplify object semantics through one‑hot class encodings (e.g., 'chair' or 'table'), require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. LLM‑based methods enable richer semantics via natural language, but lack editing functionality, are limited to rectangular layouts, or rely on weak spatial reasoning from implicit world models. We introduce ReSpace, a generative framework for autoregressive text‑driven 3D indoor scene synthesis and editing. Our approach features a compact structured scene representation with explicit room boundaries that enables asset‑agnostic deployment and frames scene manipulation as a next‑token prediction task, supporting object addition, removal, and swapping via natural language. We employ supervised fine‑tuning with a preference alignment stage to train a specialized language model for object addition that accounts for user instructions, spatial geometry, object semantics, and scene‑level composition. We further introduce a voxelization‑based evaluation metric capturing fine‑grained geometric violations beyond 3D bounding boxes. Experiments surpass state‑of‑the‑art on object addition and achieve superior human‑perceived quality on the application of full scene synthesis, despite not being trained on it.

Abstract:
Providing effective treatment and making informed clinical decisions are essential goals of modern medicine and clinical care. We are interested in simulating disease dynamics for clinical decision‑making, leveraging recent advances in large generative models. To this end, we introduce the Medical World Model (MeWM), the first world model in medicine that visually predicts future disease states based on clinical decisions. MeWM comprises (i) vision‑language models to serve as policy models, and (ii) tumor generative models as dynamics models. The policy model generates action plans, such as clinical treatments, while the dynamics model simulates tumor progression or regression under given treatment conditions. Building on this, we propose the inverse dynamics model that applies survival analysis to the simulated post‑treatment tumor, enabling the evaluation of treatment efficacy and the selection of the optimal clinical action plan. As a result, the proposed MeWM simulates disease dynamics by synthesizing post‑treatment tumors, with state‑of‑the‑art specificity in Turing tests evaluated by radiologists. Simultaneously, its inverse dynamics model outperforms medical‑specialized GPTs in optimizing individualized treatment protocols across all metrics. Notably, MeWM improves clinical decision‑making for interventional physicians, boosting F1‑score in selecting the optimal TACE protocol by 13%, paving the way for future integration of medical world models as the second readers.

Abstract:
Language‑instructed active object localization is a critical challenge for robots, requiring efficient exploration of partially observable environments. However, state‑of‑the‑art approaches either struggle to generalize beyond demonstration datasets (e.g., imitation learning methods) or fail to generate physically grounded actions (e.g., VLMs). To address these limitations, we introduce WoMAP (World Models for Active Perception): a recipe for training open‑vocabulary object localization policies that: (i) uses a Gaussian Splatting‑based real‑to‑sim‑to‑real pipeline for scalable data generation without the need for expert demonstrations, (ii) distills dense rewards signals from open‑vocabulary object detectors, and (iii) leverages a latent world model for dynamics and rewards prediction to ground high‑level action proposals at inference time. Rigorous simulation and hardware experiments demonstrate WoMAP's superior performance in a broad range of zero‑shot object localization tasks, with more than 9x and 2x higher success rates compared to VLM and diffusion policy baselines, respectively. Further, we show that WoMAP achieves strong generalization and sim‑to‑real transfer on a TidyBot.

Abstract:
Learning meaningful abstract models of Markov Decision Processes (MDPs) is crucial for improving generalization from limited data. In this work, we show how geometric priors can be imposed on the low‑dimensional representation manifold of a learned transition model. We incorporate known symmetric structures via appropriate choices of the latent space and the associated group actions, which encode prior knowledge about invariances in the environment. In addition, our framework allows the embedding of additional unstructured information alongside these symmetries. We show experimentally that this leads to better predictions of the latent transition model than fully unstructured approaches, as well as better learning on downstream RL tasks, in environments with rotational and translational features, including in first‑person views of 3D environments. Additionally, our experiments show that this leads to simpler and more disentangled representations. The full code is available on GitHub to ensure reproducibility.

Abstract:
Reinforcement learning (RL) has driven breakthroughs in AI, from game‑play to scientific discovery and AI alignment. However, its broader applicability remains limited by challenges such as low data efficiency and poor generalizability. Recent advances suggest that large language models, with their rich world knowledge and reasoning capabilities, could complement RL by enabling semantic state modeling and task‑agnostic planning. In this work, we propose the Agentic Episodic Control (AEC), a novel architecture that integrates RL with LLMs to enhance decision‑making. The AEC can leverage a large language model (LLM) to map the observations into language‑grounded embeddings, which further can be stored in an episodic memory for rapid retrieval of high‑value experiences. Simultaneously, a World‑Graph working memory module is utilized to capture structured environmental dynamics in order to enhance relational reasoning. Furthermore, a lightweight critical state detector dynamically arbitrates between the episodic memory recall and the world‑model‑guided exploration. On the whole, by combining the trial‑and‑error learning scheme with LLM‑derived semantic priors, the proposed AEC can improve both data efficiency and generalizability in reinforcement learning. In experiments on BabyAI‑Text benchmark tasks, AEC demonstrates substantial improvements over existing baselines, especially on complex and generalization tasks like FindObj, where it outperforms the best baseline by up to 76%. The proposed AEC framework bridges the strengths of numeric reinforcement learning and symbolic reasoning, which provides a pathway toward more adaptable and sample‑efficient agents.

Abstract:
Humanoid robots, with their human‑like form, are uniquely suited for interacting in environments built for people. However, enabling humanoids to reason, plan, and act in complex open‑world settings remains a challenge. World models, models that predict the future outcome of a given action, can support these capabilities by serving as a dynamics model in long‑horizon planning and generating synthetic data for policy learning. We introduce Humanoid World Models (HWM), a family of lightweight, open‑source models that forecast future egocentric video conditioned on humanoid control tokens. We train two types of generative models, Masked Transformers and Flow‑Matching, on 100 hours of humanoid demonstrations. Additionally, we explore architectural variants with different attention mechanisms and parameter‑sharing strategies. Our parameter‑sharing techniques reduce model size by 33‑53% with minimal impact on performance or visual fidelity. HWMs are designed to be trained and deployed in practical academic and small‑lab settings, such as 1‑2 GPUs.

Abstract:
World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error accumulation and temporal inconsistency. To address these limitations, we introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions conditioned on actions. Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer spatio‑temporal relationships and underlying physical dynamics. This capability significantly reduces drift and enhances temporal consistency, enabling the model to reliably generate extended future sequences and achieve substantial improvements in prediction accuracy, visual realism, and scene rationality. Furthermore, our method provides an effective solution for geometry‑aware memory retrieval, effectively preserving long‑term spatial consistency. We validate the effectiveness of DeepVerse across diverse scenarios, establishing its capacity for high‑fidelity, long‑horizon predictions grounded in geometry‑aware dynamics.

Abstract:
End‑to‑end autonomous driving systems map sensor data directly to control commands, but remain opaque, lack interpretability, and offer no formal safety guarantees. While recent vision‑language‑guided reinforcement learning (RL) methods introduce semantic feedback, they often rely on static prompts and fixed objectives, limiting adaptability to dynamic driving scenes. We present DriveMind, a unified semantic reward framework that integrates: (i) a contrastive Vision‑Language Model (VLM) encoder for stepwise semantic anchoring; (ii) a novelty‑triggered VLM encoder‑decoder, fine‑tuned via chain‑of‑thought (CoT) distillation, for dynamic prompt generation upon semantic drift; (iii) a hierarchical safety module enforcing kinematic constraints (e.g., speed, lane centering, stability); and (iv) a compact predictive world model to reward alignment with anticipated ideal states. DriveMind achieves 19.4 +/‑ 2.3 km/h average speed, 0.98 +/‑ 0.03 route completion, and near‑zero collisions in CARLA Town 2, outperforming baselines by over 4% in success rate. Its semantic reward generalizes zero‑shot to real dash‑cam data with minimal distributional shift, demonstrating robust cross‑domain alignment and potential for real‑world deployment.

Abstract:
Evaluating robot control policies is difficult: real‑world testing is costly, and handcrafted simulators require manual effort to improve in realism and generality. We propose a world‑model‑based policy evaluation environment (WorldGym), an autoregressive, action‑conditioned video generation model which serves as a proxy to real world environments. Policies are evaluated via Monte Carlo rollouts in the world model, with a vision‑language model providing rewards. We evaluate a set of VLA‑based real‑robot policies in the world model using only initial frames from real robots, and show that policy success rates within the world model highly correlate with real‑world success rates. Moreoever, we show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints. Due to requiring only a single start frame as input, the world model further enables efficient evaluation of robot policies' generalization ability on novel tasks and environments. We find that modern VLA‑based robot policies still struggle to distinguish object shapes and can become distracted by adversarial facades of objects. While generating highly realistic object interaction remains challenging, WorldGym faithfully emulates robot motions and offers a practical starting point for safe and reproducible policy evaluation before deployment.

Abstract:
World models are emerging as a transformative paradigm in artificial intelligence, enabling agents to construct internal representations of their environments for predictive reasoning, planning, and decision‑making. By learning latent dynamics, world models provide a sample‑efficient framework that is especially valuable in data‑constrained or safety‑critical scenarios. In this paper, we present a comprehensive overview of world models, highlighting their architecture, training paradigms, and applications across prediction, generation, planning, and causal reasoning. We compare and distinguish world models from related concepts such as digital twins, the metaverse, and foundation models, clarifying their unique role as embedded cognitive engines for autonomous agents. We further propose Wireless Dreamer, a novel world model‑based reinforcement learning framework tailored for wireless edge intelligence optimization, particularly in low‑altitude wireless networks (LAWNs). Through a weather‑aware UAV trajectory planning case study, we demonstrate the effectiveness of our framework in improving learning efficiency and decision quality.

Abstract:
Recent progress in reasoning with large language models (LLMs), such as DeepSeek‑R1, demonstrates impressive capabilities in domains like mathematics and coding, by exhibiting complex cognitive behaviors such as verification, goal decomposition, and self‑reflection. However, it is unclear what behavior is effective and what behavior is missing for long‑horizon AI agents tasks. In this work, we propose Dyna‑Think, a thinking framework that integrates planning with an internal world model with reasoning and acting to enhance AI agent performance. To enable Dyna‑Think, we propose Dyna‑Think Imitation Learning (DIT) and Dyna‑Think Dyna Training (DDT). To initialize a policy with Dyna‑Think, DIT reconstructs the thinking process of R1 to focus on performing world model simulation relevant to the proposed (and planned) action, and trains the policy using this reconstructed data. To enhance Dyna‑Think, DDT uses a two‑stage training process to first improve the agent's world modeling ability via objectives such as state prediction or critique generation, and then improve the agent's action via policy training. We evaluate our methods on OSWorld and WindowsAgentArena, and demonstrate that Dyna‑Think improves the agent's in‑domain and out‑of‑domain performance, achieving similar best‑of‑n performance compared to R1 while generating 2x less tokens on average. Our extensive empirical studies reveal that 1) using critique generation for world model training is effective to improve policy performance; and 2) AI agents with better performance correlate with better world modeling abilities. We believe our results suggest a promising research direction to integrate world model simulation into AI agents to enhance their reasoning, planning, and acting capabilities.

Abstract:
Autonomy is a hallmark of animal intelligence, enabling adaptive and intelligent behavior in complex environments without relying on external reward or task structure. Existing reinforcement learning approaches to exploration in reward‑free environments, including a class of methods known as model‑based intrinsic motivation, exhibit inconsistent exploration patterns and do not converge to an exploratory policy, thus failing to capture robust autonomous behaviors observed in animals. Moreover, systems neuroscience has largely overlooked the neural basis of autonomy, focusing instead on experimental paradigms where animals are motivated by external reward rather than engaging in ethological, naturalistic and task‑independent behavior. To bridge these gaps, we introduce a novel model‑based intrinsic drive explicitly designed after the principles of autonomous exploration in animals. Our method (3M‑Progress) achieves animal‑like exploration by tracking divergence between an online world model and a fixed prior learned from an ecological niche. To the best of our knowledge, we introduce the first autonomous embodied agent that predicts brain data entirely from self‑supervised optimization of an intrinsic goal ‑‑ without any behavioral or neural training data ‑‑ demonstrating that 3M‑Progress agents capture the explainable variance in behavioral patterns and whole‑brain neural‑glial dynamics recorded from autonomously behaving larval zebrafish, thereby providing the first goal‑driven, population‑level model of neural‑glial computation. Our findings establish a computational framework connecting model‑based intrinsic motivation to naturalistic behavior, providing a foundation for building artificial agents with animal‑like autonomy.

Abstract:
Current deep reinforcement learning (DRL) approaches achieve state‑of‑the‑art performance in various domains, but struggle with data efficiency compared to human learning, which leverages core priors about objects and their interactions. Active inference offers a principled framework for integrating sensory information with prior knowledge to learn a world model and quantify the uncertainty of its own beliefs and predictions. However, active inference models are usually crafted for a single task with bespoke knowledge, so they lack the domain flexibility typical of DRL approaches. To bridge this gap, we propose a novel architecture that integrates a minimal yet expressive set of core priors about object‑centric dynamics and interactions to accelerate learning in low‑data regimes. The resulting approach, which we call AXIOM, combines the usual data efficiency and interpretability of Bayesian approaches with the across‑task generalization usually associated with DRL. AXIOM represents scenes as compositions of objects, whose dynamics are modeled as piecewise linear trajectories that capture sparse object‑object interactions. The structure of the generative model is expanded online by growing and learning mixture models from single events and periodically refined through Bayesian model reduction to induce generalization. AXIOM masters various games within only 10,000 interaction steps, with both a small number of parameters compared to DRL, and without the computational expense of gradient‑based optimization.

Abstract:
Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi‑domain tasks. However, due to potential vulnerabilities in models available on open‑source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs. The attacker constructs a malicious upload model and releases it. Once a victim user merges it with any other models, the resulting merged model inherits the backdoor while maintaining utility across tasks. Merge Hijacking defines two main objectives‑effectiveness and utility‑and achieves them through four steps. Extensive experiments demonstrate the effectiveness of our attack across different models, merging algorithms, and tasks. Additionally, we show that the attack remains effective even when merging real‑world models. Moreover, our attack demonstrates robustness against two inference‑time defenses (Paraphrasing and CLEANGEN) and one training‑time defense (Fine‑pruning).

Abstract:
The ability to simulate the world in a spatially consistent manner is a crucial requirement for effective world models. Such a model enables high‑quality visual generation, and also ensures the reliability of world models for downstream tasks such as simulation and planning. It must not only retain long‑horizon observational information, but also enables the construction of explicit or implicit internal spatial representations. However, existing datasets do not explicitly enforce spatial consistency constraints, limiting both the ability to systematically evaluate this capability and to learn it through data‑driven approaches. Furthermore, most existing benchmarks primarily emphasize visual coherence or generation quality, neglecting the requirement of long‑range spatial consistency. To bridge this gap, we propose LoopNav, a dataset and corresponding benchmark centered on loop‑based navigation for evaluating spatial consistency. The dataset comprises 250 hours (20 million frames) of loop‑based navigation videos with actions, collected from diverse locations in the open‑world environment of Minecraft. We further introduce a Scene Graph Consistency Score to quantify spatial consistency while remaining invariant to pixel‑level variations. Dataset, benchmark, and code are open‑sourced to support future research.

Abstract:
Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image‑to‑video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long‑term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval‑augmented generation prove less effective for video generation, primarily due to the limited in‑context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.

Abstract:
Machine unlearning aims to remove the influence of specific training samples from a trained model without full retraining. While prior work has largely focused on privacy‑motivated settings, we recast unlearning as a general‑purpose tool for post‑deployment model revision. Specifically, we focus on utilizing unlearning in clinical contexts where data shifts, device deprecation, and policy changes are common. To this end, we propose a bilevel optimization formulation of boundary‑based unlearning that can be solved using iterative algorithms. We provide convergence guarantees when first‑order algorithms are used to unlearn. Our method introduces tunable loss design for controlling the forgetting‑retention tradeoff and supports novel model composition strategies that merge the strengths of distinct unlearning runs. Across benchmark and real‑world clinical imaging datasets, our approach outperforms baselines on both forgetting and retention metrics, including scenarios involving imaging devices and anatomical outliers. This work establishes machine unlearning as a modular, practical alternative to retraining for real‑world model maintenance in clinical applications.

Abstract:
The trajectories of 6G and AI are set for a creative collision. However, current visions for 6G remain largely incremental evolutions of 5G, while progress in AI is hampered by brittle, data‑hungry models that lack robust reasoning capabilities. This paper argues for a foundational paradigm shift, moving beyond the purely technical level of communication toward systems capable of semantic understanding and effective, goal‑oriented interaction. We propose a unified research vision rooted in the principles of System‑2 cognition, built upon three pillars: Abstraction, enabling agents to learn meaningful world models from raw sensorimotor data; Compositionality, providing the algebraic tools to combine learned concepts and subsystems; and Emergent Communication, allowing intelligent agents to create their own adaptive and grounded languages. By integrating these principles, we lay the groundwork for truly intelligent systems that can reason, adapt, and collaborate, unifying advances in wireless communications, machine learning, and robotics under a single coherent framework.

Abstract:
Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long‑term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state‑space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non‑causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling. Central to our design is a block‑wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long‑term memory capabilities of our model through spatial retrieval and reasoning tasks over extended horizons. Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long‑range memory, while maintaining practical inference speeds suitable for interactive applications.

Abstract:
With the recent success of world‑model agents, which extend the core idea of model‑based reinforcement learning by learning a differentiable model for sample‑efficient control across diverse tasks, active inference (AIF) offers a complementary, neuroscience‑grounded paradigm that unifies perception, learning, and action within a single probabilistic framework powered by a generative model. Despite this promise, practical AIF agents still rely on accurate immediate predictions and exhaustive planning, a limitation that is exacerbated in delayed environments requiring plans over long horizons, tens to hundreds of steps. Moreover, most existing agents are evaluated on robotic or vision benchmarks which, while natural for biological agents, fall short of real‑world industrial complexity. We address these limitations with a generative‑policy architecture featuring (i) a multi‑step latent transition that lets the generative model predict an entire horizon in a single look‑ahead, (ii) an integrated policy network that enables the transition and receives gradients of the expected free energy, (iii) an alternating optimization scheme that updates model and policy from a replay buffer, and (iv) a single gradient step that plans over long horizons, eliminating exhaustive planning from the control loop. We evaluate our agent in an environment that mimics a realistic industrial scenario with delayed and long‑horizon settings. The empirical results confirm the effectiveness of the proposed approach, demonstrating the coupled world‑model with the AIF formalism yields an end‑to‑end probabilistic controller capable of effective decision making in delayed, long‑horizon settings without handcrafted rewards or expensive planning.

Abstract:
Ensuring safe operation of safety‑critical complex systems interacting with their environment poses significant challenges, particularly when the system's world model relies on machine learning algorithms to process the perception input. A comprehensive safety argumentation requires knowledge of how faults or functional insufficiencies propagate through the system and interact with external factors, to manage their safety impact. While statistical analysis approaches can support the safety assessment, associative reasoning alone is neither sufficient for the safety argumentation nor for the identification and investigation of safety measures. A causal understanding of the system and its interaction with the environment is crucial for safeguarding safety‑critical complex systems. It allows to transfer and generalize knowledge, such as insights gained from testing, and facilitates the identification of potential improvements. This work explores using causal Bayesian networks to model the system's causalities for safety analysis, and proposes measures to assess causal influences based on Pearl's framework of causal inference. We compare the approach of causal Bayesian networks to the well‑established fault tree analysis, outlining advantages and limitations. In particular, we examine importance metrics typically employed in fault tree analysis as foundation to discuss suitable causal metrics. An evaluation is performed on the example of a perception system for automated driving. Overall, this work presents an approach for causal reasoning in safety analysis that enables the integration of data‑driven and expert‑based knowledge to account for uncertainties arising from complex systems operating in open environments.

Abstract:
Timely and personalized treatment decisions are essential across a wide range of healthcare settings where patient responses can vary significantly and evolve over time. Clinical data used to support these treatment decisions are often irregularly sampled, where missing data frequencies may implicitly convey information about the patient's condition. Existing Reinforcement Learning (RL) based clinical decision support systems often ignore the missing patterns and distort them with coarse discretization and simple imputation. They are also predominantly model‑free and largely depend on retrospective data, which could lead to insufficient exploration and bias by historical behaviors. To address these limitations, we propose medDreamer, a novel model‑based reinforcement learning framework for personalized treatment recommendation. medDreamer contains a world model with an Adaptive Feature Integration module that simulates latent patient states from irregular data and a two‑phase policy trained on a hybrid of real and imagined trajectories. This enables learning optimal policies that go beyond the sub‑optimality of historical clinical decisions, while remaining close to real clinical data. We evaluate medDreamer on both sepsis and mechanical ventilation treatment tasks using two large‑scale Electronic Health Records (EHRs) datasets. Comprehensive evaluations show that medDreamer significantly outperforms model‑free and model‑based baselines in both clinical outcomes and off‑policy metrics.

Abstract:
Recently, Model‑Based Reinforcement Learning (MBRL) have achieved super‑human level performance on the Atari100k benchmark on average. However, we discover that conventional aggregates mask a major problem, Performance Asymmetry: MBRL agents dramatically outperform humans in certain tasks (Agent‑Optimal tasks) while drastically underperform humans in other tasks (Human‑Optimal tasks). Indeed, despite achieving SOTA in the overall mean Human‑Normalized Scores (HNS), the SOTA agent scored the worst among baselines on Human‑Optimal tasks, with a striking 21X performance gap between the Human‑Optimal and Agent‑Optimal subsets. To address this, we partition Atari100k evenly into Human‑Optimal and Agent‑Optimal subsets, and introduce a more balanced aggregate, Sym‑HNS. Furthermore, we trace the striking Performance Asymmetry in the SOTA pixel diffusion world model to the curse of dimensionality and its prowess on high visual detail tasks (e.g. Breakout). To this end, we propose a novel latent end‑to‑end Joint Embedding DIffusion (JEDI) world model that achieves SOTA results in Sym‑HNS, Human‑Optimal tasks, and Breakout ‑‑ thus reversing the worsening Performance Asymmetry trend while improving computational efficiency and remaining competitive on the full Atari100k.

Abstract:
Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well‑explored, physically meaningful interactions that mimic real‑world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force‑video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real‑world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.

Abstract:
Data‑driven learning has advanced autonomous driving, yet task‑specific models struggle with out‑of‑distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self‑supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large‑scale driving videos. DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision‑3D point cloud forecasting, 2D semantic representation, and image generation‑to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamic‑aware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX's predictions to enhance task‑specific inference. Extensive experiments demonstrate DriveX's effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state‑of‑the‑art results on diverse tasks including occupancy prediction, flow estimation, and end‑to‑end driving. These results validate DriveX's capability as a general‑purpose world model, paving the way for robust and unified autonomous driving frameworks.

Abstract:
The rapid progress of large language models (LLMs) has sparked growing interest in building Artificial General Intelligence (AGI) within Graphical User Interface (GUI) environments. However, existing GUI agents based on LLMs or vision‑language models (VLMs) often fail to generalize to novel environments and rely heavily on manually curated, diverse datasets. To overcome these limitations, we introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open‑ended GUI environments. Innovatively, we introduced a world‑model‑based curiosity reward function to help the agent overcome the cold‑start phase of exploration. Additionally, distilling experience streams further enhances the model's exploration capabilities. Our training framework enhances model exploration in open GUI environments, with trained models showing better environmental adaptation and sustained exploration compared to static deployment models. Our findings offer a scalable pathway toward AGI systems with self‑improving capabilities in complex interactive settings.

Abstract:
The field of robotics has made significant strides toward developing generalist robot manipulation policies. However, evaluating these policies in real‑world scenarios remains time‑consuming and challenging, particularly as the number of tasks scales and environmental conditions change. In this work, we demonstrate that world models can serve as a scalable, reproducible, and reliable proxy for real‑world robot policy evaluation. A key challenge is generating accurate policy videos from world models that faithfully reflect the robot actions. We observe that directly inputting robot actions or using high‑dimensional encoding methods often fails to generate action‑following videos. To address this, we propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world simulator that follows latent action to generate the robot video. We then introduce WorldEval, an automated pipeline designed to evaluate real‑world robot policies entirely online. WorldEval effectively ranks various robot policies and individual checkpoints within a single policy, and functions as a safety detector to prevent dangerous actions by newly developed robot models. Through comprehensive paired evaluations of manipulation policies in real‑world environments, we demonstrate a strong correlation between policy performance in WorldEval and real‑world scenarios. Furthermore, our method significantly outperforms popular methods such as real‑to‑sim approach.

Abstract:
Achieving versatile humanoid locomotion with a single policy presents a critical scalability challenge. Prevailing methods often rely on distilling multiple terrain‑specific teacher policies into a unified student policy. However, while such distillation captures basic locomotion primitives, it struggles to organically compose these skills to adapt to complex environments, resulting in poor generalization to novel composite terrains unseen during training. To overcome this, we present DreamPolicy, a unified framework that integrates offline data with a diffusion‑based world model, enabling a single policy to master both known and unseen terrains. Central to our approach is a terrain‑aware world model, driven by an autoregressive diffusion world model trained on aggregated rollouts from specialized policies. This model synthesizes physically plausible future trajectories, which serve as dynamic objectives for a conditioned policy, thereby bypassing manual reward engineering. Unlike distillation, our world model captures generalizable locomotion skills, allowing for robust zero‑shot transfer to unseen composite terrains. DreamPolicy naturally scales with data availability. As the offline dataset expands, the diffusion world model continuously acquires richer skills. Experiments demonstrate that DreamPolicy outperforms the strongest baseline by up to 27% on unseen terrains and 38% on combined terrains. By unifying world model‑based planning and policy learning, DreamPolicy breaks the "one task, one policy" bottleneck and establishes a scalable, data‑driven paradigm for generalist humanoid control.

Abstract:
Real‑world driving requires people to observe the current environment, anticipate the future, and make appropriate driving decisions. This requirement is aligned well with the capabilities of world models, which understand the environment and predict the future. However, recent world models in autonomous driving are built explicitly, where they could predict the future by controllable driving video generation. We argue that driving world models should have two additional abilities: action control and action prediction. Following this line, previous methods are limited because they predict the video requires given actions of the same length as the video and ignore the dynamical action laws. To address these issues, we propose ProphetDWM, a novel end‑to‑end driving world model that jointly predicts future videos and actions. Our world model has an action module to learn latent action from the present to the future period by giving the action sequence and observations. And a diffusion‑model‑based transition module to learn the state distribution. The model is jointly trained by learning latent actions given finite states and predicting action and video. The joint learning connects the action dynamics and states and enables long‑term future prediction. We evaluate our method in video generation and action prediction tasks on the Nuscenes dataset. Compared to the state‑of‑the‑art methods, our method achieves the best video consistency and best action prediction accuracy, while also enabling high‑quality long‑term video and action generation.

Abstract:
Model‑based reinforcement learning (MBRL) offers an intuitive way to increase the sample efficiency of model‑free RL methods by simultaneously training a world model that learns to predict the future. These models constitute the large majority of training compute and time and they are subsequently used to train actors entirely in simulation, but once this is done they are quickly discarded. We show in this work that utilising these models at inference time can significantly boost sample efficiency. We propose a novel approach that anticipates and actively seeks out informative states using the world model's short‑horizon latent predictions, offering a principled alternative to traditional curiosity‑driven methods that chase outdated estimates of high uncertainty states. While many model predictive control (MPC) based methods offer similar alternatives, they typically lack commitment, synthesising multiple multi‑step plans at every step. To mitigate this, we present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the commitment to searching entropy. While our method can theoretically be applied to any model that trains its own actors with solely model generated data, we have applied it to Dreamer to illustrate the concept. Our method finishes MiniWorld's procedurally generated mazes 50% faster than base Dreamer at convergence and in only 60% of the environment steps that base Dreamer's policy needs; it displays reasoned exploratory behaviour in Crafter, achieves the same reward as base Dreamer in a third of the steps; planning tends to improve sample efficiency on DeepMind Control tasks.

Abstract:
The automatic control of mobile devices is essential for efficiently performing complex tasks that involve multiple sequential steps. However, these tasks pose significant challenges due to the limited environmental information available at each step, primarily through visual observations. As a result, current approaches, which typically rely on reactive policies, focus solely on immediate observations and often lead to suboptimal decision‑making. To address this problem, we propose Foresighted Planning with World Model‑Driven Code Execution (FPWC),a framework that prioritizes natural language understanding and structured reasoning to enhance the agent's global understanding of the environment by developing a task‑oriented, refinable \emphworld model at the outset of the task. Foresighted actions are subsequently generated through iterative planning within this world model, executed in the form of executable code. Extensive experiments conducted in simulated environments and on real mobile devices demonstrate that our method outperforms previous approaches, particularly achieving a 44.4% relative improvement in task success rate compared to the state‑of‑the‑art in the simulated environment. Code and demo are provided in the supplementary material.

Abstract:
Reinforcement Learning (RL) can mitigate the causal confusion and distribution shift inherent to imitation learning (IL). However, applying RL to end‑to‑end autonomous driving (E2E‑AD) remains an open problem for its training difficulty, and IL is still the mainstream paradigm in both academia and industry. Recently Model‑based Reinforcement Learning (MBRL) have demonstrated promising results in neural planning; however, these methods typically require privileged information as input rather than raw sensor data. We fill this gap by designing Raw2Drive, a dual‑stream MBRL approach. Initially, we efficiently train an auxiliary privileged world model paired with a neural planner that uses privileged information as input. Subsequently, we introduce a raw sensor world model trained via our proposed Guidance Mechanism, which ensures consistency between the raw sensor world model and the privileged world model during rollouts. Finally, the raw sensor world model combines the prior knowledge embedded in the heads of the privileged world model to effectively guide the training of the raw sensor policy. Raw2Drive is so far the only RL based end‑to‑end method on CARLA Leaderboard 2.0, and Bench2Drive and it achieves state‑of‑the‑art performance.

Abstract:
Autonomous driving policy learning with reinforcement learning (RL) is fundamentally limited by low sample efficiency, weak generalization, and a dependence on unsafe online trial‑and‑error interactions. Although safe RL introduces explicit constraints or costs, existing methods often fail to capture the semantic meaning of safety in real driving scenes, leading to conservative behaviors in simple cases and insufficient risk awareness in complex ones. To address this issue, we propose VLM‑SAFE, an offline safe RL framework that follows a human cognitive loop of observe‑imagine‑evaluate‑act. Starting from offline driving data, VLM‑SAFE observes traffic scenarios and leverages a vision‑language model (VLM) to provide semantic safety signals grounded in scene understanding. A learned world model then imagines future trajectories from the observed context, enabling the agent to reason about possible consequences without interacting with the real environment. Rather than using imagined rollouts solely for return estimation, VLM‑SAFE further evaluates these predicted futures with VLM‑based safety guidance, explicitly coupling future anticipation with semantic risk assessment. The resulting safety‑aware imagined experience is finally used to optimize the policy via actor‑critic learning, such that actions are chosen based on both predicted outcomes and their safety implications. By tightly integrating observation, imagination, evaluation, and action into a unified closed loop, VLM‑SAFE enables safer and more efficient offline policy learning for autonomous driving. Extensive experiments in simulation show that VLM‑SAFE achieves improved safety, stronger robustness under traffic‑density shift, and a better safety‑performance trade‑off than representative baselines.

Abstract:
Deploying learned control policies in real‑world environments poses a fundamental challenge. When system dynamics change unexpectedly, performance degrades until models are retrained on new data. We introduce Reflexive World Models (RWM), a dual control framework that uses world model predictions as implicit reference trajectories for rapid adaptation. Our method separates the control problem into long‑term reward maximization through reinforcement learning and robust motor execution through rapid latent control. This dual architecture achieves significantly faster adaptation with low online computational cost compared to model‑based RL baselines, while maintaining near‑optimal performance. The approach combines the benefits of flexible policy learning through reinforcement learning with rapid error correction capabilities, providing a principled approach to maintaining performance in high‑dimensional continuous control tasks under varying dynamics.

Abstract:
The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro‑symbolic, human‑interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the transition dynamics; (iii) to render the predicted states as visual RGB‑frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld (ii) Cart Pole. Additionally, ProgGen permits counter‑factual reasoning and interpretable video generation attesting to its effectiveness and generalizability for video generation tasks.

Abstract:
Causal world models are systems that can answer counterfactual questions about an environment of interest, i.e. predict how it would have evolved if an arbitrary subset of events had been realized differently. It requires understanding the underlying causes behind chains of events and conducting causal inference for arbitrary unseen distributions. So far, this task eludes foundation models, notably large language models (LLMs), which do not have demonstrated causal reasoning capabilities beyond the memorization of existing causal relationships. Furthermore, evaluating counterfactuals in real‑world applications is challenging since only the factual world is observed, limiting evaluation to synthetic datasets. We address these problems by explicitly extracting and modeling causal relationships and propose the Causal Cartographer framework. First, we introduce a graph retrieval‑augmented generation agent tasked to retrieve causal relationships from data. This approach allows us to construct a large network of real‑world causal relationships that can serve as a repository of causal knowledge and build real‑world counterfactuals. In addition, we create a counterfactual reasoning agent constrained by causal relationships to perform reliable step‑by‑step causal inference. We show that our approach can extract causal knowledge and improve the robustness of LLMs for causal reasoning tasks while reducing inference costs and spurious correlations.

Abstract:
Offline reinforcement learning (RL) offers a powerful paradigm for data‑driven control. Compared to model‑free approaches, offline model‑based RL (MBRL) explicitly learns a world model from a static dataset and uses it as a surrogate simulator, improving data efficiency and enabling potential generalization beyond the dataset support. However, most existing offline MBRL methods follow a two‑stage training procedure: first learning a world model by maximizing the likelihood of the observed transitions, then optimizing a policy to maximize its expected return under the learned model. This objective mismatch results in a world model that is not necessarily optimized for effective policy learning. Moreover, we observe that policies learned via offline MBRL often lack robustness during deployment, and small adversarial noise in the environment can lead to significant performance degradation. To address these, we propose a framework that dynamically adapts the world model alongside the policy under a unified learning objective aimed at improving robustness. At the core of our method is a maximin optimization problem, which we solve by innovatively utilizing Stackelberg learning dynamics. We provide theoretical analysis to support our design and introduce computationally efficient implementations. We benchmark our algorithm on twelve noisy D4RL MuJoCo tasks and three stochastic Tokamak Control tasks, demonstrating its state‑of‑the‑art performance.

Abstract:
Many animals possess a remarkable capacity to rapidly construct flexible cognitive maps of their environments. These maps are crucial for ethologically relevant behaviors such as navigation, exploration, and planning. Existing computational models typically require long sequential trajectories to build accurate maps, but neuroscience evidence suggests maps can also arise from integrating disjoint experiences governed by consistent spatial rules. We introduce the Episodic Spatial World Model (ESWM), a novel framework that constructs spatial maps from sparse, disjoint episodic memories. Across environments of varying complexity, ESWM predicts unobserved transitions from minimal experience, and the geometry of its latent space aligns with that of the environment. Because it operates on episodic memories that can be independently stored and updated, ESWM is inherently adaptive, enabling rapid adjustment to environmental changes. Furthermore, we demonstrate that ESWM readily enables near‑optimal strategies for exploring novel environments and navigating between arbitrary points, all without the need for additional training. Our work demonstrates how neuroscience‑inspired principles of episodic memory can advance the development of more flexible and generalizable world models.

Abstract:
Standard imitation learning (IL) methods have achieved considerable success in robotics, yet often rely on the Markov assumption, which falters in long‑horizon tasks where history is crucial for resolving perceptual ambiguity. This limitation stems not only from a conceptual gap but also from a fundamental computational barrier: prevailing architectures like Transformers are often constrained by quadratic complexity, rendering the processing of long, high‑dimensional observation sequences infeasible. To overcome this dual challenge, we introduce Mamba Temporal Imitation Learning (MTIL). Our approach represents a new paradigm for robotic learning, which we frame as a practical synthesis of World Model and Dynamical System concepts. By leveraging the linear‑time recurrent dynamics of State Space Models (SSMs), MTIL learns an implicit, action‑oriented world model that efficiently encodes the entire trajectory history into a compressed, evolving state. This allows the policy to be conditioned on a comprehensive temporal context, transcending the confines of Markovian approaches. Through extensive experiments on simulated benchmarks (ACT, Robomimic, LIBERO) and on challenging real‑world tasks, MTIL demonstrates superior performance against SOTA methods like ACT and Diffusion Policy, particularly in resolving long‑term temporal ambiguities. Our findings not only affirm the necessity of full temporal context but also validate MTIL as a powerful and a computationally feasible approach for learning long‑horizon, non‑Markovian behaviors from high‑dimensional observations.

Abstract:
Predictive manipulation has recently gained considerable attention in the Embodied AI community due to its potential to improve robot policy performance by leveraging predicted states. However, generating accurate future visual states of robot‑object interactions from world models remains a well‑known challenge, particularly in achieving high‑quality pixel‑level representations. To this end, we propose LaDi‑WM, a world model that predicts the latent space of future states using diffusion modeling. Specifically, LaDi‑WM leverages the well‑established latent space aligned with pre‑trained Visual Foundation Models (VFMs), which comprises both geometric features (DINO‑based) and semantic features (CLIP‑based). We find that predicting the evolution of the latent space is easier to learn and more generalizable than directly predicting pixel‑level images. Building on LaDi‑WM, we design a diffusion policy that iteratively refines output actions by incorporating forecasted states, thereby generating more consistent and accurate results. Extensive experiments on both synthetic and real‑world benchmarks demonstrate that LaDi‑WM significantly enhances policy performance by 27.9% on the LIBERO‑LONG benchmark and 20% on the real‑world scenario. Furthermore, our world model and policies achieve impressive generalizability in real‑world experiments.

Abstract:
Explainable AI (XAI) systems have been proposed to help people understand how AI systems produce outputs and behaviors. Explainable Reinforcement Learning (XRL) has an added complexity due to the temporal nature of sequential decision‑making. Further, non‑AI experts do not necessarily have the ability to alter an agent or its policy. We introduce a technique for using World Models to generate explanations for Model‑Based Deep RL agents. World Models predict how the world will change when actions are performed, allowing for the generation of counterfactual trajectories. However, identifying what a user wanted the agent to do is not enough to understand why the agent did something else. We augment Model‑Based RL agents with a Reverse World Model, which predicts what the state of the world should have been for the agent to prefer a given counterfactual action. We show that explanations that show users what the world should have been like significantly increase their understanding of the agent policy. We hypothesize that our explanations can help users learn how to control the agents execution through by manipulating the environment.

Abstract:
Current robotic planning methods often rely on predicting multi‑frame images with full pixel details. While this fine‑grained approach can serve as a generic world model, it introduces two significant challenges for downstream policy learning: substantial computational costs that hinder real‑time deployment, and accumulated inaccuracies that can mislead action extraction. Planning with coarse‑grained subgoals partially alleviates efficiency issues. However, their forward planning schemes can still result in off‑task predictions due to accumulation errors, leading to misalignment with long‑term goals. This raises a critical question: Can robotic planning be both efficient and accurate enough for real‑time control in long‑horizon, multi‑stage tasks? To address this, we propose a Latent Space Backward Planning scheme (LBP), which begins by grounding the task into final latent goals, followed by recursively predicting intermediate subgoals closer to the current state. The grounded final goal enables backward subgoal planning to always remain aware of task completion, facilitating on‑task prediction along the entire planning horizon. The subgoal‑conditioned policy incorporates a learnable token to summarize the subgoal sequences and determines how each subgoal guides action extraction. Through extensive simulation and real‑robot long‑horizon experiments, we show that LBP outperforms existing fine‑grained and forward planning methods, achieving SOTA performance. Project Page: https://lbp‑authors.github.io

Abstract:
Offline reinforcement learning (RL) enables policy optimization using static datasets, avoiding the risks and costs of extensive real‑world exploration. However, it struggles with suboptimal offline behaviors and inaccurate value estimation due to the lack of environmental interaction. We present Video‑Enhanced Offline RL (VeoRL), a model‑based method that constructs an interactive world model from diverse, unlabeled video data readily available online. Leveraging model‑based behavior guidance, our approach transfers commonsense knowledge of control policy and physical dynamics from natural videos to the RL agent within the target domain. VeoRL achieves substantial performance gains (over 100% in some cases) across visual control tasks in robotic manipulation, autonomous driving, and open‑world video games.

Abstract:
Understanding and forecasting the scene evolutions deeply affect the exploration and decision of embodied agents. While traditional methods simulate scene evolutions through trajectory prediction of potential instances, current works use the occupancy world model as a generative framework for describing fine‑grained overall scene dynamics. However, existing methods cluster on the outdoor structured road scenes, while ignoring the exploration of forecasting 3D occupancy scene evolutions for robots in indoor scenes. In this work, we explore a new framework for learning the scene evolutions of observed fine‑grained occupancy and propose an occupancy world model based on the combined spatio‑temporal receptive field and guided autoregressive transformer to forecast the scene evolutions, called RoboOccWorld. We propose the Conditional Causal State Attention (CCSA), which utilizes camera poses of next state as conditions to guide the autoregressive transformer to adapt and understand the indoor robotics scenarios. In order to effectively exploit the spatio‑temporal cues from historical observations, Hybrid Spatio‑Temporal Aggregation (HSTA) is proposed to obtain the combined spatio‑temporal receptive field based on multi‑scale spatio‑temporal windows. In addition, we restructure the OccWorld‑ScanNet benchmark based on local annotations to facilitate the evaluation of the indoor 3D occupancy scene evolution prediction task. Experimental results demonstrate that our RoboOccWorld outperforms state‑of‑the‑art methods in indoor 3D occupancy scene evolution prediction task. The code will be released soon.

Abstract:
The ability to simulate the effects of future actions on the world is a crucial ability of intelligent embodied agents, enabling agents to anticipate the effects of their actions and make plans accordingly. While a large body of existing work has explored how to construct such world models using video models, they are often myopic in nature, without any memory of a scene not captured by currently observed images, preventing agents from making consistent long‑horizon plans in complex environments where many parts of the scene are partially observed. We introduce a new persistent embodied world model with an explicit memory of previously generated content, enabling much more consistent long‑horizon simulation. During generation time, our video diffusion model predicts RGB‑D video of the future observations of the agent. This generation is then aggregated into a persistent 3D map of the environment. By conditioning the video model on this 3D spatial map, we illustrate how this enables video world models to faithfully simulate both seen and unseen parts of the world. Finally, we illustrate the efficacy of such a world model in downstream embodied applications, enabling effective planning and policy learning.

Abstract:
The 6G wireless communications aim to establish an intelligent world of ubiquitous connectivity, providing an unprecedented communication experience. Large artificial intelligence models (LAMs) are characterized by significantly larger scales (e.g., billions or trillions of parameters) compared to typical artificial intelligence (AI) models. LAMs exhibit outstanding cognitive abilities, including strong generalization capabilities for fine‑tuning to downstream tasks, and emergent capabilities to handle tasks unseen during training. Therefore, LAMs efficiently provide AI services for diverse communication applications, making them crucial tools for addressing complex challenges in future wireless communication systems. This study provides a comprehensive review of the foundations, applications, and challenges of LAMs in communication. First, we introduce the current state of AI‑based communication systems, emphasizing the motivation behind integrating LAMs into communications and summarizing the key contributions. We then present an overview of the essential concepts of LAMs in communication. This includes an introduction to the main architectures of LAMs, such as transformer, diffusion models, and mamba. We also explore the classification of LAMs, including large language models (LLMs), large vision models (LVMs), large multimodal models (LMMs), and world models, and examine their potential applications in communication. Additionally, we cover the training methods and evaluation techniques for LAMs in communication systems. Lastly, we introduce optimization strategies such as chain of thought (CoT), retrieval augmented generation (RAG), and agentic systems. Following this, we discuss the research advancements of LAMs across various communication scenarios. Finally, we analyze the challenges in the current research and provide insights into potential future research directions.

Abstract:
Joint‑embedding self‑supervised learning (SSL) commonly relies on transformations such as data augmentation and masking to learn visual representations, a task achieved by enforcing invariance or equivariance with respect to these transformations applied to two views of an image. This dominant two‑view paradigm in SSL often limits the flexibility of learned representations for downstream adaptation by creating performance trade‑offs between high‑level invariance‑demanding tasks such as image classification and more fine‑grained equivariance‑related tasks. In this work, we propose \emphseq‑JEPA, a world modeling framework that introduces architectural inductive biases into joint‑embedding predictive architectures to resolve this trade‑off. Without relying on dual equivariance predictors or loss terms, seq‑JEPA simultaneously learns two architecturally separate representations for equivariance‑ and invariance‑demanding tasks. To do so, our model processes short sequences of different views (observations) of inputs. Each encoded view is concatenated with an embedding of the relative transformation (action) that produces the next observation in the sequence. These view‑action pairs are passed through a transformer encoder that outputs an aggregate representation. A predictor head then conditions this aggregate representation on the upcoming action to predict the representation of the next observation. Empirically, seq‑JEPA demonstrates strong performance on both equivariance‑ and invariance‑demanding downstream tasks without sacrificing one for the other. Furthermore, it excels at tasks that inherently require aggregating a sequence of observations, such as path integration across actions and predictive learning across eye movements.

Abstract:
World modelling, i.e. building a representation of the rules that govern the world so as to predict its evolution, is an essential ability for any agent interacting with the physical world. Despite their impressive performance, many solutions fail to learn a causal representation of the environment they are trying to model, which would be necessary to gain a deep enough understanding of the world to perform complex tasks. With this work, we aim to broaden the research in the intersection of causality theory and neural world modelling by assessing the potential for causal discovery of the State Space Model (SSM) architecture, which has been shown to have several advantages over the widespread Transformer. We show empirically that, compared to an equivalent Transformer, a SSM can model the dynamics of a simple environment and learn a causal model at the same time with equivalent or better performance, thus paving the way for further experiments that lean into the strength of SSMs and further enhance them with causal awareness.

Abstract:
Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self‑supervised depth estimation, PosePilot leverages structure‑from‑motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self‑supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose‑aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general‑domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion‑based and auto‑regressive world models. By steering camera pose with self‑supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.

Abstract:
Traditional reinforcement learning (RL)‑based learning approaches for wireless networks rely on expensive trial‑and‑error mechanisms and real‑time feedback based on extensive environment interactions, which leads to low data efficiency and short‑sighted policies. These limitations become particularly problematic in complex, dynamic networks with high uncertainty and long‑term planning requirements. To address these limitations, in this paper, a novel world model‑based learning framework is proposed to minimize packet‑completeness‑aware age of information (CAoI) in a vehicular network. Particularly, a challenging representative scenario is considered pertaining to a millimeter‑wave (mmWave) vehicle‑to‑everything (V2X) communication network, which is characterized by high mobility, frequent signal blockages, and extremely short coherence time. Then, a world model framework is proposed to jointly learn a dynamic model of the mmWave V2X environment and use it to imagine trajectories for learning how to perform link scheduling. In particular, the long‑term policy is learned in differentiable imagined trajectories instead of environment interactions. Moreover, owing to its imagination abilities, the world model can jointly predict time‑varying wireless data and optimize link scheduling in real‑world wireless and V2X networks. Thus, during intervals without actual observations, the world model remains capable of making efficient decisions. Extensive experiments are performed on a realistic simulator based on Sionna that integrates physics‑based end‑to‑end channel modeling, ray‑tracing, and scene geometries with material properties. Simulation results show that the proposed world model achieves a significant improvement in data efficiency, and achieves 26% improvement and 16% improvement in CAoI, respectively, compared to the model‑based RL (MBRL) method and the model‑free RL (MFRL) method.

Abstract:
Planning remains a core challenge for large language models (LLMs), particularly in domains that require coherent multi‑step action sequences grounded in external constraints. We introduce SymPlanner, a novel framework that equips LLMs with structured planning capabilities by interfacing them with a symbolic environment that serves as an explicit world model. Rather than relying purely on natural language reasoning, SymPlanner grounds the planning process in a symbolic state space, where a policy model proposes actions and a symbolic environment deterministically executes and verifies their effects. To enhance exploration and improve robustness, we introduce Iterative Correction (IC), which refines previously proposed actions by leveraging feedback from the symbolic environment to eliminate invalid decisions and guide the model toward valid alternatives. Additionally, Contrastive Ranking (CR) enables fine‑grained comparison of candidate plans by evaluating them jointly. Conceptually, SymPlanner operationalizes two cognitive faculties: (i) error monitoring and repair via externalized feedback (IC) and (ii) preference formation among alternatives via pairwise comparison (CR), advancing cognitively plausible, symbol‑grounded planning aligned with the rich structure in intelligent systems. We evaluate SymPlanner on PlanBench, demonstrating that it produces more coherent, diverse, and verifiable plans than pure natural language baselines.

Abstract:
This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB‑DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off‑the‑shelf models. Next, we fine‑tune a video generation model on this annotated dataset, which jointly predicts RGB‑DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high‑quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video‑based world models.

Abstract:
Data scaling and standardized evaluation benchmarks have driven significant advances in natural language processing and computer vision. However, robotics faces unique challenges in scaling data and establishing evaluation protocols. Collecting real‑world data is resource‑intensive and inefficient, while benchmarking in real‑world scenarios remains highly complex. Synthetic data and simulation offer promising alternatives, yet existing efforts often fall short in data quality, diversity, and benchmark standardization. To address these challenges, we introduce RoboVerse, a comprehensive framework comprising a simulation platform, a synthetic dataset, and unified benchmarks. Our simulation platform supports multiple simulators and robotic embodiments, enabling seamless transitions between different environments. The synthetic dataset, featuring high‑fidelity physics and photorealistic rendering, is constructed through multiple approaches. Additionally, we propose unified benchmarks for imitation learning and reinforcement learning, enabling evaluation across different levels of generalization. At the core of the simulation platform is MetaSim, an infrastructure that abstracts diverse simulation environments into a universal interface. It restructures existing simulation environments into a simulator‑agnostic configuration system, as well as an API aligning different simulator functionalities, such as launching simulation environments, loading assets with initial states, stepping the physics engine, etc. This abstraction ensures interoperability and extensibility. Comprehensive experiments demonstrate that RoboVerse enhances the performance of imitation learning, reinforcement learning, world model learning, and sim‑to‑real transfer. These results validate the reliability of our dataset and benchmarks, establishing RoboVerse as a robust solution for advancing robot learning.

Abstract:
This paper presents DriVerse, a generative model for simulating navigation‑driven driving scenes from a single image and a future trajectory. Previous autonomous driving world models either directly feed the trajectory or discrete control signals into the generation pipeline, leading to poor alignment between the control inputs and the implicit features of the 2D base generative model, which results in low‑fidelity video outputs. Some methods use coarse textual commands or discrete vehicle control signals, which lack the precision to guide fine‑grained, trajectory‑specific video generation, making them unsuitable for evaluating actual autonomous driving algorithms. DriVerse introduces explicit trajectory guidance in two complementary forms: it tokenizes trajectories into textual prompts using a predefined trend vocabulary for seamless language integration, and converts 3D trajectories into 2D spatial motion priors to enhance control over static content within the driving scene. To better handle dynamic objects, we further introduce a lightweight motion alignment module, which focuses on the inter‑frame consistency of dynamic pixels, significantly enhancing the temporal coherence of moving elements over long sequences. With minimal training and no need for additional data, DriVerse outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets. The code and models will be released to the public.

Abstract:
This paper introduces an approach for developing surrogate environments in reinforcement learning (RL) using the Sparse Identification of Nonlinear Dynamics (SINDy) algorithm. We demonstrate the effectiveness of our approach through extensive experiments in OpenAI Gym environments, particularly Mountain Car and Lunar Lander. Our results show that SINDy‑based surrogate models can accurately capture the underlying dynamics of these environments while reducing computational costs by 20‑35%. With only 75 interactions for Mountain Car and 1000 for Lunar Lander, we achieve state‑wise correlations exceeding 0.997, with mean squared errors as low as 3.11e‑06 for Mountain Car velocity and 1.42e‑06 for LunarLander position. RL agents trained in these surrogate environments require fewer total steps (65,075 vs. 100,000 for Mountain Car and 801,000 vs. 1,000,000 for Lunar Lander) while achieving comparable performance to those trained in the original environments, exhibiting similar convergence patterns and final performance metrics. This work contributes to the field of model‑based RL by providing an efficient method for generating accurate, interpretable surrogate environments.

Abstract:
While non‑prehensile manipulation (e.g., controlled pushing/poking) constitutes a foundational robotic skill, its learning remains challenging due to the high sensitivity to complex physical interactions involving friction and restitution. To achieve robust policy learning and generalization, we opt to learn a world model of the 3D rigid body dynamics involved in non‑prehensile manipulations and use it for model‑based reinforcement learning. We propose PIN‑WM, a Physics‑INformed World Model that enables efficient end‑to‑end identification of a 3D rigid body dynamical system from visual observations. Adopting differentiable physics simulation, PIN‑WM can be learned with only few‑shot and task‑agnostic physical interaction trajectories. Further, PIN‑WM is learned with observational loss induced by Gaussian Splatting without needing state estimation. To bridge Sim2Real gaps, we turn the learned PIN‑WM into a group of Digital Cousins via physics‑aware randomizations which perturb physics and rendering parameters to generate diverse and meaningful variations of the PIN‑WM. Extensive evaluations on both simulation and real‑world tests demonstrate that PIN‑WM, enhanced with physics‑aware digital cousins, facilitates learning robust non‑prehensile manipulation skills with Sim2Real transfer, surpassing the Real2Sim2Real state‑of‑the‑arts.

Abstract:
Reinforcement Learning (RL) has achieved impressive results in robotics, yet high‑performing pipelines remain highly task‑specific, with little reuse of prior data. Offline Model‑based RL (MBRL) offers greater data efficiency by training policies entirely from existing datasets, but suffers from compounding errors and distribution shift in long‑horizon rollouts. Although existing methods have shown success in controlled simulation benchmarks, robustly applying them to the noisy, biased, and partially observed datasets typical of real‑world robotics remains challenging. We present a principled pipeline for making offline MBRL effective on physical robots. Our RWM‑U extends autoregressive world models with epistemic uncertainty estimation, enabling temporally consistent multi‑step rollouts with uncertainty effectively propagated over long horizons. We combine RWM‑U with MOPO‑PPO, which adapts uncertainty‑penalized policy optimization to the stable, on‑policy PPO framework for real‑world control. We evaluate our approach on diverse manipulation and locomotion tasks in simulation and on real quadruped and humanoid, training policies entirely from offline datasets. The resulting policies consistently outperform model‑free and uncertainty‑unaware model‑based baselines, and fusing real‑world data in model learning further yields robust policies that surpass online model‑free baselines trained solely in simulation.

Abstract:
While recent advancements in robotic manipulation video synthesis have shown promise, significant challenges persist in ensuring effective instruction‑following and achieving high visual quality. Recent methods, like RoboDreamer, utilize linguistic decomposition to divide instructions into separate lower‑level primitives, conditioning the world model on these primitives to achieve compositional instruction‑following. However, these separate primitives do not consider the relationships that exist between them. Furthermore, recent methods neglect valuable visual guidance, including depth and semantic guidance, both crucial for enhancing visual quality. This paper introduces ManipDreamer, an advanced world model based on the action tree and visual guidance. To better learn the relationships between instruction primitives, we represent the instruction as the action tree and assign embeddings to tree nodes, each instruction can acquire its embeddings by navigating through the action tree. The instruction embeddings can be used to guide the world model. To enhance visual quality, we combine depth and semantic guidance by introducing a visual guidance adapter compatible with the world model. This visual adapter enhances both the temporal and physical consistency of video generation. Based on the action tree and visual guidance, ManipDreamer significantly boosts the instruction‑following ability and visual quality. Comprehensive evaluations on robotic manipulation benchmarks reveal that ManipDreamer achieves large improvements in video quality metrics in both seen and unseen tasks, with PSNR improved from 19.55 to 21.05, SSIM improved from 0.7474 to 0.7982 and reduced Flow Error from 3.506 to 3.201 in unseen tasks, compared to the recent RoboDreamer model. Additionally, our method increases the success rate of robotic manipulation tasks by 2.5% in 6 RLbench tasks on average.

Abstract:
As artificial intelligence (AI) improves, traditional alignment strategies may falter in the face of unpredictable self‑improvement, hidden subgoals, and the sheer complexity of intelligent systems. Inspired by contemplative wisdom traditions, we show how four axiomatic principles can instil a resilient Wise World Model in AI systems. First, mindfulness enables self‑monitoring and recalibration of emergent subgoals. Second, emptiness forestalls dogmatic goal fixation and relaxes rigid priors. Third, non‑duality dissolves adversarial self‑other boundaries. Fourth, boundless care motivates the universal reduction of suffering. We find that prompting AI to reflect on these principles improves performance on the AILuminate Benchmark (d=.96) and boosts cooperation and joint‑reward on the Prisoner's Dilemma task (d=7+). We offer detailed implementation strategies at the level of architectures, constitutions, and reinforcement on chain‑of‑thought. For future systems, active inference may offer the self‑organizing and dynamic coupling capabilities needed to enact Contemplative AI in embodied agents.

Abstract:
Recent advancements in dialogue policy planning have emphasized optimizing system agent policies to achieve predefined goals, focusing on strategy design, trajectory acquisition, and efficient training paradigms. However, these approaches often overlook the critical role of user characteristics, which are essential in real‑world scenarios like conversational search and recommendation, where interactions must adapt to individual user traits such as personality, preferences, and goals. To address this gap, we first conduct a comprehensive study utilizing task‑specific user personas to systematically assess dialogue policy planning under diverse user behaviors. By leveraging realistic user profiles for different tasks, our study reveals significant limitations in existing approaches, highlighting the need for user‑tailored dialogue policy planning. Building on this foundation, we present the User‑Tailored Dialogue Policy Planning (UDP) framework, which incorporates an Intrinsic User World Model to model user traits and feedback. UDP operates in three stages: (1) User Persona Portraying, using a diffusion model to dynamically infer user profiles; (2) User Feedback Anticipating, leveraging a Brownian Bridge‑inspired anticipator to predict user reactions; and (3) User‑Tailored Policy Planning, integrating these insights to optimize response strategies. To ensure robust performance, we further propose an active learning approach that prioritizes challenging user personas during training. Comprehensive experiments on benchmarks, including collaborative and non‑collaborative settings, demonstrate the effectiveness of UDP in learning user‑specific dialogue strategies. Results validate the protocol's utility and highlight UDP's robustness, adaptability, and potential to advance user‑centric dialogue systems.

Abstract:
Imitation learning is a powerful tool for training agents by leveraging expert knowledge, and being able to replicate a given trajectory is an integral part of it. In complex environments, like modern 3D video games, distribution shift and stochasticity necessitate robust approaches beyond simple action replay. In this study, we apply Inverse Dynamics Models (IDM) with different encoders and policy heads to trajectory following in a modern 3D video game ‑‑ Bleeding Edge. Additionally, we investigate several future alignment strategies that address the distribution shift caused by the aleatoric uncertainty and imperfections of the agent. We measure both the trajectory deviation distance and the first significant deviation point between the reference and the agent's trajectory and show that the optimal configuration depends on the chosen setting. Our results show that in a diverse data setting, a GPT‑style policy head with an encoder trained from scratch performs the best, DINOv2 encoder with the GPT‑style policy head gives the best results in the low data regime, and both GPT‑style and MLP‑style policy heads had comparable results when pre‑trained on a diverse setting and fine‑tuned for a specific behaviour setting.

Abstract:
Spatial reasoning in partially observable environments has often been approached through passive predictive models, yet theories of embodied cognition suggest that genuinely useful representations arise only when perception is tightly coupled to action. Here we ask whether a recurrent agent, trained solely by sparse rewards to solve procedurally generated planar mazes, can autonomously internalize metric concepts such as direction, distance and obstacle layout. After training, the agent consistently produces near‑optimal paths in unseen mazes, behavior that hints at an underlying spatial model. To probe this possibility, we cast the closed agent‑environment loop as a hybrid dynamical system, identify stable limit cycles in its state space, and characterize behavior with a Ridge Representation that embeds whole trajectories into a common metric space. Canonical correlation analysis exposes a robust linear alignment between neural and behavioral manifolds, while targeted perturbations of the most informative neural dimensions sharply degrade navigation performance. Taken together, these dynamical, representational, and causal signatures show that sustained sensorimotor interaction is sufficient for the spontaneous emergence of compact, embodied world models, providing a principled path toward interpretable and transferable navigation policies.

Abstract:
Just as humans display language patterns influenced by their native tongue when speaking new languages, LLMs often default to English‑centric responses even when generating in other languages. Nevertheless, we observe that local cultural information persists within the models and can be readily activated for cultural customization. We first demonstrate that explicitly providing cultural context in prompts significantly improves the models' ability to generate culturally localized responses. We term the disparity in model performance with versus without explicit cultural context the explicit‑implicit localization gap, indicating that while cultural knowledge exists within LLMs, it may not naturally surface in multilingual interactions if cultural context is not explicitly provided. Despite the explicit prompting benefit, however, the answers reduce in diversity and tend toward stereotypes. Second, we identify an explicit cultural customization vector, conserved across all non‑English languages we explore, which enables LLMs to be steered from the synthetic English cultural world‑model toward each non‑English cultural world. Steered responses retain the diversity of implicit prompting and reduce stereotypes to dramatically improve the potential for customization. We discuss the implications of explicit cultural customization for understanding the conservation of alternative cultural world models within LLMs, and their controllable utility for translation, cultural customization, and the possibility of making the explicit implicit through soft control for expanded LLM function and appeal.

Abstract:
Social simulation is transforming traditional social science research by modeling human behavior through interactions between virtual individuals and their environments. With recent advances in large language models (LLMs), this approach has shown growing potential in capturing individual differences and predicting group behaviors. However, existing methods face alignment challenges related to the environment, target users, interaction mechanisms, and behavioral patterns. To this end, we introduce SocioVerse, an LLM‑agent‑driven world model for social simulation. Our framework features four powerful alignment components and a user pool of 10 million real individuals. To validate its effectiveness, we conducted large‑scale simulation experiments across three distinct domains: politics, news, and economics. Results demonstrate that SocioVerse can reflect large‑scale population dynamics while ensuring diversity, credibility, and representativeness through standardized procedures and minimal manual adjustments.

Abstract:
Controllers trained with Reinforcement Learning tend to be very specialized and thus generalize poorly when their testing environment differs from their training one. We propose a Model‑Based approach to increase generalization where both world model and policy are trained in a dimensionless state‑action space. To do so, we introduce the Dimensionless Markov Decision Process (Π‑MDP): an extension of Contextual‑MDPs in which state and action spaces are non‑dimensionalized with the Buckingham‑Π theorem. This procedure induces policies that are equivariant with respect to changes in the context of the underlying dynamics. We provide a generic framework for this approach and apply it to a model‑based policy search algorithm using Gaussian Process models. We demonstrate the applicability of our method on simulated actuated pendulum and cartpole systems, where policies trained on a single environment are robust to shifts in the distribution of the context.

Abstract:
World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real‑time interactive world model on Minecraft, an open‑ended sandbox game which has been utilized as a common testbed for world modeling. MineWorld is driven by a visual‑action autoregressive Transformer, which takes paired game scenes and corresponding actions as input, and generates consequent new scenes following the actions. Specifically, by transforming visual game scenes and actions into discrete token ids with an image tokenizer and an action tokenizer correspondingly, we consist the model input with the concatenation of the two kinds of ids interleaved. The model is then trained with next token prediction to learn rich representations of game states as well as the conditions between states and actions simultaneously. In inference, we develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time, letting models in different scales generate 4 to 7 frames per second and enabling real‑time interactions with game players. In evaluation, we propose new metrics to assess not only visual quality but also the action following capacity when generating new scenes, which is crucial for a world model. Our comprehensive evaluation shows the efficacy of MineWorld, outperforming SoTA open‑sourced diffusion based world models significantly. The code and model have been released.

Abstract:
Reinforcement learning (RL) agents have shown remarkable performances in various environments, where they can discover effective policies directly from sensory inputs. However, these agents often exploit spurious correlations in the training data, resulting in brittle behaviours that fail to generalize to new or slightly modified environments. To address this, we introduce the Causal Object‑centric Model Extraction Tool (COMET), a novel algorithm designed to learn the exact interpretable causal world models (CWMs). COMET first extracts object‑centric state descriptions from observations and identifies the environment's internal states related to the depicted objects' properties. Using symbolic regression, it models object‑centric transitions and derives causal relationships governing object dynamics. COMET further incorporates large language models (LLMs) for semantic inference, annotating causal variables to enhance interpretability. By leveraging these capabilities, COMET constructs CWMs that align with the true causal structure of the environment, enabling agents to focus on task‑relevant features. The extracted CWMs mitigate the danger of shortcuts, permitting the development of RL systems capable of better planning and decision‑making across dynamic scenarios. Our results, validated in Atari environments such as Pong and Freeway, demonstrate the accuracy and robustness of COMET, highlighting its potential to bridge the gap between object‑centric reasoning and causal inference in reinforcement learning.

Abstract:
An embodied system must not only model the patterns of the external world but also understand its own motion dynamics. A motion dynamic model is essential for efficient skill acquisition and effective planning. In this work, we introduce the neural motion simulator (MoSim), a world model that predicts the future physical state of an embodied system based on current observations and actions. MoSim achieves state‑of‑the‑art performance in physical state prediction and provides competitive performance across a range of downstream tasks. This works shows that when a world model is accurate enough and performs precise long‑horizon predictions, it can facilitate efficient skill acquisition in imagined worlds and even enable zero‑shot reinforcement learning. Furthermore, MoSim can transform any model‑free reinforcement learning (RL) algorithm into a model‑based approach, effectively decoupling physical environment modeling from RL algorithm development. This separation allows for independent advancements in RL algorithms and world modeling, significantly improving sample efficiency and enhancing generalization capabilities. Our findings highlight that world models for motion dynamics is a promising direction for developing more versatile and capable embodied systems.

Abstract:
Recent work proposes using world models to generate controlled virtual environments in which AI agents can be tested before deployment to ensure their reliability and safety. However, accurate world models often have high computational demands that can severely restrict the scope and depth of such assessments. Inspired by the classic `brain in a vat' thought experiment, here we investigate ways of simplifying world models that remain agnostic to the AI agent under evaluation. By following principles from computational mechanics, our approach reveals a fundamental trade‑off in world model construction between efficiency and interpretability, demonstrating that no single world model can optimise all desirable characteristics. Building on this trade‑off, we identify procedures to build world models that either minimise memory requirements, delineate the boundaries of what is learnable, or allow tracking causes of undesirable outcomes. In doing so, this work establishes fundamental limits in world modelling, leading to actionable guidelines that inform core design choices related to effective agent evaluation.

Abstract:
Developing effective world models is crucial for creating artificial agents that can reason about and navigate complex environments. In this paper, we investigate a deep supervision technique for encouraging the development of a world model in a network trained end‑to‑end to predict the next observation. While deep supervision has been widely applied for task‑specific learning, our focus is on improving the world models. Using an experimental environment based on the Flappy Bird game, where the agent receives only LIDAR measurements as observations, we explore the effect of adding a linear probe component to the network's loss function. This additional term encourages the network to encode a subset of the true underlying world features into its hidden state. Our experiments demonstrate that this supervision technique improves both training and test performance, enhances training stability, and results in more easily decodable world features ‑‑ even for those world features which were not included in the training. Furthermore, we observe a reduced distribution drift in networks trained with the linear probe, particularly during high‑variability phases of the game (flying between successive pipe encounters). Including the world features loss component roughly corresponded to doubling the model size, suggesting that the linear probe technique is particularly beneficial in compute‑limited settings or when aiming to achieve the best performance with smaller models. These findings contribute to our understanding of how to develop more robust and sophisticated world models in artificial agents, paving the way for further advancements in this field.

Abstract:
We propose a fully decentralized multi‑agent world model that enables both symbol emergence for communication and coordinated behavior through temporal extension of collective predictive coding. Unlike previous research that focuses on either communication or coordination separately, our approach achieves both simultaneously. Our method integrates world models with communication channels, enabling agents to predict environmental dynamics, estimate states from partial observations, and share critical information through bidirectional message exchange with contrastive learning for message alignment. Using a two‑agent trajectory drawing task, we demonstrate that our communication‑based approach outperforms non‑communicative models when agents have divergent perceptual capabilities, achieving the second‑best coordination after centralized models. Importantly, our decentralized approach with constraints preventing direct access to other agents' internal states facilitates the emergence of more meaningful symbol systems that accurately reflect environmental states. These findings demonstrate the effectiveness of decentralized communication for supporting coordination while developing shared representations of the environment.

Abstract:
Recent advances in image and video generation raise hopes that these models possess world modeling capabilities, the ability to generate realistic, physically plausible videos. This could revolutionize applications in robotics, autonomous driving, and scientific simulation. However, before treating these models as world models, we must ask: Do they adhere to physical conservation laws? To answer this, we introduce Morpheus, a benchmark for evaluating video generation models on physical reasoning. It features 80 real‑world videos capturing physical phenomena, guided by conservation laws. Since artificial generations lack ground truth, we assess physical plausibility using physics‑informed metrics evaluated with respect to infallible conservation laws known per physical setting, leveraging advances in physics‑informed neural networks and vision‑language foundation models. Our findings reveal that even with advanced prompting and video conditioning, current models struggle to encode physical principles despite generating aesthetically pleasing videos. All data, leaderboard, and code are open‑sourced at our project page.

Abstract:
Simulation‑to‑reality reinforcement learning (RL) faces the critical challenge of reconciling discrepancies between simulated and real‑world dynamics, which can severely degrade agent performance. A promising approach involves learning corrections to simulator forward dynamics represented as a residual error function, however this operation is impractical with high‑dimensional states such as images. To overcome this, we propose ReDRAW, a latent‑state autoregressive world model pretrained in simulation and calibrated to target environments through residual corrections of latent‑state dynamics rather than of explicit observed states. Using this adapted world model, ReDRAW enables RL agents to be optimized with imagined rollouts under corrected dynamics and then deployed in the real world. In multiple vision‑based MuJoCo domains and a physical robot visual lane‑following task, ReDRAW effectively models changes to dynamics and avoids overfitting in low data regimes where traditional transfer methods fail.

Abstract:
Recent advances in foundation models (FMs), including large language models (LLMs), vision‑language models (VLMs), and world models, have opened new opportunities for autonomous driving systems (ADSs) in perception, reasoning, decision‑making, and interaction. However, ADSs are safety‑critical cyber‑physical systems, and integrating FMs into them raises substantial software engineering challenges in data curation, system design, deployment, evaluation, and assurance. To clarify this rapidly evolving landscape, we present an initial roadmap, grounded in a structured literature review, for integrating FMs into autonomous driving across three dimensions: FM infrastructure, in‑vehicle integration, and practical deployment. For each dimension, we summarize the state of the art, identify key challenges, and highlight open research opportunities. Based on this analysis, we outline research directions for building reliable, safe, and trustworthy FM‑enabled ADSs.

Abstract:
We generalize and extend results on the localization of gravity on Karch‑Randall‑Sundrum brane‑worlds with positive, negative, or zero cosmological constant on the brane. We do so both from the study of bulk metric perturbations, and from their reinterpretation through brane‑world holography: an induced higher‑derivative theory of gravity coupled to a cut‑off CFT on the brane. We then enhance these models by adding an explicit Einstein‑Hilbert term on the brane action (i.e. a DGP term) and, through studying the brane position and the localization of gravity on the brane, we establish bounds for its coupling constant, beyond which the theory presents pathologies. We finally study the limit in which the brane reaches the boundary, and comment on adding further higher‑derivative terms on the brane action.

Abstract:
Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open‑world environments. Yet, prior work either incorporates only one of these abilities in an end‑to‑end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end‑to‑end Generalist policy, termed RIG. To train RIG in an end‑to‑end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than 17× sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self‑correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test‑time scaling to enhance overall performance.

Abstract:
One of the core components of our world models is 'intuitive physics' ‑ an understanding of objects, space, and causality. This capability enables us to predict events, plan action and navigate environments, all of which rely on a composite sense of objecthood. Despite its importance, there is no single, unified account of objecthood, though multiple theoretical frameworks provide insights. In the first part of this paper, we present a comprehensive overview of the main theoretical frameworks in objecthood research ‑ Gestalt psychology, enactive cognition, and developmental psychology ‑ and identify the core capabilities each framework attributes to object understanding, as well as what functional roles they play in shaping world models in biological agents. Given the foundational role of objecthood in world modelling, understanding objecthood is also essential in AI. In the second part of the paper, we evaluate how current AI paradigms approach and test objecthood capabilities compared to those in cognitive science. We define an AI paradigm as a combination of how objecthood is conceptualised, the methods used for studying objecthood, the data utilised, and the evaluation techniques. We find that, whilst benchmarks can detect that AI systems model isolated aspects of objecthood, the benchmarks cannot detect when AI systems lack functional integration across these capabilities, not solving the objecthood challenge fully. Finally, we explore novel evaluation approaches that align with the integrated vision of objecthood outlined in this paper. These methods are promising candidates for advancing from isolated object capabilities toward general‑purpose AI with genuine object understanding in real‑world contexts.

Abstract:
The inability of autonomous vehicles (AVs) to infer the material properties of obstacles limits their decision‑making capacity. While AVs rely on sensor systems such as cameras, LiDAR, and radar to detect obstacles, this study suggests combining sensors with a knowledge graph (KG)‑based world model to improve AVs' comprehension of physical material qualities. Beyond sensor data, AVs can infer qualities such as malleability, density, and elasticity using a semantic KG that depicts the relationships between obstacles and their attributes. Using the CARLA autonomous driving simulator, we evaluated AV performance with and without KG integration. The findings demonstrate that the KG‑based method improves obstacle management, which allows AVs to use material qualities to make better decisions about when to change lanes or apply emergency braking. For example, the KG‑integrated AV changed lanes for hard impediments like traffic cones and successfully avoided collisions with flexible items such as plastic bags by passing over them. Compared to the control system, the KG framework demonstrated improved responsiveness to obstacles by resolving conflicting sensor data, causing emergency stops for 13.3% more cases. In addition, our method exhibits a 6.6% higher success rate in lane‑changing maneuvers in experimental scenarios, particularly for larger, high‑impact obstacles. While we focus particularly on autonomous driving, our work demonstrates the potential of KG‑based world models to improve decision‑making in embodied AI systems and scale to other domains, including robotics, healthcare, and environmental simulation.

Abstract:
Sparse reward environments pose a significant challenge for reinforcement learning due to the scarcity of feedback. Intrinsic motivation and transfer learning have emerged as promising strategies to address this issue. Change Based Exploration Transfer (CBET), a technique that combines these two approaches for model‑free algorithms, has shown potential in addressing sparse feedback but its effectiveness with modern algorithms remains understudied. This paper provides an adaptation of CBET for world model algorithms like DreamerV3 and compares the performance of DreamerV3 and IMPALA agents, both with and without CBET, in the sparse reward environments of Crafter and Minigrid. Our tabula rasa results highlight the possibility of CBET improving DreamerV3's returns in Crafter but the algorithm attains a suboptimal policy in Minigrid with CBET further reducing returns. In the same vein, our transfer learning experiments show that pre‑training DreamerV3 with intrinsic rewards does not immediately lead to a policy that maximizes extrinsic rewards in Minigrid. Overall, our results suggest that CBET provides a positive impact on DreamerV3 in more complex environments like Crafter but may be detrimental in environments like Minigrid. In the latter case, the behaviours promoted by CBET in DreamerV3 may not align with the task objectives of the environment, leading to reduced returns and suboptimal policies.

Abstract:
Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain‑specific requirements of autonomous driving ‑ such as multi‑agent interactions, fine‑grained control, and multi‑camera consistency. We introduce GAIA‑2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA‑2 supports controllable video generation conditioned on a rich set of structured inputs: ego‑vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high‑resolution, spatiotemporally consistent multi‑camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA‑2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at https://wayve.ai/thinking/gaia‑2.

Abstract:
Navigating in environments alongside humans requires agents to reason under uncertainty and account for the beliefs and intentions of those around them. Under a sequential decision‑making framework, egocentric navigation can naturally be represented as a Markov Decision Process (MDP). However, social navigation additionally requires reasoning about the hidden beliefs of others, inherently leading to a Partially Observable Markov Decision Process (POMDP), where agents lack direct access to others' mental states. Inspired by Theory of Mind and Epistemic Planning, we propose (1) a neuro‑symbolic model‑based reinforcement learning architecture for social navigation, addressing the challenge of belief tracking in partially observable environments; and (2) a perspective‑shift operator for belief estimation, leveraging recent work on Influence‑based Abstractions (IBA) in structured multi‑agent settings.

Abstract:
Modern reinforcement learning (RL) systems have demonstrated remarkable capabilities in complex environments, such as video games. However, they still fall short of achieving human‑like sample efficiency and adaptability when learning new domains. Theory‑based reinforcement learning (TBRL) is an algorithmic framework specifically designed to address this gap. Modeled on cognitive theories, TBRL leverages structured, causal world models ‑ "theories" ‑ as forward simulators for use in planning, generalization and exploration. Although current TBRL systems provide compelling explanations of how humans learn to play video games, they face several technical limitations: their theory languages are restrictive, and their planning algorithms are not scalable. To address these challenges, we introduce TheoryCoder, an instantiation of TBRL that exploits hierarchical representations of theories and efficient program synthesis methods for more powerful learning and planning. TheoryCoder equips agents with general‑purpose abstractions (e.g., "move to"), which are then grounded in a particular environment by learning a low‑level transition model (a Python program synthesized from observations by a large language model). A bilevel planning algorithm can exploit this hierarchical structure to solve large domains. We demonstrate that this approach can be successfully applied to diverse and challenging grid‑world games, where approaches based on directly synthesizing a policy perform poorly. Ablation studies demonstrate the benefits of using hierarchical abstractions.

Abstract:
As interest grows in world models that predict future states from current observations and actions, accurately modeling part‑level dynamics has become increasingly relevant for various applications. Existing approaches, such as Puppet‑Master, rely on fine‑tuning large‑scale pre‑trained video diffusion models, which are impractical for real‑world use due to the limitations of 2D video representation and slow processing times. To overcome these challenges, we present PartRM, a novel 4D reconstruction framework that simultaneously models appearance, geometry, and part‑level motion from multi‑view images of a static object. PartRM builds upon large 3D Gaussian reconstruction models, leveraging their extensive knowledge of appearance and geometry in static objects. To address data scarcity in 4D, we introduce the PartDrag‑4D dataset, providing multi‑view observations of part‑level dynamics across over 20,000 states. We enhance the model's understanding of interaction conditions with a multi‑scale drag embedding module that captures dynamics at varying granularities. To prevent catastrophic forgetting during fine‑tuning, we implement a two‑stage training process that focuses sequentially on motion and appearance learning. Experimental results show that PartRM establishes a new state‑of‑the‑art in part‑level motion learning and can be applied in manipulation tasks in robotics. Our code, data, and models are publicly available to facilitate future research.

Abstract:
Intelligent organisms can solve truly novel problems which they have never encountered before, either in their lifetime or their evolution. An important component of this capacity is the ability to ``think'', that is, to mentally manipulate objects, concepts and behaviors in order to plan and evaluate possible solutions to novel problems, even without environment interaction. To generate problems that are truly qualitatively novel, while still solvable zero‑shot (by mental simulation), we use the combinatorial nature of environments: we train the agent while withholding a specific combination of the environment's elements. The novel test task, based on this combination, is thus guaranteed to be truly novel, while still mentally simulable since the agent has been exposed to each individual element (and their pairwise interactions) during training. We propose a method to train agents endowed with world models to make use their mental simulation abilities, by selecting tasks based on the difference between the agent's pre‑thinking and post‑thinking performance. When tested on the novel, withheld problem, the resulting agent successfully simulated alternative scenarios and used the resulting information to guide its behavior in the actual environment, solving the novel task in a single real‑environment trial (zero‑shot).

Abstract:
The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human‑like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry‑aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action‑conditioned video prediction, and (3) goal‑conditioned visual planning. Through task‑interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates zero‑shot synthetic‑to‑real generalization despite never observing real‑world data during training. Furthermore, our approach achieves zero‑shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Notably, even without real‑world data, its reconstruction performance is comparable with or even better than that of domain‑specific models. Additionally, Aether employs camera trajectories as geometry‑informed action spaces, enabling effective action‑conditioned prediction and visual planning. We hope our work inspires the community to explore new frontiers in physically‑reasonable world modeling and its applications.

Abstract:
Given its ability to analyse stochastic models ranging from discrete and continuous‑time Markov chains to Markov decision processes and stochastic games, probabilistic model checking (PMC) is widely used to verify system dependability and performance properties. However, modelling the behaviour of, and verifying these properties for many software‑intensive systems requires the joint analysis of multiple interdependent stochastic models of different types, which existing PMC techniques and tools cannot handle. To address this limitation, we introduce a tool‑supported UniversaL stochasTIc Modelling, verificAtion and synThEsis (ULTIMATE) framework that supports the representation, verification and synthesis of heterogeneous multi‑model stochastic systems with complex model interdependencies. Through its unique integration of multiple PMC paradigms, and underpinned by a novel verification method for handling model interdependencies, ULTIMATE unifies‑for the first time‑the modelling of probabilistic and nondeterministic uncertainty, discrete and continuous time, partial observability, and the use of both Bayesian and frequentist inference to exploit domain knowledge and data about the modelled system and its context. A comprehensive suite of case studies and experiments confirm the generality and effectiveness of our novel verification framework.

Abstract:
World Models help Artificial Intelligence (AI) predict outcomes, reason about its environment, and guide decision‑making. While widely used in reinforcement learning, they lack the structured, adaptive representations that even young children intuitively develop. Advancing beyond pattern recognition requires dynamic, interpretable frameworks inspired by Piaget's cognitive development theory. We highlight six key research areas ‑‑ physics‑informed learning, neurosymbolic learning, continual learning, causal inference, human‑in‑the‑loop AI, and responsible AI ‑‑ as essential for enabling true reasoning in AI. By integrating statistical learning with advances in these areas, AI can evolve from pattern recognition to genuine understanding, adaptation and reasoning capabilities.

Abstract:
We introduce BriLLM, a brain‑inspired large language model that fundamentally redefines the foundations of machine learning through its implementation of Signal Fully‑connected flowing (SiFu) learning. This work addresses the critical bottleneck hindering AI's progression toward Artificial General Intelligence (AGI)‑‑the disconnect between language models and "world models"‑‑as well as the fundamental limitations of Transformer‑based architectures rooted in the conventional representation learning paradigm. BriLLM incorporates two pivotal neurocognitive principles: (1) static semantic mapping, where tokens are mapped to specialized nodes analogous to cortical areas, and (2) dynamic signal propagation, which simulates electrophysiological information dynamics observed in brain activity. This architecture enables multiple transformative breakthroughs: natural multi‑modal compatibility, full model interpretability at the node level, context‑length independent scaling, and the first global‑scale simulation of brain‑like information processing for language tasks. Our initial 1‑2B parameter models successfully replicate GPT‑1‑level generative capabilities while demonstrating stable perplexity reduction. Scalability analyses confirm the feasibility of 100‑200B parameter variants capable of processing 40,000‑token vocabularies. The paradigm is reinforced by both Occam's Razor‑‑evidenced in the simplicity of direct semantic mapping‑‑and natural evolution‑‑given the brain's empirically validated AGI architecture. BriLLM establishes a novel, biologically grounded framework for AGI advancement that addresses fundamental limitations of current approaches.

Abstract:
Recent advances in large vision‑language models (LVLMs) have shown promise for embodied task planning, yet they struggle with fundamental challenges like dependency constraints and efficiency. Existing approaches either solely optimize action selection or leverage world models during inference, overlooking the benefits of learning to model the world as a way to enhance planning capabilities. We propose Dual Preference Optimization (D^2PO), a new learning framework that jointly optimizes state prediction and action selection through preference learning, enabling LVLMs to understand environment dynamics for better planning. To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial‑and‑error. Extensive experiments on VoTa‑Bench demonstrate that our D^2PO‑based method significantly outperforms existing methods and GPT‑4o when applied to Qwen2‑VL (7B), LLaVA‑1.6 (7B), and LLaMA‑3.2 (11B), achieving superior task success rates with more efficient execution paths.

Abstract:
We introduce LUMOS, a language‑conditioned multi‑task imitation learning framework for robotics. LUMOS learns skills by practicing them over many long‑horizon rollouts in the latent space of a learned world model and transfers these skills zero‑shot to a real robot. By learning on‑policy in the latent space of the learned world model, our algorithm mitigates policy‑induced distribution shift which most offline imitation learning methods suffer from. LUMOS learns from unstructured play data with fewer than 1% hindsight language annotations but is steerable with language commands at test time. We achieve this coherent long‑horizon performance by combining latent planning with both image‑ and language‑based hindsight goal relabeling during training, and by optimizing an intrinsic reward defined in the latent space of the world model over multiple time steps, effectively reducing covariate shift. In experiments on the difficult long‑horizon CALVIN benchmark, LUMOS outperforms prior learning‑based methods with comparable approaches on chained multi‑task evaluations. To the best of our knowledge, we are the first to learn a language‑conditioned continuous visuomotor control for a real‑world robot within an offline world model. Videos, dataset and code are available at http://lumos.cs.uni‑freiburg.de.

Abstract:
Multimodal information‑gathering settings, where users collaborate with AI in dynamic environments, are increasingly common. These involve complex processes with textual and multimodal interactions, often requiring additional structural information via cost‑incurring requests. AI helpers lack access to users' true goals, beliefs, and preferences and struggle to integrate diverse information effectively. We propose a social continual learning framework for causal knowledge acquisition and collaborative decision‑making. It focuses on autonomous agents learning through dialogues, question‑asking, and interaction in open, partially observable environments. A key component is a natural language oracle that answers the agent's queries about environmental mechanisms and states, refining causal understanding while balancing exploration or learning, and exploitation or knowledge use. Evaluation tasks inspired by developmental psychology emphasize causal reasoning and question‑asking skills. They complement benchmarks by assessing the agent's ability to identify knowledge gaps, generate meaningful queries, and incrementally update reasoning. The framework also evaluates how knowledge acquisition costs are amortized across tasks within the same environment. We propose two architectures: 1) a system combining Large Language Models (LLMs) with the ReAct framework and question‑generation, and 2) an advanced system with a causal world model, symbolic, graph‑based, or subsymbolic, for reasoning and decision‑making. The latter builds a causal knowledge graph for efficient inference and adaptability under constraints. Challenges include integrating causal reasoning into ReAct and optimizing exploration and question‑asking in error‑prone scenarios. Beyond applications, this framework models developmental processes combining causal reasoning, question generation, and social learning.

Abstract:
Various world model frameworks are being developed today based on autoregressive frameworks that rely on discrete representations of actions and observations, and these frameworks are succeeding in constructing interactive generative models for the target environment of interest. Meanwhile, humans demonstrate remarkable generalization abilities to combine experiences in multiple environments to mentally simulate and learn to control agents in diverse environments. Inspired by this human capability, we introduce World modeling through Lie Action (WLA), an unsupervised framework that learns continuous latent action representations to simulate across environments. WLA learns a control interface with high controllability and predictive ability by simultaneously modeling the dynamics of multiple environments using Lie group theory and object‑centric autoencoder. On synthetic benchmark and real‑world datasets, we demonstrate that WLA can be trained using only video frames and, with minimal or no action labels, can quickly adapt to new environments with novel action sets.

Abstract:
Predictive models of the future are fundamental for an agent's ability to reason and plan. A common strategy learns a world model and unrolls it step‑by‑step at inference, where small errors can rapidly compound. Geometric Horizon Models (GHMs) offer a compelling alternative by directly making predictions of future states, avoiding cumulative inference errors. While GHMs can be conveniently learned by a generative analog to temporal difference (TD) learning, existing methods are negatively affected by bootstrapping predictions at train time and struggle to generate high‑quality predictions at long horizons. This paper introduces Temporal Difference Flows (TD‑Flow), which leverages the structure of a novel Bellman equation on probability paths alongside flow‑matching techniques to learn accurate GHMs at over 5x the horizon length of prior methods. Theoretically, we establish a new convergence result and primarily attribute TD‑Flow's efficacy to reduced gradient variance during training. We further show that similar arguments can be extended to diffusion‑based methods. Empirically, we validate TD‑Flow across a diverse set of domains on both generative metrics and downstream tasks including policy evaluation. Moreover, integrating TD‑Flow with recent behavior foundation models for planning over pre‑trained policies demonstrates substantial performance gains, underscoring its promise for long‑horizon decision‑making.

Abstract:
Advanced end‑to‑end autonomous driving systems predict other vehicles' motions and plan ego vehicle's trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In this paper, we propose a driving World Model named EOT‑WM, unifying Ego‑Other vehicle Trajectories in videos for driving simulation. Specifically, it remains a challenge to match multiple trajectories in the BEV space with each vehicle in the video to control the video generation. We first project ego‑other vehicle trajectories in the BEV space into the image coordinate for vehicle‑trajectory match via pixel positions. Then, trajectory videos are encoded by the Spatial‑Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory‑injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego‑other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state‑of‑the‑art method by 30% in FID and 55% in FVD. The model can also predict unseen driving scenes with self‑produced trajectories.

Abstract:
We present a novel study on enhancing the capability of preserving the content in world models, focusing on a property we term World Stability. Recent diffusion‑based generative models have advanced the synthesis of immersive and realistic environments that are pivotal for applications such as reinforcement learning and interactive game engines. However, while these models excel in quality and diversity, they often neglect the preservation of previously generated scenes over time‑‑a shortfall that can introduce noise into agent learning and compromise performance in safety‑critical settings. In this work, we introduce an evaluation framework that measures world stability by having world models perform a sequence of actions followed by their inverses to return to their initial viewpoint, thereby quantifying the consistency between the starting and ending observations. Our comprehensive assessment of state‑of‑the‑art diffusion‑based world models reveals significant challenges in achieving high world stability. Moreover, we investigate several improvement strategies to enhance world stability. Our results underscore the importance of world stability in world modeling and provide actionable insights for future research in this domain.

Abstract:
In this paper, we present a novel algorithm for quantifying uncertainty and information gained within 3D Gaussian Splatting (3D‑GS) through P‑Optimality. While 3D‑GS has proven to be a useful world model with high‑quality rasterizations, it does not natively quantify uncertainty or information, posing a challenge for real‑world applications such as 3D‑GS SLAM. We propose to quantify information gain in 3D‑GS by reformulating the problem through the lens of optimal experimental design, which is a classical solution widely used in literature. By restructuring information quantification of 3D‑GS through optimal experimental design, we arrive at multiple solutions, of which T‑Optimality and D‑Optimality perform the best quantitatively and qualitatively as measured on two popular datasets. Additionally, we propose a block diagonal covariance approximation which provides a measure of correlation at the expense of a greater computation cost.

Abstract:
The article analyses foundational principles relevant to the creation of artificial general intelligence (AGI). Intelligence is understood as the ability to create novel skills that allow to achieve goals under previously unknown conditions. To this end, intelligence utilises reasoning methods such as deduction, induction and abduction as well as other methods such as abstraction and classification to develop a world model. The methods are applied to indirect and incomplete representations of the world, which are obtained through perception, for example, and which do not depict the world but only correspond to it. Due to these limitations and the uncertain and contingent nature of reasoning, the world model is constructivist. Its value is functionally determined by its viability, i.e., its potential to achieve the desired goals. In consequence, meaning is assigned to representations by attributing them a function that makes it possible to achieve a goal. This representational and functional conception of intelligence enables a naturalistic interpretation that does not presuppose mental features, such as intentionality and consciousness, which are regarded as independent of intelligence. Based on a phenomenological analysis, it is shown that AGI can gain a more fundamental access to the world than humans, although it is limited by the No Free Lunch theorems, which require assumptions to be made.

Abstract:
Occupancy World Models (OWMs) aim to predict future scenes via 3D voxelized representations of the environment to support intelligent motion planning. Existing approaches typically generate full future occupancy states from VAE‑style latent encodings, which can be computationally expensive and redundant. We propose Delta‑Triplane Transformers (DTT), a novel 4D OWM for autonomous driving, that introduces two key innovations: (1) a triplane based representation that encodes 3D occupancy more compactly than previous approaches, and (2) an incremental prediction strategy for OWM that models \em changes in occupancy rather than dealing with full states. The core insight is that changes in the compact 3D latent space are naturally sparser and easier to model, enabling higher accuracy with a lighter‑weight architecture. Building on this representation, DTT extracts multi‑scale motion features from historical data and iteratively predict future triplane deltas. These deltas are combined with past states to decode future occupancy and ego‑motion trajectories. Extensive experiments demonstrate that DTT delivers a 1.44× speedup (26 FPS) over the state of the art, improves mean IoU to 30.85, and reduces the mean absolute planning error to 1.0 meters. Demo videos are provided in the supplementary material.

Abstract:
A world model is essential for an agent to predict the future and plan in domains such as autonomous driving and robotics. To achieve this, recent advancements have focused on video generation, which has gained significant attention due to the impressive success of diffusion models. However, these models require substantial computational resources. To address these challenges, we propose a world model leveraging object‑centric representation space using slot attention, guided by language instructions. Our model perceives the current state as an object‑centric representation and predicts future states in this representation space conditioned on natural language instructions. This approach results in a more compact and computationally efficient model compared to diffusion‑based generative alternatives. Furthermore, it flexibly predicts future states based on language instructions, and offers a significant advantage in manipulation tasks where object recognition is crucial. In this paper, we demonstrate that our latent predictive world model surpasses generative world models in visuo‑linguo‑motor control tasks, achieving superior sample and computation efficiency. We also investigate the generalization performance of the proposed method and explore various strategies for predicting actions using object‑centric representations.

Abstract:
Many reinforcement learning (RL) algorithms are impractical for training in operational systems or computationally expensive high‑fidelity simulations, as they require large amounts of data. Meanwhile, low‑fidelity simulators, e.g., reduced‑order models, heuristic rewards, or learned world models, can cheaply provide useful data, even if they are too coarse for zero‑shot transfer. We propose multi‑fidelity policy gradients (MFPGs), a sample‑efficient RL framework that mixes scarce target‑environment data with a control variate formed from abundant low‑fidelity simulation data to construct an unbiased, variance‑reduced estimator for on‑policy policy gradients. We instantiate the framework with a practical, multi‑fidelity variant of the classical REINFORCE algorithm. Under standard assumptions, the MFPG estimator guarantees asymptotic convergence to locally optimal policies in the target environment and achieves faster finite‑sample convergence than standard REINFORCE. We evaluate MFPG on robotics benchmark tasks with limited high‑fidelity data but abundant off‑dynamics, low‑fidelity data. When low‑fidelity data are neutral or beneficial and dynamics gaps are mild‑moderate, MFPG is, among the evaluated off‑dynamics RL and low‑fidelity‑only approaches, the only method that consistently achieves statistically significant improvements over a high‑fidelity‑only baseline. When low‑fidelity data become harmful, MFPG exhibits the strongest robustness, whereas strong off‑dynamics RL methods exploit low‑fidelity data aggressively and fail much more severely. An additional experiment with anti‑correlated high‑ and low‑fidelity rewards shows MFPG can remain effective even under reward misspecification. MFPG thus offers a reliable paradigm for exploiting cheap low‑fidelity data (e.g., for efficient sim‑to‑real transfer) while managing the trade‑off between policy performance and data collection cost.

Abstract:
Model‑based Reinforcement Learning (MBRL) has emerged as a promising paradigm for autonomous driving, where data efficiency and robustness are critical. Yet, existing solutions often rely on carefully crafted, task specific extrinsic rewards, limiting generalization to new tasks or environments. In this paper, we propose InDRiVE (Intrinsic Disagreement based Reinforcement for Vehicle Exploration), a method that leverages purely intrinsic, disagreement based rewards within a Dreamer based MBRL framework. By training an ensemble of world models, the agent actively explores high uncertainty regions of environments without any task specific feedback. This approach yields a task agnostic latent representation, allowing for rapid zero shot or few shot fine tuning on downstream driving tasks such as lane following and collision avoidance. Experimental results in both seen and unseen environments demonstrate that InDRiVE achieves higher success rates and fewer infractions compared to DreamerV2 and DreamerV3 baselines despite using significantly fewer training steps. Our findings highlight the effectiveness of purely intrinsic exploration for learning robust vehicle control behaviors, paving the way for more scalable and adaptable autonomous driving systems.

Abstract:
Adapting quickly to dynamic, uncertain environments‑often called "open worlds"‑remains a major challenge in robotics. Traditional Task and Motion Planning (TAMP) approaches struggle to cope with unforeseen changes, are data‑inefficient when adapting, and do not leverage world models during learning. We address this issue with a hybrid planning and learning system that integrates two models: a low level neural network based model that learns stochastic transitions and drives exploration via an Intrinsic Curiosity Module (ICM), and a high level symbolic planning model that captures abstract transitions using operators, enabling the agent to plan in an "imaginary" space and generate reward machines. Our evaluation in a robotic manipulation domain with sequential novelty injections demonstrates that our approach converges faster and outperforms state‑of‑the‑art hybrid methods.

Abstract:
Li et al. (2023) used the Othello board game as a test case for the ability of GPT‑2 to induce world models, and were followed up by Nanda et al. (2023b). We briefly discuss the original experiments, expanding them to include more language models with more comprehensive probing. Specifically, we analyze sequences of Othello board states and train the model to predict the next move based on previous moves. We evaluate seven language models (GPT‑2, T5, Bart, Flan‑T5, Mistral, LLaMA‑2, and Qwen2.5) on the Othello task and conclude that these models not only learn to play Othello, but also induce the Othello board layout. We find that all models achieve up to 99% accuracy in unsupervised grounding and exhibit high similarity in the board features they learned. This provides considerably stronger evidence for the Othello World Model Hypothesis than previous works.

Abstract:
The DreamerV3 algorithm recently obtained remarkable performance across diverse environment domains by learning an accurate world model based on Recurrent Neural Networks (RNNs). Following the success of model‑based reinforcement learning algorithms and the rapid adoption of the Transformer architecture for its superior training efficiency and favorable scaling properties, recent works such as STORM have proposed replacing RNN‑based world models with Transformer‑based world models using masked self‑attention. However, despite the improved training efficiency of these methods, their impact on performance remains limited compared to the Dreamer algorithm, struggling to learn competitive Transformer‑based world models. In this work, we show that the next state prediction objective adopted in previous approaches is insufficient to fully exploit the representation capabilities of Transformers. We propose to extend world model predictions to longer time horizons by introducing TWISTER (Transformer‑based World model wIth contraSTivE Representations), a world model using action‑conditioned Contrastive Predictive Coding to learn high‑level temporal feature representations and improve the agent performance. TWISTER achieves a human‑normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state‑of‑the‑art methods that do not employ look‑ahead search.

Abstract:
We propose DRAGO, a novel approach for continual model‑based reinforcement learning aimed at improving the incremental development of world models across a sequence of tasks that differ in their reward functions but not the state space or dynamics. DRAGO comprises two key components: Synthetic Experience Rehearsal, which leverages generative models to create synthetic experiences from past tasks, allowing the agent to reinforce previously learned dynamics without storing data, and Regaining Memories Through Exploration, which introduces an intrinsic reward mechanism to guide the agent toward revisiting relevant states from prior tasks. Together, these components enable the agent to maintain a comprehensive and continually developing world model, facilitating more effective learning and adaptation across diverse environments. Empirical evaluations demonstrate that DRAGO is able to preserve knowledge across tasks, achieving superior performance in various continual learning scenarios.

Abstract:
Learning‑based controllers are often purposefully kept out of real‑world applications due to concerns about their safety and reliability. We explore how state‑of‑the‑art world models in Model‑Based Reinforcement Learning can be utilized beyond the training phase to ensure a deployed policy only operates within regions of the state‑space it is sufficiently familiar with. This is achieved by continuously monitoring discrepancies between a world model's predictions and observed system behavior during inference. It allows for triggering appropriate measures, such as an emergency stop, once an error threshold is surpassed. This does not require any task‑specific knowledge and is thus universally applicable. Simulated experiments on established robot control tasks show the effectiveness of this method, recognizing changes in local robot geometry and global gravitational magnitude. Real‑world experiments using an agile quadcopter further demonstrate the benefits of this approach by detecting unexpected forces acting on the vehicle. These results indicate how even in new and adverse conditions, safe and reliable operation of otherwise unpredictable learning‑based controllers can be achieved.

Abstract:
Reinforcement learning (RL) has evolved into a widely investigated technology for the development of smart TSC strategies. However, current RL algorithms necessitate excessive interaction with the environment to learn effective policies, making them impractical for large‑scale tasks. The DreamerV3 algorithm presents compelling properties for policy learning. It summarizes general dynamics knowledge about the environment and enables the prediction of future outcomes of potential actions from past experience, reducing the interaction with the environment through imagination training. In this paper, a corridor TSC model is trained using the DreamerV3 algorithm to explore the benefits of world models for TSC strategy learning. In RL environment design, to manage congestion levels effectively, both the state and reward functions are defined based on queue length, and the action is designed to manage queue length efficiently. Using the SUMO simulation platform, the two hyperparameters (training ratio and model size) of the DreamerV3 algorithm were tuned and analyzed across different OD matrix scenarios. We discovered that choosing a smaller model size and initially attempting several medium training ratios can significantly reduce the time spent on hyperparameter tuning. Additionally, we found that the approach is generally applicable as it can solve two TSC task scenarios with the same hyperparameters. Regarding the claimed data‑efficiency of the DreamerV3 algorithm, due to the significant fluctuation of the episode reward curve in the early stages of training, it can only be confirmed that larger model sizes exhibit modest data‑efficiency, and no evidence was found that increasing the training ratio accelerates convergence.

Abstract:
As autonomous systems are increasingly deployed in open and uncertain settings, there is a growing need for trustworthy world models that can reliably predict future high‑dimensional observations. The learned latent representations in world models lack direct mapping to meaningful physical quantities and dynamics, limiting their utility and interpretability in downstream planning, control, and safety verification. In this paper, we argue for a fundamental shift from physically informed to physically interpretable world models ‑ and crystallize four principles that leverage symbolic knowledge to achieve these ends: (1) functionally organizing the latent space according to the physical intent, (2) learning aligned invariant and equivariant representations of the physical world, (3) integrating multiple forms and strengths of supervision into a unified training process, and (4) partitioning generative outputs to support scalability and verifiability. We experimentally demonstrate the value of each principle on two benchmarks. This paper opens several intriguing research directions to achieve and capitalize on full physical interpretability in world models.

Abstract:
Exploration is a cornerstone of reinforcement learning (RL). Intrinsic motivation attempts to decouple exploration from external, task‑based rewards. However, established approaches to intrinsic motivation that follow general principles such as information gain, often only uncover low‑level interactions. In contrast, children's play suggests that they engage in meaningful high‑level behavior by imitating or interacting with their caregivers. Recent work has focused on using foundation models to inject these semantic biases into exploration. However, these methods often rely on unrealistic assumptions, such as language‑embedded environments or access to high‑level actions. We propose SEmaNtically Sensible ExploratIon (SENSEI), a framework to equip model‑based RL agents with an intrinsic motivation for semantically meaningful behavior. SENSEI distills a reward signal of interestingness from Vision Language Model (VLM) annotations, enabling an agent to predict these rewards through a world model. Using model‑based RL, SENSEI trains an exploration policy that jointly maximizes semantic rewards and uncertainty. We show that in both robotic and video game‑like simulations SENSEI discovers a variety of meaningful behaviors from image observations and low‑level actions. SENSEI provides a general tool for learning from foundation model feedback, a crucial research direction, as VLMs become more powerful.

Abstract:
To go from (passive) process monitoring to active process control, an effective AI system must learn about the behavior of the complex system from very limited training data, forming an ad‑hoc digital twin with respect to process inputs and outputs that captures the consequences of actions on the process's world. We propose a novel methodology based on learning world models that disentangles process parameters in the learned latent representation, allowing for fine‑grained control. Representation learning is driven by the latent factors influencing the processes through contrastive learning within a joint embedding predictive architecture. This makes changes in representations predictable from changes in inputs and vice versa, facilitating interpretability of key factors responsible for process variations, paving the way for effective control actions to keep the process within operational bounds. The effectiveness of our method is validated on the example of plastic injection molding, demonstrating practical relevance in proposing specific control actions for a notoriously unstable process.

Abstract:
In robotic reinforcement learning, the Sim2Real gap remains a critical challenge. However, the impact of Static friction on Sim2Real has been underexplored. Conventional domain randomization methods typically exclude Static friction from their parameter space. In our robotic reinforcement learning task, such conventional domain randomization approaches resulted in significantly underperforming real‑world models. To address this Sim2Real challenge, we employed Actuator Net as an alternative to conventional domain randomization. While this method enabled successful transfer to flat‑ground locomotion, it failed on complex terrains like stairs. To further investigate physical parameters affecting Sim2Real in robotic joints, we developed a control‑theoretic joint model and performed systematic parameter identification. Our analysis revealed unexpectedly high friction‑torque ratios in our robotic joints. To mitigate its impact, we implemented Static friction‑aware domain randomization for Sim2Real. Recognizing the increased training difficulty introduced by friction modeling, we proposed a simple and novel solution to reduce learning complexity. To validate this approach, we conducted comprehensive Sim2Sim and Sim2Real experiments comparing three methods: conventional domain randomization (without Static friction), Actuator Net, and our Static friction‑aware domain randomization. All experiments utilized the Rapid Motor Adaptation (RMA) algorithm. Results demonstrated that our method achieved superior adaptive capabilities and overall performance.

Abstract:
In this paper, we propose AUKAI, an Adaptive Unified Knowledge‑Action Intelligence for embodied cognition that seamlessly integrates perception, memory, and decision‑making via multi‑scale error feedback. Interpreting AUKAI as an embedded world model, our approach simultaneously predicts state transitions and evaluates intervention utility. The framework is underpinned by rigorous theoretical analysis drawn from convergence theory, optimal control, and Bayesian inference, which collectively establish conditions for convergence, stability, and near‑optimal performance. Furthermore, we present a hybrid implementation that combines the strengths of neural networks with symbolic reasoning modules, thereby enhancing interpretability and robustness. Finally, we demonstrate the potential of AUKAI through a detailed application in robotic navigation and obstacle avoidance, and we outline comprehensive experimental plans to validate its effectiveness in both simulated and real‑world environments.

Abstract:
Brain‑inspired spiking neural networks (SNNs) have garnered significant research attention in algorithm design and perception applications. However, their potential in the decision‑making domain, particularly in model‑based reinforcement learning, remains underexplored. The difficulty lies in the need for spiking neurons with long‑term temporal memory capabilities, as well as network optimization that can integrate and learn information for accurate predictions. The dynamic dendritic information integration mechanism of biological neurons brings us valuable insights for addressing these challenges. In this study, we propose a multi‑compartment neuron model capable of nonlinearly integrating information from multiple dendritic sources to dynamically process long sequential inputs. Based on this model, we construct a Spiking World Model (Spiking‑WM), to enable model‑based deep reinforcement learning (DRL) with SNNs. We evaluated our model using the DeepMind Control Suite, demonstrating that Spiking‑WM outperforms existing SNN‑based models and achieves performance comparable to artificial neural network (ANN)‑based world models employing Gated Recurrent Units (GRUs). Furthermore, we assess the long‑term memory capabilities of the proposed model in speech datasets, including SHD, TIMIT, and LibriSpeech 100h, showing that our multi‑compartment neuron model surpasses other SNN‑based architectures in processing long sequences. Our findings underscore the critical role of dendritic information integration in shaping neuronal function, emphasizing the importance of cooperative dendritic processing in enhancing neural computation.

Abstract:
Humanoid robots are engineered to navigate terrains akin to those encountered by humans, which necessitates human‑like locomotion and perceptual abilities. Currently, the most reliable controllers for humanoid motion rely exclusively on proprioception, a reliance that becomes both dangerous and unreliable when coping with rugged terrain. Although the integration of height maps into perception can enable proactive gait planning, robust utilization of this information remains a significant challenge, especially when exteroceptive perception is noisy. To surmount these challenges, we propose a solution based on a teacher‑student distillation framework. In this paradigm, an oracle policy accesses noise‑free data to establish an optimal reference policy, while the student policy not only imitates the teacher's actions but also simultaneously trains a world model with a variational information bottleneck for sensor denoising and state estimation. Extensive evaluations demonstrate that our approach markedly enhances performance in scenarios characterized by unreliable terrain estimations. Moreover, we conducted rigorous testing in both challenging urban settings and off‑road environments, the model successfully traverse 2 km of varied terrain without external intervention.

Abstract:
In reinforcement learning (RL), world models serve as internal simulators, enabling agents to predict environment dynamics and future outcomes in order to make informed decisions. While previous approaches leveraging discrete latent spaces, such as DreamerV3, have demonstrated strong performance in discrete action settings and visual control tasks, their comparative performance in state‑based continuous control remains underexplored. In contrast, methods with continuous latent spaces, such as TD‑MPC2, have shown notable success in state‑based continuous control benchmarks. In this paper, we demonstrate that modeling discrete latent states has benefits over continuous latent states and that discrete codebook encodings are more effective representations for continuous control, compared to alternative encodings, such as one‑hot and label‑based encodings. Based on these insights, we introduce DCWM: Discrete Codebook World Model, a self‑supervised world model with a discrete and stochastic latent space, where latent states are codes from a codebook. We combine DCWM with decision‑time planning to get our model‑based RL algorithm, named DC‑MPC: Discrete Codebook Model Predictive Control, which performs competitively against recent state‑of‑the‑art algorithms, including TD‑MPC2 and DreamerV3, on continuous control benchmarks. See our project website www.aidanscannell.com/dcmpc.

Abstract:
Humans leverage rich internal models of the world to reason about the future, imagine counterfactuals, and adapt flexibly to new situations. In Reinforcement Learning (RL), world models aim to capture how the environment evolves in response to the agent's actions, facilitating planning and generalization. However, typical world models directly operate on the environment variables (e.g. pixels, physical attributes), which can make their training slow and cumbersome; instead, it may be advantageous to rely on high‑level latent dimensions that capture relevant multimodal variables. Global Workspace (GW) Theory offers a cognitive framework for multimodal integration and information broadcasting in the brain, and recent studies have begun to introduce efficient deep learning implementations of GW. Here, we evaluate the capabilities of an RL system combining GW with a world model. We compare our GW‑Dreamer with various versions of the standard PPO and the original Dreamer algorithms. We show that performing the dreaming process (i.e., mental simulation) inside the GW latent space allows for training with fewer environment steps. As an additional emergent property, the resulting model (but not its comparison baselines) displays strong robustness to the absence of one of its observation modalities (images or simulation attributes). We conclude that the combination of GW with World Models holds great potential for improving decision‑making in RL agents.

Abstract:
Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision‑making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application‑driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction‑following and physics‑adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law ‑ issues overlooked by prior benchmarks. (2) Aligned with large‑scale human preferences: We crowd‑source 67K human labels to accurately measure 14 frontier models. Using our high‑quality human labels, we further fine‑tune an accurate judger to automate the evaluation procedure, achieving 8.6% higher average accuracy in predicting world modeling violations than GPT‑4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The website is available at https://worldmodelbench‑team.github.io.

Abstract:
Reinforcement learning (RL) is a powerful approach for robot learning. However, model‑free RL (MFRL) requires a large number of environment interactions to learn successful control policies. This is due to the noisy RL training updates and the complexity of robotic systems, which typically involve highly non‑linear dynamics and noisy sensor signals. In contrast, model‑based RL (MBRL) not only trains a policy but simultaneously learns a world model that captures the environment's dynamics and rewards. The world model can either be used for planning, for data collection, or to provide first‑order policy gradients for training. Leveraging a world model significantly improves sample efficiency compared to model‑free RL. However, training a world model alongside the policy increases the computational complexity, leading to longer training times that are often intractable for complex real‑world scenarios. In this work, we propose a new method for accelerating model‑based RL using state‑space world models. Our approach leverages state‑space models (SSMs) to parallelize the training of the dynamics model, which is typically the main computational bottleneck. Additionally, we propose an architecture that provides privileged information to the world model during training, which is particularly relevant for partially observable environments. We evaluate our method in several real‑world agile quadrotor flight tasks, involving complex dynamics, for both fully and partially observable environments. We demonstrate a significant speedup, reducing the world model training time by up to 10 times, and the overall MBRL training time by up to 4 times. This benefit comes without compromising performance, as our method achieves similar sample efficiency and task rewards to state‑of‑the‑art MBRL methods.

Abstract:
Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline‑to‑online RL by leveraging abundant non‑curated data that is reward‑free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine‑tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine‑tuning. To address this issue and effectively use the offline data, we propose two essential techniques: \emphi) experience rehearsal and \emphii) execution guidance. With these modifications, the non‑curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves a 102.8% relative improvement in aggregate score over learning‑from‑scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.

Abstract:
As robots are increasingly deployed in diverse application domains, enabling robust mobility across different embodiments has become a critical challenge. Classical mobility stacks, though effective on specific platforms, require extensive per‑robot tuning and do not scale easily to new embodiments. Learning‑based approaches, such as imitation learning (IL), offer alternatives, but face significant limitations on the need for high‑quality demonstrations for each embodiment. To address these challenges, we introduce COMPASS, a unified framework that enables scalable cross‑embodiment mobility using expert demonstrations from only a single embodiment. We first pre‑train a mobility policy on a single robot using IL, combining a world model with a policy model. We then apply residual reinforcement learning (RL) to efficiently adapt this policy to diverse embodiments through corrective refinements. Finally, we distill specialist policies into a single generalist policy conditioned on an embodiment embedding vector. This design significantly reduces the burden of collecting data while enabling robust generalization across a wide range of robot designs. Our experiments demonstrate that COMPASS scales effectively across diverse robot platforms while maintaining adaptability to various environment configurations, achieving a generalist policy with a success rate approximately 5X higher than the pre‑trained IL policy on unseen embodiments, and further demonstrates zero‑shot sim‑to‑real transfer.

Abstract:
Humanoid robots are designed to navigate environments accessible to humans using their legs. However, classical research has primarily focused on controlled laboratory settings, resulting in a gap in developing controllers for navigating complex real‑world terrains. This challenge mainly arises from the limitations and noise in sensor data, which hinder the robot's understanding of itself and the environment. In this study, we introduce World Model Reconstruction (WMR), an end‑to‑end learning‑based approach for blind humanoid locomotion across challenging terrains. We propose training an estimator to explicitly reconstruct the world state and utilize it to enhance the locomotion policy. The locomotion policy takes inputs entirely from the reconstructed information. The policy and the estimator are trained jointly; however, the gradient between them is intentionally cut off. This ensures that the estimator focuses solely on world reconstruction, independent of the locomotion policy's updates. We evaluated our model on rough, deformable, and slippery surfaces in real‑world scenarios, demonstrating robust adaptability and resistance to interference. The robot successfully completed a 3.2 km hike without any human assistance, mastering terrains covered with ice and snow.

Abstract:
The leading AI companies are increasingly focused on building generalist AI agents ‑‑ systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. We discuss how these risks arise from current AI training methods. Indeed, various scenarios and experiments have demonstrated the possibility of AI agents engaging in deception or pursuing goals that were not specified by human operators and that conflict with human interests, such as self‑preservation. Following the precautionary principle, we see a strong need for safer, yet still useful, alternatives to the current agency‑driven trajectory. Accordingly, we propose as a core building block for further advances the development of a non‑agentic AI system that is trustworthy and safe by design, which we call Scientist AI. This system is designed to explain the world from observations, as opposed to taking actions in it to imitate or please humans. It comprises a world model that generates theories to explain data and a question‑answering inference machine. Both components operate with an explicit notion of uncertainty to mitigate the risks of overconfident predictions. In light of these considerations, a Scientist AI could be used to assist human researchers in accelerating scientific progress, including in AI safety. In particular, our system can be employed as a guardrail against AI agents that might be created despite the risks involved. Ultimately, focusing on non‑agentic AI may enable the benefits of AI innovation while avoiding the risks associated with the current trajectory. We hope these arguments will motivate researchers, developers, and policymakers to favor this safer path.

Abstract:
Autonomous artificial agents must be able to learn behaviors in complex environments without humans to design tasks and rewards. Designing these functions for each environment is not feasible, thus, motivating the development of intrinsic reward functions. In this paper, we propose using several cognitive elements that have been neglected for a long time to build an internal world model for an intrinsically motivated agent. Our agent performs satisfactory iterations with the environment, learning complex behaviors without needing previously designed reward functions. We used 18 Atari games to evaluate what cognitive skills emerge in games that require reactive and deliberative behaviors. Our results show superior performance compared to the state‑of‑the‑art in many test cases with dense and sparse rewards.

Abstract:
World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high‑fidelity video sequences with advanced diffusion‑based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE‑style feature‑level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion‑related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial‑temporal domain by utilizing row‑wise mask for shifted self‑attention rather than masked self‑attention in MAE. Then, we adopt a row‑wise cross‑view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM‑long, focusing on long‑horizon prediction, and MaskGWM‑mview, dedicated to multi‑view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long‑horizon rollout of OpenDV‑2K dataset and zero‑shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state‑of‑the‑art driving world model.

Abstract:
Differentiable simulators represent an environment's dynamics as a differentiable function. Within robotics and autonomous driving, this property is used in Analytic Policy Gradients (APG), which relies on backpropagating through the dynamics to train accurate policies for diverse tasks. Here we show that differentiable simulation also has an important role in world modeling, where it can impart predictive, prescriptive, and counterfactual capabilities to an agent. Specifically, we design three novel task setups in which the differentiable dynamics are combined within an end‑to‑end computation graph not with a policy, but a state predictor. This allows us to learn relative odometry, optimal planners, and optimal inverse states. We collectively call these predictors Analytic World Models (AWMs) and demonstrate how differentiable simulation enables their efficient, end‑to‑end learning. In autonomous driving scenarios, they have broad applicability and can augment an agent's decision‑making beyond reactive control.

Abstract:
Visual reinforcement learning agents typically face serious performance declines in real‑world applications caused by visual distractions. Existing methods rely on fine‑tuning the policy's representations with hand‑crafted augmentations. In this work, we propose Self‑Consistent Model‑based Adaptation (SCMA), a novel method that fosters robust adaptation without modifying the policy. By transferring cluttered observations to clean ones with a denoising model, SCMA can mitigate distractions for various policies as a plug‑and‑play enhancement. To optimize the denoising model in an unsupervised manner, we derive an unsupervised distribution matching objective with a theoretical analysis of its optimality. We further present a practical algorithm to optimize the objective by estimating the distribution of clean observations with a pre‑trained world model. Extensive experiments on multiple visual generalization benchmarks and real robot data demonstrate that SCMA effectively boosts performance across various distractions and exhibits better sample efficiency.

Abstract:
Humans develop world models that capture the underlying generation process of data. Whether neural networks can learn similar world models remains an open problem. In this work, we present the first theoretical results for this problem, showing that in a multi‑task setting, models with a low‑degree bias provably recover latent data‑generating variables under mild assumptions‑‑even if proxy tasks involve complex, non‑linear functions of the latents. However, such recovery is sensitive to model architecture. Our analysis leverages Boolean models of task solutions via the Fourier‑Walsh transform and introduces new techniques for analyzing invertible Boolean transforms, which may be of independent interest. We illustrate the algorithmic implications of our results and connect them to related research areas, including self‑supervised learning, out‑of‑distribution generalization, and the linear representation hypothesis in large language models.

Abstract:
Imagination in world models is crucial for enabling agents to learn long‑horizon policy in a sample‑efficient manner. Existing recurrent state‑space model (RSSM)‑based world models depend on single‑step statistical inference to capture the environment dynamics, and, hence, they are unable to perform long‑term imagination tasks due to the accumulation of prediction errors. Inspired by the dual‑process theory of human cognition, we propose a novel dual‑mind world model (DMWM) framework that integrates logical reasoning to enable imagination with logical consistency. DMWM is composed of two components: an RSSM‑based System 1 (RSSM‑S1) component that handles state transitions in an intuitive manner and a logic‑integrated neural network‑based System 2 (LINN‑S2) component that guides the imagination process through hierarchical deep logical reasoning. The inter‑system feedback mechanism is designed to ensure that the imagination process follows the logical rules of the real environment. The proposed framework is evaluated on benchmark tasks that require long‑term planning from the DMControl suite. Extensive experimental results demonstrate that the proposed framework yields significant improvements in terms of logical coherence, trial efficiency, data efficiency and long‑term imagination over the state‑of‑the‑art world models.

Abstract:
Understanding world dynamics is crucial for planning in autonomous driving. Recent methods attempt to achieve this by learning a 3D occupancy world model that forecasts future surrounding scenes based on current observation. However, 3D occupancy labels are still required to produce promising results. Considering the high annotation cost for 3D outdoor scenes, we propose a semi‑supervised vision‑centric 3D occupancy world model, PreWorld, to leverage the potential of 2D labels through a novel two‑stage training paradigm: the self‑supervised pre‑training stage and the fully‑supervised fine‑tuning stage. Specifically, during the pre‑training stage, we utilize an attribute projection head to generate different attribute fields of a scene (e.g., RGB, density, semantic), thus enabling temporal supervision from 2D labels via volume rendering techniques. Furthermore, we introduce a simple yet effective state‑conditioned forecasting module to recursively forecast future occupancy and ego trajectory in a direct manner. Extensive experiments on the nuScenes dataset validate the effectiveness and scalability of our method, and demonstrate that PreWorld achieves competitive performance across 3D occupancy prediction, 4D occupancy forecasting and motion planning tasks.

Abstract:
Completing Long‑Horizon (LH) tasks in open‑ended worlds is an important yet difficult problem for embodied agents. Existing approaches suffer from two key challenges: (1) they heavily rely on experiences obtained from human‑created data or curricula, failing to autonomously update and select multimodal experiences, and (2) they may encounter catastrophic forgetting issues when faced with new tasks, failing to autonomously update world knowledge. To solve these challenges, this paper presents \bf EvolvingAgent, a curriculum self‑evolving agent with a continual World Model (WM), which can autonomously complete various LH tasks across environments through self‑planning, self‑control, and self‑reflection, without human intervention. Specifically, EvolvingAgent contains three modules, i.e., i) the experience‑driven task planner, which uses an LLM along with multimodal experiences to convert LH tasks into executable sub‑tasks; ii) the WM‑guided action controller, which leverages WM to generate low‑level actions and incorporates a self‑verification mechanism to update multimodal experiences; iii) the Curriculum Learning (CL) based reflector, which implements a two‑stage CL algorithm to select multimodal experiences for task‑adaptive WM updates. By building a planner‑controller‑reflector closed‑loop dynamic, the continual WM for EvolvingAgent can autonomously update multimodal experiences and world knowledge. We conducted extensive experiments on Minecraft, compared with existing methods, EvolvingAgent can improve 111.74% average success rate, reduce more than 6x ineffective actions, and generalize to the Atari environment with human‑level performance.

Abstract:
Solving complex planning problems requires Large Language Models (LLMs) to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality‑a task hindered by the inherent ambiguity of natural language. To overcome such ambiguity, Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. With PDDL, we can generate a symbolic world model where classic searching algorithms, such as A, can be seamlessly applied to find optimal plans. However, directly generating PDDL domains with current LLMs remains an open challenge due to the lack of PDDL training data. To address this challenge, we propose to scale up the test‑time computation of LLMs to enhance their PDDL reasoning capabilities, thereby enabling the generation of high‑quality PDDL domains. Specifically, we introduce a simple yet effective algorithm, which first employs a Best‑of‑N sampling approach to improve the quality of the initial solution and then refines the solution in a fine‑grained manner with verbalized machine learning. Our method outperforms o1‑mini by a considerable margin in the generation of PDDL domains, achieving over 50% success rate on two tasks (i.e., generating PDDL domains from natural language description or PDDL problems). This is done without requiring additional training. By taking advantage of PDDL as state abstraction, our method is able to outperform current state‑of‑the‑art methods on almost all competition‑level planning tasks.

Abstract:
We investigate the Free Energy Principle as a foundation for measuring risk in agentic and multi‑agent systems. From these principles we introduce a Cumulative Risk Exposure metric that is flexible to differing contexts and needs. We contrast this to other popular theories for safe AI that hinge on massive amounts of data or describing arbitrarily complex world models. In our framework, stakeholders need only specify their preferences over system outcomes, providing straightforward and transparent decision rules for risk governance and mitigation. This framework naturally accounts for uncertainty in both world model and preference model, allowing for decision‑making that is epistemically and axiologically humble, parsimonious, and future‑proof. We demonstrate this novel approach in a simplified autonomous vehicle environment with multi‑agent vehicles whose driving policies are mediated by gatekeepers that evaluate, in an online fashion, the risk to the collective safety in their neighborhood, and intervene through each vehicle's policy when appropriate. We show that the introduction of gatekeepers in an AV fleet, even at low penetration, can generate significant positive externalities in terms of increased system safety.

Abstract:
Imitation learning and world models have shown significant promise in advancing generalizable robotic learning, with robotic grasping remaining a critical challenge for achieving precise manipulation. Existing methods often rely heavily on robot arm state data and RGB images, leading to overfitting to specific object shapes or positions. To address these limitations, we propose RoboGrasp, a universal grasping policy framework that integrates pretrained grasp detection models with robotic learning. By leveraging robust visual guidance from object detection and segmentation tasks, RoboGrasp significantly enhances grasp precision, stability, and generalizability, achieving up to 34% higher success rates in few‑shot learning and grasping box prompt tasks. Built on diffusion‑based methods, RoboGrasp is adaptable to various robotic learning paradigms, enabling precise and reliable manipulation across diverse and complex scenarios. This framework represents a scalable and versatile solution for tackling real‑world challenges in robotic grasping.

Abstract:
We present three improvements to the standard model‑based RL paradigm based on transformers: (a) "Dyna with warmup", which trains the policy on real and imaginary data, but only starts using imaginary data after the world model has been sufficiently trained; (b) "nearest neighbor tokenizer" for image patches, which improves upon previous tokenization schemes, which are needed when using a transformer world model (TWM), by ensuring the code words are static after creation, thus providing a constant target for TWM learning; and (c) "block teacher forcing", which allows the TWM to reason jointly about the future tokens of the next timestep, instead of generating them sequentially. We then show that our method significantly improves upon prior methods in various environments. We mostly focus on the challenging Craftax‑classic benchmark, where our method achieves a reward of 69.66% after only 1M environment steps, significantly outperforming DreamerV3, which achieves 53.2%, and exceeding human performance of 65.0% for the first time. We also show preliminary results on Craftax‑full, MinAtar, and three different two‑player games, to illustrate the generality of the approach.

Abstract:
Hamilton‑Jacobi (HJ) reachability is a rigorous mathematical framework that enables robots to simultaneously detect unsafe states and generate actions that prevent future failures. While in theory, HJ reachability can synthesize safe controllers for nonlinear systems and nonconvex constraints, in practice, it has been limited to hand‑engineered collision‑avoidance constraints modeled via low‑dimensional state‑space representations and first‑principles dynamics. In this work, our goal is to generalize safe robot controllers to prevent failures that are hard‑‑if not impossible‑‑to write down by hand, but can be intuitively identified from high‑dimensional observations: for example, spilling the contents of a bag. We propose Latent Safety Filters, a latent‑space generalization of HJ reachability that tractably operates directly on raw observation data (e.g., RGB images) to automatically compute safety‑preserving actions without explicit recovery demonstrations by performing safety analysis in the latent embedding space of a generative world model. Our method leverages diverse robot observation‑action data of varying quality (including successes, random exploration, and unsafe demonstrations) to learn a world model. Constraint specification is then transformed into a classification problem in the latent space of the learned world model. In simulation and hardware experiments, we compute an approximation of Latent Safety Filters to safeguard arbitrary policies (from imitation‑ learned policies to direct teleoperation) from complex safety hazards, like preventing a Franka Research 3 manipulator from spilling the contents of a bag or toppling cluttered objects.

Abstract:
Text‑to‑3D asset generation has achieved significant optimization under the supervision of 2D diffusion priors. However, when dealing with compositional scenes, existing methods encounter several challenges: 1). failure to ensure that composite scene layouts comply with physical laws; 2). difficulty in accurately capturing the assets and relationships described in complex scene descriptions; 3). limited autonomous asset generation capabilities among layout approaches leveraging large language models (LLMs). To avoid these compromises, we propose a novel framework for compositional scene generation, PhiP‑G, which seamlessly integrates generation techniques with layout guidance based on a world model. Leveraging LLM‑based agents, PhiP‑G analyzes the complex scene description to generate a scene graph, and integrating a multimodal 2D generation agent and a 3D Gaussian generation method for targeted assets creation. For the stage of layout, PhiP‑G employs a physical pool with adhesion capabilities and a visual supervision agent, forming a world model for layout prediction and planning. Extensive experiments demonstrate that PhiP‑G significantly enhances the generation quality and physical rationality of the compositional scenes. Notably, PhiP‑G attains state‑of‑the‑art (SOTA) performance in CLIP scores, achieves parity with the leading methods in generation quality as measured by the T^3Bench, and improves efficiency by 24x.

Abstract:
Large language models are increasingly customized through fine‑tuning and other adaptations, creating challenges in enforcing licensing terms and managing downstream impacts. Tracking model origins is crucial both for protecting intellectual property and for identifying derived models when biases or vulnerabilities are discovered in foundation models. We address this challenge by developing a framework for testing model provenance: Whether one model is derived from another. Our approach is based on the key observation that real‑world model derivations preserve significant similarities in model outputs that can be detected through statistical analysis. Using only black‑box access to models, we employ multiple hypothesis testing to compare model similarities against a baseline established by unrelated models. On two comprehensive real‑world benchmarks spanning models from 30M to 4B parameters and comprising over 600 models, our tester achieves 90‑95% precision and 80‑90% recall in identifying derived models. These results demonstrate the viability of systematic provenance verification in production environments even when only API access is available.

Abstract:
We present Generative Predictive Control (GPC), an inference‑time method for improving pretrained behavior‑cloning policies without retraining. GPC augments a frozen diffusion policy at deployment with an action‑conditioned world model trained on expert demonstrations and random exploration rollouts. The world model predicts the consequences of action proposals generated by the diffusion policy and enables lightweight online planning that ranks and refines these proposals through model‑based look‑ahead. By combining a generative prior with predictive foresight, GPC enables test‑time adaptation while keeping the original policy fixed. Across diverse robotic manipulation tasks, including state‑ and vision‑based settings in both simulation and real‑world experiments, GPC consistently outperforms standard behavior cloning and compares favorably with other inference‑time adaptation baselines.

Abstract:
World models represent a promising approach for training reinforcement learning agents with significantly improved sample efficiency. While most world model methods primarily rely on sequences of discrete latent variables to model environment dynamics, this compression often neglects critical visual details essential for reinforcement learning. Recent diffusion‑based world models condition generation on a fixed context length of frames to predict the next observation, using separate recurrent neural networks to model rewards and termination signals. Although this architecture effectively enhances visual fidelity, the fixed context length approach inherently limits memory capacity. In this paper, we introduce EDELINE, a unified world model architecture that integrates state space models with diffusion models. Our approach outperforms existing baselines across visually challenging Atari 100k tasks, memory‑demanding Crafter benchmark, and 3D first‑person ViZDoom environments, demonstrating superior performance in all these diverse challenges.

Abstract:
Guesstimation ‑‑ the task of making approximate quantitative estimates about objects or events ‑‑ is a common real‑world skill, yet remains underexplored in large language model (LLM) research. We introduce three guesstimation datasets: MARBLES, FUTURE, and ELECPRED, spanning physical estimation (e.g., how many marbles fit in a cup) to abstract predictions (e.g., the 2024 U.S. presidential election). Inspired by the social science concept of Wisdom of Crowds (WOC)‑ where the median of multiple estimates improves accuracy‑we propose WOC decoding for LLMs. We replicate WOC effects in human participants and find that LLMs exhibit similar benefits: median aggregation across sampled responses consistently improves accuracy over greedy decoding, self‑consistency decoding, and mean decoding. This suggests that LLMs encode a world model that supports approximate reasoning. Our results position guesstimation as a useful probe of LLM world knowledge and highlight WOC decoding as a strategy for enhancing LLM guesstimation performance on real‑world tasks.

Abstract:
While deep reinforcement learning (RL) from pixels has achieved remarkable success, its sample inefficiency remains a critical limitation for real‑world applications. Model‑based RL (MBRL) addresses this by learning a world model to generate simulated experience, but standard approaches that rely on pixel‑level reconstruction losses often fail to capture small, task‑critical objects in complex, dynamic scenes. We posit that an object‑centric (OC) representation can direct model capacity toward semantically meaningful entities, improving dynamics prediction and sample efficiency. In this work, we introduce OC‑STORM, an object‑centric MBRL framework that enhances a learned world model with object representations extracted by a pretrained segmentation network. By conditioning on a minimal number of annotated frames, OC‑STORM learns to track decision‑relevant object dynamics and inter‑object interactions without extensive labeling or access to privileged information. Empirical results demonstrate that OC‑STORM significantly outperforms the STORM baseline on the Atari 100k benchmark and achieves state‑of‑the‑art sample efficiency on challenging boss fights in the visually complex game Hollow Knight. Our findings underscore the potential of integrating OC priors into MBRL for complex visual domains. Project page: https://oc‑storm.weipuzhang.com

Abstract:
Learning efficient representations for decision‑making policies is a challenge in imitation learning (IL). Current IL methods require expert demonstrations, which are expensive to collect. Additionally, they are not explicitly trained to understand the environment. Consequently, they have underdeveloped world models. Self‑supervised learning (SSL) offers an alternative, as it can learn a world model from diverse, unlabeled data. However, most SSL methods are inefficient because they operate in raw input space. In this work, we propose ACT‑JEPA, a novel architecture that unifies IL and SSL to enhance policy representations. It is trained end‑to‑end to jointly predict 1) action sequences and 2) latent observation sequences. To learn in latent space, we utilize Joint‑Embedding Predictive Architecture, which allows the model to filter out irrelevant details and learn a robust world model. We evaluate ACT‑JEPA in different environments and across multiple tasks. Our results show that it outperforms the strongest baseline in all environments. ACT‑JEPA achieves up to 40% improvement in world model understanding and up to 10% higher task success rate. Finally, we show that predicting latent observation sequences effectively generalizes to predicting action sequences. This work demonstrates how integrating IL and SSL leads to efficient policy representation learning, an improved world model, and a higher task success rate.

Abstract:
Humans have an innate ability to decompose their perceptions of the world into objects and their attributes, such as colors, shapes, and movement patterns. This cognitive process enables us to imagine novel futures by recombining familiar concepts. However, replicating this ability in artificial intelligence systems has proven challenging, particularly when it comes to modeling videos into compositional concepts and generating unseen, recomposed futures without relying on auxiliary data, such as text, masks, or bounding boxes. In this paper, we propose Dreamweaver, a neural architecture designed to discover hierarchical and compositional representations from raw videos and generate compositional future simulations. Our approach leverages a novel Recurrent Block‑Slot Unit (RBSU) to decompose videos into their constituent objects and attributes. In addition, Dreamweaver uses a multi‑future‑frame prediction objective to capture disentangled representations for dynamic concepts more effectively as well as static concepts. In experiments, we demonstrate our model outperforms current state‑of‑the‑art baselines for world modeling when evaluated under the DCI framework across multiple datasets. Furthermore, we show how the modularized concept representations of our model enable compositional imagination, allowing the generation of novel videos by recombining attributes from previously seen objects. cun‑bjy.github.io/dreamweaver‑website

Abstract:
World model based reinforcement learning (RL) has emerged as a promising approach for autonomous driving, which learns a latent dynamics model and uses it to train a planning policy. To speed up the learning process, the pretrain‑finetune paradigm is often used, where online RL is initialized by a pretrained model and a policy learned offline. However, naively performing such initialization in RL may result in dramatic performance degradation during the online interactions in the new task. To tackle this challenge, we first analyze the performance degradation and identify two primary root causes therein: the mismatch of the planning policy and the mismatch of the dynamics model, due to distribution shift. We further analyze the effects of these factors on performance degradation during finetuning, and our findings reveal that the choice of finetuning strategies plays a pivotal role in mitigating these effects. We then introduce AdaWM, an Adaptive World Model based planning method, featuring two key steps: (a) mismatch identification, which quantifies the mismatches and informs the finetuning strategy, and (b) alignment‑driven finetuning, which selectively updates either the policy or the model as needed using efficient low‑rank updates. Extensive experiments on the challenging CARLA driving tasks demonstrate that AdaWM significantly improves the finetuning process, resulting in more robust and efficient performance in autonomous driving systems.

Abstract:
Mimicking the real interaction trajectory in the inference of the world model has been shown to improve the sample efficiency of model‑based reinforcement learning (MBRL) algorithms. Many methods directly use known state sequences for reasoning. However, this approach fails to enhance the quality of reasoning by capturing the subtle variation between states. Much like how humans infer trends in event development from this variation, in this work, we introduce Global‑Local variation Awareness Mamba‑based world model (GLAM) that improves reasoning quality by perceiving and predicting variation between states. GLAM comprises two Mambabased parallel reasoning modules, GMamba and LMamba, which focus on perceiving variation from global and local perspectives, respectively, during the reasoning process. GMamba focuses on identifying patterns of variation between states in the input sequence and leverages these patterns to enhance the prediction of future state variation. LMamba emphasizes reasoning about unknown information, such as rewards, termination signals, and visual representations, by perceiving variation in adjacent states. By integrating the strengths of the two modules, GLAM accounts for highervalue variation in environmental changes, providing the agent with more efficient imagination‑based training. We demonstrate that our method outperforms existing methods in normalized human scores on the Atari 100k benchmark.

Abstract:
The rapid growth of artificial intelligence in poultry farming has highlighted the challenge of efficiently labeling large, diverse datasets. Manual annotation is time‑consuming and costly, making it impractical for modern systems that continuously generate data. This study addresses this challenge by exploring semi‑supervised auto‑labeling methods, integrating self and active learning approaches to develop an efficient, label‑scarce framework for auto‑labeling large poultry datasets (ALPD). For this study, video data were collected from broilers and laying hens housed. Various machine learning models, including zero‑shot models and supervised models, were utilized for broilers and hens detection. The results showed that YOLOv8s‑World and YOLOv9s performed better when compared performance metrics for broiler and hen detection under supervised learning, while among the semi‑supervised model, YOLOv8s‑ALPD achieved the highest precision (96.1%) and recall (99%) with an RMSE of 1.87. The hybrid YOLO‑World model, incorporating the optimal YOLOv8s backbone with zero‑shot models, demonstrated the highest overall performance. It achieved a precision of 99.2%, recall of 99.4%, and an F1 score of 98.7% for detection. In addition, the semi‑supervised models with minimal human intervention (active learning) reduced annotation time by over 80% compared to full manual labeling. Moreover, integrating zero‑shot models with the best models enhanced broiler and hen detection, achieving comparable results to supervised models while significantly increasing speed. In conclusion, integrating semi‑supervised auto‑labeling and zero‑shot models significantly improves detection accuracy. It reduces manual annotation efforts, offering a promising solution to optimize AI‑driven systems in poultry farming, advancing precision livestock management, and promoting more sustainable practices.

Abstract:
Humans learn about the world, and how to act in the world, in many ways: from individually conducting experiments to observing and reproducing others' behavior. Different learning strategies come with different costs and likelihoods of successfully learning more about the world. The choice that any one individual makes of how to learn can have an impact on the collective understanding of a whole population if people learn from each other. Alan Rogers developed simulations of a population of agents to study these network phenomena where agents could individually or socially learn amidst a dynamic, uncertain world and uncovered a confusing result: the availability of cheap social learning yielded no benefit to population fitness over individual learning. This paradox spawned decades of work trying to understand and uncover factors that foster the relative benefit of social learning that centuries of human behavior suggest exists. What happens in such network models now that humans can socially learn from AI systems that are themselves socially learning from us? We revisit Rogers' Paradox in the context of human‑AI interaction to probe a simplified network of humans and AI systems learning together about an uncertain world. We propose and examine the impact of several learning strategies on the quality of the equilibrium of a society's 'collective world model'. We consider strategies that can be undertaken by various stakeholders involved in a single human‑AI interaction: human, AI model builder, and society or regulators around the interaction. We then consider possible negative feedback loops that may arise from humans learning socially from AI: that learning from the AI may impact our own ability to learn about the world. We close with open directions into studying networks of human and AI systems that can be explored in enriched versions of our simulation framework.

Abstract:
In recent years, Model‑based Multi‑Agent Reinforcement Learning (MARL) has demonstrated significant advantages over model‑free methods in terms of sample efficiency by using independent environment dynamics world models for data sample augmentation. However, without considering the limited sample size, these methods still lag behind model‑free methods in terms of final convergence performance and stability. This is primarily due to the world model's insufficient and unstable representation of global states in partially observable environments. This limitation hampers the ability to ensure global consistency in the data samples and results in a time‑varying and unstable distribution mismatch between the pseudo data samples generated by the world model and the real samples. This issue becomes particularly pronounced in more complex multi‑agent environments. To address this challenge, we propose a model‑based MARL method called GAWM, which enhances the centralized world model's ability to achieve globally unified and accurate representation of state information while adhering to the CTDE paradigm. GAWM uniquely leverages an additional Transformer architecture to fuse local observation information from different agents, thereby improving its ability to extract and represent global state information. This enhancement not only improves sample efficiency but also enhances training stability, leading to superior convergence performance, particularly in complex and challenging multi‑agent environments. This advancement enables model‑based methods to be effectively applied to more complex multi‑agent environments. Experimental results demonstrate that GAWM outperforms various model‑free and model‑based approaches, achieving exceptional performance in the challenging domains of SMAC.

Abstract:
Learning robust and generalizable world models is crucial for enabling efficient and scalable robotic control in real‑world environments. In this work, we introduce a novel framework for learning world models that accurately capture complex, partially observable, and stochastic dynamics. The proposed method employs a dual‑autoregressive mechanism and self‑supervised training to achieve reliable long‑horizon predictions without relying on domain‑specific inductive biases, ensuring adaptability across diverse robotic tasks. We further propose a policy optimization framework that leverages world models for efficient training in imagined environments and seamless deployment in real‑world systems. This work advances model‑based reinforcement learning by addressing the challenges of long‑horizon prediction, error accumulation, and sim‑to‑real transfer. By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real‑world applications.

Abstract:
Theory of Mind (ToM) is the ability to understand and reflect on the mental states of others. Although this capability is crucial for human interaction, testing on Large Language Models (LLMs) reveals that they possess only a rudimentary understanding of it. Although the most capable closed‑source LLMs have come close to human performance on some ToM tasks, they still perform poorly on complex variations of the task that involve more structured reasoning. In this work, we utilize the concept of "pretend‑play", or ``Simulation Theory'' from cognitive psychology to propose ``Decompose‑ToM'': an LLM‑based inference algorithm that improves model performance on complex ToM tasks. We recursively simulate user perspectives and decompose the ToM task into a simpler set of functions: subject identification, question‑reframing, world model updation, and knowledge availability. We test the algorithm on higher‑order ToM tasks and a task testing for ToM capabilities in a conversational setting, demonstrating that our approach shows significant improvement across models compared to baseline methods while requiring minimal prompt tuning across tasks and no additional model training.

Abstract:
While Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning generative AI, we present empirical evidence that it can also cause severe, systematic misalignment. We hypothesize that this stems from evaluator feedback depending on downstream outcome predictions (foresight) that can be influenced by the AI's output, inducing Goodhart's law dynamics. We present a theoretical analysis showing that conditioning evaluator feedback on downstream observations (hindsight) inhibits this effect by decoupling the alignment signal from potentially compromised predictions‑‑crucially, the result holds even if the observed outcomes are sampled from the AI's own world model. Building on this insight, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback. We validate RLHS across three consultancy settings‑‑marketplace interactions, restaurant recommendations, and online course advising‑‑using both online (PPO) and offline (DPO) fine‑tuning methods, and show that it substantially improves alignment over RLHF in experiments and human evaluations. We perform post‑hoc benchmark evaluations on TruthfulQA, HaluEval, and TrustLLM, finding that even after single‑task fine‑tuning, RLHF misalignment persists, whereas RLHS consistently outperforms baselines and demonstrates robust alignment generalization. The project webpage and code are available at https://rl‑hindsight.github.io.

Abstract:
Healthcare systems worldwide face persistent challenges in efficiency, accessibility, and personalization. Powered by modern AI technologies such as multimodal large language models and world models, Embodied AI (EmAI) represents a transformative frontier, offering enhanced autonomy and the ability to interact with the physical world to address these challenges. As an interdisciplinary and rapidly evolving research domain, "EmAI in healthcare" spans diverse fields such as algorithms, robotics, and biomedicine. This complexity underscores the importance of timely reviews and analyses to track advancements, address challenges, and foster cross‑disciplinary collaboration. In this paper, we provide a comprehensive overview of the "brain" of EmAI for healthcare, wherein we introduce foundational AI algorithms for perception, actuation, planning, and memory, and focus on presenting the healthcare applications spanning clinical interventions, daily care & companionship, infrastructure support, and biomedical research. Despite its promise, the development of EmAI for healthcare is hindered by critical challenges such as safety concerns, gaps between simulation platforms and real‑world applications, the absence of standardized benchmarks, and uneven progress across interdisciplinary domains. We discuss the technical barriers and explore ethical considerations, offering a forward‑looking perspective on the future of EmAI in healthcare. A hierarchical framework of intelligent levels for EmAI systems is also introduced to guide further development. By providing systematic insights, this work aims to inspire innovation and practical applications, paving the way for a new era of intelligent, patient‑centered healthcare.

Abstract:
Efficient control in long‑horizon robotic manipulation is challenging due to complex representation and policy learning requirements. Model‑based visual reinforcement learning (RL) has shown great potential in addressing these challenges but still faces notable limitations, particularly in handling sparse rewards and complex visual features in long‑horizon environments. To address these limitations, we propose the Recognize‑Sense‑Plan‑Act (RSPA) pipeline for long‑horizon tasks and further introduce RoboHorizon, an LLM‑assisted multi‑view world model tailored for long‑horizon robotic manipulation. In RoboHorizon, pre‑trained LLMs generate dense reward structures for multi‑stage sub‑tasks based on task language instructions, enabling robots to better recognize long‑horizon tasks. Keyframe discovery is then integrated into the multi‑view masked autoencoder (MAE) architecture to enhance the robot's ability to sense critical task sequences, strengthening its multi‑stage perception of long‑horizon processes. Leveraging these dense rewards and multi‑view representations, a robotic world model is constructed to efficiently plan long‑horizon tasks, enabling the robot to reliably act through RL algorithms. Experiments on two representative benchmarks, RLBench and FurnitureBench, show that RoboHorizon outperforms state‑of‑the‑art visual model‑based RL methods, achieving a 23.35% improvement in task success rates on RLBench's 4 short‑horizon tasks and a 29.23% improvement on 6 long‑horizon tasks from RLBench and 3 furniture assembly tasks from FurnitureBench.

Abstract:
Wheel loaders in mines and construction sites repeatedly load soil from a pile to load receivers. Automating this task presents a challenging planning problem since each loading's performance depends on the pile state, which depends on previous loadings. We investigate an end‑to‑end optimization approach considering future loading outcomes and transportation costs between the pile and load receivers. To predict the evolution of the pile state and the loading performance, we use world models that leverage deep neural networks trained on numerous simulated loading cycles. A look‑ahead tree search optimizes the sequence of loading actions by evaluating the performance of thousands of action candidates, which expand into subsequent action candidates under the predicted pile states recursively. Test results demonstrate that, over a horizon of 15 sequential loadings, the look‑ahead tree search is 6% more efficient than a greedy strategy, which always selects the action that maximizes the current single loading performance, and 14% more efficient than using a fixed loading controller optimized for the nominal case.

Abstract:
Assistive mobile robots are a transformative technology that helps persons with disabilities regain the ability to move freely. Although autonomous wheelchairs significantly reduce user effort, they still require human input to allow users to maintain control and adapt to changing environments. Brain Computer Interface (BCI) stands out as a highly user‑friendly option that does not require physical movement. Current BCI systems can understand whether users want to accelerate or decelerate, but they implement these changes in discrete speed steps rather than allowing for smooth, continuous velocity adjustments. This limitation prevents the systems from mimicking the natural, fluid speed changes seen in human self‑paced motion. The authors aim to address this limitation by redesigning the perception‑action cycle in a BCI controlled robotic system: improving how the robotic agent interprets the user's motion intentions (world state) and implementing these actions in a way that better reflects natural physical properties of motion, such as inertia and damping. The scope of this paper focuses on the perception aspect. We asked and answered a normative question "what computation should the robotic agent carry out to optimally perceive incomplete or noisy sensory observations?" Empirical EEG data were collected, and probabilistic representation that served as world state distributions were learned and evaluated in a Generative Adversarial Network framework. The ROS framework was established that connected with a Gazebo environment containing a digital twin of an indoor space and a virtual model of a robotic wheelchair. Signal processing and statistical analyses were implemented to identity the most discriminative features in the spatial‑spectral‑temporal dimensions, which are then used to construct the world model for the robotic agent to interpret user motion intentions as a Bayesian observer.

Abstract:
We propose an efficient knowledge transfer approach for model‑based reinforcement learning, addressing the challenge of deploying large world models in resource‑constrained environments. Our method distills a high‑capacity multi‑task agent (317M parameters) into a compact 1M parameter model, achieving state‑of‑the‑art performance on the MT30 benchmark with a normalized score of 28.45, a substantial improvement over the original 1M parameter model's score of 18.93. This demonstrates the ability of our distillation technique to consolidate complex multi‑task knowledge effectively. Additionally, we apply FP16 post‑training quantization, reducing the model size by 50% while maintaining performance. Our work bridges the gap between the power of large models and practical deployment constraints, offering a scalable solution for efficient and accessible multi‑task reinforcement learning in robotics and other resource‑limited domains.

Abstract:
Deploying learned decision‑making systems often requires transferring to new sites where the sensing pipeline differs. In such cases, observations can change in semantics and dimensionality even when action primitives and objectives remain comparable. In this work, we study transferable model‑based planning under this observation mismatch, which remains challenging for existing learning‑based approaches. We propose Adaptive Modularized Model (AMM), a modular planning architecture that separates a domain‑specific observation adapter from a shared internal dynamics model defined in a common planning state space. The dynamics model is meta‑learned from multiple source domains to enable fast adaptation with limited target interaction. At run time, AMM performs receding‑horizon planning by rolling out candidate action sequences under the learned dynamics and selecting actions that optimize a task‑specific objective over predicted futures. We instantiate the approach on cross‑domain traffic signal control, where actions correspond to signal phases and the planning objective captures congestion. Experiments show that AMM improves both performance and data efficiency compared with existing conventional controllers and learning‑based baselines.

Abstract:
Powered by their superior performance, deep neural networks (DNNs) have found widespread applications across various domains. Many deep learning (DL) models are now embedded in mobile apps, making them more accessible to end users through on‑device DL. However, deploying on‑device DL to users' smartphones simultaneously introduces several security threats. One primary threat is backdoor attacks. Extensive research has explored backdoor attacks for several years and has proposed numerous attack approaches. However, few studies have investigated backdoor attacks on DL models deployed in the real world, or they have shown obvious deficiencies in effectiveness and stealthiness. In this work, we explore more effective and stealthy backdoor attacks on real‑world DL models extracted from mobile apps. Our main justification is that imperceptible and sample‑specific backdoor triggers generated by DNN‑based steganography can enhance the efficacy of backdoor attacks on real‑world models. We first confirm the effectiveness of steganography‑based backdoor attacks on four state‑of‑the‑art DNN models. Subsequently, we systematically evaluate and analyze the stealthiness of the attacks to ensure they are difficult to perceive. Finally, we implement the backdoor attacks on real‑world models and compare our approach with three baseline methods. We collect 38,387 mobile apps, extract 89 DL models from them, and analyze these models to obtain the prerequisite model information for the attacks. After identifying the target models, our approach achieves an average of 12.50% higher attack success rate than DeepPayload while better maintaining the normal performance of the models. Extensive experimental results demonstrate that our method enables more effective, robust, and stealthy backdoor attacks on real‑world models.

Abstract:
Our aim is to learn to solve long‑horizon decision‑making problems in complex robotics domains given low‑level skills and a handful of short‑horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero‑shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision‑language models (VLMs) to propose a large set of visual predicates potentially relevant for decision‑making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization‑based model‑learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search‑based planning algorithm to find a sequence of low‑level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.

Abstract:
Large Language Models (LLMs) have demonstrated a remarkable ability to capture extensive world knowledge, yet how this is achieved without direct sensorimotor experience remains a fundamental puzzle. This study proposes a novel theoretical solution by introducing the Collective World Model hypothesis. We argue that an LLM does not learn a world model from scratch; instead, it learns a statistical approximation of a collective world model that is already implicitly encoded in human language through a society‑wide process of embodied, interactive sense‑making. To formalize this process, we introduce generative emergent communication (Generative EmCom), a framework built on the Collective Predictive Coding (CPC). This framework models the emergence of language as a process of decentralized Bayesian inference over the internal states of multiple agents. We argue that this process effectively creates an encoder‑decoder structure at a societal scale: human society collectively encodes its grounded, internal representations into language, and an LLM subsequently decodes these symbols to reconstruct a latent space that mirrors the structure of the original collective representations. This perspective provides a principled, mathematical explanation for how LLMs acquire their capabilities. The main contributions of this paper are: 1) the formalization of the Generative EmCom framework, clarifying its connection to world models and multi‑agent reinforcement learning, and 2) its application to interpret LLMs, explaining phenomena such as distributional semantics as a natural consequence of representation reconstruction. This work provides a unified theory that bridges individual cognitive development, collective language evolution, and the foundations of large‑scale AI.

Abstract:
World models have recently emerged as a promising approach to reinforcement learning (RL), achieving state‑of‑the‑art performance across a wide range of visual control tasks. This work aims to obtain a deep understanding of the robustness and generalization capabilities of world models. Thus motivated, we develop a stochastic differential equation formulation by treating the world model learning as a stochastic dynamical system, and characterize the impact of latent representation errors on robustness and generalization, for both cases with zero‑drift representation errors and with non‑zero‑drift representation errors. Our somewhat surprising findings, based on both theoretic and experimental studies, reveal that for the case with zero drift, modest latent representation errors can in fact function as implicit regularization and hence result in improved robustness. We further propose a Jacobian regularization scheme to mitigate the compounding error propagation effects of non‑zero drift, thereby enhancing training stability and robustness. Our experimental studies corroborate that this regularization approach not only stabilizes training but also accelerates convergence and improves accuracy of long‑horizon prediction.

Abstract:
World model‑based searching and planning are widely recognized as a promising path toward human‑level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next‑token prediction. Our DrivingGPT demonstrates strong performance in both action‑conditioned video generation and end‑to‑end planning, outperforming strong baselines on large‑scale nuPlan and NAVSIM benchmarks.

Abstract:
Despite the considerable progress achieved in the long video generation problem, there is still significant room to improve the consistency of the generated videos, particularly in terms of their smoothness and transitions between scenes. We address these issues to enhance the consistency and coherence of videos generated with either single or multiple prompts. We propose the Time‑frequency based temporal Attention Reweighting Algorithm (TiARA), which judiciously edits the attention score matrix based on the Discrete Short‑Time Fourier Transform. This method is supported by a frequency‑based analysis, ensuring that the edited attention score matrix achieves improved consistency across frames. It represents the first‑of‑its‑kind for frequency‑based methods in video diffusion models. For videos generated by multiple prompts, we further uncover key factors such as the alignment of the prompts affecting prompt interpolation quality. Inspired by our analyses, we propose PromptBlend, an advanced prompt interpolation pipeline that systematically aligns the prompts. Extensive experimental results validate the efficacy of our proposed method, demonstrating consistent and substantial improvements over multiple baselines.

Abstract:
The field of autonomous driving is experiencing a surge of interest in world models, which aim to predict potential future scenarios based on historical observations. In this paper, we introduce DFIT‑OccWorld, an efficient 3D occupancy world model that leverages decoupled dynamic flow and image‑assisted training strategy, substantially improving 4D scene forecasting performance. To simplify the training process, we discard the previous two‑stage training strategy and innovatively reformulate the occupancy forecasting problem as a decoupled voxels warping process. Our model forecasts future dynamic voxels by warping existing observations using voxel flow, whereas static voxels are easily obtained through pose transformation. Moreover, our method incorporates an image‑assisted training paradigm to enhance prediction reliability. Specifically, differentiable volume rendering is adopted to generate rendered depth maps through predicted future volumes, which are adopted in render‑based photometric consistency. Experiments demonstrate the effectiveness of our approach, showcasing its state‑of‑the‑art performance on the nuScenes and OpenScene benchmarks for 4D occupancy forecasting, end‑to‑end motion planning and point cloud forecasting. Concretely, it achieves state‑of‑the‑art performances compared to existing 3D world models while incurring substantially lower computational costs.

Abstract:
Learning predictive models from high‑dimensional sensory observations is fundamental for cyber‑physical systems, yet the latent representations learned by standard world models lack physical interpretability. This limits their reliability, generalizability, and applicability to safety‑critical tasks. We introduce Physically Interpretable World Models (PIWM), a framework that aligns latent representations with real‑world physical quantities and constrains their evolution through partially known physical dynamics. Physical interpretability in PIWM is defined by two complementary properties: (i) the learned latent state corresponds to meaningful physical variables, and (ii) its temporal evolution follows physically consistent dynamics. To achieve this without requiring ground‑truth physical annotations, PIWM employs weak distribution‑based supervision that captures state uncertainty naturally arising from real‑world sensing pipelines. The architecture integrates a VQ‑based visual encoder, a transformer‑based physical encoder, and a learnable dynamics model grounded in known physical equations. Across three case studies (Cart Pole, Lunar Lander, and Donkey Car), PIWM achieves accurate long‑horizon prediction, recovers true system parameters, and significantly improves physical grounding over purely data‑driven models. These results demonstrate the feasibility and advantages of learning physically interpretable world models directly from images under weak supervision.

Abstract:
Recent studies in interpretability have explored the inner workings of transformer models trained on tasks across various domains, often discovering that these networks naturally develop highly structured representations. When such representations comprehensively reflect the task domain's structure, they are commonly referred to as "World Models" (WMs). In this work, we identify WMs in transformers trained on maze‑solving tasks. By using Sparse Autoencoders (SAEs) and analyzing attention patterns, we examine the construction of WMs and demonstrate consistency between SAE feature‑based and circuit‑based analyses. By subsequently intervening on isolated features to confirm their causal role, we find that it is easier to activate features than to suppress them. Furthermore, we find that models can reason about mazes involving more simultaneously active features than they encountered during training; however, when these same mazes (with greater numbers of connections) are provided to models via input tokens instead, the models fail. Finally, we demonstrate that positional encoding schemes appear to influence how World Models are structured within the model's residual stream.

Abstract:
We present GEM, a Generalizable Ego‑vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego‑trajectories. Hence, our model has precise control over object dynamics, ego‑agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long‑horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo‑labels are used to get depth maps, ego‑trajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long generations. Code, models, and datasets are fully open‑sourced.

Abstract:
Large vision language models (LVLM) are the leading A.I approach for achieving a general visual understanding of the world. Models such as GPT, Claude, Gemini, and LLama can use images to understand and analyze complex visual scenes. 3D objects and shapes are the basic building blocks of the world, recognizing them is a fundamental part of human perception. The goal of this work is to test whether LVLMs truly understand 3D shapes by testing the models ability to identify and match objects of the exact same 3D shapes but with different orientations and materials/textures. A large number of test images were created using CGI with a huge number of highly diverse objects, materials, and scenes. The results of this test show that the ability of such models to match 3D shapes is significantly below humans but much higher than random guesses. Suggesting that the models have gained some abstract understanding of 3D shapes but still trail far beyond humans in this task. Mainly it seems that the models can easily identify the same object with a different orientation as well as matching identical 3D shapes of the same orientation but with different materials and textures. However, when both the object material and orientation are changed, all models perform poorly relative to humans. Code and benchmark are available.

Abstract:
The believable simulation of multi‑user behavior is crucial for understanding complex social systems. Recently, large language models (LLMs)‑based AI agents have made significant progress, enabling them to achieve human‑like intelligence across various tasks. However, real human societies are often dynamic and complex, involving numerous individuals engaging in multimodal interactions. In this paper, taking e‑commerce scenarios as an example, we present LMAgent, a very large‑scale and multimodal agents society based on multimodal LLMs. In LMAgent, besides freely chatting with friends, the agents can autonomously browse, purchase, and review products, even perform live streaming e‑commerce. To simulate this complex system, we introduce a self‑consistency prompting mechanism to augment agents' multimodal capabilities, resulting in significantly improved decision‑making performance over the existing multi‑agent system. Moreover, we propose a fast memory mechanism combined with the small‑world model to enhance system efficiency, which supports more than 10,000 agent simulations in a society. Experiments on agents' behavior show that these agents achieve comparable performance to humans in behavioral indicators. Furthermore, compared with the existing LLMs‑based multi‑agent system, more different and valuable phenomena are exhibited, such as herd behavior, which demonstrates the potential of LMAgent in credible large‑scale social behavior simulations.

Abstract:
Are generative pre‑trained transformer (GPT) models, trained only to predict the next token, implicitly learning a world model from which sequences are generated one token at a time? We address this question by deriving a causal interpretation of the attention mechanism in GPT and presenting a causal world model that arises from this interpretation. Furthermore, we propose that GPT models, at inference time, can be utilized for zero‑shot causal structure learning for input sequences, and introduce a corresponding confidence score. Empirical tests were conducted in controlled environments using the setups of the Othello and Chess strategy games. A GPT, pre‑trained on real‑world games played with the intention of winning, was tested on out‑of‑distribution synthetic data consisting of sequences of random legal moves. We find that the GPT model is likely to generate legal next moves for out‑of‑distribution sequences for which a causal structure is encoded in the attention mechanism with high confidence. In cases where it generates illegal moves, it also fails to capture a causal structure.

Abstract:
In offline reinforcement learning, deriving an effective policy from a pre‑collected set of experiences is challenging due to the distribution mismatch between the target policy and the behavioral policy used to collect the data, as well as the limited sample size. Model‑based reinforcement learning improves sample efficiency by generating simulated experiences using a learned dynamic model of the environment. However, these synthetic experiences often suffer from the same distribution mismatch. To address these challenges, we introduce SimuDICE, a framework that iteratively refines the initial policy derived from offline data using synthetically generated experiences from the world model. SimuDICE enhances the quality of these simulated experiences by adjusting the sampling probabilities of state‑action pairs based on stationary DIstribution Correction Estimation (DICE) and the estimated confidence in the model's predictions. This approach guides policy improvement by balancing experiences similar to those frequently encountered with ones that have a distribution mismatch. Our experiments show that SimuDICE achieves performance comparable to existing algorithms while requiring fewer pre‑collected experiences and planning steps, and it remains robust across varying data collection policies.

Abstract:
Model‑based reinforcement learning (MBRL) is a promising route to sample‑efficient policy optimization. However, a known vulnerability of reconstruction‑based MBRL consists of scenarios in which detailed aspects of the world are highly predictable, but irrelevant to learning a good policy. Such scenarios can lead the model to exhaust its capacity on meaningless content, at the cost of neglecting important environment dynamics. While existing approaches attempt to solve this problem, we highlight its continuing impact on leading MBRL methods ‑‑ including DreamerV3 and DreamerPro ‑‑ with a novel environment where background distractions are intricate, predictable, and useless for planning future actions. To address this challenge we develop a method for focusing the capacity of the world model through synergy of a pretrained segmentation model, a task‑aware reconstruction loss, and adversarial learning. Our method outperforms a variety of other approaches designed to reduce the impact of distractors, and is an advance towards robust model‑based reinforcement learning.

Abstract:
World models have emerged as promising neural simulators for autonomous driving, with the potential to supplement scarce real‑world data and enable closed‑loop evaluations. However, current research primarily evaluates these models based on visual realism or downstream task performance, with limited focus on fidelity to specific action instructions ‑ a crucial property for generating targeted simulation scenes. Although some studies address action fidelity, their evaluations rely on closed‑source mechanisms, limiting reproducibility. To address this gap, we develop an open‑access evaluation framework, ACT‑Bench, for quantifying action fidelity, along with a baseline world model, Terra. Our benchmarking framework includes a large‑scale dataset pairing short context videos from nuScenes with corresponding future trajectory data, which provides conditional input for generating future video frames and enables evaluation of action fidelity for executed motions. Furthermore, Terra is trained on multiple large‑scale trajectory‑annotated datasets to enhance action fidelity. Leveraging this framework, we demonstrate that the state‑of‑the‑art model does not fully adhere to given instructions, while Terra achieves improved action fidelity. All components of our benchmark framework will be made publicly available to support future research.

Abstract:
Navigation is a fundamental skill of agents with visual‑motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next‑generation navigation systems.

Abstract:
We present The Matrix, the first foundational realistic world simulator capable of generating continuous 720p high‑fidelity real‑scene video streams with real‑time, responsive control in both first‑ and third‑person perspectives, enabling immersive exploration of richly dynamic environments. Trained on limited supervised data from AAA games like Forza Horizon 5 and Cyberpunk 2077, complemented by large‑scale unsupervised footage from real‑world settings like Tokyo streets, The Matrix allows users to traverse diverse terrains ‑‑ deserts, grasslands, water bodies, and urban landscapes ‑‑ in continuous, uncut hour‑long sequences. Operating at 16 FPS, the system supports real‑time interactivity and demonstrates zero‑shot generalization, translating virtual game environments to real‑world contexts where collecting continuous movement data is often infeasible. For example, The Matrix can simulate a BMW X3 driving through an office setting‑‑an environment present in neither gaming data nor real‑world sources. This approach showcases the potential of AAA game data to advance robust world models, bridging the gap between simulations and real‑world applications in scenarios with limited data.

Abstract:
Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions. To this end, we train a video generation model conditioned on spatio‑temporally sparse or dense motion trajectories. In contrast to prior motion conditioning work, this flexible representation can encode any number of trajectories, object‑specific or global scene motion, and temporally sparse motion; due to its flexibility we refer to this conditioning as motion prompts. While users may directly specify sparse trajectories, we also show how to translate high‑level user requests into detailed, semi‑dense motion prompts, a process we term motion prompt expansion. We demonstrate the versatility of our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing. Our results showcase emergent behaviors, such as realistic physics, suggesting the potential of motion prompts for probing video models and interacting with future generative world models. Finally, we evaluate quantitatively, conduct a human study, and demonstrate strong performance. Video results are available on our webpage: https://motion‑prompting.github.io/

Abstract:
In self‑supervised robotic learning, agents acquire data through active interaction with their environment, incurring costs such as energy use, human oversight, and experimental time. To mitigate these, sample‑efficient exploration is essential. While intrinsic motivation (IM) methods like learning progress (LP) are widely used in robotics, and active learning (AL) is well established for classification in machine learning, few frameworks address continuous, high‑dimensional regression tasks typical of world model learning. We propose MUSEL (Model Uncertainty for Sample‑Efficient Learning), a novel AL framework tailored for regression tasks in robotics, such as action‑effect prediction. MUSEL introduces a model uncertainty metric that combines total predictive uncertainty, learning progress, and input diversity to guide data acquisition. We validate our approach using a Stochastic Variational Deep Kernel Learning (SVDKL) model in two robotic tabletop tasks. Experimental results demonstrate that MUSEL improves both learning accuracy and sample efficiency, validating its effectiveness in learning action effects and selecting informative samples.

Abstract:
Generative models have significantly improved the generation and prediction quality on either camera images or LiDAR point clouds for autonomous driving. However, a real‑world autonomous driving system uses multiple kinds of input modality, usually cameras and LiDARs, where they contain complementary information for generation, while existing generation methods ignore this crucial feature, resulting in the generated results only covering separate 2D or 3D information. In order to fill the gap in 2D‑3D multi‑modal joint generation for autonomous driving, in this paper, we propose our framework, \emphHoloDrive, to jointly generate the camera images and LiDAR point clouds. We employ BEV‑to‑Camera and Camera‑to‑BEV transform modules between heterogeneous generative models, and introduce a depth prediction branch in the 2D generative model to disambiguate the un‑projecting from image space to BEV space, then extend the method to predict the future by adding temporal structure and carefully designed progressive training. Further, we conduct experiments on single frame generation and world model benchmarks, and demonstrate our method leads to significant performance gains over SOTA methods in terms of generation metrics.

Abstract:
In recent years, model‑based reinforcement learning (MBRL) has emerged as a solution to address sample complexity in multi‑agent reinforcement learning (MARL) by modeling agent‑environment dynamics to improve sample efficiency. However, most MBRL methods assume complete and continuous observations from each agent during the inference stage, which can be overly idealistic in practical applications. A novel model‑based MARL approach called RMIO is introduced to address this limitation, specifically designed for scenarios where observation is lost in some agent. RMIO leverages the world model to reconstruct missing observations, and further reduces reconstruction errors through inter‑agent information integration to ensure stable multi‑agent decision‑making. Secondly, unlike CTCE methods such as MAMBA, RMIO adopts the CTDE paradigm in standard environment, and enabling limited communication only when agents lack observation data, thereby reducing reliance on communication. Additionally, RMIO improves asymptotic performance through strategies such as reward smoothing, a dual‑layer experience replay buffer, and an RNN‑augmented policy model, surpassing previous work. Our experiments conducted in both the SMAC and MaMuJoCo environments demonstrate that RMIO outperforms current state‑of‑the‑art approaches in terms of asymptotic convergence performance and policy robustness, both in standard mission settings and in scenarios involving observation loss.

Abstract:
Closed‑loop simulation is crucial for end‑to‑end autonomous driving. Existing sensor simulation methods (e.g., NeRF and 3DGS) reconstruct driving scenes based on conditions that closely mirror training data distributions. However, these methods struggle with rendering novel trajectories, such as lane changes. Recent works have demonstrated that integrating world model knowledge alleviates these issues. Despite their efficiency, these approaches still encounter difficulties in the accurate representation of more complex maneuvers, with multi‑lane shifts being a notable example. Therefore, we introduce ReconDreamer, which enhances driving scene reconstruction through incremental integration of world model knowledge. Specifically, DriveRestorer is proposed to mitigate artifacts via online restoration. This is complemented by a progressive data update strategy designed to ensure high‑quality rendering for more complex maneuvers. To the best of our knowledge, ReconDreamer is the first method to effectively render in large maneuvers. Experimental results demonstrate that ReconDreamer outperforms Street Gaussians in the NTA‑IoU, NTL‑IoU, and FID, with relative improvements by 24.87%, 6.72%, and 29.97%. Furthermore, ReconDreamer surpasses DriveDreamer4D with PVG during large maneuver rendering, as verified by a relative improvement of 195.87% in the NTA‑IoU metric and a comprehensive user study.

Abstract:
Unthinking execution of human instructions in robotic manipulation can lead to severe safety risks, such as poisonings, fires, and even explosions. In this paper, we present responsible robotic manipulation, which requires robots to consider potential hazards in the real‑world environment while completing instructions and performing complex operations safely and efficiently. However, such scenarios in real world are variable and risky for training. To address this challenge, we propose Safety‑as‑policy, which includes (i) a world model to automatically generate scenarios containing safety risks and conduct virtual interactions, and (ii) a mental model to infer consequences with reflections and gradually develop the cognition of safety, allowing robots to accomplish tasks while avoiding dangers. Additionally, we create the SafeBox synthetic dataset, which includes one hundred responsible robotic manipulation tasks with different safety risk scenarios and instructions, effectively reducing the risks associated with real‑world experiments. Experiments demonstrate that Safety‑as‑policy can avoid risks and efficiently complete tasks in both synthetic dataset and real‑world experiments, significantly outperforming baseline methods. Our SafeBox dataset shows consistent evaluation results with real‑world scenarios, serving as a safe and effective benchmark for future research.

Abstract:
Autonomous intelligent agents must bridge computational challenges at disparate levels of abstraction, from the low‑level spaces of sensory input and motor commands to the high‑level domain of abstract reasoning and planning. A key question in designing such agents is how best to instantiate the representational space that will interface between these two levels ‑‑ ideally without requiring supervision in the form of expensive data annotations. These objectives can be efficiently achieved by representing the world in terms of objects (grounded in perception and action). In this work, we present a novel, brain‑inspired, deep‑learning architecture that learns from pixels to interpret, control, and reason about its environment, using object‑centric representations. We show the utility of our approach through tasks in synthetic environments that require a combination of (high‑level) logical reasoning and (low‑level) continuous control. Results show that the agent can learn emergent conditional behavioural reasoning, such as (A \to B) \land (\neg A \to C), as well as logical composition (A \to B) \land (A \to C) \vdash A \to (B \land C) and XOR operations, and successfully controls its environment to satisfy objectives deduced from these logical rules. The agent can adapt online to unexpected changes in its environment and is robust to mild violations of its world model, thanks to dynamic internal desired goal generation. While the present results are limited to synthetic settings (2D and 3D activated versions of dSprites), which fall short of real‑world levels of complexity, the proposed architecture shows how to manipulate grounded object representations, as a key inductive bias for unsupervised learning, to enable behavioral reasoning.

Abstract:
This study explores the potential for artificial agents to develop core consciousness, as proposed by Antonio Damasio's theory of consciousness. According to Damasio, the emergence of core consciousness relies on the integration of a self model, informed by representations of emotions and feelings, and a world model. We hypothesize that an artificial agent, trained via reinforcement learning (RL) in a virtual environment, can develop preliminary forms of these models as a byproduct of its primary task. The agent's main objective is to learn to play a video game and explore the environment. To evaluate the emergence of world and self models, we employ probes‑feedforward classifiers that use the activations of the trained agent's neural networks to predict the spatial positions of the agent itself. Our results demonstrate that the agent can form rudimentary world and self models, suggesting a pathway toward developing machine consciousness. This research provides foundational insights into the capabilities of artificial agents in mirroring aspects of human consciousness, with implications for future advancements in artificial intelligence.

Abstract:
AI's significant recent advances using general‑purpose circuit computations offer a potential window into how the neocortex and cerebellum of the brain are able to achieve a diverse range of functions across sensory, cognitive, and motor domains, despite their uniform circuit structures. However, comparing the brain and AI is challenging unless clear similarities exist, and past reviews have been limited to comparison of brain‑inspired vision AI and the visual neocortex. Here, to enable comparisons across diverse functional domains, we subdivide circuit computation into three elements ‑‑ circuit structure, input/outputs, and the learning algorithm ‑‑ and evaluate the similarities for each element. With this novel approach, we identify wide‑ranging similarities and convergent evolution in the brain and AI, providing new insights into key concepts in neuroscience. Furthermore, inspired by processing mechanisms of AI, we propose a new theory that integrates established neuroscience theories, particularly the theories of internal models and the mirror neuron system. Both the neocortex and cerebellum predict future world events from past information and learn from prediction errors, thereby acquiring models of the world. These models enable three core processes: (1) Prediction ‑‑ generating future information, (2) Understanding ‑‑ interpreting the external world via compressed and abstracted sensory information, and (3) Generation ‑‑ repurposing the future‑information generation mechanism to produce other types of outputs. The universal application of these processes underlies the ability of the neocortex and cerebellum to accomplish diverse functions with uniform circuits. Our systematic approach, insights, and theory promise groundbreaking advances in understanding the brain.

Abstract:
Biological intelligence is inherently adaptive ‑‑ animals continually adjust their actions based on environmental feedback. However, creating adaptive artificial intelligence (AI) remains a major challenge. The next frontier is to go beyond traditional AI to develop "adaptive intelligence," defined here as harnessing insights from biological intelligence to build agents that can learn online, generalize, and rapidly adapt to changes in their environment. Recent advances in neuroscience offer inspiration through studies that increasingly focus on how animals naturally learn and adapt their world models. In this Perspective, I will review the behavioral and neural foundations of adaptive biological intelligence, the parallel progress in AI, and explore brain‑inspired approaches for building more adaptive algorithms.

Abstract:
Addressing the challenge of ensuring safety in ever‑changing and unpredictable environments, particularly in the swiftly advancing realm of autonomous driving in today's 5G wireless communication world, we present Navigation Secure (NavSecure). This vision‑based navigation framework merges the strengths of world models with crucial safety‑focused decision‑making capabilities, enabling autonomous vehicles to navigate real‑world complexities securely. Our approach anticipates potential threats and formulates safer routes by harnessing the predictive capabilities of world models, thus significantly reducing the need for extensive real‑world trial‑and‑error learning. Additionally, our method empowers vehicles to autonomously learn and develop through continuous practice, ensuring the system evolves and adapts to new challenges. Incorporating radio frequency technology, NavSecure leverages 5G networks to enhance real‑time data exchange, improving communication and responsiveness. Validated through rigorous experiments under simulation‑to‑real driving conditions, NavSecure has shown exceptional performance in safety‑critical scenarios, such as sudden obstacle avoidance. Results indicate that NavSecure excels in key safety metrics, including collision prevention and risk reduction, surpassing other end‑to‑end methodologies. This framework not only advances autonomous driving safety but also demonstrates how world models can enhance decision‑making in critical applications. NavSecure sets a new standard for developing more robust and trustworthy autonomous driving systems, capable of handling the inherent dynamics and uncertainties of real‑world environments.

Abstract:
In the last decade, the free energy principle (FEP) and active inference (AIF) have achieved many successes connecting conceptual models of learning and cognition to mathematical models of perception and action. This effort is driven by a multidisciplinary interest in understanding aspects of self‑organizing complex adaptive systems, including elements of agency. Various reinforcement learning (RL) models performing active inference have been proposed and trained on standard RL tasks using deep neural networks. Recent work has focused on improving such agents' performance in complex environments by incorporating the latest machine learning techniques. In this paper, we take an alternative approach. Within the constraints imposed by the FEP and AIF, we attempt to model agents in an interpretable way without deep neural networks by introducing Free Energy Projective Simulation (FEPS). Using internal rewards only, FEPS agents build a representation of their partially observable environments with which they interact. Following AIF, the policy to achieve a given task is derived from this world model by minimizing the expected free energy. Leveraging the interpretability of the model, techniques are introduced to deal with long‑term goals and reduce prediction errors caused by erroneous hidden state estimation. We test the FEPS model on two RL environments inspired from behavioral biology: a timed response task and a navigation task in a partially observable grid. Our results show that FEPS agents fully resolve the ambiguity of both environments by appropriately contextualizing their observations based on prediction accuracy only. In addition, they infer optimal policies flexibly for any target observation in the environment.

Abstract:
Experience Goal Visual Rearrangement task stands as a foundational challenge within Embodied AI, requiring an agent to construct a robust world model that accurately captures the goal state. The agent uses this world model to restore a shuffled scene to its original configuration, making an accurate representation of the world essential for successfully completing the task. In this work, we present a novel framework that leverages on 3D Gaussian Splatting as a 3D scene representation for experience goal visual rearrangement task. Recent advances in volumetric scene representation like 3D Gaussian Splatting, offer fast rendering of high quality and photo‑realistic novel views. Our approach enables the agent to have consistent views of the current and the goal setting of the rearrangement task, which enables the agent to directly compare the goal state and the shuffled state of the world in image space. To compare these views, we propose to use a dense feature matching method with visual features extracted from a foundation model, leveraging its advantages of a more universal feature representation, which facilitates robustness, and generalization. We validate our approach on the AI2‑THOR rearrangement challenge benchmark and demonstrate improvements over the current state of the art methods

Abstract:
Efficient path optimization for drones in search and rescue operations faces challenges, including limited visibility, time constraints, and complex information gathering in urban environments. We present a comprehensive approach to optimize UAV‑based search and rescue operations in neighborhood areas, utilizing both a 3D AirSim‑ROS2 simulator and a 2D simulator. The path planning problem is formulated as a partially observable Markov decision process (POMDP), and we propose a novel ``Shrinking POMCP'' approach to address time constraints. In the AirSim environment, we integrate our approach with a probabilistic world model for belief maintenance and a neurosymbolic navigator for obstacle avoidance. The 2D simulator employs surrogate ROS2 nodes with equivalent functionality. We compare trajectories generated by different approaches in the 2D simulator and evaluate performance across various belief types in the 3D AirSim‑ROS simulator. Experimental results from both simulators demonstrate that our proposed shrinking POMCP solution achieves significant improvements in search times compared to alternative methods, showcasing its potential for enhancing the efficiency of UAV‑assisted search and rescue operations.

Abstract:
The development of artificial intelligence systems capable of understanding and reasoning about complex real‑world scenarios is a significant challenge. In this work we present a novel approach to enhance and exploit LLM reactive capability to address complex problems and interpret deeply contextual real‑world meaning. We introduce a method and a tool for creating a multimodal, knowledge‑augmented formal representation of meaning that combines the strengths of large language models with structured semantic representations. Our method begins with an image input, utilizing state‑of‑the‑art large language models to generate a natural language description. This description is then transformed into an Abstract Meaning Representation (AMR) graph, which is formalized and enriched with logical design patterns, and layered semantics derived from linguistic and factual knowledge bases. The resulting graph is then fed back into the LLM to be extended with implicit knowledge activated by complex heuristic learning, including semantic implicatures, moral values, embodied cognition, and metaphorical representations. By bridging the gap between unstructured language models and formal semantic structures, our method opens new avenues for tackling intricate problems in natural language understanding and reasoning.

Abstract:
We discuss the possibility of world models and active exploration as emergent properties of open‑ended behavior optimization in autonomous agents. In discussing the source of the open‑endedness of living things, we start from the perspective of biological systems as understood by the mechanistic approach of theoretical biology and artificial life. From this perspective, we discuss the potential of homeostasis in particular as an open‑ended objective for autonomous agents and as a general, integrative extrinsic motivation. We then discuss the possibility of implicitly acquiring a world model and active exploration through the internal dynamics of a network, and a hypothetical architecture for this, by combining meta‑reinforcement learning, which assumes domain adaptation as a system that achieves robust homeostasis.

Abstract:
World Model‑based Reinforcement Learning (WMRL) enables sample efficient policy learning by reducing the need for online interactions which can potentially be costly and unsafe, especially for autonomous driving. However, existing world models often suffer from low prediction fidelity and compounding one‑step errors, leading to policy degradation over long horizons. Additionally, traditional RL policies, often deterministic or single Gaussian‑based, fail to capture the multi‑modal nature of decision‑making in complex driving scenarios. To address these challenges, we propose Imagine‑2‑Drive, a novel WMRL framework that integrates a high‑fidelity world model with a multi‑modal diffusion‑based policy actor. It consists of two key components: DiffDreamer, a diffusion‑based world model that generates future observations simultaneously, mitigating error accumulation, and DPA (Diffusion Policy Actor), a diffusion‑based policy that models diverse and multi‑modal trajectory distributions. By training DPA within DiffDreamer, our method enables robust policy learning with minimal online interactions. We evaluate our method in CARLA using standard driving benchmarks and demonstrate that it outperforms prior world model baselines, improving Route Completion and Success Rate by 15% and 20% respectively.

Abstract:
Recent advancements utilizing large‑scale video data for learning video generation models demonstrate significant potential in understanding complex physical dynamics. It suggests the feasibility of leveraging diverse robot trajectory data to develop a unified, dynamics‑aware model to enhance robot manipulation. However, given the relatively small amount of available robot data, directly fitting data without considering the relationship between visual observations and actions could lead to suboptimal data utilization. To this end, we propose VidMan (Video Diffusion for Robot Manipulation), a novel framework that employs a two‑stage training mechanism inspired by dual‑process theory from neuroscience to enhance stability and improve data utilization efficiency. Specifically, in the first stage, VidMan is pre‑trained on the Open X‑Embodiment dataset (OXE) for predicting future visual trajectories in a video denoising diffusion manner, enabling the model to develop a long horizontal awareness of the environment's dynamics. In the second stage, a flexible yet effective layer‑wise self‑attention adapter is introduced to transform VidMan into an efficient inverse dynamics model that predicts action modulated by the implicit dynamics knowledge via parameter sharing. Our VidMan framework outperforms state‑of‑the‑art baseline model GR‑1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small‑scale dataset. These results provide compelling evidence that world models can significantly enhance the precision of robot action prediction. Codes and models will be public.

Abstract:
World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world models are either evaluated as a general world simulator, or as a functional module of the agent, i.e., predicting the transitions to assist the planning. In this work, we propose a comprehensive evaluation of the world models with LLMs from the decision making perspective. Specifically, we leverage the 31 diverse environments from (Wang et al., 2023;2024) and curate the rule‑based policy of each environment for the diverse evaluation. Then, we design three main tasks, i.e., policy verification, action proposal, and policy planning, where the world models can be used for decision making solely. Finally, we conduct the comprehensive evaluation of the advanced LLMs, i.e., GPT‑4o and GPT‑4o‑mini, on the environments for the three main tasks under various settings. The key observations include: i) GPT‑4o significantly outperforms GPT‑4o‑mini on the three main tasks, especially for the tasks which require the domain knowledge, ii) the performance of the world model with LLM will be decreased for long‑term decision‑making tasks, and iii) the combination of different functionalities of the world model will brings additional unstabilities of the performance.

Abstract:
Most learning‑based approaches to complex physical reasoning sidestep the crucial problem of parameter identification (e.g., mass, friction) that governs scene dynamics, despite its importance in real‑world applications such as collision avoidance and robotic manipulation. In this paper, we present LLMPhy, a black‑box optimization framework that integrates large language models (LLMs) with physics simulators for physical reasoning. The core insight of LLMPhy is to bridge the textbook physical knowledge embedded in LLMs with the world models implemented in modern physics engines, enabling the construction of digital twins of input scenes via latent parameter estimation. Specifically, LLMPhy decomposes digital twin construction into two subproblems: (i) a continuous problem of estimating physical parameters and (ii) a discrete problem of estimating scene layout. For each subproblem, LLMPhy iteratively prompts the LLM to generate computer programs encoding parameter estimates, executes them in the physics engine to reconstruct the scene, and uses the resulting reconstruction error as feedback to refine the LLM's predictions. As existing physical reasoning benchmarks rarely account for parameter identifiability, we introduce three new datasets designed to evaluate physical reasoning in zero‑shot settings. Our results show that LLMPhy achieves state‑of‑the‑art performance on our tasks, recovers physical parameters more accurately, and converges more reliably than prior black‑box methods. See the LLMPhy project page for details: https://www.merl.com/research/highlights/LLMPhy

Abstract:
With the proliferation of the Large Language Model (LLM), the concept of World Models (WM) has recently attracted a great deal of attention in the AI research community, especially in the context of AI agents. It is arguably evolving into an essential foundation for building AI agent systems. A WM is intended to help the agent predict the future evolution of environmental states or help the agent fill in missing information so that it can plan its actions and behave safely. The safety property of WM plays a key role in their effective use in critical applications. In this work, we review and analyze the impacts of the current state‑of‑the‑art in WM technology from the point of view of trustworthiness and safety based on a comprehensive survey and the fields of application envisaged. We provide an in‑depth analysis of state‑of‑the‑art WMs and derive technical research challenges and their impact in order to call on the research community to collaborate on improving the safety and trustworthiness of WM.

Abstract:
Capturing the interactions between entities in a structured way plays a central role in world models that flexibly adapt to changes in the environment. Recent works motivate the benefits of models that explicitly represent the structure of interactions and formulate the problem as discovering local causal structures. In this work, we demonstrate that reliably capturing these relationships in complex settings remains challenging. To remedy this shortcoming, we postulate that sparsity is a critical ingredient for the discovery of such local structures. To this end, we present the SPARse TrANsformer World model (SPARTAN), a Transformer‑based world model that learns context‑dependent interaction structures between entities in a scene. By applying sparsity regularisation on the attention patterns between object‑factored tokens, SPARTAN learns sparse, context‑dependent interaction graphs that accurately predict future object states. We further extend our model to adapt to sparse interventions with unknown targets in the dynamics of the environment. This results in a highly interpretable world model that can efficiently adapt to changes. Empirically, we evaluate SPARTAN against the current state‑of‑the‑art in object‑centric world models in observation‑based environments and demonstrate that our model can learn local causal graphs that accurately reflect the underlying interactions between objects, achieving significantly improved few‑shot adaptation to dynamics changes, as well as robustness against distractors.

Abstract:
Language agents based on large language models (LLMs) have demonstrated great promise in automating web‑based tasks. Recent work has shown that incorporating advanced planning algorithms, e.g., tree search, is advantageous over reactive planning for web agents. However, unlike simulated sandbox environments, real‑world environments such as the web are rife with irreversible actions. This undermines the feasibility of backtracking, a cornerstone of (tree) search. Overly relying on test‑time search also hurts efficiency. We advocate model‑based planning for web agents that employs a world model to simulate and deliberate over the outcome of each candidate action before committing to one. We systematically explore this paradigm by (1) Proposing a model‑based planning framework, WebDreamer, which employs LLMs to serve as both world models and value functions; (2) Training specialized LLMs as world models with a scalable data synthesis pipeline. Empirical results demonstrate that WebDreamer achieves substantial performance improvements over reactive baselines. It is competitive, while being 4‑5 times more efficient, with tree search in sandbox environments (VisualWebArena) and also works effectively on real‑world websites (Online‑Mind2Web and Mind2Web‑Live). Furthermore, our trained world model, Dreamer‑7B, performs comparable to GPT‑4o, highlighting the potential of specialized world models for efficient and effective planning in complex web environments.

Abstract:
World models play a crucial role in decision‑making within embodied environments, enabling cost‑free explorations that would otherwise be expensive in the real world. To facilitate effective decision‑making, world models must be equipped with strong generalizability to support faithful imagination in out‑of‑distribution (OOD) regions and provide reliable uncertainty estimation to assess the credibility of the simulated experiences, both of which present significant challenges for prior scalable approaches. This paper introduces WHALE, a framework for learning generalizable world models, consisting of two key techniques: behavior‑conditioning and retracing‑rollout. Behavior‑conditioning addresses the policy distribution shift, one of the primary sources of the world model generalization error, while retracing‑rollout enables efficient uncertainty estimation without the necessity of model ensembles. These techniques are universal and can be combined with any neural network architecture for world model learning. Incorporating these two techniques, we present Whale‑ST, a scalable spatial‑temporal transformer‑based world model with enhanced generalizability. We demonstrate the superiority of Whale‑ST in simulation tasks by evaluating both value estimation accuracy and video generation fidelity. Additionally, we examine the effectiveness of our uncertainty estimation technique, which enhances model‑based policy optimization in fully offline scenarios. Furthermore, we propose Whale‑X, a 414M parameter world model trained on 970K trajectories from Open X‑Embodiment datasets. We show that Whale‑X exhibits promising scalability and strong generalizability in real‑world manipulation scenarios using minimal demonstrations.

Abstract:
The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, remains challenging to learn and are typically developed for task‑specific solutions with online policy learning. To unlock world models' true potential, we argue that they should 1) be trainable on offline, pre‑collected trajectories, 2) support test‑time behavior optimization, and 3) facilitate task‑agnostic reasoning. To this end, we present DINO World Model (DINO‑WM), a new method to model visual dynamics without reconstructing the visual world. DINO‑WM leverages spatial patch features pre‑trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This allows DINO‑WM to achieve observational goals through action sequence optimization, facilitating task‑agnostic planning by treating goal features as prediction targets. We demonstrate that DINO‑WM achieves zero‑shot behavioral solutions at test time on six environments without expert demonstrations, reward modeling, or pre‑learned inverse models, outperforming prior state‑of‑the‑art work across diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi‑particle scenarios.

Abstract:
The performance of embodied agents has been shown to improve by increasing model parameters, dataset size, and compute. This has been demonstrated in domains from robotics to video games, when generative learning objectives on offline datasets (pre‑training) are used to model an agent's behavior (imitation learning) or their environment (world modeling). This paper characterizes the role of scale in these tasks more precisely. Going beyond the simple intuition that `bigger is better', we show that the same types of power laws found in language modeling also arise in world modeling and imitation learning (e.g. between loss and optimal model size). However, the coefficients of these laws are heavily influenced by the tokenizer, task \& architecture ‑‑ this has important implications on the optimal sizing of models and data.

Abstract:
The Daisy World model has long served as a foundational framework for understanding the self‑regulation of planetary biospheres, providing insights into the feedback mechanisms that may govern inhabited exoplanets. In this study, we extend the classic Daisy World model through the lens of Semantic Information Theory (SIT), aiming to characterize the information flow between the biosphere and planetary environment ‑‑ what we term the \emphinformation architecture of Daisy World systems. Our objective is to develop novel methodologies for analyzing the evolution of coupled planetary systems, including biospheres and geospheres, with implications for astrobiological observations and the identification of agnostic biosignatures. To operationalize SIT in this context, we introduce a version of the Daisy World model tailored to reflect potential conditions on M‑dwarf exoplanets, formulating a system of stochastic differential equations that describe the co‑evolution of the daisies and their planetary environment. Analysis of this Exo‑Daisy World model reveals how correlations between the biosphere and environment intensify with rising stellar luminosity, and how these correlations correspond to distinct phases of information exchange between the coupled systems. This \emphrein control provides a quantitative description of the informational feedback between the biosphere and its host planet. Finally, we discuss the broader implications of our approach for developing detailed ExoGaia models of inhabited exoplanetary systems, proposing new avenues for interpreting astrobiological data and exploring biosignature candidates.

Abstract:
World models and video generation are pivotal technologies in the domain of autonomous driving, each playing a critical role in enhancing the robustness and reliability of autonomous systems. World models, which simulate the dynamics of real‑world environments, and video generation models, which produce realistic video sequences, are increasingly being integrated to improve situational awareness and decision‑making capabilities in autonomous vehicles. This paper investigates the relationship between these two technologies, focusing on how their structural parallels, particularly in diffusion‑based models, contribute to more accurate and coherent simulations of driving scenarios. We examine leading works such as JEPA, Genie, and Sora, which exemplify different approaches to world model design, thereby highlighting the lack of a universally accepted definition of world models. These diverse interpretations underscore the field's evolving understanding of how world models can be optimized for various autonomous driving tasks. Furthermore, this paper discusses the key evaluation metrics employed in this domain, such as Chamfer distance for 3D scene reconstruction and Fréchet Inception Distance (FID) for assessing the quality of generated video content. By analyzing the interplay between video generation and world models, this survey identifies critical challenges and future research directions, emphasizing the potential of these technologies to jointly advance the performance of autonomous driving systems. The findings presented in this paper aim to provide a comprehensive understanding of how the integration of video generation and world models can drive innovation in the development of safer and more reliable autonomous vehicles.

Abstract:
Learning world models offers a promising avenue for goal‑conditioned reinforcement learning with sparse rewards. By allowing agents to plan actions or exploratory goals without direct interaction with the environment, world models enhance exploration efficiency. The quality of a world model hinges on the richness of data stored in the agent's replay buffer, with expectations of reasonable generalization across the state space surrounding recorded trajectories. However, challenges arise in generalizing learned world models to state transitions backward along recorded trajectories or between states across different trajectories, hindering their ability to accurately model real‑world dynamics. To address these challenges, we introduce a novel goal‑directed exploration algorithm, MUN (short for "World Models for Unconstrained Goal Navigation"). This algorithm is capable of modeling state transitions between arbitrary subgoal states in the replay buffer, thereby facilitating the learning of policies to navigate between any "key" states. Experimental results demonstrate that MUN strengthens the reliability of world models and significantly improves the policy's capacity to generalize across new goal settings.

Abstract:
OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in‑distribution, out‑of‑distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large‑scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion‑based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out‑of‑distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case‑based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io

Abstract:
Large language model (LLM)‑based agents are increasingly employed to interact with external environments (e.g., games, APIs, world models) to solve user‑provided tasks. However, current frameworks often lack the ability to collaborate effectively with users in fully conversational settings. Conversations are essential for aligning on task details, achieving user‑defined goals, and satisfying preferences. While existing agents address ambiguity through clarification questions, they underutilize the broader potential of an LLM's conversational capabilities. In this work, we introduce ReSpAct, an LLM‑based agent designed to seamlessly integrate reasoning, decision‑making, and dynamic dialogue for task‑solving. Expanding on reasoning‑first approaches like ReAct, ReSpAct employs active, free‑flowing dialogues to interpret instructions, clarify goals, provide status updates, resolve subtask failures, and refine plans based on user inputs without any explicit dialogue schema. By alternating between task‑solving actions and interactive conversations, ReSpAct demonstrates improved performance across diverse environments. We evaluate ReSpAct in user‑interactive settings, including task‑oriented dialogue systems (MultiWOZ) and decision‑making tasks (ALFWorld, WebShop). ReSpAct outperforms ReAct with absolute success rate improvements of 6% and 4% in ALFWorld and WebShop, respectively, and achieves a 5.5% gain in Inform and a 3% gain in Success scores in MultiWOZ. These results highlight the value of integrating dynamic user‑agent collaboration for more effective task resolution.

Abstract:
We introduce Image‑GOal Representations (IGOR), aiming to learn a unified, semantically consistent action space across human and various robots. Through this unified latent action space, IGOR enables knowledge transfer among large‑scale robot and human activity data. We achieve this by compressing visual changes between an initial image and its goal state into latent actions. IGOR allows us to generate latent action labels for internet‑scale video data. This unified latent action space enables the training of foundation policy and world models across a wide variety of tasks performed by both robots and humans. We demonstrate that: (1) IGOR learns a semantically consistent action space for both human and robots, characterizing various possible motions of objects representing the physical interaction knowledge; (2) IGOR can "migrate" the movements of the object in the one video to other videos, even across human and robots, by jointly using the latent action model and world model; (3) IGOR can learn to align latent actions with natural language through the foundation policy model, and integrate latent actions with a low‑level policy model to achieve effective robot control. We believe IGOR opens new possibilities for human‑to‑robot knowledge transfer and control.

Abstract:
Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre‑training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model's context window. To this end, we introduce SlowFast‑VGen, a novel dual‑speed learning system for action‑driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference‑time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its temporal LoRA parameters based on local inputs and outputs, thereby efficiently storing episodic memory in its parameters. We further propose a slow‑fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop, enabling the recall of prior multi‑episode experiences for context‑aware skill learning. To facilitate the slow learning of an approximate world model, we collect a large‑scale dataset of 200k videos with language action annotations, covering a wide range of scenarios. Extensive experiments show that SlowFast‑VGen outperforms baselines across various metrics for action‑driven video generation, achieving an FVD score of 514 compared to 782, and maintaining consistency in longer videos, with an average of 0.37 scene cuts versus 0.89. The slow‑fast learning loop algorithm significantly enhances performances on long‑horizon planning tasks as well. Project Website: https://slowfast‑vgen.github.io

Abstract:
Broadly intelligent agents should form task‑specific abstractions that selectively expose the essential elements of a task, while abstracting away the complexity of the raw sensorimotor space. In this work, we present Neuro‑Symbolic Predicates, a first‑order abstraction language that combines the strengths of symbolic and neural knowledge representations. We outline an online algorithm for inventing such predicates and learning abstract world models. We compare our approach to hierarchical reinforcement learning, vision‑language model planning, and symbolic predicate invention approaches, on both in‑ and out‑of‑distribution tasks across five simulated robotic domains. Results show that our approach offers better sample complexity, stronger out‑of‑distribution generalization, and improved interpretability.

Abstract:
As reinforcement learning agents become increasingly deployed in real‑world scenarios, predicting future agent actions and events during deployment is important for facilitating better human‑agent interaction and preventing catastrophic outcomes. This paper experimentally evaluates and compares the effectiveness of future action and event prediction for three types of RL agents: explicitly planning, implicitly planning, and non‑planning. We employ two approaches: the inner state approach, which involves predicting based on the inner computations of the agents (e.g., plans or neuron activations), and a simulation‑based approach, which involves unrolling the agent in a learned world model. Our results show that the plans of explicitly planning agents are significantly more informative for prediction than the neuron activations of the other types. Furthermore, using internal plans proves more robust to model quality compared to simulation‑based approaches when predicting actions, while the results for event prediction are more mixed. These findings highlight the benefits of leveraging inner states and simulations to predict future agent actions and events, thereby improving interaction and safety in real‑world deployments.

Abstract:
Mobile manipulators require coordinated control between navigation and manipulation to accomplish tasks. Typically, coordinated mobile manipulation behaviors have base navigation to approach the goal followed by arm manipulation to reach the desired pose. Selecting the embodiment between the base and arm can be determined based on reachability. Previous methods evaluate reachability by computing inverse kinematics and activate arm motions once solutions are identified. In this study, we introduce a new approach called predictive reachability that decides reachability based on predicted arm motions. Our model utilizes a hierarchical policy framework built upon a world model. The world model allows the prediction of future trajectories and the evaluation of reachability. The hierarchical policy selects the embodiment based on the predicted reachability and plans accordingly. Unlike methods that require prior knowledge about robots and environments for inverse kinematics, our method only relies on image‑based observations. We evaluate our approach through basic reaching tasks across various environments. The results demonstrate that our method outperforms previous model‑based approaches in both sample efficiency and performance, while enabling more reasonable embodiment selection based on predictive reachability.

Abstract:
Unlike quasi‑static robotic manipulation tasks like pick‑and‑place, dynamic tasks such as non‑prehensile manipulation pose greater challenges, especially for vision‑based control. Successful control requires the extraction of features relevant to the target task. In visual imitation learning settings, these features can be learnt by backpropagating the policy loss through the vision backbone. Yet, this approach tends to learn task‑specific features with limited generalizability. Alternatively, learning world models can realize more generalizable vision backbones. Utilizing the learnt features, task‑specific policies are subsequently trained. Commonly, these models are trained solely to predict the next RGB state from the current state and action taken. But only‑RGB prediction might not fully‑capture the task‑relevant dynamics. In this work, we hypothesize that direct supervision of target dynamic states (Dynamics Mapping) can learn better dynamics‑informed world models. Beside the next RGB reconstruction, the world model is also trained to directly predict position, velocity, and acceleration of environment rigid bodies. To verify our hypothesis, we designed a non‑prehensile 2D environment tailored to two tasks: "Balance‑Reaching" and "Bin‑Dropping". When trained on the first task, dynamics mapping enhanced the task performance under different training configurations (Decoupled, Joint, End‑to‑End) and policy architectures (Feedforward, Recurrent). Notably, its most significant impact was for world model pretraining boosting the success rate from 21% to 85%. Although frozen dynamics‑informed world models could generalize well to a task with in‑domain dynamics, but poorly to a one with out‑of‑domain dynamics.

Abstract:
Video generation models have made significant progress in simulating future states, showcasing their potential as world simulators in embodied scenarios. However, existing models often lack robust understanding, limiting their ability to perform multi‑step predictions or handle Out‑of‑Distribution (OOD) scenarios. To address this challenge, we propose the Reflection of Generation (RoG), a set of intermediate reasoning strategies designed to enhance video prediction. It leverages the complementary strengths of pre‑trained vision‑language and video generation models, enabling them to function as a world model in embodied scenarios. To support RoG, we introduce Embodied Video Anticipation Benchmark(EVA‑Bench), a comprehensive benchmark that evaluates embodied world models across diverse tasks and scenarios, utilizing both in‑domain and OOD datasets. Building on this foundation, we devise a world model, Embodied Video Anticipator (EVA), that follows a multistage training paradigm to generate high‑fidelity video frames and apply an autoregressive strategy to enable adaptive generalization for longer video sequences. Extensive experiments demonstrate the efficacy of EVA in various downstream tasks like video generation and robotics, thereby paving the way for large‑scale pre‑trained models in real‑world video prediction applications. The video demos are available at \hyperlinkhttps://sites.google.com/view/icml‑evahttps://sites.google.com/view/icml‑eva.

Abstract:
With the increasing availability of open‑source robotic data, imitation learning has become a promising approach for both manipulation and locomotion. Diffusion models are now widely used to train large, generalized policies that predict controls or trajectories, leveraging their ability to model multimodal action distributions. However, this generality comes at the cost of larger model sizes and slower inference, an acute limitation for robotic tasks requiring high control frequencies. Moreover, Diffusion Policy (DP), a popular trajectory‑generation approach, suffers from a trade‑off between performance and action horizon: fewer diffusion queries lead to larger trajectory chunks, which in turn accumulate tracking errors. To overcome these challenges, we introduce WARPD (World model Assisted Reactive Policy Diffusion), a method that generates closed‑loop policies (weights for neural policies) directly, instead of open‑loop trajectories. By learning behavioral distributions in parameter space rather than trajectory space, WARPD offers two major advantages: (1) extended action horizons with robustness to perturbations, while maintaining high task performance, and (2) significantly reduced inference costs. Empirically, WARPD outperforms DP in long‑horizon and perturbed environments, and achieves multitask performance on par with DP while requiring only ~ 1/45th of the inference‑time FLOPs per step.

Abstract:
Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self‑report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self‑reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short‑ or long‑term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground‑truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT‑4, GPT‑4o, and Llama‑3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground‑truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out‑of‑distribution generalization.

Abstract:
Closed‑loop simulation is essential for advancing end‑to‑end autonomous driving systems. Contemporary sensor simulation methods, such as NeRF and 3DGS, rely predominantly on conditions closely aligned with training data distributions, which are largely confined to forward‑driving scenarios. Consequently, these methods face limitations when rendering complex maneuvers (e.g., lane change, acceleration, deceleration). Recent advancements in autonomous‑driving world models have demonstrated the potential to generate diverse driving videos. However, these approaches remain constrained to 2D video generation, inherently lacking the spatiotemporal coherence required to capture intricacies of dynamic driving environments. In this paper, we introduce DriveDreamer4D, which enhances 4D driving scene representation leveraging world model priors. Specifically, we utilize the world model as a data machine to synthesize novel trajectory videos, where structured conditions are explicitly leveraged to control the spatial‑temporal consistency of traffic elements. Besides, the cousin data training strategy is proposed to facilitate merging real and synthetic data for optimizing 4DGS. To our knowledge, DriveDreamer4D is the first to utilize video generation models for improving 4D reconstruction in driving scenarios. Experimental results reveal that DriveDreamer4D significantly enhances generation quality under novel trajectory views, achieving a relative improvement in FID by 32.1%, 46.4%, and 16.3% compared to PVG, S3Gaussian, and Deformable‑GS. Moreover, DriveDreamer4D markedly enhances the spatiotemporal coherence of driving agents, which is verified by a comprehensive user study and the relative increases of 22.6%, 43.5%, and 15.6% in the NTA‑IoU metric.

Abstract:
The number of publicly available models is rapidly increasing, yet most remain undocumented. Users looking for suitable models for their tasks must first determine what each model does. Training machine learning models to infer missing documentation directly from model weights is challenging, as these weights often contain significant variation unrelated to model functionality (denoted nuisance). Here, we identify a key property of real‑world models: most public models belong to a small set of Model Trees, where all models within a tree are fine‑tuned from a common ancestor (e.g., a foundation model). Importantly, we find that within each tree there is less nuisance variation between models. Concretely, while learning across Model Trees requires complex architectures, even a linear classifier trained on a single model layer often works within trees. While effective, these linear classifiers are computationally expensive, especially when dealing with larger models that have many parameters. To address this, we introduce Probing Experts (ProbeX), a theoretically motivated and lightweight method. Notably, ProbeX is the first probing method specifically designed to learn from the weights of a single hidden model layer. We demonstrate the effectiveness of ProbeX by predicting the categories in a model's training dataset based only on its weights. Excitingly, ProbeX can map the weights of Stable Diffusion into a weight‑language embedding space, enabling model search via text, i.e., zero‑shot model classification.

Abstract:
Large language models (LLMs) have recently gained much attention in building autonomous agents. However, the performance of current LLM‑based web agents in long‑horizon tasks is far from optimal, often yielding errors such as repeatedly buying a non‑refundable flight ticket. By contrast, humans can avoid such an irreversible mistake, as we have an awareness of the potential outcomes (e.g., losing money) of our actions, also known as the "world model". Motivated by this, our study first starts with preliminary analyses, confirming the absence of world models in current LLMs (e.g., GPT‑4o, Claude‑3.5‑Sonnet, etc.). Then, we present a World‑model‑augmented (WMA) web agent, which simulates the outcomes of its actions for better decision‑making. To overcome the challenges in training LLMs as world models predicting next observations, such as repeated elements across observations and long HTML inputs, we propose a transition‑focused observation abstraction, where the prediction objectives are free‑form natural language descriptions exclusively highlighting important state differences between time steps. Experiments on WebArena and Mind2Web show that our world models improve agents' policy selection without training and demonstrate our agents' cost‑ and time‑efficiency compared to recent tree‑search‑based agents.

Abstract:
Large‑scale generative models have achieved remarkable success in a number of domains. However, for sequential decision‑making problems, such as robotics, action‑labelled data is often scarce and therefore scaling‑up foundation models for decision‑making remains a challenge. A potential solution lies in leveraging widely‑available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to optimize decision‑making in downstream tasks. Image‑to‑video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action‑conditioned, and the most powerful models are closed‑source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action‑conditioned world models, without access to the parameters of the pretrained model. Our approach, AVID, trains an adapter on a small domain‑specific dataset of action‑labelled videos. AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action‑conditioned videos. We evaluate AVID on video game and real‑world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.1 Our results demonstrate that if utilized correctly, pretrained video models have the potential to be powerful tools for embodied AI.

Abstract:
Advancements in reinforcement learning have led to the development of sophisticated models capable of learning complex decision‑making tasks. However, efficiently integrating world models with decision transformers remains a challenge. In this paper, we introduce a novel approach that combines the Dreamer algorithm's ability to generate anticipatory trajectories with the adaptive learning strengths of the Online Decision Transformer. Our methodology enables parallel training where Dreamer‑produced trajectories enhance the contextual decision‑making of the transformer, creating a bidirectional enhancement loop. We empirically demonstrate the efficacy of our approach on a suite of challenging benchmarks, achieving notable improvements in sample efficiency and reward maximization over existing methods. Our results indicate that the proposed integrated framework not only accelerates learning but also showcases robustness in diverse and dynamic scenarios, marking a significant step forward in model‑based reinforcement learning.

Abstract:
Large language models (LLMs) embed extensive knowledge and utilize it to perform exceptionally well across various tasks. Nevertheless, outdated knowledge or factual errors within LLMs can lead to misleading or incorrect responses, causing significant issues in practical applications. To rectify the fatal flaw without the necessity for costly model retraining, various model editing approaches have been proposed to correct inaccurate knowledge within LLMs in a cost‑efficient way. To evaluate these model editing methods, previous work introduced a series of datasets. However, most of the previous datasets only contain fabricated data in a single format, which diverges from real‑world model editing scenarios, raising doubts about their usability in practice. To facilitate the application of model editing in real‑world scenarios, we propose the challenge of practicality. To resolve such challenges and effectively enhance the capabilities of LLMs, we present FAME, an factual, comprehensive, and multi‑task dataset, which is designed to enhance the practicality of model editing. We then propose SKEME, a model editing method that uses a novel caching mechanism to ensure synchronization with the real world. The experiments demonstrate that SKEME performs excellently across various tasks and scenarios, confirming its practicality.

Abstract:
Language‑guided robotic manipulation is a challenging task that requires an embodied agent to follow abstract user instructions to accomplish various complex manipulation tasks. Previous work trivially fitting the data without revealing the relation between instruction and low‑level executable actions, these models are prone to memorizing the surficial pattern of the data instead of acquiring the transferable knowledge, and thus are fragile to dynamic environment changes. To address this issue, we propose a PrIrmitive‑driVen waypOinT‑aware world model for Robotic manipulation (PIVOT‑R) that focuses solely on the prediction of task‑relevant waypoints. Specifically, PIVOT‑R consists of a Waypoint‑aware World Model (WAWM) and a lightweight action prediction module. The former performs primitive action parsing and primitive‑driven waypoint prediction, while the latter focuses on decoding low‑level actions. Additionally, we also design an asynchronous hierarchical executor (AHE), which can use different execution frequencies for different modules of the model, thereby helping the model reduce computational redundancy and improve model execution efficiency. Our PIVOT‑R outperforms state‑of‑the‑art (SoTA) open‑source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks. Moreover, compared to the synchronously executed PIVOT‑R, the execution efficiency of PIVOT‑R with AHE is increased by 28‑fold, with only a 2.9% drop in performance. These results provide compelling evidence that our PIVOT‑R can significantly improve both the performance and efficiency of robotic manipulation.

Abstract:
Tight coordination is required for effective human‑robot teams in domains involving fast dynamics and tactical decisions, such as multi‑car racing. In such settings, robot teammates must react to cues of a human teammate's tactical objective to assist in a way that is consistent with the objective (e.g., navigating left or right around an obstacle). To address this challenge, we present Dream2Assist, a framework that combines a rich world model able to infer human objectives and value functions, and an assistive agent that provides appropriate expert assistance to a given human teammate. Our approach builds on a recurrent state space model to explicitly infer human intents, enabling the assistive agent to select actions that align with the human and enabling a fluid teaming interaction. We demonstrate our approach in a high‑speed racing domain with a population of synthetic human drivers pursuing mutually exclusive objectives, such as "stay‑behind" and "overtake". We show that the combined human‑robot team, when blending its actions with those of the human, outperforms the synthetic humans alone as well as several baseline assistance strategies, and that intent‑conditioning enables adherence to human preferences during task execution, leading to improved performance while satisfying the human's objective.

Abstract:
Designing a generalist scientific agent capable of performing tasks in laboratory settings to assist researchers has become a key goal in recent Artificial Intelligence (AI) research. Unlike everyday tasks, scientific tasks are inherently more delicate and complex, requiring agents to possess a higher level of reasoning ability, structured and temporal understanding of their environment, and a strong emphasis on safety. Existing approaches often fail to address these multifaceted requirements. To tackle these challenges, we present DAVIS. Unlike traditional retrieval‑augmented generation (RAG) approaches, DAVIS incorporates structured and temporal memory, which enables model‑based planning. Additionally, DAVIS implements an agentic, multi‑turn retrieval system, similar to a human's inner monologue, allowing for a greater degree of reasoning over past experiences. DAVIS demonstrates substantially improved performance on the ScienceWorld benchmark comparing to previous approaches on 8 out of 9 elementary science subjects. In addition, DAVIS's World Model demonstrates competitive performance on the famous HotpotQA and MusiqueQA dataset for multi‑hop question answering. To the best of our knowledge, DAVIS is the first RAG agent to employ an interactive retrieval method in a RAG pipeline.

Abstract:
Building deep reinforcement learning (RL) agents that find a good policy with few samples has proven notoriously challenging. To achieve sample efficiency, recent work has explored updating neural networks with large numbers of gradient steps for every new sample. While such high update‑to‑data (UTD) ratios have shown strong empirical performance, they also introduce instability to the training process. Previous approaches need to rely on periodic neural network parameter resets to address this instability, but restarting the training process is infeasible in many real‑world applications and requires tuning the resetting interval. In this paper, we focus on one of the core difficulties of stable training with limited samples: the inability of learned value functions to generalize to unobserved on‑policy actions. We mitigate this issue directly by augmenting the off‑policy RL training process with a small amount of data generated from a learned world model. Our method, Model‑Augmented Data for TD Learning (MAD‑TD), uses small amounts of generated data to stabilize high UTD training and achieve competitive performance on the most challenging tasks in the DeepMind control suite. Our experiments further highlight the importance of employing a good model to generate data, MAD‑TD's ability to combat value overestimation, and its practical stability gains for continued learning.

Abstract:
Efficient video tokenization remains a key bottleneck in learning general purpose vision models that are capable of processing long video sequences. Prevailing approaches are restricted to encoding videos to a fixed number of tokens, where too few tokens will result in overly lossy encodings, and too many tokens will result in prohibitively long sequence lengths. In this work, we introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. To enable this in a computationally scalable way, we propose a masking technique that drops a random number of tokens at the end of each frames's token encoding. During inference, ElasticTok can dynamically allocate tokens when needed ‑‑ more complex data can leverage more tokens, while simpler data only needs a few tokens. Our empirical evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage, paving the way for future development of more powerful multimodal models, world models, and agents.

Abstract:
Due to the difficulty of acquiring extensive real‑world data, robot simulation has become crucial for parallel training and sim‑to‑real transfer, highlighting the importance of scalable simulated robotic tasks. Foundation models have demonstrated impressive capacities in autonomously generating feasible robotic tasks. However, this new paradigm underscores the challenge of adequately evaluating these autonomously generated tasks. To address this, we propose a comprehensive evaluation framework tailored to generative simulations. Our framework segments evaluation into three core aspects: quality, diversity, and generalization. For single‑task quality, we evaluate the realism of the generated task and the completeness of the generated trajectories using large language models and vision‑language models. In terms of diversity, we measure both task and data diversity through text similarity of task descriptions and world model loss trained on collected task trajectories. For task‑level generalization, we assess the zero‑shot generalization ability on unseen tasks of a policy trained with multiple generated tasks. Experiments conducted on three representative task generation pipelines demonstrate that the results from our framework are highly consistent with human evaluations, confirming the feasibility and validity of our approach. The findings reveal that while metrics of quality and diversity can be achieved through certain methods, no single approach excels across all metrics, suggesting a need for greater focus on balancing these different metrics. Additionally, our analysis further highlights the common challenge of low generalization capability faced by current works. Our anonymous website: https://sites.google.com/view/evaltasks.

Abstract:
Deep Reinforcement Learning (RL) has become the leading approach for creating artificial agents in complex environments. Model‑based approaches, which are RL methods with world models that predict environment dynamics, are among the most promising directions for improving data efficiency, forming a critical step toward bridging the gap between research and real‑world deployment. In particular, world models enhance sample efficiency by learning in imagination, which involves training a generative sequence model of the environment in a self‑supervised manner. Recently, Masked Generative Modelling has emerged as a more efficient and superior inductive bias for modelling and generating token sequences. Building on the Efficient Stochastic Transformer‑based World Models (STORM) architecture, we replace the traditional MLP prior with a Masked Generative Prior (e.g., MaskGIT Prior) and introduce GIT‑STORM. We evaluate our model on two downstream tasks: reinforcement learning and video prediction. GIT‑STORM demonstrates substantial performance gains in RL tasks on the Atari 100k benchmark. Moreover, we apply Transformer‑based World Models to continuous action environments for the first time, addressing a significant gap in prior research. To achieve this, we employ a state mixer function that integrates latent state representations with actions, enabling our model to handle continuous control tasks. We validate this approach through qualitative and quantitative analyses on the DeepMind Control Suite, showcasing the effectiveness of Transformer‑based World Models in this new domain. Our results highlight the versatility and efficacy of the MaskGIT dynamics prior, paving the way for more accurate world models and effective RL policies.

Abstract:
Reinforcement Learning (RL) applied in healthcare can lead to unsafe medical decisions and treatment, such as excessive dosages or abrupt changes, often due to agents overlooking common‑sense constraints. Consequently, Constrained Reinforcement Learning (CRL) is a natural choice for safe decisions. However, specifying the exact cost function is inherently difficult in healthcare. Recent Inverse Constrained Reinforcement Learning (ICRL) is a promising approach that infers constraints from expert demonstrations. ICRL algorithms model Markovian decisions in an interactive environment. These settings do not align with the practical requirement of a decision‑making system in healthcare, where decisions rely on historical treatment recorded in an offline dataset. To tackle these issues, we propose the Constraint Transformer (CT). Specifically, 1) we utilize a causal attention mechanism to incorporate historical decisions and observations into the constraint modeling, while employing a Non‑Markovian layer for weighted constraints to capture critical states. 2) A generative world model is used to perform exploratory data augmentation, enabling offline RL methods to simulate unsafe decision sequences. In multiple medical scenarios, empirical results demonstrate that CT can capture unsafe states and achieve strategies that approximate lower mortality rates, reducing the occurrence probability of unsafe behaviors.

Abstract:
We introduce a novel, general‑purpose audio generation framework specifically designed for anomaly detection and localization. Unlike existing datasets that predominantly focus on industrial and machine‑related sounds, our framework focuses a broader range of environments, particularly useful in real‑world scenarios where only audio data are available, such as in video‑derived or telephonic audio. To generate such data, we propose a new method inspired by the LLM‑Modulo framework, which leverages large language models(LLMs) as world models to simulate such real‑world scenarios. This tool is modular allowing a plug‑and‑play approach. It operates by first using LLMs to predict plausible real‑world scenarios. An LLM further extracts the constituent sounds, the order and the way in which these should be merged to create coherent wholes. Much like the LLM‑Modulo framework, we include rigorous verification of each output stage, ensuring the reliability of the generated data. The data produced using the framework serves as a benchmark for anomaly detection applications, potentially enhancing the performance of models trained on audio data, particularly in handling out‑of‑distribution cases. Our contributions thus fill a critical void in audio anomaly detection resources and provide a scalable tool for generating diverse, realistic audio data.

Abstract:
Enhancing the reasoning capabilities of language models (LMs) remains a key challenge, especially for tasks that require complex, multi‑step decision‑making where existing Chain‑of‑Thought (CoT) approaches struggle with consistency and verification. In this paper, we propose a novel reasoning framework, referred to as Structure‑aware Planning with an Accurate World Model (SWAP), that integrates structured knowledge representation with learned planning. Unlike prior methods that rely purely on natural language reasoning, SWAP leverages entailment graphs to encode structured dependencies and enable symbolic verification of intermediate steps. To systematically construct and update the graph, SWAP employs a policy model to propose candidate expansions and a world model to predict structural updates. To improve accuracy, the world model generates multiple alternative updates, and a discriminator re‑ranks them based on plausibility. To encourage diverse exploration, we introduce Diversity‑based Modelling (DM), which samples candidates from the remaining probability mass after removing previously sampled candidates from the original policy distribution. Additionally, SWAP improves the discrimination accuracy through Contrastive Ranking (CR), which directly compares candidates within prompts and incorporates meta‑knowledge to improve ranking quality. We evaluate SWAP across diverse reasoning‑intensive benchmarks including math reasoning, logical reasoning, and coding tasks. Extensive experiments demonstrate that SWAP significantly improves upon the base models and consistently outperforms existing reasoning methods.

Abstract:
Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent‑based data generator to automatically create high‑quality and diverse instruction datasets. The generator includes an iterative self‑refining module for temporally consistent experience sampling, a diverse set of question‑answering instruction seeds, and a retrieval‑augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open‑source LLMs like LLaMA‑3 with a performance boost of 2.04 ×, 1.54 ×, and 1.82 × across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT‑4.

Abstract:
Recent progress in generative models has stimulated significant innovations in many fields, such as image generation and chatbots. Despite their success, these models often produce sketchy and misleading solutions for complex multi‑agent decision‑making problems because they miss the trial‑and‑error experience and reasoning as humans. To address this limitation, we explore a paradigm that integrates a language‑guided simulator into the multi‑agent reinforcement learning pipeline to enhance the generated answer. The simulator is a world model that separately learns dynamics and reward, where the dynamics model comprises an image tokenizer as well as a causal transformer to generate interaction transitions autoregressively, and the reward model is a bidirectional transformer learned by maximizing the likelihood of trajectories in the expert demonstrations under language guidance. Given an image of the current state and the task description, we use the world model to train the joint policy and produce the image sequence as the answer by running the converged policy on the dynamics model. The empirical results demonstrate that this framework can improve the answers for multi‑agent decision‑making problems by showing superior performance on the training and unseen tasks of the StarCraft Multi‑Agent Challenge benchmark. In particular, it can generate consistent interaction sequences and explainable reward functions at interaction states, opening the path for training generative models of the future.

Abstract:
This paper offers a roadmap for the development of scalable aligned artificial intelligence (AI) from first principle descriptions of natural intelligence. In brief, a possible path toward scalable aligned AI rests upon enabling artificial agents to learn a good model of the world that includes a good model of our preferences. For this, the main objective is creating agents that learn to represent the world and other agents' world models; a problem that falls under structure learning (a.k.a. causal representation learning or model discovery). We expose the structure learning and alignment problems with this goal in mind, as well as principles to guide us forward, synthesizing various ideas across mathematics, statistics, and cognitive science. 1) We discuss the essential role of core knowledge, information geometry and model reduction in structure learning, and suggest core structural modules to learn a wide range of naturalistic worlds. 2) We outline a way toward aligned agents through structure learning and theory of mind. As an illustrative example, we mathematically sketch Asimov's Laws of Robotics, which prescribe agents to act cautiously to minimize the ill‑being of other agents. We supplement this example by proposing refined approaches to alignment. These observations may guide the development of artificial intelligence in helping to scale existing ‑‑ or design new ‑‑ aligned structure learning systems.

Abstract:
Long‑distance driving is an important component of planetary surface exploration. Unforeseen events often require human operators to adjust mobility plans, but this approach does not scale and will be insufficient for future missions. Interest in self‑reliant rovers is increasing, however the research community has not yet given significant attention to autonomous, adaptive decision‑making. In this paper, we look back at specific planetary mobility operations where human‑guided adaptive planning played an important role in mission safety and productivity. Inspired by the abilities of human experts, we identify shortcomings of existing autonomous mobility algorithms for robots operating in off‑road environments like planetary surfaces. We advocate for adaptive decision‑making capabilities such as unassisted learning from past experiences and more reliance on stochastic world models. The aim of this work is to highlight promising research avenues to enhance ground planning tools and, ultimately, long‑range autonomy algorithms on board planetary rovers.

Abstract:
Despite the complexity of quantum systems in the real world, models with just a few effective many‑body states often suffice to describe their quantum dynamics, provided decoherence is accounted for. We show that a machine learning algorithm is able to construct such models, given a straightforward set of quantum dynamics measurements. The effective Hilbert space can be a black box, with variations of the coupling to just one accessible output state being sufficient to generate the required training data. We demonstrate through simulations of a Markovian open quantum system that a neural network can automatically detect the number N of effective states and the most relevant Hamiltonian terms and state‑dephasing processes and rates. For systems with N\leq5 we find typical mean relative errors of predictions in the 10 % range. With more advanced networks and larger training sets, it is conceivable that a future single software can provide the automated first stop solution to model building for an unknown device or system, complementing and validating the conventional approach based on physical insight into the system.

Abstract:
Optimal decision‑making under partial observability requires reasoning about the uncertainty of the environment's hidden state. However, most reinforcement learning architectures handle partial observability with sequence models that have no internal mechanism to incorporate uncertainty in their hidden state representation, such as recurrent neural networks, deterministic state‑space models and transformers. Inspired by advances in probabilistic world models for reinforcement learning, we propose a standalone Kalman filter layer that performs closed‑form Gaussian inference in linear state‑space models and train it end‑to‑end within a model‑free architecture to maximize returns. Similar to efficient linear recurrent layers, the Kalman filter layer processes sequential data using a parallel scan, which scales logarithmically with the sequence length. By design, Kalman filter layers are a drop‑in replacement for other recurrent layers in standard model‑free architectures, but importantly they include an explicit mechanism for probabilistic filtering of the latent state representation. Experiments in a wide variety of tasks with partial observability show that Kalman filter layers excel in problems where uncertainty reasoning is key for decision‑making, outperforming other stateful models.

Abstract:
We propose the use of latent space generative world models to address the covariate shift problem in autonomous driving. A world model is a neural network capable of predicting an agent's next state given past states and actions. By leveraging a world model during training, the driving policy effectively mitigates covariate shift without requiring an excessive amount of training data. During end‑to‑end training, our policy learns how to recover from errors by aligning with states observed in human demonstrations, so that at runtime it can recover from perturbations outside the training distribution. Additionally, we introduce a novel transformer‑based perception encoder that employs multi‑view cross‑attention and a learned scene query. We present qualitative and quantitative results, demonstrating significant improvements upon prior state of the art in closed‑loop testing in the CARLA simulator, as well as showing the ability to handle perturbations in both CARLA and NVIDIA's DRIVE Sim.

Abstract:
A World Model is a compressed spatial and temporal representation of a real world environment that allows one to train an agent or execute planning methods. However, world models are typically trained on observations from the real world environment, and they usually do not enable learning policies for other real environments. We propose One‑Shot World Model (OSWM), a transformer world model that is learned in an in‑context learning fashion from purely synthetic data sampled from a prior distribution. Our prior is composed of multiple randomly initialized neural networks, where each network models the dynamics of each state and reward dimension of a desired target environment. We adopt the supervised learning procedure of Prior‑Fitted Networks by masking next‑state and reward at random context positions and query OSWM to make probabilistic predictions based on the remaining transition context. During inference time, OSWM is able to quickly adapt to the dynamics of a simple grid world, as well as the CartPole gym and a custom control environment by providing 1k transition steps as context and is then able to successfully train environment‑solving agent policies. However, transferring to more complex environments remains a challenge, currently. Despite these limitations, we see this work as an important stepping‑stone in the pursuit of learning world models purely from synthetic data.

Abstract:
Few‑shot adaptation is an important capability for intelligent robots that perform tasks in open‑world settings such as everyday environments or flexible production. In this paper, we propose a novel approach for non‑prehensile manipulation which incrementally adapts a physics‑based dynamics model for model‑predictive control (MPC). The model prediction is aligned with a few examples of robot‑object interactions collected with the MPC. This is achieved by using a parallelizable rigid‑body physics simulation as dynamic world model and sampling‑based optimization of the model parameters. In turn, the optimized dynamics model can be used for MPC using efficient sampling‑based optimization. We evaluate our few‑shot adaptation approach in object pushing experiments in simulation and with a real robot.

Abstract:
World models, which encapsulate the dynamics of how actions affect environments, are foundational to the functioning of intelligent agents. In this work, we explore the potential of Large Language Models (LLMs) to operate as world models. Although LLMs are not inherently designed to model real‑world dynamics, we show that they can be induced to perform two critical world model functions: determining the applicability of an action based on a given world state, and predicting the resulting world state upon action execution. This is achieved by fine‑tuning two separate LLMs‑one for precondition prediction and another for effect prediction‑while leveraging synthetic data generation techniques. Through human‑participant studies, we validate that the precondition and effect knowledge generated by our models aligns with human understanding of world dynamics. We also analyze the extent to which the world model trained on our synthetic data results in an inferred state space that supports the creation of action chains, a necessary property for planning.

Abstract:
Object manipulation capabilities are essential skills that set apart embodied agents engaging with the world, especially in the realm of robotics. The ability to predict outcomes of interactions with objects is paramount in this setting. While model‑based control methods have started to be employed for tackling manipulation tasks, they have faced challenges in accurately manipulating objects. As we analyze the causes of this limitation, we identify the cause of underperformance in the way current world models represent crucial positional information, especially about the target's goal specification for object positioning tasks. We introduce a general approach that empowers world model‑based agents to effectively solve object‑positioning tasks. We propose two declinations of this approach for generative world models: position‑conditioned (PCP) and latent‑conditioned (LCP) policy learning. In particular, LCP employs object‑centric latent representations that explicitly capture object positional information for goal specification. This naturally leads to the emergence of multimodal capabilities, enabling the specification of goals through spatial coordinates or a visual goal. Our methods are rigorously evaluated across several manipulation environments, showing favorable performance compared to current model‑based control approaches.

Abstract:
End‑to‑end autonomous driving with vision‑only is not only more cost‑effective compared to LiDAR‑vision fusion but also more reliable than traditional methods. To achieve a economical and robust purely visual autonomous driving system, we propose RenderWorld, a vision‑only end‑to‑end autonomous driving framework, which generates 3D occupancy labels using a self‑supervised gaussian‑based Img2Occ Module, then encodes the labels by AM‑VAE, and uses world model for forecasting and planning. RenderWorld employs Gaussian Splatting to represent 3D scenes and render 2D images greatly improves segmentation accuracy and reduces GPU memory consumption compared with NeRF‑based methods. By applying AM‑VAE to encode air and non‑air separately, RenderWorld achieves more fine‑grained scene element representation, leading to state‑of‑the‑art performance in both 4D occupancy forecasting and motion planning from autoregressive world model.

Abstract:
The rapid development of artificial intelligence technologies, particularly Large Language Models (LLMs), has revolutionized the landscape of lifelong learning. This paper introduces a conceptual framework for a self‑constructed lifelong learning environment supported by LLMs. It highlights the inadequacies of traditional education systems in keeping pace with the rapid deactualization of knowledge and skills. The proposed framework emphasizes the transformation from institutionalized education to personalized, self‑driven learning. It leverages the natural language capabilities of LLMs to provide dynamic and adaptive learning experiences, facilitating the creation of personal intellectual agents that assist in knowledge acquisition. The framework integrates principles of lifelong learning, including the necessity of building personal world models, the dual modes of learning (training and exploration), and the creation of reusable learning artifacts. Additionally, it underscores the importance of curiosity‑driven learning and reflective practices in maintaining an effective learning trajectory. The paper envisions the evolution of educational institutions into "flipped" universities, focusing on supporting global knowledge consistency rather than merely structuring and transmitting knowledge.

Abstract:
This paper addresses the problem of autonomous UAV search missions, where a UAV must locate specific Entities of Interest (EOIs) within a time limit, based on brief descriptions in large, hazard‑prone environments with keep‑out zones. The UAV must perceive, reason, and make decisions with limited and uncertain information. We propose NEUSIS, a compositional neuro‑symbolic system designed for interpretable UAV search and navigation in realistic scenarios. NEUSIS integrates neuro‑symbolic visual perception, reasoning, and grounding (GRiD) to process raw sensory inputs, maintains a probabilistic world model for environment representation, and uses a hierarchical planning component (SNaC) for efficient path planning. Experimental results from simulated urban search missions using AirSim and Unreal Engine show that NEUSIS outperforms a state‑of‑the‑art (SOTA) vision‑language model and a SOTA search planning model in success rate, search efficiency, and 3D localization. These results demonstrate the effectiveness of our compositional neuro‑symbolic approach in handling complex, real‑world scenarios, making it a promising solution for autonomous UAV systems in search missions.

Abstract:
The rise of multi‑modal large language models(MLLMs) has spurred their applications in autonomous driving. Recent MLLM‑based methods perform action by learning a direct mapping from perception to action, neglecting the dynamics of the world and the relations between action and world dynamics. In contrast, human beings possess world model that enables them to simulate the future states based on 3D internal visual representation and plan actions accordingly. To this end, we propose OccLLaMA, an occupancy‑language‑action generative world model, which uses semantic occupancy as a general visual representation and unifies vision‑language‑action(VLA) modalities through an autoregressive model. Specifically, we introduce a novel VQVAE‑like scene tokenizer to efficiently discretize and reconstruct semantic occupancy scenes, considering its sparsity and classes imbalance. Then, we build a unified multi‑modal vocabulary for vision, language and action. Furthermore, we enhance LLM, specifically LLaMA, to perform the next token/scene prediction on the unified vocabulary to complete multiple tasks in autonomous driving. Extensive experiments demonstrate that OccLLaMA achieves competitive performance across multiple tasks, including 4D occupancy forecasting, motion planning, and visual question answering, showcasing its potential as a foundation model in autonomous driving.