free hit counter

CVPR 2026

NeuROK: Generative 4D Neural Object Kinematics

1Stanford University   ·   2University of Cambridge   ·   3Cornell University

NeuROK is a neural simulation framework that turns any 3D shape into an interactable 4D asset — with no physical annotations and no category-specific structural assumptions. Taking Lagrangian mechanics as its minimal inductive bias for simulation, it uses a large pre-trained transformer to predict the kinematic state space of an input 3D object, then simulates motion directly in that latent state space by solving an ODE.

Problem Setting

Generating simulative 4D dynamics from a static shape

Given a shape-only static 3D object and an initial physical condition, we study generating its simulative 4D dynamics: plausible temporal deformations of static objects under the specified input physical conditions.

Application

Scan your room and make it interactable!

Our method is robust and can be applied to turn real scanned 3D objects into interactive 4D objects.

Interacting with real objects in a Stanford office

3D Scan · Smartphone
Generated Output · Stanford Office

We scan each scene and perform post-processing, including object segmentation, to obtain the model inputs.

Interacting with real objects in a Stanford office kitchen

3D Scan · Smartphone
Generated Output · Stanford Office Kitchen

We scan each scene and perform post-processing, including object segmentation, to obtain the model inputs.

Interacting with real objects in an apartment kitchen

3D Scan · Smartphone
Generated Output · Apartment Kitchen

We scan each scene and perform post-processing, including object segmentation, to obtain the model inputs.

Interacting with real objects in a Cornell office

3D Scan · Smartphone
Generated Output · Cornell Office

We scan each scene and perform post-processing, including object segmentation, to obtain the model inputs.

Model Prediction

Unified model for diverse phenomena

From shape-only static 3D assets without any dynamic annotations, our pipeline can form a 3D world supporting diverse interaction from users.

Our method uses only the minimal inductive bias of Lagrangian mechanics and assumes no object category or dynamic structure — so the same unified model can be applied to a diverse range of objects.

Headphones, flowers, Newton's cradle & more in your office

Input: 3D shapes of static objects and initial physical conditions
Generated Output

We insert the generated 4D objects into a 3D office to form an interactive 3D world.

Curtains, oral rinse, faucets & more in your bathroom

Input: 3D shapes of static objects and initial physical conditions
Generated Output

We insert the generated 4D objects into a 3D bathroom to form an interactive 3D world.

Sponges, kettles, microwaves & more in your kitchen

Input: 3D shapes of static objects and initial physical conditions
Generated Output

We insert the generated 4D objects into a 3D kitchen to form an interactive 3D world.

Method

Simpler coordinates, simpler dynamics

NeuROK's idea is to simulate inside a learned latent state space of the object.
It draws on Lagrangian mechanics: with the right choice of coordinates, a hard dynamics problem becomes a simple one.

Your browser can't run WebGL, so these interactive 3D views are unavailable.

Encoding kinematics

NeuROK learns this latent space from data: it captures the object's possible states, and a decoder maps any latent vector to a valid deformation. Below, we make this tangible: sweeping across a 2D slice of the space for an eyeglass, we decode each latent vector into 3D on the fly.

Latent space drag to explore
latent = (0.0, 0.0)
Decoded 3D shape

Solving dynamics on a latent state space

Simulating is then straightforward: a single equation of motion (the Euler–Lagrange equation) handles every kind of object. As the eyeglass is dropped, its latent vector follows a path over time; decoding that path frame by frame gives the full 3D motion.

Latent trajectory
Simulated drop

Comparisons

We show video results comparing our method with baselines on physically-inspired 4D generation.

Select a dynamic object

Input
Newton's Cradle input
Newton's Cradle comparison

Note that the goal of this paper is to generate one plausible 4D sequence that satisfies one valid physical configuration and conforms to human physical intuition.

BibTeX

@InProceedings{Geng_2026_CVPR,
    author    = {Geng, Chen and He, Guangzhao and Gao, Yue and Zhang, Yunzhi and Wu, Shangzhe and Wu, Jiajun},
    title     = {{NeuROK}: Generative 4D Neural Object Kinematics},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {39239-39251}
}

Acknowledgements

This work is in part supported by NSF RI #2211258 and #2338203, ONR YIP N00014-24-12117, ONR MURI N00014-22-1-2740, the Stanford Institute for Human-Centered AI (HAI), and the Magic Grant from the Brown Institute for Media Innovation.

We acknowledge the compute support from the NSF ACCESS program #CIS250696, Stanford Data Science and Marlowe Computing Platform, and the AMD University Program for AI & HPC Cluster.

We thank Robyn Lockwood (Stanford Language Center) for editorial and writing suggestions that improved the clarity of the manuscript. We thank Chong Zeng and Ruocheng Wang for early feedback on the manuscript and members of Stanford Vision and Learning Lab and Stanford Graphics Lab for fruitful discussion.