free hit counter

CVPR 2026

NeuROK: Generative 4D Neural Object Kinematics

1Stanford University   ·   2University of Cambridge   ·   3Cornell University

NeuROK is a neural simulation framework that turns any 3D shape into an interactable 4D asset — with no physical annotations and no category-specific structural assumptions. Taking Lagrangian mechanics as its minimal inductive bias for simulation, it uses a large pre-trained transformer to predict the kinematic state space of an input 3D object, then simulates motion directly in that latent state space by solving an ODE.

Problem Setting

Generating simulative 4D dynamics from a static shape

Given a shape-only static 3D object and an initial physical condition, we study generating its simulative 4D dynamics: plausible temporal deformations of static objects under the specified input physical conditions.

Application

Scan your room and make it interactable!

Our method is robust and can be applied to turn real scanned 3D objects into interactive 4D objects.

Interacting with real objects in the Gates building

3D Scan · Smartphone
Generated Output · Gates Building

We scan each scene and perform post-processing, including object segmentation, to obtain the model inputs.

Interacting with real objects in your kitchen

3D Scan · Smartphone
Generated Output · Real Kitchen

We scan each scene and perform post-processing, including object segmentation, to obtain the model inputs.

Interacting with real objects in your office

3D Scan · Smartphone
Generated Output · Real Office

We scan each scene and perform post-processing, including object segmentation, to obtain the model inputs.

Model Prediction

Unified model for diverse phenomena

From shape-only static 3D assets without any dynamic annotations, our pipeline can form a 3D world supporting diverse interaction from users.

Our method uses only the minimal inductive bias of Lagrangian mechanics and assumes no object category or dynamic structure — so the same unified model can be applied to a diverse range of objects.

Headphones, flowers, Newton's cradle & more in your office

Input: 3D shapes of static objects and initial physical conditions
Generated Output

We insert the generated 4D objects into a 3D office to form an interactive 3D world.

Curtains, oral rinse, faucets & more in your bathroom

Input: 3D shapes of static objects and initial physical conditions
Generated Output

We insert the generated 4D objects into a 3D bathroom to form an interactive 3D world.

Sponges, kettles, microwaves & more in your kitchen

Input: 3D shapes of static objects and initial physical conditions
Generated Output

We insert the generated 4D objects into a 3D kitchen to form an interactive 3D world.

Comparisons

We show video results comparing our method with baselines on physically-inspired 4D generation.

Select a dynamic object

Input
Newton's Cradle input
Newton's Cradle comparison

Note that the goal of this paper is to generate one plausible 4D sequence that satisfies one valid physical configuration and conforms to human physical intuition.

BibTeX

@InProceedings{Geng_2026_CVPR,
    author    = {Geng, Chen and He, Guangzhao and Gao, Yue and Zhang, Yunzhi and Wu, Shangzhe and Wu, Jiajun},
    title     = {{NeuROK}: Generative 4D Neural Object Kinematics},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {39239-39251}
}

Acknowledgements

This work is in part supported by NSF RI #2211258 and #2338203, ONR YIP N00014-24-12117, ONR MURI N00014-22-1-2740, the Stanford Institute for Human-Centered AI (HAI), and the Magic Grant from the Brown Institute for Media Innovation.

We acknowledge the compute support from the NSF ACCESS program #CIS250696, Stanford Data Science and Marlowe Computing Platform, and the AMD University Program for AI & HPC Cluster.

We thank Robyn Lockwood (Stanford Language Center) for editorial and writing suggestions that improved the clarity of the manuscript. We thank Chong Zeng and Ruocheng Wang for early feedback on the manuscript and members of Stanford Vision and Learning Lab and Stanford Graphics Lab for fruitful discussion.