Chen GENG「耿 晨
?
My first name is Chen, and my last name is Geng.
I prefer to be addressed by my first name Chen.
Possible pronunciation: Chen(ch-uhn) Geng(guh-ng).
」
We humans live in a physical world, where the depiction of reality through camera lenses can be seen as visual representations rendered by the (imaginary) underlying graphics engines. These engines can be modeled as physically based engines (rasterization, volume rendering, NeRF, etc.), statistical generative engines (GAN, Diffusion, etc.), or a combination of both. My current research interest lies in teaching machines to perceive and understand such a physical world by inverting the forward graphics engines, enabling them to reason about the world structurally, intrinsically, and self-supervisedly, just like how we humans naturally do.
Previously, I got my bachelor's degree in Computer Science from Zhejiang University in 2023, with an honors degree from Chu Kochen Honors College. During my undergraduate, I was privileged to work closely with Prof. Xiaowei Zhou and Prof. Sida Peng on several research projects. I spent a wonderful summer at Stanford, also with the CogAI group, in 2022.
If you have shared research interests or have any topics you'd like to chat about — especially if you're from underrepresented groups — don't hesitate to shoot me an email. I'm always up for exploring potential collaborations and/or engaging in insightful conversations.
Email: X × Y, where X = {gengchen}, Y = {@cs.stanford.edu}
tl;dr: We decompose the shading of objects into a tree-structured representation, which can be edited or interpreted by users easily.
Abstract: We study the problem of obtaining a tree-structured representation for shading objects. Prior work typically uses the parametric or measured representation to model shading, which is neither interpretable nor easily editable. Our method uses the shade tree representation, which combines basic shading nodes and compositing methods, to model and decomposes the material shading. Such a representation enables users to edit previously rigid material appearances in an efficient and intuitive manner. In particular, novice users who are unfamiliar with the construction of such shade trees can quickly obtain such a representation. The extraction of such a representation enables the editing and understanding of object shading, even for novice users. The biggest challenge in this task is that the discrete structure of the shade tree is not differentiable. We propose a hybrid algorithm to address this issue. First, given an input image, a recursive amortized inference model is leveraged to initialize a guess of the tree structure and corresponding leaf node parameters. Then, we apply an optimization-based method to fine-tune the result. Experiments show that our method works well on synthetic images, realistic images, and non-realistic vector drawings, surpassing the baselines significantly.
tl;dr: We accelerate the learning of neural volumetric videos of dynamic humans by over 100 times.
Abstract: This paper addresses the challenge of quickly reconstructing free-viewpoint videos of dynamic humans from sparse multi-view videos. Some recent works represent the dynamic human as a canonical neural radiance field (NeRF) and a motion field, which are learned from videos through differentiable rendering. They generally require a lengthy optimization process. Other generalization methods leverage learned prior from datasets and reduce the optimization time by only finetuning on new scenes, at the cost of visual fidelity. In this paper, we propose a novel method for creating viewpoint-free human performance synthesis from sparse view videos in minutes with competitive visual quality. Specifically, we leverage the human body prior to define a novel part-based voxelized NeRF representation, which distributes the representational power of the canonical human model efficiently. Furthermore, we propose a novel dimensionality reduction 2D motion parameterization scheme to increase the convergence rate of the human deformation field. Experiments demonstrate that our approach can be trained 100 times faster than prior per-scene optimiztion methods while being competitive in the rendering quality. We show that given a video capturing a human performer of 100 frames, our model typically takes about 5 minutes for training to produce photorealistic free-viewpoint videos on a single RTX 3090 GPU. The code will be released for reproducibility.
tl;dr: Given sparse multi-view videos of crowded scenes with multiple human performers, our approach is able to generate high-fidelity novel views and accurate instance masks.
@inproceedings{multinb,
     title = {Novel View Synthesis of Human Interactions from Sparse
Multi-view Videos},
     author = {Qing, Shuai and Chen, Geng and Qi, Fang and Sida, Peng and Wenhao, Shen and Xiaowei, Zhou and Hujun, Bao},
     booktitle = {SIGGRAPH Conference Proceedings},
     year = {2022},
}
Experience 🧑🎓
Stanford University 2023 - Present, Stanford, California