I work at the intersection of multimodal learning, 3D human modeling,
and generative AI to advance neural avatars, virtual humans, and replicate aspects of the human
brain. I serve as an AI Researcher at VOLV AI, where I
direct the machine learning team in exploring innovative solutions in 3D computer vision for virtual
try-on and digital human synthesis. Previously, I worked as a Research AI Engineer at BHuman AI, leading research and development
for their core generative AI products.
tldr: An embedding-driven approach combines audio encoders with multimodal
projectors to enable direct speech-to-text processing, achieving significant performance while
training minimal parameters through block optimization.
tldr: An interactive 3D body modeling system that allows real-time
manipulation of human body shapes through intuitive measurement sliders with immediate visual
feedback in a fully navigable 3D environment.
tldr: VMVLM enhances vision-language models by using dual visual pathways,
combining Q-Former queries with direct ViT feature injection for improved multimodal instruction
following.
tldr: A unified framework for 3D virtual try-on that transforms simple 2D images
into realistic 3D
representations, by efficiently integrating clothing with the human body in a pose-adaptive manner.
tldr: A two-stage unified audio-driven talking face generation framework, which can
render
high-fidelity,
lip-synchronized videos with improved inference speed.
tldr: Current image inpainting techniques are too heavy; this paper introduces a
Row-wise Flat Pixel
LSTM, a small hybrid model for the efficient and high-quality restoration of small images.
tldr: A robust and efficient talking face generation model with highly accurate lip
synchronization
and
full facial expressiveness with more extended audio and high-quality video resolutions.
tldr: An unsupervised one-shot talking head video generation model using neural
rendering and motion
transfer techniques with non-linear transformation to animate static images.
tldr: Innovative face-swapping model that preserves the source identity features
accurately while
seamlessly adapting target attributes applicable to images and videos.