I am currently the AI Lead at Eris AI, where I
drive research and engineering in multimodal and generative modeling to build foundation models for
agentic AI applications. I oversee model design, system architecture, and deployment of
next-generation interactive agents. Previously, I was an AI Researcher at VOLV AI and a Research AI Engineer at
BHuman AI.
Research Work
My work spans the intersection of multimodal learning, generative
modeling, and computational neuroscience, advancing neural avatars, virtual humans, systems that
move from pattern recognition toward causal understanding of intelligent behavior.
tldr: An embedding-driven approach combines audio encoders with multimodal
projectors to enable direct speech-to-text processing, achieving significant performance while
training minimal parameters through block optimization.
tldr: An interactive 3D body modeling system that allows real-time
manipulation of human body shapes through intuitive measurement sliders with immediate visual
feedback in a fully navigable 3D environment.
tldr: VMVLM enhances vision-language models by using dual visual pathways,
combining Q-Former queries with direct ViT feature injection for improved multimodal instruction
following.
tldr: A unified framework for 3D virtual try-on that transforms simple 2D images
into realistic 3D
representations, by efficiently integrating clothing with the human body in a pose-adaptive manner.
tldr: A two-stage unified audio-driven talking face generation framework, which can
render
high-fidelity,
lip-synchronized videos with improved inference speed.
tldr: Current image inpainting techniques are too heavy; this paper introduces a
Row-wise Flat Pixel
LSTM, a small hybrid model for the efficient and high-quality restoration of small images.
tldr: A robust and efficient talking face generation model with highly accurate lip
synchronization
and
full facial expressiveness with more extended audio and high-quality video resolutions.
tldr: An unsupervised one-shot talking head video generation model using neural
rendering and motion
transfer techniques with non-linear transformation to animate static images.
tldr: Innovative face-swapping model that preserves the source identity features
accurately while
seamlessly adapting target attributes applicable to images and videos.