I am currently the AI Lead at ErisAI, where I
drive research and engineering in multimodal and generative modeling to build foundation models for
agentic AI applications. I oversee model design, system architecture, and deployment of
next-generation interactive agents. Previously, I was an AI Researcher at VOLV AI and a Research AI Engineer at
BHuman AI.
Research Work
My work spans multimodal learning, generative modeling, and
intelligent systems, with a focus on building reliable AI that moves from research into real-world
deployment.
DarwinPatch is a budgeted repair-search controller for coding agents that converts failed patch
attempts into bounded evidence, routes follow-up repairs through hard verification gates, and
records auditable candidate lineage.
A research-first look at how Synapse combines retrieval, routing, memory,
verification, and runtime controls into a dependable applied intelligence system.
tldr: An embedding-driven approach combines audio encoders with multimodal
projectors to enable direct speech-to-text processing, achieving significant performance while
training minimal parameters through block optimization.
tldr: An interactive 3D body modeling system that allows real-time
manipulation of human body shapes through intuitive measurement sliders with immediate visual
feedback in a fully navigable 3D environment.
tldr: VMVLM enhances vision-language models by using dual visual pathways,
combining Q-Former queries with direct ViT feature injection for improved multimodal instruction
following.
tldr: A unified framework for 3D virtual try-on that transforms simple 2D images
into realistic 3D
representations, by efficiently integrating clothing with the human body in a pose-adaptive manner.
tldr: A two-stage unified audio-driven talking face generation framework, which can
render
high-fidelity,
lip-synchronized videos with improved inference speed.
tldr: Current image inpainting techniques are too heavy; this paper introduces a
Row-wise Flat Pixel
LSTM, a small hybrid model for the efficient and high-quality restoration of small images.
tldr: A robust and efficient talking face generation model with highly accurate lip
synchronization
and
full facial expressiveness with more extended audio and high-quality video resolutions.
tldr: An unsupervised one-shot talking head video generation model using neural
rendering and motion
transfer techniques with non-linear transformation to animate static images.
tldr: Innovative face-swapping model that preserves the source identity features
accurately while
seamlessly adapting target attributes applicable to images and videos.