>> user: What the speaker is trying to test?
>> assistant: The speaker is trying to test the functionality and
effectiveness of the driver.
>> user: Is this a male or a female voice?
>> assistant: Male. The use of "Hello, name" and the intonation suggest a
male speaker.
>> user: Can you transcribe this voice for me?
>> assistant: Hello, name. This is the driver video. So in this video, I'm
going to talk a little bit about how the driver works, and we're going to experiment to see how the results
are.
>> user: What was my first question in this conversation?
>> assistant: Your first question in this conversation was: "What the speaker is trying to test?"
We present EmbedVoiceLLM, a novel multimodal architecture that seamlessly integrates speech and text processing through advanced embedding space mapping techniques. Our approach eliminates the traditional Automatic Speech Recognition (ASR) pipeline by directly projecting audio embeddings into the high-dimensional space of Large Language Models (LLMs). The system achieves significant efficiency with ~150ms time-to-first-token and ~60 tokens/second inference speed while maintaining state-of-the-art performance across diverse audio-language tasks. To ensure robust and cost-efficient streaming, we introduce a Persistent Adaptive Token (PAT) strategy that predicts and allocates the required output length on-the-fly from prompt features and modality signals, continuously refining estimates with file-backed, session-persistent feedback. PAT dynamically sets the number of maximum tokens for single-pass decoding, reducing truncation and avoiding costly multi-pass regeneration while improving token efficiency, especially in multimodal settings. By employing Low-Rank Adaptation (LoRA) fine-tuning strategies and block-optimized architectures, EmbedVoiceLLM demonstrates superior parameter efficiency, training only 3.5–4.4% of total parameters while achieving competitive performance on speech understanding, transcription, and conversational AI tasks. Our extensible learning framework supports multiple backbone architectures including Mistral NeMo 12B and Meta Llama 3.1 8B, coupled with advanced audio encoders for robust multimodal comprehension.
The convergence of speech and language understanding has become a critical frontier in artificial
intelligence, with applications spanning from voice assistants to multimodal content creation. Traditional
approaches rely on cascaded systems that first convert speech to text through ASR, then process the text
using language models. This pipeline approach introduces latency, error propagation, and loss of
paralinguistic information crucial for natural human-machine interaction.
Recent advances in multimodal learning have demonstrated the potential for end-to-end systems that can
process audio and text simultaneously. However, existing approaches often suffer from computational
inefficiency, requiring extensive pre-training on massive datasets, or architectural complexity that
limits practical deployment.
EmbedVoiceLLM addresses these challenges through three key innovations:
This approach achieves competitive performance while significantly reducing computational requirements and training time, making advanced multimodal capabilities accessible for practical applications.
EmbedVoiceLLM employs a three-component architecture that seamlessly integrates audio understanding with language generation capabilities:
graph TD A[Audio Input
Waveform] --> B[Audio Encoder
OpenAI Whisper] B --> C[Audio Embeddings
Hidden States] C --> D[Multimodal Projector
Linear + Activation] D --> E[Projected Embeddings
LLM Dimension] F[Text Input
Tokenized] --> G[Text Embeddings
LLM Embeddings] E --> H[Embedding Fusion
Concatenation] G --> H H --> I[Language Model
NeMo/Llama3.1] I --> J[Generated Response
Text Output] style A fill:#e1f5fe style B fill:#f3e5f5 style D fill:#fff3e0 style I fill:#e8f5e8 style J fill:#fff8e1
The audio encoder processes raw waveform input and extracts meaningful acoustic representations. We use OpenAI's Whisper as our primary audio backbone for several key advantages:
The audio encoder processes input waveforms sampled at 16kHz and outputs contextualized embeddings with temporal dimension preserved for accurate speech-text alignment.
The multimodal projector serves as the critical bridge between audio and text modalities, mapping audio embeddings into the LLM's high-dimensional space:
graph LR A[Audio Embeddings
Shape: B×T×D_audio] --> B[Stack Factor
Downsampling: 8x] B --> C[Stacked Features
Shape: B×T/8×D_audio*8] C --> D[Linear Layer 1
Input → Hidden] D --> E[SwiGLU Activation
Gated Linear Unit] E --> F[Linear Layer 2
Hidden → LLM_dim] F --> G[Layer Normalization
Stable Training] G --> H[Projected Embeddings
Shape: B×T/8×D_llm] style A fill:#e3f2fd style B fill:#f1f8e9 style D fill:#fff3e0 style E fill:#fce4ec style F fill:#fff3e0 style H fill:#e8f5e8
Key Design Decisions:
EmbedVoiceLLM supports multiple state-of-the-art language model backbones. Those language models are kept frozen during training to preserve their pre-trained capabilities while the multimodal projector learns to map audio embeddings into the appropriate semantic space.
EmbedVoiceLLM uses a parameter-efficient training approach that freezes the majority of model parameters while fine-tuning specific components:
graph TD A[Training Components] --> B[Frozen Components] A --> C[Trainable Components] B --> D[Audio Encoder
Whisper Large
*Frozen*] B --> E[Language Model
NeMol/Llama3.1
*Frozen*] C --> F[Multimodal Projector
100% Trainable
*Active*] C --> G[LoRA Adapters
Text Model
*Active*] C --> H[LoRA Adapters
Audio Model
*Active*] F --> I[Training Statistics
Total: ~1.3B params
Trainable: ~45M - 3.5%] G --> I H --> I style D fill:#e3f2fd style E fill:#e3f2fd style F fill:#ffebee style G fill:#ffebee style H fill:#ffebee style I fill:#e8f5e8
We implement the LoRA fine-tuning technique for both text and audio models to enable task-specific adaptation without full parameter updates. This enables efficient learning with a reduced number of trainable parameters.
EmbedVoiceLLM employs KL divergence loss to align the multimodal model's output distribution with text-only predictions:
L_KL = KL(P(y|x_audio, x_text) || P(y|x_text))
where:
flowchart TD A[Raw Dataset] --> B[Audio Preprocessing
16kHz, Mono, Normalization] A --> C[Text Preprocessing
Tokenization, Formatting] B --> D[Audio Features
Log-mel Spectrograms] C --> E[Text Tokens
Model-specific Encoding] D --> F[Multimodal Sample
Audio + Text + Labels] E --> F F --> G[Data Augmentation
Speed, Noise, Masking] G --> H[Training Batch
Dynamic Padding] H --> I[Model Forward Pass] I --> J[Loss Computation
KL Divergence] J --> K[Backward Pass
LoRA + Projector Updates] style A fill:#e1f5fe style F fill:#fff3e0 style I fill:#e8f5e8 style K fill:#ffebee
The inference pipeline is optimized for low-latency, real-time applications. Not only this, a Persistent Adaptive Token (PAT) strategy predicts the number of tokens required on the fly from prompt and modality features and refines estimates via a file-backed history, enabling single-pass decoding that reduces truncation and compute waste.
sequenceDiagram participant User participant AudioProc as Audio Processor participant Encoder as Audio Encoder participant Projector as Projector participant LLM as Language Model participant Output as Text Generator User->>AudioProc: Raw Audio Input AudioProc->>Encoder: Preprocessed Waveform Encoder->>Projector: Audio Embeddings B×T×D_audio Projector->>LLM: Projected Embeddings B×T/8×D_llm User->>LLM: Text Prompt (Optional) LLM->>LLM: Embedding Fusion LLM->>Output: Contextualized Hidden States Output->>User: Generated Response Note over Encoder,Projector: ~50ms processing time Note over LLM,Output: ~100ms time-to-first-token
EmbedVoiceLLM supports streaming inference for real-time applications:
graph TD A[Audio Stream
Continuous Input] --> B[Sliding Window
Processing: 30s chunks] B --> C[Feature Extraction
Overlapping segments] C --> D[Embedding Projection
Incremental processing] D --> E[LLM Generation
Streaming output] E --> F[Response Stream
Token-by-token] B --> G[VAD Integration
Voice Activity Detection] G --> H[Silence Handling
Optimized processing] H --> D style A fill:#e3f2fd style E fill:#e8f5e8 style F fill:#fff8e1
EmbedVoiceLLM is evaluated across multiple dimensions to assess both speech understanding and language generation capabilities:
Speech Recognition TasksOur evaluation process spans diverse multimodal tasks without relying on any single dataset:
Speech UnderstandingEmbedVoiceLLM represents a significant advancement in multimodal AI, demonstrating that efficient embedding-driven architectures can achieve significant performance while maintaining practical computational requirements. Our approach of directly mapping audio embeddings into LLM space eliminates traditional ASR bottlenecks and enables more natural human-machine interaction.
Key contributions include:
The extensible learning framework positions EmbedVoiceLLM as a foundation for diverse applications, from voice assistants to content creation tools. Our comprehensive evaluation demonstrates robust performance across speech understanding, transcription, and conversational AI tasks. As multimodal AI continues to evolve, EmbedVoiceLLM provides both a practical solution for current applications and a research platform for future innovations in audio-language understanding. The combination of efficiency, performance, and extensibility makes it particularly suitable for both academic research and industrial deployment scenarios.
@article{
embedvoicellm,
title={Efficient Multimodal Block-optimized Embedding-Driven Voice Operations with extensible learning},
author={Taneem Ullah Jan},
year={2025}
}