by
                            
                        
 
 Although lipsync2 (our previous work) achieved the required goals, the model used an ensemble network strategy, and the inference time was above standard. Therefore, the “LipSyncFace” aims to solve the issues we faced in lipsync2, which relied on an ensemble of networks (and adversarial training framework) and suffered from slow inference time. To solve these issues, we come up with a two-stage framework. The first stage consists of audio and video encoders that form the face and predict the approximate sketch of the lip region conditioned by audio. The first stage generates the face in 160x160 resolution as compared to 96x96 in the method mentioned earlier. In the second stage, a rendering decoder synchronizes the lip movements with audio and outputs a high-fidelity talking face. This two-stage framework successfully overcomes the ensemble approach by using a single and unified network pipeline, which is easier to train. Moreover, due to less computational complexity, unlike GANs, the inference is now faster, which is crucial for real-time applications. So far, this method has achieved a PSNR of 34.306 and 7.436 on metric Lip-Sync Error Confidence (LSE-C) and 6.010 on Lip-Sync Error Distance (LSE-D) on the LRS2 dataset. These results are an improvement in overall visual quality in lip synchronization, making it more practical for real-time applications.
@article{
        LipSyncFace-Taneem, 
        title={LipSyncFace: High-Fidelity Audio-Driven and Lip-Synchronized Talking Face Generation}, 
        author={Taneem Ullah Jan}, 
        year={2025}
      }