Full Citation: “Mukhopadhyay, Soumik, et al. “Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization.” arXiv preprint arXiv:2308.09716 (2023).”
Link to Paper: https://arxiv.org/pdf/2308.09716.pdf
Conference Details: arXiv 2023
Project Page: Link

Lip synchronization task

Audio에 맞게 사람의 입술 움직임을 합성하는 task.

영화 산업(더빙), 가상 아바타 등에서 다양한 응용이 가능하다.

도전과제

디테일한 입술 움직임 구현

identity, pose, emotions 등 source의 특징을 보존해야함

굉장히 재미있어 보이는 분야가 있어서 읽어봄.

위의 비디오 예시처럼 오디오에 맞게 사람의 입술 움직임을 합성하는 task인데 Lip synchronization task라고 한다.

ChatGPT로 시나리오랑 text 짜달라고 한뒤 Text-to-Video + Lip synchronization + Video-to-Audio 3개면 영화나 deepfake, 양산형 유투브 쇼츠 비디오 뚝딱 만들 수 있을 듯…

쇼츠공장으로 월 몇백 버는 사람 있다고하던데;

Introduction

립 싱크(lip-synchronization)는 다른 speech audio에 맞게 사람의 입술 움직임을 합성하는 task.

이 과정에서 중요한 것은 자연스러운 입술 움직임만 합성하는 것이 아니라, 사용자의 identity, pose 등도 동시에 유지되어야 한다.

Video frame 처리 작업이 포함되어 있기 때문에, 4차원의 입력값에 연산이 필요하다.

입력 size: (C, F, H, W)
- C: Channel, F: Frame, H: Height, W: Width

본 논문, ‘Diff2Lip’은 위 립 싱크 task를 해결하기 위해 다음과 같은 구성 요소를 사용함:

Pose context를 얻기 위한 masked input frame.
Identity와 mouth region textures를 얻기 위한 reference frame.
Lip shape를 제어하기 위한 audio frame.

Contributions

Frame 차원이 포함된 입력 데이터에서 diffusion 기법을 사용하여 립 싱크를 수행한 ‘Diff2Lip’의 주요한 contribution은 아래와 같다:

Audio-conditioned image generation을 위한 새로운 diffusion model 기반의 접근법을 제안.
Frame-wise 및 sequential losses를 통해 고품질의 립 싱크를 성공적으로 수행.
Sequential adv loss를 활용하여 diffusion 모델을 이용한 frame-wise video 생성을 더욱 stable하게 수행.

Methods

Diffusion Models

Diffusion 모델은 데이터의 복잡한 분포를 모델링하기 위해 확률적 확산 과정을 사용한다.

주요 아이디어는 원본 데이터에 노이즈를 점진적으로 추가하고, 그 노이즈를 점차 제거하여 원래의 데이터를 재구성한다.

노이즈 추가 과정:

\[x_t = \sqrt{ \bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \epsilon \in N(0, I)\]
$\left\{ \bar{\alpha}_t \right\}_{t=1}^{T}$: list of noise scales
$x_0$는 원본 데이터

제거할 예측 noise에 대한 variational lower bound on the maximum likelihood objective:

\[L_{simple} = \mathbb{E}_{x_0, t, \epsilon}\left [ \left\| \epsilon_\theta (x_t, t)-\epsilon \right\|_2^2 \right ]\]

노이즈 제거 과정 (denoising), posterior sampling:

\[x_{t-1} = \sqrt{\frac{\bar{\alpha}_{t-1}}{\bar{\alpha}_{t}}} + (\sqrt{1-\bar{\alpha}_{t-1}}-\sqrt{\frac{\bar{\alpha}_{t-1}(1-\bar{\alpha}_{t})}{\bar{\alpha}_{t}}})\cdot \epsilon\]

자세한 Diffusion model은 DDPM 논문 참고

Proposed Approach

Notation 정리

$s$번 째 frame $ s \in {1, 2, \ldots, S}$
Video $V = {v_1, \ldots, v_S }$
Audio $A = {a_1, \ldots, a_S }$
Binary mask $M$
- 입 부분은 1, 나머지 0
$x_{T, s}=v_s∙(1-M)+\eta∙M$ , where $\eta∈N(0, I)$
- $T$ 번째 diffusion step의 이미지 (코 밑으로 gaussian noise)

Noisy Video $x_{t, s:s+5}$, reference image $x_r$, input audio $a_{s:s+5}$ 를 입력받아 audio-lip 이 일치하는 video $\hat{x}_{0, s:s+5}$ 생성하는 것이 목표.

Diffusion 만으로 lip-sync를 학습 한 경우

$x_{T, s}=v_s∙(1-M)+\eta∙M$ , where $\eta∈N(0, I)$
$x_r=v_{random(1, S) ≠s}$
Noise predictoin model $\epsilon_\theta(x_{s,t}, a_s, a_r, t)$

$L_{simple} = \mathbb{E}\left [ \left| \epsilon_\theta(x_{s,t}, a_s, a_r, t) - \epsilon \right| \right ]$

자연스러운 얼굴, 입모양이 형성되지만 오디오와 입술 모양의 sync가 맞지 않음.

Noise prediction(Original Diffusion model) 만으론 lip-sync 하기엔 부족함.

다양한 입모양을 생성할 수 있도록 해야함.
생성한 입모양이 audio에 맞도록 일치 시켜야함
각 video frame이 부드럽게 학습 되어야함.

1. 다양한 입모양 생성을 위한 Direct recover

$t$ 시점의 예측한 de-noised 이미지 $x_t$를 direct recover한 $x^{\theta}_0$ 를 활용.

$ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{(1-\bar{\alpha}_t)}\cdot \epsilon$ , where $\epsilon \in N(0, 1)$
\[x^\theta_0(x_t, t) = \frac{x_t-\sqrt{(1-\bar{\alpha}_t)}\cdot \epsilon}{\sqrt{\bar{\alpha}_t}}\]

다양한 입모양 생성 능력을 길러주기 위해 2가지 loss 추가 활용

\[L_2=\mathbb{E}_{x_{0,s}, t, \epsilon}\left [ \left\| x^\theta_{0,s}-x_{0,s}\right\|^2_2 \right ]\]

\[L_{lpips}=\mathbb{E}_{x_{0,s}, t, \epsilon}\left [ \left\|\phi(x^\theta_{0,s}) -\phi(x_{0,s})\right\|^2_2 \right ]\]

$\phi$: Pre-trained VGG network

2. 입모양과 audio 사이의 sync를 맞추기 위한 $L_{sync}$

To impose audio synchronization, use SyncNet [Wav2Lip, ACM 2020].

\[L_{sync} = \mathbb{E}_{x_{0,x},t,\epsilon}\left [ \mathbf{SyncNet}(x^\theta_{0,s:s+5}, a_{s:s+5}) \right ]\]

3. 각 frame이 realistic하고, 부드러워지도록 $L_{GAN}$

\[L_{GAN} = \mathbb{E}_{x_{0,s},t,\epsilon}\left [ log D(x^\theta_{0,s:s+5}) \right ] + \mathbb{E}_{x_{0,s},t,\epsilon}\left [ log(1-D(x^\theta_{0,s:s+5})) \right ]\]

Diffusion 만으로 부족한 입술모양, 입술 sync, frame 학습을 하기위해 GAN loss를 같이 활용.

$L_{total} = L_{simple} + \lambda_{L_2}L_2 + \lambda_{sync}L_{sync} + \lambda_{lips}L_{lipis} + \lambda_{GAN}L_{GAN}$

Experiments

Datasets

Train, Test

Voxceleb2: 1M face-cropped Youtube videos coming from 6000+ identies.

Test

LRW: 1000 videos each of 500 different words for a length of 1 second coming from BBC news

Video resolution 224×224, crop the face and resize it to 128×128.

Comparison Methods

Wav2Lip (ACM 2020, GAN 방식)
PC-AVSS (CVPR 2021, GAN 방식)

Quantitative Evaluation

Reconstruction: Given only the first frame and the audio corresponding to the same video.
Cross generation: The identity and the pose are controlled using a video while the lip-sync is driven using input audio corresponding to a different video.

솔직히 좋다고 하기엔 애매한 score. Diffusion 모델이기에 GAN 방식 선행연구에 비해 이미지 품질(FID)는 좋은 모습을 보여줌.

Qualitative Evaluation

왼쪽부터 Video Source, Wav2Lip, PC-AVS, Diff2Lip Diff2LIP project page

Qualitative 평가는 상당히 괜찮은 모습.

아직 생성분야에 대한 정량적 평가가 어렵기 때문에 애매한 quantative 결과대비 좋은 qualitative 결과를 보여줌.

Ablations

Reconstruction: $L_{simple}$ 만 사용
+SyncNet: $L_{simple}+L_2+L_{sync}$
- $L_{sync}$ 추가 시 audio-lip 사이의 sync는 매우 좋아지지만($Sync_c$ ) 이미지 quality(FID)가 살짝 안좋아짐
+Perceptual: $L_{simple}+L_2+L_{sync}+L_{lpips}$
+Seq. GAN: $L_{simple}+L_2+L_{sync}+L_{lpips}+L_{GAN}$
- Achieve temporal consistency

Conclusion

Diff2Lip is able to generate high-quality lip synchronization.

The authors pose it as a mouth region inpainting task and solve it by learning an audio-conditioned diffusion model.

SyncNet loss is required in our framework to introduce lip-sync while sequential adversarial loss improves both image quality and temporal consistency.

개인 Review

인상깊었던 부분

Direct recover noise image $x^\theta_0(x_t, t)$ 활용한 것은 매우 신박한듯.

GAN과 Diffusion을 함께 활용한 논문은 처음봤음.

아쉬운 부분

어디서부터 어디까지가 저자들의 contribution인지 명확하게 설명이 없음.
- 솔직히 말해서 Wav2Lip의 GAN 모델을 Diffusion 모델로 바꾼 것으로 밖에 안보임.
Diffusion image quality 향상의 핵심인 guidance-sampling 사용하지 않았음.
최근 $(B, C, F, H, W)$ 모양의 입력에 대해 효율적으로 temporal 차원을 학습할 수 있는 기술이 있음에도 Frame 사이의 관계를 Discriminator만으로 간접적으로 training 하였음
- 예시) Make-A-Video(ICLR 2023**)
  - B×F×C×H×W →(B×F)×C×H×W :이미지 quality 학습
  - B×F×C×H×W →(B×H×W)×F×C : Frame 관계 학습
Comparison Methods 다양성 부족
- GAN Inpainting method들과 비교
- 다른 Diffusion들의 결과와 비교 부재

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization.