Cách tích hợp Speaker Diarization vào VideoDubber

Bước 1: Hiểu cấu trúc code hiện tại

VideoDubber thường có pipeline:

video → audio → STT → translation → TTS → lipsync

Cần sửa thành:

video → audio → SPEAKER DIARIZATION → (cho mỗi speaker) → STT → translation → TTS → lipsync → merge

Bước 2: Chọn công cụ Diarization phù hợp

Option 1: Pyannote.audio (Tốt nhất)

# Thêm vào requirements.txt
pyannote.audio==3.1.1
pyannote.core==5.0.1

# Trong code chính
from pyannote.audio import Pipeline

def diarize_audio(audio_path):
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token="YOUR_HF_TOKEN"  # Cần tạo tại huggingface.co
    )
    
    # Áp dụng pipeline
    diarization = pipeline(audio_path)
    
    # Phân tách theo từng speaker
    segments = []
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        segments.append({
            'speaker': speaker,
            'start': turn.start,
            'end': turn.end,
            'duration': turn.end - turn.start
        })
    return segments

Option 2: NVIDIA NeMo (Cân bằng giữa độ chính xác và tốc độ)

# Thêm vào requirements.txt
nemo_toolkit['asr']==1.21.0

# Code tích hợp
import nemo.collections.asr as nemo_asr

def diarize_with_nemo(audio_path):
    model = nemo_asr.models.EncDecDiarLabelModel.from_pretrained(
        model_name="diar_msdd_telephonic"
    )
    
    # Diarization
    diarization_result = model.diarize(
        paths2audio_files=[audio_path],
        batch_size=1
    )
    
    return diarization_result

Option 3: Simple Diarization với Whisper (Nhẹ, dễ tích hợp)

# Dùng Whisper với speaker embedding
import whisper
from sklearn.cluster import KMeans
import numpy as np

def diarize_with_whisper(audio_path):
    model = whisper.load_model("medium")
    
    # Transcribe với segment timestamps
    result = model.transcribe(audio_path, word_timestamps=True)
    
    # Lấy embeddings và clustering
    segments = result["segments"]
    # Cần extract embeddings từ model (phức tạp hơn)
    # Hoặc dùng approach đơn giản: phân chia theo pause dài

Bước 3: Sửa đổi pipeline VideoDubber

Tạo file multi_speaker_pipeline.py:

import json
import os
from pathlib import Path

class MultiSpeakerVideoDubber:
    def __init__(self, video_path):
        self.video_path = video_path
        self.speakers = {}
        self.audio_segments = []
        
    def extract_and_diarize(self):
        """Bước 1: Trích xuất audio và diarization"""
        # Extract audio từ video
        audio_file = self.extract_audio()
        
        # Diarization
        from diarization_module import diarize_audio
        speaker_segments = diarize_audio(audio_file)
        
        # Tách audio theo speaker
        for i, segment in enumerate(speaker_segments):
            output_file = f"temp/speaker_{segment['speaker']}_seg_{i}.wav"
            self.cut_audio_segment(
                audio_file, 
                segment['start'], 
                segment['end'], 
                output_file
            )
            self.audio_segments.append({
                'speaker': segment['speaker'],
                'file': output_file,
                'start': segment['start'],
                'end': segment['end']
            })
    
    def process_each_speaker(self):
        """Bước 2: Xử lý riêng từng speaker"""
        for segment in self.audio_segments:
            speaker_id = segment['speaker']
            
            # Nếu speaker mới, tạo profile
            if speaker_id not in self.speakers:
                self.speakers[speaker_id] = {
                    'voice_model': self.select_voice_model(speaker_id),
                    'segments': []
                }
            
            # STT
            text = self.transcribe_segment(segment['file'], source_lang='zh')
            
            # Translation
            translated_text = self.translate_text(text, 'zh', 'en')
            
            # TTS với giọng riêng
            voice_file = self.text_to_speech(
                translated_text,
                voice_model=self.speakers[speaker_id]['voice_model']
            )
            
            # Lưu thông tin
            self.speakers[speaker_id]['segments'].append({
                'original_text': text,
                'translated_text': translated_text,
                'audio_file': voice_file,
                'start': segment['start'],
                'end': segment['end']
            })
    
    def merge_and_sync(self):
        """Bước 3: Ghép audio và đồng bộ môi"""
        # Tạo timeline audio mới
        merged_audio = self.create_empty_audio(self.get_video_duration())
        
        # Ghép từng segment
        for speaker_id, data in self.speakers.items():
            for segment in data['segments']:
                merged_audio = self.overlay_audio(
                    merged_audio,
                    segment['audio_file'],
                    segment['start']
                )
        
        # Lưu audio merged
        merged_audio.export("temp/merged_audio.wav", format="wav")
        
        # Đồng bộ môi với video gốc
        self.lip_sync(
            self.video_path,
            "temp/merged_audio.wav",
            "output/final_video.mp4"
        )
    
    def select_voice_model(self, speaker_id):
        """Chọn giọng phù hợp cho mỗi speaker"""
        # Có thể dựa trên pitch, gender detection
        # Hoặc mapping cố định
        voice_mapping = {
            'SPEAKER_00': 'en_male_01',
            'SPEAKER_01': 'en_female_01',
            # Thêm mapping khác
        }
        return voice_mapping.get(speaker_id, 'en_default')
    
    # Các hàm utility
    def extract_audio(self):
        # Dùng ffmpeg
        pass
    
    def cut_audio_segment(self, audio_file, start, end, output_file):
        # Dùng pydub
        pass
    
    def transcribe_segment(self, audio_file, source_lang):
        # Dùng Whisper
        pass
    
    def translate_text(self, text, src_lang, tgt_lang):
        # Dùng translation model
        pass
    
    def text_to_speech(self, text, voice_model):
        # Dùng TTS với voice model cụ thể
        pass
    
    def lip_sync(self, video_path, audio_path, output_path):
        # Dùng Wav2Lip hoặc SadTalker
        pass

Bước 4: Tích hợp vào VideoDubber chính

Thêm vào main.py của VideoDubber:

# Thêm option multi-speaker
parser.add_argument('--multi-speaker', action='store_true',
                   help='Enable multi-speaker diarization')

# Trong hàm main
if args.multi_speaker:
    from multi_speaker_pipeline import MultiSpeakerVideoDubber
    dubber = MultiSpeakerVideoDubber(args.input_video)
    dubber.extract_and_diarize()
    dubber.process_each_speaker()
    dubber.merge_and_sync()
else:
    # Xử lý single speaker như cũ
    pass

🚀 Cải tiến nâng cao

1. Voice Cloning cho từng nhân vật

def create_voice_clone(self, speaker_id, reference_audio):
    """Tạo voice clone từ mẫu giọng"""
    # Dùng RVC (Retrieval-based Voice Conversion)
    # Hoặc OpenVoice for voice cloning
    
    # Code mẫu dùng RVC
    cmd = f"python rvc/infer_cli.py --input {reference_audio} --model {speaker_id}"
    subprocess.run(cmd, shell=True)

2. Visual Speaker Identification (Nhận diện người nói bằng hình ảnh)

def identify_speaker_by_face(self, video_path, audio_segments):
    """Kết hợp face recognition với diarization"""
    # Dùng face_recognition library
    import face_recognition
    
    # Trích keyframe
    keyframes = self.extract_keyframes(video_path)
    
    # Nhận diện khuôn mặt
    faces = {}
    for frame in keyframes:
        face_locations = face_recognition.face_locations(frame)
        face_encodings = face_recognition.face_encodings(frame, face_locations)
        
        # Map face với thời gian
        timestamp = self.get_frame_time(frame)
        faces[timestamp] = face_encodings
    
    # Kết hợp với audio diarization
    return self.match_face_to_voice(faces, audio_segments)

3. Context-aware Translation (Dịch theo ngữ cảnh từng nhân vật)

def get_speaker_context(self, speaker_id):
    """Thu thập ngữ cảnh của từng nhân vật"""
    # Phân tích tất cả lời thoại của speaker
    all_utterances = []
    for segment in self.speakers[speaker_id]['segments']:
        all_utterances.append(segment['original_text'])
    
    # Tạo context embedding
    context = self.analyze_speaker_style(all_utterances)
    
    return {
        'speaking_style': context['style'],  # formal, casual, etc.
        'terminology': context['terms'],     # từ chuyên ngành
        'personality': context['personality']  # tính cách
    }

📦 Cấu trúc thư mục sau khi cải tiến

VideoDubber/
├── multi_speaker/
│   ├── __init__.py
│   ├── diarization.py      # Speaker diarization
│   ├── voice_manager.py    # Quản lý giọng nói
│   ├── merge_utils.py      # Ghép audio/video
│   └── visual_speaker.py   # Nhận diện bằng hình ảnh
├── configs/
│   ├── speaker_config.yaml # Cấu hình giọng cho từng loại
│   └── diarization_config.yaml
├── models/
│   ├── speaker_models/     # Voice models
│   └── face_models/        # Face recognition models
└── main_multi_speaker.py   # Entry point mới

⚙️ Cấu hình YAML cho multi-speaker

Tạo file configs/multi_speaker_config.yaml:

diarization:
  method: "pyannote"  # or "nemo", "simple"
  min_speakers: 1
  max_speakers: 10
  min_segment_length: 0.5  # giây

speaker_voice_mapping:
  default_male: "tts_models/en/vctk/vits"
  default_female: "tts_models/en/ljspeech/tacotron2-DDC"
  child_voice: "tts_models/en/sam/tacotron-DDC"

voice_cloning:
  enabled: false
  method: "rvc"  # or "openvoice", "voicecraft"
  min_samples: 10  # số giây tối thiểu để clone

visual_identification:
  enabled: true
  method: "face_recognition"
  confidence_threshold: 0.6

output:
  separate_tracks: false
  subtitles_per_speaker: true
  export_speaker_timeline: true

🧪 Test với video đa nhân vật

Tạo script test test_multi_speaker.py:

import sys
sys.path.append('.')

from multi_speaker.multi_speaker_dubber import MultiSpeakerVideoDubber

# Test với video phỏng vấn (2 người)
dubber = MultiSpeakerVideoDubber(
    video_path="interview.mp4",
    config_path="configs/multi_speaker_config.yaml"
)

# Chạy pipeline
dubber.process()

# Xuất kết quả
dubber.export_output(
    output_dir="output/interview",
    formats=["mp4", "srt", "json"]
)

# Xem phân tích
print("Phân tích speakers:")
for speaker_id, data in dubber.speakers.items():
    print(f"Speaker {speaker_id}:")
    print(f"  Số lần nói: {len(data['segments'])}")
    print(f"  Tổng thời gian: {data['total_duration']:.2f}s")
    print(f"  Giọng sử dụng: {data['voice_model']}")

🎨 Giao diện Web cải tiến

Thêm UI element để quản lý từng speaker:

<!-- Trong templates/index.html -->
<div id="speaker-management">
    <h3>Quản lý Nhân vật</h3>
    
    <div v-for="(speaker, id) in speakers" :key="id" class="speaker-card">
        <h4>Speaker {{ id }}</h4>
        
        <div class="voice-selection">
            <label>Chọn giọng:</label>
            <select v-model="speaker.voice">
                <option value="male_01">Nam trầm</option>
                <option value="female_01">Nữ cao</option>
                <option value="clone">Clone giọng gốc</option>
            </select>
        </div>
        
        <div class="timeline-preview">
            <p>Thời gian: {{ speaker.total_time }}s</p>
            <p>Số đoạn: {{ speaker.segments.length }}</p>
        </div>
        
        <button @click="previewSpeaker(id)">Nghe thử</button>
    </div>
</div>

⚡ Tối ưu hiệu suất

Parallel Processing:

from concurrent.futures import ThreadPoolExecutor

def process_speakers_parallel(self):
    with ThreadPoolExecutor(max_workers=4) as executor:
        futures = []
        for speaker_id in self.speakers:
            future = executor.submit(
                self.process_single_speaker,
                speaker_id
            )
            futures.append(future)
        
        # Chờ tất cả hoàn thành
        results = [f.result() for f in futures]

Cache voice models để tránh tải lại nhiều lần
Batch processing cho các segments nhỏ cùng speaker

📊 Đánh giá kết quả

Sau khi cải tiến, VideoDubber sẽ có thể:

✅ Nhận diện tự động 2-10 nhân vật trong video
✅ Gán giọng khác nhau cho từng người
✅ Dịch riêng theo ngữ cảnh từng nhân vật
✅ Xuất phụ đề phân biệt màu cho từng speaker
✅ Tùy chọn clone giọng từ mẫu có sẵn

🚨 Lưu ý quan trọng

Cần GPU mạnh để chạy multi-speaker diarization
Chất lượng phụ thuộc vào:
- Độ phân biệt giọng nói trong audio
- Khoảng cách giữa các lời thoại
- Nhiễu nền
Thời gian xử lý sẽ tăng theo số lượng speakers

🔗 Tài nguyên bổ sung

Pre-trained diarization models:
- https://huggingface.co/pyannote/speaker-diarization-3.1
- https://catalog.ngc.nvidia.com/orgs/nvidia/models/nemo_diar_msdd_telephonic
Multi-speaker TTS:
- https://github.com/coqui-ai/TTS/wiki/Multi-speaker-TTS
Diarization tutorials:
- https://github.com/pyannote/pyannote-audio

Kết luận: Hoàn toàn có thể cải tiến VideoDubber để nhận diện nhiều nhân vật, nhưng cần đầu tư đáng kể về thời gian và tính toán. Bắt đầu với pyannote.audio là lựa chọn tốt nhất về độ chính xác và dễ tích hợp.

Search This Blog

Trang Ánh Nam