Speaker Clustering Methods
Overview
After extracting speaker embeddings from audio segments, you need to cluster them to identify unique speakers. Different clustering methods have different strengths.
When to Use
- •After extracting speaker embeddings from VAD segments
- •Need to group similar speakers together
- •Determining number of speakers automatically or manually
Available Clustering Methods
1. Hierarchical Clustering (Recommended for Auto-tuning)
Best for: Automatically determining number of speakers, flexible threshold tuning
python
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import numpy as np
# Prepare embeddings
embeddings_array = np.array(embeddings_list)
n_segments = len(embeddings_array)
# Compute distance matrix
distances = pdist(embeddings_array, metric='cosine')
# Create linkage matrix
linkage_matrix = linkage(distances, method='average')
# Auto-tune threshold to get reasonable speaker count
min_speakers = 2
max_speakers = max(2, min(10, n_segments // 2))
threshold = 0.7
labels = fcluster(linkage_matrix, t=threshold, criterion='distance')
n_speakers = len(set(labels))
# Adjust threshold if needed
if n_speakers > max_speakers:
for t in [0.8, 0.9, 1.0, 1.1, 1.2]:
labels = fcluster(linkage_matrix, t=t, criterion='distance')
n_speakers = len(set(labels))
if n_speakers <= max_speakers:
threshold = t
break
elif n_speakers < min_speakers:
for t in [0.6, 0.5, 0.4]:
labels = fcluster(linkage_matrix, t=t, criterion='distance')
n_speakers = len(set(labels))
if n_speakers >= min_speakers:
threshold = t
break
print(f"Selected: t={threshold}, {n_speakers} speakers")
Advantages:
- •Automatically determines speaker count
- •Flexible threshold tuning
- •Good for unknown number of speakers
- •Can visualize dendrogram
2. KMeans Clustering
Best for: Known number of speakers, fast clustering
python
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np
# Normalize embeddings
embeddings_array = np.array(embeddings_list)
norms = np.linalg.norm(embeddings_array, axis=1, keepdims=True)
embeddings_normalized = embeddings_array / np.clip(norms, 1e-9, None)
# Try different k values and choose best
best_k = 2
best_score = -1
best_labels = None
for k in range(2, min(7, len(embeddings_normalized))):
kmeans = KMeans(n_clusters=k, random_state=0, n_init=10)
labels = kmeans.fit_predict(embeddings_normalized)
if len(set(labels)) < 2:
continue
score = silhouette_score(embeddings_normalized, labels, metric='cosine')
if score > best_score:
best_score = score
best_k = k
best_labels = labels
print(f"Best k={best_k}, silhouette score={best_score:.3f}")
Advantages:
- •Fast and efficient
- •Works well with known speaker count
- •Simple to implement
Disadvantages:
- •Requires specifying number of clusters
- •May get stuck in local minima
3. Agglomerative Clustering
Best for: Similar to hierarchical but with fixed number of clusters
python
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
import numpy as np
# Normalize embeddings
embeddings_array = np.array(embeddings_list)
norms = np.linalg.norm(embeddings_array, axis=1, keepdims=True)
embeddings_normalized = embeddings_array / np.clip(norms, 1e-9, None)
# Try different numbers of clusters
best_n = 2
best_score = -1
best_labels = None
for n_clusters in range(2, min(6, len(embeddings_normalized))):
clustering = AgglomerativeClustering(n_clusters=n_clusters)
labels = clustering.fit_predict(embeddings_normalized)
if len(set(labels)) < 2:
continue
score = silhouette_score(embeddings_normalized, labels, metric='cosine')
if score > best_score:
best_score = score
best_n = n_clusters
best_labels = labels
print(f"Best n_clusters={best_n}, silhouette score={best_score:.3f}")
Advantages:
- •Deterministic results
- •Good for fixed number of clusters
- •Can use different linkage methods
Comparison Table
| Method | Auto Speaker Count | Speed | Best For |
|---|---|---|---|
| Hierarchical | ✅ Yes | Medium | Unknown speaker count |
| KMeans | ❌ No | Fast | Known speaker count |
| Agglomerative | ❌ No | Medium | Fixed clusters needed |
Embedding Normalization
Always normalize embeddings before clustering:
python
# L2 normalization
embeddings_normalized = embeddings_array / np.clip(
np.linalg.norm(embeddings_array, axis=1, keepdims=True),
1e-9, None
)
Distance Metrics
- •Cosine: Best for speaker embeddings (default)
- •Euclidean: Can work but less ideal for normalized embeddings
Choosing Number of Speakers
Method 1: Silhouette Score (for KMeans/Agglomerative)
python
from sklearn.metrics import silhouette_score
best_k = 2
best_score = -1
for k in range(2, min(7, len(embeddings))):
labels = clusterer.fit_predict(embeddings)
score = silhouette_score(embeddings, labels, metric='cosine')
if score > best_score:
best_score = score
best_k = k
Method 2: Threshold Tuning (for Hierarchical)
python
# Start with reasonable threshold
threshold = 0.7
labels = fcluster(linkage_matrix, t=threshold, criterion='distance')
n_speakers = len(set(labels))
# Adjust based on constraints
if n_speakers > max_speakers:
# Increase threshold to merge more
threshold = 0.9
elif n_speakers < min_speakers:
# Decrease threshold to split more
threshold = 0.5
Post-Clustering: Merging Segments
After clustering, merge adjacent segments with same speaker:
python
def merge_speaker_segments(labeled_segments, gap_threshold=0.15):
"""
labeled_segments: list of (start, end, speaker_label)
gap_threshold: merge if gap <= this (seconds)
"""
labeled_segments.sort(key=lambda x: (x[0], x[1]))
merged = []
cur_s, cur_e, cur_spk = labeled_segments[0]
for s, e, spk in labeled_segments[1:]:
if spk == cur_spk and s <= cur_e + gap_threshold:
cur_e = max(cur_e, e)
else:
merged.append((cur_s, cur_e, cur_spk))
cur_s, cur_e, cur_spk = s, e, spk
merged.append((cur_s, cur_e, cur_spk))
return merged
Common Issues
- •Too many speakers: Increase threshold (hierarchical) or decrease k (KMeans)
- •Too few speakers: Decrease threshold (hierarchical) or increase k (KMeans)
- •Poor clustering: Check embedding quality, try different normalization
- •Over-segmentation: Increase gap_threshold when merging segments
Best Practices
- •Normalize embeddings before clustering
- •Use cosine distance for speaker embeddings
- •Try multiple methods and compare results
- •Validate speaker count with visual features if available
- •Merge adjacent segments after clustering
- •After diarization, use high-quality ASR: Use Whisper
smallorlarge-v3model for transcription (see automatic-speech-recognition skill)