AgentSkillsCN

video-search

当用户需要通过文本或图像搜索视频内容时,可启用此功能。触发条件包括:视频搜索、视频检索、视频片段、会议录制、教学视频、监控回放、在视频中寻找精彩瞬间等。

SKILL.md
--- frontmatter
name: video-search
description: "Use when user needs to search video content by text or image. Triggers on: video search, video retrieval, video clips, meeting recordings, tutorial videos, surveillance playback, find moment in video."

Video Search

Semantic search on video content — find specific moments by describing what you're looking for.

When to Activate

Activate this skill when:

  • User wants to search within video content by description
  • User mentions "find in video", "video search", "meeting recordings"
  • User has tutorial videos, meetings, or surveillance footage to search
  • User needs to find specific scenes or discussions in long videos

Do NOT activate when:

  • User only needs image search → use image-search
  • User wants to search images by text → use text-to-image-search
  • User has static documents with images → use multimodal-rag

Interactive Flow

Step 1: Understand Video Type

"What type of videos are you searching?"

A) Speech-heavy (tutorials, meetings, lectures)

  • Primary content is spoken words
  • ASR (speech-to-text) is key

B) Visual-heavy (surveillance, sports, vlogs)

  • Actions and scenes matter more than speech
  • Keyframe extraction is key

C) Mixed (documentaries, how-to videos)

  • Both speech and visuals important
  • Need both approaches

Which describes your videos? (A/B/C)

Step 2: Determine Search Granularity

"How precise should search results be?"

GranularitySegment LengthUse Case
Coarse5-10 minutes"Find the meeting about budget"
Medium30-60 seconds"Find where they discuss pricing"
Fine5-15 seconds"Find exactly when John mentioned the deadline"

Step 3: Confirm Configuration

"Based on your requirements:

  • Processing: ASR (Whisper) + Keyframes every 30s
  • Segment size: 30 seconds
  • Embedding: BGE for text, CLIP for frames

Proceed? (yes / adjust [what])"

Core Concepts

Mental Model: Video as Searchable Book

Think of video processing as converting a video into a searchable book:

  • Each chapter = video segment (30-60 seconds)
  • Each chapter has text (transcript) + illustrations (keyframes)
  • Search finds the right "chapter"
code
┌─────────────────────────────────────────────────────────┐
│                    Video Search Pipeline                 │
│                                                          │
│  Original Video (2 hours)                                │
│       │                                                  │
│       ├────────────────────────────────────┐            │
│       │                                    │            │
│       ▼                                    ▼            │
│  ┌──────────────┐                  ┌──────────────┐    │
│  │ Audio Track  │                  │ Video Track  │    │
│  └──────┬───────┘                  └──────┬───────┘    │
│         │                                  │            │
│         ▼                                  ▼            │
│  ┌──────────────┐                  ┌──────────────┐    │
│  │    Whisper   │                  │   Keyframe   │    │
│  │     ASR      │                  │  Extraction  │    │
│  └──────┬───────┘                  └──────┬───────┘    │
│         │                                  │            │
│         ▼                                  ▼            │
│  [Transcript segments]             [Keyframe images]    │
│  "At 0:30, John said..."           [img1] [img2] ...   │
│         │                                  │            │
│         ▼                                  ▼            │
│  ┌──────────────┐                  ┌──────────────┐    │
│  │     BGE      │                  │    CLIP      │    │
│  │   Encoder    │                  │   Encoder    │    │
│  └──────┬───────┘                  └──────┬───────┘    │
│         │                                  │            │
│         └────────────┬─────────────────────┘            │
│                      ▼                                   │
│              ┌──────────────┐                           │
│              │    Milvus    │                           │
│              │   Storage    │                           │
│              └──────────────┘                           │
└─────────────────────────────────────────────────────────┘

Two Search Approaches

ApproachWhat it SearchesBest For
TranscriptSpoken words"What did they say about X?"
KeyframesVisual content"Find the scene with Y"

Both can be combined for comprehensive search.

Implementation

python
from pymilvus import MilvusClient, DataType
from sentence_transformers import SentenceTransformer

class VideoSearch:
    def __init__(self, uri: str = "./milvus.db"):
        self.client = MilvusClient(uri=uri)
        self.text_model = SentenceTransformer('BAAI/bge-large-en-v1.5')
        self.collection_name = "video_search"
        self._init_collection()

    def _init_collection(self):
        if self.client.has_collection(self.collection_name):
            return

        schema = self.client.create_schema()
        schema.add_field("id", DataType.VARCHAR, is_primary=True, max_length=64)
        schema.add_field("video_path", DataType.VARCHAR, max_length=512)
        schema.add_field("content_type", DataType.VARCHAR, max_length=16)  # transcript/frame
        schema.add_field("content", DataType.VARCHAR, max_length=65535)
        schema.add_field("start_time", DataType.FLOAT)
        schema.add_field("end_time", DataType.FLOAT)
        schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=1024)

        index_params = self.client.prepare_index_params()
        index_params.add_index("embedding", index_type="AUTOINDEX", metric_type="COSINE")

        self.client.create_collection(
            collection_name=self.collection_name,
            schema=schema,
            index_params=index_params
        )

    def search(self, query: str, limit: int = 10, search_type: str = "all") -> list:
        """Search video clips
        search_type: "all" | "transcript" | "frame"
        """
        embedding = self.text_model.encode(query).tolist()

        filter_expr = ""
        if search_type == "transcript":
            filter_expr = 'content_type == "transcript"'
        elif search_type == "frame":
            filter_expr = 'content_type == "frame"'

        results = self.client.search(
            collection_name=self.collection_name,
            data=[embedding],
            filter=filter_expr if filter_expr else None,
            limit=limit,
            output_fields=["video_path", "content", "start_time", "end_time", "content_type"]
        )

        return [{
            "video": hit["entity"]["video_path"],
            "type": hit["entity"]["content_type"],
            "content": hit["entity"]["content"][:200] + "..." if len(hit["entity"]["content"]) > 200 else hit["entity"]["content"],
            "start": hit["entity"]["start_time"],
            "end": hit["entity"]["end_time"],
            "score": hit["distance"]
        } for hit in results[0]]

    def format_timestamp(self, seconds: float) -> str:
        """Convert seconds to HH:MM:SS"""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        if hours > 0:
            return f"{hours}:{minutes:02d}:{secs:02d}"
        return f"{minutes}:{secs:02d}"

# Usage
search = VideoSearch()
results = search.search("how to configure the database connection")

for r in results:
    start = search.format_timestamp(r['start'])
    end = search.format_timestamp(r['end'])
    print(f"[{start} - {end}] ({r['type']})")
    print(f"  {r['content']}")
    print(f"  Score: {r['score']:.3f}")
    print()

Video Processing Pipeline

Audio Processing (Transcription)

python
import whisper
import subprocess

def extract_audio(video_path: str, audio_path: str):
    """Extract audio track from video."""
    subprocess.run([
        'ffmpeg', '-i', video_path, '-vn', '-acodec', 'pcm_s16le',
        '-ar', '16000', '-ac', '1', audio_path, '-y'
    ], check=True)

def transcribe_audio(audio_path: str, segment_length: int = 30):
    """Transcribe and segment audio."""
    model = whisper.load_model("base")
    result = model.transcribe(audio_path)

    segments = []
    current_segment = {"text": "", "start": 0, "end": 0}

    for segment in result["segments"]:
        if segment["end"] - current_segment["start"] > segment_length:
            if current_segment["text"]:
                segments.append(current_segment)
            current_segment = {
                "text": segment["text"],
                "start": segment["start"],
                "end": segment["end"]
            }
        else:
            current_segment["text"] += " " + segment["text"]
            current_segment["end"] = segment["end"]

    if current_segment["text"]:
        segments.append(current_segment)

    return segments

Frame Extraction

python
import cv2

def extract_keyframes(video_path: str, interval_seconds: int = 30):
    """Extract keyframes at regular intervals."""
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps * interval_seconds)

    frames = []
    frame_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        if frame_count % frame_interval == 0:
            timestamp = frame_count / fps
            frames.append({
                "frame": frame,
                "timestamp": timestamp
            })

        frame_count += 1

    cap.release()
    return frames

Processing Strategy by Video Type

Video TypeASRKeyframesSegment Length
Tutorials✅ PrimaryEvery 30s30s
Meetings✅ PrimaryEvery 60s60s
Surveillance❌ SkipEvery 5s10s
Movies/Shows✅ SubtitlesEvery 10s30s
Sports⚠️ CommentaryEvery 3s15s

Common Pitfalls

❌ Pitfall 1: Processing Full Resolution

Problem: Processing takes forever on 4K videos

Why: Video processing is compute-intensive

Fix: Downscale for processing

python
# Extract at 720p for processing
subprocess.run([
    'ffmpeg', '-i', input_path, '-vf', 'scale=-1:720',
    output_path, '-y'
])

❌ Pitfall 2: Too Many Keyframes

Problem: Storage explodes with frequent keyframes

Why: Every 5 seconds on a 2-hour video = 1440 frames

Fix: Use scene detection or longer intervals

python
# Scene-change detection
from scenedetect import detect, ContentDetector
scenes = detect(video_path, ContentDetector())

❌ Pitfall 3: Ignoring ASR Errors

Problem: Transcription errors make search miss results

Why: Speech recognition isn't perfect

Fix: Store both raw and corrected transcripts, or use phonetic search

❌ Pitfall 4: No Timestamp Indexing

Problem: Can't quickly seek to result in video player

Why: Only stored content, not timestamps

Fix: Always store start_time and end_time for each segment

When to Level Up

NeedUpgrade To
Search by uploading imageCombine with image-search
Q&A on video contentAdd RAG layer
Real-time streamingConsider specialized tools
Speaker identificationAdd speaker diarization

References

  • Frame extraction tools: references/frame-sampling.md
  • ASR models: Whisper, FunASR, Azure Speech
  • Batch processing: core:ray