Multimodal RAG
Handle Q&A on documents containing images, tables, charts, and text — answer questions that require understanding both visual and textual content.
When to Activate
Activate this skill when:
- •User has documents with images (PDFs with charts, manuals with diagrams)
- •User needs to answer questions about visual content in documents
- •User mentions "PDF with images", "document with charts", "visual Q&A"
- •User's documents have tables, flowcharts, or screenshots
Do NOT activate when:
- •User only needs image similarity search → use
image-search - •User only needs text-based RAG → use
rag-toolkit:rag - •User needs video search → use
video-search
Interactive Flow
Step 1: Understand Document Type
"What type of documents are you processing?"
A) Technical manuals (product docs, installation guides)
- •Mix of text and diagrams
- •Questions like "How do I install component X?"
B) Reports with charts (financial, research, analytics)
- •Data visualizations, tables
- •Questions like "What was Q3 revenue?"
C) Mixed content (presentations, marketing materials)
- •Varied image types
- •Diverse question types
Which describes your documents? (A/B/C)
Step 2: Choose Processing Strategy
"How should we handle images?"
| Strategy | When to Use |
|---|---|
| VLM Description | Charts, diagrams, complex visuals |
| OCR Extraction | Screenshots with text, tables |
| Caption Only | Photos, simple images |
For most cases, VLM Description is recommended.
Step 3: Confirm Configuration
"Based on your requirements:
- •Text extraction: PyMuPDF
- •Image processing: GPT-4o for descriptions
- •Embedding: text-embedding-3-small (1536 dim)
- •Answer model: GPT-4o with vision
Proceed? (yes / adjust [what])"
Core Concepts
Mental Model: Illustrated Encyclopedia
Think of multimodal RAG as building a searchable illustrated encyclopedia:
- •Extract all content: Text paragraphs AND image descriptions
- •Index everything: Both become searchable vectors
- •Retrieve mixed results: May get text AND images for a query
- •Generate answer: VLM combines all context to answer
┌─────────────────────────────────────────────────────────────┐ │ Multimodal RAG Pipeline │ │ │ │ Document (PDF with images) │ │ │ │ │ ├──────────────────────────────────────┐ │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Text Chunks │ │ Images │ │ │ │ │ │ │ │ │ │ "Section 1 │ │ [diagram.png]│ │ │ │ describes..." │ │ [chart.png] │ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌──────────────┐ │ │ │ │ VLM Caption │ │ │ │ │ "This chart │ │ │ │ │ shows..." │ │ │ │ └──────┬───────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Unified Text Embeddings │ │ │ │ (both text chunks and image captions) │ │ │ └──────────────────────┬───────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Milvus │ │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ Query: "What does the chart show?" │ │ │ │ │ ▼ │ │ Retrieved: [text chunk] + [image caption + image] │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────┐ │ │ │ VLM Answer Generation (with images) │ │ │ │ "Based on the chart, revenue increased..." │ │ │ └──────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘
Why Multimodal Matters
| Question Type | Text-Only RAG | Multimodal RAG |
|---|---|---|
| "What does paragraph 3 say?" | ✅ Works | ✅ Works |
| "What does the chart show?" | ❌ Can't see | ✅ Understands |
| "Summarize the diagram" | ❌ Can't see | ✅ Describes |
| "What's the value in the table?" | ⚠️ If OCR'd | ✅ Reads directly |
Implementation
from pymilvus import MilvusClient, DataType
from openai import OpenAI
import fitz # PyMuPDF
import base64
import os
import uuid
class MultimodalRAG:
def __init__(self, uri: str = "./milvus.db"):
self.client = MilvusClient(uri=uri)
self.openai = OpenAI()
self.collection_name = "multimodal_rag"
self._init_collection()
def _init_collection(self):
if self.client.has_collection(self.collection_name):
return
schema = self.client.create_schema()
schema.add_field("id", DataType.VARCHAR, is_primary=True, max_length=64)
schema.add_field("content_type", DataType.VARCHAR, max_length=16) # text/image
schema.add_field("content", DataType.VARCHAR, max_length=65535)
schema.add_field("image_path", DataType.VARCHAR, max_length=512)
schema.add_field("source", DataType.VARCHAR, max_length=512)
schema.add_field("page", DataType.INT32)
schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=1536)
index_params = self.client.prepare_index_params()
index_params.add_index("embedding", index_type="AUTOINDEX", metric_type="COSINE")
self.client.create_collection(
collection_name=self.collection_name,
schema=schema,
index_params=index_params
)
def _embed(self, text: str) -> list:
"""Generate embedding using OpenAI API."""
response = self.openai.embeddings.create(
model="text-embedding-3-small",
input=[text]
)
return response.data[0].embedding
def _describe_image(self, image_path: str) -> str:
"""Generate description of image using VLM."""
with open(image_path, "rb") as f:
b64_image = base64.standard_b64encode(f.read()).decode()
response = self.openai.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail. If it's a chart or diagram, explain what it shows. If there's text, include it."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_image}"}}
]
}],
max_tokens=1000
)
return response.choices[0].message.content
def ingest_pdf(self, pdf_path: str, image_output_dir: str = "./images"):
"""Process PDF and index text + images."""
os.makedirs(image_output_dir, exist_ok=True)
doc = fitz.open(pdf_path)
source = os.path.basename(pdf_path)
data = []
for page_num, page in enumerate(doc):
# Extract text
text = page.get_text()
if text.strip():
chunks = self._chunk_text(text)
for chunk in chunks:
data.append({
"id": str(uuid.uuid4()),
"content_type": "text",
"content": chunk,
"image_path": "",
"source": source,
"page": page_num + 1,
"embedding": self._embed(chunk)
})
# Extract images
images = page.get_images()
for img_idx, img in enumerate(images):
xref = img[0]
base_image = doc.extract_image(xref)
image_bytes = base_image["image"]
image_path = f"{image_output_dir}/{source}_p{page_num+1}_img{img_idx}.png"
with open(image_path, "wb") as f:
f.write(image_bytes)
# Generate description
description = self._describe_image(image_path)
data.append({
"id": str(uuid.uuid4()),
"content_type": "image",
"content": description,
"image_path": image_path,
"source": source,
"page": page_num + 1,
"embedding": self._embed(description)
})
self.client.insert(collection_name=self.collection_name, data=data)
return len(data)
def _chunk_text(self, text: str, chunk_size: int = 500) -> list:
"""Split text into chunks."""
words = text.split()
chunks = []
current_chunk = []
for word in words:
current_chunk.append(word)
if len(" ".join(current_chunk)) > chunk_size:
chunks.append(" ".join(current_chunk))
current_chunk = []
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
def retrieve(self, query: str, limit: int = 10) -> list:
"""Retrieve relevant text and images."""
embedding = self._embed(query)
results = self.client.search(
collection_name=self.collection_name,
data=[embedding],
limit=limit,
output_fields=["content_type", "content", "image_path", "source", "page"]
)
return [{
"type": hit["entity"]["content_type"],
"content": hit["entity"]["content"],
"image_path": hit["entity"]["image_path"],
"source": hit["entity"]["source"],
"page": hit["entity"]["page"],
"score": hit["distance"]
} for hit in results[0]]
def query(self, question: str, use_images: bool = True) -> dict:
"""Answer question using retrieved context."""
contexts = self.retrieve(question, limit=10)
text_contexts = [c for c in contexts if c["type"] == "text"]
image_contexts = [c for c in contexts if c["type"] == "image"]
# Build message with text and images
messages = [{"role": "user", "content": []}]
# Add text context
context_text = "\n\n".join([
f"[{c['source']} Page {c['page']}]\n{c['content']}"
for c in text_contexts[:5]
])
messages[0]["content"].append({
"type": "text",
"text": f"Context:\n{context_text}\n"
})
# Add images
if use_images and image_contexts:
for img in image_contexts[:3]:
if os.path.exists(img["image_path"]):
with open(img["image_path"], "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode()
messages[0]["content"].append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{b64}"}
})
# Add question
messages[0]["content"].append({
"type": "text",
"text": f"\nQuestion: {question}\nAnswer based on the provided context and images:"
})
response = self.openai.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.3
)
return {
"answer": response.choices[0].message.content,
"sources": list(set(c["source"] for c in contexts)),
"pages": list(set(c["page"] for c in contexts))
}
# Usage
rag = MultimodalRAG()
# Ingest PDF with images
rag.ingest_pdf("product_manual.pdf")
# Ask questions
result = rag.query("What are the installation steps shown in the diagram?")
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")
Processing Strategy by Document Type
| Document Type | Text Extraction | Image Processing | VLM Prompt |
|---|---|---|---|
| Technical manual | Full text | Diagram description | "Explain what this diagram shows step by step" |
| Financial report | Full text | Chart data extraction | "Extract all data points from this chart" |
| Medical report | Full text | Image analysis | "Describe any medical findings visible" |
| Presentation | Slide text | Screenshot description | "Summarize what this slide conveys" |
Common Pitfalls
❌ Pitfall 1: Images Too Small
Problem: VLM can't read chart text
Why: PDF images extracted at low resolution
Fix: Extract at higher resolution
mat = fitz.Matrix(2.0, 2.0) # 2x zoom pix = page.get_pixmap(matrix=mat)
❌ Pitfall 2: Ignoring Image-Text Connection
Problem: Image description doesn't mention surrounding context
Why: Image processed in isolation
Fix: Include nearby text in prompt
prompt = f"This image appears near the text: '{nearby_text}'. Describe the image and how it relates to this text."
❌ Pitfall 3: Too Many API Calls
Problem: Processing costs explode
Why: Calling VLM for every image
Fix: Batch processing, caching, or use cheaper models
# Use gpt-4o-mini for initial pass # Only use gpt-4o for complex charts
❌ Pitfall 4: Not Returning Images in Answer
Problem: User asks about chart but can't see it
Why: Only returning text answer
Fix: Include image references in response
return {
"answer": answer,
"referenced_images": [c["image_path"] for c in image_contexts[:3]]
}
When to Level Up
| Need | Upgrade To |
|---|---|
| Better chunking | Add core:chunking |
| Higher precision | Add core:rerank |
| Video content | video-search |
| Pure text documents | rag-toolkit:rag |
References
- •VLM models: GPT-4o, Claude 3, Qwen-VL, LLaVA
- •PDF processing: PyMuPDF, pdf2image
- •Batch processing:
core:ray