Vision & Multimodal AI
Multimodal models process multiple input types: text, images, audio, video.
Vision capabilities
| Task | Description |
|---|---|
| OCR | Extract text from images |
| Description | Describe visual content |
| Analysis | Identify objects/people |
| Comparison | Compare images |
| Diagrams | Understand charts/graphs |
Multimodal models
| Model | Provider | Input |
|---|---|---|
| GPT-4V | OpenAI | Text + Image |
| Claude 3 | Anthropic | Text + Image |
| Gemini Pro | Text + Image + Audio | |
| LLaVA | Open Source | Text + Image |
Send image to Claude
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
messages: [{
role: "user",
content: [
{
type: "image",
source: {
type: "base64",
media_type: "image/jpeg",
data: base64Image
}
},
{
type: "text",
text: "What's in this image?"
}
]
}]
});
Send image to OpenAI
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
}
]
}]
)
Real use cases
| Application | Example |
|---|---|
| Accessibility | Describe images for blind users |
| Documents | Extract data from invoices |
| Retail | Analyze products in photos |
| Health | Preliminary X-ray analysis |
| Security | Content detection |
Image classification models
For specific tasks, use specialized models:
from transformers import pipeline
# Classification
classifier = pipeline("image-classification")
result = classifier("cat.jpg")
# โ [{"label": "cat", "score": 0.99}]
# Object detection
detector = pipeline("object-detection")
objects = detector("street.jpg")
# โ [{"label": "car", "box": {...}}]
๐ฆ Fintech Case: KYC Verification with Vision
Know Your Customer (KYC) requires identity document verification. Vision AI automates this process:
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic();
interface KYCResult {
documentType: 'passport' | 'id_card' | 'drivers_license' | 'unknown';
extractedData: {
fullName?: string;
documentNumber?: string;
expiryDate?: string;
nationality?: string;
};
validationChecks: {
isExpired: boolean;
formatValid: boolean;
photoDetected: boolean;
};
confidence: number;
requiresManualReview: boolean;
}
async function verifyKYCDocument(imageBase64: string): Promise<KYCResult> {
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
messages: [{
role: "user",
content: [
{
type: "image",
source: { type: "base64", media_type: "image/jpeg", data: imageBase64 }
},
{
type: "text",
text: `Analyze this identity document for KYC. Extract:
1. Document type (passport, id_card, drivers_license)
2. Full name
3. Document number
4. Expiry date
5. Nationality
Verify:
- Is the document expired?
- Does the format appear valid?
- Is a photo of the holder detected?
Respond ONLY in JSON with this format:
{
"documentType": "...",
"extractedData": {...},
"validationChecks": {...},
"confidence": 0.0-1.0,
"requiresManualReview": true/false
}
IMPORTANT: If there's ANY doubt about authenticity, set requiresManualReview: true`
}
]
}]
});
// Parse response and validate
const result = JSON.parse(response.content[0].text);
// Business rule: low confidence = manual review
if (result.confidence < 0.85) {
result.requiresManualReview = true;
}
// Audit log (no sensitive data)
await auditLog({
action: 'KYC_VERIFICATION',
documentType: result.documentType,
confidence: result.confidence,
requiresManualReview: result.requiresManualReview,
timestamp: new Date().toISOString()
});
return result;
}
KYC Security Considerations
| Aspect | Recommendation |
|---|---|
| Storage | Encrypt images at rest (AES-256) |
| Retention | Delete after verification (30-90 days) |
| Logs | DO NOT store extracted data in logs |
| Fallback | Always have human review available |
| Regulation | Comply with GDPR/CCPA for biometric data |
๐ก Vision AI accelerates KYC from days to minutes, but always keep a human in the loop for low confidence cases.
Practice
โ Image Classifier โ Multimodal App