Vision & Multimodal AI

Multimodal models process multiple input types: text, images, audio, video.

Vision capabilities

Task	Description
OCR	Extract text from images
Description	Describe visual content
Analysis	Identify objects/people
Comparison	Compare images
Diagrams	Understand charts/graphs

Multimodal models

Model	Provider	Input
GPT-4V	OpenAI	Text + Image
Claude 3	Anthropic	Text + Image
Gemini Pro	Google	Text + Image + Audio
LLaVA	Open Source	Text + Image

Send image to Claude

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  messages: [{
    role: "user",
    content: [
      {
        type: "image",
        source: {
          type: "base64",
          media_type: "image/jpeg",
          data: base64Image
        }
      },
      {
        type: "text",
        text: "What's in this image?"
      }
    ]
  }]
});

Send image to OpenAI

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{base64_image}"
                }
            }
        ]
    }]
)

Real use cases

Application	Example
Accessibility	Describe images for blind users
Documents	Extract data from invoices
Retail	Analyze products in photos
Health	Preliminary X-ray analysis
Security	Content detection

Image classification models

For specific tasks, use specialized models:

from transformers import pipeline

# Classification
classifier = pipeline("image-classification")
result = classifier("cat.jpg")
# → [{"label": "cat", "score": 0.99}]

# Object detection
detector = pipeline("object-detection")
objects = detector("street.jpg")
# → [{"label": "car", "box": {...}}]

🏦 Fintech Case: KYC Verification with Vision

Know Your Customer (KYC) requires identity document verification. Vision AI automates this process:

import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

interface KYCResult {
  documentType: 'passport' | 'id_card' | 'drivers_license' | 'unknown';
  extractedData: {
    fullName?: string;
    documentNumber?: string;
    expiryDate?: string;
    nationality?: string;
  };
  validationChecks: {
    isExpired: boolean;
    formatValid: boolean;
    photoDetected: boolean;
  };
  confidence: number;
  requiresManualReview: boolean;
}

async function verifyKYCDocument(imageBase64: string): Promise<KYCResult> {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    messages: [{
      role: "user",
      content: [
        {
          type: "image",
          source: { type: "base64", media_type: "image/jpeg", data: imageBase64 }
        },
        {
          type: "text",
          text: `Analyze this identity document for KYC. Extract:
1. Document type (passport, id_card, drivers_license)
2. Full name
3. Document number
4. Expiry date
5. Nationality

Verify:
- Is the document expired?
- Does the format appear valid?
- Is a photo of the holder detected?

Respond ONLY in JSON with this format:
{
  "documentType": "...",
  "extractedData": {...},
  "validationChecks": {...},
  "confidence": 0.0-1.0,
  "requiresManualReview": true/false
}

IMPORTANT: If there's ANY doubt about authenticity, set requiresManualReview: true`
        }
      ]
    }]
  });

  // Parse response and validate
  const result = JSON.parse(response.content[0].text);

  // Business rule: low confidence = manual review
  if (result.confidence < 0.85) {
    result.requiresManualReview = true;
  }

  // Audit log (no sensitive data)
  await auditLog({
    action: 'KYC_VERIFICATION',
    documentType: result.documentType,
    confidence: result.confidence,
    requiresManualReview: result.requiresManualReview,
    timestamp: new Date().toISOString()
  });

  return result;
}

KYC Security Considerations

Aspect	Recommendation
Storage	Encrypt images at rest (AES-256)
Retention	Delete after verification (30-90 days)
Logs	DO NOT store extracted data in logs
Fallback	Always have human review available
Regulation	Comply with GDPR/CCPA for biometric data

💡 Vision AI accelerates KYC from days to minutes, but always keep a human in the loop for low confidence cases.

Practice

→ Image Classifier → Multimodal App