Skip to content

Operations Overview

This document covers monitoring, logging, and operational procedures for Pebble DMS.


Monitoring Architecture

flowchart TB
    subgraph Services
        A[API]
        B[Workers]
        C[PostgreSQL]
        D[Qdrant]
        E[Meilisearch]
    end

    subgraph Observability
        F[Prometheus]
        G[Grafana]
        H[Loki]
        I[Alertmanager]
    end

    A --> F
    B --> F
    C --> F
    D --> F
    E --> F

    F --> G
    F --> I

    A --> H
    B --> H

    I --> J[Slack/Email]

Key Metrics

API Metrics

Metric Description Alert Threshold
api_request_duration_seconds Latency histogram p99 > 1s
api_requests_total Request count -
api_errors_total Error count > 10/min

Worker Metrics

Metric Description Alert Threshold
worker_jobs_total Jobs processed -
worker_jobs_failed Failed jobs > 5%
worker_queue_depth Pending jobs > 1000
worker_job_duration_seconds Processing time p95 > 300s

OCR Metrics

Metric Description Alert Threshold
ocr_pages_processed Pages OCR'd -
ocr_confidence_avg Average confidence < 0.7
ocr_duration_seconds Processing time > 60s/page

Logging

Log Format

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "INFO",
  "service": "api",
  "trace_id": "abc123",
  "message": "Document uploaded",
  "document_id": "doc_xyz",
  "filename": "invoice.pdf",
  "size_bytes": 125000
}

Log Levels

Level Use Case
ERROR Failures requiring attention
WARN Recoverable issues
INFO Key business events
DEBUG Diagnostic details

Alerting

Critical Alerts

Alert Condition Action
API Down No response > 1 min Page on-call
Worker Queue Overflow Queue > 5000 Scale workers
Database Connection Failed Connection errors Check DB health
Storage Full Disk > 90% Expand storage

Warning Alerts

Alert Condition Action
High Latency p95 > 2s Investigate
OCR Failures High > 10% failures Check quality
Low Classification Confidence Avg < 0.6 Review model

Health Checks

Endpoints

GET /health          # Basic health
GET /health/ready    # Full readiness
GET /health/live     # Liveness probe

Readiness Checks

@app.get("/health/ready")
async def readiness():
    checks = {
        "database": check_db(),
        "redis": check_redis(),
        "qdrant": check_qdrant(),
        "meilisearch": check_meilisearch(),
        "storage": check_minio(),
    }

    all_healthy = all(checks.values())
    status_code = 200 if all_healthy else 503

    return JSONResponse(
        content={"status": "ready" if all_healthy else "not_ready", "checks": checks},
        status_code=status_code
    )

Backup & Recovery

Backup Schedule

Component Frequency Retention
PostgreSQL Daily 30 days
Qdrant Daily 14 days
MinIO Continuous sync 90 days

Backup Procedures

# PostgreSQL backup
pg_dump -h localhost -U pebble pebble_db | gzip > backup_$(date +%Y%m%d).sql.gz

# Qdrant snapshot
curl -X POST 'http://localhost:6333/collections/documents/snapshots'

# MinIO sync
mc mirror minio/documents s3/backup-bucket/documents

Recovery Procedures

# PostgreSQL restore
gunzip -c backup_20240115.sql.gz | psql -h localhost -U pebble pebble_db

# Qdrant restore
curl -X PUT 'http://localhost:6333/collections/documents/snapshots/recover' \
  -d '{"location": "file:///snapshots/documents-snap.snapshot"}'

Scaling Guidelines

Horizontal Scaling Triggers

Component Scale Up When Scale Down When
API CPU > 70%, RPS > 1000 CPU < 30%, RPS < 200
OCR Worker Queue > 100 Queue < 10
Embedding Worker Queue > 50 Queue < 5

Resource Limits

# Kubernetes resource config
resources:
  api:
    requests:
      cpu: "500m"
      memory: "512Mi"
    limits:
      cpu: "2000m"
      memory: "2Gi"

  worker:
    requests:
      cpu: "1000m"
      memory: "2Gi"
    limits:
      cpu: "4000m"
      memory: "8Gi"

Runbooks

High Queue Depth

  1. Check worker health: kubectl get pods -l app=worker
  2. Check for errors: kubectl logs -l app=worker --tail=100
  3. Scale workers: kubectl scale deployment worker --replicas=5
  4. Monitor queue: redis-cli llen celery

OCR Failures Spike

  1. Check error logs: grep "OCR failed" /var/log/worker.log
  2. Identify problematic documents
  3. Check OCR engine health
  4. Restart OCR workers if needed

Database Connection Issues

  1. Check PostgreSQL status: systemctl status postgresql
  2. Check connection count: SELECT count(*) FROM pg_stat_activity
  3. Kill idle connections if needed
  4. Restart application if persistent

Maintenance Windows

Task Schedule Duration Impact
Database maintenance Sunday 02:00 UTC 30 min Read-only
Index optimization Saturday 03:00 UTC 1 hour Slower search
Model updates As needed 10 min Brief classification delay

← Back to Home