Operations Overview¶

This document covers monitoring, logging, and operational procedures for Pebble DMS.

Monitoring Architecture¶

flowchart TB
    subgraph Services
        A[API]
        B[Workers]
        C[PostgreSQL]
        D[Qdrant]
        E[Meilisearch]
    end

    subgraph Observability
        F[Prometheus]
        G[Grafana]
        H[Loki]
        I[Alertmanager]
    end

    A --> F
    B --> F
    C --> F
    D --> F
    E --> F

    F --> G
    F --> I

    A --> H
    B --> H

    I --> J[Slack/Email]

Key Metrics¶

API Metrics¶

Metric	Description	Alert Threshold
`api_request_duration_seconds`	Latency histogram	p99 > 1s
`api_requests_total`	Request count	-
`api_errors_total`	Error count	> 10/min

Worker Metrics¶

Metric	Description	Alert Threshold
`worker_jobs_total`	Jobs processed	-
`worker_jobs_failed`	Failed jobs	> 5%
`worker_queue_depth`	Pending jobs	> 1000
`worker_job_duration_seconds`	Processing time	p95 > 300s

OCR Metrics¶

Metric	Description	Alert Threshold
`ocr_pages_processed`	Pages OCR'd	-
`ocr_confidence_avg`	Average confidence	< 0.7
`ocr_duration_seconds`	Processing time	> 60s/page

Logging¶

Log Format¶

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "INFO",
  "service": "api",
  "trace_id": "abc123",
  "message": "Document uploaded",
  "document_id": "doc_xyz",
  "filename": "invoice.pdf",
  "size_bytes": 125000
}

Log Levels¶

Level	Use Case
ERROR	Failures requiring attention
WARN	Recoverable issues
INFO	Key business events
DEBUG	Diagnostic details

Alerting¶

Critical Alerts¶

Alert	Condition	Action
API Down	No response > 1 min	Page on-call
Worker Queue Overflow	Queue > 5000	Scale workers
Database Connection Failed	Connection errors	Check DB health
Storage Full	Disk > 90%	Expand storage

Warning Alerts¶

Alert	Condition	Action
High Latency	p95 > 2s	Investigate
OCR Failures High	> 10% failures	Check quality
Low Classification Confidence	Avg < 0.6	Review model

Health Checks¶

Endpoints¶

GET /health          # Basic health
GET /health/ready    # Full readiness
GET /health/live     # Liveness probe

Readiness Checks¶

@app.get("/health/ready")
async def readiness():
    checks = {
        "database": check_db(),
        "redis": check_redis(),
        "qdrant": check_qdrant(),
        "meilisearch": check_meilisearch(),
        "storage": check_minio(),
    }

    all_healthy = all(checks.values())
    status_code = 200 if all_healthy else 503

    return JSONResponse(
        content={"status": "ready" if all_healthy else "not_ready", "checks": checks},
        status_code=status_code
    )

Backup & Recovery¶

Backup Schedule¶

Component	Frequency	Retention
PostgreSQL	Daily	30 days
Qdrant	Daily	14 days
MinIO	Continuous sync	90 days

Backup Procedures¶

# PostgreSQL backup
pg_dump -h localhost -U pebble pebble_db | gzip > backup_$(date +%Y%m%d).sql.gz

# Qdrant snapshot
curl -X POST 'http://localhost:6333/collections/documents/snapshots'

# MinIO sync
mc mirror minio/documents s3/backup-bucket/documents

Recovery Procedures¶

# PostgreSQL restore
gunzip -c backup_20240115.sql.gz | psql -h localhost -U pebble pebble_db

# Qdrant restore
curl -X PUT 'http://localhost:6333/collections/documents/snapshots/recover' \
  -d '{"location": "file:///snapshots/documents-snap.snapshot"}'

Scaling Guidelines¶

Horizontal Scaling Triggers¶

Component	Scale Up When	Scale Down When
API	CPU > 70%, RPS > 1000	CPU < 30%, RPS < 200
OCR Worker	Queue > 100	Queue < 10
Embedding Worker	Queue > 50	Queue < 5

Resource Limits¶

# Kubernetes resource config
resources:
  api:
    requests:
      cpu: "500m"
      memory: "512Mi"
    limits:
      cpu: "2000m"
      memory: "2Gi"

  worker:
    requests:
      cpu: "1000m"
      memory: "2Gi"
    limits:
      cpu: "4000m"
      memory: "8Gi"

Runbooks¶

High Queue Depth¶

Check worker health: kubectl get pods -l app=worker
Check for errors: kubectl logs -l app=worker --tail=100
Scale workers: kubectl scale deployment worker --replicas=5
Monitor queue: redis-cli llen celery

OCR Failures Spike¶

Check error logs: grep "OCR failed" /var/log/worker.log
Identify problematic documents
Check OCR engine health
Restart OCR workers if needed

Database Connection Issues¶

Check PostgreSQL status: systemctl status postgresql
Check connection count: SELECT count(*) FROM pg_stat_activity
Kill idle connections if needed
Restart application if persistent

Maintenance Windows¶

Task	Schedule	Duration	Impact
Database maintenance	Sunday 02:00 UTC	30 min	Read-only
Index optimization	Saturday 03:00 UTC	1 hour	Slower search
Model updates	As needed	10 min	Brief classification delay

← Back to Home