Operations Overview
This document covers monitoring, logging, and operational procedures for Pebble DMS.
Monitoring Architecture
flowchart TB
subgraph Services
A[API]
B[Workers]
C[PostgreSQL]
D[Qdrant]
E[Meilisearch]
end
subgraph Observability
F[Prometheus]
G[Grafana]
H[Loki]
I[Alertmanager]
end
A --> F
B --> F
C --> F
D --> F
E --> F
F --> G
F --> I
A --> H
B --> H
I --> J[Slack/Email]
Key Metrics
API Metrics
| Metric |
Description |
Alert Threshold |
api_request_duration_seconds |
Latency histogram |
p99 > 1s |
api_requests_total |
Request count |
- |
api_errors_total |
Error count |
> 10/min |
Worker Metrics
| Metric |
Description |
Alert Threshold |
worker_jobs_total |
Jobs processed |
- |
worker_jobs_failed |
Failed jobs |
> 5% |
worker_queue_depth |
Pending jobs |
> 1000 |
worker_job_duration_seconds |
Processing time |
p95 > 300s |
OCR Metrics
| Metric |
Description |
Alert Threshold |
ocr_pages_processed |
Pages OCR'd |
- |
ocr_confidence_avg |
Average confidence |
< 0.7 |
ocr_duration_seconds |
Processing time |
> 60s/page |
Logging
{
"timestamp": "2024-01-15T10:30:00.123Z",
"level": "INFO",
"service": "api",
"trace_id": "abc123",
"message": "Document uploaded",
"document_id": "doc_xyz",
"filename": "invoice.pdf",
"size_bytes": 125000
}
Log Levels
| Level |
Use Case |
| ERROR |
Failures requiring attention |
| WARN |
Recoverable issues |
| INFO |
Key business events |
| DEBUG |
Diagnostic details |
Alerting
Critical Alerts
| Alert |
Condition |
Action |
| API Down |
No response > 1 min |
Page on-call |
| Worker Queue Overflow |
Queue > 5000 |
Scale workers |
| Database Connection Failed |
Connection errors |
Check DB health |
| Storage Full |
Disk > 90% |
Expand storage |
Warning Alerts
| Alert |
Condition |
Action |
| High Latency |
p95 > 2s |
Investigate |
| OCR Failures High |
> 10% failures |
Check quality |
| Low Classification Confidence |
Avg < 0.6 |
Review model |
Health Checks
Endpoints
GET /health # Basic health
GET /health/ready # Full readiness
GET /health/live # Liveness probe
Readiness Checks
@app.get("/health/ready")
async def readiness():
checks = {
"database": check_db(),
"redis": check_redis(),
"qdrant": check_qdrant(),
"meilisearch": check_meilisearch(),
"storage": check_minio(),
}
all_healthy = all(checks.values())
status_code = 200 if all_healthy else 503
return JSONResponse(
content={"status": "ready" if all_healthy else "not_ready", "checks": checks},
status_code=status_code
)
Backup & Recovery
Backup Schedule
| Component |
Frequency |
Retention |
| PostgreSQL |
Daily |
30 days |
| Qdrant |
Daily |
14 days |
| MinIO |
Continuous sync |
90 days |
Backup Procedures
# PostgreSQL backup
pg_dump -h localhost -U pebble pebble_db | gzip > backup_$(date +%Y%m%d).sql.gz
# Qdrant snapshot
curl -X POST 'http://localhost:6333/collections/documents/snapshots'
# MinIO sync
mc mirror minio/documents s3/backup-bucket/documents
Recovery Procedures
# PostgreSQL restore
gunzip -c backup_20240115.sql.gz | psql -h localhost -U pebble pebble_db
# Qdrant restore
curl -X PUT 'http://localhost:6333/collections/documents/snapshots/recover' \
-d '{"location": "file:///snapshots/documents-snap.snapshot"}'
Scaling Guidelines
Horizontal Scaling Triggers
| Component |
Scale Up When |
Scale Down When |
| API |
CPU > 70%, RPS > 1000 |
CPU < 30%, RPS < 200 |
| OCR Worker |
Queue > 100 |
Queue < 10 |
| Embedding Worker |
Queue > 50 |
Queue < 5 |
Resource Limits
# Kubernetes resource config
resources:
api:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
worker:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "4000m"
memory: "8Gi"
Runbooks
High Queue Depth
- Check worker health:
kubectl get pods -l app=worker
- Check for errors:
kubectl logs -l app=worker --tail=100
- Scale workers:
kubectl scale deployment worker --replicas=5
- Monitor queue:
redis-cli llen celery
OCR Failures Spike
- Check error logs:
grep "OCR failed" /var/log/worker.log
- Identify problematic documents
- Check OCR engine health
- Restart OCR workers if needed
Database Connection Issues
- Check PostgreSQL status:
systemctl status postgresql
- Check connection count:
SELECT count(*) FROM pg_stat_activity
- Kill idle connections if needed
- Restart application if persistent
Maintenance Windows
| Task |
Schedule |
Duration |
Impact |
| Database maintenance |
Sunday 02:00 UTC |
30 min |
Read-only |
| Index optimization |
Saturday 03:00 UTC |
1 hour |
Slower search |
| Model updates |
As needed |
10 min |
Brief classification delay |
← Back to Home