High-Level Architecture¶
This document describes the system architecture for Pebble DMS.
System Overview¶
Pebble DMS is built as a modular, event-driven architecture designed for:
- Scale: Handle 1TB+ document collections
- Reliability: Zero data loss, async processing
- Extensibility: Easy to add new processors and integrations
- Performance: Fast ingestion and search
Architecture Diagram¶
flowchart TB
subgraph Clients
A[Web UI]
B[REST API Clients]
C[Batch Upload CLI]
end
subgraph API Layer
D[API Gateway / Load Balancer]
E[Auth Service]
end
subgraph Application Services
F[Ingestion Service]
G[Processing Orchestrator]
H[Classification Service]
I[Tagging Service]
J[Search Service]
end
subgraph Processing Workers
K[OCR Worker]
L[Embedding Worker]
M[Dedup Worker]
end
subgraph Message Queue
N[Job Queue - Redis/RabbitMQ]
end
subgraph Data Stores
O[(Document Store - S3/MinIO)]
P[(Metadata DB - PostgreSQL)]
Q[(Vector Store - Qdrant)]
R[(Search Index - Meilisearch)]
end
A --> D
B --> D
C --> D
D --> E
E --> F
E --> J
F --> N
N --> K
N --> L
N --> M
K --> G
L --> G
M --> G
G --> H
G --> I
H --> P
I --> P
F --> O
G --> P
L --> Q
J --> R
J --> Q
Component Details¶
1. API Gateway¶
Purpose: Single entry point for all client requests.
| Feature | Implementation |
|---|---|
| Load balancing | Nginx / Traefik |
| Rate limiting | Token bucket |
| Authentication | JWT validation |
| Request routing | Path-based |
Endpoints:
POST /api/v1/documents # Upload
GET /api/v1/documents/:id # Get metadata
GET /api/v1/documents/:id/file # Download file
DELETE /api/v1/documents/:id # Delete
POST /api/v1/search # Search
GET /api/v1/tags # List tags
2. Ingestion Service¶
Purpose: Accept and validate incoming documents.
sequenceDiagram
participant Client
participant API
participant Ingestion
participant Queue
participant Storage
Client->>API: POST /documents (file)
API->>Ingestion: Validate & store
Ingestion->>Storage: Save raw file
Storage-->>Ingestion: File URL
Ingestion->>Queue: Enqueue processing job
Ingestion-->>API: Document ID + status
API-->>Client: 202 Accepted
Responsibilities:
- File format validation
- Size limit enforcement
- Virus scanning (optional)
- Initial metadata extraction
- Queueing for async processing
3. Processing Orchestrator¶
Purpose: Coordinate document processing pipeline.
| Step | Worker | Input | Output |
|---|---|---|---|
| 1 | OCR Worker | Raw file | Extracted text |
| 2 | Embedding Worker | Text | Vector embedding |
| 3 | Dedup Worker | Hash + Embedding | Duplicate status |
| 4 | Classification Service | Text | Type + Category |
| 5 | Tagging Service | Text | Tags |
State Machine:
stateDiagram-v2
[*] --> Queued
Queued --> OCR: Start processing
OCR --> Embedding: Text extracted
Embedding --> Dedup: Embedding computed
Dedup --> Classification: Not duplicate
Dedup --> Duplicate: Is duplicate
Classification --> Tagging: Classified
Tagging --> Indexing: Tagged
Indexing --> Ready: Indexed
Ready --> [*]
OCR --> Failed: Error
Embedding --> Failed: Error
Failed --> [*]
4. OCR Worker¶
Purpose: Extract text from scanned documents and images.
| Engine | Use Case | Languages |
|---|---|---|
| Tesseract | General OCR | English, Hindi, 100+ |
| Doctr | Deep learning OCR | English (high accuracy) |
| EasyOCR | Multi-language | 80+ languages |
Configuration:
ocr:
engine: tesseract
languages: [eng, hin]
timeout_seconds: 300
max_pages: 100
dpi: 300
5. Embedding Worker¶
Purpose: Generate vector embeddings for semantic operations.
| Model | Dimensions | Use Case |
|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fast, general purpose |
| all-mpnet-base-v2 | 768 | High accuracy |
| multilingual-e5-base | 768 | Multi-language |
Process:
- Chunk text into 512-token segments
- Generate embedding per chunk
- Mean-pool for document-level embedding
- Store in vector database
6. Classification Service¶
Purpose: Categorize documents automatically.
flowchart LR
A[Text] --> B[Tokenize]
B --> C[Feature Extraction]
C --> D[Type Classifier]
C --> E[Category Classifier]
D --> F[Type + Confidence]
E --> G[Category + Confidence]
Document Types:
| Type | Examples |
|---|---|
| Invoice | Bills, receipts |
| Contract | Agreements, NDAs |
| Report | Analysis, summaries |
| Correspondence | Letters, emails |
| Form | Applications, surveys |
| ID | Passports, licenses |
7. Search Service¶
Purpose: Enable document discovery.
| Search Type | Technology | Query Example |
|---|---|---|
| Full-text | Meilisearch | payment invoice 2024 |
| Filtered | PostgreSQL | type = 'invoice' |
| Semantic | Qdrant | "documents about renewals" |
Index Schema:
{
"id": "doc_123",
"title": "Invoice #456",
"content": "Full extracted text...",
"type": "invoice",
"category": "finance",
"tags": ["payment", "vendor", "Q1-2024"],
"created_at": "2024-01-15T10:30:00Z",
"embedding": [0.123, -0.456, ...]
}
Data Stores¶
Document Store (S3/MinIO)¶
| Item | Details |
|---|---|
| Purpose | Raw file storage |
| Format | Original files |
| Retention | Configurable |
| Structure | /{tenant}/{year}/{month}/{doc_id}/{filename} |
Metadata DB (PostgreSQL)¶
Schema (simplified):
CREATE TABLE documents (
id UUID PRIMARY KEY,
filename VARCHAR(255),
mime_type VARCHAR(100),
size_bytes BIGINT,
hash_md5 CHAR(32),
hash_sha256 CHAR(64),
status VARCHAR(50),
type VARCHAR(100),
category VARCHAR(100),
confidence DECIMAL(5,2),
created_at TIMESTAMP,
processed_at TIMESTAMP
);
CREATE TABLE tags (
id UUID PRIMARY KEY,
name VARCHAR(100) UNIQUE,
created_at TIMESTAMP
);
CREATE TABLE document_tags (
document_id UUID REFERENCES documents(id),
tag_id UUID REFERENCES tags(id),
source VARCHAR(50), -- 'auto' or 'manual'
PRIMARY KEY (document_id, tag_id)
);
Vector Store (Qdrant)¶
| Collection | Purpose |
|---|---|
document_embeddings |
Full document vectors |
chunk_embeddings |
Paragraph-level vectors |
Scalability¶
Horizontal Scaling¶
| Component | Scaling Strategy |
|---|---|
| API Gateway | Multiple replicas behind LB |
| Workers | Auto-scale based on queue depth |
| PostgreSQL | Read replicas |
| Qdrant | Sharding |
| Meilisearch | Sharding |
Performance Targets¶
| Metric | Target |
|---|---|
| Ingestion | 1,000 docs/hour |
| OCR | 10 pages/second |
| Search latency | <200ms (p95) |
| API latency | <100ms (p95) |
Security¶
| Layer | Controls |
|---|---|
| Network | TLS 1.3, VPC isolation |
| Authentication | JWT, API keys |
| Authorization | RBAC |
| Data | Encryption at rest (AES-256) |
| Audit | All API calls logged |
Deployment Options¶
Option 1: Docker Compose (Development)¶
services:
api:
image: pebble-dms/api
worker:
image: pebble-dms/worker
postgres:
image: postgres:15
qdrant:
image: qdrant/qdrant
meilisearch:
image: getmeili/meilisearch
minio:
image: minio/minio
redis:
image: redis:7
Option 2: Kubernetes (Production)¶
- Helm chart for deployment
- HPA for worker scaling
- PVC for persistent storage
- Ingress for external access