High-Level Architecture¶

This document describes the system architecture for Pebble DMS.

System Overview¶

Pebble DMS is built as a modular, event-driven architecture designed for:

Scale: Handle 1TB+ document collections
Reliability: Zero data loss, async processing
Extensibility: Easy to add new processors and integrations
Performance: Fast ingestion and search

Architecture Diagram¶

flowchart TB
    subgraph Clients
        A[Web UI]
        B[REST API Clients]
        C[Batch Upload CLI]
    end

    subgraph API Layer
        D[API Gateway / Load Balancer]
        E[Auth Service]
    end

    subgraph Application Services
        F[Ingestion Service]
        G[Processing Orchestrator]
        H[Classification Service]
        I[Tagging Service]
        J[Search Service]
    end

    subgraph Processing Workers
        K[OCR Worker]
        L[Embedding Worker]
        M[Dedup Worker]
    end

    subgraph Message Queue
        N[Job Queue - Redis/RabbitMQ]
    end

    subgraph Data Stores
        O[(Document Store - S3/MinIO)]
        P[(Metadata DB - PostgreSQL)]
        Q[(Vector Store - Qdrant)]
        R[(Search Index - Meilisearch)]
    end

    A --> D
    B --> D
    C --> D
    D --> E
    E --> F
    E --> J

    F --> N
    N --> K
    N --> L
    N --> M

    K --> G
    L --> G
    M --> G

    G --> H
    G --> I

    H --> P
    I --> P

    F --> O
    G --> P
    L --> Q
    J --> R
    J --> Q

Component Details¶

1. API Gateway¶

Purpose: Single entry point for all client requests.

Feature	Implementation
Load balancing	Nginx / Traefik
Rate limiting	Token bucket
Authentication	JWT validation
Request routing	Path-based

Endpoints:

POST   /api/v1/documents          # Upload
GET    /api/v1/documents/:id      # Get metadata
GET    /api/v1/documents/:id/file # Download file
DELETE /api/v1/documents/:id      # Delete
POST   /api/v1/search             # Search
GET    /api/v1/tags               # List tags

2. Ingestion Service¶

Purpose: Accept and validate incoming documents.

sequenceDiagram
    participant Client
    participant API
    participant Ingestion
    participant Queue
    participant Storage

    Client->>API: POST /documents (file)
    API->>Ingestion: Validate & store
    Ingestion->>Storage: Save raw file
    Storage-->>Ingestion: File URL
    Ingestion->>Queue: Enqueue processing job
    Ingestion-->>API: Document ID + status
    API-->>Client: 202 Accepted

Responsibilities:

File format validation
Size limit enforcement
Virus scanning (optional)
Initial metadata extraction
Queueing for async processing

3. Processing Orchestrator¶

Purpose: Coordinate document processing pipeline.

Step	Worker	Input	Output
1	OCR Worker	Raw file	Extracted text
2	Embedding Worker	Text	Vector embedding
3	Dedup Worker	Hash + Embedding	Duplicate status
4	Classification Service	Text	Type + Category
5	Tagging Service	Text	Tags

State Machine:

stateDiagram-v2
    [*] --> Queued
    Queued --> OCR: Start processing
    OCR --> Embedding: Text extracted
    Embedding --> Dedup: Embedding computed
    Dedup --> Classification: Not duplicate
    Dedup --> Duplicate: Is duplicate
    Classification --> Tagging: Classified
    Tagging --> Indexing: Tagged
    Indexing --> Ready: Indexed
    Ready --> [*]

    OCR --> Failed: Error
    Embedding --> Failed: Error
    Failed --> [*]

4. OCR Worker¶

Purpose: Extract text from scanned documents and images.

Engine	Use Case	Languages
Tesseract	General OCR	English, Hindi, 100+
Doctr	Deep learning OCR	English (high accuracy)
EasyOCR	Multi-language	80+ languages

Configuration:

ocr:
  engine: tesseract
  languages: [eng, hin]
  timeout_seconds: 300
  max_pages: 100
  dpi: 300

5. Embedding Worker¶

Purpose: Generate vector embeddings for semantic operations.

Model	Dimensions	Use Case
all-MiniLM-L6-v2	384	Fast, general purpose
all-mpnet-base-v2	768	High accuracy
multilingual-e5-base	768	Multi-language

Process:

Chunk text into 512-token segments
Generate embedding per chunk
Mean-pool for document-level embedding
Store in vector database

6. Classification Service¶

Purpose: Categorize documents automatically.

flowchart LR
    A[Text] --> B[Tokenize]
    B --> C[Feature Extraction]
    C --> D[Type Classifier]
    C --> E[Category Classifier]
    D --> F[Type + Confidence]
    E --> G[Category + Confidence]

Document Types:

Type	Examples
Invoice	Bills, receipts
Contract	Agreements, NDAs
Report	Analysis, summaries
Correspondence	Letters, emails
Form	Applications, surveys
ID	Passports, licenses

7. Search Service¶

Purpose: Enable document discovery.

Search Type	Technology	Query Example
Full-text	Meilisearch	`payment invoice 2024`
Filtered	PostgreSQL	`type = 'invoice'`
Semantic	Qdrant	"documents about renewals"

Index Schema:

{
  "id": "doc_123",
  "title": "Invoice #456",
  "content": "Full extracted text...",
  "type": "invoice",
  "category": "finance",
  "tags": ["payment", "vendor", "Q1-2024"],
  "created_at": "2024-01-15T10:30:00Z",
  "embedding": [0.123, -0.456, ...]
}

Data Stores¶

Document Store (S3/MinIO)¶

Item	Details
Purpose	Raw file storage
Format	Original files
Retention	Configurable
Structure	`/{tenant}/{year}/{month}/{doc_id}/{filename}`

Metadata DB (PostgreSQL)¶

Schema (simplified):

CREATE TABLE documents (
    id UUID PRIMARY KEY,
    filename VARCHAR(255),
    mime_type VARCHAR(100),
    size_bytes BIGINT,
    hash_md5 CHAR(32),
    hash_sha256 CHAR(64),
    status VARCHAR(50),
    type VARCHAR(100),
    category VARCHAR(100),
    confidence DECIMAL(5,2),
    created_at TIMESTAMP,
    processed_at TIMESTAMP
);

CREATE TABLE tags (
    id UUID PRIMARY KEY,
    name VARCHAR(100) UNIQUE,
    created_at TIMESTAMP
);

CREATE TABLE document_tags (
    document_id UUID REFERENCES documents(id),
    tag_id UUID REFERENCES tags(id),
    source VARCHAR(50), -- 'auto' or 'manual'
    PRIMARY KEY (document_id, tag_id)
);

Vector Store (Qdrant)¶

Collection	Purpose
`document_embeddings`	Full document vectors
`chunk_embeddings`	Paragraph-level vectors

Scalability¶

Horizontal Scaling¶

Component	Scaling Strategy
API Gateway	Multiple replicas behind LB
Workers	Auto-scale based on queue depth
PostgreSQL	Read replicas
Qdrant	Sharding
Meilisearch	Sharding

Performance Targets¶

Metric	Target
Ingestion	1,000 docs/hour
OCR	10 pages/second
Search latency	<200ms (p95)
API latency	<100ms (p95)

Security¶

Layer	Controls
Network	TLS 1.3, VPC isolation
Authentication	JWT, API keys
Authorization	RBAC
Data	Encryption at rest (AES-256)
Audit	All API calls logged

Deployment Options¶

Option 1: Docker Compose (Development)¶

services:
  api:
    image: pebble-dms/api
  worker:
    image: pebble-dms/worker
  postgres:
    image: postgres:15
  qdrant:
    image: qdrant/qdrant
  meilisearch:
    image: getmeili/meilisearch
  minio:
    image: minio/minio
  redis:
    image: redis:7

Option 2: Kubernetes (Production)¶

Helm chart for deployment
HPA for worker scaling
PVC for persistent storage
Ingress for external access

← Back to Home