Skip to content

High-Level Architecture

This document describes the system architecture for Pebble DMS.


System Overview

Pebble DMS is built as a modular, event-driven architecture designed for:

  • Scale: Handle 1TB+ document collections
  • Reliability: Zero data loss, async processing
  • Extensibility: Easy to add new processors and integrations
  • Performance: Fast ingestion and search

Architecture Diagram

flowchart TB
    subgraph Clients
        A[Web UI]
        B[REST API Clients]
        C[Batch Upload CLI]
    end

    subgraph API Layer
        D[API Gateway / Load Balancer]
        E[Auth Service]
    end

    subgraph Application Services
        F[Ingestion Service]
        G[Processing Orchestrator]
        H[Classification Service]
        I[Tagging Service]
        J[Search Service]
    end

    subgraph Processing Workers
        K[OCR Worker]
        L[Embedding Worker]
        M[Dedup Worker]
    end

    subgraph Message Queue
        N[Job Queue - Redis/RabbitMQ]
    end

    subgraph Data Stores
        O[(Document Store - S3/MinIO)]
        P[(Metadata DB - PostgreSQL)]
        Q[(Vector Store - Qdrant)]
        R[(Search Index - Meilisearch)]
    end

    A --> D
    B --> D
    C --> D
    D --> E
    E --> F
    E --> J

    F --> N
    N --> K
    N --> L
    N --> M

    K --> G
    L --> G
    M --> G

    G --> H
    G --> I

    H --> P
    I --> P

    F --> O
    G --> P
    L --> Q
    J --> R
    J --> Q

Component Details

1. API Gateway

Purpose: Single entry point for all client requests.

Feature Implementation
Load balancing Nginx / Traefik
Rate limiting Token bucket
Authentication JWT validation
Request routing Path-based

Endpoints:

POST   /api/v1/documents          # Upload
GET    /api/v1/documents/:id      # Get metadata
GET    /api/v1/documents/:id/file # Download file
DELETE /api/v1/documents/:id      # Delete
POST   /api/v1/search             # Search
GET    /api/v1/tags               # List tags

2. Ingestion Service

Purpose: Accept and validate incoming documents.

sequenceDiagram
    participant Client
    participant API
    participant Ingestion
    participant Queue
    participant Storage

    Client->>API: POST /documents (file)
    API->>Ingestion: Validate & store
    Ingestion->>Storage: Save raw file
    Storage-->>Ingestion: File URL
    Ingestion->>Queue: Enqueue processing job
    Ingestion-->>API: Document ID + status
    API-->>Client: 202 Accepted

Responsibilities:

  • File format validation
  • Size limit enforcement
  • Virus scanning (optional)
  • Initial metadata extraction
  • Queueing for async processing

3. Processing Orchestrator

Purpose: Coordinate document processing pipeline.

Step Worker Input Output
1 OCR Worker Raw file Extracted text
2 Embedding Worker Text Vector embedding
3 Dedup Worker Hash + Embedding Duplicate status
4 Classification Service Text Type + Category
5 Tagging Service Text Tags

State Machine:

stateDiagram-v2
    [*] --> Queued
    Queued --> OCR: Start processing
    OCR --> Embedding: Text extracted
    Embedding --> Dedup: Embedding computed
    Dedup --> Classification: Not duplicate
    Dedup --> Duplicate: Is duplicate
    Classification --> Tagging: Classified
    Tagging --> Indexing: Tagged
    Indexing --> Ready: Indexed
    Ready --> [*]

    OCR --> Failed: Error
    Embedding --> Failed: Error
    Failed --> [*]

4. OCR Worker

Purpose: Extract text from scanned documents and images.

Engine Use Case Languages
Tesseract General OCR English, Hindi, 100+
Doctr Deep learning OCR English (high accuracy)
EasyOCR Multi-language 80+ languages

Configuration:

ocr:
  engine: tesseract
  languages: [eng, hin]
  timeout_seconds: 300
  max_pages: 100
  dpi: 300

5. Embedding Worker

Purpose: Generate vector embeddings for semantic operations.

Model Dimensions Use Case
all-MiniLM-L6-v2 384 Fast, general purpose
all-mpnet-base-v2 768 High accuracy
multilingual-e5-base 768 Multi-language

Process:

  1. Chunk text into 512-token segments
  2. Generate embedding per chunk
  3. Mean-pool for document-level embedding
  4. Store in vector database

6. Classification Service

Purpose: Categorize documents automatically.

flowchart LR
    A[Text] --> B[Tokenize]
    B --> C[Feature Extraction]
    C --> D[Type Classifier]
    C --> E[Category Classifier]
    D --> F[Type + Confidence]
    E --> G[Category + Confidence]

Document Types:

Type Examples
Invoice Bills, receipts
Contract Agreements, NDAs
Report Analysis, summaries
Correspondence Letters, emails
Form Applications, surveys
ID Passports, licenses

7. Search Service

Purpose: Enable document discovery.

Search Type Technology Query Example
Full-text Meilisearch payment invoice 2024
Filtered PostgreSQL type = 'invoice'
Semantic Qdrant "documents about renewals"

Index Schema:

{
  "id": "doc_123",
  "title": "Invoice #456",
  "content": "Full extracted text...",
  "type": "invoice",
  "category": "finance",
  "tags": ["payment", "vendor", "Q1-2024"],
  "created_at": "2024-01-15T10:30:00Z",
  "embedding": [0.123, -0.456, ...]
}

Data Stores

Document Store (S3/MinIO)

Item Details
Purpose Raw file storage
Format Original files
Retention Configurable
Structure /{tenant}/{year}/{month}/{doc_id}/{filename}

Metadata DB (PostgreSQL)

Schema (simplified):

CREATE TABLE documents (
    id UUID PRIMARY KEY,
    filename VARCHAR(255),
    mime_type VARCHAR(100),
    size_bytes BIGINT,
    hash_md5 CHAR(32),
    hash_sha256 CHAR(64),
    status VARCHAR(50),
    type VARCHAR(100),
    category VARCHAR(100),
    confidence DECIMAL(5,2),
    created_at TIMESTAMP,
    processed_at TIMESTAMP
);

CREATE TABLE tags (
    id UUID PRIMARY KEY,
    name VARCHAR(100) UNIQUE,
    created_at TIMESTAMP
);

CREATE TABLE document_tags (
    document_id UUID REFERENCES documents(id),
    tag_id UUID REFERENCES tags(id),
    source VARCHAR(50), -- 'auto' or 'manual'
    PRIMARY KEY (document_id, tag_id)
);

Vector Store (Qdrant)

Collection Purpose
document_embeddings Full document vectors
chunk_embeddings Paragraph-level vectors

Scalability

Horizontal Scaling

Component Scaling Strategy
API Gateway Multiple replicas behind LB
Workers Auto-scale based on queue depth
PostgreSQL Read replicas
Qdrant Sharding
Meilisearch Sharding

Performance Targets

Metric Target
Ingestion 1,000 docs/hour
OCR 10 pages/second
Search latency <200ms (p95)
API latency <100ms (p95)

Security

Layer Controls
Network TLS 1.3, VPC isolation
Authentication JWT, API keys
Authorization RBAC
Data Encryption at rest (AES-256)
Audit All API calls logged

Deployment Options

Option 1: Docker Compose (Development)

services:
  api:
    image: pebble-dms/api
  worker:
    image: pebble-dms/worker
  postgres:
    image: postgres:15
  qdrant:
    image: qdrant/qdrant
  meilisearch:
    image: getmeili/meilisearch
  minio:
    image: minio/minio
  redis:
    image: redis:7

Option 2: Kubernetes (Production)

  • Helm chart for deployment
  • HPA for worker scaling
  • PVC for persistent storage
  • Ingress for external access

← Back to Home