Document Workflow Overview¶

This document describes, end-to-end processing pipeline for documents flowing through Pebble DMS.

Workflow Diagram¶

flowchart TB
    subgraph "1. Ingestion"
        A[📄 Document Upload] --> B[Format Detection]
        B --> C{Supported?}
        C -->|Yes| D[Metadata Extraction]
        C -->|No| E[Reject with Error]
    end

    subgraph "2. Text Extraction"
        D --> F{File Type?}
        F -->|PDF with text| G[Extract Text Layer]
        F -->|Scanned/Image| H[OCR Processing]
        G --> I[Raw Text]
        H --> I
    end

    subgraph "3. Deduplication"
        I --> J[Compute Hash]
        J --> K{Exact Match?}
        K -->|Yes| L[Link to Existing]
        K -->|No| M[Compute Embedding]
        M --> N{Near-Duplicate?}
        N -->|Yes, >95%| O[Flag for Review]
        N -->|No| P[New Document]
    end

    subgraph "4. Classification"
        P --> Q[Classify Type]
        O --> Q
        Q --> R[Classify Category]
        R --> S[Confidence Score]
        S --> T{High Confidence?}
        T -->|Yes, >80%| U[Auto-Apply]
        T -->|No| V[Queue for Review]
    end

    subgraph "5. Tagging"
        U --> W[NER Extraction]
        V --> W
        W --> X[Keyword Extraction]
        X --> Y[Topic Detection]
        Y --> Z[Apply Tags]
    end

    subgraph "6. Indexing"
        Z --> AA[Store Document]
        AA --> AB[Index Full Text]
        AB --> AC[Store Embeddings]
        AC --> AD[Update Metadata]
    end

    subgraph "7. Access"
        AD --> AE[Search API]
        AE --> AF[Web UI]
    end

Workflow Stages¶

Stage 1: Ingestion¶

Purpose: Accept documents and prepare them for processing.

Step	Description	Output
Upload	Accept file via API or UI	Raw file bytes
Format Detection	Identify PDF, PNG, JPG, TIFF, etc.	MIME type
Validation	Check file isn't corrupt, size limits	Pass/Fail
Metadata Extraction	Filename, size, creation date	Initial metadata

Supported Formats:

PDF (native text, scanned, hybrid)
Images: PNG, JPG, JPEG, TIFF, BMP, WEBP
Future: DOCX, XLSX, PPTX

Stage 2: Text Extraction¶

Purpose: Extract searchable text from all documents.

File Type	Method	Tool
PDF (text layer)	Direct extraction	PyMuPDF, pdfplumber
PDF (scanned)	OCR	Tesseract, Doctr
Image	OCR	Tesseract, Doctr

Language Support:

English (primary)
Hindi, Marathi, Tamil (planned)

Stage 3: Deduplication¶

Purpose: Identify and handle duplicate documents.

flowchart LR
    A[Document] --> B[MD5/SHA256 Hash]
    B --> C{Exact Match?}
    C -->|Yes| D[Duplicate - Skip]
    C -->|No| E[Generate Embedding]
    E --> F[Vector Similarity Search]
    F --> G{>95% Similar?}
    G -->|Yes| H[Near-Duplicate]
    G -->|No| I[Unique Document]

Dedup Strategies:

Strategy	Use Case	Tool
Hash-based	Exact file duplicates	MD5, SHA256
Content-based	Same text, different file	Text hashing
Semantic	Similar meaning	Embeddings + cosine
Visual	Similar images	pHash, dHash

Stage 4: Classification¶

Purpose: Automatically categorize documents.

Document Types (examples):

Invoice
Contract
Report
Letter
Form
Receipt
ID Document
Medical Record
Legal Filing

Classification Model:

Input: Extracted text (first 512 tokens)
Model: Fine-tuned transformer or traditional ML
Output: Type + Category + Confidence (0-100%)

Confidence Handling:

Confidence	Action
>80%	Auto-apply classification
50-80%	Apply but flag for review
<50%	Queue for manual classification

Stage 5: Tagging¶

Purpose: Extract and apply descriptive tags.

Tag Type	Method	Examples
Entities (NER)	Named Entity Recognition	Person names, Org names, Dates
Keywords	TF-IDF, RAKE, YAKE	"invoice", "payment", "contract"
Topics	Topic modeling	Finance, Legal, HR, Operations
Custom	User-defined rules	Project codes, client names

Stage 6: Indexing¶

Purpose: Make documents searchable.

Index Type	Purpose	Technology
Full-text	Keyword search	Elasticsearch, Meilisearch
Vector	Semantic search	Qdrant, Weaviate, Pinecone
Metadata	Filtered queries	PostgreSQL

Stage 7: Access¶

Purpose: Enable document retrieval.

Search Modes:

Mode	Query Example
Keyword	`invoice payment 2024`
Filter	`type:invoice AND date:2024-01`
Semantic	"documents about contract renewal"
Combined	`client:ACME type:contract "renewal terms"`

User Personas¶

1. Document Administrator¶

Goal: Bulk upload and organize documents
Key Workflows: Batch upload, review duplicates, correct classifications

2. Knowledge Worker¶

Goal: Find specific documents quickly
Key Workflows: Search, filter by tags, view document details

3. Data Engineer¶

Goal: Integrate DMS with other systems
Key Workflows: API integration, webhook setup, export data

4. ML Engineer¶

Goal: Improve classification accuracy
Key Workflows: Review low-confidence items, retrain models

Document States¶

stateDiagram-v2
    [*] --> Uploaded: File received
    Uploaded --> Processing: Queue picked up
    Processing --> TextExtracted: OCR/Extraction done
    TextExtracted --> Deduped: Dedup check passed
    Deduped --> Classified: Type assigned
    Classified --> Tagged: Tags applied
    Tagged --> Indexed: Search index updated
    Indexed --> Ready: [*]

    Processing --> Failed: Error
    Failed --> [*]

    Deduped --> Duplicate: Exact match found
    Duplicate --> [*]

    Deduped --> Review: Near-duplicate
    Review --> Ready: Confirmed unique
    Review --> Merged: Merged with existing
    Merged --> [*]

Processing Metrics¶

Metric	Target	Measurement
Ingestion → Ready	<5 min	End-to-end latency
OCR throughput	10 pages/sec	Pages processed/second
Dedup check	<1 sec	Hash + embedding lookup
Classification	<0.5 sec	Model inference time
Indexing	<2 sec	Search index update

← Back to Home