Skip to content

Document Workflow Overview

This document describes, end-to-end processing pipeline for documents flowing through Pebble DMS.


Workflow Diagram

flowchart TB
    subgraph "1. Ingestion"
        A[📄 Document Upload] --> B[Format Detection]
        B --> C{Supported?}
        C -->|Yes| D[Metadata Extraction]
        C -->|No| E[Reject with Error]
    end

    subgraph "2. Text Extraction"
        D --> F{File Type?}
        F -->|PDF with text| G[Extract Text Layer]
        F -->|Scanned/Image| H[OCR Processing]
        G --> I[Raw Text]
        H --> I
    end

    subgraph "3. Deduplication"
        I --> J[Compute Hash]
        J --> K{Exact Match?}
        K -->|Yes| L[Link to Existing]
        K -->|No| M[Compute Embedding]
        M --> N{Near-Duplicate?}
        N -->|Yes, >95%| O[Flag for Review]
        N -->|No| P[New Document]
    end

    subgraph "4. Classification"
        P --> Q[Classify Type]
        O --> Q
        Q --> R[Classify Category]
        R --> S[Confidence Score]
        S --> T{High Confidence?}
        T -->|Yes, >80%| U[Auto-Apply]
        T -->|No| V[Queue for Review]
    end

    subgraph "5. Tagging"
        U --> W[NER Extraction]
        V --> W
        W --> X[Keyword Extraction]
        X --> Y[Topic Detection]
        Y --> Z[Apply Tags]
    end

    subgraph "6. Indexing"
        Z --> AA[Store Document]
        AA --> AB[Index Full Text]
        AB --> AC[Store Embeddings]
        AC --> AD[Update Metadata]
    end

    subgraph "7. Access"
        AD --> AE[Search API]
        AE --> AF[Web UI]
    end

Workflow Stages

Stage 1: Ingestion

Purpose: Accept documents and prepare them for processing.

Step Description Output
Upload Accept file via API or UI Raw file bytes
Format Detection Identify PDF, PNG, JPG, TIFF, etc. MIME type
Validation Check file isn't corrupt, size limits Pass/Fail
Metadata Extraction Filename, size, creation date Initial metadata

Supported Formats:

  • PDF (native text, scanned, hybrid)
  • Images: PNG, JPG, JPEG, TIFF, BMP, WEBP
  • Future: DOCX, XLSX, PPTX

Stage 2: Text Extraction

Purpose: Extract searchable text from all documents.

File Type Method Tool
PDF (text layer) Direct extraction PyMuPDF, pdfplumber
PDF (scanned) OCR Tesseract, Doctr
Image OCR Tesseract, Doctr

Language Support:

  • English (primary)
  • Hindi, Marathi, Tamil (planned)

Stage 3: Deduplication

Purpose: Identify and handle duplicate documents.

flowchart LR
    A[Document] --> B[MD5/SHA256 Hash]
    B --> C{Exact Match?}
    C -->|Yes| D[Duplicate - Skip]
    C -->|No| E[Generate Embedding]
    E --> F[Vector Similarity Search]
    F --> G{>95% Similar?}
    G -->|Yes| H[Near-Duplicate]
    G -->|No| I[Unique Document]

Dedup Strategies:

Strategy Use Case Tool
Hash-based Exact file duplicates MD5, SHA256
Content-based Same text, different file Text hashing
Semantic Similar meaning Embeddings + cosine
Visual Similar images pHash, dHash

Stage 4: Classification

Purpose: Automatically categorize documents.

Document Types (examples):

  • Invoice
  • Contract
  • Report
  • Letter
  • Form
  • Receipt
  • ID Document
  • Medical Record
  • Legal Filing

Classification Model:

  • Input: Extracted text (first 512 tokens)
  • Model: Fine-tuned transformer or traditional ML
  • Output: Type + Category + Confidence (0-100%)

Confidence Handling:

Confidence Action
>80% Auto-apply classification
50-80% Apply but flag for review
<50% Queue for manual classification

Stage 5: Tagging

Purpose: Extract and apply descriptive tags.

Tag Type Method Examples
Entities (NER) Named Entity Recognition Person names, Org names, Dates
Keywords TF-IDF, RAKE, YAKE "invoice", "payment", "contract"
Topics Topic modeling Finance, Legal, HR, Operations
Custom User-defined rules Project codes, client names

Stage 6: Indexing

Purpose: Make documents searchable.

Index Type Purpose Technology
Full-text Keyword search Elasticsearch, Meilisearch
Vector Semantic search Qdrant, Weaviate, Pinecone
Metadata Filtered queries PostgreSQL

Stage 7: Access

Purpose: Enable document retrieval.

Search Modes:

Mode Query Example
Keyword invoice payment 2024
Filter type:invoice AND date:2024-01
Semantic "documents about contract renewal"
Combined client:ACME type:contract "renewal terms"

User Personas

1. Document Administrator

  • Goal: Bulk upload and organize documents
  • Key Workflows: Batch upload, review duplicates, correct classifications

2. Knowledge Worker

  • Goal: Find specific documents quickly
  • Key Workflows: Search, filter by tags, view document details

3. Data Engineer

  • Goal: Integrate DMS with other systems
  • Key Workflows: API integration, webhook setup, export data

4. ML Engineer

  • Goal: Improve classification accuracy
  • Key Workflows: Review low-confidence items, retrain models

Document States

stateDiagram-v2
    [*] --> Uploaded: File received
    Uploaded --> Processing: Queue picked up
    Processing --> TextExtracted: OCR/Extraction done
    TextExtracted --> Deduped: Dedup check passed
    Deduped --> Classified: Type assigned
    Classified --> Tagged: Tags applied
    Tagged --> Indexed: Search index updated
    Indexed --> Ready: [*]

    Processing --> Failed: Error
    Failed --> [*]

    Deduped --> Duplicate: Exact match found
    Duplicate --> [*]

    Deduped --> Review: Near-duplicate
    Review --> Ready: Confirmed unique
    Review --> Merged: Merged with existing
    Merged --> [*]

Processing Metrics

Metric Target Measurement
Ingestion → Ready <5 min End-to-end latency
OCR throughput 10 pages/sec Pages processed/second
Dedup check <1 sec Hash + embedding lookup
Classification <0.5 sec Model inference time
Indexing <2 sec Search index update

← Back to Home