Document Workflow Overview¶
This document describes, end-to-end processing pipeline for documents flowing through Pebble DMS.
Workflow Diagram¶
flowchart TB
subgraph "1. Ingestion"
A[📄 Document Upload] --> B[Format Detection]
B --> C{Supported?}
C -->|Yes| D[Metadata Extraction]
C -->|No| E[Reject with Error]
end
subgraph "2. Text Extraction"
D --> F{File Type?}
F -->|PDF with text| G[Extract Text Layer]
F -->|Scanned/Image| H[OCR Processing]
G --> I[Raw Text]
H --> I
end
subgraph "3. Deduplication"
I --> J[Compute Hash]
J --> K{Exact Match?}
K -->|Yes| L[Link to Existing]
K -->|No| M[Compute Embedding]
M --> N{Near-Duplicate?}
N -->|Yes, >95%| O[Flag for Review]
N -->|No| P[New Document]
end
subgraph "4. Classification"
P --> Q[Classify Type]
O --> Q
Q --> R[Classify Category]
R --> S[Confidence Score]
S --> T{High Confidence?}
T -->|Yes, >80%| U[Auto-Apply]
T -->|No| V[Queue for Review]
end
subgraph "5. Tagging"
U --> W[NER Extraction]
V --> W
W --> X[Keyword Extraction]
X --> Y[Topic Detection]
Y --> Z[Apply Tags]
end
subgraph "6. Indexing"
Z --> AA[Store Document]
AA --> AB[Index Full Text]
AB --> AC[Store Embeddings]
AC --> AD[Update Metadata]
end
subgraph "7. Access"
AD --> AE[Search API]
AE --> AF[Web UI]
end
Workflow Stages¶
Stage 1: Ingestion¶
Purpose: Accept documents and prepare them for processing.
| Step | Description | Output |
|---|---|---|
| Upload | Accept file via API or UI | Raw file bytes |
| Format Detection | Identify PDF, PNG, JPG, TIFF, etc. | MIME type |
| Validation | Check file isn't corrupt, size limits | Pass/Fail |
| Metadata Extraction | Filename, size, creation date | Initial metadata |
Supported Formats:
- PDF (native text, scanned, hybrid)
- Images: PNG, JPG, JPEG, TIFF, BMP, WEBP
- Future: DOCX, XLSX, PPTX
Stage 2: Text Extraction¶
Purpose: Extract searchable text from all documents.
| File Type | Method | Tool |
|---|---|---|
| PDF (text layer) | Direct extraction | PyMuPDF, pdfplumber |
| PDF (scanned) | OCR | Tesseract, Doctr |
| Image | OCR | Tesseract, Doctr |
Language Support:
- English (primary)
- Hindi, Marathi, Tamil (planned)
Stage 3: Deduplication¶
Purpose: Identify and handle duplicate documents.
flowchart LR
A[Document] --> B[MD5/SHA256 Hash]
B --> C{Exact Match?}
C -->|Yes| D[Duplicate - Skip]
C -->|No| E[Generate Embedding]
E --> F[Vector Similarity Search]
F --> G{>95% Similar?}
G -->|Yes| H[Near-Duplicate]
G -->|No| I[Unique Document]
Dedup Strategies:
| Strategy | Use Case | Tool |
|---|---|---|
| Hash-based | Exact file duplicates | MD5, SHA256 |
| Content-based | Same text, different file | Text hashing |
| Semantic | Similar meaning | Embeddings + cosine |
| Visual | Similar images | pHash, dHash |
Stage 4: Classification¶
Purpose: Automatically categorize documents.
Document Types (examples):
- Invoice
- Contract
- Report
- Letter
- Form
- Receipt
- ID Document
- Medical Record
- Legal Filing
Classification Model:
- Input: Extracted text (first 512 tokens)
- Model: Fine-tuned transformer or traditional ML
- Output: Type + Category + Confidence (0-100%)
Confidence Handling:
| Confidence | Action |
|---|---|
| >80% | Auto-apply classification |
| 50-80% | Apply but flag for review |
| <50% | Queue for manual classification |
Stage 5: Tagging¶
Purpose: Extract and apply descriptive tags.
| Tag Type | Method | Examples |
|---|---|---|
| Entities (NER) | Named Entity Recognition | Person names, Org names, Dates |
| Keywords | TF-IDF, RAKE, YAKE | "invoice", "payment", "contract" |
| Topics | Topic modeling | Finance, Legal, HR, Operations |
| Custom | User-defined rules | Project codes, client names |
Stage 6: Indexing¶
Purpose: Make documents searchable.
| Index Type | Purpose | Technology |
|---|---|---|
| Full-text | Keyword search | Elasticsearch, Meilisearch |
| Vector | Semantic search | Qdrant, Weaviate, Pinecone |
| Metadata | Filtered queries | PostgreSQL |
Stage 7: Access¶
Purpose: Enable document retrieval.
Search Modes:
| Mode | Query Example |
|---|---|
| Keyword | invoice payment 2024 |
| Filter | type:invoice AND date:2024-01 |
| Semantic | "documents about contract renewal" |
| Combined | client:ACME type:contract "renewal terms" |
User Personas¶
1. Document Administrator¶
- Goal: Bulk upload and organize documents
- Key Workflows: Batch upload, review duplicates, correct classifications
2. Knowledge Worker¶
- Goal: Find specific documents quickly
- Key Workflows: Search, filter by tags, view document details
3. Data Engineer¶
- Goal: Integrate DMS with other systems
- Key Workflows: API integration, webhook setup, export data
4. ML Engineer¶
- Goal: Improve classification accuracy
- Key Workflows: Review low-confidence items, retrain models
Document States¶
stateDiagram-v2
[*] --> Uploaded: File received
Uploaded --> Processing: Queue picked up
Processing --> TextExtracted: OCR/Extraction done
TextExtracted --> Deduped: Dedup check passed
Deduped --> Classified: Type assigned
Classified --> Tagged: Tags applied
Tagged --> Indexed: Search index updated
Indexed --> Ready: [*]
Processing --> Failed: Error
Failed --> [*]
Deduped --> Duplicate: Exact match found
Duplicate --> [*]
Deduped --> Review: Near-duplicate
Review --> Ready: Confirmed unique
Review --> Merged: Merged with existing
Merged --> [*]
Processing Metrics¶
| Metric | Target | Measurement |
|---|---|---|
| Ingestion → Ready | <5 min | End-to-end latency |
| OCR throughput | 10 pages/sec | Pages processed/second |
| Dedup check | <1 sec | Hash + embedding lookup |
| Classification | <0.5 sec | Model inference time |
| Indexing | <2 sec | Search index update |