Skip to content

AI & ML Overview

This document describes the AI and machine learning components used in Pebble DMS for document processing, classification, and tagging.


AI Capabilities Summary

Capability Model Type Purpose
OCR Deep Learning Text extraction from images
Classification Transformer/ML Document type detection
Embeddings Sentence Transformer Semantic similarity
NER Sequence Labeling Entity extraction
Keywords Statistical/ML Topic identification

1. OCR (Optical Character Recognition)

Purpose

Extract searchable text from scanned PDFs and images.

Engine Options

Engine Type Best For Languages
Tesseract 5 Traditional + LSTM General, multi-language 100+
Doctr Deep Learning High accuracy English 5
EasyOCR Deep Learning Multi-script 80+
PaddleOCR Deep Learning Chinese + others 80+
Azure Form Recognizer Cloud API Forms, tables 50+

Architecture

flowchart LR
    A[Image/PDF] --> B[Preprocessing]
    B --> C[Deskew]
    C --> D[Denoise]
    D --> E[Binarize]
    E --> F[OCR Engine]
    F --> G[Post-processing]
    G --> H[Text + Confidence]

Preprocessing Pipeline

Step Purpose Tool
Deskew Straighten rotated text OpenCV
Denoise Remove noise OpenCV bilateral
Binarize Convert to B/W Otsu's threshold
DPI normalize Consistent resolution PIL/Pillow

Quality Metrics

Metric Description Target
Character accuracy % correct characters >95%
Word accuracy % correct words >90%
Confidence score Engine confidence >0.8

2. Document Classification

Purpose

Automatically categorize documents by type and content.

Classification Hierarchy

graph TD
    A[Document] --> B{Type}
    B --> C[Invoice]
    B --> D[Contract]
    B --> E[Report]
    B --> F[Letter]
    B --> G[Form]

    C --> H{Category}
    H --> I[Finance]
    H --> J[Procurement]

    D --> K{Category}
    K --> L[Legal]
    K --> M[HR]
    K --> N[Vendor]

Model Options

Model Type Accuracy Speed
DistilBERT Transformer 92% Fast
RoBERTa Transformer 95% Medium
XGBoost + TF-IDF Traditional ML 88% Very Fast
fastText Shallow NN 85% Very Fast

Training Pipeline

flowchart LR
    A[Labeled Docs] --> B[Text Extraction]
    B --> C[Preprocessing]
    C --> D[Train/Val Split]
    D --> E[Model Training]
    E --> F[Evaluation]
    F --> G{Accuracy OK?}
    G -->|Yes| H[Deploy]
    G -->|No| I[Tune/More Data]
    I --> E

Confidence Handling

Score Action
>0.90 Auto-apply, high confidence
0.70-0.90 Auto-apply, medium confidence
0.50-0.70 Apply but flag for review
<0.50 Queue for manual classification

3. Embedding Models

Purpose

Generate vector representations for semantic search and near-duplicate detection.

Model Comparison

Model Dimensions Speed Quality
all-MiniLM-L6-v2 384 ⚡ Fast Good
all-mpnet-base-v2 768 Medium Better
e5-base-v2 768 Medium Best
multilingual-e5-base 768 Medium Multi-lang

Embedding Strategy

flowchart TB
    A[Document Text] --> B{Length?}
    B -->|<512 tokens| C[Single Embedding]
    B -->|>512 tokens| D[Chunk into 512]
    D --> E[Embed Each Chunk]
    E --> F[Mean Pooling]
    F --> G[Document Embedding]
    C --> G
    G --> H[Store in Qdrant]

Use Cases

Use Case Similarity Threshold
Exact content duplicate >0.99
Near-duplicate >0.95
Similar documents >0.85
Related documents >0.70

4. Named Entity Recognition (NER)

Purpose

Extract structured entities from unstructured text.

Entity Types

Entity Examples Use
PERSON "John Smith" Contact tagging
ORG "Acme Corp" Company tagging
DATE "January 15, 2024" Date filtering
MONEY "$5,000" Financial tagging
LOCATION "New York, NY" Geographic tagging
EMAIL "john@acme.com" Contact extraction
PHONE "+1-555-1234" Contact extraction

Model Options

Model Type Entities Speed
spaCy en_core_web_lg Statistical 18 types Fast
spaCy transformers BERT-based 18 types Medium
Flair BiLSTM-CRF Custom Medium
GLiNER Zero-shot Any Slow

Pipeline

flowchart LR
    A[Text] --> B[Tokenize]
    B --> C[NER Model]
    C --> D[Entity Spans]
    D --> E[Dedup & Clean]
    E --> F[Entity List]

5. Keyword Extraction

Purpose

Identify important terms and topics from document content.

Methods

Method Type Best For
TF-IDF Statistical Corpus-level importance
YAKE Statistical Unsupervised, fast
RAKE Statistical Phrase extraction
KeyBERT Neural Semantic keywords

KeyBERT Example

from keybert import KeyBERT

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(
    doc, 
    keyphrase_ngram_range=(1, 2),
    stop_words='english',
    top_n=10
)
# Returns: [('machine learning', 0.72), ('data processing', 0.68), ...]

6. Topic Modeling (Future)

Purpose

Discover latent topics across document collection.

Methods

Method Type Best For
LDA Probabilistic Traditional topics
BERTopic Neural Semantic topics
Top2Vec Neural Auto-clustering

Model Training & Evaluation

Training Data Requirements

Task Minimum Samples Recommended
Classification (per class) 50 200+
NER (per entity type) 100 500+
Custom embeddings 10,000 100,000+

Evaluation Metrics

Task Metrics
Classification Accuracy, F1, Precision, Recall
NER Entity-level F1
Embeddings Retrieval MRR, Recall@k
OCR Character Error Rate (CER)

Continuous Improvement

flowchart LR
    A[Production] --> B[Log Predictions]
    B --> C[Human Review]
    C --> D[Corrections]
    D --> E[Training Data]
    E --> F[Retrain Model]
    F --> G[Evaluate]
    G --> H{Better?}
    H -->|Yes| I[Deploy]
    H -->|No| J[Iterate]
    I --> A

MVP Configuration

ocr:
  engine: tesseract
  version: 5.x
  languages: [eng]

classification:
  model: distilbert-base-uncased
  framework: transformers

embeddings:
  model: all-MiniLM-L6-v2
  framework: sentence-transformers

ner:
  model: en_core_web_lg
  framework: spacy

keywords:
  method: keybert

Hardware Requirements

Component CPU GPU RAM
OCR Worker 4 cores Optional 8GB
Classification 2 cores Recommended 4GB
Embedding 2 cores Recommended 4GB
NER 2 cores Optional 4GB

← Back to Home