AI & ML Overview¶

This document describes the AI and machine learning components used in Pebble DMS for document processing, classification, and tagging.

AI Capabilities Summary¶

Capability	Model Type	Purpose
OCR	Deep Learning	Text extraction from images
Classification	Transformer/ML	Document type detection
Embeddings	Sentence Transformer	Semantic similarity
NER	Sequence Labeling	Entity extraction
Keywords	Statistical/ML	Topic identification

1. OCR (Optical Character Recognition)¶

Purpose¶

Extract searchable text from scanned PDFs and images.

Engine Options¶

Engine	Type	Best For	Languages
Tesseract 5	Traditional + LSTM	General, multi-language	100+
Doctr	Deep Learning	High accuracy English	5
EasyOCR	Deep Learning	Multi-script	80+
PaddleOCR	Deep Learning	Chinese + others	80+
Azure Form Recognizer	Cloud API	Forms, tables	50+

Architecture¶

flowchart LR
    A[Image/PDF] --> B[Preprocessing]
    B --> C[Deskew]
    C --> D[Denoise]
    D --> E[Binarize]
    E --> F[OCR Engine]
    F --> G[Post-processing]
    G --> H[Text + Confidence]

Preprocessing Pipeline¶

Step	Purpose	Tool
Deskew	Straighten rotated text	OpenCV
Denoise	Remove noise	OpenCV bilateral
Binarize	Convert to B/W	Otsu's threshold
DPI normalize	Consistent resolution	PIL/Pillow

Quality Metrics¶

Metric	Description	Target
Character accuracy	% correct characters	>95%
Word accuracy	% correct words	>90%
Confidence score	Engine confidence	>0.8

2. Document Classification¶

Purpose¶

Automatically categorize documents by type and content.

Classification Hierarchy¶

graph TD
    A[Document] --> B{Type}
    B --> C[Invoice]
    B --> D[Contract]
    B --> E[Report]
    B --> F[Letter]
    B --> G[Form]

    C --> H{Category}
    H --> I[Finance]
    H --> J[Procurement]

    D --> K{Category}
    K --> L[Legal]
    K --> M[HR]
    K --> N[Vendor]

Model Options¶

Model	Type	Accuracy	Speed
DistilBERT	Transformer	92%	Fast
RoBERTa	Transformer	95%	Medium
XGBoost + TF-IDF	Traditional ML	88%	Very Fast
fastText	Shallow NN	85%	Very Fast

Training Pipeline¶

flowchart LR
    A[Labeled Docs] --> B[Text Extraction]
    B --> C[Preprocessing]
    C --> D[Train/Val Split]
    D --> E[Model Training]
    E --> F[Evaluation]
    F --> G{Accuracy OK?}
    G -->|Yes| H[Deploy]
    G -->|No| I[Tune/More Data]
    I --> E

Confidence Handling¶

Score	Action
>0.90	Auto-apply, high confidence
0.70-0.90	Auto-apply, medium confidence
0.50-0.70	Apply but flag for review
<0.50	Queue for manual classification

3. Embedding Models¶

Purpose¶

Generate vector representations for semantic search and near-duplicate detection.

Model Comparison¶

Model	Dimensions	Speed	Quality
`all-MiniLM-L6-v2`	384	⚡ Fast	Good
`all-mpnet-base-v2`	768	Medium	Better
`e5-base-v2`	768	Medium	Best
`multilingual-e5-base`	768	Medium	Multi-lang

Embedding Strategy¶

flowchart TB
    A[Document Text] --> B{Length?}
    B -->|<512 tokens| C[Single Embedding]
    B -->|>512 tokens| D[Chunk into 512]
    D --> E[Embed Each Chunk]
    E --> F[Mean Pooling]
    F --> G[Document Embedding]
    C --> G
    G --> H[Store in Qdrant]

Use Cases¶

Use Case	Similarity Threshold
Exact content duplicate	>0.99
Near-duplicate	>0.95
Similar documents	>0.85
Related documents	>0.70

4. Named Entity Recognition (NER)¶

Purpose¶

Extract structured entities from unstructured text.

Entity Types¶

Entity	Examples	Use
PERSON	"John Smith"	Contact tagging
ORG	"Acme Corp"	Company tagging
DATE	"January 15, 2024"	Date filtering
MONEY	"$5,000"	Financial tagging
LOCATION	"New York, NY"	Geographic tagging
EMAIL	"john@acme.com"	Contact extraction
PHONE	"+1-555-1234"	Contact extraction

Model Options¶

Model	Type	Entities	Speed
spaCy en_core_web_lg	Statistical	18 types	Fast
spaCy transformers	BERT-based	18 types	Medium
Flair	BiLSTM-CRF	Custom	Medium
GLiNER	Zero-shot	Any	Slow

Pipeline¶

flowchart LR
    A[Text] --> B[Tokenize]
    B --> C[NER Model]
    C --> D[Entity Spans]
    D --> E[Dedup & Clean]
    E --> F[Entity List]

5. Keyword Extraction¶

Purpose¶

Identify important terms and topics from document content.

Methods¶

Method	Type	Best For
TF-IDF	Statistical	Corpus-level importance
YAKE	Statistical	Unsupervised, fast
RAKE	Statistical	Phrase extraction
KeyBERT	Neural	Semantic keywords

KeyBERT Example¶

from keybert import KeyBERT

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(
    doc, 
    keyphrase_ngram_range=(1, 2),
    stop_words='english',
    top_n=10
)
# Returns: [('machine learning', 0.72), ('data processing', 0.68), ...]

6. Topic Modeling (Future)¶

Purpose¶

Discover latent topics across document collection.

Methods¶

Method	Type	Best For
LDA	Probabilistic	Traditional topics
BERTopic	Neural	Semantic topics
Top2Vec	Neural	Auto-clustering

Model Training & Evaluation¶

Training Data Requirements¶

Task	Minimum Samples	Recommended
Classification (per class)	50	200+
NER (per entity type)	100	500+
Custom embeddings	10,000	100,000+

Evaluation Metrics¶

Task	Metrics
Classification	Accuracy, F1, Precision, Recall
NER	Entity-level F1
Embeddings	Retrieval MRR, Recall@k
OCR	Character Error Rate (CER)

Continuous Improvement¶

flowchart LR
    A[Production] --> B[Log Predictions]
    B --> C[Human Review]
    C --> D[Corrections]
    D --> E[Training Data]
    E --> F[Retrain Model]
    F --> G[Evaluate]
    G --> H{Better?}
    H -->|Yes| I[Deploy]
    H -->|No| J[Iterate]
    I --> A

Recommended Stack¶

MVP Configuration¶

ocr:
  engine: tesseract
  version: 5.x
  languages: [eng]

classification:
  model: distilbert-base-uncased
  framework: transformers

embeddings:
  model: all-MiniLM-L6-v2
  framework: sentence-transformers

ner:
  model: en_core_web_lg
  framework: spacy

keywords:
  method: keybert

Hardware Requirements¶

Component	CPU	GPU	RAM
OCR Worker	4 cores	Optional	8GB
Classification	2 cores	Recommended	4GB
Embedding	2 cores	Recommended	4GB
NER	2 cores	Optional	4GB

← Back to Home