AI & ML Overview
This document describes the AI and machine learning components used in Pebble DMS for document processing, classification, and tagging.
AI Capabilities Summary
| Capability |
Model Type |
Purpose |
| OCR |
Deep Learning |
Text extraction from images |
| Classification |
Transformer/ML |
Document type detection |
| Embeddings |
Sentence Transformer |
Semantic similarity |
| NER |
Sequence Labeling |
Entity extraction |
| Keywords |
Statistical/ML |
Topic identification |
1. OCR (Optical Character Recognition)
Purpose
Extract searchable text from scanned PDFs and images.
Engine Options
| Engine |
Type |
Best For |
Languages |
| Tesseract 5 |
Traditional + LSTM |
General, multi-language |
100+ |
| Doctr |
Deep Learning |
High accuracy English |
5 |
| EasyOCR |
Deep Learning |
Multi-script |
80+ |
| PaddleOCR |
Deep Learning |
Chinese + others |
80+ |
| Azure Form Recognizer |
Cloud API |
Forms, tables |
50+ |
Architecture
flowchart LR
A[Image/PDF] --> B[Preprocessing]
B --> C[Deskew]
C --> D[Denoise]
D --> E[Binarize]
E --> F[OCR Engine]
F --> G[Post-processing]
G --> H[Text + Confidence]
Preprocessing Pipeline
| Step |
Purpose |
Tool |
| Deskew |
Straighten rotated text |
OpenCV |
| Denoise |
Remove noise |
OpenCV bilateral |
| Binarize |
Convert to B/W |
Otsu's threshold |
| DPI normalize |
Consistent resolution |
PIL/Pillow |
Quality Metrics
| Metric |
Description |
Target |
| Character accuracy |
% correct characters |
>95% |
| Word accuracy |
% correct words |
>90% |
| Confidence score |
Engine confidence |
>0.8 |
2. Document Classification
Purpose
Automatically categorize documents by type and content.
Classification Hierarchy
graph TD
A[Document] --> B{Type}
B --> C[Invoice]
B --> D[Contract]
B --> E[Report]
B --> F[Letter]
B --> G[Form]
C --> H{Category}
H --> I[Finance]
H --> J[Procurement]
D --> K{Category}
K --> L[Legal]
K --> M[HR]
K --> N[Vendor]
Model Options
| Model |
Type |
Accuracy |
Speed |
| DistilBERT |
Transformer |
92% |
Fast |
| RoBERTa |
Transformer |
95% |
Medium |
| XGBoost + TF-IDF |
Traditional ML |
88% |
Very Fast |
| fastText |
Shallow NN |
85% |
Very Fast |
Training Pipeline
flowchart LR
A[Labeled Docs] --> B[Text Extraction]
B --> C[Preprocessing]
C --> D[Train/Val Split]
D --> E[Model Training]
E --> F[Evaluation]
F --> G{Accuracy OK?}
G -->|Yes| H[Deploy]
G -->|No| I[Tune/More Data]
I --> E
Confidence Handling
| Score |
Action |
| >0.90 |
Auto-apply, high confidence |
| 0.70-0.90 |
Auto-apply, medium confidence |
| 0.50-0.70 |
Apply but flag for review |
| <0.50 |
Queue for manual classification |
3. Embedding Models
Purpose
Generate vector representations for semantic search and near-duplicate detection.
Model Comparison
| Model |
Dimensions |
Speed |
Quality |
all-MiniLM-L6-v2 |
384 |
⚡ Fast |
Good |
all-mpnet-base-v2 |
768 |
Medium |
Better |
e5-base-v2 |
768 |
Medium |
Best |
multilingual-e5-base |
768 |
Medium |
Multi-lang |
Embedding Strategy
flowchart TB
A[Document Text] --> B{Length?}
B -->|<512 tokens| C[Single Embedding]
B -->|>512 tokens| D[Chunk into 512]
D --> E[Embed Each Chunk]
E --> F[Mean Pooling]
F --> G[Document Embedding]
C --> G
G --> H[Store in Qdrant]
Use Cases
| Use Case |
Similarity Threshold |
| Exact content duplicate |
>0.99 |
| Near-duplicate |
>0.95 |
| Similar documents |
>0.85 |
| Related documents |
>0.70 |
4. Named Entity Recognition (NER)
Purpose
Extract structured entities from unstructured text.
Entity Types
| Entity |
Examples |
Use |
| PERSON |
"John Smith" |
Contact tagging |
| ORG |
"Acme Corp" |
Company tagging |
| DATE |
"January 15, 2024" |
Date filtering |
| MONEY |
"$5,000" |
Financial tagging |
| LOCATION |
"New York, NY" |
Geographic tagging |
| EMAIL |
"john@acme.com" |
Contact extraction |
| PHONE |
"+1-555-1234" |
Contact extraction |
Model Options
| Model |
Type |
Entities |
Speed |
| spaCy en_core_web_lg |
Statistical |
18 types |
Fast |
| spaCy transformers |
BERT-based |
18 types |
Medium |
| Flair |
BiLSTM-CRF |
Custom |
Medium |
| GLiNER |
Zero-shot |
Any |
Slow |
Pipeline
flowchart LR
A[Text] --> B[Tokenize]
B --> C[NER Model]
C --> D[Entity Spans]
D --> E[Dedup & Clean]
E --> F[Entity List]
Purpose
Identify important terms and topics from document content.
Methods
| Method |
Type |
Best For |
| TF-IDF |
Statistical |
Corpus-level importance |
| YAKE |
Statistical |
Unsupervised, fast |
| RAKE |
Statistical |
Phrase extraction |
| KeyBERT |
Neural |
Semantic keywords |
KeyBERT Example
from keybert import KeyBERT
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(
doc,
keyphrase_ngram_range=(1, 2),
stop_words='english',
top_n=10
)
# Returns: [('machine learning', 0.72), ('data processing', 0.68), ...]
6. Topic Modeling (Future)
Purpose
Discover latent topics across document collection.
Methods
| Method |
Type |
Best For |
| LDA |
Probabilistic |
Traditional topics |
| BERTopic |
Neural |
Semantic topics |
| Top2Vec |
Neural |
Auto-clustering |
Model Training & Evaluation
Training Data Requirements
| Task |
Minimum Samples |
Recommended |
| Classification (per class) |
50 |
200+ |
| NER (per entity type) |
100 |
500+ |
| Custom embeddings |
10,000 |
100,000+ |
Evaluation Metrics
| Task |
Metrics |
| Classification |
Accuracy, F1, Precision, Recall |
| NER |
Entity-level F1 |
| Embeddings |
Retrieval MRR, Recall@k |
| OCR |
Character Error Rate (CER) |
Continuous Improvement
flowchart LR
A[Production] --> B[Log Predictions]
B --> C[Human Review]
C --> D[Corrections]
D --> E[Training Data]
E --> F[Retrain Model]
F --> G[Evaluate]
G --> H{Better?}
H -->|Yes| I[Deploy]
H -->|No| J[Iterate]
I --> A
Recommended Stack
MVP Configuration
ocr:
engine: tesseract
version: 5.x
languages: [eng]
classification:
model: distilbert-base-uncased
framework: transformers
embeddings:
model: all-MiniLM-L6-v2
framework: sentence-transformers
ner:
model: en_core_web_lg
framework: spacy
keywords:
method: keybert
Hardware Requirements
| Component |
CPU |
GPU |
RAM |
| OCR Worker |
4 cores |
Optional |
8GB |
| Classification |
2 cores |
Recommended |
4GB |
| Embedding |
2 cores |
Recommended |
4GB |
| NER |
2 cores |
Optional |
4GB |
← Back to Home