Classification Use Cases (CLS)
Module Purpose: Automatically categorize documents by type and content category using ML models. This module contains 7 use cases.
Use Case Quick Reference
| ID |
Title |
Priority |
| CLS-001 |
Extract Text Content |
P1 |
| CLS-002 |
Classify Document Type |
P1 |
| CLS-003 |
Classify Content Category |
P1 |
| CLS-004 |
Set Confidence Score |
P1 |
| CLS-005 |
Flag Low-Confidence for Review |
P2 |
| CLS-006 |
Manual Classification Override |
P2 |
| CLS-007 |
Train Classification Model |
P3 |
Overview
| Field |
Value |
| ID |
CLS-001 |
| Title |
Extract Text Content |
| Actor |
System |
| Priority |
P1 (MVP Phase 2) |
Description
Extract text content from documents for classification. Uses native text extraction for PDFs with text layers, OCR for scanned documents.
Steps
- Determine if PDF has text layer or is scanned
- For text PDFs: Extract using pdfplumber/PyMuPDF
- For scanned/images: Use OCR output (from OCR pipeline)
- Clean and normalize text (remove extra whitespace, fix encoding)
- Store extracted text in document record
Output
{
"document_id": "doc_abc",
"text_length": 2500,
"extraction_method": "native",
"preview": "INVOICE\n\nInvoice Number: 12345\nDate: January 15, 2024..."
}
Acceptance Criteria
UC-CLS-002: Classify Document Type
Overview
| Field |
Value |
| ID |
CLS-002 |
| Title |
Classify Document Type |
| Actor |
Classification Service |
| Priority |
P1 (MVP Phase 2) |
Description
Predict the document type (Invoice, Contract, Report, etc.) using ML classification.
Document Types
| Type |
Description |
| invoice |
Bills, receipts, payment requests |
| contract |
Agreements, NDAs, legal documents |
| report |
Analysis, summaries, reviews |
| letter |
Correspondence, memos |
| form |
Applications, questionnaires |
| id_document |
Passports, licenses, IDs |
| receipt |
Purchase receipts |
| other |
Unclassified |
Steps
- Retrieve extracted text
- Preprocess text (tokenize, truncate to 512 tokens)
- Run through type classifier model
- Return predicted type with confidence
Output
{
"document_id": "doc_abc",
"predicted_type": "invoice",
"confidence": 0.94,
"alternatives": [
{"type": "receipt", "confidence": 0.04},
{"type": "other", "confidence": 0.02}
]
}
Acceptance Criteria
UC-CLS-003: Classify Content Category
Overview
| Field |
Value |
| ID |
CLS-003 |
| Title |
Classify Content Category |
| Actor |
Classification Service |
| Priority |
P1 (MVP Phase 2) |
Description
Classify the content category (Finance, Legal, HR, etc.) independent of document type.
Categories
| Category |
Examples |
| finance |
Invoices, budgets, financial reports |
| legal |
Contracts, compliance documents |
| hr |
Employee records, policies |
| operations |
Procedures, manuals |
| marketing |
Brochures, campaigns |
| technical |
Specifications, documentation |
Output
{
"document_id": "doc_abc",
"predicted_category": "finance",
"confidence": 0.88
}
Acceptance Criteria
UC-CLS-004: Set Confidence Score
Overview
| Field |
Value |
| ID |
CLS-004 |
| Title |
Set Confidence Score |
| Actor |
System |
| Priority |
P1 (MVP Phase 2) |
Description
Calculate and store confidence scores for all classifications.
Confidence Levels
| Score |
Level |
Action |
| >0.90 |
High |
Auto-apply |
| 0.70-0.90 |
Medium |
Apply, optional review |
| 0.50-0.70 |
Low |
Apply but flag |
| <0.50 |
Very Low |
Queue for manual review |
Acceptance Criteria
UC-CLS-005: Flag Low-Confidence for Review
Overview
| Field |
Value |
| ID |
CLS-005 |
| Title |
Flag Low-Confidence for Review |
| Actor |
System |
| Priority |
P2 |
Description
Queue documents with low classification confidence for human review.
Steps
- Check confidence against threshold
- If below threshold, add to review queue
- Notify reviewers of pending items
- Track review status
Output
{
"document_id": "doc_abc",
"review_required": true,
"reason": "confidence_below_threshold",
"confidence": 0.45,
"predicted_type": "contract"
}
Acceptance Criteria
UC-CLS-006: Manual Classification Override
Overview
| Field |
Value |
| ID |
CLS-006 |
| Title |
Manual Classification Override |
| Actor |
User |
| Priority |
P2 |
Description
Allow users to manually correct or set document classification.
Steps
- User views document and current classification
- User selects correct type/category
- System updates classification
- Correction logged for model retraining
Acceptance Criteria
UC-CLS-007: Train Classification Model
Overview
| Field |
Value |
| ID |
CLS-007 |
| Title |
Train Classification Model |
| Actor |
ML Engineer, Admin |
| Priority |
P3 |
Description
Retrain classification models using accumulated labeled data.
Steps
- Export labeled documents
- Split into train/validation sets
- Fine-tune model on new data
- Evaluate on holdout set
- If improved, deploy new model
Acceptance Criteria
← Back to Use Cases | Previous: Deduplication | Next: Tagging →