Skip to content

Classification Use Cases (CLS)

Module Purpose: Automatically categorize documents by type and content category using ML models. This module contains 7 use cases.


Use Case Quick Reference

ID Title Priority
CLS-001 Extract Text Content P1
CLS-002 Classify Document Type P1
CLS-003 Classify Content Category P1
CLS-004 Set Confidence Score P1
CLS-005 Flag Low-Confidence for Review P2
CLS-006 Manual Classification Override P2
CLS-007 Train Classification Model P3

UC-CLS-001: Extract Text Content

Overview

Field Value
ID CLS-001
Title Extract Text Content
Actor System
Priority P1 (MVP Phase 2)

Description

Extract text content from documents for classification. Uses native text extraction for PDFs with text layers, OCR for scanned documents.

Steps

  1. Determine if PDF has text layer or is scanned
  2. For text PDFs: Extract using pdfplumber/PyMuPDF
  3. For scanned/images: Use OCR output (from OCR pipeline)
  4. Clean and normalize text (remove extra whitespace, fix encoding)
  5. Store extracted text in document record

Output

{
  "document_id": "doc_abc",
  "text_length": 2500,
  "extraction_method": "native",
  "preview": "INVOICE\n\nInvoice Number: 12345\nDate: January 15, 2024..."
}

Acceptance Criteria

  • Text extracted from native PDFs
  • OCR text used for scanned documents
  • Text is cleaned and normalized

UC-CLS-002: Classify Document Type

Overview

Field Value
ID CLS-002
Title Classify Document Type
Actor Classification Service
Priority P1 (MVP Phase 2)

Description

Predict the document type (Invoice, Contract, Report, etc.) using ML classification.

Document Types

Type Description
invoice Bills, receipts, payment requests
contract Agreements, NDAs, legal documents
report Analysis, summaries, reviews
letter Correspondence, memos
form Applications, questionnaires
id_document Passports, licenses, IDs
receipt Purchase receipts
other Unclassified

Steps

  1. Retrieve extracted text
  2. Preprocess text (tokenize, truncate to 512 tokens)
  3. Run through type classifier model
  4. Return predicted type with confidence

Output

{
  "document_id": "doc_abc",
  "predicted_type": "invoice",
  "confidence": 0.94,
  "alternatives": [
    {"type": "receipt", "confidence": 0.04},
    {"type": "other", "confidence": 0.02}
  ]
}

Acceptance Criteria

  • Classification accuracy >85% on test set
  • Processing time <500ms per document
  • Top 3 predictions returned

UC-CLS-003: Classify Content Category

Overview

Field Value
ID CLS-003
Title Classify Content Category
Actor Classification Service
Priority P1 (MVP Phase 2)

Description

Classify the content category (Finance, Legal, HR, etc.) independent of document type.

Categories

Category Examples
finance Invoices, budgets, financial reports
legal Contracts, compliance documents
hr Employee records, policies
operations Procedures, manuals
marketing Brochures, campaigns
technical Specifications, documentation

Output

{
  "document_id": "doc_abc",
  "predicted_category": "finance",
  "confidence": 0.88
}

Acceptance Criteria

  • Multi-label classification supported
  • Hierarchical categories supported (future)

UC-CLS-004: Set Confidence Score

Overview

Field Value
ID CLS-004
Title Set Confidence Score
Actor System
Priority P1 (MVP Phase 2)

Description

Calculate and store confidence scores for all classifications.

Confidence Levels

Score Level Action
>0.90 High Auto-apply
0.70-0.90 Medium Apply, optional review
0.50-0.70 Low Apply but flag
<0.50 Very Low Queue for manual review

Acceptance Criteria

  • Confidence calibrated (predicted 80% = actual 80%)
  • Thresholds are configurable

UC-CLS-005: Flag Low-Confidence for Review

Overview

Field Value
ID CLS-005
Title Flag Low-Confidence for Review
Actor System
Priority P2

Description

Queue documents with low classification confidence for human review.

Steps

  1. Check confidence against threshold
  2. If below threshold, add to review queue
  3. Notify reviewers of pending items
  4. Track review status

Output

{
  "document_id": "doc_abc",
  "review_required": true,
  "reason": "confidence_below_threshold",
  "confidence": 0.45,
  "predicted_type": "contract"
}

Acceptance Criteria

  • Low-confidence items queued automatically
  • Review queue UI accessible
  • Reviewers can approve or correct

UC-CLS-006: Manual Classification Override

Overview

Field Value
ID CLS-006
Title Manual Classification Override
Actor User
Priority P2

Description

Allow users to manually correct or set document classification.

Steps

  1. User views document and current classification
  2. User selects correct type/category
  3. System updates classification
  4. Correction logged for model retraining

Acceptance Criteria

  • Manual override possible for any document
  • Override history tracked
  • Corrections available for retraining

UC-CLS-007: Train Classification Model

Overview

Field Value
ID CLS-007
Title Train Classification Model
Actor ML Engineer, Admin
Priority P3

Description

Retrain classification models using accumulated labeled data.

Steps

  1. Export labeled documents
  2. Split into train/validation sets
  3. Fine-tune model on new data
  4. Evaluate on holdout set
  5. If improved, deploy new model

Acceptance Criteria

  • Automated training pipeline
  • Model versioning
  • A/B testing between models

← Back to Use Cases | Previous: Deduplication | Next: Tagging →