Skip to content

Deduplication Use Cases (DUP)

Module Purpose: Identify and handle duplicate documents using hash-based, content-based, and visual similarity methods. This module contains 8 use cases.


Use Case Quick Reference

ID Title Priority
DUP-001 Compute File Hash (MD5/SHA256) P1
DUP-002 Check Exact Duplicate P1
DUP-003 Compute Document Embedding P1
DUP-004 Find Near-Duplicates P1
DUP-005 Compute Visual Similarity (Images) P2
DUP-006 Merge Duplicate Records P2
DUP-007 Archive/Delete Duplicates P2
DUP-008 Generate Dedup Report P2

UC-DUP-001: Compute File Hash

Overview

Field Value
ID DUP-001
Title Compute File Hash (MD5/SHA256)
Actor System
Priority P1 (MVP Phase 1)

Description

Calculate cryptographic hashes of uploaded files for exact duplicate detection.

Steps

  1. Read file in chunks (64KB)
  2. Update MD5 and SHA256 hash objects
  3. Finalize and store both hashes

Output

{
  "hash_md5": "d41d8cd98f00b204e9800998ecf8427e",
  "hash_sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}

Acceptance Criteria

  • Hash computed for all uploaded files
  • Processing time <1s for 100MB files
  • Both MD5 and SHA256 stored

UC-DUP-002: Check Exact Duplicate

Overview

Field Value
ID DUP-002
Title Check Exact Duplicate
Actor System
Priority P1 (MVP Phase 1)

Description

Check if a file with the same hash already exists in the system.

Steps

  1. Query database for existing hash match
  2. If match found:
  3. Return existing document ID
  4. Link new upload to existing
  5. If no match, proceed with processing

Output

{
  "is_duplicate": true,
  "existing_document_id": "doc_xyz",
  "match_type": "exact_hash"
}

Acceptance Criteria

  • Exact duplicates are detected before storage
  • Duplicate uploads are linked, not stored twice
  • User is notified of duplicate status

UC-DUP-003: Compute Document Embedding

Overview

Field Value
ID DUP-003
Title Compute Document Embedding
Actor Embedding Worker
Priority P1 (MVP Phase 2)

Description

Generate vector embedding from document text for semantic similarity detection.

Steps

  1. Retrieve extracted text from document
  2. Truncate/chunk if >512 tokens
  3. Pass through embedding model
  4. Store embedding in vector database

Model Configuration

Setting Value
Model all-MiniLM-L6-v2
Dimensions 384
Max tokens 512
Pooling Mean

Output

{
  "document_id": "doc_abc",
  "embedding_id": "emb_123",
  "dimensions": 384,
  "model": "all-MiniLM-L6-v2"
}

Acceptance Criteria

  • Embeddings generated for all text-extracted documents
  • Stored in Qdrant for similarity search
  • Processing time <2s per document

UC-DUP-004: Find Near-Duplicates

Overview

Field Value
ID DUP-004
Title Find Near-Duplicates
Actor System
Priority P1 (MVP Phase 2)

Description

Identify documents with similar content using embedding similarity.

Steps

  1. Retrieve document embedding
  2. Query vector store for similar documents
  3. Apply similarity threshold (configurable)
  4. Return ranked list of candidates

Similarity Thresholds

Threshold Interpretation
>0.99 Exact content duplicate
0.95-0.99 Near-duplicate (minor edits)
0.85-0.95 Similar document
<0.85 Different document

Output

{
  "document_id": "doc_abc",
  "near_duplicates": [
    {
      "id": "doc_xyz",
      "similarity": 0.97,
      "title": "Invoice #123 (v2)"
    },
    {
      "id": "doc_def",
      "similarity": 0.92,
      "title": "Invoice #123 (draft)"
    }
  ]
}

Acceptance Criteria

  • Near-duplicates detected with >90% precision
  • Query time <100ms
  • Threshold is configurable

UC-DUP-005: Compute Visual Similarity

Overview

Field Value
ID DUP-005
Title Compute Visual Similarity (Images)
Actor System
Priority P2 (MVP Phase 3)

Description

Detect duplicate images even with different resolutions or minor edits using perceptual hashing.

Hash Types

Algorithm Use Case
pHash Photo similarity
dHash Difference hash (fast)
aHash Average hash (robust)

Steps

  1. Resize image to standard size (8x8 or 16x16)
  2. Convert to grayscale
  3. Compute perceptual hash
  4. Store hash for comparison
  5. Query for similar hashes (Hamming distance)

Output

{
  "document_id": "doc_img123",
  "phash": "8f0f0f0f0f0f0f0f",
  "similar_images": [
    {
      "id": "doc_img456",
      "hamming_distance": 2,
      "similarity": 0.97
    }
  ]
}

Acceptance Criteria

  • Detects resized duplicates
  • Detects cropped images
  • Tolerates minor quality changes

UC-DUP-006: Merge Duplicate Records

Overview

Field Value
ID DUP-006
Title Merge Duplicate Records
Actor User
Priority P2

Description

Allow users to manually merge duplicate documents, combining metadata and tags.

Steps

  1. User selects documents to merge
  2. Choose primary document
  3. Merge tags from all documents
  4. Update references to point to primary
  5. Archive or delete secondary documents

Acceptance Criteria

  • Tags are combined without duplicates
  • References are updated
  • Merge history is logged

UC-DUP-007: Archive/Delete Duplicates

Overview

Field Value
ID DUP-007
Title Archive/Delete Duplicates
Actor User, Admin
Priority P2

Description

Remove or archive confirmed duplicate documents.

Options

Action Behavior
Archive Move to archive, retain metadata
Delete Soft delete, recoverable
Purge Hard delete, permanent

Acceptance Criteria

  • Duplicates can be archived
  • Deletion is reversible (soft delete)
  • Storage is reclaimed after purge

UC-DUP-008: Generate Dedup Report

Overview

Field Value
ID DUP-008
Title Generate Dedup Report
Actor User, Admin
Priority P2

Description

Generate a report of detected duplicates and storage savings.

Report Contents

Section Details
Summary Total docs, duplicates found, storage saved
Exact Duplicates List with file sizes
Near-Duplicates List with similarity scores
Recommendations Suggested actions

Output

{
  "generated_at": "2024-01-15T10:00:00Z",
  "summary": {
    "total_documents": 10000,
    "exact_duplicates": 1500,
    "near_duplicates": 800,
    "potential_savings_gb": 45.2
  },
  "exact_duplicate_groups": [...],
  "near_duplicate_groups": [...]
}

Acceptance Criteria

  • Report includes all duplicate types
  • Storage savings calculated
  • Export to CSV/PDF available

← Back to Use Cases | Previous: Ingestion | Next: Classification →