Deduplication Use Cases (DUP)¶

Module Purpose: Identify and handle duplicate documents using hash-based, content-based, and visual similarity methods. This module contains 8 use cases.

Use Case Quick Reference¶

ID	Title	Priority
DUP-001	Compute File Hash (MD5/SHA256)	P1
DUP-002	Check Exact Duplicate	P1
DUP-003	Compute Document Embedding	P1
DUP-004	Find Near-Duplicates	P1
DUP-005	Compute Visual Similarity (Images)	P2
DUP-006	Merge Duplicate Records	P2
DUP-007	Archive/Delete Duplicates	P2
DUP-008	Generate Dedup Report	P2

UC-DUP-001: Compute File Hash¶

Overview¶

Field	Value
ID	DUP-001
Title	Compute File Hash (MD5/SHA256)
Actor	System
Priority	P1 (MVP Phase 1)

Description¶

Calculate cryptographic hashes of uploaded files for exact duplicate detection.

Steps¶

Read file in chunks (64KB)
Update MD5 and SHA256 hash objects
Finalize and store both hashes

Output¶

{
  "hash_md5": "d41d8cd98f00b204e9800998ecf8427e",
  "hash_sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}

Acceptance Criteria¶

Hash computed for all uploaded files
Processing time <1s for 100MB files
Both MD5 and SHA256 stored

UC-DUP-002: Check Exact Duplicate¶

Overview¶

Field	Value
ID	DUP-002
Title	Check Exact Duplicate
Actor	System
Priority	P1 (MVP Phase 1)

Description¶

Check if a file with the same hash already exists in the system.

Steps¶

Query database for existing hash match
If match found:
Return existing document ID
Link new upload to existing
If no match, proceed with processing

Output¶

{
  "is_duplicate": true,
  "existing_document_id": "doc_xyz",
  "match_type": "exact_hash"
}

Acceptance Criteria¶

Exact duplicates are detected before storage
Duplicate uploads are linked, not stored twice
User is notified of duplicate status

UC-DUP-003: Compute Document Embedding¶

Overview¶

Field	Value
ID	DUP-003
Title	Compute Document Embedding
Actor	Embedding Worker
Priority	P1 (MVP Phase 2)

Description¶

Generate vector embedding from document text for semantic similarity detection.

Steps¶

Retrieve extracted text from document
Truncate/chunk if >512 tokens
Pass through embedding model
Store embedding in vector database

Model Configuration¶

Setting	Value
Model	all-MiniLM-L6-v2
Dimensions	384
Max tokens	512
Pooling	Mean

Output¶

{
  "document_id": "doc_abc",
  "embedding_id": "emb_123",
  "dimensions": 384,
  "model": "all-MiniLM-L6-v2"
}

Acceptance Criteria¶

Embeddings generated for all text-extracted documents
Stored in Qdrant for similarity search
Processing time <2s per document

UC-DUP-004: Find Near-Duplicates¶

Overview¶

Field	Value
ID	DUP-004
Title	Find Near-Duplicates
Actor	System
Priority	P1 (MVP Phase 2)

Description¶

Identify documents with similar content using embedding similarity.

Steps¶

Retrieve document embedding
Query vector store for similar documents
Apply similarity threshold (configurable)
Return ranked list of candidates

Similarity Thresholds¶

Threshold	Interpretation
>0.99	Exact content duplicate
0.95-0.99	Near-duplicate (minor edits)
0.85-0.95	Similar document
<0.85	Different document

Output¶

{
  "document_id": "doc_abc",
  "near_duplicates": [
    {
      "id": "doc_xyz",
      "similarity": 0.97,
      "title": "Invoice #123 (v2)"
    },
    {
      "id": "doc_def",
      "similarity": 0.92,
      "title": "Invoice #123 (draft)"
    }
  ]
}

Acceptance Criteria¶

Near-duplicates detected with >90% precision
Query time <100ms
Threshold is configurable

UC-DUP-005: Compute Visual Similarity¶

Overview¶

Field	Value
ID	DUP-005
Title	Compute Visual Similarity (Images)
Actor	System
Priority	P2 (MVP Phase 3)

Description¶

Detect duplicate images even with different resolutions or minor edits using perceptual hashing.

Hash Types¶

Algorithm	Use Case
pHash	Photo similarity
dHash	Difference hash (fast)
aHash	Average hash (robust)

Steps¶

Resize image to standard size (8x8 or 16x16)
Convert to grayscale
Compute perceptual hash
Store hash for comparison
Query for similar hashes (Hamming distance)

Output¶

{
  "document_id": "doc_img123",
  "phash": "8f0f0f0f0f0f0f0f",
  "similar_images": [
    {
      "id": "doc_img456",
      "hamming_distance": 2,
      "similarity": 0.97
    }
  ]
}

Acceptance Criteria¶

Detects resized duplicates
Detects cropped images
Tolerates minor quality changes

UC-DUP-006: Merge Duplicate Records¶

Overview¶

Field	Value
ID	DUP-006
Title	Merge Duplicate Records
Actor	User
Priority	P2

Description¶

Allow users to manually merge duplicate documents, combining metadata and tags.

Steps¶

User selects documents to merge
Choose primary document
Merge tags from all documents
Update references to point to primary
Archive or delete secondary documents

Acceptance Criteria¶

Tags are combined without duplicates
References are updated
Merge history is logged

UC-DUP-007: Archive/Delete Duplicates¶

Overview¶

Field	Value
ID	DUP-007
Title	Archive/Delete Duplicates
Actor	User, Admin
Priority	P2

Description¶

Remove or archive confirmed duplicate documents.

Options¶

Action	Behavior
Archive	Move to archive, retain metadata
Delete	Soft delete, recoverable
Purge	Hard delete, permanent

Acceptance Criteria¶

Duplicates can be archived
Deletion is reversible (soft delete)
Storage is reclaimed after purge

UC-DUP-008: Generate Dedup Report¶

Overview¶

Field	Value
ID	DUP-008
Title	Generate Dedup Report
Actor	User, Admin
Priority	P2

Description¶

Generate a report of detected duplicates and storage savings.

Report Contents¶

Section	Details
Summary	Total docs, duplicates found, storage saved
Exact Duplicates	List with file sizes
Near-Duplicates	List with similarity scores
Recommendations	Suggested actions

Output¶

{
  "generated_at": "2024-01-15T10:00:00Z",
  "summary": {
    "total_documents": 10000,
    "exact_duplicates": 1500,
    "near_duplicates": 800,
    "potential_savings_gb": 45.2
  },
  "exact_duplicate_groups": [...],
  "near_duplicate_groups": [...]
}

Acceptance Criteria¶

Report includes all duplicate types
Storage savings calculated
Export to CSV/PDF available

← Back to Use Cases | Previous: Ingestion | Next: Classification →