Deduplication Use Cases (DUP)
Module Purpose: Identify and handle duplicate documents using hash-based, content-based, and visual similarity methods. This module contains 8 use cases.
Use Case Quick Reference
| ID |
Title |
Priority |
| DUP-001 |
Compute File Hash (MD5/SHA256) |
P1 |
| DUP-002 |
Check Exact Duplicate |
P1 |
| DUP-003 |
Compute Document Embedding |
P1 |
| DUP-004 |
Find Near-Duplicates |
P1 |
| DUP-005 |
Compute Visual Similarity (Images) |
P2 |
| DUP-006 |
Merge Duplicate Records |
P2 |
| DUP-007 |
Archive/Delete Duplicates |
P2 |
| DUP-008 |
Generate Dedup Report |
P2 |
UC-DUP-001: Compute File Hash
Overview
| Field |
Value |
| ID |
DUP-001 |
| Title |
Compute File Hash (MD5/SHA256) |
| Actor |
System |
| Priority |
P1 (MVP Phase 1) |
Description
Calculate cryptographic hashes of uploaded files for exact duplicate detection.
Steps
- Read file in chunks (64KB)
- Update MD5 and SHA256 hash objects
- Finalize and store both hashes
Output
{
"hash_md5": "d41d8cd98f00b204e9800998ecf8427e",
"hash_sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}
Acceptance Criteria
UC-DUP-002: Check Exact Duplicate
Overview
| Field |
Value |
| ID |
DUP-002 |
| Title |
Check Exact Duplicate |
| Actor |
System |
| Priority |
P1 (MVP Phase 1) |
Description
Check if a file with the same hash already exists in the system.
Steps
- Query database for existing hash match
- If match found:
- Return existing document ID
- Link new upload to existing
- If no match, proceed with processing
Output
{
"is_duplicate": true,
"existing_document_id": "doc_xyz",
"match_type": "exact_hash"
}
Acceptance Criteria
UC-DUP-003: Compute Document Embedding
Overview
| Field |
Value |
| ID |
DUP-003 |
| Title |
Compute Document Embedding |
| Actor |
Embedding Worker |
| Priority |
P1 (MVP Phase 2) |
Description
Generate vector embedding from document text for semantic similarity detection.
Steps
- Retrieve extracted text from document
- Truncate/chunk if >512 tokens
- Pass through embedding model
- Store embedding in vector database
Model Configuration
| Setting |
Value |
| Model |
all-MiniLM-L6-v2 |
| Dimensions |
384 |
| Max tokens |
512 |
| Pooling |
Mean |
Output
{
"document_id": "doc_abc",
"embedding_id": "emb_123",
"dimensions": 384,
"model": "all-MiniLM-L6-v2"
}
Acceptance Criteria
UC-DUP-004: Find Near-Duplicates
Overview
| Field |
Value |
| ID |
DUP-004 |
| Title |
Find Near-Duplicates |
| Actor |
System |
| Priority |
P1 (MVP Phase 2) |
Description
Identify documents with similar content using embedding similarity.
Steps
- Retrieve document embedding
- Query vector store for similar documents
- Apply similarity threshold (configurable)
- Return ranked list of candidates
Similarity Thresholds
| Threshold |
Interpretation |
| >0.99 |
Exact content duplicate |
| 0.95-0.99 |
Near-duplicate (minor edits) |
| 0.85-0.95 |
Similar document |
| <0.85 |
Different document |
Output
{
"document_id": "doc_abc",
"near_duplicates": [
{
"id": "doc_xyz",
"similarity": 0.97,
"title": "Invoice #123 (v2)"
},
{
"id": "doc_def",
"similarity": 0.92,
"title": "Invoice #123 (draft)"
}
]
}
Acceptance Criteria
UC-DUP-005: Compute Visual Similarity
Overview
| Field |
Value |
| ID |
DUP-005 |
| Title |
Compute Visual Similarity (Images) |
| Actor |
System |
| Priority |
P2 (MVP Phase 3) |
Description
Detect duplicate images even with different resolutions or minor edits using perceptual hashing.
Hash Types
| Algorithm |
Use Case |
| pHash |
Photo similarity |
| dHash |
Difference hash (fast) |
| aHash |
Average hash (robust) |
Steps
- Resize image to standard size (8x8 or 16x16)
- Convert to grayscale
- Compute perceptual hash
- Store hash for comparison
- Query for similar hashes (Hamming distance)
Output
{
"document_id": "doc_img123",
"phash": "8f0f0f0f0f0f0f0f",
"similar_images": [
{
"id": "doc_img456",
"hamming_distance": 2,
"similarity": 0.97
}
]
}
Acceptance Criteria
UC-DUP-006: Merge Duplicate Records
Overview
| Field |
Value |
| ID |
DUP-006 |
| Title |
Merge Duplicate Records |
| Actor |
User |
| Priority |
P2 |
Description
Allow users to manually merge duplicate documents, combining metadata and tags.
Steps
- User selects documents to merge
- Choose primary document
- Merge tags from all documents
- Update references to point to primary
- Archive or delete secondary documents
Acceptance Criteria
UC-DUP-007: Archive/Delete Duplicates
Overview
| Field |
Value |
| ID |
DUP-007 |
| Title |
Archive/Delete Duplicates |
| Actor |
User, Admin |
| Priority |
P2 |
Description
Remove or archive confirmed duplicate documents.
Options
| Action |
Behavior |
| Archive |
Move to archive, retain metadata |
| Delete |
Soft delete, recoverable |
| Purge |
Hard delete, permanent |
Acceptance Criteria
UC-DUP-008: Generate Dedup Report
Overview
| Field |
Value |
| ID |
DUP-008 |
| Title |
Generate Dedup Report |
| Actor |
User, Admin |
| Priority |
P2 |
Description
Generate a report of detected duplicates and storage savings.
Report Contents
| Section |
Details |
| Summary |
Total docs, duplicates found, storage saved |
| Exact Duplicates |
List with file sizes |
| Near-Duplicates |
List with similarity scores |
| Recommendations |
Suggested actions |
Output
{
"generated_at": "2024-01-15T10:00:00Z",
"summary": {
"total_documents": 10000,
"exact_duplicates": 1500,
"near_duplicates": 800,
"potential_savings_gb": 45.2
},
"exact_duplicate_groups": [...],
"near_duplicate_groups": [...]
}
Acceptance Criteria
← Back to Use Cases | Previous: Ingestion | Next: Classification →