Ingestion Use Cases (ING)¶

Module Purpose: Document upload, validation, and preparation for processing. This module contains 6 use cases covering the initial entry point for all documents into the system.

Use Case Quick Reference¶

ID	Title	Priority
ING-001	Upload Single Document	P1
ING-002	Batch Upload Documents	P1
ING-003	Detect File Format	P1
ING-004	Extract Document Metadata	P1
ING-005	Queue Document for Processing	P1
ING-006	Handle Upload Failure	P2

UC-ING-001: Upload Single Document¶

Overview¶

Field	Value
ID	ING-001
Title	Upload Single Document
Actor	User, API Client
Priority	P1 (MVP Phase 1)

Description¶

Accept a single document upload via REST API or web UI, validate the file, and store it for processing.

Preconditions¶

User is authenticated
File size ≤ 100MB
File format is supported (PDF, PNG, JPG, TIFF)

Input¶

Field	Type	Required	Description
file	Binary	Yes	Document file
filename	String	Yes	Original filename
tags	Array[String]	No	Optional initial tags

Steps¶

Receive file via multipart/form-data
Validate file size
Detect file format (→ ING-003)
Compute file hash (→ DUP-001)
Check for exact duplicate (→ DUP-002)
Store file in object storage
Create document record in database
Queue for processing (→ ING-005)
Return document ID and status

Output¶

{
  "id": "doc_abc123",
  "filename": "invoice.pdf",
  "size_bytes": 125000,
  "mime_type": "application/pdf",
  "hash_md5": "d41d8cd98f00b204e9800998ecf8427e",
  "status": "queued",
  "created_at": "2024-01-15T10:30:00Z"
}

Error Handling¶

Error	HTTP Status	Message
File too large	413	File exceeds 100MB limit
Unsupported format	415	File format not supported
Duplicate found	409	Exact duplicate exists (doc_xyz)

Acceptance Criteria¶

File uploads complete in <30s for files ≤50MB
All supported formats are accepted
Unsupported formats return 415
Duplicate check is performed before storage
Document record is created with correct metadata

UC-ING-002: Batch Upload Documents¶

Overview¶

Field	Value
ID	ING-002
Title	Batch Upload Documents
Actor	User, CLI
Priority	P1 (MVP Phase 1)

Description¶

Accept multiple documents in a single request or ZIP archive for bulk processing.

Preconditions¶

User is authenticated
Total size ≤ 1GB
Max 100 documents per batch

Input¶

Field	Type	Required	Description
files	Array[Binary]	Yes	Multiple files
archive	Binary	No	ZIP archive (alternative)

Steps¶

Receive files or extract from archive
Validate total size and count
For each file:
Execute ING-001 (single upload)
Track success/failure
Return batch summary

Output¶

{
  "batch_id": "batch_xyz789",
  "total": 50,
  "succeeded": 48,
  "failed": 2,
  "documents": [
    {"id": "doc_001", "status": "queued"},
    {"id": "doc_002", "status": "queued"},
    ...
  ],
  "errors": [
    {"filename": "corrupt.pdf", "error": "Invalid PDF format"},
    {"filename": "huge.png", "error": "File exceeds size limit"}
  ]
}

Acceptance Criteria¶

Supports multipart upload of multiple files
Supports ZIP archive upload
Partial success is allowed (some files fail, others succeed)
Batch status tracks overall progress

UC-ING-003: Detect File Format¶

Overview¶

Field	Value
ID	ING-003
Title	Detect File Format
Actor	System
Priority	P1 (MVP Phase 1)

Description¶

Identify the actual file format using magic bytes, not just file extension.

Supported Formats¶

Format	MIME Type	Magic Bytes
PDF	application/pdf	`%PDF`
PNG	image/png	`89 50 4E 47`
JPEG	image/jpeg	`FF D8 FF`
TIFF	image/tiff	`49 49 2A 00` or `4D 4D 00 2A`
GIF	image/gif	`47 49 46 38`

Steps¶

Read first 8 bytes of file
Match against known magic bytes
Return detected MIME type
If unknown, attempt extension-based detection
If still unknown, return error

Output¶

{
  "detected_type": "application/pdf",
  "confidence": "high",
  "method": "magic_bytes"
}

Acceptance Criteria¶

Correctly identifies all supported formats
Does not rely solely on file extension
Returns error for unsupported formats

UC-ING-004: Extract Document Metadata¶

Overview¶

Field	Value
ID	ING-004
Title	Extract Document Metadata
Actor	System
Priority	P1 (MVP Phase 1)

Description¶

Extract metadata from document files (PDF info, EXIF data, etc.).

Extracted Fields¶

Source	Fields
PDF	Title, Author, Subject, Creator, Creation Date, Page Count
Image	Width, Height, DPI, Color Space, EXIF (camera, date, GPS)

Steps¶

Open file for reading
Based on format, use appropriate library
Extract available metadata
Normalize to common schema
Store in document record

Output¶

{
  "pdf_metadata": {
    "title": "Invoice #12345",
    "author": "Accounting Dept",
    "creation_date": "2024-01-10",
    "page_count": 3
  },
  "file_metadata": {
    "size_bytes": 125000,
    "mime_type": "application/pdf"
  }
}

Acceptance Criteria¶

Extracts PDF metadata fields
Extracts image EXIF data
Handles missing/corrupt metadata gracefully

UC-ING-005: Queue Document for Processing¶

Overview¶

Field	Value
ID	ING-005
Title	Queue Document for Processing
Actor	System
Priority	P1 (MVP Phase 1)

Description¶

Add document to processing queue for OCR, classification, and tagging.

Steps¶

Create processing job record
Determine processing pipeline based on file type
Enqueue to appropriate worker queue
Update document status to "queued"

Queue Routing¶

File Type	Queue	Workers
Scanned PDF	ocr_queue	OCR Workers
Native PDF	text_queue	Text Extraction
Image	ocr_queue	OCR Workers

Output¶

{
  "job_id": "job_123",
  "document_id": "doc_abc",
  "queue": "ocr_queue",
  "position": 42,
  "estimated_wait": "2 minutes"
}

Acceptance Criteria¶

Jobs are created for all uploaded documents
Correct queue is selected based on file type
Job status is trackable

UC-ING-006: Handle Upload Failure¶

Overview¶

Field	Value
ID	ING-006
Title	Handle Upload Failure
Actor	System
Priority	P2 (MVP Phase 1)

Description¶

Gracefully handle upload failures with proper cleanup and error reporting.

Failure Scenarios¶

Scenario	Action
Connection lost mid-upload	Discard partial file
Virus detected	Quarantine and alert
Storage full	Return 507 error
Timeout	Allow resume if supported

Steps¶

Detect failure condition
Log failure details
Clean up partial resources
Notify user of failure
Provide recovery options if applicable

Acceptance Criteria¶

No orphaned files on failure
Clear error messages returned
Failures are logged for debugging

← Back to Use Cases | Next: Deduplication →