Ingestion Use Cases (ING)
Module Purpose: Document upload, validation, and preparation for processing. This module contains 6 use cases covering the initial entry point for all documents into the system.
Use Case Quick Reference
| ID |
Title |
Priority |
| ING-001 |
Upload Single Document |
P1 |
| ING-002 |
Batch Upload Documents |
P1 |
| ING-003 |
Detect File Format |
P1 |
| ING-004 |
Extract Document Metadata |
P1 |
| ING-005 |
Queue Document for Processing |
P1 |
| ING-006 |
Handle Upload Failure |
P2 |
UC-ING-001: Upload Single Document
Overview
| Field |
Value |
| ID |
ING-001 |
| Title |
Upload Single Document |
| Actor |
User, API Client |
| Priority |
P1 (MVP Phase 1) |
Description
Accept a single document upload via REST API or web UI, validate the file, and store it for processing.
Preconditions
- User is authenticated
- File size ≤ 100MB
- File format is supported (PDF, PNG, JPG, TIFF)
| Field |
Type |
Required |
Description |
| file |
Binary |
Yes |
Document file |
| filename |
String |
Yes |
Original filename |
| tags |
Array[String] |
No |
Optional initial tags |
Steps
- Receive file via multipart/form-data
- Validate file size
- Detect file format (→ ING-003)
- Compute file hash (→ DUP-001)
- Check for exact duplicate (→ DUP-002)
- Store file in object storage
- Create document record in database
- Queue for processing (→ ING-005)
- Return document ID and status
Output
{
"id": "doc_abc123",
"filename": "invoice.pdf",
"size_bytes": 125000,
"mime_type": "application/pdf",
"hash_md5": "d41d8cd98f00b204e9800998ecf8427e",
"status": "queued",
"created_at": "2024-01-15T10:30:00Z"
}
Error Handling
| Error |
HTTP Status |
Message |
| File too large |
413 |
File exceeds 100MB limit |
| Unsupported format |
415 |
File format not supported |
| Duplicate found |
409 |
Exact duplicate exists (doc_xyz) |
Acceptance Criteria
UC-ING-002: Batch Upload Documents
Overview
| Field |
Value |
| ID |
ING-002 |
| Title |
Batch Upload Documents |
| Actor |
User, CLI |
| Priority |
P1 (MVP Phase 1) |
Description
Accept multiple documents in a single request or ZIP archive for bulk processing.
Preconditions
- User is authenticated
- Total size ≤ 1GB
- Max 100 documents per batch
| Field |
Type |
Required |
Description |
| files |
Array[Binary] |
Yes |
Multiple files |
| archive |
Binary |
No |
ZIP archive (alternative) |
Steps
- Receive files or extract from archive
- Validate total size and count
- For each file:
- Execute ING-001 (single upload)
- Track success/failure
- Return batch summary
Output
{
"batch_id": "batch_xyz789",
"total": 50,
"succeeded": 48,
"failed": 2,
"documents": [
{"id": "doc_001", "status": "queued"},
{"id": "doc_002", "status": "queued"},
...
],
"errors": [
{"filename": "corrupt.pdf", "error": "Invalid PDF format"},
{"filename": "huge.png", "error": "File exceeds size limit"}
]
}
Acceptance Criteria
Overview
| Field |
Value |
| ID |
ING-003 |
| Title |
Detect File Format |
| Actor |
System |
| Priority |
P1 (MVP Phase 1) |
Description
Identify the actual file format using magic bytes, not just file extension.
| Format |
MIME Type |
Magic Bytes |
| PDF |
application/pdf |
%PDF |
| PNG |
image/png |
89 50 4E 47 |
| JPEG |
image/jpeg |
FF D8 FF |
| TIFF |
image/tiff |
49 49 2A 00 or 4D 4D 00 2A |
| GIF |
image/gif |
47 49 46 38 |
Steps
- Read first 8 bytes of file
- Match against known magic bytes
- Return detected MIME type
- If unknown, attempt extension-based detection
- If still unknown, return error
Output
{
"detected_type": "application/pdf",
"confidence": "high",
"method": "magic_bytes"
}
Acceptance Criteria
Overview
| Field |
Value |
| ID |
ING-004 |
| Title |
Extract Document Metadata |
| Actor |
System |
| Priority |
P1 (MVP Phase 1) |
Description
Extract metadata from document files (PDF info, EXIF data, etc.).
| Source |
Fields |
| PDF |
Title, Author, Subject, Creator, Creation Date, Page Count |
| Image |
Width, Height, DPI, Color Space, EXIF (camera, date, GPS) |
Steps
- Open file for reading
- Based on format, use appropriate library
- Extract available metadata
- Normalize to common schema
- Store in document record
Output
{
"pdf_metadata": {
"title": "Invoice #12345",
"author": "Accounting Dept",
"creation_date": "2024-01-10",
"page_count": 3
},
"file_metadata": {
"size_bytes": 125000,
"mime_type": "application/pdf"
}
}
Acceptance Criteria
UC-ING-005: Queue Document for Processing
Overview
| Field |
Value |
| ID |
ING-005 |
| Title |
Queue Document for Processing |
| Actor |
System |
| Priority |
P1 (MVP Phase 1) |
Description
Add document to processing queue for OCR, classification, and tagging.
Steps
- Create processing job record
- Determine processing pipeline based on file type
- Enqueue to appropriate worker queue
- Update document status to "queued"
Queue Routing
| File Type |
Queue |
Workers |
| Scanned PDF |
ocr_queue |
OCR Workers |
| Native PDF |
text_queue |
Text Extraction |
| Image |
ocr_queue |
OCR Workers |
Output
{
"job_id": "job_123",
"document_id": "doc_abc",
"queue": "ocr_queue",
"position": 42,
"estimated_wait": "2 minutes"
}
Acceptance Criteria
UC-ING-006: Handle Upload Failure
Overview
| Field |
Value |
| ID |
ING-006 |
| Title |
Handle Upload Failure |
| Actor |
System |
| Priority |
P2 (MVP Phase 1) |
Description
Gracefully handle upload failures with proper cleanup and error reporting.
Failure Scenarios
| Scenario |
Action |
| Connection lost mid-upload |
Discard partial file |
| Virus detected |
Quarantine and alert |
| Storage full |
Return 507 error |
| Timeout |
Allow resume if supported |
Steps
- Detect failure condition
- Log failure details
- Clean up partial resources
- Notify user of failure
- Provide recovery options if applicable
Acceptance Criteria
← Back to Use Cases | Next: Deduplication →