Skip to content

Ingestion Use Cases (ING)

Module Purpose: Document upload, validation, and preparation for processing. This module contains 6 use cases covering the initial entry point for all documents into the system.


Use Case Quick Reference

ID Title Priority
ING-001 Upload Single Document P1
ING-002 Batch Upload Documents P1
ING-003 Detect File Format P1
ING-004 Extract Document Metadata P1
ING-005 Queue Document for Processing P1
ING-006 Handle Upload Failure P2

UC-ING-001: Upload Single Document

Overview

Field Value
ID ING-001
Title Upload Single Document
Actor User, API Client
Priority P1 (MVP Phase 1)

Description

Accept a single document upload via REST API or web UI, validate the file, and store it for processing.

Preconditions

  • User is authenticated
  • File size ≤ 100MB
  • File format is supported (PDF, PNG, JPG, TIFF)

Input

Field Type Required Description
file Binary Yes Document file
filename String Yes Original filename
tags Array[String] No Optional initial tags

Steps

  1. Receive file via multipart/form-data
  2. Validate file size
  3. Detect file format (→ ING-003)
  4. Compute file hash (→ DUP-001)
  5. Check for exact duplicate (→ DUP-002)
  6. Store file in object storage
  7. Create document record in database
  8. Queue for processing (→ ING-005)
  9. Return document ID and status

Output

{
  "id": "doc_abc123",
  "filename": "invoice.pdf",
  "size_bytes": 125000,
  "mime_type": "application/pdf",
  "hash_md5": "d41d8cd98f00b204e9800998ecf8427e",
  "status": "queued",
  "created_at": "2024-01-15T10:30:00Z"
}

Error Handling

Error HTTP Status Message
File too large 413 File exceeds 100MB limit
Unsupported format 415 File format not supported
Duplicate found 409 Exact duplicate exists (doc_xyz)

Acceptance Criteria

  • File uploads complete in <30s for files ≤50MB
  • All supported formats are accepted
  • Unsupported formats return 415
  • Duplicate check is performed before storage
  • Document record is created with correct metadata

UC-ING-002: Batch Upload Documents

Overview

Field Value
ID ING-002
Title Batch Upload Documents
Actor User, CLI
Priority P1 (MVP Phase 1)

Description

Accept multiple documents in a single request or ZIP archive for bulk processing.

Preconditions

  • User is authenticated
  • Total size ≤ 1GB
  • Max 100 documents per batch

Input

Field Type Required Description
files Array[Binary] Yes Multiple files
archive Binary No ZIP archive (alternative)

Steps

  1. Receive files or extract from archive
  2. Validate total size and count
  3. For each file:
  4. Execute ING-001 (single upload)
  5. Track success/failure
  6. Return batch summary

Output

{
  "batch_id": "batch_xyz789",
  "total": 50,
  "succeeded": 48,
  "failed": 2,
  "documents": [
    {"id": "doc_001", "status": "queued"},
    {"id": "doc_002", "status": "queued"},
    ...
  ],
  "errors": [
    {"filename": "corrupt.pdf", "error": "Invalid PDF format"},
    {"filename": "huge.png", "error": "File exceeds size limit"}
  ]
}

Acceptance Criteria

  • Supports multipart upload of multiple files
  • Supports ZIP archive upload
  • Partial success is allowed (some files fail, others succeed)
  • Batch status tracks overall progress

UC-ING-003: Detect File Format

Overview

Field Value
ID ING-003
Title Detect File Format
Actor System
Priority P1 (MVP Phase 1)

Description

Identify the actual file format using magic bytes, not just file extension.

Supported Formats

Format MIME Type Magic Bytes
PDF application/pdf %PDF
PNG image/png 89 50 4E 47
JPEG image/jpeg FF D8 FF
TIFF image/tiff 49 49 2A 00 or 4D 4D 00 2A
GIF image/gif 47 49 46 38

Steps

  1. Read first 8 bytes of file
  2. Match against known magic bytes
  3. Return detected MIME type
  4. If unknown, attempt extension-based detection
  5. If still unknown, return error

Output

{
  "detected_type": "application/pdf",
  "confidence": "high",
  "method": "magic_bytes"
}

Acceptance Criteria

  • Correctly identifies all supported formats
  • Does not rely solely on file extension
  • Returns error for unsupported formats

UC-ING-004: Extract Document Metadata

Overview

Field Value
ID ING-004
Title Extract Document Metadata
Actor System
Priority P1 (MVP Phase 1)

Description

Extract metadata from document files (PDF info, EXIF data, etc.).

Extracted Fields

Source Fields
PDF Title, Author, Subject, Creator, Creation Date, Page Count
Image Width, Height, DPI, Color Space, EXIF (camera, date, GPS)

Steps

  1. Open file for reading
  2. Based on format, use appropriate library
  3. Extract available metadata
  4. Normalize to common schema
  5. Store in document record

Output

{
  "pdf_metadata": {
    "title": "Invoice #12345",
    "author": "Accounting Dept",
    "creation_date": "2024-01-10",
    "page_count": 3
  },
  "file_metadata": {
    "size_bytes": 125000,
    "mime_type": "application/pdf"
  }
}

Acceptance Criteria

  • Extracts PDF metadata fields
  • Extracts image EXIF data
  • Handles missing/corrupt metadata gracefully

UC-ING-005: Queue Document for Processing

Overview

Field Value
ID ING-005
Title Queue Document for Processing
Actor System
Priority P1 (MVP Phase 1)

Description

Add document to processing queue for OCR, classification, and tagging.

Steps

  1. Create processing job record
  2. Determine processing pipeline based on file type
  3. Enqueue to appropriate worker queue
  4. Update document status to "queued"

Queue Routing

File Type Queue Workers
Scanned PDF ocr_queue OCR Workers
Native PDF text_queue Text Extraction
Image ocr_queue OCR Workers

Output

{
  "job_id": "job_123",
  "document_id": "doc_abc",
  "queue": "ocr_queue",
  "position": 42,
  "estimated_wait": "2 minutes"
}

Acceptance Criteria

  • Jobs are created for all uploaded documents
  • Correct queue is selected based on file type
  • Job status is trackable

UC-ING-006: Handle Upload Failure

Overview

Field Value
ID ING-006
Title Handle Upload Failure
Actor System
Priority P2 (MVP Phase 1)

Description

Gracefully handle upload failures with proper cleanup and error reporting.

Failure Scenarios

Scenario Action
Connection lost mid-upload Discard partial file
Virus detected Quarantine and alert
Storage full Return 507 error
Timeout Allow resume if supported

Steps

  1. Detect failure condition
  2. Log failure details
  3. Clean up partial resources
  4. Notify user of failure
  5. Provide recovery options if applicable

Acceptance Criteria

  • No orphaned files on failure
  • Clear error messages returned
  • Failures are logged for debugging

← Back to Use Cases | Next: Deduplication →