Implementation Overview
This document outlines the development approach and technical implementation details for Pebble DMS.
Technology Stack
Backend
| Layer |
Technology |
Purpose |
| Language |
Python 3.11+ |
Core services |
| API Framework |
FastAPI |
REST API |
| Task Queue |
Celery + Redis |
Async processing |
| ORM |
SQLAlchemy |
Database access |
AI/ML
| Component |
Technology |
| OCR |
Tesseract 5, Doctr |
| Embeddings |
sentence-transformers |
| Classification |
Hugging Face transformers |
| NER |
spaCy |
| Keywords |
KeyBERT |
Data Stores
| Store |
Technology |
Purpose |
| Documents |
MinIO / S3 |
File storage |
| Metadata |
PostgreSQL 15 |
Relational data |
| Vectors |
Qdrant |
Embeddings |
| Search |
Meilisearch |
Full-text search |
| Cache |
Redis |
Caching, queues |
Frontend
| Technology |
Purpose |
| React 18 |
UI framework |
| TypeScript |
Type safety |
| TanStack Query |
Data fetching |
| Tailwind CSS |
Styling |
Project Structure
pebble-dms/
├── api/ # FastAPI application
│ ├── routes/
│ │ ├── documents.py
│ │ ├── search.py
│ │ └── tags.py
│ ├── models/ # SQLAlchemy models
│ ├── schemas/ # Pydantic schemas
│ ├── services/ # Business logic
│ └── main.py
├── workers/ # Celery workers
│ ├── ocr.py
│ ├── embedding.py
│ ├── classification.py
│ └── tagging.py
├── ml/ # ML models and training
│ ├── classification/
│ ├── embedding/
│ └── ner/
├── web/ # React frontend
│ ├── src/
│ │ ├── components/
│ │ ├── pages/
│ │ └── api/
│ └── package.json
├── tests/
├── docker-compose.yml
└── pyproject.toml
Development Workflow
Git Flow
gitGraph
commit id: "main"
branch develop
commit id: "feature-base"
branch feature/ingestion
commit id: "upload-api"
commit id: "batch-upload"
checkout develop
merge feature/ingestion
branch feature/dedup
commit id: "hash-dedup"
commit id: "content-dedup"
checkout develop
merge feature/dedup
checkout main
merge develop tag: "v0.1.0"
Branching Strategy
| Branch |
Purpose |
Merge To |
main |
Production-ready |
- |
develop |
Integration |
main |
feature/* |
New features |
develop |
fix/* |
Bug fixes |
develop |
release/* |
Release prep |
main |
API Design
RESTful Conventions
| Method |
Endpoint |
Purpose |
| POST |
/documents |
Upload document |
| GET |
/documents |
List documents |
| GET |
/documents/:id |
Get document |
| DELETE |
/documents/:id |
Delete document |
| POST |
/search |
Search documents |
| GET |
/tags |
List tags |
// POST /documents
// Request (multipart/form-data)
{
"file": "<binary>",
"tags": ["manual-tag-1"]
}
// Response
{
"id": "doc_abc123",
"filename": "invoice.pdf",
"status": "processing",
"created_at": "2024-01-15T10:30:00Z"
}
Error Handling
{
"error": {
"code": "DOCUMENT_NOT_FOUND",
"message": "Document with ID 'doc_xyz' not found",
"details": {}
}
}
Database Schema
Core Tables
-- Documents
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
filename VARCHAR(255) NOT NULL,
original_filename VARCHAR(255),
mime_type VARCHAR(100),
size_bytes BIGINT,
storage_path TEXT,
hash_md5 CHAR(32),
hash_sha256 CHAR(64),
status VARCHAR(50) DEFAULT 'pending',
doc_type VARCHAR(100),
category VARCHAR(100),
confidence DECIMAL(5,2),
extracted_text TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Tags
CREATE TABLE tags (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(100) UNIQUE NOT NULL,
type VARCHAR(50) DEFAULT 'custom',
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Document-Tag relationship
CREATE TABLE document_tags (
document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
tag_id UUID REFERENCES tags(id) ON DELETE CASCADE,
source VARCHAR(50) DEFAULT 'manual',
confidence DECIMAL(5,2),
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (document_id, tag_id)
);
-- Processing jobs
CREATE TABLE processing_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID REFERENCES documents(id),
job_type VARCHAR(50),
status VARCHAR(50) DEFAULT 'pending',
result JSONB,
error TEXT,
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
Indexes
CREATE INDEX idx_documents_status ON documents(status);
CREATE INDEX idx_documents_type ON documents(doc_type);
CREATE INDEX idx_documents_hash_md5 ON documents(hash_md5);
CREATE INDEX idx_documents_created ON documents(created_at);
CREATE INDEX idx_tags_name ON tags(name);
Worker Implementation
OCR Worker
# workers/ocr.py
from celery import shared_task
import pytesseract
from PIL import Image
@shared_task(bind=True, max_retries=3)
def process_ocr(self, document_id: str, file_path: str):
try:
# Load image/PDF
image = Image.open(file_path)
# Run OCR
text = pytesseract.image_to_string(
image,
lang='eng',
config='--oem 3 --psm 6'
)
# Save result
update_document(document_id, extracted_text=text)
# Trigger next step
process_embedding.delay(document_id)
except Exception as e:
self.retry(exc=e, countdown=60)
Embedding Worker
# workers/embedding.py
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
@shared_task
def process_embedding(document_id: str):
doc = get_document(document_id)
# Generate embedding
embedding = model.encode(doc.extracted_text)
# Store in Qdrant
qdrant_client.upsert(
collection_name="documents",
points=[{
"id": document_id,
"vector": embedding.tolist(),
"payload": {"filename": doc.filename}
}]
)
# Trigger dedup check
check_duplicates.delay(document_id)
Testing Strategy
Test Pyramid
| Level |
Coverage |
Tools |
| Unit |
80%+ |
pytest |
| Integration |
60%+ |
pytest + testcontainers |
| E2E |
Critical paths |
Playwright |
Example Tests
# tests/test_ingestion.py
import pytest
from fastapi.testclient import TestClient
def test_upload_pdf(client: TestClient, sample_pdf):
response = client.post(
"/api/v1/documents",
files={"file": sample_pdf}
)
assert response.status_code == 202
assert "id" in response.json()
assert response.json()["status"] == "processing"
def test_upload_unsupported_format(client: TestClient):
response = client.post(
"/api/v1/documents",
files={"file": ("test.exe", b"binary", "application/x-msdownload")}
)
assert response.status_code == 400
Deployment
Docker Compose (Development)
version: '3.8'
services:
api:
build: ./api
ports:
- "8000:8000"
environment:
- DATABASE_URL=postgresql://user:pass@postgres/pebble
- REDIS_URL=redis://redis:6379
depends_on:
- postgres
- redis
worker:
build: ./api
command: celery -A workers worker -l INFO
depends_on:
- redis
postgres:
image: postgres:15
environment:
POSTGRES_DB: pebble
POSTGRES_USER: user
POSTGRES_PASSWORD: pass
redis:
image: redis:7
qdrant:
image: qdrant/qdrant
ports:
- "6333:6333"
meilisearch:
image: getmeili/meilisearch
ports:
- "7700:7700"
minio:
image: minio/minio
command: server /data
ports:
- "9000:9000"
CI/CD Pipeline
flowchart LR
A[Push] --> B[Lint]
B --> C[Unit Tests]
C --> D[Build Images]
D --> E[Integration Tests]
E --> F{Branch?}
F -->|develop| G[Deploy Staging]
F -->|main| H[Deploy Production]
← Back to Home