Paperless-ngx Tutorial: Build a Paperless Office Hub with Docker

Scan, OCR, and automatically organize your receipts and documents. Deploy Paperless-ngx on a VPS using Docker Compose for digital document storage.

How to Self-Host Paperless-ngx on a VPS with Docker Compose

Paperless-ngx is an enterprise-grade document management system (DMS) that transforms your physical documents into a searchable, organized digital archive. While SQLite is suitable for testing, running Paperless-ngx in a production VPS environment requires a robust architecture: a PostgreSQL database for metadata, a Redis instance for task queues/caching, and a reverse proxy for SSL termination.

This guide provides a production-hardened docker-compose.yml configuration and explains the technical mechanics of document archiving, database bindings, OCR optimization, and backup strategies.


Architectural Overview

A robust Paperless-ngx deployment consists of three primary components communicating over an isolated Docker bridge network:

  1. Webserver & Worker (The Core Application): Runs Django, Gunicorn, and Celery. It handles the web UI, processes OCR tasks, parses metadata, and manages document storage.
  2. Database (PostgreSQL): Stores document metadata, user settings, tagging configurations, and indexing references. PostgreSQL is preferred over SQLite for parallel write handling and crash resilience.
  3. Message Broker & Cache (Redis): Handles Celery task distribution (e.g., orchestrating OCR worker tasks) and serves as an in-memory cache for web sessions.
                  +----------------------------------+
                  |           Reverse Proxy          |
                  |         (Nginx / Caddy)          |
                  +----------------------------------+
                                   | (Port 8000)
                                   v
                  +----------------------------------+
                  |      Paperless-ngx Container     |
                  |      (Django, Gunicorn, Celery)  |
                  +----------------------------------+
                    /              |               \
                   /               |               \
   +--------------------+  +---------------+  +--------------------+
   |  PostgreSQL (DB)   |  | Redis (Queue) |  |   Host Filesystem  |
   | (Port 5432 - Int)  |  | (Port 6379)   |  | (Consume, Media)   |
   +--------------------+  +---------------+  +--------------------+

Directory Structure and Permissions

Before deploying, organize the host file system. Paperless-ngx processes files from a "consume" folder and saves the processed results into a "media" folder.

To prevent ownership conflicts between the host user and the Docker container processes, identify your host user's UID and GID:

id -u  # Typically 1000
id -g  # Typically 1000

Create the directory structure on the VPS:

sudo mkdir -p /opt/paperless/{config,data,media,consume,export,pgdata,redisdata}
sudo chown -R 1000:1000 /opt/paperless
  • /opt/paperless/consume: Place files here to be automatically ingested.
  • /opt/paperless/media: Stores original documents and generated PDF/A files.
  • /opt/paperless/data: Application state (index files, temporary scratchpad).
  • /opt/paperless/export: Backup destination.
  • /opt/paperless/pgdata: Persistent PostgreSQL data.

Production Docker Compose Configuration

The following docker-compose.yml defines the multi-container setup. It includes strict health checks to guarantee that dependent services (PostgreSQL and Redis) are fully operational before the Paperless web server initializes.

Create /opt/paperless/docker-compose.yml:

version: '3.8'

services:
  redis:
    image: docker.io/library/redis:7-alpine
    container_name: paperless-redis
    restart: unless-stopped
    volumes:
      - /opt/paperless/redisdata:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

  db:
    image: docker.io/library/postgres:16-alpine
    container_name: paperless-db
    restart: unless-stopped
    volumes:
      - /opt/paperless/pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless_usr
      # Set a strong password in production
      POSTGRES_PASSWORD: super_secret_db_password_here
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U paperless_usr -d paperless"]
      interval: 10s
      timeout: 5s
      retries: 5

  webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:2.9.0
    container_name: paperless-webserver
    restart: unless-stopped
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    ports:
      - "127.0.0.1:8000:8000"
    volumes:
      - /opt/paperless/config:/usr/src/paperless/data
      - /opt/paperless/data:/usr/src/paperless/data/index
      - /opt/paperless/media:/usr/src/paperless/media
      - /opt/paperless/consume:/usr/src/paperless/consume
      - /opt/paperless/export:/usr/src/paperless/export
    environment:
      # PUID/PGID to match host permissions
      USERMAP_UID: 1000
      USERMAP_GID: 1000

      # Database Configuration
      PAPERLESS_DBENGINE: postgresql
      PAPERLESS_DBHOST: db
      PAPERLESS_DBPORT: 5432
      PAPERLESS_DBNAME: paperless
      PAPERLESS_DBUSER: paperless_usr
      PAPERLESS_DBPASS: super_secret_db_password_here

      # Broker & Cache
      PAPERLESS_REDIS: redis://redis:6379

      # Security & URL Bindings
      # Change this to your actual sub-domain
      PAPERLESS_URL: https://paperless.yourdomain.com
      # Generate a random 64-character alphanumeric string for SECRET_KEY
      PAPERLESS_SECRET_KEY: "change_me_to_a_random_long_secret_key"
      PAPERLESS_TIME_ZONE: "UTC"
      PAPERLESS_OCR_LANGUAGE: "eng"

      # OCR Tuning
      PAPERLESS_OCR_MODE: skip_noarchive
      PAPERLESS_TASK_WORKERS: 2
      PAPERLESS_THREADS_PER_WORKER: 1

Technical Deep-Dive: Database and Broker Bindings

Redis Connection Protocol

Paperless-ngx relies on Redis for Celery tasks. The PAPERLESS_REDIS configuration takes a URI format: redis://redis:6379. Because Docker automatically maps container names to internal IPs using its embedded DNS server, the hostname redis correctly resolves to the Redis container.

For security, the Redis container does not expose port 6379 to the host system. It is only accessible to containers on the shared bridge network.

PostgreSQL Connection Lifecycle

During boot, the webserver service waits for db to pass its health check (pg_isready). Once healthy, Django runs migrations to build the schema. By default, Paperless utilizes a connection pool to minimize the overhead of opening and closing database TCP sockets.


Document Archiving and OCR Internals

When a PDF or image enters the /opt/paperless/consume directory, a Celery worker is spawned to process it. The core pipeline consists of:

  1. Extraction: Reading textual content if it already exists.
  2. OCR (Tesseract): If no text is found, Tesseract analyzes the document.
  3. Archiving: Creating a standardized PDF/A version.

Critical OCR Configurations

To control OCR behaviors and VPS resource usage, configure the following environmental variables:

  • PAPERLESS_OCR_MODE:
    • redo: Re-run OCR even if text is present. Use this for scanned documents containing bad OCR headers.
    • skip: Skip OCR if text is already present (e.g., native digital documents).
    • skip_noarchive: Skip OCR if text is present, but still generate a standardized PDF/A file for long-term archiving. (Recommended).
  • PAPERLESS_OCR_USER_ARGS: A JSON string containing custom configuration flags passed directly to Tesseract. For example: {"tessedit_char_whitelist": "0123456789"} to restrict scanning to numerical digits for invoices.
  • PAPERLESS_FILENAME_FORMAT: Allows dynamic directory structuring. Example: {created_year}/{correspondent}/{title}. If undefined, Paperless stores documents flat in the media folder using database-indexed names.

Optimizing Resources on Single-Core / Low-Ram VPSs

By default, Paperless tries to consume all available CPU threads for OCR operations. This can crash a small VPS (1GB - 2GB RAM). Limit resource usage using:

PAPERLESS_TASK_WORKERS: 1
PAPERLESS_THREADS_PER_WORKER: 1

Reverse Proxy Configuration (Nginx)

To secure Paperless-ngx with Let's Encrypt SSL, configure Nginx on your host VPS to proxy requests to container port 8000.

Save the following configuration to /etc/nginx/sites-available/paperless:

server {
    listen 80;
    server_name paperless.yourdomain.com;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl http2;
    server_name paperless.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/paperless.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/paperless.yourdomain.com/privkey.pem;

    # Crucial for large document uploads
    client_max_body_size 100M;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support for paperless live progress bars
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Backup and Maintenance Workflows

A proper backup consists of exporting the database schema, indexing configuration, and media files.

1. Document Exporter Tool

Paperless-ngx provides an internal command that dumps files, database entries, and configs into a structure optimized for portability.

Run the exporter container command:

docker compose exec webserver document_exporter ../export

2. Database Dumps

Alternatively, backup the raw database directly using pg_dump:

docker compose exec db pg_dump -U paperless_usr paperless > /opt/paperless/backup/db_backup_$(date +%F).sql

Include the /opt/paperless/media, /opt/paperless/config, and database dumps in your external backup destination (e.g., Restic, rsync, or AWS S3).