Tabular generate replaced with a durable job queue and worker pool

The biggest structural change in the tabular review subsystem: +1650/-475 in a single commit, swapping the SSE-streaming generate handler for a jobs table with leased worker claims. The author's stated target is 5K-10K-document reviews; the proximate problem was that proxy idle timeouts, tab closes, and backend restarts all killed runs with no recovery path.

discoveryinfrastructure

Migration 005_tabular_jobs.sql introduces three objects: tabular_jobs (per-review job with status and counters), tabular_job_items (one row per document-to-process), and claim_tabular_job_item(lease_seconds) - a SQL RPC wrapping SELECT ... FOR UPDATE SKIP LOCKED so multiple backend instances can claim items without races. Items whose lease_expires_at has passed are treated as unclaimed, so a worker that crashes releases its items automatically when the next scan runs. The partial index on tabular_job_items(created_at) WHERE status IN ('pending','running') keeps the claim query fast as completed rows accumulate.

Business logic moved from routes/tabular.ts into lib/tabularJobs.ts (832 LOC) so the TabularWorkerPool class doesn't import a route file - a clean separation that matters once the worker starts before app.listen() and shuts down on SIGTERM/SIGINT. The generate endpoint now creates a job and returns immediately; four new endpoints handle polling: GET /jobs/:id, GET /jobs/:id/cells, POST /jobs/:id/cancel, and GET /reviews/:id/active-job. The last one is what enables resume-on-page-reload.

The frontend replaces an EventSource with a pollJob loop in TabularReviewView.tsx. Live progress shows as a "12/200" counter. On remount, the active-job endpoint is checked first; if a job is running it resumes polling from wherever it left off.

Configurable via env: TABULAR_GENERATE_CONCURRENCY (default 10), TABULAR_JOB_LEASE_SECONDS (300), TABULAR_WORKER_IDLE_MS (500), NEXT_PUBLIC_TABULAR_POLL_MS (1500). One acknowledged tradeoff: per-cell streaming is gone - a document's full row of cells appears together when the item finishes rather than cell-by-cell.

So what Worth importing if your fork has any long-running per-document LLM workload. The jobs+items+leased-claims+in-process-pool+frontend-polling pattern generalizes cleanly, and the SIGTERM shutdown story is exactly the part that's easy to get wrong the first time. This is also the foundation everything else in nwhitehouse's tabular cluster builds on: column reprocess (feat-021), RAG embedding (feat-024), and the polling-flush fix all depend on this job machinery. Don't import later features without this one first.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

1 commit from nwhitehouse/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA Subject Author Date

a663f8df [bug-007] Tabular generate as durable job + worker pool (5K-10K-doc scale) Nick Whitehouse 2026-05-07 ↗ GitHub

SHA	Subject	Author	Date
`a663f8df`	[bug-007] Tabular generate as durable job + worker pool (5K-10K-doc scale)	Nick Whitehouse	2026-05-07	↗ GitHub
commit body Replaces the previous SSE-streaming generate handler with a durable job table + in-process worker pool + frontend polling. The previous design worked at 200-doc scale but fell over at the actual product target (5K-10K-doc tabular projects): - vLLM cannot serve N concurrent inference requests at that scale - the request handler tied up an Express worker for hours - SSE was a single point of failure (proxy idle, browser tab close, backend restart all killed the run with no recovery) - in-flight progress was lost on a restart Schema (migration 005): - tabular_jobs(id, review_id, status, total_items, started_at, completed_at, cancel_requested_at, error, ...) - tabular_job_items(id, job_id, document_id, status, attempt_count, lease_expires_at, error, ...) - claim_tabular_job_item(lease_seconds) RPC: atomic worker claim via FOR UPDATE SKIP LOCKED. Multi-instance safe. - RLS via the existing can_access_review() predicate. Backend: - lib/tabularJobs.ts: extraction + LLM helpers moved here so the worker doesn't import a route file (no circular deps); added createGenerateJob, claimNextItem, processOneJobItem, maybeFinalizeJob, TabularWorkerPool. - routes/tabular.ts: POST /generate now creates a job and returns immediately. New endpoints: GET /jobs/:id, GET /jobs/:id/cells, POST /jobs/:id/cancel, GET /reviews/:id/active-job. - index.ts: TabularWorkerPool started after app.listen(); SIGTERM/ SIGINT shutdown stops the loops gracefully (in-flight items expire their lease and the next worker reclaims them). Frontend: - mikeApi.ts: removed streamTabularGeneration; added startTabularGenerate, getTabularJob, getTabularJobCells, cancelTabularJob, getActiveTabularJob. - TabularReviewView.tsx: EventSource reader replaced with a pollJob loop that surfaces 12/200 progress live and resumes automatically on remount via getActiveTabularJob. Env knobs: TABULAR_GENERATE_CONCURRENCY (workers, default 10), TABULAR_JOB_LEASE_SECONDS (300), TABULAR_WORKER_IDLE_MS (500), NEXT_PUBLIC_TABULAR_POLL_MS (1500). Verified: - tsc --noEmit clean (backend + frontend) - all 16 backend tests pass (no regressions) - migration applied to local Supabase; 6 RLS policies + claim RPC + tables + indexes in place. Known limitation: in-flight cells (worker mid-way through one doc's columns) aren't surfaced to the frontend until the item reaches a terminal state. Matches the user's mental model that a doc's row of cells appears together when its turn finishes. If live per-cell streaming becomes a requirement, add tabular_cells.updated_at + a separate query path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

commit body

Replaces the previous SSE-streaming generate handler with a durable job
table + in-process worker pool + frontend polling. The previous design
worked at 200-doc scale but fell over at the actual product target
(5K-10K-doc tabular projects):
  - vLLM cannot serve N concurrent inference requests at that scale
  - the request handler tied up an Express worker for hours
  - SSE was a single point of failure (proxy idle, browser tab close,
    backend restart all killed the run with no recovery)
  - in-flight progress was lost on a restart

Schema (migration 005):
  - tabular_jobs(id, review_id, status, total_items, started_at,
    completed_at, cancel_requested_at, error, ...)
  - tabular_job_items(id, job_id, document_id, status, attempt_count,
    lease_expires_at, error, ...)
  - claim_tabular_job_item(lease_seconds) RPC: atomic worker claim via
    FOR UPDATE SKIP LOCKED. Multi-instance safe.
  - RLS via the existing can_access_review() predicate.

Backend:
  - lib/tabularJobs.ts: extraction + LLM helpers moved here so the
    worker doesn't import a route file (no circular deps); added
    createGenerateJob, claimNextItem, processOneJobItem,
    maybeFinalizeJob, TabularWorkerPool.
  - routes/tabular.ts: POST /generate now creates a job and returns
    immediately. New endpoints: GET /jobs/:id, GET /jobs/:id/cells,
    POST /jobs/:id/cancel, GET /reviews/:id/active-job.
  - index.ts: TabularWorkerPool started after app.listen(); SIGTERM/
    SIGINT shutdown stops the loops gracefully (in-flight items
    expire their lease and the next worker reclaims them).

Frontend:
  - mikeApi.ts: removed streamTabularGeneration; added
    startTabularGenerate, getTabularJob, getTabularJobCells,
    cancelTabularJob, getActiveTabularJob.
  - TabularReviewView.tsx: EventSource reader replaced with a
    pollJob loop that surfaces 12/200 progress live and resumes
    automatically on remount via getActiveTabularJob.

Env knobs: TABULAR_GENERATE_CONCURRENCY (workers, default 10),
TABULAR_JOB_LEASE_SECONDS (300), TABULAR_WORKER_IDLE_MS (500),
NEXT_PUBLIC_TABULAR_POLL_MS (1500).

Verified:
  - tsc --noEmit clean (backend + frontend)
  - all 16 backend tests pass (no regressions)
  - migration applied to local Supabase; 6 RLS policies + claim RPC
    + tables + indexes in place.

Known limitation: in-flight cells (worker mid-way through one doc's
columns) aren't surfaced to the frontend until the item reaches a
terminal state. Matches the user's mental model that a doc's row of
cells appears together when its turn finishes. If live per-cell
streaming becomes a requirement, add tabular_cells.updated_at + a
separate query path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-150.md from inside the repo you want the changes in.

⬇ Download capture-thread-150.md