RAG sidecar squeezed into 320MB on Render's free tier

Davemaina1 pulled embedding and search out of Node entirely, built a FastAPI sidecar, then spent four commits cutting its memory footprint from 1.5GB down to 320MB peak RSS -- small enough to run on Render's free tier with 192MB to spare.

searchinfrastructure

The starting point (c0684a4) is an architectural split: kenyaLawSearch.ts goes from ~200 lines of ONNX-in-Node to a thin HTTP client, and all the ML work moves into rag-service/app.py. The motivation was real -- the in-process ONNX session was causing silent segfaults and tsx-watch failures under concurrent inference because @xenova/transformers v2 holds a single ONNX session that isn't safe to call in parallel. Moving it to Python sidesteps that entirely.

The memory work unfolds across the next three commits. First, torch (~~1.5GB) and sentence-transformers get replaced with onnxruntime + tokenizers + huggingface-hub, running the same all-MiniLM-L6-v2 via its ONNX export (~~90MB). The CrossEncoder reranker gets dropped in favor of RRF fusion. That lands around 340MB.

The second cut eliminates ChromaDB. The 86K-chunk corpus is pre-exported to .npz files on a Supabase Storage public bucket. At startup the service downloads ~50MB, dequantizes int8 embeddings back to float32, builds a BM25 index in memory, and answers queries via brute-force numpy dot product -- around 180ms per query at that scale.

The final pass (3ec3223) writes corpus files to /tmp and memory-maps the embedding array instead of loading it into RAM. BM25 gets dropped entirely (saved 150MB), leaving semantic-only search. Peak RSS measured at 320MB. chromadb and rank-bm25 disappear from requirements.

So what Worth a close look if you're deploying a legal-document search feature on constrained infrastructure and a managed vector DB is overhead you'd rather avoid. The pattern -- int8-quantized `.npz` on object storage, mmap on startup, brute-force dot product -- is clean and replicable. The 180ms/query scan is fine up to a few hundred thousand chunks; past that you'll need an ANN index. Davemaina1 explicitly marks this a "testing phase" approach, so don't treat the corpus size as validated for production scale.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

5 commits from Davemaina1/iroh_, oldest first. Source extracted verbatim from the harvested git log.

SHA	Subject	Author	Date
`c0684a41`	RAG sidecar: Python service for embedding+BM25+rerank; Node becomes HTTP client	Davemaina1	2026-05-13	↗ GitHub
`bed9cc2f`	fix(rag): pin Python 3.11, relax torch/numpy version constraints	Davemaina1	2026-05-14	↗ GitHub
Render defaults to Python 3.14 which doesn't have torch 2.5.1 wheels. Pin to 3.11 via .python-version and allow any torch 2.x. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
`7a0da671`	feat(rag): replace torch with onnxruntime to fit 512MB RAM	Davemaina1	2026-05-14	↗ GitHub
commit body Eliminates torch (1.5GB) and sentence-transformers entirely. Uses onnxruntime + tokenizers + huggingface-hub to run the same all-MiniLM-L6-v2 model via its ONNX export (~90MB). Drops CrossEncoder reranker - RRF fusion alone is sufficient for the testing phase. Estimated memory: ~340MB. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
`0b5251e8`	feat(rag): eliminate ChromaDB dependency, load corpus from Supabase Storage	Davemaina1	2026-05-14	↗ GitHub
commit body Removes both torch/sentence-transformers AND ChromaDB from production. Corpus (86K chunks, embeddings, metadata) is pre-exported to .npz files hosted on Supabase Storage (public bucket). On startup, the service downloads ~50MB, dequantizes int8 embeddings, and builds a BM25 index. Semantic search is brute-force numpy dot product (~180ms/query for 86K vectors). Total runtime memory: ~350MB (fits in Render's 512MB free tier). Zero additional services required. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
`3ec32238`	feat(rag): mmap embeddings + on-demand metadata - peak 320MB RSS	Davemaina1	2026-05-14	↗ GitHub
commit body Downloads corpus files to /tmp, memory-maps the embeddings (zero RAM cost), and reads metadata only for the top-K results on each query. Drops BM25 (150MB overhead) - semantic-only search is good enough for testing phase. Removes chromadb and rank-bm25 dependencies entirely. Measured peak RSS: 320MB (well within Render's 512MB free tier). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-380.md from inside the repo you want the changes in.

⬇ Download capture-thread-380.md