AI · 2025
DocLens.
Hybrid retrieval over enterprise docs — FAISS dense vectors fused with BM25, then re-ranked.
Role
Solo Engineer
Team
Solo project
Stack
Python · FAISS · rank-bm25 · sentence-transformers · FastAPI
Year
2025
01 / The problem
Why this needed building.
Pure vector search misses literal-term queries; pure BM25 misses paraphrased intent. Enterprise documentation queries shift between both styles inside a single session — names of internal systems on one query, conceptual questions the next. The retrieval layer has to handle both without me hand-tuning per query.
02 / Approach
How I broke it down.
- 01
Chunk documents with semantic-aware splitter (token budget + paragraph boundary fallback) so neither index over-fragments a coherent argument.
- 02
Build dense index in FAISS with sentence-transformers MPNet embeddings; sparse index with rank-bm25 on the same chunk set so scores stay comparable.
- 03
Fuse at rank-time using reciprocal rank fusion (RRF) — score-agnostic so I don't need to normalize across two very different distributions.
- 04
Re-rank top-k with a cross-encoder for the final list; cheap because k is small after fusion.
03 / System
The pipeline, stage by stage.
Scroll to walk through each stage. Each is small on its own; what matters is the composition.
STAGE / 01
Chunk.
Semantic-aware splitter respects token budget while keeping paragraph boundaries intact. Coherent arguments stay together.
STAGE / 02
Dense index.
FAISS over MPNet embeddings. Captures paraphrased and conceptual intent; weak on literal-term lookups.
STAGE / 03
Sparse index.
rank-bm25 over the same chunk set so the two retrievers stay aligned. Captures literal names, acronyms, IDs.
STAGE / 04
Fuse.
Reciprocal rank fusion at query time. Score-agnostic — no normalization needed across two very different distributions.
STAGE / 05
Re-rank.
Cross-encoder over the top-k. Small k keeps the cost bounded; this is where the final ordering is earned.
04 / Outcomes
What it ended up being good at.
Recall@10 improved over dense-only baseline on a hand-built eval set of 80 queries spanning both literal and conceptual styles.
p95 latency stayed under 350ms end-to-end on a single machine — the cross-encoder is the cost, not the fusion.
Currently exploring learned fusion weights instead of the static 0.5/0.5 RRF; treating this as DocLens v2.
