Major changes: - Replace OOM-causing in-memory BM25 with PostgreSQL full-text search - Add tsvector column and GIN index for fast keyword search - Implement hybrid score fusion (70% embedding + 30% FTS + 15% bonus) - Add CANDIDATE_MULTIPLIER (5x) to search more candidates before fusion - Add stopword filtering to FTS queries for less strict matching - Make search limit configurable (default 20, max 100) - Propagate relevance scores through the search pipeline Search improvements: - "clowns iconoclasts" → finds target at rank 1 (score 0.815) - "replacing words with definitions" → finds target at rank 1 - Vague queries now find results with limit=30 that were previously missed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
16 KiB
RAG Search Quality Investigation
Summary
Investigation into why RAG search results "often aren't that good" when trying to find things with partial/vague memories.
Date: 2025-12-20 Status: Significant Progress Made
Key Findings
-
BM25 keyword search was broken - Caused OOM with 250K chunks. ✅ FIXED: Replaced with PostgreSQL full-text search.
-
Embeddings can't find "mentioned in passing" content - Query "engineer fail-safe" ranks article about humility (that mentions engineers as example) at position 140 out of 145K. Articles specifically about engineering rank higher.
-
Score propagation was broken - ✅ FIXED: Scores now flow through properly.
-
Chunk sizes are inconsistent - Some chunks are 3MB (books), some are 3 bytes. Large chunks have diluted embeddings.
-
"Half-remembered" queries don't match article keywords - User describes concept, but article uses different terminology. E.g., "not using specific words" vs "taboo your words".
What Works Now
- Keyword-matching queries: "clowns iconoclasts" → finds "Lonely Dissent" at rank 1 (score 0.815)
- Direct concept queries: "replacing words with definitions" → finds "Taboo Your Words" at rank 1
- Hybrid search: Results appearing in both embedding + FTS get 15% bonus
Remaining Challenges
- Conceptual queries: "saying what you mean not using specific words" → target ranks 23rd (needs top 10)
- Query describes the effect, article describes the technique
- Need query expansion (HyDE) to bridge semantic gap
Recommended Fix Priority
- Implement PostgreSQL full-text search - ✅ DONE
- Add candidate pool multiplier - ✅ DONE (5x internal limit)
- Add stopword filtering - ✅ DONE
- Re-chunk oversized content - Max 512 tokens, with context
- Implement HyDE query expansion - For vague/conceptual queries
PostgreSQL Full-Text Search Implementation (2025-12-20)
Changes Made
-
Created migration
db/migrations/versions/20251220_130000_add_chunk_fulltext_search.py- Added
search_vectortsvector column to chunk table - Created GIN index for fast search
- Added trigger to auto-update search_vector on insert/update
- Populated existing 250K chunks with search vectors
- Added
-
Rewrote bm25.py to use PostgreSQL full-text search
- Removed in-memory BM25 that caused OOM
- Uses
ts_rank()for relevance scoring - Uses AND matching with prefix wildcards:
engineer:* & fail:* & safe:* - Normalized scores to 0-1 range
-
Added search_vector column to Chunk model in SQLAlchemy
Test Results
For query "engineer fail safe":
- PostgreSQL FTS returns 100 results without OOM
- Source 157 (humility article) chunks rank 25th and 26th (vs not appearing before)
- Search completes in ~100ms (vs OOM crash before)
Hybrid Search Flow
With BM25 now working, the hybrid search combines:
- Embedding search (70% weight) - finds semantically similar content
- Full-text search (30% weight) - finds exact keyword matches
- +15% bonus for results appearing in both
This should significantly improve "half-remembered" searches where users recall specific words that appear in the article.
Issues Fixed (This Session)
1. Scores Were Being Discarded (CRITICAL)
Problem: Both embedding and BM25 searches computed relevance scores but threw them away, returning only chunk IDs.
Files Changed:
src/memory/api/search/embeddings.py- Now returnsdict[str, float](chunk_id -> score)src/memory/api/search/bm25.py- Now returns normalized scores (0-1 range)src/memory/api/search/search.py- Addedfuse_scores()for hybrid rankingsrc/memory/api/search/types.py- Changed from mean to max chunk score
Before: All search_score values were 0.000
After: Meaningful scores like 0.443, 0.503, etc.
2. Score Fusion Implemented
Added weighted combination of embedding (70%) + BM25 (30%) scores with 15% bonus for results appearing in both searches.
EMBEDDING_WEIGHT = 0.7
BM25_WEIGHT = 0.3
HYBRID_BONUS = 0.15
3. Changed from Mean to Max Chunk Score
Before: Documents with many chunks were penalized (averaging diluted scores) After: Uses max chunk score - finds documents with at least one highly relevant section
Current Issues Identified
Issue 1: BM25 is Disabled AND Causes OOM
Finding: ENABLE_BM25_SEARCH=False in docker-compose.yaml
Impact: Keyword matching doesn't work. Queries like "engineer fail-safe" won't find articles containing those exact words unless the embedding similarity is high enough.
When Enabled: BM25 causes OOM crash!
- Database has 250,048 chunks total
- Forum collection alone has 147,546 chunks
- BM25 implementation loads ALL chunks into memory and builds index on each query
- Container killed (exit code 137) when attempting BM25 search
Root Cause: Current BM25 implementation in bm25.py is not scalable:
items = items_query.all() # Loads ALL chunks into memory
corpus = [item.content.lower().strip() for item in items] # Copies all content
retriever.index(corpus_tokens) # Builds index from scratch each query
Recommendation:
- Build persistent BM25 index (store on disk, load once)
- Or use PostgreSQL full-text search instead
- Or limit BM25 to smaller collections only
Issue 2: Embeddings Capture Theme, Not Details
Test Case: Article 157 about "humility in science" contains an example about engineers designing fail-safe mechanisms.
| Query | Result |
|---|---|
| "humility in science creationist evolution" | Rank 1, score 0.475 |
| "types of humility epistemic" | Rank 1, score 0.443 |
| "being humble about scientific knowledge" | Rank 1, score 0.483 |
| "engineer fail-safe mechanisms humble design" | Not in top 10 |
| "student double-checks math test answers" | Not in top 10 |
| "creationism debate" | Not in top 10 |
Analysis:
- Query "engineer fail-safe" has 0.52 cosine similarity to target chunks
- Other documents in corpus have 0.61+ similarity to that query
- The embedding captures the article's main theme (humility) but not incidental details (engineer example)
Root Cause: Embeddings are designed to capture semantic meaning of the whole chunk. Brief examples or mentions don't dominate the embedding.
Issue 3: Chunk Context May Be Insufficient
Finding: The article's "engineer fail-safe" example appears in chunks, but:
- Some chunks are cut mid-word (e.g., "fail-s" instead of "fail-safe")
- The engineer example may lack surrounding context
Chunk Analysis for Article 157:
- 7 chunks total
- Chunks containing "engineer": 2 (chunks 2 and 6)
- Chunk 2 ends with "fail-s" (word cut off)
- The engineer example is brief (~2 sentences) within larger chunks about humility
Embedding Similarity Analysis
For query "engineer fail-safe mechanisms humble design":
| Chunk | Similarity | Content Preview |
|---|---|---|
| 3097f4d6 | 0.522 | "It is widely recognized that good science requires..." |
| db87f54d | 0.486 | "It is widely recognized that good science requires..." |
| f3e97d77 | 0.462 | "You'd still double-check your calculations..." |
| 9153d1f5 | 0.435 | "They ought to be more humble..." |
| 3375ae64 | 0.424 | "Dennett suggests that much 'religious belief'..." |
| 047e7a9a | 0.353 | Summary chunk |
| 80ff7a03 | 0.267 | References chunk |
Problem: Top results in the forum collection score 0.61+, so these 0.52 scores don't make the cut.
Recommendations
High Priority
-
Enable BM25 Search
- Set
ENABLE_BM25_SEARCH=True - This will find keyword matches that embeddings miss
- Already implemented score fusion to combine results
- Set
-
Lower Embedding Threshold for Text Collections
- Current: 0.25 minimum score
- Consider: 0.20 to catch more marginal matches
- Trade-off: May increase noise
-
Increase Search Limit Before Fusion
- Current: Uses same
limitfor both embedding and BM25 - Consider: Search for 2-3x more candidates, then fuse and return top N
- Current: Uses same
Medium Priority
-
Implement Query Expansion / HyDE
- For vague queries, generate a hypothetical answer and embed that
- Example: "engineer fail-safe" -> generate "An article discussing how engineers design fail-safe mechanisms as an example of good humility..."
-
Improve Chunking Overlap
- Ensure examples carry context from surrounding paragraphs
- Consider semantic chunking (split on topic changes, not just size)
-
Add Document-Level Context to Chunks
- Prepend document title/summary to each chunk before embedding
- Helps chunks maintain connection to main theme
Lower Priority
-
Tune Fusion Weights
- Current: 70% embedding, 30% BM25
- May need adjustment based on use case
-
Add Temporal Decay
- Prefer recent content for certain query types
Architectural Issues
Issue A: BM25 Implementation is Not Scalable
The current BM25 implementation cannot handle 250K chunks:
# Current approach (in bm25.py):
items = items_query.all() # Loads ALL matching chunks into memory
corpus = [item.content.lower().strip() for item in items] # Makes copies
retriever.index(corpus_tokens) # Rebuilds index from scratch per query
Why this fails:
- 147K forum chunks × ~3KB avg = ~440MB just for text
- Plus tokenization, BM25 index structures → OOM
Solutions (in order of recommendation):
-
PostgreSQL Full-Text Search (Recommended)
- Already have PostgreSQL in stack
- Add
tsvectorcolumn to Chunk table - Create GIN index for fast search
- Use
ts_rankfor relevance scoring - No additional infrastructure needed
-
Persistent BM25 Index
- Build index once at ingestion time
- Store on disk, load once at startup
- Update incrementally on new chunks
- More complex to maintain
-
External Search Engine
- Elasticsearch or Meilisearch
- Adds operational complexity
- May be overkill for current scale
Issue B: Chunk Size Variance
Chunks range from 3 bytes to 3.3MB. This causes:
- Large chunks have diluted embeddings
- Small chunks lack context
- Inconsistent search quality across collections
Solution: Re-chunk existing content with:
- Max ~512 tokens per chunk (optimal for embeddings)
- 50-100 token overlap between chunks
- Prepend document title/context to each chunk
Issue C: Search Timeout (2 seconds)
The default 2-second timeout is too aggressive for:
- Large collections (147K forum chunks)
- Cold Qdrant cache
- Network latency
Solution: Increase to 5-10 seconds for initial search, with progressive loading UX.
Test Queries for Validation
After making changes, test with these queries against article 157:
# Should find article 157 (humility in science)
test_cases = [
# Main topic - currently working
("humility in science", "main topic"),
("types of humility epistemic", "topic area"),
# Specific examples - currently failing
("engineer fail-safe mechanisms", "specific example"),
("student double-checks math test", "specific example"),
# Tangential mentions - currently failing
("creationism debate", "mentioned topic"),
# Vague/half-remembered - currently failing
("checking your work", "vague concept"),
("when engineers make mistakes", "tangential"),
]
Session Log
2025-12-20
-
Initial Investigation
- Found scores were all 0.000
- Traced to embeddings.py and bm25.py discarding scores
-
Fixed Score Propagation
- Modified 4 files to preserve and fuse scores
- Rebuilt Docker images
- Verified scores now appear (0.4-0.5 range)
-
Quality Testing
- Selected random article (ID 157, humility in science)
- Tested 10 query types from specific to vague
- Found 3/10 queries succeed (main topic only)
-
Root Cause Analysis
- BM25 disabled - no keyword matching
- Embeddings capture theme, not details
- Target chunks have 0.52 similarity vs 0.61 for top results
-
Next Steps
- Enable BM25 and retest
- Consider HyDE for query expansion
- Investigate chunking improvements
-
Deep Dive: Database Statistics
- Total chunks: 250,048
- Forum: 147,546 (58.9%)
- Blog: 46,159 (18.5%)
- Book: 34,586 (13.8%)
- Text: 10,823 (4.3%)
-
Chunk Size Analysis (MAJOR ISSUE) Found excessively large chunks that dilute embedding quality:
Collection Avg Length Max Length Over 8KB Over 128KB book 15,487 3.3MB 12,452 474 blog 3,661 710KB 2,874 19 forum 3,514 341KB 8,943 47 Books have 36% of chunks over 8KB - too large for good embedding quality. The Voyage embedding model has 32K token limit, but chunks over 8KB (~2K tokens) start to lose fine-grained detail in the embedding.
-
Detailed Score Analysis for "engineer fail-safe mechanisms humble design"
- Query returns 145,632 results from forum collection
- Top results score 0.61, median 0.34
- Source 157 (target article) chunks score:
- 3097f4d6: 0.5222 (rank 140/145,632) - main content
- db87f54d: 0.4863 (rank 710/145,632) - full text chunk
- f3e97d77: 0.4622 (rank 1,952/145,632)
- 047e7a9a: 0.3528 (rank 58,949/145,632) - summary
Key Finding: Target chunks rank 140th-710th, but with limit=10, they never appear. BM25 would find exact keyword match "engineer fail-safe".
-
Top Results Analysis The chunks scoring 0.61 (beating our target) are about:
- CloudFlare incident (software failure)
- AI safety testing (risk/mitigation mechanisms)
- Generic "mechanisms to prevent failure" content
These are semantically similar to "engineer fail-safe mechanisms" but NOT about humility. Embeddings capture concept, not context.
-
Root Cause Confirmed The fundamental problem is:
- Embeddings capture semantic meaning of query concepts
- Query "engineer fail-safe" embeds as "engineering safety mechanisms"
- Articles specifically about engineering/failure rank higher
- Article about humility (that merely mentions engineers as example) ranks lower
- Only keyword search (BM25) can find "mentioned in passing" content
-
Implemented Candidate Pool Multiplier Added
CANDIDATE_MULTIPLIER = 5to search.py:- Internal searches now fetch 5x the requested limit
- Results from both methods are fused, then top N returned
- This helps surface results that rank well in one method but not both
-
Added Stopword Filtering to FTS Updated bm25.py to filter common English stopwords before building tsquery:
- Words like "what", "you", "not", "the" are filtered out
- This makes AND matching less strict
- Query "saying what you mean" becomes "saying:* & mean:*" instead of 8 terms
-
Testing: "Taboo Your Words" Query Query: "saying what you mean not using specific words" Target: Source 735 ("Taboo Your Words" article)
Results:
- Embedding search ranks target at position 21 (score 0.606)
- Top 10 results score 0.62-0.64 (about language/communication generally)
- FTS doesn't match because article lacks "saying" and "specific"
- After fusion: target ranks 23rd, cutoff is 20th
Key Insight: The query describes the concept ("not using specific words") but the article is about a technique ("taboo your words = replace with definitions"). These are semantically adjacent but not equivalent.
With direct query "replacing words with their definitions" → ranks 1st!
-
Testing: "Clowns Iconoclasts" Query Query: "clowns being the real iconoclasts" Target: "Lonely Dissent" article
Results: Found at rank 1 with score 0.815 (hybrid boost!)
- Both embedding AND FTS match
- 0.15 hybrid bonus applied
- This is an ideal case where keywords match content
-
Remaining Challenges
- "Half-remembered" queries describing concepts vs actual content
- Need query expansion (HyDE) to bridge semantic gap
- Or return more results for user to scan
- Consider showing "You might also be looking for..." suggestions