memory/docs/SEARCH_INVESTIGATION.md
mruwnik e414c3311c Improve RAG search quality with PostgreSQL FTS and hybrid scoring
Major changes:
- Replace OOM-causing in-memory BM25 with PostgreSQL full-text search
- Add tsvector column and GIN index for fast keyword search
- Implement hybrid score fusion (70% embedding + 30% FTS + 15% bonus)
- Add CANDIDATE_MULTIPLIER (5x) to search more candidates before fusion
- Add stopword filtering to FTS queries for less strict matching
- Make search limit configurable (default 20, max 100)
- Propagate relevance scores through the search pipeline

Search improvements:
- "clowns iconoclasts" → finds target at rank 1 (score 0.815)
- "replacing words with definitions" → finds target at rank 1
- Vague queries now find results with limit=30 that were previously missed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-20 15:54:30 +00:00

438 lines
16 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# RAG Search Quality Investigation
## Summary
Investigation into why RAG search results "often aren't that good" when trying to find things with partial/vague memories.
**Date:** 2025-12-20
**Status:** Significant Progress Made
### Key Findings
1. **BM25 keyword search was broken** - Caused OOM with 250K chunks. ✅ FIXED: Replaced with PostgreSQL full-text search.
2. **Embeddings can't find "mentioned in passing" content** - Query "engineer fail-safe" ranks article about humility (that mentions engineers as example) at position 140 out of 145K. Articles specifically about engineering rank higher.
3. **Score propagation was broken** - ✅ FIXED: Scores now flow through properly.
4. **Chunk sizes are inconsistent** - Some chunks are 3MB (books), some are 3 bytes. Large chunks have diluted embeddings.
5. **"Half-remembered" queries don't match article keywords** - User describes concept, but article uses different terminology. E.g., "not using specific words" vs "taboo your words".
### What Works Now
- **Keyword-matching queries**: "clowns iconoclasts" → finds "Lonely Dissent" at rank 1 (score 0.815)
- **Direct concept queries**: "replacing words with definitions" → finds "Taboo Your Words" at rank 1
- **Hybrid search**: Results appearing in both embedding + FTS get 15% bonus
### Remaining Challenges
- **Conceptual queries**: "saying what you mean not using specific words" → target ranks 23rd (needs top 10)
- Query describes the *effect*, article describes the *technique*
- Need query expansion (HyDE) to bridge semantic gap
### Recommended Fix Priority
1. **Implement PostgreSQL full-text search** - ✅ DONE
2. **Add candidate pool multiplier** - ✅ DONE (5x internal limit)
3. **Add stopword filtering** - ✅ DONE
4. **Re-chunk oversized content** - Max 512 tokens, with context
5. **Implement HyDE query expansion** - For vague/conceptual queries
---
## PostgreSQL Full-Text Search Implementation (2025-12-20)
### Changes Made
1. **Created migration** `db/migrations/versions/20251220_130000_add_chunk_fulltext_search.py`
- Added `search_vector` tsvector column to chunk table
- Created GIN index for fast search
- Added trigger to auto-update search_vector on insert/update
- Populated existing 250K chunks with search vectors
2. **Rewrote bm25.py** to use PostgreSQL full-text search
- Removed in-memory BM25 that caused OOM
- Uses `ts_rank()` for relevance scoring
- Uses AND matching with prefix wildcards: `engineer:* & fail:* & safe:*`
- Normalized scores to 0-1 range
3. **Added search_vector column** to Chunk model in SQLAlchemy
### Test Results
For query "engineer fail safe":
- PostgreSQL FTS returns 100 results without OOM
- Source 157 (humility article) chunks rank **25th and 26th** (vs not appearing before)
- Search completes in ~100ms (vs OOM crash before)
### Hybrid Search Flow
With BM25 now working, the hybrid search combines:
- Embedding search (70% weight) - finds semantically similar content
- Full-text search (30% weight) - finds exact keyword matches
- +15% bonus for results appearing in both
This should significantly improve "half-remembered" searches where users recall specific words that appear in the article.
---
## Issues Fixed (This Session)
### 1. Scores Were Being Discarded (CRITICAL)
**Problem:** Both embedding and BM25 searches computed relevance scores but threw them away, returning only chunk IDs.
**Files Changed:**
- `src/memory/api/search/embeddings.py` - Now returns `dict[str, float]` (chunk_id -> score)
- `src/memory/api/search/bm25.py` - Now returns normalized scores (0-1 range)
- `src/memory/api/search/search.py` - Added `fuse_scores()` for hybrid ranking
- `src/memory/api/search/types.py` - Changed from mean to max chunk score
**Before:** All `search_score` values were 0.000
**After:** Meaningful scores like 0.443, 0.503, etc.
### 2. Score Fusion Implemented
Added weighted combination of embedding (70%) + BM25 (30%) scores with 15% bonus for results appearing in both searches.
```python
EMBEDDING_WEIGHT = 0.7
BM25_WEIGHT = 0.3
HYBRID_BONUS = 0.15
```
### 3. Changed from Mean to Max Chunk Score
**Before:** Documents with many chunks were penalized (averaging diluted scores)
**After:** Uses max chunk score - finds documents with at least one highly relevant section
---
## Current Issues Identified
### Issue 1: BM25 is Disabled AND Causes OOM
**Finding:** `ENABLE_BM25_SEARCH=False` in docker-compose.yaml
**Impact:** Keyword matching doesn't work. Queries like "engineer fail-safe" won't find articles containing those exact words unless the embedding similarity is high enough.
**When Enabled:** BM25 causes OOM crash!
- Database has 250,048 chunks total
- Forum collection alone has 147,546 chunks
- BM25 implementation loads ALL chunks into memory and builds index on each query
- Container killed (exit code 137) when attempting BM25 search
**Root Cause:** Current BM25 implementation in `bm25.py` is not scalable:
```python
items = items_query.all() # Loads ALL chunks into memory
corpus = [item.content.lower().strip() for item in items] # Copies all content
retriever.index(corpus_tokens) # Builds index from scratch each query
```
**Recommendation:**
1. Build persistent BM25 index (store on disk, load once)
2. Or use PostgreSQL full-text search instead
3. Or limit BM25 to smaller collections only
### Issue 2: Embeddings Capture Theme, Not Details
**Test Case:** Article 157 about "humility in science" contains an example about engineers designing fail-safe mechanisms.
| Query | Result |
|-------|--------|
| "humility in science creationist evolution" | Rank 1, score 0.475 |
| "types of humility epistemic" | Rank 1, score 0.443 |
| "being humble about scientific knowledge" | Rank 1, score 0.483 |
| "engineer fail-safe mechanisms humble design" | Not in top 10 |
| "student double-checks math test answers" | Not in top 10 |
| "creationism debate" | Not in top 10 |
**Analysis:**
- Query "engineer fail-safe" has 0.52 cosine similarity to target chunks
- Other documents in corpus have 0.61+ similarity to that query
- The embedding captures the article's main theme (humility) but not incidental details (engineer example)
**Root Cause:** Embeddings are designed to capture semantic meaning of the whole chunk. Brief examples or mentions don't dominate the embedding.
### Issue 3: Chunk Context May Be Insufficient
**Finding:** The article's "engineer fail-safe" example appears in chunks, but:
- Some chunks are cut mid-word (e.g., "fail\-s" instead of "fail-safe")
- The engineer example may lack surrounding context
**Chunk Analysis for Article 157:**
- 7 chunks total
- Chunks containing "engineer": 2 (chunks 2 and 6)
- Chunk 2 ends with "fail\-s" (word cut off)
- The engineer example is brief (~2 sentences) within larger chunks about humility
---
## Embedding Similarity Analysis
For query "engineer fail-safe mechanisms humble design":
| Chunk | Similarity | Content Preview |
|-------|------------|-----------------|
| 3097f4d6 | 0.522 | "It is widely recognized that good science requires..." |
| db87f54d | 0.486 | "It is widely recognized that good science requires..." |
| f3e97d77 | 0.462 | "You'd still double-check your calculations..." |
| 9153d1f5 | 0.435 | "They ought to be more humble..." |
| 3375ae64 | 0.424 | "Dennett suggests that much 'religious belief'..." |
| 047e7a9a | 0.353 | Summary chunk |
| 80ff7a03 | 0.267 | References chunk |
**Problem:** Top results in the forum collection score 0.61+, so these 0.52 scores don't make the cut.
---
## Recommendations
### High Priority
1. **Enable BM25 Search**
- Set `ENABLE_BM25_SEARCH=True`
- This will find keyword matches that embeddings miss
- Already implemented score fusion to combine results
2. **Lower Embedding Threshold for Text Collections**
- Current: 0.25 minimum score
- Consider: 0.20 to catch more marginal matches
- Trade-off: May increase noise
3. **Increase Search Limit Before Fusion**
- Current: Uses same `limit` for both embedding and BM25
- Consider: Search for 2-3x more candidates, then fuse and return top N
### Medium Priority
4. **Implement Query Expansion / HyDE**
- For vague queries, generate a hypothetical answer and embed that
- Example: "engineer fail-safe" -> generate "An article discussing how engineers design fail-safe mechanisms as an example of good humility..."
5. **Improve Chunking Overlap**
- Ensure examples carry context from surrounding paragraphs
- Consider semantic chunking (split on topic changes, not just size)
6. **Add Document-Level Context to Chunks**
- Prepend document title/summary to each chunk before embedding
- Helps chunks maintain connection to main theme
### Lower Priority
7. **Tune Fusion Weights**
- Current: 70% embedding, 30% BM25
- May need adjustment based on use case
8. **Add Temporal Decay**
- Prefer recent content for certain query types
---
## Architectural Issues
### Issue A: BM25 Implementation is Not Scalable
The current BM25 implementation cannot handle 250K chunks:
```python
# Current approach (in bm25.py):
items = items_query.all() # Loads ALL matching chunks into memory
corpus = [item.content.lower().strip() for item in items] # Makes copies
retriever.index(corpus_tokens) # Rebuilds index from scratch per query
```
**Why this fails:**
- 147K forum chunks × ~3KB avg = ~440MB just for text
- Plus tokenization, BM25 index structures → OOM
**Solutions (in order of recommendation):**
1. **PostgreSQL Full-Text Search** (Recommended)
- Already have PostgreSQL in stack
- Add `tsvector` column to Chunk table
- Create GIN index for fast search
- Use `ts_rank` for relevance scoring
- No additional infrastructure needed
2. **Persistent BM25 Index**
- Build index once at ingestion time
- Store on disk, load once at startup
- Update incrementally on new chunks
- More complex to maintain
3. **External Search Engine**
- Elasticsearch or Meilisearch
- Adds operational complexity
- May be overkill for current scale
### Issue B: Chunk Size Variance
Chunks range from 3 bytes to 3.3MB. This causes:
- Large chunks have diluted embeddings
- Small chunks lack context
- Inconsistent search quality across collections
**Solution:** Re-chunk existing content with:
- Max ~512 tokens per chunk (optimal for embeddings)
- 50-100 token overlap between chunks
- Prepend document title/context to each chunk
### Issue C: Search Timeout (2 seconds)
The default 2-second timeout is too aggressive for:
- Large collections (147K forum chunks)
- Cold Qdrant cache
- Network latency
**Solution:** Increase to 5-10 seconds for initial search, with progressive loading UX.
---
## Test Queries for Validation
After making changes, test with these queries against article 157:
```python
# Should find article 157 (humility in science)
test_cases = [
# Main topic - currently working
("humility in science", "main topic"),
("types of humility epistemic", "topic area"),
# Specific examples - currently failing
("engineer fail-safe mechanisms", "specific example"),
("student double-checks math test", "specific example"),
# Tangential mentions - currently failing
("creationism debate", "mentioned topic"),
# Vague/half-remembered - currently failing
("checking your work", "vague concept"),
("when engineers make mistakes", "tangential"),
]
```
---
## Session Log
### 2025-12-20
1. **Initial Investigation**
- Found scores were all 0.000
- Traced to embeddings.py and bm25.py discarding scores
2. **Fixed Score Propagation**
- Modified 4 files to preserve and fuse scores
- Rebuilt Docker images
- Verified scores now appear (0.4-0.5 range)
3. **Quality Testing**
- Selected random article (ID 157, humility in science)
- Tested 10 query types from specific to vague
- Found 3/10 queries succeed (main topic only)
4. **Root Cause Analysis**
- BM25 disabled - no keyword matching
- Embeddings capture theme, not details
- Target chunks have 0.52 similarity vs 0.61 for top results
5. **Next Steps**
- Enable BM25 and retest
- Consider HyDE for query expansion
- Investigate chunking improvements
6. **Deep Dive: Database Statistics**
- Total chunks: 250,048
- Forum: 147,546 (58.9%)
- Blog: 46,159 (18.5%)
- Book: 34,586 (13.8%)
- Text: 10,823 (4.3%)
7. **Chunk Size Analysis (MAJOR ISSUE)**
Found excessively large chunks that dilute embedding quality:
| Collection | Avg Length | Max Length | Over 8KB | Over 128KB |
|------------|------------|------------|----------|------------|
| book | 15,487 | 3.3MB | 12,452 | 474 |
| blog | 3,661 | 710KB | 2,874 | 19 |
| forum | 3,514 | 341KB | 8,943 | 47 |
Books have 36% of chunks over 8KB - too large for good embedding quality.
The Voyage embedding model has 32K token limit, but chunks over 8KB (~2K tokens)
start to lose fine-grained detail in the embedding.
8. **Detailed Score Analysis for "engineer fail-safe mechanisms humble design"**
- Query returns 145,632 results from forum collection
- Top results score 0.61, median 0.34
- Source 157 (target article) chunks score:
- 3097f4d6: 0.5222 (rank 140/145,632) - main content
- db87f54d: 0.4863 (rank 710/145,632) - full text chunk
- f3e97d77: 0.4622 (rank 1,952/145,632)
- 047e7a9a: 0.3528 (rank 58,949/145,632) - summary
**Key Finding:** Target chunks rank 140th-710th, but with limit=10,
they never appear. BM25 would find exact keyword match "engineer fail-safe".
9. **Top Results Analysis**
The chunks scoring 0.61 (beating our target) are about:
- CloudFlare incident (software failure)
- AI safety testing (risk/mitigation mechanisms)
- Generic "mechanisms to prevent failure" content
These are semantically similar to "engineer fail-safe mechanisms"
but NOT about humility. Embeddings capture concept, not context.
10. **Root Cause Confirmed**
The fundamental problem is:
1. Embeddings capture semantic meaning of query concepts
2. Query "engineer fail-safe" embeds as "engineering safety mechanisms"
3. Articles specifically about engineering/failure rank higher
4. Article about humility (that merely mentions engineers as example) ranks lower
5. Only keyword search (BM25) can find "mentioned in passing" content
11. **Implemented Candidate Pool Multiplier**
Added `CANDIDATE_MULTIPLIER = 5` to search.py:
- Internal searches now fetch 5x the requested limit
- Results from both methods are fused, then top N returned
- This helps surface results that rank well in one method but not both
12. **Added Stopword Filtering to FTS**
Updated bm25.py to filter common English stopwords before building tsquery:
- Words like "what", "you", "not", "the" are filtered out
- This makes AND matching less strict
- Query "saying what you mean" becomes "saying:* & mean:*" instead of 8 terms
13. **Testing: "Taboo Your Words" Query**
Query: "saying what you mean not using specific words"
Target: Source 735 ("Taboo Your Words" article)
Results:
- Embedding search ranks target at position 21 (score 0.606)
- Top 10 results score 0.62-0.64 (about language/communication generally)
- FTS doesn't match because article lacks "saying" and "specific"
- After fusion: target ranks 23rd, cutoff is 20th
**Key Insight:** The query describes the *concept* ("not using specific words")
but the article is about a *technique* ("taboo your words = replace with definitions").
These are semantically adjacent but not equivalent.
With direct query "replacing words with their definitions" → ranks 1st!
14. **Testing: "Clowns Iconoclasts" Query**
Query: "clowns being the real iconoclasts"
Target: "Lonely Dissent" article
Results: Found at rank 1 with score 0.815 (hybrid boost!)
- Both embedding AND FTS match
- 0.15 hybrid bonus applied
- This is an ideal case where keywords match content
15. **Remaining Challenges**
- "Half-remembered" queries describing concepts vs actual content
- Need query expansion (HyDE) to bridge semantic gap
- Or return more results for user to scan
- Consider showing "You might also be looking for..." suggestions