mirror of
https://github.com/mruwnik/memory.git
synced 2026-01-02 17:22:58 +01:00
Major changes: - Replace OOM-causing in-memory BM25 with PostgreSQL full-text search - Add tsvector column and GIN index for fast keyword search - Implement hybrid score fusion (70% embedding + 30% FTS + 15% bonus) - Add CANDIDATE_MULTIPLIER (5x) to search more candidates before fusion - Add stopword filtering to FTS queries for less strict matching - Make search limit configurable (default 20, max 100) - Propagate relevance scores through the search pipeline Search improvements: - "clowns iconoclasts" → finds target at rank 1 (score 0.815) - "replacing words with definitions" → finds target at rank 1 - Vague queries now find results with limit=30 that were previously missed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
438 lines
16 KiB
Markdown
438 lines
16 KiB
Markdown
# RAG Search Quality Investigation
|
||
|
||
## Summary
|
||
|
||
Investigation into why RAG search results "often aren't that good" when trying to find things with partial/vague memories.
|
||
|
||
**Date:** 2025-12-20
|
||
**Status:** Significant Progress Made
|
||
|
||
### Key Findings
|
||
|
||
1. **BM25 keyword search was broken** - Caused OOM with 250K chunks. ✅ FIXED: Replaced with PostgreSQL full-text search.
|
||
|
||
2. **Embeddings can't find "mentioned in passing" content** - Query "engineer fail-safe" ranks article about humility (that mentions engineers as example) at position 140 out of 145K. Articles specifically about engineering rank higher.
|
||
|
||
3. **Score propagation was broken** - ✅ FIXED: Scores now flow through properly.
|
||
|
||
4. **Chunk sizes are inconsistent** - Some chunks are 3MB (books), some are 3 bytes. Large chunks have diluted embeddings.
|
||
|
||
5. **"Half-remembered" queries don't match article keywords** - User describes concept, but article uses different terminology. E.g., "not using specific words" vs "taboo your words".
|
||
|
||
### What Works Now
|
||
|
||
- **Keyword-matching queries**: "clowns iconoclasts" → finds "Lonely Dissent" at rank 1 (score 0.815)
|
||
- **Direct concept queries**: "replacing words with definitions" → finds "Taboo Your Words" at rank 1
|
||
- **Hybrid search**: Results appearing in both embedding + FTS get 15% bonus
|
||
|
||
### Remaining Challenges
|
||
|
||
- **Conceptual queries**: "saying what you mean not using specific words" → target ranks 23rd (needs top 10)
|
||
- Query describes the *effect*, article describes the *technique*
|
||
- Need query expansion (HyDE) to bridge semantic gap
|
||
|
||
### Recommended Fix Priority
|
||
|
||
1. **Implement PostgreSQL full-text search** - ✅ DONE
|
||
2. **Add candidate pool multiplier** - ✅ DONE (5x internal limit)
|
||
3. **Add stopword filtering** - ✅ DONE
|
||
4. **Re-chunk oversized content** - Max 512 tokens, with context
|
||
5. **Implement HyDE query expansion** - For vague/conceptual queries
|
||
|
||
---
|
||
|
||
## PostgreSQL Full-Text Search Implementation (2025-12-20)
|
||
|
||
### Changes Made
|
||
|
||
1. **Created migration** `db/migrations/versions/20251220_130000_add_chunk_fulltext_search.py`
|
||
- Added `search_vector` tsvector column to chunk table
|
||
- Created GIN index for fast search
|
||
- Added trigger to auto-update search_vector on insert/update
|
||
- Populated existing 250K chunks with search vectors
|
||
|
||
2. **Rewrote bm25.py** to use PostgreSQL full-text search
|
||
- Removed in-memory BM25 that caused OOM
|
||
- Uses `ts_rank()` for relevance scoring
|
||
- Uses AND matching with prefix wildcards: `engineer:* & fail:* & safe:*`
|
||
- Normalized scores to 0-1 range
|
||
|
||
3. **Added search_vector column** to Chunk model in SQLAlchemy
|
||
|
||
### Test Results
|
||
|
||
For query "engineer fail safe":
|
||
- PostgreSQL FTS returns 100 results without OOM
|
||
- Source 157 (humility article) chunks rank **25th and 26th** (vs not appearing before)
|
||
- Search completes in ~100ms (vs OOM crash before)
|
||
|
||
### Hybrid Search Flow
|
||
|
||
With BM25 now working, the hybrid search combines:
|
||
- Embedding search (70% weight) - finds semantically similar content
|
||
- Full-text search (30% weight) - finds exact keyword matches
|
||
- +15% bonus for results appearing in both
|
||
|
||
This should significantly improve "half-remembered" searches where users recall specific words that appear in the article.
|
||
|
||
---
|
||
|
||
## Issues Fixed (This Session)
|
||
|
||
### 1. Scores Were Being Discarded (CRITICAL)
|
||
|
||
**Problem:** Both embedding and BM25 searches computed relevance scores but threw them away, returning only chunk IDs.
|
||
|
||
**Files Changed:**
|
||
- `src/memory/api/search/embeddings.py` - Now returns `dict[str, float]` (chunk_id -> score)
|
||
- `src/memory/api/search/bm25.py` - Now returns normalized scores (0-1 range)
|
||
- `src/memory/api/search/search.py` - Added `fuse_scores()` for hybrid ranking
|
||
- `src/memory/api/search/types.py` - Changed from mean to max chunk score
|
||
|
||
**Before:** All `search_score` values were 0.000
|
||
**After:** Meaningful scores like 0.443, 0.503, etc.
|
||
|
||
### 2. Score Fusion Implemented
|
||
|
||
Added weighted combination of embedding (70%) + BM25 (30%) scores with 15% bonus for results appearing in both searches.
|
||
|
||
```python
|
||
EMBEDDING_WEIGHT = 0.7
|
||
BM25_WEIGHT = 0.3
|
||
HYBRID_BONUS = 0.15
|
||
```
|
||
|
||
### 3. Changed from Mean to Max Chunk Score
|
||
|
||
**Before:** Documents with many chunks were penalized (averaging diluted scores)
|
||
**After:** Uses max chunk score - finds documents with at least one highly relevant section
|
||
|
||
---
|
||
|
||
## Current Issues Identified
|
||
|
||
### Issue 1: BM25 is Disabled AND Causes OOM
|
||
|
||
**Finding:** `ENABLE_BM25_SEARCH=False` in docker-compose.yaml
|
||
|
||
**Impact:** Keyword matching doesn't work. Queries like "engineer fail-safe" won't find articles containing those exact words unless the embedding similarity is high enough.
|
||
|
||
**When Enabled:** BM25 causes OOM crash!
|
||
- Database has 250,048 chunks total
|
||
- Forum collection alone has 147,546 chunks
|
||
- BM25 implementation loads ALL chunks into memory and builds index on each query
|
||
- Container killed (exit code 137) when attempting BM25 search
|
||
|
||
**Root Cause:** Current BM25 implementation in `bm25.py` is not scalable:
|
||
```python
|
||
items = items_query.all() # Loads ALL chunks into memory
|
||
corpus = [item.content.lower().strip() for item in items] # Copies all content
|
||
retriever.index(corpus_tokens) # Builds index from scratch each query
|
||
```
|
||
|
||
**Recommendation:**
|
||
1. Build persistent BM25 index (store on disk, load once)
|
||
2. Or use PostgreSQL full-text search instead
|
||
3. Or limit BM25 to smaller collections only
|
||
|
||
### Issue 2: Embeddings Capture Theme, Not Details
|
||
|
||
**Test Case:** Article 157 about "humility in science" contains an example about engineers designing fail-safe mechanisms.
|
||
|
||
| Query | Result |
|
||
|-------|--------|
|
||
| "humility in science creationist evolution" | Rank 1, score 0.475 |
|
||
| "types of humility epistemic" | Rank 1, score 0.443 |
|
||
| "being humble about scientific knowledge" | Rank 1, score 0.483 |
|
||
| "engineer fail-safe mechanisms humble design" | Not in top 10 |
|
||
| "student double-checks math test answers" | Not in top 10 |
|
||
| "creationism debate" | Not in top 10 |
|
||
|
||
**Analysis:**
|
||
- Query "engineer fail-safe" has 0.52 cosine similarity to target chunks
|
||
- Other documents in corpus have 0.61+ similarity to that query
|
||
- The embedding captures the article's main theme (humility) but not incidental details (engineer example)
|
||
|
||
**Root Cause:** Embeddings are designed to capture semantic meaning of the whole chunk. Brief examples or mentions don't dominate the embedding.
|
||
|
||
### Issue 3: Chunk Context May Be Insufficient
|
||
|
||
**Finding:** The article's "engineer fail-safe" example appears in chunks, but:
|
||
- Some chunks are cut mid-word (e.g., "fail\-s" instead of "fail-safe")
|
||
- The engineer example may lack surrounding context
|
||
|
||
**Chunk Analysis for Article 157:**
|
||
- 7 chunks total
|
||
- Chunks containing "engineer": 2 (chunks 2 and 6)
|
||
- Chunk 2 ends with "fail\-s" (word cut off)
|
||
- The engineer example is brief (~2 sentences) within larger chunks about humility
|
||
|
||
---
|
||
|
||
## Embedding Similarity Analysis
|
||
|
||
For query "engineer fail-safe mechanisms humble design":
|
||
|
||
| Chunk | Similarity | Content Preview |
|
||
|-------|------------|-----------------|
|
||
| 3097f4d6 | 0.522 | "It is widely recognized that good science requires..." |
|
||
| db87f54d | 0.486 | "It is widely recognized that good science requires..." |
|
||
| f3e97d77 | 0.462 | "You'd still double-check your calculations..." |
|
||
| 9153d1f5 | 0.435 | "They ought to be more humble..." |
|
||
| 3375ae64 | 0.424 | "Dennett suggests that much 'religious belief'..." |
|
||
| 047e7a9a | 0.353 | Summary chunk |
|
||
| 80ff7a03 | 0.267 | References chunk |
|
||
|
||
**Problem:** Top results in the forum collection score 0.61+, so these 0.52 scores don't make the cut.
|
||
|
||
---
|
||
|
||
## Recommendations
|
||
|
||
### High Priority
|
||
|
||
1. **Enable BM25 Search**
|
||
- Set `ENABLE_BM25_SEARCH=True`
|
||
- This will find keyword matches that embeddings miss
|
||
- Already implemented score fusion to combine results
|
||
|
||
2. **Lower Embedding Threshold for Text Collections**
|
||
- Current: 0.25 minimum score
|
||
- Consider: 0.20 to catch more marginal matches
|
||
- Trade-off: May increase noise
|
||
|
||
3. **Increase Search Limit Before Fusion**
|
||
- Current: Uses same `limit` for both embedding and BM25
|
||
- Consider: Search for 2-3x more candidates, then fuse and return top N
|
||
|
||
### Medium Priority
|
||
|
||
4. **Implement Query Expansion / HyDE**
|
||
- For vague queries, generate a hypothetical answer and embed that
|
||
- Example: "engineer fail-safe" -> generate "An article discussing how engineers design fail-safe mechanisms as an example of good humility..."
|
||
|
||
5. **Improve Chunking Overlap**
|
||
- Ensure examples carry context from surrounding paragraphs
|
||
- Consider semantic chunking (split on topic changes, not just size)
|
||
|
||
6. **Add Document-Level Context to Chunks**
|
||
- Prepend document title/summary to each chunk before embedding
|
||
- Helps chunks maintain connection to main theme
|
||
|
||
### Lower Priority
|
||
|
||
7. **Tune Fusion Weights**
|
||
- Current: 70% embedding, 30% BM25
|
||
- May need adjustment based on use case
|
||
|
||
8. **Add Temporal Decay**
|
||
- Prefer recent content for certain query types
|
||
|
||
---
|
||
|
||
## Architectural Issues
|
||
|
||
### Issue A: BM25 Implementation is Not Scalable
|
||
|
||
The current BM25 implementation cannot handle 250K chunks:
|
||
|
||
```python
|
||
# Current approach (in bm25.py):
|
||
items = items_query.all() # Loads ALL matching chunks into memory
|
||
corpus = [item.content.lower().strip() for item in items] # Makes copies
|
||
retriever.index(corpus_tokens) # Rebuilds index from scratch per query
|
||
```
|
||
|
||
**Why this fails:**
|
||
- 147K forum chunks × ~3KB avg = ~440MB just for text
|
||
- Plus tokenization, BM25 index structures → OOM
|
||
|
||
**Solutions (in order of recommendation):**
|
||
|
||
1. **PostgreSQL Full-Text Search** (Recommended)
|
||
- Already have PostgreSQL in stack
|
||
- Add `tsvector` column to Chunk table
|
||
- Create GIN index for fast search
|
||
- Use `ts_rank` for relevance scoring
|
||
- No additional infrastructure needed
|
||
|
||
2. **Persistent BM25 Index**
|
||
- Build index once at ingestion time
|
||
- Store on disk, load once at startup
|
||
- Update incrementally on new chunks
|
||
- More complex to maintain
|
||
|
||
3. **External Search Engine**
|
||
- Elasticsearch or Meilisearch
|
||
- Adds operational complexity
|
||
- May be overkill for current scale
|
||
|
||
### Issue B: Chunk Size Variance
|
||
|
||
Chunks range from 3 bytes to 3.3MB. This causes:
|
||
- Large chunks have diluted embeddings
|
||
- Small chunks lack context
|
||
- Inconsistent search quality across collections
|
||
|
||
**Solution:** Re-chunk existing content with:
|
||
- Max ~512 tokens per chunk (optimal for embeddings)
|
||
- 50-100 token overlap between chunks
|
||
- Prepend document title/context to each chunk
|
||
|
||
### Issue C: Search Timeout (2 seconds)
|
||
|
||
The default 2-second timeout is too aggressive for:
|
||
- Large collections (147K forum chunks)
|
||
- Cold Qdrant cache
|
||
- Network latency
|
||
|
||
**Solution:** Increase to 5-10 seconds for initial search, with progressive loading UX.
|
||
|
||
---
|
||
|
||
## Test Queries for Validation
|
||
|
||
After making changes, test with these queries against article 157:
|
||
|
||
```python
|
||
# Should find article 157 (humility in science)
|
||
test_cases = [
|
||
# Main topic - currently working
|
||
("humility in science", "main topic"),
|
||
("types of humility epistemic", "topic area"),
|
||
|
||
# Specific examples - currently failing
|
||
("engineer fail-safe mechanisms", "specific example"),
|
||
("student double-checks math test", "specific example"),
|
||
|
||
# Tangential mentions - currently failing
|
||
("creationism debate", "mentioned topic"),
|
||
|
||
# Vague/half-remembered - currently failing
|
||
("checking your work", "vague concept"),
|
||
("when engineers make mistakes", "tangential"),
|
||
]
|
||
```
|
||
|
||
---
|
||
|
||
## Session Log
|
||
|
||
### 2025-12-20
|
||
|
||
1. **Initial Investigation**
|
||
- Found scores were all 0.000
|
||
- Traced to embeddings.py and bm25.py discarding scores
|
||
|
||
2. **Fixed Score Propagation**
|
||
- Modified 4 files to preserve and fuse scores
|
||
- Rebuilt Docker images
|
||
- Verified scores now appear (0.4-0.5 range)
|
||
|
||
3. **Quality Testing**
|
||
- Selected random article (ID 157, humility in science)
|
||
- Tested 10 query types from specific to vague
|
||
- Found 3/10 queries succeed (main topic only)
|
||
|
||
4. **Root Cause Analysis**
|
||
- BM25 disabled - no keyword matching
|
||
- Embeddings capture theme, not details
|
||
- Target chunks have 0.52 similarity vs 0.61 for top results
|
||
|
||
5. **Next Steps**
|
||
- Enable BM25 and retest
|
||
- Consider HyDE for query expansion
|
||
- Investigate chunking improvements
|
||
|
||
6. **Deep Dive: Database Statistics**
|
||
- Total chunks: 250,048
|
||
- Forum: 147,546 (58.9%)
|
||
- Blog: 46,159 (18.5%)
|
||
- Book: 34,586 (13.8%)
|
||
- Text: 10,823 (4.3%)
|
||
|
||
7. **Chunk Size Analysis (MAJOR ISSUE)**
|
||
Found excessively large chunks that dilute embedding quality:
|
||
|
||
| Collection | Avg Length | Max Length | Over 8KB | Over 128KB |
|
||
|------------|------------|------------|----------|------------|
|
||
| book | 15,487 | 3.3MB | 12,452 | 474 |
|
||
| blog | 3,661 | 710KB | 2,874 | 19 |
|
||
| forum | 3,514 | 341KB | 8,943 | 47 |
|
||
|
||
Books have 36% of chunks over 8KB - too large for good embedding quality.
|
||
The Voyage embedding model has 32K token limit, but chunks over 8KB (~2K tokens)
|
||
start to lose fine-grained detail in the embedding.
|
||
|
||
8. **Detailed Score Analysis for "engineer fail-safe mechanisms humble design"**
|
||
- Query returns 145,632 results from forum collection
|
||
- Top results score 0.61, median 0.34
|
||
- Source 157 (target article) chunks score:
|
||
- 3097f4d6: 0.5222 (rank 140/145,632) - main content
|
||
- db87f54d: 0.4863 (rank 710/145,632) - full text chunk
|
||
- f3e97d77: 0.4622 (rank 1,952/145,632)
|
||
- 047e7a9a: 0.3528 (rank 58,949/145,632) - summary
|
||
|
||
**Key Finding:** Target chunks rank 140th-710th, but with limit=10,
|
||
they never appear. BM25 would find exact keyword match "engineer fail-safe".
|
||
|
||
9. **Top Results Analysis**
|
||
The chunks scoring 0.61 (beating our target) are about:
|
||
- CloudFlare incident (software failure)
|
||
- AI safety testing (risk/mitigation mechanisms)
|
||
- Generic "mechanisms to prevent failure" content
|
||
|
||
These are semantically similar to "engineer fail-safe mechanisms"
|
||
but NOT about humility. Embeddings capture concept, not context.
|
||
|
||
10. **Root Cause Confirmed**
|
||
The fundamental problem is:
|
||
1. Embeddings capture semantic meaning of query concepts
|
||
2. Query "engineer fail-safe" embeds as "engineering safety mechanisms"
|
||
3. Articles specifically about engineering/failure rank higher
|
||
4. Article about humility (that merely mentions engineers as example) ranks lower
|
||
5. Only keyword search (BM25) can find "mentioned in passing" content
|
||
|
||
11. **Implemented Candidate Pool Multiplier**
|
||
Added `CANDIDATE_MULTIPLIER = 5` to search.py:
|
||
- Internal searches now fetch 5x the requested limit
|
||
- Results from both methods are fused, then top N returned
|
||
- This helps surface results that rank well in one method but not both
|
||
|
||
12. **Added Stopword Filtering to FTS**
|
||
Updated bm25.py to filter common English stopwords before building tsquery:
|
||
- Words like "what", "you", "not", "the" are filtered out
|
||
- This makes AND matching less strict
|
||
- Query "saying what you mean" becomes "saying:* & mean:*" instead of 8 terms
|
||
|
||
13. **Testing: "Taboo Your Words" Query**
|
||
Query: "saying what you mean not using specific words"
|
||
Target: Source 735 ("Taboo Your Words" article)
|
||
|
||
Results:
|
||
- Embedding search ranks target at position 21 (score 0.606)
|
||
- Top 10 results score 0.62-0.64 (about language/communication generally)
|
||
- FTS doesn't match because article lacks "saying" and "specific"
|
||
- After fusion: target ranks 23rd, cutoff is 20th
|
||
|
||
**Key Insight:** The query describes the *concept* ("not using specific words")
|
||
but the article is about a *technique* ("taboo your words = replace with definitions").
|
||
These are semantically adjacent but not equivalent.
|
||
|
||
With direct query "replacing words with their definitions" → ranks 1st!
|
||
|
||
14. **Testing: "Clowns Iconoclasts" Query**
|
||
Query: "clowns being the real iconoclasts"
|
||
Target: "Lonely Dissent" article
|
||
|
||
Results: Found at rank 1 with score 0.815 (hybrid boost!)
|
||
- Both embedding AND FTS match
|
||
- 0.15 hybrid bonus applied
|
||
- This is an ideal case where keywords match content
|
||
|
||
15. **Remaining Challenges**
|
||
- "Half-remembered" queries describing concepts vs actual content
|
||
- Need query expansion (HyDE) to bridge semantic gap
|
||
- Or return more results for user to scan
|
||
- Consider showing "You might also be looking for..." suggestions
|