memory/docs/SEARCH_INVESTIGATION.md

# RAG Search Quality Investigation

## Summary

Investigation into why RAG search results "often aren't that good" when trying to find things with partial/vague memories.

**Date:** 2025-12-20
**Status:** Significant Progress Made

### Key Findings

1. **BM25 keyword search was broken** - Caused OOM with 250K chunks. ✅ FIXED: Replaced with PostgreSQL full-text search.

2. **Embeddings can't find "mentioned in passing" content** - Query "engineer fail-safe" ranks article about humility (that mentions engineers as example) at position 140 out of 145K. Articles specifically about engineering rank higher.

3. **Score propagation was broken** - ✅ FIXED: Scores now flow through properly.

4. **Chunk sizes are inconsistent** - Some chunks are 3MB (books), some are 3 bytes. Large chunks have diluted embeddings.

5. **"Half-remembered" queries don't match article keywords** - User describes concept, but article uses different terminology. E.g., "not using specific words" vs "taboo your words".

### What Works Now

- **Keyword-matching queries**: "clowns iconoclasts" → finds "Lonely Dissent" at rank 1 (score 0.815)
- **Direct concept queries**: "replacing words with definitions" → finds "Taboo Your Words" at rank 1
- **Hybrid search**: Results appearing in both embedding + FTS get 15% bonus

### Remaining Challenges

- **Conceptual queries**: "saying what you mean not using specific words" → target ranks 23rd (needs top 10)
- Query describes the *effect*, article describes the *technique*
- Need query expansion (HyDE) to bridge semantic gap

### Recommended Fix Priority

1. **Implement PostgreSQL full-text search** - ✅ DONE
2. **Add candidate pool multiplier** - ✅ DONE (5x internal limit)
3. **Add stopword filtering** - ✅ DONE
4. **Re-chunk oversized content** - Max 512 tokens, with context
5. **Implement HyDE query expansion** - For vague/conceptual queries

---

## PostgreSQL Full-Text Search Implementation (2025-12-20)

### Changes Made

1. **Created migration** `db/migrations/versions/20251220_130000_add_chunk_fulltext_search.py`
   - Added `search_vector` tsvector column to chunk table
   - Created GIN index for fast search
   - Added trigger to auto-update search_vector on insert/update
   - Populated existing 250K chunks with search vectors

2. **Rewrote bm25.py** to use PostgreSQL full-text search
   - Removed in-memory BM25 that caused OOM
   - Uses `ts_rank()` for relevance scoring
   - Uses AND matching with prefix wildcards: `engineer:* & fail:* & safe:*`
   - Normalized scores to 0-1 range

3. **Added search_vector column** to Chunk model in SQLAlchemy

### Test Results

For query "engineer fail safe":
- PostgreSQL FTS returns 100 results without OOM
- Source 157 (humility article) chunks rank **25th and 26th** (vs not appearing before)
- Search completes in ~100ms (vs OOM crash before)

### Hybrid Search Flow

With BM25 now working, the hybrid search combines:
- Embedding search (70% weight) - finds semantically similar content
- Full-text search (30% weight) - finds exact keyword matches
- +15% bonus for results appearing in both

This should significantly improve "half-remembered" searches where users recall specific words that appear in the article.

---

## Issues Fixed (This Session)

### 1. Scores Were Being Discarded (CRITICAL)

**Problem:** Both embedding and BM25 searches computed relevance scores but threw them away, returning only chunk IDs.

**Files Changed:**
- `src/memory/api/search/embeddings.py` - Now returns `dict[str, float]` (chunk_id -> score)
- `src/memory/api/search/bm25.py` - Now returns normalized scores (0-1 range)
- `src/memory/api/search/search.py` - Added `fuse_scores()` for hybrid ranking
- `src/memory/api/search/types.py` - Changed from mean to max chunk score

**Before:** All `search_score` values were 0.000
**After:** Meaningful scores like 0.443, 0.503, etc.

### 2. Score Fusion Implemented

Added weighted combination of embedding (70%) + BM25 (30%) scores with 15% bonus for results appearing in both searches.

```python
EMBEDDING_WEIGHT = 0.7
BM25_WEIGHT = 0.3
HYBRID_BONUS = 0.15
```

### 3. Changed from Mean to Max Chunk Score

**Before:** Documents with many chunks were penalized (averaging diluted scores)
**After:** Uses max chunk score - finds documents with at least one highly relevant section

---

## Current Issues Identified

### Issue 1: BM25 is Disabled AND Causes OOM

**Finding:** `ENABLE_BM25_SEARCH=False` in docker-compose.yaml

**Impact:** Keyword matching doesn't work. Queries like "engineer fail-safe" won't find articles containing those exact words unless the embedding similarity is high enough.

**When Enabled:** BM25 causes OOM crash!
- Database has 250,048 chunks total
- Forum collection alone has 147,546 chunks
- BM25 implementation loads ALL chunks into memory and builds index on each query
- Container killed (exit code 137) when attempting BM25 search

**Root Cause:** Current BM25 implementation in `bm25.py` is not scalable:
```python
items = items_query.all()  # Loads ALL chunks into memory
corpus = [item.content.lower().strip() for item in items]  # Copies all content
retriever.index(corpus_tokens)  # Builds index from scratch each query
```

**Recommendation:**
1. Build persistent BM25 index (store on disk, load once)
2. Or use PostgreSQL full-text search instead
3. Or limit BM25 to smaller collections only

### Issue 2: Embeddings Capture Theme, Not Details

**Test Case:** Article 157 about "humility in science" contains an example about engineers designing fail-safe mechanisms.

| Query | Result |
|-------|--------|
| "humility in science creationist evolution" | Rank 1, score 0.475 |
| "types of humility epistemic" | Rank 1, score 0.443 |
| "being humble about scientific knowledge" | Rank 1, score 0.483 |
| "engineer fail-safe mechanisms humble design" | Not in top 10 |
| "student double-checks math test answers" | Not in top 10 |
| "creationism debate" | Not in top 10 |

**Analysis:**
- Query "engineer fail-safe" has 0.52 cosine similarity to target chunks
- Other documents in corpus have 0.61+ similarity to that query
- The embedding captures the article's main theme (humility) but not incidental details (engineer example)

**Root Cause:** Embeddings are designed to capture semantic meaning of the whole chunk. Brief examples or mentions don't dominate the embedding.

### Issue 3: Chunk Context May Be Insufficient

**Finding:** The article's "engineer fail-safe" example appears in chunks, but:
- Some chunks are cut mid-word (e.g., "fail\-s" instead of "fail-safe")
- The engineer example may lack surrounding context

**Chunk Analysis for Article 157:**
- 7 chunks total
- Chunks containing "engineer": 2 (chunks 2 and 6)
- Chunk 2 ends with "fail\-s" (word cut off)
- The engineer example is brief (~2 sentences) within larger chunks about humility

---

## Embedding Similarity Analysis

For query "engineer fail-safe mechanisms humble design":

| Chunk | Similarity | Content Preview |
|-------|------------|-----------------|
| 3097f4d6 | 0.522 | "It is widely recognized that good science requires..." |
| db87f54d | 0.486 | "It is widely recognized that good science requires..." |
| f3e97d77 | 0.462 | "You'd still double-check your calculations..." |
| 9153d1f5 | 0.435 | "They ought to be more humble..." |
| 3375ae64 | 0.424 | "Dennett suggests that much 'religious belief'..." |
| 047e7a9a | 0.353 | Summary chunk |
| 80ff7a03 | 0.267 | References chunk |

**Problem:** Top results in the forum collection score 0.61+, so these 0.52 scores don't make the cut.

---

## Recommendations

### High Priority

1. **Enable BM25 Search**
   - Set `ENABLE_BM25_SEARCH=True`
   - This will find keyword matches that embeddings miss
   - Already implemented score fusion to combine results

2. **Lower Embedding Threshold for Text Collections**
   - Current: 0.25 minimum score
   - Consider: 0.20 to catch more marginal matches
   - Trade-off: May increase noise

3. **Increase Search Limit Before Fusion**
   - Current: Uses same `limit` for both embedding and BM25
   - Consider: Search for 2-3x more candidates, then fuse and return top N

### Medium Priority

4. **Implement Query Expansion / HyDE**
   - For vague queries, generate a hypothetical answer and embed that
   - Example: "engineer fail-safe" -> generate "An article discussing how engineers design fail-safe mechanisms as an example of good humility..."

5. **Improve Chunking Overlap**
   - Ensure examples carry context from surrounding paragraphs
   - Consider semantic chunking (split on topic changes, not just size)

6. **Add Document-Level Context to Chunks**
   - Prepend document title/summary to each chunk before embedding
   - Helps chunks maintain connection to main theme

### Lower Priority

7. **Tune Fusion Weights**
   - Current: 70% embedding, 30% BM25
   - May need adjustment based on use case

8. **Add Temporal Decay**
   - Prefer recent content for certain query types

---

## Architectural Issues

### Issue A: BM25 Implementation is Not Scalable

The current BM25 implementation cannot handle 250K chunks:

```python
# Current approach (in bm25.py):
items = items_query.all()  # Loads ALL matching chunks into memory
corpus = [item.content.lower().strip() for item in items]  # Makes copies
retriever.index(corpus_tokens)  # Rebuilds index from scratch per query
```

**Why this fails:**
- 147K forum chunks × ~3KB avg = ~440MB just for text
- Plus tokenization, BM25 index structures → OOM

**Solutions (in order of recommendation):**

1. **PostgreSQL Full-Text Search** (Recommended)
   - Already have PostgreSQL in stack
   - Add `tsvector` column to Chunk table
   - Create GIN index for fast search
   - Use `ts_rank` for relevance scoring
   - No additional infrastructure needed

2. **Persistent BM25 Index**
   - Build index once at ingestion time
   - Store on disk, load once at startup
   - Update incrementally on new chunks
   - More complex to maintain

3. **External Search Engine**
   - Elasticsearch or Meilisearch
   - Adds operational complexity
   - May be overkill for current scale

### Issue B: Chunk Size Variance

Chunks range from 3 bytes to 3.3MB. This causes:
- Large chunks have diluted embeddings
- Small chunks lack context
- Inconsistent search quality across collections

**Solution:** Re-chunk existing content with:
- Max ~512 tokens per chunk (optimal for embeddings)
- 50-100 token overlap between chunks
- Prepend document title/context to each chunk

### Issue C: Search Timeout (2 seconds)

The default 2-second timeout is too aggressive for:
- Large collections (147K forum chunks)
- Cold Qdrant cache
- Network latency

**Solution:** Increase to 5-10 seconds for initial search, with progressive loading UX.

---

## Test Queries for Validation

After making changes, test with these queries against article 157:

```python
# Should find article 157 (humility in science)
test_cases = [
    # Main topic - currently working
    ("humility in science", "main topic"),
    ("types of humility epistemic", "topic area"),

    # Specific examples - currently failing
    ("engineer fail-safe mechanisms", "specific example"),
    ("student double-checks math test", "specific example"),

    # Tangential mentions - currently failing
    ("creationism debate", "mentioned topic"),

    # Vague/half-remembered - currently failing
    ("checking your work", "vague concept"),
    ("when engineers make mistakes", "tangential"),
]
```

---

## Session Log

### 2025-12-20

1. **Initial Investigation**
   - Found scores were all 0.000
   - Traced to embeddings.py and bm25.py discarding scores

2. **Fixed Score Propagation**
   - Modified 4 files to preserve and fuse scores
   - Rebuilt Docker images
   - Verified scores now appear (0.4-0.5 range)

3. **Quality Testing**
   - Selected random article (ID 157, humility in science)
   - Tested 10 query types from specific to vague
   - Found 3/10 queries succeed (main topic only)

4. **Root Cause Analysis**
   - BM25 disabled - no keyword matching
   - Embeddings capture theme, not details
   - Target chunks have 0.52 similarity vs 0.61 for top results

5. **Next Steps**
   - Enable BM25 and retest
   - Consider HyDE for query expansion
   - Investigate chunking improvements

6. **Deep Dive: Database Statistics**
   - Total chunks: 250,048
   - Forum: 147,546 (58.9%)
   - Blog: 46,159 (18.5%)
   - Book: 34,586 (13.8%)
   - Text: 10,823 (4.3%)

7. **Chunk Size Analysis (MAJOR ISSUE)**
   Found excessively large chunks that dilute embedding quality:

   | Collection | Avg Length | Max Length | Over 8KB | Over 128KB |
   |------------|------------|------------|----------|------------|
   | book       | 15,487     | 3.3MB      | 12,452   | 474        |
   | blog       | 3,661      | 710KB      | 2,874    | 19         |
   | forum      | 3,514      | 341KB      | 8,943    | 47         |

   Books have 36% of chunks over 8KB - too large for good embedding quality.
   The Voyage embedding model has 32K token limit, but chunks over 8KB (~2K tokens)
   start to lose fine-grained detail in the embedding.

8. **Detailed Score Analysis for "engineer fail-safe mechanisms humble design"**
   - Query returns 145,632 results from forum collection
   - Top results score 0.61, median 0.34
   - Source 157 (target article) chunks score:
     - 3097f4d6: 0.5222 (rank 140/145,632) - main content
     - db87f54d: 0.4863 (rank 710/145,632) - full text chunk
     - f3e97d77: 0.4622 (rank 1,952/145,632)
     - 047e7a9a: 0.3528 (rank 58,949/145,632) - summary

   **Key Finding:** Target chunks rank 140th-710th, but with limit=10,
   they never appear. BM25 would find exact keyword match "engineer fail-safe".

9. **Top Results Analysis**
   The chunks scoring 0.61 (beating our target) are about:
   - CloudFlare incident (software failure)
   - AI safety testing (risk/mitigation mechanisms)
   - Generic "mechanisms to prevent failure" content

   These are semantically similar to "engineer fail-safe mechanisms"
   but NOT about humility. Embeddings capture concept, not context.

10. **Root Cause Confirmed**
    The fundamental problem is:
    1. Embeddings capture semantic meaning of query concepts
    2. Query "engineer fail-safe" embeds as "engineering safety mechanisms"
    3. Articles specifically about engineering/failure rank higher
    4. Article about humility (that merely mentions engineers as example) ranks lower
    5. Only keyword search (BM25) can find "mentioned in passing" content

11. **Implemented Candidate Pool Multiplier**
    Added `CANDIDATE_MULTIPLIER = 5` to search.py:
    - Internal searches now fetch 5x the requested limit
    - Results from both methods are fused, then top N returned
    - This helps surface results that rank well in one method but not both

12. **Added Stopword Filtering to FTS**
    Updated bm25.py to filter common English stopwords before building tsquery:
    - Words like "what", "you", "not", "the" are filtered out
    - This makes AND matching less strict
    - Query "saying what you mean" becomes "saying:* & mean:*" instead of 8 terms

13. **Testing: "Taboo Your Words" Query**
    Query: "saying what you mean not using specific words"
    Target: Source 735 ("Taboo Your Words" article)

    Results:
    - Embedding search ranks target at position 21 (score 0.606)
    - Top 10 results score 0.62-0.64 (about language/communication generally)
    - FTS doesn't match because article lacks "saying" and "specific"
    - After fusion: target ranks 23rd, cutoff is 20th

    **Key Insight:** The query describes the *concept* ("not using specific words")
    but the article is about a *technique* ("taboo your words = replace with definitions").
    These are semantically adjacent but not equivalent.

    With direct query "replacing words with their definitions" → ranks 1st!

14. **Testing: "Clowns Iconoclasts" Query**
    Query: "clowns being the real iconoclasts"
    Target: "Lonely Dissent" article

    Results: Found at rank 1 with score 0.815 (hybrid boost!)
    - Both embedding AND FTS match
    - 0.15 hybrid bonus applied
    - This is an ideal case where keywords match content

15. **Remaining Challenges**
    - "Half-remembered" queries describing concepts vs actual content
    - Need query expansion (HyDE) to bridge semantic gap
    - Or return more results for user to scan
    - Consider showing "You might also be looking for..." suggestions