5 Commits

Author SHA1 Message Date
5b997cc397 Fix search bugs: query terms, index validation, chunk loss
- Include 2-letter terms (AI, ML) in query term extraction (was > 2, now >= 2)
- Add guard for empty data before accessing data[0].data[0] in scorer
- Preserve chunks without content in reranking instead of silently dropping
- Remove legacy wrapper functions (apply_title_boost, apply_popularity_boost)
- Update tests to use apply_source_boosts directly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 15:01:03 +00:00
782b56939f Refactor search: add LLM query analysis, extract constants
- Add query_analysis.py for LLM-based query preprocessing
  - Detects modalities from natural language ("on lesswrong" -> forum)
  - Cleans meta-language ("I remember reading..." -> core query)
  - Generates query variants for better recall
  - Dynamically discovers modalities and domains from database

- Extract constants to constants.py
  - STOPWORDS, RRF_K, boost values, etc.
  - Cleaner separation of configuration from logic

- Refactor search_chunks into focused helper functions
  - _run_llm_analysis: parallel query analysis + HyDE
  - _apply_query_analysis: apply analysis results
  - _build_search_data: construct search data with variants
  - _run_searches: embedding + BM25 with RRF fusion
  - _fetch_chunks: database retrieval with scoring
  - _apply_boosts: title, popularity, recency boosts
  - _apply_reranking: cross-encoder reranking

- Remove redundant regex-based modality detection
- Remove static QUERY_EXPANSIONS (LLM handles this better)
- Add comprehensive tests for query_analysis module

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 14:43:17 +00:00
d9fcfe3878 more search improvements 2025-12-21 12:29:44 +00:00
f3d8b6602b Add popularity boosting to search based on karma
- Add `popularity` property to SourceItem base class (default 1.0)
- Override in ForumPost with karma-based calculation:
  - Uses KARMA_REFERENCES dict mapping URL patterns to reference values
  - LessWrong: 100 (90th percentile from actual data)
  - Reference karma gives popularity=2.0, caps at 2.5
- Add apply_popularity_boost() to search pipeline
- POPULARITY_BOOST = 0.02 (2% score adjustment per popularity unit)
- Add comprehensive tests for popularity boost

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-20 22:44:06 +00:00
09215adf9a Add comprehensive tests for search improvements
- Add tests for extract_query_terms (stopword filtering, short words)
- Add tests for apply_query_term_boost (boost calculations, edge cases)
- Add tests for deduplicate_by_source (keeps highest per source)
- Add tests for apply_title_boost (title matching with mocked DB)
- Add tests for fuse_scores_rrf (RRF score fusion, ranking behavior)
- Add tests for rerank module (VoyageAI reranker mocking)

Uses pytest.mark.parametrize for concise, data-driven tests.
77 tests total covering all new search functionality.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-20 22:26:16 +00:00