78 Commits

Author SHA1 Message Date
f042f9aed8 proactive stuff 2025-12-29 14:07:12 +00:00
47180e1e71 fixes 2025-12-24 14:52:12 +00:00
5d79fa349e synch people 2025-12-24 14:38:14 +00:00
47629fc5fb add PRs and People 2025-12-24 13:25:34 +00:00
526bfa5f6b more github ingesting 2025-12-23 20:02:10 +00:00
5b997cc397 Fix search bugs: query terms, index validation, chunk loss
- Include 2-letter terms (AI, ML) in query term extraction (was > 2, now >= 2)
- Add guard for empty data before accessing data[0].data[0] in scorer
- Preserve chunks without content in reranking instead of silently dropping
- Remove legacy wrapper functions (apply_title_boost, apply_popularity_boost)
- Update tests to use apply_source_boosts directly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 15:01:03 +00:00
782b56939f Refactor search: add LLM query analysis, extract constants
- Add query_analysis.py for LLM-based query preprocessing
  - Detects modalities from natural language ("on lesswrong" -> forum)
  - Cleans meta-language ("I remember reading..." -> core query)
  - Generates query variants for better recall
  - Dynamically discovers modalities and domains from database

- Extract constants to constants.py
  - STOPWORDS, RRF_K, boost values, etc.
  - Cleaner separation of configuration from logic

- Refactor search_chunks into focused helper functions
  - _run_llm_analysis: parallel query analysis + HyDE
  - _apply_query_analysis: apply analysis results
  - _build_search_data: construct search data with variants
  - _run_searches: embedding + BM25 with RRF fusion
  - _fetch_chunks: database retrieval with scoring
  - _apply_boosts: title, popularity, recency boosts
  - _apply_reranking: cross-encoder reranking

- Remove redundant regex-based modality detection
- Remove static QUERY_EXPANSIONS (LLM handles this better)
- Add comprehensive tests for query_analysis module

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 14:43:17 +00:00
d9fcfe3878 more search improvements 2025-12-21 12:29:44 +00:00
f3d8b6602b Add popularity boosting to search based on karma
- Add `popularity` property to SourceItem base class (default 1.0)
- Override in ForumPost with karma-based calculation:
  - Uses KARMA_REFERENCES dict mapping URL patterns to reference values
  - LessWrong: 100 (90th percentile from actual data)
  - Reference karma gives popularity=2.0, caps at 2.5
- Add apply_popularity_boost() to search pipeline
- POPULARITY_BOOST = 0.02 (2% score adjustment per popularity unit)
- Add comprehensive tests for popularity boost

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-20 22:44:06 +00:00
09215adf9a Add comprehensive tests for search improvements
- Add tests for extract_query_terms (stopword filtering, short words)
- Add tests for apply_query_term_boost (boost calculations, edge cases)
- Add tests for deduplicate_by_source (keeps highest per source)
- Add tests for apply_title_boost (title matching with mocked DB)
- Add tests for fuse_scores_rrf (RRF score fusion, ranking behavior)
- Add tests for rerank module (VoyageAI reranker mocking)

Uses pytest.mark.parametrize for concise, data-driven tests.
77 tests total covering all new search functionality.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-20 22:26:16 +00:00
d644281b26 Fix 5 security and quality bugs
BUG-030: Add rate limiting via slowapi middleware
- Added slowapi to requirements
- Configurable limits: 100/min default, 30/min search, 10/min auth
- Rate limit settings in settings.py

BUG-028: Fix filter validation in embeddings.py
- Unknown filter keys now logged and ignored instead of passed through
- Prevents potential filter injection

BUG-034: Fix timezone handling in oauth_provider.py
- Now uses timezone-aware UTC comparison for refresh tokens

BUG-050: Fix SQL injection in test database handling
- Added validate_db_identifier() function
- Validates database names contain only safe characters

Also:
- Updated tests for bcrypt password format
- Updated test for filter validation behavior
- Updated INVESTIGATION.md with fix status

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-19 21:41:16 +00:00
Daniel O'Connell
28bc10df92 Fix break_chunk() appending wrong object (BUG-007)
The function was appending the entire DataChunk object instead of
the individual item when processing non-string data (e.g., images).

Bug: `result.append(chunk)` should have been `result.append(c)`

This caused:
- Type mismatches (returning DataChunk instead of MulitmodalChunk)
- Potential circular references
- Embedding failures for mixed content

Fixed by appending the individual item `c` instead of the parent `chunk`.
Updated existing test and added new test to verify behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-19 18:27:13 +01:00
Daniel O'Connell
21dedbeb61 Fix search score aggregation to use mean instead of sum
BUG-004: Score aggregation was broken - documents with more chunks
would always rank higher regardless of relevance because scores were
summed instead of averaged.

Changes:
- Changed score calculation from sum() to mean()
- Added comprehensive tests for SearchResult.from_source_item()
- Added tests for elide_content helper

This ensures search results are ranked by actual relevance rather
than by the number of chunks in the document.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-19 18:25:15 +01:00
Daniel O'Connell
93b77a16d6 Add pytest markers for fast/slow test separation
- Add --run-slow flag to optionally include slow tests
- Auto-detect tests that use db_session, test_db, db_engine, or qdrant fixtures
- Skip slow tests by default for faster development iteration
- Usage: pytest (fast only) or pytest --run-slow (all tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-19 18:21:41 +01:00
56ed7b7d8f fix scheduler 2025-11-04 12:46:38 +00:00
Daniel O'Connell
ad6510bd17 add a bunch of tests 2025-11-03 23:23:41 +01:00
a5bc53326d backups 2025-11-02 00:01:35 +00:00
Daniel O'Connell
814090dccb use db bots 2025-11-01 18:52:37 +01:00
Daniel O'Connell
9639fa3dd7 use usage tracker 2025-11-01 18:49:06 +01:00
Daniel O'Connell
8af07f0dac add slash commands for discord 2025-11-01 18:04:38 +01:00
Daniel O'Connell
07852f9ee7 Base usage tracker 2025-11-01 16:22:40 +01:00
Daniel O'Connell
1a3cf9c931 add tetsts 2025-10-20 21:10:39 +02:00
Daniel O'Connell
1606348d8b discord integration 2025-10-20 03:47:13 +02:00
Daniel O'Connell
99d3843f47 move to general LLM providers 2025-10-13 03:23:20 +02:00
Daniel O'Connell
f454aa9afa change schedule call signature 2025-10-12 10:17:22 +02:00
Daniel O'Connell
a3544222e7 add scheduled calls 2025-08-12 23:37:54 +00:00
Daniel O'Connell
b68e15d3ab add blogs 2025-08-09 02:07:49 +02:00
Daniel O'Connell
beb94375da fix tests 2025-07-24 23:34:10 +02:00
Daniel O'Connell
50601ad930 proper notes path 2025-07-06 13:53:29 +02:00
Daniel O'Connell
288c2995e5 synch notes 2025-07-05 23:58:47 +02:00
Daniel O'Connell
8eb6374cac second pass in search 2025-06-28 20:59:15 +02:00
Daniel O'Connell
01ccea2733 add missing tests 2025-06-28 02:30:54 +02:00
Daniel O'Connell
a3daea883b fix tests 2025-06-26 14:12:42 +02:00
Daniel O'Connell
0e574542d5 fix tests 2025-06-10 15:32:34 +02:00
Daniel O'Connell
3e4e5872d1 search filters 2025-06-10 12:16:54 +02:00
Daniel O'Connell
780e27ba04 better emails embedding + format search results 2025-06-09 13:51:58 +02:00
Daniel O'Connell
4d057d1ec6 discord notification on error 2025-06-05 02:21:52 +02:00
Daniel O'Connell
e5da3714de muliple dimemnsions for confidence values 2025-06-03 12:18:20 +02:00
Daniel O'Connell
a40e0b50fa editable notes 2025-06-02 22:24:19 +02:00
Daniel O'Connell
ac3b48a04c notes and observations triggered as jobs 2025-06-02 14:34:39 +02:00
Daniel O'Connell
29b8ce6860 Fix search + proper integration tests 2025-06-02 02:53:32 +02:00
Daniel O'Connell
1dd93929c1 Add embedding for observations 2025-05-31 16:51:55 +02:00
Daniel O'Connell
004bd39987 Add observations model 2025-05-31 16:15:30 +02:00
Daniel O'Connell
e505f9b53c summarize before chunking 2025-05-29 01:26:10 +02:00
Daniel O'Connell
ed8033bdd3 Add less wrong tasks + reindexer 2025-05-28 03:14:27 +02:00
Daniel O'Connell
ab87bced81 fix linting 2025-05-27 23:19:28 +02:00
Daniel O'Connell
1291ca9d08 better handling of errors 2025-05-27 22:39:24 +02:00
Daniel O'Connell
f5c3e458d7 move parsers 2025-05-27 21:53:31 +02:00
Daniel O'Connell
0f15e4e410 Check all feeds work 2025-05-27 01:42:22 +02:00
Daniel O'Connell
876fa87725 Add archives fetcher 2025-05-27 01:24:57 +02:00