From adff8662bbec0900c511899e590d20b1180eed57 Mon Sep 17 00:00:00 2001 From: Daniel O'Connell Date: Fri, 19 Dec 2025 19:24:17 +0100 Subject: [PATCH] Add investigation findings documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Documents 100+ issues found across: - Data layer (10 issues) - Content processing (12 issues) - Search system (14 issues) - API layer (12 issues) - Worker tasks (20 issues) - Infrastructure (12 issues) - Code quality (20+ issues) Includes database statistics and improvement suggestions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 --- INVESTIGATION.md | 369 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 369 insertions(+) create mode 100644 INVESTIGATION.md diff --git a/INVESTIGATION.md b/INVESTIGATION.md new file mode 100644 index 0000000..0405eec --- /dev/null +++ b/INVESTIGATION.md @@ -0,0 +1,369 @@ +# Memory System Investigation + +## Investigation Status +- **Started:** 2025-12-19 +- **Status:** Complete +- **Total Issues Found:** 100+ + +--- + +## Executive Summary + +This investigation identified **100+ issues** across 7 areas of the memory system. The most critical findings are: + +1. **Security vulnerabilities** (path traversal, CORS, API key logging) +2. **Data integrity issues** (1,338 items unsearchable due to collection mismatch) +3. **Search system bugs** (BM25 filters ignored, broken score aggregation) +4. **Worker reliability** (no retries, silent failures, race conditions) +5. **Code quality concerns** (bare exceptions, type safety gaps) + +--- + +## Critical Bugs (Immediate Action Required) + +### BUG-001: Path Traversal Vulnerabilities +- **Severity:** CRITICAL +- **Area:** API Security +- **Files:** + - `src/memory/api/app.py:54-64` - `/files/{path}` endpoint + - `src/memory/api/MCP/memory.py:355-365` - `fetch_file` tool + - `src/memory/api/MCP/memory.py:335-352` - `note_files` tool +- **Description:** No validation that requested files are within allowed directories +- **Impact:** Arbitrary file read on server filesystem +- **Fix:** Add path resolution validation with `.resolve()` and prefix check + +### BUG-002: Collection Mismatch (1,338 items) +- **Severity:** CRITICAL +- **Area:** Data/Embedding Pipeline +- **Description:** Mail items have chunks with `collection_name='text'` but vectors stored in Qdrant's `mail` collection +- **Impact:** Items completely unsearchable +- **Evidence:** 1,338 orphaned vectors in mail, 1,338 missing in text +- **Fix:** Re-sync vectors or update chunk collection_name + +### BUG-003: BM25 Filters Completely Ignored +- **Severity:** CRITICAL +- **Area:** Search System +- **File:** `src/memory/api/search/bm25.py:32-43` +- **Description:** BM25 search ignores tags, dates, size filters - only applies source_ids +- **Impact:** Filter results diverge between BM25 and vector search +- **Fix:** Apply all filters consistently in BM25 search + +### BUG-004: Search Score Aggregation Broken +- **Severity:** CRITICAL +- **Area:** Search System +- **File:** `src/memory/api/search/types.py:44-45` +- **Description:** Scores are summed across chunks instead of averaged +- **Impact:** Documents with more chunks always rank higher regardless of relevance +- **Fix:** Change to mean() or max-based ranking + +### BUG-005: Registration Always Enabled +- **Severity:** CRITICAL +- **Area:** Configuration/Security +- **File:** `src/memory/common/settings.py:178` +- **Description:** Logic error: `REGISTER_ENABLED = boolean_env(...) or True` always evaluates to True +- **Impact:** Open registration regardless of configuration +- **Fix:** Remove `or True` + +### BUG-006: API Key Logged in Plain Text +- **Severity:** CRITICAL +- **Area:** Security +- **File:** `src/memory/discord/api.py:63` +- **Description:** Bot API key logged in error message +- **Impact:** Credentials exposed in logs +- **Fix:** Remove API key from log message + +--- + +## High Severity Bugs + +### BUG-007: Wrong Object Appended in break_chunk() +- **File:** `src/memory/common/embedding.py:57` +- **Description:** Appends entire `chunk` object instead of individual item `c` +- **Impact:** Circular references, type mismatches, embedding failures + +### BUG-008: Oversized Chunks Exceed Token Limits +- **File:** `src/memory/common/chunker.py:109-112` +- **Description:** When overlap <= 0, chunks yielded without size validation +- **Impact:** 483 chunks >10K chars (should be ~2K) + +### BUG-009: Scheduled Call Race Condition +- **File:** `src/memory/workers/tasks/scheduled_calls.py:145-163` +- **Description:** No DB lock when querying due calls - multiple workers can execute same call +- **Impact:** Duplicate LLM calls and Discord messages + +### BUG-010: Missing MCP Servers Relationship +- **File:** `src/memory/common/db/models/discord.py:74-76` +- **Description:** `self.mcp_servers` referenced in `to_xml()` but no relationship defined +- **Impact:** Runtime AttributeError + +### BUG-011: User ID Type Mismatch +- **Files:** `users.py:47`, `scheduled_calls.py:23` +- **Description:** `ScheduledLLMCall.user_id` is BigInteger but `User.id` is Integer +- **Impact:** Foreign key constraint violations + +### BUG-012: Inverted Min Score Thresholds +- **File:** `src/memory/api/search/embeddings.py:186-207` +- **Description:** Multimodal uses 0.25, text uses 0.4 - should be reversed +- **Impact:** Multimodal results artificially boosted + +### BUG-013: No Error Handling in Embedding Pipeline +- **File:** `src/memory/common/embedding.py` +- **Description:** No try-except blocks around Voyage AI API calls +- **Impact:** Entire content processing fails on API error + +### BUG-014: Unrestricted CORS Configuration +- **File:** `src/memory/api/app.py:36-42` +- **Description:** `allow_origins=["*"]` with `allow_credentials=True` +- **Impact:** CSRF attacks enabled + +### BUG-015: Missing Retry Configuration +- **Files:** All task files +- **Description:** No `autoretry_for`, `max_retries` on any Celery tasks +- **Impact:** Transient failures lost without retry + +### BUG-016: Silent Task Failures +- **File:** `src/memory/workers/tasks/content_processing.py:258-296` +- **Description:** `safe_task_execution` catches all exceptions, returns as dict +- **Impact:** Failed tasks can't be retried by Celery + +--- + +## Medium Severity Bugs + +### Data Layer +- BUG-017: Missing `collection_name` index on Chunk table (`source_item.py:165-168`) +- BUG-018: AgentObservation dead code for future embedding types (`source_items.py:1005-1028`) +- BUG-019: Embed status never set to STORED after push (`content_processing.py:125`) +- BUG-020: Missing server_id index on DiscordMessage (`source_items.py:426-435`) + +### Content Processing +- BUG-021: No chunk validation after break_chunk (`embedding.py:49-58`) +- BUG-022: Ebook extraction creates single massive chunk (`extract.py:218-230`) +- BUG-023: SHA256-only deduplication misses semantic duplicates (`source_item.py:51-91`) +- BUG-024: Email hash inconsistency with markdown conversion (`email.py:171-185`) +- BUG-025: Token approximation uses fixed 4-char ratio (`tokens.py:8-12`) + +### Search System +- BUG-026: BM25 scores calculated then discarded (`bm25.py:66-70`) +- BUG-027: LLM score fallback missing - defaults to 0.0 (`scorer.py:55-60`) +- BUG-028: Missing filter validation (`embeddings.py:130-131`) +- BUG-029: Hardcoded min_score thresholds (`embeddings.py:186,202`) + +### API Layer +- BUG-030: Missing rate limiting (global) +- BUG-031: No SearchConfig limits - can request millions of results (`types.py:73-78`) +- BUG-032: No CSRF protection (`auth.py:50-86`) +- BUG-033: Debug print statements in production (`memory.py:363-370`) +- BUG-034: Timezone handling issues (`oauth_provider.py:83-87`) + +### Worker Tasks +- BUG-035: No task time limits (global) +- BUG-036: Database integrity errors not properly handled (`discord.py:310-321`) +- BUG-037: Timezone bug in scheduled calls (`scheduled_calls.py:152-153`) +- BUG-038: Beat schedule not thread-safe for distributed deployment (`ingest.py:19-56`) +- BUG-039: Email sync fails entire account on single folder error (`email.py:84-152`) + +### Infrastructure +- BUG-040: Missing resource limits for postgres, redis, qdrant, api (`docker-compose.yaml`) +- BUG-041: Backup encryption silently disabled if key missing (`settings.py:215-216`) +- BUG-042: Restore scripts don't validate database integrity (`restore_databases.sh:79`) +- BUG-043: Health check doesn't check dependencies (`app.py:87-92`) +- BUG-044: Uvicorn trusts all proxy headers (`docker/api/Dockerfile:63`) + +### Code Quality +- BUG-045: 183 unsafe cast() operations (various files) +- BUG-046: 21 type:ignore comments (various files) +- BUG-047: 32 bare except Exception blocks (various files) +- BUG-048: 13 exception swallowing with pass (various files) +- BUG-049: Missing CSRF in OAuth callback (`auth.py`) +- BUG-050: SQL injection in test database handling (`tests/conftest.py:94`) + +--- + +## Low Severity Bugs + +- BUG-051: Duplicate chunks (16 identical "Claude plays Pokemon" chunks) +- BUG-052: Garbage content in text collection +- BUG-053: No vector freshness index (`source_item.py:157`) +- BUG-054: OAuthToken missing Base inheritance (`users.py:215-228`) +- BUG-055: collection_model returns "unknown" (`collections.py:140`) +- BUG-056: Unused "appuser" in API Dockerfile (`docker/api/Dockerfile:48`) +- BUG-057: Build dependencies not cleaned up (`docker/api/Dockerfile:7-12`) +- BUG-058: Typos in log messages (`tests/conftest.py:63`) +- BUG-059: MockRedis overly simplistic (`tests/conftest.py:24-46`) +- BUG-060: Print statement in ebook.py:192 + +--- + +## Improvement Suggestions + +### High Priority +1. **Implement proper retry logic** for all Celery tasks with exponential backoff +2. **Add comprehensive health checks** that validate all service dependencies +3. **Fix score aggregation** to use mean/max instead of sum +4. **Add rate limiting** to prevent DoS attacks +5. **Implement proper CSRF protection** for OAuth flows +6. **Add resource limits** to all Docker services +7. **Implement centralized logging** with ELK or Grafana Loki + +### Medium Priority +1. **Re-chunk oversized content** - add validation to enforce size limits +2. **Add chunk deduplication** based on content hash within same source +3. **Preserve BM25 scores** for hybrid search weighting +4. **Add task progress tracking** for long-running operations +5. **Implement distributed beat lock** for multi-worker deployments +6. **Add backup verification tests** - periodically test restore +7. **Replace cast() with type guards** throughout codebase + +### Lower Priority +1. **Add Prometheus metrics** for observability +2. **Implement structured JSON logging** with correlation IDs +3. **Add graceful shutdown handling** to workers +4. **Document configuration requirements** more thoroughly +5. **Add integration tests** for critical workflows +6. **Remove dead code** and TODO comments in production + +--- + +## Feature Ideas + +### Search Enhancements +1. **Hybrid score weighting** - configurable balance between BM25 and vector +2. **Query expansion** - automatic synonym/related term expansion +3. **Faceted search** - filter by date ranges, sources, tags with counts +4. **Search result highlighting** - show matched terms in context +5. **Saved searches** - store and re-run common queries + +### Content Management +1. **Content quality scoring** - automatic assessment of chunk quality +2. **Duplicate detection UI** - show and merge semantic duplicates +3. **Re-indexing queue** - prioritize content for re-embedding +4. **Content archiving** - move old content to cold storage +5. **Bulk operations** - tag, delete, re-process multiple items + +### Email Management +1. **Email filtering rules** - configurable rules to filter/categorize emails (e.g., skip marketing spam but keep order confirmations, shipping notifications, appointment reminders) +2. **Email source classification** - auto-detect email types (transactional, marketing, personal, receipts) +3. **Smart email retention** - keep "useful" emails (orders, bookings, confirmations) while filtering noise + +### User Experience +1. **Search analytics** - track what users search for +2. **Relevance feedback** - let users rate results to improve ranking +3. **Personal knowledge graph** - visualize connections between content +4. **Smart summaries** - auto-generate summaries of search results +5. **Email digest** - scheduled summary of new content + +### Infrastructure +1. **Auto-scaling workers** - scale based on queue depth +2. **Multi-tenant support** - isolate data by user/org +3. **Backup scheduling UI** - configure backup frequency +4. **Monitoring dashboard** - Grafana-style metrics visualization +5. **Audit logging** - track all data access and modifications + +--- + +## Investigation Log + +### 2025-12-19 - Complete Investigation + +**Data Layer (10 issues)** +- Missing relationships (mcp_servers) +- Type mismatches (User.id) +- Missing indexes (collection_name, server_id) +- Dead code (AgentObservation) + +**Content Processing (12 issues)** +- Critical: break_chunk bug appends wrong object +- Critical: Oversized chunks exceed limits +- Deduplication only on SHA256 +- Ebook creates single massive chunk + +**Search System (14 issues)** +- Critical: BM25 ignores filters +- Critical: Score aggregation broken (sum vs mean) +- Inverted min_score thresholds +- BM25 scores discarded + +**API Layer (12 issues)** +- Critical: Path traversal vulnerabilities (3 endpoints) +- CORS misconfiguration +- Missing rate limiting +- Debug print statements + +**Worker Tasks (20 issues)** +- No retry configuration +- Silent task failures +- Race condition in scheduled calls +- No task timeouts + +**Infrastructure (12 issues)** +- Missing resource limits +- Backup encryption issues +- Health check incomplete +- No centralized logging + +**Code Quality (20+ issues)** +- 183 unsafe casts +- 32 bare exception blocks +- Registration always enabled bug +- API key logging + +--- + +## Database Statistics + +``` +Sources by Modality: + forum: 981 + mail: 665 + text: 165 + comic: 115 + doc: 102 + book: 78 + observation: 26 + note: 3 + photo: 2 + blog: 1 + +Chunks by Collection: + forum: 8786 + text: 1843 + mail: 1418 + doc: 312 + book: 156 + semantic: 84 + comic: 49 + temporal: 26 + blog: 7 + photo: 2 + +Vectors in Qdrant: + forum: 8778 + mail: 2756 (1338 orphaned!) + text: 505 (1338 missing!) + doc: 312 + book: 156 + semantic: 84 + comic: 49 + temporal: 26 + blog: 7 + photo: 2 + +Embed Status: + STORED: 2056 + FAILED: 81 + RAW: 1 +``` + +--- + +## Next Steps + +1. **Immediate:** Fix path traversal vulnerabilities (security critical) +2. **Immediate:** Fix registration always enabled (security critical) +3. **Immediate:** Remove API key from logs (security critical) +4. **This Week:** Fix collection mismatch for 1,338 items +5. **This Week:** Fix BM25 filter application +6. **This Sprint:** Add retry logic to all tasks +7. **This Sprint:** Add resource limits to Docker services +8. **This Sprint:** Fix score aggregation