Add investigation findings documentation

Documents 100+ issues found across: - Data layer (10 issues) - Content processing (12 issues) - Search system (14 issues) - API layer (12 issues) - Worker tasks (20 issues) - Infrastructure (12 issues) - Code quality (20+ issues) Includes database statistics and improvement suggestions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-02 17:22:58 +01:00 · 2025-12-19 19:24:17 +01:00 · 2025-12-19 19:24:17 +01:00 · adff8662bb
commit adff8662bb
parent 52274f82a6
1 changed files with 369 additions and 0 deletions
--- a/INVESTIGATION.md
+++ b/INVESTIGATION.md
@ -0,0 +1,369 @@
+# Memory System Investigation
+
+## Investigation Status
+- **Started:** 2025-12-19
+- **Status:** Complete
+- **Total Issues Found:** 100+
+
+---
+
+## Executive Summary
+
+This investigation identified **100+ issues** across 7 areas of the memory system. The most critical findings are:
+
+1. **Security vulnerabilities** (path traversal, CORS, API key logging)
+2. **Data integrity issues** (1,338 items unsearchable due to collection mismatch)
+3. **Search system bugs** (BM25 filters ignored, broken score aggregation)
+4. **Worker reliability** (no retries, silent failures, race conditions)
+5. **Code quality concerns** (bare exceptions, type safety gaps)
+
+---
+
+## Critical Bugs (Immediate Action Required)
+
+### BUG-001: Path Traversal Vulnerabilities
+- **Severity:** CRITICAL
+- **Area:** API Security
+- **Files:**
+  - `src/memory/api/app.py:54-64` - `/files/{path}` endpoint
+  - `src/memory/api/MCP/memory.py:355-365` - `fetch_file` tool
+  - `src/memory/api/MCP/memory.py:335-352` - `note_files` tool
+- **Description:** No validation that requested files are within allowed directories
+- **Impact:** Arbitrary file read on server filesystem
+- **Fix:** Add path resolution validation with `.resolve()` and prefix check
+
+### BUG-002: Collection Mismatch (1,338 items)
+- **Severity:** CRITICAL
+- **Area:** Data/Embedding Pipeline
+- **Description:** Mail items have chunks with `collection_name='text'` but vectors stored in Qdrant's `mail` collection
+- **Impact:** Items completely unsearchable
+- **Evidence:** 1,338 orphaned vectors in mail, 1,338 missing in text
+- **Fix:** Re-sync vectors or update chunk collection_name
+
+### BUG-003: BM25 Filters Completely Ignored
+- **Severity:** CRITICAL
+- **Area:** Search System
+- **File:** `src/memory/api/search/bm25.py:32-43`
+- **Description:** BM25 search ignores tags, dates, size filters - only applies source_ids
+- **Impact:** Filter results diverge between BM25 and vector search
+- **Fix:** Apply all filters consistently in BM25 search
+
+### BUG-004: Search Score Aggregation Broken
+- **Severity:** CRITICAL
+- **Area:** Search System
+- **File:** `src/memory/api/search/types.py:44-45`
+- **Description:** Scores are summed across chunks instead of averaged
+- **Impact:** Documents with more chunks always rank higher regardless of relevance
+- **Fix:** Change to mean() or max-based ranking
+
+### BUG-005: Registration Always Enabled
+- **Severity:** CRITICAL
+- **Area:** Configuration/Security
+- **File:** `src/memory/common/settings.py:178`
+- **Description:** Logic error: `REGISTER_ENABLED = boolean_env(...) or True` always evaluates to True
+- **Impact:** Open registration regardless of configuration
+- **Fix:** Remove `or True`
+
+### BUG-006: API Key Logged in Plain Text
+- **Severity:** CRITICAL
+- **Area:** Security
+- **File:** `src/memory/discord/api.py:63`
+- **Description:** Bot API key logged in error message
+- **Impact:** Credentials exposed in logs
+- **Fix:** Remove API key from log message
+
+---
+
+## High Severity Bugs
+
+### BUG-007: Wrong Object Appended in break_chunk()
+- **File:** `src/memory/common/embedding.py:57`
+- **Description:** Appends entire `chunk` object instead of individual item `c`
+- **Impact:** Circular references, type mismatches, embedding failures
+
+### BUG-008: Oversized Chunks Exceed Token Limits
+- **File:** `src/memory/common/chunker.py:109-112`
+- **Description:** When overlap <= 0, chunks yielded without size validation
+- **Impact:** 483 chunks >10K chars (should be ~2K)
+
+### BUG-009: Scheduled Call Race Condition
+- **File:** `src/memory/workers/tasks/scheduled_calls.py:145-163`
+- **Description:** No DB lock when querying due calls - multiple workers can execute same call
+- **Impact:** Duplicate LLM calls and Discord messages
+
+### BUG-010: Missing MCP Servers Relationship
+- **File:** `src/memory/common/db/models/discord.py:74-76`
+- **Description:** `self.mcp_servers` referenced in `to_xml()` but no relationship defined
+- **Impact:** Runtime AttributeError
+
+### BUG-011: User ID Type Mismatch
+- **Files:** `users.py:47`, `scheduled_calls.py:23`
+- **Description:** `ScheduledLLMCall.user_id` is BigInteger but `User.id` is Integer
+- **Impact:** Foreign key constraint violations
+
+### BUG-012: Inverted Min Score Thresholds
+- **File:** `src/memory/api/search/embeddings.py:186-207`
+- **Description:** Multimodal uses 0.25, text uses 0.4 - should be reversed
+- **Impact:** Multimodal results artificially boosted
+
+### BUG-013: No Error Handling in Embedding Pipeline
+- **File:** `src/memory/common/embedding.py`
+- **Description:** No try-except blocks around Voyage AI API calls
+- **Impact:** Entire content processing fails on API error
+
+### BUG-014: Unrestricted CORS Configuration
+- **File:** `src/memory/api/app.py:36-42`
+- **Description:** `allow_origins=["*"]` with `allow_credentials=True`
+- **Impact:** CSRF attacks enabled
+
+### BUG-015: Missing Retry Configuration
+- **Files:** All task files
+- **Description:** No `autoretry_for`, `max_retries` on any Celery tasks
+- **Impact:** Transient failures lost without retry
+
+### BUG-016: Silent Task Failures
+- **File:** `src/memory/workers/tasks/content_processing.py:258-296`
+- **Description:** `safe_task_execution` catches all exceptions, returns as dict
+- **Impact:** Failed tasks can't be retried by Celery
+
+---
+
+## Medium Severity Bugs
+
+### Data Layer
+- BUG-017: Missing `collection_name` index on Chunk table (`source_item.py:165-168`)
+- BUG-018: AgentObservation dead code for future embedding types (`source_items.py:1005-1028`)
+- BUG-019: Embed status never set to STORED after push (`content_processing.py:125`)
+- BUG-020: Missing server_id index on DiscordMessage (`source_items.py:426-435`)
+
+### Content Processing
+- BUG-021: No chunk validation after break_chunk (`embedding.py:49-58`)
+- BUG-022: Ebook extraction creates single massive chunk (`extract.py:218-230`)
+- BUG-023: SHA256-only deduplication misses semantic duplicates (`source_item.py:51-91`)
+- BUG-024: Email hash inconsistency with markdown conversion (`email.py:171-185`)
+- BUG-025: Token approximation uses fixed 4-char ratio (`tokens.py:8-12`)
+
+### Search System
+- BUG-026: BM25 scores calculated then discarded (`bm25.py:66-70`)
+- BUG-027: LLM score fallback missing - defaults to 0.0 (`scorer.py:55-60`)
+- BUG-028: Missing filter validation (`embeddings.py:130-131`)
+- BUG-029: Hardcoded min_score thresholds (`embeddings.py:186,202`)
+
+### API Layer
+- BUG-030: Missing rate limiting (global)
+- BUG-031: No SearchConfig limits - can request millions of results (`types.py:73-78`)
+- BUG-032: No CSRF protection (`auth.py:50-86`)
+- BUG-033: Debug print statements in production (`memory.py:363-370`)
+- BUG-034: Timezone handling issues (`oauth_provider.py:83-87`)
+
+### Worker Tasks
+- BUG-035: No task time limits (global)
+- BUG-036: Database integrity errors not properly handled (`discord.py:310-321`)
+- BUG-037: Timezone bug in scheduled calls (`scheduled_calls.py:152-153`)
+- BUG-038: Beat schedule not thread-safe for distributed deployment (`ingest.py:19-56`)
+- BUG-039: Email sync fails entire account on single folder error (`email.py:84-152`)
+
+### Infrastructure
+- BUG-040: Missing resource limits for postgres, redis, qdrant, api (`docker-compose.yaml`)
+- BUG-041: Backup encryption silently disabled if key missing (`settings.py:215-216`)
+- BUG-042: Restore scripts don't validate database integrity (`restore_databases.sh:79`)
+- BUG-043: Health check doesn't check dependencies (`app.py:87-92`)
+- BUG-044: Uvicorn trusts all proxy headers (`docker/api/Dockerfile:63`)
+
+### Code Quality
+- BUG-045: 183 unsafe cast() operations (various files)
+- BUG-046: 21 type:ignore comments (various files)
+- BUG-047: 32 bare except Exception blocks (various files)
+- BUG-048: 13 exception swallowing with pass (various files)
+- BUG-049: Missing CSRF in OAuth callback (`auth.py`)
+- BUG-050: SQL injection in test database handling (`tests/conftest.py:94`)
+
+---
+
+## Low Severity Bugs
+
+- BUG-051: Duplicate chunks (16 identical "Claude plays Pokemon" chunks)
+- BUG-052: Garbage content in text collection
+- BUG-053: No vector freshness index (`source_item.py:157`)
+- BUG-054: OAuthToken missing Base inheritance (`users.py:215-228`)
+- BUG-055: collection_model returns "unknown" (`collections.py:140`)
+- BUG-056: Unused "appuser" in API Dockerfile (`docker/api/Dockerfile:48`)
+- BUG-057: Build dependencies not cleaned up (`docker/api/Dockerfile:7-12`)
+- BUG-058: Typos in log messages (`tests/conftest.py:63`)
+- BUG-059: MockRedis overly simplistic (`tests/conftest.py:24-46`)
+- BUG-060: Print statement in ebook.py:192
+
+---
+
+## Improvement Suggestions
+
+### High Priority
+1. **Implement proper retry logic** for all Celery tasks with exponential backoff
+2. **Add comprehensive health checks** that validate all service dependencies
+3. **Fix score aggregation** to use mean/max instead of sum
+4. **Add rate limiting** to prevent DoS attacks
+5. **Implement proper CSRF protection** for OAuth flows
+6. **Add resource limits** to all Docker services
+7. **Implement centralized logging** with ELK or Grafana Loki
+
+### Medium Priority
+1. **Re-chunk oversized content** - add validation to enforce size limits
+2. **Add chunk deduplication** based on content hash within same source
+3. **Preserve BM25 scores** for hybrid search weighting
+4. **Add task progress tracking** for long-running operations
+5. **Implement distributed beat lock** for multi-worker deployments
+6. **Add backup verification tests** - periodically test restore
+7. **Replace cast() with type guards** throughout codebase
+
+### Lower Priority
+1. **Add Prometheus metrics** for observability
+2. **Implement structured JSON logging** with correlation IDs
+3. **Add graceful shutdown handling** to workers
+4. **Document configuration requirements** more thoroughly
+5. **Add integration tests** for critical workflows
+6. **Remove dead code** and TODO comments in production
+
+---
+
+## Feature Ideas
+
+### Search Enhancements
+1. **Hybrid score weighting** - configurable balance between BM25 and vector
+2. **Query expansion** - automatic synonym/related term expansion
+3. **Faceted search** - filter by date ranges, sources, tags with counts
+4. **Search result highlighting** - show matched terms in context
+5. **Saved searches** - store and re-run common queries
+
+### Content Management
+1. **Content quality scoring** - automatic assessment of chunk quality
+2. **Duplicate detection UI** - show and merge semantic duplicates
+3. **Re-indexing queue** - prioritize content for re-embedding
+4. **Content archiving** - move old content to cold storage
+5. **Bulk operations** - tag, delete, re-process multiple items
+
+### Email Management
+1. **Email filtering rules** - configurable rules to filter/categorize emails (e.g., skip marketing spam but keep order confirmations, shipping notifications, appointment reminders)
+2. **Email source classification** - auto-detect email types (transactional, marketing, personal, receipts)
+3. **Smart email retention** - keep "useful" emails (orders, bookings, confirmations) while filtering noise
+
+### User Experience
+1. **Search analytics** - track what users search for
+2. **Relevance feedback** - let users rate results to improve ranking
+3. **Personal knowledge graph** - visualize connections between content
+4. **Smart summaries** - auto-generate summaries of search results
+5. **Email digest** - scheduled summary of new content
+
+### Infrastructure
+1. **Auto-scaling workers** - scale based on queue depth
+2. **Multi-tenant support** - isolate data by user/org
+3. **Backup scheduling UI** - configure backup frequency
+4. **Monitoring dashboard** - Grafana-style metrics visualization
+5. **Audit logging** - track all data access and modifications
+
+---
+
+## Investigation Log
+
+### 2025-12-19 - Complete Investigation
+
+**Data Layer (10 issues)**
+- Missing relationships (mcp_servers)
+- Type mismatches (User.id)
+- Missing indexes (collection_name, server_id)
+- Dead code (AgentObservation)
+
+**Content Processing (12 issues)**
+- Critical: break_chunk bug appends wrong object
+- Critical: Oversized chunks exceed limits
+- Deduplication only on SHA256
+- Ebook creates single massive chunk
+
+**Search System (14 issues)**
+- Critical: BM25 ignores filters
+- Critical: Score aggregation broken (sum vs mean)
+- Inverted min_score thresholds
+- BM25 scores discarded
+
+**API Layer (12 issues)**
+- Critical: Path traversal vulnerabilities (3 endpoints)
+- CORS misconfiguration
+- Missing rate limiting
+- Debug print statements
+
+**Worker Tasks (20 issues)**
+- No retry configuration
+- Silent task failures
+- Race condition in scheduled calls
+- No task timeouts
+
+**Infrastructure (12 issues)**
+- Missing resource limits
+- Backup encryption issues
+- Health check incomplete
+- No centralized logging
+
+**Code Quality (20+ issues)**
+- 183 unsafe casts
+- 32 bare exception blocks
+- Registration always enabled bug
+- API key logging
+
+---
+
+## Database Statistics
+
+```
+Sources by Modality:
+  forum: 981
+  mail: 665
+  text: 165
+  comic: 115
+  doc: 102
+  book: 78
+  observation: 26
+  note: 3
+  photo: 2
+  blog: 1
+
+Chunks by Collection:
+  forum: 8786
+  text: 1843
+  mail: 1418
+  doc: 312
+  book: 156
+  semantic: 84
+  comic: 49
+  temporal: 26
+  blog: 7
+  photo: 2
+
+Vectors in Qdrant:
+  forum: 8778
+  mail: 2756 (1338 orphaned!)
+  text: 505 (1338 missing!)
+  doc: 312
+  book: 156
+  semantic: 84
+  comic: 49
+  temporal: 26
+  blog: 7
+  photo: 2
+
+Embed Status:
+  STORED: 2056
+  FAILED: 81
+  RAW: 1
+```
+
+---
+
+## Next Steps
+
+1. **Immediate:** Fix path traversal vulnerabilities (security critical)
+2. **Immediate:** Fix registration always enabled (security critical)
+3. **Immediate:** Remove API key from logs (security critical)
+4. **This Week:** Fix collection mismatch for 1,338 items
+5. **This Week:** Fix BM25 filter application
+6. **This Sprint:** Add retry logic to all tasks
+7. **This Sprint:** Add resource limits to Docker services
+8. **This Sprint:** Fix score aggregation