memory/INVESTIGATION.md
Daniel O'Connell adff8662bb Add investigation findings documentation
Documents 100+ issues found across:
- Data layer (10 issues)
- Content processing (12 issues)
- Search system (14 issues)
- API layer (12 issues)
- Worker tasks (20 issues)
- Infrastructure (12 issues)
- Code quality (20+ issues)

Includes database statistics and improvement suggestions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-19 19:24:17 +01:00

14 KiB

Memory System Investigation

Investigation Status

  • Started: 2025-12-19
  • Status: Complete
  • Total Issues Found: 100+

Executive Summary

This investigation identified 100+ issues across 7 areas of the memory system. The most critical findings are:

  1. Security vulnerabilities (path traversal, CORS, API key logging)
  2. Data integrity issues (1,338 items unsearchable due to collection mismatch)
  3. Search system bugs (BM25 filters ignored, broken score aggregation)
  4. Worker reliability (no retries, silent failures, race conditions)
  5. Code quality concerns (bare exceptions, type safety gaps)

Critical Bugs (Immediate Action Required)

BUG-001: Path Traversal Vulnerabilities

  • Severity: CRITICAL
  • Area: API Security
  • Files:
    • src/memory/api/app.py:54-64 - /files/{path} endpoint
    • src/memory/api/MCP/memory.py:355-365 - fetch_file tool
    • src/memory/api/MCP/memory.py:335-352 - note_files tool
  • Description: No validation that requested files are within allowed directories
  • Impact: Arbitrary file read on server filesystem
  • Fix: Add path resolution validation with .resolve() and prefix check

BUG-002: Collection Mismatch (1,338 items)

  • Severity: CRITICAL
  • Area: Data/Embedding Pipeline
  • Description: Mail items have chunks with collection_name='text' but vectors stored in Qdrant's mail collection
  • Impact: Items completely unsearchable
  • Evidence: 1,338 orphaned vectors in mail, 1,338 missing in text
  • Fix: Re-sync vectors or update chunk collection_name

BUG-003: BM25 Filters Completely Ignored

  • Severity: CRITICAL
  • Area: Search System
  • File: src/memory/api/search/bm25.py:32-43
  • Description: BM25 search ignores tags, dates, size filters - only applies source_ids
  • Impact: Filter results diverge between BM25 and vector search
  • Fix: Apply all filters consistently in BM25 search

BUG-004: Search Score Aggregation Broken

  • Severity: CRITICAL
  • Area: Search System
  • File: src/memory/api/search/types.py:44-45
  • Description: Scores are summed across chunks instead of averaged
  • Impact: Documents with more chunks always rank higher regardless of relevance
  • Fix: Change to mean() or max-based ranking

BUG-005: Registration Always Enabled

  • Severity: CRITICAL
  • Area: Configuration/Security
  • File: src/memory/common/settings.py:178
  • Description: Logic error: REGISTER_ENABLED = boolean_env(...) or True always evaluates to True
  • Impact: Open registration regardless of configuration
  • Fix: Remove or True

BUG-006: API Key Logged in Plain Text

  • Severity: CRITICAL
  • Area: Security
  • File: src/memory/discord/api.py:63
  • Description: Bot API key logged in error message
  • Impact: Credentials exposed in logs
  • Fix: Remove API key from log message

High Severity Bugs

BUG-007: Wrong Object Appended in break_chunk()

  • File: src/memory/common/embedding.py:57
  • Description: Appends entire chunk object instead of individual item c
  • Impact: Circular references, type mismatches, embedding failures

BUG-008: Oversized Chunks Exceed Token Limits

  • File: src/memory/common/chunker.py:109-112
  • Description: When overlap <= 0, chunks yielded without size validation
  • Impact: 483 chunks >10K chars (should be ~2K)

BUG-009: Scheduled Call Race Condition

  • File: src/memory/workers/tasks/scheduled_calls.py:145-163
  • Description: No DB lock when querying due calls - multiple workers can execute same call
  • Impact: Duplicate LLM calls and Discord messages

BUG-010: Missing MCP Servers Relationship

  • File: src/memory/common/db/models/discord.py:74-76
  • Description: self.mcp_servers referenced in to_xml() but no relationship defined
  • Impact: Runtime AttributeError

BUG-011: User ID Type Mismatch

  • Files: users.py:47, scheduled_calls.py:23
  • Description: ScheduledLLMCall.user_id is BigInteger but User.id is Integer
  • Impact: Foreign key constraint violations

BUG-012: Inverted Min Score Thresholds

  • File: src/memory/api/search/embeddings.py:186-207
  • Description: Multimodal uses 0.25, text uses 0.4 - should be reversed
  • Impact: Multimodal results artificially boosted

BUG-013: No Error Handling in Embedding Pipeline

  • File: src/memory/common/embedding.py
  • Description: No try-except blocks around Voyage AI API calls
  • Impact: Entire content processing fails on API error

BUG-014: Unrestricted CORS Configuration

  • File: src/memory/api/app.py:36-42
  • Description: allow_origins=["*"] with allow_credentials=True
  • Impact: CSRF attacks enabled

BUG-015: Missing Retry Configuration

  • Files: All task files
  • Description: No autoretry_for, max_retries on any Celery tasks
  • Impact: Transient failures lost without retry

BUG-016: Silent Task Failures

  • File: src/memory/workers/tasks/content_processing.py:258-296
  • Description: safe_task_execution catches all exceptions, returns as dict
  • Impact: Failed tasks can't be retried by Celery

Medium Severity Bugs

Data Layer

  • BUG-017: Missing collection_name index on Chunk table (source_item.py:165-168)
  • BUG-018: AgentObservation dead code for future embedding types (source_items.py:1005-1028)
  • BUG-019: Embed status never set to STORED after push (content_processing.py:125)
  • BUG-020: Missing server_id index on DiscordMessage (source_items.py:426-435)

Content Processing

  • BUG-021: No chunk validation after break_chunk (embedding.py:49-58)
  • BUG-022: Ebook extraction creates single massive chunk (extract.py:218-230)
  • BUG-023: SHA256-only deduplication misses semantic duplicates (source_item.py:51-91)
  • BUG-024: Email hash inconsistency with markdown conversion (email.py:171-185)
  • BUG-025: Token approximation uses fixed 4-char ratio (tokens.py:8-12)

Search System

  • BUG-026: BM25 scores calculated then discarded (bm25.py:66-70)
  • BUG-027: LLM score fallback missing - defaults to 0.0 (scorer.py:55-60)
  • BUG-028: Missing filter validation (embeddings.py:130-131)
  • BUG-029: Hardcoded min_score thresholds (embeddings.py:186,202)

API Layer

  • BUG-030: Missing rate limiting (global)
  • BUG-031: No SearchConfig limits - can request millions of results (types.py:73-78)
  • BUG-032: No CSRF protection (auth.py:50-86)
  • BUG-033: Debug print statements in production (memory.py:363-370)
  • BUG-034: Timezone handling issues (oauth_provider.py:83-87)

Worker Tasks

  • BUG-035: No task time limits (global)
  • BUG-036: Database integrity errors not properly handled (discord.py:310-321)
  • BUG-037: Timezone bug in scheduled calls (scheduled_calls.py:152-153)
  • BUG-038: Beat schedule not thread-safe for distributed deployment (ingest.py:19-56)
  • BUG-039: Email sync fails entire account on single folder error (email.py:84-152)

Infrastructure

  • BUG-040: Missing resource limits for postgres, redis, qdrant, api (docker-compose.yaml)
  • BUG-041: Backup encryption silently disabled if key missing (settings.py:215-216)
  • BUG-042: Restore scripts don't validate database integrity (restore_databases.sh:79)
  • BUG-043: Health check doesn't check dependencies (app.py:87-92)
  • BUG-044: Uvicorn trusts all proxy headers (docker/api/Dockerfile:63)

Code Quality

  • BUG-045: 183 unsafe cast() operations (various files)
  • BUG-046: 21 type:ignore comments (various files)
  • BUG-047: 32 bare except Exception blocks (various files)
  • BUG-048: 13 exception swallowing with pass (various files)
  • BUG-049: Missing CSRF in OAuth callback (auth.py)
  • BUG-050: SQL injection in test database handling (tests/conftest.py:94)

Low Severity Bugs

  • BUG-051: Duplicate chunks (16 identical "Claude plays Pokemon" chunks)
  • BUG-052: Garbage content in text collection
  • BUG-053: No vector freshness index (source_item.py:157)
  • BUG-054: OAuthToken missing Base inheritance (users.py:215-228)
  • BUG-055: collection_model returns "unknown" (collections.py:140)
  • BUG-056: Unused "appuser" in API Dockerfile (docker/api/Dockerfile:48)
  • BUG-057: Build dependencies not cleaned up (docker/api/Dockerfile:7-12)
  • BUG-058: Typos in log messages (tests/conftest.py:63)
  • BUG-059: MockRedis overly simplistic (tests/conftest.py:24-46)
  • BUG-060: Print statement in ebook.py:192

Improvement Suggestions

High Priority

  1. Implement proper retry logic for all Celery tasks with exponential backoff
  2. Add comprehensive health checks that validate all service dependencies
  3. Fix score aggregation to use mean/max instead of sum
  4. Add rate limiting to prevent DoS attacks
  5. Implement proper CSRF protection for OAuth flows
  6. Add resource limits to all Docker services
  7. Implement centralized logging with ELK or Grafana Loki

Medium Priority

  1. Re-chunk oversized content - add validation to enforce size limits
  2. Add chunk deduplication based on content hash within same source
  3. Preserve BM25 scores for hybrid search weighting
  4. Add task progress tracking for long-running operations
  5. Implement distributed beat lock for multi-worker deployments
  6. Add backup verification tests - periodically test restore
  7. Replace cast() with type guards throughout codebase

Lower Priority

  1. Add Prometheus metrics for observability
  2. Implement structured JSON logging with correlation IDs
  3. Add graceful shutdown handling to workers
  4. Document configuration requirements more thoroughly
  5. Add integration tests for critical workflows
  6. Remove dead code and TODO comments in production

Feature Ideas

Search Enhancements

  1. Hybrid score weighting - configurable balance between BM25 and vector
  2. Query expansion - automatic synonym/related term expansion
  3. Faceted search - filter by date ranges, sources, tags with counts
  4. Search result highlighting - show matched terms in context
  5. Saved searches - store and re-run common queries

Content Management

  1. Content quality scoring - automatic assessment of chunk quality
  2. Duplicate detection UI - show and merge semantic duplicates
  3. Re-indexing queue - prioritize content for re-embedding
  4. Content archiving - move old content to cold storage
  5. Bulk operations - tag, delete, re-process multiple items

Email Management

  1. Email filtering rules - configurable rules to filter/categorize emails (e.g., skip marketing spam but keep order confirmations, shipping notifications, appointment reminders)
  2. Email source classification - auto-detect email types (transactional, marketing, personal, receipts)
  3. Smart email retention - keep "useful" emails (orders, bookings, confirmations) while filtering noise

User Experience

  1. Search analytics - track what users search for
  2. Relevance feedback - let users rate results to improve ranking
  3. Personal knowledge graph - visualize connections between content
  4. Smart summaries - auto-generate summaries of search results
  5. Email digest - scheduled summary of new content

Infrastructure

  1. Auto-scaling workers - scale based on queue depth
  2. Multi-tenant support - isolate data by user/org
  3. Backup scheduling UI - configure backup frequency
  4. Monitoring dashboard - Grafana-style metrics visualization
  5. Audit logging - track all data access and modifications

Investigation Log

2025-12-19 - Complete Investigation

Data Layer (10 issues)

  • Missing relationships (mcp_servers)
  • Type mismatches (User.id)
  • Missing indexes (collection_name, server_id)
  • Dead code (AgentObservation)

Content Processing (12 issues)

  • Critical: break_chunk bug appends wrong object
  • Critical: Oversized chunks exceed limits
  • Deduplication only on SHA256
  • Ebook creates single massive chunk

Search System (14 issues)

  • Critical: BM25 ignores filters
  • Critical: Score aggregation broken (sum vs mean)
  • Inverted min_score thresholds
  • BM25 scores discarded

API Layer (12 issues)

  • Critical: Path traversal vulnerabilities (3 endpoints)
  • CORS misconfiguration
  • Missing rate limiting
  • Debug print statements

Worker Tasks (20 issues)

  • No retry configuration
  • Silent task failures
  • Race condition in scheduled calls
  • No task timeouts

Infrastructure (12 issues)

  • Missing resource limits
  • Backup encryption issues
  • Health check incomplete
  • No centralized logging

Code Quality (20+ issues)

  • 183 unsafe casts
  • 32 bare exception blocks
  • Registration always enabled bug
  • API key logging

Database Statistics

Sources by Modality:
  forum: 981
  mail: 665
  text: 165
  comic: 115
  doc: 102
  book: 78
  observation: 26
  note: 3
  photo: 2
  blog: 1

Chunks by Collection:
  forum: 8786
  text: 1843
  mail: 1418
  doc: 312
  book: 156
  semantic: 84
  comic: 49
  temporal: 26
  blog: 7
  photo: 2

Vectors in Qdrant:
  forum: 8778
  mail: 2756 (1338 orphaned!)
  text: 505 (1338 missing!)
  doc: 312
  book: 156
  semantic: 84
  comic: 49
  temporal: 26
  blog: 7
  photo: 2

Embed Status:
  STORED: 2056
  FAILED: 81
  RAW: 1

Next Steps

  1. Immediate: Fix path traversal vulnerabilities (security critical)
  2. Immediate: Fix registration always enabled (security critical)
  3. Immediate: Remove API key from logs (security critical)
  4. This Week: Fix collection mismatch for 1,338 items
  5. This Week: Fix BM25 filter application
  6. This Sprint: Add retry logic to all tasks
  7. This Sprint: Add resource limits to Docker services
  8. This Sprint: Fix score aggregation