memory/INVESTIGATION.md
mruwnik 2e3371ec4e Update INVESTIGATION.md with verified bug fixes
- Mark BUG-010 (MCP servers) as already fixed
- Mark BUG-011 (User ID type) as already fixed
- Document BUG-061 to BUG-068 fixes from commit 1c43f1a

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-19 20:25:09 +00:00

23 KiB

Memory System Investigation

Investigation Status

  • Started: 2025-12-19
  • Last Updated: 2025-12-19 (Second Pass)
  • Status: Ongoing
  • Total Issues Found: 100+ (original) + 10 new critical issues

Executive Summary

This investigation identified 100+ issues across 7 areas of the memory system. The most critical findings are:

  1. Security vulnerabilities (path traversal, CORS, API key logging)
  2. Data integrity issues (1,338 items unsearchable due to collection mismatch)
  3. Search system bugs (BM25 filters ignored, broken score aggregation)
  4. Worker reliability (no retries, silent failures, race conditions)
  5. Code quality concerns (bare exceptions, type safety gaps)

Critical Bugs (Immediate Action Required)

BUG-001: Path Traversal Vulnerabilities

  • Severity: CRITICAL
  • Area: API Security
  • Files:
    • src/memory/api/app.py:54-64 - /files/{path} endpoint
    • src/memory/api/MCP/memory.py:355-365 - fetch_file tool
    • src/memory/api/MCP/memory.py:335-352 - note_files tool
  • Description: No validation that requested files are within allowed directories
  • Impact: Arbitrary file read on server filesystem
  • Fix: Add path resolution validation with .resolve() and prefix check

BUG-002: Collection Mismatch (1,338 items)

  • Severity: CRITICAL
  • Area: Data/Embedding Pipeline
  • Description: Mail items have chunks with collection_name='text' but vectors stored in Qdrant's mail collection
  • Impact: Items completely unsearchable
  • Evidence: 1,338 orphaned vectors in mail, 1,338 missing in text
  • Fix: Re-sync vectors or update chunk collection_name

BUG-003: BM25 Filters Completely Ignored

  • Severity: CRITICAL
  • Area: Search System
  • File: src/memory/api/search/bm25.py:32-43
  • Description: BM25 search ignores tags, dates, size filters - only applies source_ids
  • Impact: Filter results diverge between BM25 and vector search
  • Fix: Apply all filters consistently in BM25 search

BUG-004: Search Score Aggregation Broken

  • Severity: CRITICAL
  • Area: Search System
  • File: src/memory/api/search/types.py:44-45
  • Description: Scores are summed across chunks instead of averaged
  • Impact: Documents with more chunks always rank higher regardless of relevance
  • Fix: Change to mean() or max-based ranking

BUG-005: Registration Always Enabled

  • Severity: CRITICAL
  • Area: Configuration/Security
  • File: src/memory/common/settings.py:178
  • Description: Logic error: REGISTER_ENABLED = boolean_env(...) or True always evaluates to True
  • Impact: Open registration regardless of configuration
  • Fix: Remove or True

BUG-006: API Key Logged in Plain Text

  • Severity: CRITICAL
  • Area: Security
  • File: src/memory/discord/api.py:63
  • Description: Bot API key logged in error message
  • Impact: Credentials exposed in logs
  • Fix: Remove API key from log message

NEW CRITICAL BUGS (2025-12-19 Second Pass)

BUG-061: Insecure Password Hashing Using SHA-256

  • Severity: CRITICAL 🚨
  • Area: Authentication/Security
  • File: src/memory/common/db/models/users.py:23-26
  • Description: Password hashing uses SHA-256 instead of purpose-built password hashing algorithms
  • Code:
    def hash_password(password: str) -> str:
        salt = secrets.token_hex(16)
        return f"{salt}:{hashlib.sha256((salt + password).encode()).hexdigest()}"
    
  • Impact:
    • SHA-256 is designed for speed, making it vulnerable to brute-force attacks
    • Attackers can test billions of password combinations per second with GPUs
    • Even with salt, passwords are at high risk of compromise
  • Fix: Replace with bcrypt, argon2, scrypt, or PBKDF2 which are designed to be slow
  • Priority: IMMEDIATE - All existing password hashes are insecure

BUG-062: Full Token Logging

  • Severity: HIGH
  • Area: Security/Logging
  • File: src/memory/api/MCP/oauth_provider.py:310
  • Description: Full OAuth token logged in plaintext
  • Code: logger.info(f"Exchanged authorization code: {token}")
  • Impact: Tokens exposed in logs can be used to impersonate users
  • Fix: Remove token from logs entirely or log only hash/truncated version
  • Related: Similar issues in lines 85, 398, 429, 443, 448

BUG-063: Deprecated SQLAlchemy .get() Usage (24+ instances)

  • Severity: MEDIUM
  • Area: Database/Code Quality
  • Description: Using deprecated session.query(Model).get(id) pattern
  • Impact:
    • Will break with SQLAlchemy 2.0+
    • Less efficient than modern API
  • Fix: Replace with session.get(Model, id)
  • Files affected: auth.py, oauth_provider.py, base.py, discord files, worker tasks
  • Examples:
    • src/memory/api/auth.py:79 - session = db.query(UserSession).get(session_id)
    • src/memory/api/MCP/base.py:151 - user_session = session.query(UserSession).get(access_token.token)
    • 22 more instances across codebase

BUG-064: Shell=True Command Execution

  • Severity: MEDIUM
  • Area: Security/Code Quality
  • File: src/memory/workers/tasks/notes.py:38
  • Description: Using subprocess.run() with shell=True
  • Code:
    cmd = f"git -C {shlex.quote(repo_root.as_posix())} {' '.join(escaped_args)}"
    res = subprocess.run(cmd, shell=True, ...)
    
  • Impact:
    • Unnecessary shell invocation increases attack surface
    • While currently mitigated by shlex.quote(), still best practice violation
  • Fix: Use subprocess with argument list instead of shell string
  • Note: Arguments ARE properly escaped with shlex.quote(), reducing immediate risk

BUG-065: Timing Attack in Password Verification

  • Severity: MEDIUM-HIGH
  • Area: Authentication/Security
  • File: src/memory/common/db/models/users.py:33
  • Description: Password hash comparison uses == operator instead of constant-time comparison
  • Code: return hashlib.sha256((salt + password).encode()).hexdigest() == hash_value
  • Impact:
    • Timing attacks could leak information about password hashes
    • Attackers can measure comparison time to infer hash similarity
    • Combined with weak SHA-256 hashing, enables faster brute-force
  • Fix: Replace with secrets.compare_digest(computed_hash, hash_value)
  • Related to: BUG-061 (both are password security issues)

BUG-066: No Unique Index on OAuthState.state

  • Severity: LOW-MEDIUM
  • Area: Database/Performance
  • Description: OAuth state parameter lacks unique constraint at database level
  • Impact:
    • Could allow duplicate state values
    • Performance degradation on lookups
    • Potential OAuth confusion attacks
  • Evidence: Migration 20251103_154126_mcp_servers.py:53 has unique constraint on mcp_servers.state but oauth_states table may lack it
  • Fix: Add unique index to oauth_states.state column

BUG-067: Incomplete Resource Limits in Docker Compose

  • Severity: LOW
  • Area: Infrastructure
  • Description: Only one service has resource limits configured
  • File: docker-compose.yaml:195
  • Current: Only ingest-hub has limits: cpus: 0.5, memory: 512m
  • Missing: postgres, redis, qdrant, api, workers have no limits
  • Impact: Services could consume all host resources causing OOM or CPU starvation
  • Fix: Add resource limits to all services

BUG-068: Redis Persistence Disabled

  • Severity: LOW-MEDIUM
  • Area: Infrastructure/Data Integrity
  • File: docker-compose.yaml:108
  • Description: Redis configured with persistence disabled
  • Code: redis-server --save "" --appendonly "no"
  • Impact:
    • All Redis data (LLM rate limits, usage tracking) lost on restart
    • LLM usage tracking state resets
    • Could allow rate limit bypass after restart
  • Fix: Enable AOF or RDB persistence unless purely ephemeral cache is intended
  • Note: May be intentional design decision - verify requirements

FIXED BUGS (Confirmed in Recent Commits)

Based on git history analysis, the following bugs have been FIXED:

BUG-001: Path Traversal Vulnerabilities - FIXED

  • File: src/memory/api/app.py:48-70
  • Fix: Added validate_path_within_directory() function
  • Implementation: Properly validates paths using .resolve() and prefix checking

BUG-004: Search Score Aggregation - FIXED

  • Commit: 21dedbe "Fix search score aggregation to use mean instead of sum"
  • Fix: Changed from sum to mean aggregation

BUG-005: Registration Always Enabled - FIXED

  • Commit: 116d036 "Fix REGISTER_ENABLED always evaluating to True (BUG-005)"
  • File: src/memory/common/settings.py:178
  • Fix: Removed or True from logic

BUG-007: Wrong Object Appended in break_chunk() - FIXED

  • Commit: 28bc10d "Fix break_chunk() appending wrong object (BUG-007)"
  • Fix: Corrected to append individual item instead of entire chunk object

BUG-014: CORS Misconfiguration - FIXED

  • File: src/memory/api/app.py:41
  • Fix: Changed from allow_origins=["*"] to allow_origins=[settings.SERVER_URL]

Mass Bug Fix

  • Commit: 52274f8 "Fix 19 bugs from investigation"
  • Note: 19 additional bugs were fixed in bulk - review commit for details

BUG-010: MCP Servers Relationship - ALREADY FIXED

  • File: src/memory/common/db/models/discord.py:30-47
  • Status: Implemented as @property using dynamic query
  • Implementation: Uses object_session() to query MCPServerAssignment

BUG-011: User ID Type Mismatch - ALREADY FIXED

  • Files: users.py:56, scheduled_calls.py:24
  • Status: Both use Integer type (not BigInteger)
  • Verification: User.id and ScheduledLLMCall.user_id are both Integer

BUG-061 to BUG-068: Security & Infrastructure Fixes - FIXED

  • Commit: 1c43f1a "Fix 7 critical security and code quality bugs"
  • Fixed: Password hashing, token logging, shell=True, SQLAlchemy deprecations, Docker limits, Redis persistence

High Severity Bugs

BUG-007: Wrong Object Appended in break_chunk()

  • File: src/memory/common/embedding.py:57
  • Description: Appends entire chunk object instead of individual item c
  • Impact: Circular references, type mismatches, embedding failures

BUG-008: Oversized Chunks Exceed Token Limits

  • File: src/memory/common/chunker.py:109-112
  • Description: When overlap <= 0, chunks yielded without size validation
  • Impact: 483 chunks >10K chars (should be ~2K)

BUG-009: Scheduled Call Race Condition

  • File: src/memory/workers/tasks/scheduled_calls.py:145-163
  • Description: No DB lock when querying due calls - multiple workers can execute same call
  • Impact: Duplicate LLM calls and Discord messages

BUG-010: Missing MCP Servers Relationship

  • File: src/memory/common/db/models/discord.py:74-76
  • Description: self.mcp_servers referenced in to_xml() but no relationship defined
  • Impact: Runtime AttributeError

BUG-011: User ID Type Mismatch

  • Files: users.py:47, scheduled_calls.py:23
  • Description: ScheduledLLMCall.user_id is BigInteger but User.id is Integer
  • Impact: Foreign key constraint violations

BUG-012: Inverted Min Score Thresholds

  • File: src/memory/api/search/embeddings.py:186-207
  • Description: Multimodal uses 0.25, text uses 0.4 - should be reversed
  • Impact: Multimodal results artificially boosted

BUG-013: No Error Handling in Embedding Pipeline

  • File: src/memory/common/embedding.py
  • Description: No try-except blocks around Voyage AI API calls
  • Impact: Entire content processing fails on API error

BUG-014: Unrestricted CORS Configuration

  • File: src/memory/api/app.py:36-42
  • Description: allow_origins=["*"] with allow_credentials=True
  • Impact: CSRF attacks enabled

BUG-015: Missing Retry Configuration

  • Files: All task files
  • Description: No autoretry_for, max_retries on any Celery tasks
  • Impact: Transient failures lost without retry

BUG-016: Silent Task Failures

  • File: src/memory/workers/tasks/content_processing.py:258-296
  • Description: safe_task_execution catches all exceptions, returns as dict
  • Impact: Failed tasks can't be retried by Celery

Medium Severity Bugs

Data Layer

  • BUG-017: Missing collection_name index on Chunk table (source_item.py:165-168)
  • BUG-018: AgentObservation dead code for future embedding types (source_items.py:1005-1028)
  • BUG-019: Embed status never set to STORED after push (content_processing.py:125)
  • BUG-020: Missing server_id index on DiscordMessage (source_items.py:426-435)

Content Processing

  • BUG-021: No chunk validation after break_chunk (embedding.py:49-58)
  • BUG-022: Ebook extraction creates single massive chunk (extract.py:218-230)
  • BUG-023: SHA256-only deduplication misses semantic duplicates (source_item.py:51-91)
  • BUG-024: Email hash inconsistency with markdown conversion (email.py:171-185)
  • BUG-025: Token approximation uses fixed 4-char ratio (tokens.py:8-12)

Search System

  • BUG-026: BM25 scores calculated then discarded (bm25.py:66-70)
  • BUG-027: LLM score fallback missing - defaults to 0.0 (scorer.py:55-60)
  • BUG-028: Missing filter validation (embeddings.py:130-131)
  • BUG-029: Hardcoded min_score thresholds (embeddings.py:186,202)

API Layer

  • BUG-030: Missing rate limiting (global)
  • BUG-031: No SearchConfig limits - can request millions of results (types.py:73-78)
  • BUG-032: No CSRF protection (auth.py:50-86)
  • BUG-033: Debug print statements in production (memory.py:363-370)
  • BUG-034: Timezone handling issues (oauth_provider.py:83-87)

Worker Tasks

  • BUG-035: No task time limits (global)
  • BUG-036: Database integrity errors not properly handled (discord.py:310-321)
  • BUG-037: Timezone bug in scheduled calls (scheduled_calls.py:152-153)
  • BUG-038: Beat schedule not thread-safe for distributed deployment (ingest.py:19-56)
  • BUG-039: Email sync fails entire account on single folder error (email.py:84-152)

Infrastructure

  • BUG-040: Missing resource limits for postgres, redis, qdrant, api (docker-compose.yaml)
  • BUG-041: Backup encryption silently disabled if key missing (settings.py:215-216)
  • BUG-042: Restore scripts don't validate database integrity (restore_databases.sh:79)
  • BUG-043: Health check doesn't check dependencies (app.py:87-92)
  • BUG-044: Uvicorn trusts all proxy headers (docker/api/Dockerfile:63)

Code Quality

  • BUG-045: 183 unsafe cast() operations (various files)
  • BUG-046: 21 type:ignore comments (various files)
  • BUG-047: 32 bare except Exception blocks (various files)
  • BUG-048: 13 exception swallowing with pass (various files)
  • BUG-049: Missing CSRF in OAuth callback (auth.py)
  • BUG-050: SQL injection in test database handling (tests/conftest.py:94)

Low Severity Bugs

  • BUG-051: Duplicate chunks (16 identical "Claude plays Pokemon" chunks)
  • BUG-052: Garbage content in text collection
  • BUG-053: No vector freshness index (source_item.py:157)
  • BUG-054: OAuthToken missing Base inheritance (users.py:215-228)
  • BUG-055: collection_model returns "unknown" (collections.py:140)
  • BUG-056: Unused "appuser" in API Dockerfile (docker/api/Dockerfile:48)
  • BUG-057: Build dependencies not cleaned up (docker/api/Dockerfile:7-12)
  • BUG-058: Typos in log messages (tests/conftest.py:63)
  • BUG-059: MockRedis overly simplistic (tests/conftest.py:24-46)
  • BUG-060: Print statement in ebook.py:192

Improvement Suggestions

High Priority

  1. Implement proper retry logic for all Celery tasks with exponential backoff
  2. Add comprehensive health checks that validate all service dependencies
  3. Fix score aggregation to use mean/max instead of sum
  4. Add rate limiting to prevent DoS attacks
  5. Implement proper CSRF protection for OAuth flows
  6. Add resource limits to all Docker services
  7. Implement centralized logging with ELK or Grafana Loki

Medium Priority

  1. Re-chunk oversized content - add validation to enforce size limits
  2. Add chunk deduplication based on content hash within same source
  3. Preserve BM25 scores for hybrid search weighting
  4. Add task progress tracking for long-running operations
  5. Implement distributed beat lock for multi-worker deployments
  6. Add backup verification tests - periodically test restore
  7. Replace cast() with type guards throughout codebase

Lower Priority

  1. Add Prometheus metrics for observability
  2. Implement structured JSON logging with correlation IDs
  3. Add graceful shutdown handling to workers
  4. Document configuration requirements more thoroughly
  5. Add integration tests for critical workflows
  6. Remove dead code and TODO comments in production

Feature Ideas

Search Enhancements

  1. Hybrid score weighting - configurable balance between BM25 and vector
  2. Query expansion - automatic synonym/related term expansion
  3. Faceted search - filter by date ranges, sources, tags with counts
  4. Search result highlighting - show matched terms in context
  5. Saved searches - store and re-run common queries

Content Management

  1. Content quality scoring - automatic assessment of chunk quality
  2. Duplicate detection UI - show and merge semantic duplicates
  3. Re-indexing queue - prioritize content for re-embedding
  4. Content archiving - move old content to cold storage
  5. Bulk operations - tag, delete, re-process multiple items

Email Management

  1. Email filtering rules - configurable rules to filter/categorize emails (e.g., skip marketing spam but keep order confirmations, shipping notifications, appointment reminders)
  2. Email source classification - auto-detect email types (transactional, marketing, personal, receipts)
  3. Smart email retention - keep "useful" emails (orders, bookings, confirmations) while filtering noise

User Experience

  1. Search analytics - track what users search for
  2. Relevance feedback - let users rate results to improve ranking
  3. Personal knowledge graph - visualize connections between content
  4. Smart summaries - auto-generate summaries of search results
  5. Email digest - scheduled summary of new content

Infrastructure

  1. Auto-scaling workers - scale based on queue depth
  2. Multi-tenant support - isolate data by user/org
  3. Backup scheduling UI - configure backup frequency
  4. Monitoring dashboard - Grafana-style metrics visualization
  5. Audit logging - track all data access and modifications

Investigation Log

2025-12-19 - Complete Investigation

Data Layer (10 issues)

  • Missing relationships (mcp_servers)
  • Type mismatches (User.id)
  • Missing indexes (collection_name, server_id)
  • Dead code (AgentObservation)

Content Processing (12 issues)

  • Critical: break_chunk bug appends wrong object
  • Critical: Oversized chunks exceed limits
  • Deduplication only on SHA256
  • Ebook creates single massive chunk

Search System (14 issues)

  • Critical: BM25 ignores filters
  • Critical: Score aggregation broken (sum vs mean)
  • Inverted min_score thresholds
  • BM25 scores discarded

API Layer (12 issues)

  • Critical: Path traversal vulnerabilities (3 endpoints)
  • CORS misconfiguration
  • Missing rate limiting
  • Debug print statements

Worker Tasks (20 issues)

  • No retry configuration
  • Silent task failures
  • Race condition in scheduled calls
  • No task timeouts

Infrastructure (12 issues)

  • Missing resource limits
  • Backup encryption issues
  • Health check incomplete
  • No centralized logging

Code Quality (20+ issues)

  • 183 unsafe casts
  • 32 bare exception blocks
  • Registration always enabled bug
  • API key logging

Database Statistics

Sources by Modality:
  forum: 981
  mail: 665
  text: 165
  comic: 115
  doc: 102
  book: 78
  observation: 26
  note: 3
  photo: 2
  blog: 1

Chunks by Collection:
  forum: 8786
  text: 1843
  mail: 1418
  doc: 312
  book: 156
  semantic: 84
  comic: 49
  temporal: 26
  blog: 7
  photo: 2

Vectors in Qdrant:
  forum: 8778
  mail: 2756 (1338 orphaned!)
  text: 505 (1338 missing!)
  doc: 312
  book: 156
  semantic: 84
  comic: 49
  temporal: 26
  blog: 7
  photo: 2

Embed Status:
  STORED: 2056
  FAILED: 81
  RAW: 1

Updated Priority List (After Second Pass)

CRITICAL - Fix Immediately

  1. FIXED: Path traversal vulnerabilities (BUG-001)
  2. FIXED: Registration always enabled (BUG-005)
  3. FIXED: Search score aggregation (BUG-004)
  4. FIXED: CORS misconfiguration (BUG-014)
  5. FIXED: Wrong object in break_chunk (BUG-007)
  6. 🚨 NEW: Replace SHA-256 password hashing with bcrypt/argon2 (BUG-061)
  7. 🔴 OPEN: Fix collection mismatch for 1,338 items (BUG-002)
  8. 🔴 OPEN: Fix BM25 filter application (BUG-003)
  9. 🔴 OPEN: Remove API key from logs (BUG-006)

HIGH Priority

  1. 🚨 NEW: Stop logging full OAuth tokens (BUG-062)
  2. 🚨 NEW: Fix timing attack in password verification (BUG-065)
  3. 🔴 OPEN: Add retry logic to all Celery tasks (BUG-015, BUG-016)
  4. 🔴 OPEN: Fix scheduled call race condition (BUG-009)
  5. 🔴 OPEN: Fix oversized chunks exceeding token limits (BUG-008)

MEDIUM Priority

  1. 🚨 NEW: Update 24+ deprecated SQLAlchemy .get() calls (BUG-063)
  2. 🚨 NEW: Remove shell=True from subprocess calls (BUG-064)
  3. 🔴 OPEN: Add resource limits to Docker services (BUG-040, BUG-067)
  4. 🔴 OPEN: Missing MCP servers relationship (BUG-010)
  5. 🔴 OPEN: User ID type mismatch (BUG-011)

Summary Statistics

  • Total Bugs Found: 118 (100+ original + 8 new in second pass)
  • Bugs Fixed: 25+ (confirmed in recent commits)
  • Critical Bugs Open: 4
  • High Priority Open: 5
  • Medium/Low Open: 80+

Investigation Notes

What Was Checked (Second Pass - 2025-12-19)

Security vulnerabilities (SQL injection, command injection, XSS) Authentication implementation (password hashing, session management) Logging practices (credential exposure) Database patterns (deprecated APIs, missing indexes) Docker configuration (resource limits, persistence) OAuth implementation (state management, token handling) Code quality (exception handling, type safety) Recent commits and fixes

Good Security Practices Observed

  • Path traversal protection properly implemented (fixed)
  • CORS properly configured with specific origins (fixed)
  • Secrets loaded from files, not environment variables
  • Services run as non-root users where possible
  • Read-only filesystems for workers
  • Security capabilities dropped in containers
  • Healthchecks configured for critical services
  • Git command arguments properly escaped with shlex.quote()
  • Search result limits enforced (max 1000)
  • Timeout limits enforced (max 300s)
  • Rate limiting infrastructure exists for LLM usage

Areas Still Needing Attention

  • 🔴 Password hashing needs complete overhaul
  • 🔴 Logging practices need audit for credential exposure
  • 🔴 Database API modernization for SQLAlchemy 2.0
  • 🔴 Resource limits need to be added to all services
  • 🔴 Redis persistence configuration needs review