mirror of
https://github.com/mruwnik/memory.git
synced 2026-01-02 09:12:58 +01:00
Documents 100+ issues found across: - Data layer (10 issues) - Content processing (12 issues) - Search system (14 issues) - API layer (12 issues) - Worker tasks (20 issues) - Infrastructure (12 issues) - Code quality (20+ issues) Includes database statistics and improvement suggestions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
14 KiB
14 KiB
Memory System Investigation
Investigation Status
- Started: 2025-12-19
- Status: Complete
- Total Issues Found: 100+
Executive Summary
This investigation identified 100+ issues across 7 areas of the memory system. The most critical findings are:
- Security vulnerabilities (path traversal, CORS, API key logging)
- Data integrity issues (1,338 items unsearchable due to collection mismatch)
- Search system bugs (BM25 filters ignored, broken score aggregation)
- Worker reliability (no retries, silent failures, race conditions)
- Code quality concerns (bare exceptions, type safety gaps)
Critical Bugs (Immediate Action Required)
BUG-001: Path Traversal Vulnerabilities
- Severity: CRITICAL
- Area: API Security
- Files:
src/memory/api/app.py:54-64-/files/{path}endpointsrc/memory/api/MCP/memory.py:355-365-fetch_filetoolsrc/memory/api/MCP/memory.py:335-352-note_filestool
- Description: No validation that requested files are within allowed directories
- Impact: Arbitrary file read on server filesystem
- Fix: Add path resolution validation with
.resolve()and prefix check
BUG-002: Collection Mismatch (1,338 items)
- Severity: CRITICAL
- Area: Data/Embedding Pipeline
- Description: Mail items have chunks with
collection_name='text'but vectors stored in Qdrant'smailcollection - Impact: Items completely unsearchable
- Evidence: 1,338 orphaned vectors in mail, 1,338 missing in text
- Fix: Re-sync vectors or update chunk collection_name
BUG-003: BM25 Filters Completely Ignored
- Severity: CRITICAL
- Area: Search System
- File:
src/memory/api/search/bm25.py:32-43 - Description: BM25 search ignores tags, dates, size filters - only applies source_ids
- Impact: Filter results diverge between BM25 and vector search
- Fix: Apply all filters consistently in BM25 search
BUG-004: Search Score Aggregation Broken
- Severity: CRITICAL
- Area: Search System
- File:
src/memory/api/search/types.py:44-45 - Description: Scores are summed across chunks instead of averaged
- Impact: Documents with more chunks always rank higher regardless of relevance
- Fix: Change to mean() or max-based ranking
BUG-005: Registration Always Enabled
- Severity: CRITICAL
- Area: Configuration/Security
- File:
src/memory/common/settings.py:178 - Description: Logic error:
REGISTER_ENABLED = boolean_env(...) or Truealways evaluates to True - Impact: Open registration regardless of configuration
- Fix: Remove
or True
BUG-006: API Key Logged in Plain Text
- Severity: CRITICAL
- Area: Security
- File:
src/memory/discord/api.py:63 - Description: Bot API key logged in error message
- Impact: Credentials exposed in logs
- Fix: Remove API key from log message
High Severity Bugs
BUG-007: Wrong Object Appended in break_chunk()
- File:
src/memory/common/embedding.py:57 - Description: Appends entire
chunkobject instead of individual itemc - Impact: Circular references, type mismatches, embedding failures
BUG-008: Oversized Chunks Exceed Token Limits
- File:
src/memory/common/chunker.py:109-112 - Description: When overlap <= 0, chunks yielded without size validation
- Impact: 483 chunks >10K chars (should be ~2K)
BUG-009: Scheduled Call Race Condition
- File:
src/memory/workers/tasks/scheduled_calls.py:145-163 - Description: No DB lock when querying due calls - multiple workers can execute same call
- Impact: Duplicate LLM calls and Discord messages
BUG-010: Missing MCP Servers Relationship
- File:
src/memory/common/db/models/discord.py:74-76 - Description:
self.mcp_serversreferenced into_xml()but no relationship defined - Impact: Runtime AttributeError
BUG-011: User ID Type Mismatch
- Files:
users.py:47,scheduled_calls.py:23 - Description:
ScheduledLLMCall.user_idis BigInteger butUser.idis Integer - Impact: Foreign key constraint violations
BUG-012: Inverted Min Score Thresholds
- File:
src/memory/api/search/embeddings.py:186-207 - Description: Multimodal uses 0.25, text uses 0.4 - should be reversed
- Impact: Multimodal results artificially boosted
BUG-013: No Error Handling in Embedding Pipeline
- File:
src/memory/common/embedding.py - Description: No try-except blocks around Voyage AI API calls
- Impact: Entire content processing fails on API error
BUG-014: Unrestricted CORS Configuration
- File:
src/memory/api/app.py:36-42 - Description:
allow_origins=["*"]withallow_credentials=True - Impact: CSRF attacks enabled
BUG-015: Missing Retry Configuration
- Files: All task files
- Description: No
autoretry_for,max_retrieson any Celery tasks - Impact: Transient failures lost without retry
BUG-016: Silent Task Failures
- File:
src/memory/workers/tasks/content_processing.py:258-296 - Description:
safe_task_executioncatches all exceptions, returns as dict - Impact: Failed tasks can't be retried by Celery
Medium Severity Bugs
Data Layer
- BUG-017: Missing
collection_nameindex on Chunk table (source_item.py:165-168) - BUG-018: AgentObservation dead code for future embedding types (
source_items.py:1005-1028) - BUG-019: Embed status never set to STORED after push (
content_processing.py:125) - BUG-020: Missing server_id index on DiscordMessage (
source_items.py:426-435)
Content Processing
- BUG-021: No chunk validation after break_chunk (
embedding.py:49-58) - BUG-022: Ebook extraction creates single massive chunk (
extract.py:218-230) - BUG-023: SHA256-only deduplication misses semantic duplicates (
source_item.py:51-91) - BUG-024: Email hash inconsistency with markdown conversion (
email.py:171-185) - BUG-025: Token approximation uses fixed 4-char ratio (
tokens.py:8-12)
Search System
- BUG-026: BM25 scores calculated then discarded (
bm25.py:66-70) - BUG-027: LLM score fallback missing - defaults to 0.0 (
scorer.py:55-60) - BUG-028: Missing filter validation (
embeddings.py:130-131) - BUG-029: Hardcoded min_score thresholds (
embeddings.py:186,202)
API Layer
- BUG-030: Missing rate limiting (global)
- BUG-031: No SearchConfig limits - can request millions of results (
types.py:73-78) - BUG-032: No CSRF protection (
auth.py:50-86) - BUG-033: Debug print statements in production (
memory.py:363-370) - BUG-034: Timezone handling issues (
oauth_provider.py:83-87)
Worker Tasks
- BUG-035: No task time limits (global)
- BUG-036: Database integrity errors not properly handled (
discord.py:310-321) - BUG-037: Timezone bug in scheduled calls (
scheduled_calls.py:152-153) - BUG-038: Beat schedule not thread-safe for distributed deployment (
ingest.py:19-56) - BUG-039: Email sync fails entire account on single folder error (
email.py:84-152)
Infrastructure
- BUG-040: Missing resource limits for postgres, redis, qdrant, api (
docker-compose.yaml) - BUG-041: Backup encryption silently disabled if key missing (
settings.py:215-216) - BUG-042: Restore scripts don't validate database integrity (
restore_databases.sh:79) - BUG-043: Health check doesn't check dependencies (
app.py:87-92) - BUG-044: Uvicorn trusts all proxy headers (
docker/api/Dockerfile:63)
Code Quality
- BUG-045: 183 unsafe cast() operations (various files)
- BUG-046: 21 type:ignore comments (various files)
- BUG-047: 32 bare except Exception blocks (various files)
- BUG-048: 13 exception swallowing with pass (various files)
- BUG-049: Missing CSRF in OAuth callback (
auth.py) - BUG-050: SQL injection in test database handling (
tests/conftest.py:94)
Low Severity Bugs
- BUG-051: Duplicate chunks (16 identical "Claude plays Pokemon" chunks)
- BUG-052: Garbage content in text collection
- BUG-053: No vector freshness index (
source_item.py:157) - BUG-054: OAuthToken missing Base inheritance (
users.py:215-228) - BUG-055: collection_model returns "unknown" (
collections.py:140) - BUG-056: Unused "appuser" in API Dockerfile (
docker/api/Dockerfile:48) - BUG-057: Build dependencies not cleaned up (
docker/api/Dockerfile:7-12) - BUG-058: Typos in log messages (
tests/conftest.py:63) - BUG-059: MockRedis overly simplistic (
tests/conftest.py:24-46) - BUG-060: Print statement in ebook.py:192
Improvement Suggestions
High Priority
- Implement proper retry logic for all Celery tasks with exponential backoff
- Add comprehensive health checks that validate all service dependencies
- Fix score aggregation to use mean/max instead of sum
- Add rate limiting to prevent DoS attacks
- Implement proper CSRF protection for OAuth flows
- Add resource limits to all Docker services
- Implement centralized logging with ELK or Grafana Loki
Medium Priority
- Re-chunk oversized content - add validation to enforce size limits
- Add chunk deduplication based on content hash within same source
- Preserve BM25 scores for hybrid search weighting
- Add task progress tracking for long-running operations
- Implement distributed beat lock for multi-worker deployments
- Add backup verification tests - periodically test restore
- Replace cast() with type guards throughout codebase
Lower Priority
- Add Prometheus metrics for observability
- Implement structured JSON logging with correlation IDs
- Add graceful shutdown handling to workers
- Document configuration requirements more thoroughly
- Add integration tests for critical workflows
- Remove dead code and TODO comments in production
Feature Ideas
Search Enhancements
- Hybrid score weighting - configurable balance between BM25 and vector
- Query expansion - automatic synonym/related term expansion
- Faceted search - filter by date ranges, sources, tags with counts
- Search result highlighting - show matched terms in context
- Saved searches - store and re-run common queries
Content Management
- Content quality scoring - automatic assessment of chunk quality
- Duplicate detection UI - show and merge semantic duplicates
- Re-indexing queue - prioritize content for re-embedding
- Content archiving - move old content to cold storage
- Bulk operations - tag, delete, re-process multiple items
Email Management
- Email filtering rules - configurable rules to filter/categorize emails (e.g., skip marketing spam but keep order confirmations, shipping notifications, appointment reminders)
- Email source classification - auto-detect email types (transactional, marketing, personal, receipts)
- Smart email retention - keep "useful" emails (orders, bookings, confirmations) while filtering noise
User Experience
- Search analytics - track what users search for
- Relevance feedback - let users rate results to improve ranking
- Personal knowledge graph - visualize connections between content
- Smart summaries - auto-generate summaries of search results
- Email digest - scheduled summary of new content
Infrastructure
- Auto-scaling workers - scale based on queue depth
- Multi-tenant support - isolate data by user/org
- Backup scheduling UI - configure backup frequency
- Monitoring dashboard - Grafana-style metrics visualization
- Audit logging - track all data access and modifications
Investigation Log
2025-12-19 - Complete Investigation
Data Layer (10 issues)
- Missing relationships (mcp_servers)
- Type mismatches (User.id)
- Missing indexes (collection_name, server_id)
- Dead code (AgentObservation)
Content Processing (12 issues)
- Critical: break_chunk bug appends wrong object
- Critical: Oversized chunks exceed limits
- Deduplication only on SHA256
- Ebook creates single massive chunk
Search System (14 issues)
- Critical: BM25 ignores filters
- Critical: Score aggregation broken (sum vs mean)
- Inverted min_score thresholds
- BM25 scores discarded
API Layer (12 issues)
- Critical: Path traversal vulnerabilities (3 endpoints)
- CORS misconfiguration
- Missing rate limiting
- Debug print statements
Worker Tasks (20 issues)
- No retry configuration
- Silent task failures
- Race condition in scheduled calls
- No task timeouts
Infrastructure (12 issues)
- Missing resource limits
- Backup encryption issues
- Health check incomplete
- No centralized logging
Code Quality (20+ issues)
- 183 unsafe casts
- 32 bare exception blocks
- Registration always enabled bug
- API key logging
Database Statistics
Sources by Modality:
forum: 981
mail: 665
text: 165
comic: 115
doc: 102
book: 78
observation: 26
note: 3
photo: 2
blog: 1
Chunks by Collection:
forum: 8786
text: 1843
mail: 1418
doc: 312
book: 156
semantic: 84
comic: 49
temporal: 26
blog: 7
photo: 2
Vectors in Qdrant:
forum: 8778
mail: 2756 (1338 orphaned!)
text: 505 (1338 missing!)
doc: 312
book: 156
semantic: 84
comic: 49
temporal: 26
blog: 7
photo: 2
Embed Status:
STORED: 2056
FAILED: 81
RAW: 1
Next Steps
- Immediate: Fix path traversal vulnerabilities (security critical)
- Immediate: Fix registration always enabled (security critical)
- Immediate: Remove API key from logs (security critical)
- This Week: Fix collection mismatch for 1,338 items
- This Week: Fix BM25 filter application
- This Sprint: Add retry logic to all tasks
- This Sprint: Add resource limits to Docker services
- This Sprint: Fix score aggregation