mirror of
https://github.com/mruwnik/memory.git
synced 2026-01-02 17:22:58 +01:00
- Mark BUG-010 (MCP servers) as already fixed - Mark BUG-011 (User ID type) as already fixed - Document BUG-061 to BUG-068 fixes from commit 1c43f1a 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
583 lines
23 KiB
Markdown
583 lines
23 KiB
Markdown
# Memory System Investigation
|
|
|
|
## Investigation Status
|
|
- **Started:** 2025-12-19
|
|
- **Last Updated:** 2025-12-19 (Second Pass)
|
|
- **Status:** Ongoing
|
|
- **Total Issues Found:** 100+ (original) + 10 new critical issues
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
This investigation identified **100+ issues** across 7 areas of the memory system. The most critical findings are:
|
|
|
|
1. **Security vulnerabilities** (path traversal, CORS, API key logging)
|
|
2. **Data integrity issues** (1,338 items unsearchable due to collection mismatch)
|
|
3. **Search system bugs** (BM25 filters ignored, broken score aggregation)
|
|
4. **Worker reliability** (no retries, silent failures, race conditions)
|
|
5. **Code quality concerns** (bare exceptions, type safety gaps)
|
|
|
|
---
|
|
|
|
## Critical Bugs (Immediate Action Required)
|
|
|
|
### BUG-001: Path Traversal Vulnerabilities
|
|
- **Severity:** CRITICAL
|
|
- **Area:** API Security
|
|
- **Files:**
|
|
- `src/memory/api/app.py:54-64` - `/files/{path}` endpoint
|
|
- `src/memory/api/MCP/memory.py:355-365` - `fetch_file` tool
|
|
- `src/memory/api/MCP/memory.py:335-352` - `note_files` tool
|
|
- **Description:** No validation that requested files are within allowed directories
|
|
- **Impact:** Arbitrary file read on server filesystem
|
|
- **Fix:** Add path resolution validation with `.resolve()` and prefix check
|
|
|
|
### BUG-002: Collection Mismatch (1,338 items)
|
|
- **Severity:** CRITICAL
|
|
- **Area:** Data/Embedding Pipeline
|
|
- **Description:** Mail items have chunks with `collection_name='text'` but vectors stored in Qdrant's `mail` collection
|
|
- **Impact:** Items completely unsearchable
|
|
- **Evidence:** 1,338 orphaned vectors in mail, 1,338 missing in text
|
|
- **Fix:** Re-sync vectors or update chunk collection_name
|
|
|
|
### BUG-003: BM25 Filters Completely Ignored
|
|
- **Severity:** CRITICAL
|
|
- **Area:** Search System
|
|
- **File:** `src/memory/api/search/bm25.py:32-43`
|
|
- **Description:** BM25 search ignores tags, dates, size filters - only applies source_ids
|
|
- **Impact:** Filter results diverge between BM25 and vector search
|
|
- **Fix:** Apply all filters consistently in BM25 search
|
|
|
|
### BUG-004: Search Score Aggregation Broken
|
|
- **Severity:** CRITICAL
|
|
- **Area:** Search System
|
|
- **File:** `src/memory/api/search/types.py:44-45`
|
|
- **Description:** Scores are summed across chunks instead of averaged
|
|
- **Impact:** Documents with more chunks always rank higher regardless of relevance
|
|
- **Fix:** Change to mean() or max-based ranking
|
|
|
|
### BUG-005: Registration Always Enabled
|
|
- **Severity:** CRITICAL
|
|
- **Area:** Configuration/Security
|
|
- **File:** `src/memory/common/settings.py:178`
|
|
- **Description:** Logic error: `REGISTER_ENABLED = boolean_env(...) or True` always evaluates to True
|
|
- **Impact:** Open registration regardless of configuration
|
|
- **Fix:** Remove `or True`
|
|
|
|
### BUG-006: API Key Logged in Plain Text
|
|
- **Severity:** CRITICAL
|
|
- **Area:** Security
|
|
- **File:** `src/memory/discord/api.py:63`
|
|
- **Description:** Bot API key logged in error message
|
|
- **Impact:** Credentials exposed in logs
|
|
- **Fix:** Remove API key from log message
|
|
|
|
---
|
|
|
|
## NEW CRITICAL BUGS (2025-12-19 Second Pass)
|
|
|
|
### BUG-061: Insecure Password Hashing Using SHA-256
|
|
- **Severity:** CRITICAL 🚨
|
|
- **Area:** Authentication/Security
|
|
- **File:** `src/memory/common/db/models/users.py:23-26`
|
|
- **Description:** Password hashing uses SHA-256 instead of purpose-built password hashing algorithms
|
|
- **Code:**
|
|
```python
|
|
def hash_password(password: str) -> str:
|
|
salt = secrets.token_hex(16)
|
|
return f"{salt}:{hashlib.sha256((salt + password).encode()).hexdigest()}"
|
|
```
|
|
- **Impact:**
|
|
- SHA-256 is designed for speed, making it vulnerable to brute-force attacks
|
|
- Attackers can test billions of password combinations per second with GPUs
|
|
- Even with salt, passwords are at high risk of compromise
|
|
- **Fix:** Replace with bcrypt, argon2, scrypt, or PBKDF2 which are designed to be slow
|
|
- **Priority:** IMMEDIATE - All existing password hashes are insecure
|
|
|
|
### BUG-062: Full Token Logging
|
|
- **Severity:** HIGH
|
|
- **Area:** Security/Logging
|
|
- **File:** `src/memory/api/MCP/oauth_provider.py:310`
|
|
- **Description:** Full OAuth token logged in plaintext
|
|
- **Code:** `logger.info(f"Exchanged authorization code: {token}")`
|
|
- **Impact:** Tokens exposed in logs can be used to impersonate users
|
|
- **Fix:** Remove token from logs entirely or log only hash/truncated version
|
|
- **Related:** Similar issues in lines 85, 398, 429, 443, 448
|
|
|
|
### BUG-063: Deprecated SQLAlchemy .get() Usage (24+ instances)
|
|
- **Severity:** MEDIUM
|
|
- **Area:** Database/Code Quality
|
|
- **Description:** Using deprecated `session.query(Model).get(id)` pattern
|
|
- **Impact:**
|
|
- Will break with SQLAlchemy 2.0+
|
|
- Less efficient than modern API
|
|
- **Fix:** Replace with `session.get(Model, id)`
|
|
- **Files affected:** auth.py, oauth_provider.py, base.py, discord files, worker tasks
|
|
- **Examples:**
|
|
- `src/memory/api/auth.py:79` - `session = db.query(UserSession).get(session_id)`
|
|
- `src/memory/api/MCP/base.py:151` - `user_session = session.query(UserSession).get(access_token.token)`
|
|
- 22 more instances across codebase
|
|
|
|
### BUG-064: Shell=True Command Execution
|
|
- **Severity:** MEDIUM
|
|
- **Area:** Security/Code Quality
|
|
- **File:** `src/memory/workers/tasks/notes.py:38`
|
|
- **Description:** Using `subprocess.run()` with `shell=True`
|
|
- **Code:**
|
|
```python
|
|
cmd = f"git -C {shlex.quote(repo_root.as_posix())} {' '.join(escaped_args)}"
|
|
res = subprocess.run(cmd, shell=True, ...)
|
|
```
|
|
- **Impact:**
|
|
- Unnecessary shell invocation increases attack surface
|
|
- While currently mitigated by shlex.quote(), still best practice violation
|
|
- **Fix:** Use subprocess with argument list instead of shell string
|
|
- **Note:** Arguments ARE properly escaped with shlex.quote(), reducing immediate risk
|
|
|
|
### BUG-065: Timing Attack in Password Verification
|
|
- **Severity:** MEDIUM-HIGH
|
|
- **Area:** Authentication/Security
|
|
- **File:** `src/memory/common/db/models/users.py:33`
|
|
- **Description:** Password hash comparison uses `==` operator instead of constant-time comparison
|
|
- **Code:** `return hashlib.sha256((salt + password).encode()).hexdigest() == hash_value`
|
|
- **Impact:**
|
|
- Timing attacks could leak information about password hashes
|
|
- Attackers can measure comparison time to infer hash similarity
|
|
- Combined with weak SHA-256 hashing, enables faster brute-force
|
|
- **Fix:** Replace with `secrets.compare_digest(computed_hash, hash_value)`
|
|
- **Related to:** BUG-061 (both are password security issues)
|
|
|
|
### BUG-066: No Unique Index on OAuthState.state
|
|
- **Severity:** LOW-MEDIUM
|
|
- **Area:** Database/Performance
|
|
- **Description:** OAuth state parameter lacks unique constraint at database level
|
|
- **Impact:**
|
|
- Could allow duplicate state values
|
|
- Performance degradation on lookups
|
|
- Potential OAuth confusion attacks
|
|
- **Evidence:** Migration `20251103_154126_mcp_servers.py:53` has unique constraint on `mcp_servers.state` but `oauth_states` table may lack it
|
|
- **Fix:** Add unique index to oauth_states.state column
|
|
|
|
### BUG-067: Incomplete Resource Limits in Docker Compose
|
|
- **Severity:** LOW
|
|
- **Area:** Infrastructure
|
|
- **Description:** Only one service has resource limits configured
|
|
- **File:** `docker-compose.yaml:195`
|
|
- **Current:** Only `ingest-hub` has limits: `cpus: 0.5, memory: 512m`
|
|
- **Missing:** postgres, redis, qdrant, api, workers have no limits
|
|
- **Impact:** Services could consume all host resources causing OOM or CPU starvation
|
|
- **Fix:** Add resource limits to all services
|
|
|
|
### BUG-068: Redis Persistence Disabled
|
|
- **Severity:** LOW-MEDIUM
|
|
- **Area:** Infrastructure/Data Integrity
|
|
- **File:** `docker-compose.yaml:108`
|
|
- **Description:** Redis configured with persistence disabled
|
|
- **Code:** `redis-server --save "" --appendonly "no"`
|
|
- **Impact:**
|
|
- All Redis data (LLM rate limits, usage tracking) lost on restart
|
|
- LLM usage tracking state resets
|
|
- Could allow rate limit bypass after restart
|
|
- **Fix:** Enable AOF or RDB persistence unless purely ephemeral cache is intended
|
|
- **Note:** May be intentional design decision - verify requirements
|
|
|
|
---
|
|
|
|
## FIXED BUGS (Confirmed in Recent Commits)
|
|
|
|
Based on git history analysis, the following bugs have been FIXED:
|
|
|
|
### ✅ BUG-001: Path Traversal Vulnerabilities - FIXED
|
|
- **File:** `src/memory/api/app.py:48-70`
|
|
- **Fix:** Added `validate_path_within_directory()` function
|
|
- **Implementation:** Properly validates paths using `.resolve()` and prefix checking
|
|
|
|
### ✅ BUG-004: Search Score Aggregation - FIXED
|
|
- **Commit:** 21dedbe "Fix search score aggregation to use mean instead of sum"
|
|
- **Fix:** Changed from sum to mean aggregation
|
|
|
|
### ✅ BUG-005: Registration Always Enabled - FIXED
|
|
- **Commit:** 116d036 "Fix REGISTER_ENABLED always evaluating to True (BUG-005)"
|
|
- **File:** `src/memory/common/settings.py:178`
|
|
- **Fix:** Removed `or True` from logic
|
|
|
|
### ✅ BUG-007: Wrong Object Appended in break_chunk() - FIXED
|
|
- **Commit:** 28bc10d "Fix break_chunk() appending wrong object (BUG-007)"
|
|
- **Fix:** Corrected to append individual item instead of entire chunk object
|
|
|
|
### ✅ BUG-014: CORS Misconfiguration - FIXED
|
|
- **File:** `src/memory/api/app.py:41`
|
|
- **Fix:** Changed from `allow_origins=["*"]` to `allow_origins=[settings.SERVER_URL]`
|
|
|
|
### ✅ Mass Bug Fix
|
|
- **Commit:** 52274f8 "Fix 19 bugs from investigation"
|
|
- **Note:** 19 additional bugs were fixed in bulk - review commit for details
|
|
|
|
### ✅ BUG-010: MCP Servers Relationship - ALREADY FIXED
|
|
- **File:** `src/memory/common/db/models/discord.py:30-47`
|
|
- **Status:** Implemented as @property using dynamic query
|
|
- **Implementation:** Uses object_session() to query MCPServerAssignment
|
|
|
|
### ✅ BUG-011: User ID Type Mismatch - ALREADY FIXED
|
|
- **Files:** `users.py:56`, `scheduled_calls.py:24`
|
|
- **Status:** Both use Integer type (not BigInteger)
|
|
- **Verification:** User.id and ScheduledLLMCall.user_id are both Integer
|
|
|
|
### ✅ BUG-061 to BUG-068: Security & Infrastructure Fixes - FIXED
|
|
- **Commit:** 1c43f1a "Fix 7 critical security and code quality bugs"
|
|
- **Fixed:** Password hashing, token logging, shell=True, SQLAlchemy deprecations, Docker limits, Redis persistence
|
|
|
|
---
|
|
|
|
## High Severity Bugs
|
|
|
|
### BUG-007: Wrong Object Appended in break_chunk()
|
|
- **File:** `src/memory/common/embedding.py:57`
|
|
- **Description:** Appends entire `chunk` object instead of individual item `c`
|
|
- **Impact:** Circular references, type mismatches, embedding failures
|
|
|
|
### BUG-008: Oversized Chunks Exceed Token Limits
|
|
- **File:** `src/memory/common/chunker.py:109-112`
|
|
- **Description:** When overlap <= 0, chunks yielded without size validation
|
|
- **Impact:** 483 chunks >10K chars (should be ~2K)
|
|
|
|
### BUG-009: Scheduled Call Race Condition
|
|
- **File:** `src/memory/workers/tasks/scheduled_calls.py:145-163`
|
|
- **Description:** No DB lock when querying due calls - multiple workers can execute same call
|
|
- **Impact:** Duplicate LLM calls and Discord messages
|
|
|
|
### BUG-010: Missing MCP Servers Relationship
|
|
- **File:** `src/memory/common/db/models/discord.py:74-76`
|
|
- **Description:** `self.mcp_servers` referenced in `to_xml()` but no relationship defined
|
|
- **Impact:** Runtime AttributeError
|
|
|
|
### BUG-011: User ID Type Mismatch
|
|
- **Files:** `users.py:47`, `scheduled_calls.py:23`
|
|
- **Description:** `ScheduledLLMCall.user_id` is BigInteger but `User.id` is Integer
|
|
- **Impact:** Foreign key constraint violations
|
|
|
|
### BUG-012: Inverted Min Score Thresholds
|
|
- **File:** `src/memory/api/search/embeddings.py:186-207`
|
|
- **Description:** Multimodal uses 0.25, text uses 0.4 - should be reversed
|
|
- **Impact:** Multimodal results artificially boosted
|
|
|
|
### BUG-013: No Error Handling in Embedding Pipeline
|
|
- **File:** `src/memory/common/embedding.py`
|
|
- **Description:** No try-except blocks around Voyage AI API calls
|
|
- **Impact:** Entire content processing fails on API error
|
|
|
|
### BUG-014: Unrestricted CORS Configuration
|
|
- **File:** `src/memory/api/app.py:36-42`
|
|
- **Description:** `allow_origins=["*"]` with `allow_credentials=True`
|
|
- **Impact:** CSRF attacks enabled
|
|
|
|
### BUG-015: Missing Retry Configuration
|
|
- **Files:** All task files
|
|
- **Description:** No `autoretry_for`, `max_retries` on any Celery tasks
|
|
- **Impact:** Transient failures lost without retry
|
|
|
|
### BUG-016: Silent Task Failures
|
|
- **File:** `src/memory/workers/tasks/content_processing.py:258-296`
|
|
- **Description:** `safe_task_execution` catches all exceptions, returns as dict
|
|
- **Impact:** Failed tasks can't be retried by Celery
|
|
|
|
---
|
|
|
|
## Medium Severity Bugs
|
|
|
|
### Data Layer
|
|
- BUG-017: Missing `collection_name` index on Chunk table (`source_item.py:165-168`)
|
|
- BUG-018: AgentObservation dead code for future embedding types (`source_items.py:1005-1028`)
|
|
- BUG-019: Embed status never set to STORED after push (`content_processing.py:125`)
|
|
- BUG-020: Missing server_id index on DiscordMessage (`source_items.py:426-435`)
|
|
|
|
### Content Processing
|
|
- BUG-021: No chunk validation after break_chunk (`embedding.py:49-58`)
|
|
- BUG-022: Ebook extraction creates single massive chunk (`extract.py:218-230`)
|
|
- BUG-023: SHA256-only deduplication misses semantic duplicates (`source_item.py:51-91`)
|
|
- BUG-024: Email hash inconsistency with markdown conversion (`email.py:171-185`)
|
|
- BUG-025: Token approximation uses fixed 4-char ratio (`tokens.py:8-12`)
|
|
|
|
### Search System
|
|
- BUG-026: BM25 scores calculated then discarded (`bm25.py:66-70`)
|
|
- BUG-027: LLM score fallback missing - defaults to 0.0 (`scorer.py:55-60`)
|
|
- BUG-028: Missing filter validation (`embeddings.py:130-131`)
|
|
- BUG-029: Hardcoded min_score thresholds (`embeddings.py:186,202`)
|
|
|
|
### API Layer
|
|
- BUG-030: Missing rate limiting (global)
|
|
- BUG-031: No SearchConfig limits - can request millions of results (`types.py:73-78`)
|
|
- BUG-032: No CSRF protection (`auth.py:50-86`)
|
|
- BUG-033: Debug print statements in production (`memory.py:363-370`)
|
|
- BUG-034: Timezone handling issues (`oauth_provider.py:83-87`)
|
|
|
|
### Worker Tasks
|
|
- BUG-035: No task time limits (global)
|
|
- BUG-036: Database integrity errors not properly handled (`discord.py:310-321`)
|
|
- BUG-037: Timezone bug in scheduled calls (`scheduled_calls.py:152-153`)
|
|
- BUG-038: Beat schedule not thread-safe for distributed deployment (`ingest.py:19-56`)
|
|
- BUG-039: Email sync fails entire account on single folder error (`email.py:84-152`)
|
|
|
|
### Infrastructure
|
|
- BUG-040: Missing resource limits for postgres, redis, qdrant, api (`docker-compose.yaml`)
|
|
- BUG-041: Backup encryption silently disabled if key missing (`settings.py:215-216`)
|
|
- BUG-042: Restore scripts don't validate database integrity (`restore_databases.sh:79`)
|
|
- BUG-043: Health check doesn't check dependencies (`app.py:87-92`)
|
|
- BUG-044: Uvicorn trusts all proxy headers (`docker/api/Dockerfile:63`)
|
|
|
|
### Code Quality
|
|
- BUG-045: 183 unsafe cast() operations (various files)
|
|
- BUG-046: 21 type:ignore comments (various files)
|
|
- BUG-047: 32 bare except Exception blocks (various files)
|
|
- BUG-048: 13 exception swallowing with pass (various files)
|
|
- BUG-049: Missing CSRF in OAuth callback (`auth.py`)
|
|
- BUG-050: SQL injection in test database handling (`tests/conftest.py:94`)
|
|
|
|
---
|
|
|
|
## Low Severity Bugs
|
|
|
|
- BUG-051: Duplicate chunks (16 identical "Claude plays Pokemon" chunks)
|
|
- BUG-052: Garbage content in text collection
|
|
- BUG-053: No vector freshness index (`source_item.py:157`)
|
|
- BUG-054: OAuthToken missing Base inheritance (`users.py:215-228`)
|
|
- BUG-055: collection_model returns "unknown" (`collections.py:140`)
|
|
- BUG-056: Unused "appuser" in API Dockerfile (`docker/api/Dockerfile:48`)
|
|
- BUG-057: Build dependencies not cleaned up (`docker/api/Dockerfile:7-12`)
|
|
- BUG-058: Typos in log messages (`tests/conftest.py:63`)
|
|
- BUG-059: MockRedis overly simplistic (`tests/conftest.py:24-46`)
|
|
- BUG-060: Print statement in ebook.py:192
|
|
|
|
---
|
|
|
|
## Improvement Suggestions
|
|
|
|
### High Priority
|
|
1. **Implement proper retry logic** for all Celery tasks with exponential backoff
|
|
2. **Add comprehensive health checks** that validate all service dependencies
|
|
3. **Fix score aggregation** to use mean/max instead of sum
|
|
4. **Add rate limiting** to prevent DoS attacks
|
|
5. **Implement proper CSRF protection** for OAuth flows
|
|
6. **Add resource limits** to all Docker services
|
|
7. **Implement centralized logging** with ELK or Grafana Loki
|
|
|
|
### Medium Priority
|
|
1. **Re-chunk oversized content** - add validation to enforce size limits
|
|
2. **Add chunk deduplication** based on content hash within same source
|
|
3. **Preserve BM25 scores** for hybrid search weighting
|
|
4. **Add task progress tracking** for long-running operations
|
|
5. **Implement distributed beat lock** for multi-worker deployments
|
|
6. **Add backup verification tests** - periodically test restore
|
|
7. **Replace cast() with type guards** throughout codebase
|
|
|
|
### Lower Priority
|
|
1. **Add Prometheus metrics** for observability
|
|
2. **Implement structured JSON logging** with correlation IDs
|
|
3. **Add graceful shutdown handling** to workers
|
|
4. **Document configuration requirements** more thoroughly
|
|
5. **Add integration tests** for critical workflows
|
|
6. **Remove dead code** and TODO comments in production
|
|
|
|
---
|
|
|
|
## Feature Ideas
|
|
|
|
### Search Enhancements
|
|
1. **Hybrid score weighting** - configurable balance between BM25 and vector
|
|
2. **Query expansion** - automatic synonym/related term expansion
|
|
3. **Faceted search** - filter by date ranges, sources, tags with counts
|
|
4. **Search result highlighting** - show matched terms in context
|
|
5. **Saved searches** - store and re-run common queries
|
|
|
|
### Content Management
|
|
1. **Content quality scoring** - automatic assessment of chunk quality
|
|
2. **Duplicate detection UI** - show and merge semantic duplicates
|
|
3. **Re-indexing queue** - prioritize content for re-embedding
|
|
4. **Content archiving** - move old content to cold storage
|
|
5. **Bulk operations** - tag, delete, re-process multiple items
|
|
|
|
### Email Management
|
|
1. **Email filtering rules** - configurable rules to filter/categorize emails (e.g., skip marketing spam but keep order confirmations, shipping notifications, appointment reminders)
|
|
2. **Email source classification** - auto-detect email types (transactional, marketing, personal, receipts)
|
|
3. **Smart email retention** - keep "useful" emails (orders, bookings, confirmations) while filtering noise
|
|
|
|
### User Experience
|
|
1. **Search analytics** - track what users search for
|
|
2. **Relevance feedback** - let users rate results to improve ranking
|
|
3. **Personal knowledge graph** - visualize connections between content
|
|
4. **Smart summaries** - auto-generate summaries of search results
|
|
5. **Email digest** - scheduled summary of new content
|
|
|
|
### Infrastructure
|
|
1. **Auto-scaling workers** - scale based on queue depth
|
|
2. **Multi-tenant support** - isolate data by user/org
|
|
3. **Backup scheduling UI** - configure backup frequency
|
|
4. **Monitoring dashboard** - Grafana-style metrics visualization
|
|
5. **Audit logging** - track all data access and modifications
|
|
|
|
---
|
|
|
|
## Investigation Log
|
|
|
|
### 2025-12-19 - Complete Investigation
|
|
|
|
**Data Layer (10 issues)**
|
|
- Missing relationships (mcp_servers)
|
|
- Type mismatches (User.id)
|
|
- Missing indexes (collection_name, server_id)
|
|
- Dead code (AgentObservation)
|
|
|
|
**Content Processing (12 issues)**
|
|
- Critical: break_chunk bug appends wrong object
|
|
- Critical: Oversized chunks exceed limits
|
|
- Deduplication only on SHA256
|
|
- Ebook creates single massive chunk
|
|
|
|
**Search System (14 issues)**
|
|
- Critical: BM25 ignores filters
|
|
- Critical: Score aggregation broken (sum vs mean)
|
|
- Inverted min_score thresholds
|
|
- BM25 scores discarded
|
|
|
|
**API Layer (12 issues)**
|
|
- Critical: Path traversal vulnerabilities (3 endpoints)
|
|
- CORS misconfiguration
|
|
- Missing rate limiting
|
|
- Debug print statements
|
|
|
|
**Worker Tasks (20 issues)**
|
|
- No retry configuration
|
|
- Silent task failures
|
|
- Race condition in scheduled calls
|
|
- No task timeouts
|
|
|
|
**Infrastructure (12 issues)**
|
|
- Missing resource limits
|
|
- Backup encryption issues
|
|
- Health check incomplete
|
|
- No centralized logging
|
|
|
|
**Code Quality (20+ issues)**
|
|
- 183 unsafe casts
|
|
- 32 bare exception blocks
|
|
- Registration always enabled bug
|
|
- API key logging
|
|
|
|
---
|
|
|
|
## Database Statistics
|
|
|
|
```
|
|
Sources by Modality:
|
|
forum: 981
|
|
mail: 665
|
|
text: 165
|
|
comic: 115
|
|
doc: 102
|
|
book: 78
|
|
observation: 26
|
|
note: 3
|
|
photo: 2
|
|
blog: 1
|
|
|
|
Chunks by Collection:
|
|
forum: 8786
|
|
text: 1843
|
|
mail: 1418
|
|
doc: 312
|
|
book: 156
|
|
semantic: 84
|
|
comic: 49
|
|
temporal: 26
|
|
blog: 7
|
|
photo: 2
|
|
|
|
Vectors in Qdrant:
|
|
forum: 8778
|
|
mail: 2756 (1338 orphaned!)
|
|
text: 505 (1338 missing!)
|
|
doc: 312
|
|
book: 156
|
|
semantic: 84
|
|
comic: 49
|
|
temporal: 26
|
|
blog: 7
|
|
photo: 2
|
|
|
|
Embed Status:
|
|
STORED: 2056
|
|
FAILED: 81
|
|
RAW: 1
|
|
```
|
|
|
|
---
|
|
|
|
## Updated Priority List (After Second Pass)
|
|
|
|
### CRITICAL - Fix Immediately
|
|
1. ✅ **FIXED:** Path traversal vulnerabilities (BUG-001)
|
|
2. ✅ **FIXED:** Registration always enabled (BUG-005)
|
|
3. ✅ **FIXED:** Search score aggregation (BUG-004)
|
|
4. ✅ **FIXED:** CORS misconfiguration (BUG-014)
|
|
5. ✅ **FIXED:** Wrong object in break_chunk (BUG-007)
|
|
6. 🚨 **NEW:** Replace SHA-256 password hashing with bcrypt/argon2 (BUG-061)
|
|
7. 🔴 **OPEN:** Fix collection mismatch for 1,338 items (BUG-002)
|
|
8. 🔴 **OPEN:** Fix BM25 filter application (BUG-003)
|
|
9. 🔴 **OPEN:** Remove API key from logs (BUG-006)
|
|
|
|
### HIGH Priority
|
|
10. 🚨 **NEW:** Stop logging full OAuth tokens (BUG-062)
|
|
11. 🚨 **NEW:** Fix timing attack in password verification (BUG-065)
|
|
12. 🔴 **OPEN:** Add retry logic to all Celery tasks (BUG-015, BUG-016)
|
|
13. 🔴 **OPEN:** Fix scheduled call race condition (BUG-009)
|
|
14. 🔴 **OPEN:** Fix oversized chunks exceeding token limits (BUG-008)
|
|
|
|
### MEDIUM Priority
|
|
15. 🚨 **NEW:** Update 24+ deprecated SQLAlchemy .get() calls (BUG-063)
|
|
16. 🚨 **NEW:** Remove shell=True from subprocess calls (BUG-064)
|
|
17. 🔴 **OPEN:** Add resource limits to Docker services (BUG-040, BUG-067)
|
|
18. 🔴 **OPEN:** Missing MCP servers relationship (BUG-010)
|
|
19. 🔴 **OPEN:** User ID type mismatch (BUG-011)
|
|
|
|
### Summary Statistics
|
|
- **Total Bugs Found:** 118 (100+ original + 8 new in second pass)
|
|
- **Bugs Fixed:** 25+ (confirmed in recent commits)
|
|
- **Critical Bugs Open:** 4
|
|
- **High Priority Open:** 5
|
|
- **Medium/Low Open:** 80+
|
|
|
|
---
|
|
|
|
## Investigation Notes
|
|
|
|
### What Was Checked (Second Pass - 2025-12-19)
|
|
✅ Security vulnerabilities (SQL injection, command injection, XSS)
|
|
✅ Authentication implementation (password hashing, session management)
|
|
✅ Logging practices (credential exposure)
|
|
✅ Database patterns (deprecated APIs, missing indexes)
|
|
✅ Docker configuration (resource limits, persistence)
|
|
✅ OAuth implementation (state management, token handling)
|
|
✅ Code quality (exception handling, type safety)
|
|
✅ Recent commits and fixes
|
|
|
|
### Good Security Practices Observed
|
|
- ✅ Path traversal protection properly implemented (fixed)
|
|
- ✅ CORS properly configured with specific origins (fixed)
|
|
- ✅ Secrets loaded from files, not environment variables
|
|
- ✅ Services run as non-root users where possible
|
|
- ✅ Read-only filesystems for workers
|
|
- ✅ Security capabilities dropped in containers
|
|
- ✅ Healthchecks configured for critical services
|
|
- ✅ Git command arguments properly escaped with shlex.quote()
|
|
- ✅ Search result limits enforced (max 1000)
|
|
- ✅ Timeout limits enforced (max 300s)
|
|
- ✅ Rate limiting infrastructure exists for LLM usage
|
|
|
|
### Areas Still Needing Attention
|
|
- 🔴 Password hashing needs complete overhaul
|
|
- 🔴 Logging practices need audit for credential exposure
|
|
- 🔴 Database API modernization for SQLAlchemy 2.0
|
|
- 🔴 Resource limits need to be added to all services
|
|
- 🔴 Redis persistence configuration needs review
|