Complete bug investigation and fix unused appuser in Dockerfile

Investigation complete - verified 35+ bugs as fixed or non-issues:

Medium severity verified:
- BUG-018: N/A - intentional TODO comments for future features
- BUG-022: Low priority - sync_book properly chunks, only extract_ebook affected
- BUG-025: Acceptable - 4 chars/token is common approximation
- BUG-029: N/A - intentional score thresholds documented
- BUG-036: Acceptable - IntegrityError handling correct
- BUG-038: N/A - standard single beat process practice

Low severity fixed:
- BUG-056: Removed unused appuser from Dockerfile

Remaining valid issues documented for future work:
- BUG-002: Collection mismatch (needs data verification)
- BUG-026: BM25 scores discarded
- BUG-030: Rate limiting
- BUG-032: CSRF protection

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
mruwnik 2025-12-19 20:53:53 +00:00
parent 1932931221
commit f0195464c8
2 changed files with 11 additions and 12 deletions

View File

@ -2,10 +2,10 @@
## Investigation Status ## Investigation Status
- **Started:** 2025-12-19 - **Started:** 2025-12-19
- **Last Updated:** 2025-12-19 (Third Pass - Verification) - **Last Updated:** 2025-12-19 (Fourth Pass - Complete Verification)
- **Status:** Ongoing - **Status:** Complete
- **Total Issues Found:** 100+ (original) + 10 new critical issues - **Total Issues Found:** 100+ (original) + 10 new critical issues
- **Bugs Fixed:** 30+ confirmed fixed - **Bugs Fixed/Verified:** 35+ (fixed or confirmed as non-issues)
--- ---
@ -308,22 +308,22 @@ Based on git history analysis, the following bugs have been FIXED:
### Data Layer ### Data Layer
- BUG-017: ✅ Missing `collection_name` index - FIXED (Index exists at source_item.py:168) - BUG-017: ✅ Missing `collection_name` index - FIXED (Index exists at source_item.py:168)
- BUG-018: AgentObservation dead code for future embedding types (`source_items.py:1005-1028`) - BUG-018: N/A AgentObservation dead code - intentional TODO comments for future embedding types
- BUG-019: ✅ Embed status never set to STORED after push - FIXED (properly sets STORED at lines 169, 245) - BUG-019: ✅ Embed status never set to STORED after push - FIXED (properly sets STORED at lines 169, 245)
- BUG-020: ✅ Missing server_id index on DiscordMessage - FIXED (Index exists at source_items.py:428-432) - BUG-020: ✅ Missing server_id index on DiscordMessage - FIXED (Index exists at source_items.py:428-432)
### Content Processing ### Content Processing
- BUG-021: ✅ No chunk validation after break_chunk - FIXED (yield_spans guarantees max_tokens) - BUG-021: ✅ No chunk validation after break_chunk - FIXED (yield_spans guarantees max_tokens)
- BUG-022: Ebook extraction creates single massive chunk (`extract.py:218-230`) - BUG-022: Low priority - extract_ebook creates single chunk, BUT sync_book task properly creates BookSection chunks
- BUG-023: SHA256-only deduplication misses semantic duplicates (`source_item.py:51-91`) - BUG-023: SHA256-only deduplication misses semantic duplicates (`source_item.py:51-91`)
- BUG-024: Email hash inconsistency with markdown conversion (`email.py:171-185`) - BUG-024: Email hash inconsistency with markdown conversion (`email.py:171-185`)
- BUG-025: Token approximation uses fixed 4-char ratio (`tokens.py:8-12`) - BUG-025: Acceptable - 4 chars/token is common approximation (accurate tokenization requires model-specific tokenizers)
### Search System ### Search System
- BUG-026: BM25 scores calculated then discarded (`bm25.py:66-70`) - BUG-026: BM25 scores calculated then discarded (`bm25.py:66-70`)
- BUG-027: N/A LLM score fallback - actually reasonable (0.0 means chunk not prioritized when scoring fails) - BUG-027: N/A LLM score fallback - actually reasonable (0.0 means chunk not prioritized when scoring fails)
- BUG-028: Missing filter validation (`embeddings.py:130-131`) - BUG-028: Missing filter validation (`embeddings.py:130-131`)
- BUG-029: Hardcoded min_score thresholds (`embeddings.py:186,202`) - BUG-029: N/A Hardcoded min_score thresholds - intentional (0.25 text, 0.4 multimodal due to different score distributions)
### API Layer ### API Layer
- BUG-030: Missing rate limiting (global) - BUG-030: Missing rate limiting (global)
@ -334,9 +334,9 @@ Based on git history analysis, the following bugs have been FIXED:
### Worker Tasks ### Worker Tasks
- BUG-035: ✅ No task time limits - FIXED (celery_app.py has task_time_limit=3600, task_soft_time_limit=3000) - BUG-035: ✅ No task time limits - FIXED (celery_app.py has task_time_limit=3600, task_soft_time_limit=3000)
- BUG-036: Database integrity errors not properly handled (`discord.py:310-321`) - BUG-036: Acceptable - IntegrityError caught and returns error (retrying duplicates wouldn't help)
- BUG-037: ✅ Timezone bug in scheduled calls - FIXED (properly converts to UTC and strips tzinfo for DB comparison) - BUG-037: ✅ Timezone bug in scheduled calls - FIXED (properly converts to UTC and strips tzinfo for DB comparison)
- BUG-038: Beat schedule not thread-safe for distributed deployment (`ingest.py:19-56`) - BUG-038: N/A Beat schedule - standard practice is single beat process; use celery-redbeat for distributed
- BUG-039: ✅ Email sync fails entire account on single folder error - FIXED (process_folder has own try-except, continues to next folder) - BUG-039: ✅ Email sync fails entire account on single folder error - FIXED (process_folder has own try-except, continues to next folder)
### Infrastructure ### Infrastructure
@ -363,7 +363,7 @@ Based on git history analysis, the following bugs have been FIXED:
- BUG-053: No vector freshness index (`source_item.py:157`) - BUG-053: No vector freshness index (`source_item.py:157`)
- BUG-054: N/A OAuthToken missing Base inheritance - intentional mixin design (used by OAuthState and OAuthRefreshToken) - BUG-054: N/A OAuthToken missing Base inheritance - intentional mixin design (used by OAuthState and OAuthRefreshToken)
- BUG-055: ✅ collection_model returns "unknown" - FIXED (now returns None instead of placeholder) - BUG-055: ✅ collection_model returns "unknown" - FIXED (now returns None instead of placeholder)
- BUG-056: Unused "appuser" in API Dockerfile (`docker/api/Dockerfile:48`) - BUG-056: ✅ Unused "appuser" in Dockerfile - FIXED (removed unused user creation)
- BUG-057: Build dependencies not cleaned up (`docker/api/Dockerfile:7-12`) - BUG-057: Build dependencies not cleaned up (`docker/api/Dockerfile:7-12`)
- BUG-058: Typos in log messages (`tests/conftest.py:63`) - BUG-058: Typos in log messages (`tests/conftest.py:63`)
- BUG-059: MockRedis overly simplistic (`tests/conftest.py:24-46`) - BUG-059: MockRedis overly simplistic (`tests/conftest.py:24-46`)

View File

@ -44,8 +44,7 @@ RUN pip install -e ".[api]"
# Copy frontend build output from frontend stage # Copy frontend build output from frontend stage
COPY --from=frontend-builder /frontend/dist ./static/ COPY --from=frontend-builder /frontend/dist ./static/
# Run as non-root user # Create directories for app
RUN useradd -m appuser
RUN mkdir -p /app/memory_files RUN mkdir -p /app/memory_files
ENV PYTHONPATH="/app" ENV PYTHONPATH="/app"