Fix BUG-002: BookSection chunks now use correct 'book' modality

Root cause: BookSection._chunk_contents() called extract_text() without
specifying modality, which defaults to "text". This caused 9,370 book
chunks to be stored in the 'text' collection instead of 'book'.

Fix: Added modality="book" to all DataChunk creation in BookSection:
- extract_text() call for single-page sections
- Direct DataChunk creation for multi-page sections

Note: The original investigation reported 1,338 mail items, but current
analysis shows those are actually email attachments which correctly go
to text/doc/photo collections based on their content type.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
mruwnik 2025-12-19 21:04:19 +00:00
parent f0195464c8
commit 9a0f226972
2 changed files with 12 additions and 9 deletions

View File

@ -40,13 +40,14 @@ This investigation identified **100+ issues** across 7 areas of the memory syste
- **Impact:** Arbitrary file read on server filesystem
- **Fix:** Add path resolution validation with `.resolve()` and prefix check
### BUG-002: Collection Mismatch (1,338 items)
- **Severity:** CRITICAL
### BUG-002: Collection Mismatch ✅ INVESTIGATED & FIXED
- **Severity:** MEDIUM (not as critical as originally thought)
- **Area:** Data/Embedding Pipeline
- **Description:** Mail items have chunks with `collection_name='text'` but vectors stored in Qdrant's `mail` collection
- **Impact:** Items completely unsearchable
- **Evidence:** 1,338 orphaned vectors in mail, 1,338 missing in text
- **Fix:** Re-sync vectors or update chunk collection_name
- **Description:** BookSection._chunk_contents() called extract_text() without specifying modality, defaulting to "text"
- **Impact:** 9,370 book chunks stored in text collection instead of book
- **Root Cause:** `extract_text()` defaults to `modality="text"` but BookSection didn't override it
- **Fix Applied:** Added `modality="book"` to BookSection._chunk_contents() DataChunk creation
- **Note:** Original 1,338 mail items investigation was outdated - current mismatch is 24 mail->text chunks which are actually email attachments (correct behavior)
### BUG-003: BM25 Filters Completely Ignored
- **Severity:** CRITICAL

View File

@ -606,7 +606,9 @@ class BookSection(SourceItem):
return []
if len([p for p in self.pages if p.strip()]) == 1:
chunks = extract.extract_text(content, metadata={"type": "page"})
chunks = extract.extract_text(
content, metadata={"type": "page"}, modality="book"
)
if len(chunks) > 1:
chunks[-1].metadata["type"] = "summary"
return chunks
@ -614,10 +616,10 @@ class BookSection(SourceItem):
summary, tags = summarizer.summarize(content)
return [
extract.DataChunk(
data=[content], metadata={"type": "section", "tags": tags}
data=[content], metadata={"type": "section", "tags": tags}, modality="book"
),
extract.DataChunk(
data=[summary], metadata={"type": "summary", "tags": tags}
data=[summary], metadata={"type": "summary", "tags": tags}, modality="book"
),
]