mirror of
https://github.com/mruwnik/memory.git
synced 2026-01-02 09:12:58 +01:00
Fix BUG-002: BookSection chunks now use correct 'book' modality
Root cause: BookSection._chunk_contents() called extract_text() without specifying modality, which defaults to "text". This caused 9,370 book chunks to be stored in the 'text' collection instead of 'book'. Fix: Added modality="book" to all DataChunk creation in BookSection: - extract_text() call for single-page sections - Direct DataChunk creation for multi-page sections Note: The original investigation reported 1,338 mail items, but current analysis shows those are actually email attachments which correctly go to text/doc/photo collections based on their content type. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
f0195464c8
commit
9a0f226972
@ -40,13 +40,14 @@ This investigation identified **100+ issues** across 7 areas of the memory syste
|
||||
- **Impact:** Arbitrary file read on server filesystem
|
||||
- **Fix:** Add path resolution validation with `.resolve()` and prefix check
|
||||
|
||||
### BUG-002: Collection Mismatch (1,338 items)
|
||||
- **Severity:** CRITICAL
|
||||
### BUG-002: Collection Mismatch ✅ INVESTIGATED & FIXED
|
||||
- **Severity:** MEDIUM (not as critical as originally thought)
|
||||
- **Area:** Data/Embedding Pipeline
|
||||
- **Description:** Mail items have chunks with `collection_name='text'` but vectors stored in Qdrant's `mail` collection
|
||||
- **Impact:** Items completely unsearchable
|
||||
- **Evidence:** 1,338 orphaned vectors in mail, 1,338 missing in text
|
||||
- **Fix:** Re-sync vectors or update chunk collection_name
|
||||
- **Description:** BookSection._chunk_contents() called extract_text() without specifying modality, defaulting to "text"
|
||||
- **Impact:** 9,370 book chunks stored in text collection instead of book
|
||||
- **Root Cause:** `extract_text()` defaults to `modality="text"` but BookSection didn't override it
|
||||
- **Fix Applied:** Added `modality="book"` to BookSection._chunk_contents() DataChunk creation
|
||||
- **Note:** Original 1,338 mail items investigation was outdated - current mismatch is 24 mail->text chunks which are actually email attachments (correct behavior)
|
||||
|
||||
### BUG-003: BM25 Filters Completely Ignored
|
||||
- **Severity:** CRITICAL
|
||||
|
||||
@ -606,7 +606,9 @@ class BookSection(SourceItem):
|
||||
return []
|
||||
|
||||
if len([p for p in self.pages if p.strip()]) == 1:
|
||||
chunks = extract.extract_text(content, metadata={"type": "page"})
|
||||
chunks = extract.extract_text(
|
||||
content, metadata={"type": "page"}, modality="book"
|
||||
)
|
||||
if len(chunks) > 1:
|
||||
chunks[-1].metadata["type"] = "summary"
|
||||
return chunks
|
||||
@ -614,10 +616,10 @@ class BookSection(SourceItem):
|
||||
summary, tags = summarizer.summarize(content)
|
||||
return [
|
||||
extract.DataChunk(
|
||||
data=[content], metadata={"type": "section", "tags": tags}
|
||||
data=[content], metadata={"type": "section", "tags": tags}, modality="book"
|
||||
),
|
||||
extract.DataChunk(
|
||||
data=[summary], metadata={"type": "summary", "tags": tags}
|
||||
data=[summary], metadata={"type": "summary", "tags": tags}, modality="book"
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user