mirror of
https://github.com/mruwnik/memory.git
synced 2026-01-02 17:22:58 +01:00
Fix BUG-002: BookSection chunks now use correct 'book' modality
Root cause: BookSection._chunk_contents() called extract_text() without specifying modality, which defaults to "text". This caused 9,370 book chunks to be stored in the 'text' collection instead of 'book'. Fix: Added modality="book" to all DataChunk creation in BookSection: - extract_text() call for single-page sections - Direct DataChunk creation for multi-page sections Note: The original investigation reported 1,338 mail items, but current analysis shows those are actually email attachments which correctly go to text/doc/photo collections based on their content type. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
f0195464c8
commit
9a0f226972
@ -40,13 +40,14 @@ This investigation identified **100+ issues** across 7 areas of the memory syste
|
|||||||
- **Impact:** Arbitrary file read on server filesystem
|
- **Impact:** Arbitrary file read on server filesystem
|
||||||
- **Fix:** Add path resolution validation with `.resolve()` and prefix check
|
- **Fix:** Add path resolution validation with `.resolve()` and prefix check
|
||||||
|
|
||||||
### BUG-002: Collection Mismatch (1,338 items)
|
### BUG-002: Collection Mismatch ✅ INVESTIGATED & FIXED
|
||||||
- **Severity:** CRITICAL
|
- **Severity:** MEDIUM (not as critical as originally thought)
|
||||||
- **Area:** Data/Embedding Pipeline
|
- **Area:** Data/Embedding Pipeline
|
||||||
- **Description:** Mail items have chunks with `collection_name='text'` but vectors stored in Qdrant's `mail` collection
|
- **Description:** BookSection._chunk_contents() called extract_text() without specifying modality, defaulting to "text"
|
||||||
- **Impact:** Items completely unsearchable
|
- **Impact:** 9,370 book chunks stored in text collection instead of book
|
||||||
- **Evidence:** 1,338 orphaned vectors in mail, 1,338 missing in text
|
- **Root Cause:** `extract_text()` defaults to `modality="text"` but BookSection didn't override it
|
||||||
- **Fix:** Re-sync vectors or update chunk collection_name
|
- **Fix Applied:** Added `modality="book"` to BookSection._chunk_contents() DataChunk creation
|
||||||
|
- **Note:** Original 1,338 mail items investigation was outdated - current mismatch is 24 mail->text chunks which are actually email attachments (correct behavior)
|
||||||
|
|
||||||
### BUG-003: BM25 Filters Completely Ignored
|
### BUG-003: BM25 Filters Completely Ignored
|
||||||
- **Severity:** CRITICAL
|
- **Severity:** CRITICAL
|
||||||
|
|||||||
@ -606,7 +606,9 @@ class BookSection(SourceItem):
|
|||||||
return []
|
return []
|
||||||
|
|
||||||
if len([p for p in self.pages if p.strip()]) == 1:
|
if len([p for p in self.pages if p.strip()]) == 1:
|
||||||
chunks = extract.extract_text(content, metadata={"type": "page"})
|
chunks = extract.extract_text(
|
||||||
|
content, metadata={"type": "page"}, modality="book"
|
||||||
|
)
|
||||||
if len(chunks) > 1:
|
if len(chunks) > 1:
|
||||||
chunks[-1].metadata["type"] = "summary"
|
chunks[-1].metadata["type"] = "summary"
|
||||||
return chunks
|
return chunks
|
||||||
@ -614,10 +616,10 @@ class BookSection(SourceItem):
|
|||||||
summary, tags = summarizer.summarize(content)
|
summary, tags = summarizer.summarize(content)
|
||||||
return [
|
return [
|
||||||
extract.DataChunk(
|
extract.DataChunk(
|
||||||
data=[content], metadata={"type": "section", "tags": tags}
|
data=[content], metadata={"type": "section", "tags": tags}, modality="book"
|
||||||
),
|
),
|
||||||
extract.DataChunk(
|
extract.DataChunk(
|
||||||
data=[summary], metadata={"type": "summary", "tags": tags}
|
data=[summary], metadata={"type": "summary", "tags": tags}, modality="book"
|
||||||
),
|
),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user