feat(kb): add Markdown-aware chunker for structured documents#8151
Open
fausalors wants to merge 2 commits into
Open
feat(kb): add Markdown-aware chunker for structured documents#8151fausalors wants to merge 2 commits into
fausalors wants to merge 2 commits into
Conversation
Contributor
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The Markdown heading detection uses a simple regex over the whole text and will also match
#lines inside fenced code blocks or inline examples; consider skipping fenced code blocks (e.g., track ```/~~~ sections) before applying the heading regex to avoid spurious section splits. - In
upload_document, a newMarkdownChunkeris instantiated for every Markdown upload instead of reusing or deriving from the existingself.chunkerconfiguration; consider reusing a shared instance or a small factory to avoid per-upload allocation and to keep configuration (likeinclude_heading_context/max_heading_depth) consistently customizable.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The Markdown heading detection uses a simple regex over the whole text and will also match `#` lines inside fenced code blocks or inline examples; consider skipping fenced code blocks (e.g., track ```/~~~ sections) before applying the heading regex to avoid spurious section splits.
- In `upload_document`, a new `MarkdownChunker` is instantiated for every Markdown upload instead of reusing or deriving from the existing `self.chunker` configuration; consider reusing a shared instance or a small factory to avoid per-upload allocation and to keep configuration (like `include_heading_context`/`max_heading_depth`) consistently customizable.
## Individual Comments
### Comment 1
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="110-119" />
<code_context>
+ chunk_text = sub_chunk
+ raw_chunks.append((chunk_text.strip(), True))
+
+ # 合并没有实质正文的 chunk 到下一个有正文的 chunk
+ merged = []
+ pending = ""
+ for chunk_text, has_body in raw_chunks:
+ if not chunk_text:
+ continue
+ if not has_body:
+ # 纯标题节,暂存,等待合并到下一个有内容的 chunk
+ pending += chunk_text + "\n\n"
+ else:
+ if pending:
+ combined = pending + chunk_text
+ # 如果合并后不超过 chunk_size,合并
+ if len(combined) <= chunk_size:
+ merged.append(combined.strip())
+ else:
+ # 超长了,分开保留
+ merged.append(pending.strip())
+ merged.append(chunk_text.strip())
+ pending = ""
+ else:
</code_context>
<issue_to_address>
**issue (bug_risk):** Pending header-only chunks can exceed `chunk_size` without being further split
Because `pending` is only size-checked when combined with a `has_body=True` chunk, a run of many or very long header-only chunks can make `pending` exceed `chunk_size`. In that case, the `else` path appends `pending`/`pending.strip()` directly to `merged` (and similarly in the tail handling), producing chunks larger than `chunk_size`. Consider either enforcing a size cap as `pending` grows or passing `pending` through the fallback chunker before appending, so all emitted chunks stay within `chunk_size`.
</issue_to_address>
### Comment 2
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="13" />
<code_context>
+from .recursive import RecursiveCharacterChunker
+
+
+class MarkdownChunker(BaseChunker):
+ """Markdown 感知分块器
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring `MarkdownChunker` to introduce small typed helpers (e.g., dataclasses and merge functions) so `chunk()` becomes a clearer high-level orchestration rather than a large stateful method.
You can keep all current behavior but reduce complexity by introducing small typed objects and extracting the “phases” of `chunk()` into helpers. That will make the control flow much easier to follow and maintain.
### 1. Use a `Section` dataclass instead of generic dicts
Encoding the structure of `_parse_sections` in a dataclass removes a lot of implicit knowledge from `chunk()`:
```python
from dataclasses import dataclass
from typing import List
@dataclass
class Section:
heading_path: List[str]
text: str
has_body: bool
```
Update `_parse_sections` to return `list[Section]`:
```python
def _parse_sections(self, text: str) -> list[Section]:
...
sections: list[Section] = []
preamble = text[: headings[0]["start"]].strip()
if preamble:
sections.append(Section(heading_path=[], text=preamble, has_body=True))
heading_stack: list[dict] = []
for i, heading in enumerate(headings):
...
section_text = heading_line
if body:
section_text += "\n" + body
heading_path = [h["title"] for h in heading_stack[:-1]]
sections.append(
Section(
heading_path=heading_path,
text=section_text,
has_body=bool(body),
)
)
return sections
```
This lets `chunk()` work with `Section` objects directly, instead of re‑deriving meanings from dict keys.
### 2. Extract heading-context / continuation formatting
Move the continuation-label formatting out of the main loop so `chunk()` only decides *when* to create continuation chunks:
```python
def _apply_heading_context(
self, heading_path: list[str], content: str, is_continuation: bool
) -> str:
if not self.include_heading_context or not heading_path:
return content.strip()
title = " > ".join(heading_path)
if is_continuation:
return f"[续] {title}\n\n{content}".strip()
else:
return f"{title}\n\n{content}".strip()
```
Use it inside your fallback chunking:
```python
sub_chunks = await self._fallback_chunker.chunk(
section_text, chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
for i, sub_chunk in enumerate(sub_chunks):
raw_chunks.append(
(
self._apply_heading_context(
heading_path, sub_chunk, is_continuation=(i > 0)
),
True,
)
)
```
This removes a nested conditional from the main orchestration logic.
### 3. Factor out the two merge passes into helpers
You currently have two separate passes with different ad‑hoc state (`pending`, `buf`). You can keep behavior but make it clearer by using small helpers over a simple chunk type.
Introduce a light `Chunk` type:
```python
@dataclass
class Chunk:
text: str
has_body: bool # for heading-only merge; ignored by size merge
```
Then:
```python
def _merge_heading_only_chunks(
self, chunks: list[Chunk], chunk_size: int
) -> list[str]:
merged: list[str] = []
pending = ""
for c in chunks:
if not c.text:
continue
if not c.has_body:
pending += c.text + "\n\n"
continue
if pending:
combined = pending + c.text
if len(combined) <= chunk_size:
merged.append(combined.strip())
else:
merged.append(pending.strip())
merged.append(c.text.strip())
pending = ""
else:
merged.append(c.text.strip())
if pending:
if merged and len(merged[-1] + "\n\n" + pending.strip()) <= chunk_size:
merged[-1] = merged[-1] + "\n\n" + pending.strip()
else:
merged.append(pending.strip())
return [m for m in merged if m.strip()]
```
And a separate helper that only cares about size, not headings:
```python
def _merge_short_chunks_by_size(
self, chunks: list[str], chunk_size: int, min_chunk_size: int
) -> list[str]:
if min_chunk_size <= 0 or len(chunks) <= 1:
return chunks
final: list[str] = []
buf = ""
for c in chunks:
if buf:
combined = buf + "\n\n" + c
if len(combined) <= chunk_size:
buf = combined
else:
final.append(buf)
buf = c if len(c) < min_chunk_size else ""
if len(c) >= min_chunk_size:
final.append(c)
elif len(c) < min_chunk_size:
buf = c
else:
final.append(c)
if buf:
if final and len(final[-1] + "\n\n" + buf) <= chunk_size:
final[-1] = final[-1] + "\n\n" + buf
else:
final.append(buf)
return final
```
Then `chunk()` becomes a higher‑level orchestration:
```python
sections = self._parse_sections(text)
...
raw_chunks: list[Chunk] = self._sections_to_chunks(
sections, chunk_size, chunk_overlap
)
merged = self._merge_heading_only_chunks(raw_chunks, chunk_size)
merged = self._merge_short_chunks_by_size(
merged, chunk_size, self.min_chunk_size
)
return merged
```
Where `_sections_to_chunks` contains only the “section → raw chunk” logic (heading context + recursion), independent from merging rules.
This keeps all current behavior but flattens the nested state machines into small, named helpers that are easier to reason about and test independently.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a MarkdownChunker to provide structure-aware document splitting for Markdown files, preserving heading context across chunks and integrating it into the document upload process. Feedback focuses on improving the robustness of the chunker, specifically by clamping max_heading_depth to prevent regex errors, adjusting sub-chunk sizes to account for prefix lengths, and refining header detection to ignore matches within code blocks. Additionally, it is recommended to avoid hardcoded language-specific strings like "[续]" to support multi-language environments.
- Clamp max_heading_depth to 1-6 to prevent regex errors - Deduct prefix length when splitting oversized sections - Replace hardcoded "[续]" with configurable continuation_prefix - Skip fenced code blocks when detecting headings - Cap pending size to prevent chunks exceeding chunk_size - Refactor into dataclass + helper methods
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
现有的
RecursiveCharacterChunker按字符数切分 Markdown 文档,会破坏标题层级结构——一个 chunk 可能包含上一节的尾部和下一节的头部,导致语义不连贯,降低知识库检索质量。本 PR 新增
MarkdownChunker,按标题层级切分文档,保持每个章节的语义完整性。上传.md文件时自动启用,对非 Markdown 文件无影响。Modifications / 改动点
新增
astrbot/core/knowledge_base/chunking/markdown.py:Markdown 感知分块器,按标题层级切分,支持短 chunk 合并、超长章节回退递归分割修改
astrbot/core/knowledge_base/chunking/__init__.py:导出MarkdownChunker修改
astrbot/core/knowledge_base/kb_helper.py:上传文档时检测.md/.markdown/.mkd/.mdx后缀,自动切换到MarkdownChunkerThis is NOT a breaking change. / 这不是一个破坏性变更。
Screenshots or Test Results / 运行截图或测试结果
Benchmark: 3 份文档(分布式系统设计 + ML工程实践 + CCF大模型认证大纲,共 11,669 字符),20 条 query,chunk_size=256。
bge-m3 (Pro/BAAI/bge-m3, 1024d)
Recall@1 +40%, Precision@1 达到 1.0。
mxbai-embed-large (本地, 1024d)
Recall@3 +55%。
Checklist / 检查清单
😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
/ 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。
👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
/ 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”。
🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in
requirements.txtandpyproject.toml./ 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到
requirements.txt和pyproject.toml文件相应位置。😮 My changes do not introduce malicious code.
/ 我的更改没有引入恶意代码。
Summary by Sourcery
Introduce a Markdown-aware chunker for knowledge base documents and automatically apply it to Markdown uploads to preserve section structure during chunking.
New Features:
Enhancements: