Skip to content

feat(kb): add Markdown-aware chunker for structured documents#8151

Open
fausalors wants to merge 2 commits into
AstrBotDevs:masterfrom
fausalors:feat/markdown-chunker
Open

feat(kb): add Markdown-aware chunker for structured documents#8151
fausalors wants to merge 2 commits into
AstrBotDevs:masterfrom
fausalors:feat/markdown-chunker

Conversation

@fausalors
Copy link
Copy Markdown

@fausalors fausalors commented May 11, 2026

现有的 RecursiveCharacterChunker 按字符数切分 Markdown 文档,会破坏标题层级结构——一个 chunk 可能包含上一节的尾部和下一节的头部,导致语义不连贯,降低知识库检索质量。

本 PR 新增 MarkdownChunker,按标题层级切分文档,保持每个章节的语义完整性。上传 .md 文件时自动启用,对非 Markdown 文件无影响。

Modifications / 改动点

  • 新增 astrbot/core/knowledge_base/chunking/markdown.py:Markdown 感知分块器,按标题层级切分,支持短 chunk 合并、超长章节回退递归分割

  • 修改 astrbot/core/knowledge_base/chunking/__init__.py:导出 MarkdownChunker

  • 修改 astrbot/core/knowledge_base/kb_helper.py:上传文档时检测 .md/.markdown/.mkd/.mdx 后缀,自动切换到 MarkdownChunker

  • This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果

Benchmark: 3 份文档(分布式系统设计 + ML工程实践 + CCF大模型认证大纲,共 11,669 字符),20 条 query,chunk_size=256。

bge-m3 (Pro/BAAI/bge-m3, 1024d)

策略 Recall@1 Precision@1 Recall@3 Precision@3
RecursiveCharacterChunker (baseline) 0.5292 0.9500 0.7458 0.5000
MarkdownChunker 0.7433 1.0000 0.7850 0.3667

Recall@1 +40%, Precision@1 达到 1.0

mxbai-embed-large (本地, 1024d)

策略 Recall@1 Precision@1 Recall@3 Precision@3
RecursiveCharacterChunker (baseline) 0.2667 0.5000 0.3792 0.2500
MarkdownChunker 0.4350 0.5500 0.5867 0.2833

Recall@3 +55%


Checklist / 检查清单

  • 😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
    / 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。

  • 👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
    / 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”

  • 🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in requirements.txt and pyproject.toml.
    / 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到 requirements.txtpyproject.toml 文件相应位置。

  • 😮 My changes do not introduce malicious code.
    / 我的更改没有引入恶意代码。

Summary by Sourcery

Introduce a Markdown-aware chunker for knowledge base documents and automatically apply it to Markdown uploads to preserve section structure during chunking.

New Features:

  • Add a MarkdownChunker that splits documents by Markdown heading hierarchy with optional heading context and recursive fallback for oversized sections.

Enhancements:

  • Export MarkdownChunker from the chunking package and select it dynamically in document upload flow for Markdown file extensions to improve retrieval quality for Markdown knowledge sources.

@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 11, 2026
@auto-assign auto-assign Bot requested review from LIghtJUNction and Raven95676 May 11, 2026 10:26
@dosubot dosubot Bot added the feature:knowledge-base The bug / feature is about knowledge base label May 11, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The Markdown heading detection uses a simple regex over the whole text and will also match # lines inside fenced code blocks or inline examples; consider skipping fenced code blocks (e.g., track ```/~~~ sections) before applying the heading regex to avoid spurious section splits.
  • In upload_document, a new MarkdownChunker is instantiated for every Markdown upload instead of reusing or deriving from the existing self.chunker configuration; consider reusing a shared instance or a small factory to avoid per-upload allocation and to keep configuration (like include_heading_context/max_heading_depth) consistently customizable.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The Markdown heading detection uses a simple regex over the whole text and will also match `#` lines inside fenced code blocks or inline examples; consider skipping fenced code blocks (e.g., track ```/~~~ sections) before applying the heading regex to avoid spurious section splits.
- In `upload_document`, a new `MarkdownChunker` is instantiated for every Markdown upload instead of reusing or deriving from the existing `self.chunker` configuration; consider reusing a shared instance or a small factory to avoid per-upload allocation and to keep configuration (like `include_heading_context`/`max_heading_depth`) consistently customizable.

## Individual Comments

### Comment 1
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="110-119" />
<code_context>
+                        chunk_text = sub_chunk
+                    raw_chunks.append((chunk_text.strip(), True))
+
+        # 合并没有实质正文的 chunk 到下一个有正文的 chunk
+        merged = []
+        pending = ""
+        for chunk_text, has_body in raw_chunks:
+            if not chunk_text:
+                continue
+            if not has_body:
+                # 纯标题节,暂存,等待合并到下一个有内容的 chunk
+                pending += chunk_text + "\n\n"
+            else:
+                if pending:
+                    combined = pending + chunk_text
+                    # 如果合并后不超过 chunk_size,合并
+                    if len(combined) <= chunk_size:
+                        merged.append(combined.strip())
+                    else:
+                        # 超长了,分开保留
+                        merged.append(pending.strip())
+                        merged.append(chunk_text.strip())
+                    pending = ""
+                else:
</code_context>
<issue_to_address>
**issue (bug_risk):** Pending header-only chunks can exceed `chunk_size` without being further split

Because `pending` is only size-checked when combined with a `has_body=True` chunk, a run of many or very long header-only chunks can make `pending` exceed `chunk_size`. In that case, the `else` path appends `pending`/`pending.strip()` directly to `merged` (and similarly in the tail handling), producing chunks larger than `chunk_size`. Consider either enforcing a size cap as `pending` grows or passing `pending` through the fallback chunker before appending, so all emitted chunks stay within `chunk_size`.
</issue_to_address>

### Comment 2
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="13" />
<code_context>
+from .recursive import RecursiveCharacterChunker
+
+
+class MarkdownChunker(BaseChunker):
+    """Markdown 感知分块器
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring `MarkdownChunker` to introduce small typed helpers (e.g., dataclasses and merge functions) so `chunk()` becomes a clearer high-level orchestration rather than a large stateful method.

You can keep all current behavior but reduce complexity by introducing small typed objects and extracting the “phases” of `chunk()` into helpers. That will make the control flow much easier to follow and maintain.

### 1. Use a `Section` dataclass instead of generic dicts

Encoding the structure of `_parse_sections` in a dataclass removes a lot of implicit knowledge from `chunk()`:

```python
from dataclasses import dataclass
from typing import List

@dataclass
class Section:
    heading_path: List[str]
    text: str
    has_body: bool
```

Update `_parse_sections` to return `list[Section]`:

```python
def _parse_sections(self, text: str) -> list[Section]:
    ...
    sections: list[Section] = []

    preamble = text[: headings[0]["start"]].strip()
    if preamble:
        sections.append(Section(heading_path=[], text=preamble, has_body=True))

    heading_stack: list[dict] = []

    for i, heading in enumerate(headings):
        ...
        section_text = heading_line
        if body:
            section_text += "\n" + body

        heading_path = [h["title"] for h in heading_stack[:-1]]

        sections.append(
            Section(
                heading_path=heading_path,
                text=section_text,
                has_body=bool(body),
            )
        )

    return sections
```

This lets `chunk()` work with `Section` objects directly, instead of re‑deriving meanings from dict keys.

### 2. Extract heading-context / continuation formatting

Move the continuation-label formatting out of the main loop so `chunk()` only decides *when* to create continuation chunks:

```python
def _apply_heading_context(
    self, heading_path: list[str], content: str, is_continuation: bool
) -> str:
    if not self.include_heading_context or not heading_path:
        return content.strip()

    title = " > ".join(heading_path)
    if is_continuation:
        return f"[续] {title}\n\n{content}".strip()
    else:
        return f"{title}\n\n{content}".strip()
```

Use it inside your fallback chunking:

```python
sub_chunks = await self._fallback_chunker.chunk(
    section_text, chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
for i, sub_chunk in enumerate(sub_chunks):
    raw_chunks.append(
        (
            self._apply_heading_context(
                heading_path, sub_chunk, is_continuation=(i > 0)
            ),
            True,
        )
    )
```

This removes a nested conditional from the main orchestration logic.

### 3. Factor out the two merge passes into helpers

You currently have two separate passes with different ad‑hoc state (`pending`, `buf`). You can keep behavior but make it clearer by using small helpers over a simple chunk type.

Introduce a light `Chunk` type:

```python
@dataclass
class Chunk:
    text: str
    has_body: bool  # for heading-only merge; ignored by size merge
```

Then:

```python
def _merge_heading_only_chunks(
    self, chunks: list[Chunk], chunk_size: int
) -> list[str]:
    merged: list[str] = []
    pending = ""

    for c in chunks:
        if not c.text:
            continue
        if not c.has_body:
            pending += c.text + "\n\n"
            continue

        if pending:
            combined = pending + c.text
            if len(combined) <= chunk_size:
                merged.append(combined.strip())
            else:
                merged.append(pending.strip())
                merged.append(c.text.strip())
            pending = ""
        else:
            merged.append(c.text.strip())

    if pending:
        if merged and len(merged[-1] + "\n\n" + pending.strip()) <= chunk_size:
            merged[-1] = merged[-1] + "\n\n" + pending.strip()
        else:
            merged.append(pending.strip())

    return [m for m in merged if m.strip()]
```

And a separate helper that only cares about size, not headings:

```python
def _merge_short_chunks_by_size(
    self, chunks: list[str], chunk_size: int, min_chunk_size: int
) -> list[str]:
    if min_chunk_size <= 0 or len(chunks) <= 1:
        return chunks

    final: list[str] = []
    buf = ""

    for c in chunks:
        if buf:
            combined = buf + "\n\n" + c
            if len(combined) <= chunk_size:
                buf = combined
            else:
                final.append(buf)
                buf = c if len(c) < min_chunk_size else ""
                if len(c) >= min_chunk_size:
                    final.append(c)
        elif len(c) < min_chunk_size:
            buf = c
        else:
            final.append(c)

    if buf:
        if final and len(final[-1] + "\n\n" + buf) <= chunk_size:
            final[-1] = final[-1] + "\n\n" + buf
        else:
            final.append(buf)

    return final
```

Then `chunk()` becomes a higher‑level orchestration:

```python
sections = self._parse_sections(text)
...
raw_chunks: list[Chunk] = self._sections_to_chunks(
    sections, chunk_size, chunk_overlap
)
merged = self._merge_heading_only_chunks(raw_chunks, chunk_size)
merged = self._merge_short_chunks_by_size(
    merged, chunk_size, self.min_chunk_size
)
return merged
```

Where `_sections_to_chunks` contains only the “section → raw chunk” logic (heading context + recursion), independent from merging rules.

This keeps all current behavior but flattens the nested state machines into small, named helpers that are easier to reason about and test independently.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/core/knowledge_base/chunking/markdown.py Outdated
Comment thread astrbot/core/knowledge_base/chunking/markdown.py
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a MarkdownChunker to provide structure-aware document splitting for Markdown files, preserving heading context across chunks and integrating it into the document upload process. Feedback focuses on improving the robustness of the chunker, specifically by clamping max_heading_depth to prevent regex errors, adjusting sub-chunk sizes to account for prefix lengths, and refining header detection to ignore matches within code blocks. Additionally, it is recommended to avoid hardcoded language-specific strings like "[续]" to support multi-language environments.

Comment thread astrbot/core/knowledge_base/chunking/markdown.py Outdated
Comment thread astrbot/core/knowledge_base/chunking/markdown.py
Comment thread astrbot/core/knowledge_base/chunking/markdown.py Outdated
Comment thread astrbot/core/knowledge_base/chunking/markdown.py
- Clamp max_heading_depth to 1-6 to prevent regex errors
- Deduct prefix length when splitting oversized sections
- Replace hardcoded "[续]" with configurable continuation_prefix
- Skip fenced code blocks when detecting headings
- Cap pending size to prevent chunks exceeding chunk_size
- Refactor into dataclass + helper methods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature:knowledge-base The bug / feature is about knowledge base size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant