Skip to content

Hybrid chunking using OpenAI Tokenizer #260

@mohammedfaisal

Description

@mohammedfaisal

The documentation explains how to configure and use OpenAI Tokenizer with hybrid chunker in python.

       import tiktoken

        from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer

        tokenizer = OpenAITokenizer(
                   tokenizer=tiktoken.encoding_for_model("gpt-4o"),
                   max_tokens=128 * 1024,  # context window length required for OpenAI tokenizers
        )
        
        chunker = HybridChunker(
                tokenizer=tokenizer,
                merge_peers=True,  # optional, defaults to True
        )
        chunk_iter = chunker.chunk(dl_doc=doc)
        chunks = list(chunk_iter)

How to do the same operation using docling-java ?
The HybridChunkerOptions.builder().tokenizer() seems support only HuggingFace models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions