embeddings.cpp

A C++ library for text (and maybe image) embeddings, focusing on efficient inference of BERT-like (and maybe clip-like) models.

Overview

Many existing GGML-based text embedding libraries have limited support for Chinese text processing due to their custom tokenizer implementations. This project addresses this limitation by leveraging Hugging Face's Rust tokenizer implementation, wrapped with a C++ API to ensure consistency with the Python transformers library while providing native performance.

While currently focused on BERT-like text embedding models, the project aims to support image embedding models in the future (Work in Progress).

Note: This is an experimental and educational project. It is not recommended for production use at this time.

Supported Models

The following models have been tested and verified:

BAAI/bge-m3
BAAI/bge-base-zh-v1.5
shibing624/text2vec-base-multilingual
Snowflake/snowflake-arctic-embed-m-v2.0
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

The C++ implementation is checked against Python transformers CPU output. Models also supported by Hugging Face text-embeddings-inference can be checked against TEI as a third implementation. For repeatable correctness and performance runs, see scripts/ALIGNMENT_README.md.

Model Preparation

First, install the required dependencies:

uv pip install --torch-backend cpu -r scripts/requirements.txt

Then convert the models to GGUF format:

# Convert BGE-M3 model
uv run scripts/convert.py BAAI/bge-m3 ./models/bge-m3.fp16.gguf f16

# Convert BGE-Base Chinese v1.5 model
uv run scripts/convert.py BAAI/bge-base-zh-v1.5 ./models/bge-base-zh-v1.5.fp16.gguf f16

uv run scripts/convert.py Snowflake/snowflake-arctic-embed-m-v2.0 ./models/snowflake-arctic-embed-m-v2.0.fp16.gguf f16

# Convert Text2Vec multilingual model
uv run scripts/convert.py shibing624/text2vec-base-multilingual ./models/text2vec-base-multilingual.fp16.gguf f16

uv run scripts/convert.py sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 ./models/paraphrase-multilingual-MiniLM-L12-v2.fp16.gguf f16

Model Quantization

After converting models to GGUF format, you can quantize them to reduce memory usage and improve inference speed:

# Build the quantization tool
cmake --build build --target quantize

# Quantize a model (example with different quantization types)
./build/quantize ./models/bge-m3.fp16.gguf ./models/bge-m3.q4_k.gguf q4_k
./build/quantize ./models/bge-m3.fp16.gguf ./models/bge-m3.q6_k.gguf q6_k
./build/quantize ./models/bge-m3.fp16.gguf ./models/bge-m3.q8_0.gguf q8_0

# On Windows
.\build\Release\quantize.exe .\models\bge-m3.fp16.gguf .\models\bge-m3.q4_k.gguf q4_k

Supported Quantization Types

q4_k: 4-bit quantization with K-means clustering (good balance of size and quality)
q6_k: 6-bit quantization with K-means clustering (higher quality, larger size)
q8_0: 8-bit quantization (minimal quality loss, moderate size reduction)
Other GGML quantization types as supported by the library

Usage

quantize <input_model.gguf> <output_model.gguf> <qtype>

The quantization tool will:

Load the input GGUF model
Quantize eligible tensors (typically weight matrices)
Preserve metadata and non-quantizable tensors
Output size comparison and compression statistics

Running Tests

Before running, install embeddings.cpp:

# use CMAKE_ARGS to add more cmake settings
$env:CMAKE_ARGS="-DGGML_VULKAN=ON"

# Install the package
pip install .

# Generate Python stub files
cd build && make stub

# on Windows
pip install pybind11-stubgen
# then
pybind11-stubgen embeddings_cpp -o .

python tests/test_tokenizer.py

Alignment and Benchmarking

Run correctness checks for every model mentioned in this README:

uv run scripts/alignment.py --convert-missing

Include CPU performance comparisons:

uv run scripts/alignment.py --convert-missing --benchmark

Pin the C++ CPU thread count while tuning:

uv run scripts/alignment.py --benchmark --cpp-threads 8

For models also supported by text-embeddings-inference, start TEI as an additional comparator:

uv run scripts/alignment.py \
  --models Snowflake/snowflake-arctic-embed-m-v2.0 \
  --convert-missing \
  --tei-start \
  --benchmark

For registry-driven Snowflake checks against the optimized mixed GGUF:

uv run scripts/correctness.py --model-id Snowflake/snowflake-arctic-embed-m-v2.0 --benchmark
uv run scripts/benchmark.py \
  --model-id Snowflake/snowflake-arctic-embed-m-v2.0 \
  --gguf-path models/snowflake-arctic-embed-m-v2.0.q4_k_mlp_q8_attn.gguf

Loading Published GGUF Models

Known optimized GGUF artifacts are listed in embeddings_cpp/registry.json. The default Snowflake artifact is published under the chux0519 Hugging Face namespace.

from embeddings_cpp import load

model = load("Snowflake/snowflake-arctic-embed-m-v2.0")
vectors = model.batch_encode(["hello world", "你好，世界"])

By default, CPU inference uses the detected CPU concurrency. Pin EMBEDDINGS_CPP_THREADS=N only after measuring a specific host or container CPU quota.

Install the optional Hugging Face dependency when downloading from the Hub:

pip install "embeddings-cpp[hub]"

HTTP Server

The server can load a registered model from Hugging Face or a local GGUF path:

python -m embeddings_cpp.server \
  --model-id Snowflake/snowflake-arctic-embed-m-v2.0 \
  --port 8080

Endpoints:

GET /health
POST /embed with {"inputs": ["hello", "world"]}
POST /v1/embeddings with an OpenAI-compatible embeddings request

Container images can be published to GHCR with .github/workflows/publish-server-image.yml.

Building from Source

macOS (ARM)

Configure and build with Metal support:

cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
      -DGGML_METAL=ON \
      -DGGML_METAL_EMBED_LIBRARY=ON \
      -DEMBEDDINGS_CPP_ENABLE_PYBIND=ON ..

If you encountered openmp's bug, try

brew install libomp

export OpenMP_ROOT=$(brew --prefix)/opt/libomp

Windows

build with vulkan support:

cmake -DGGML_VULKAN=ON -DEMBEDDINGS_CPP_ENABLE_PYBIND=ON ..
# If you encounter any issues, ensure that your graphics driver and Vulkan SDK versions are compatible.
# You can also add -DGGML_VULKAN_DEBUG=ON -DGGML_VULKAN_VALIDATE=ON for debuging

Debugging

GGML debug support is now enabled by default in the vendored version. This provides better debugging capabilities for CPU backend operations without requiring additional patches.

For more information about GGML debugging features, see: ggml-org/ggml#655

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github/workflows		.github/workflows
docs		docs
embeddings_cpp		embeddings_cpp
ggml		ggml
hf_tokenizers @ f563661		hf_tokenizers @ f563661
models		models
pybind11 @ f7e14e9		pybind11 @ f7e14e9
scripts		scripts
src		src
tests		tests
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
README.md		README.md
REPORT_20260417.md		REPORT_20260417.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

embeddings.cpp

Overview

Supported Models

Model Preparation

Model Quantization

Supported Quantization Types

Usage

Running Tests

Alignment and Benchmarking

Loading Published GGUF Models

HTTP Server

Building from Source

macOS (ARM)

Windows

Debugging

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

embeddings.cpp

Overview

Supported Models

Model Preparation

Model Quantization

Supported Quantization Types

Usage

Running Tests

Alignment and Benchmarking

Loading Published GGUF Models

HTTP Server

Building from Source

macOS (ARM)

Windows

Debugging

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages