GitHub - oidlabs-com/Lexoid: The open-source universal adapter for LLMs. Turn messy real-world data into clean, agent-ready context.

Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) PDF document parsing.

Documentation

Motivation:

Use the multi-modal advancement of LLMs
Enable convenience for users
Collaborate with a permissive license

Installation

Installing with pip

pip install lexoid

To use LLM-based parsing, define the following environment variables or create a .env file with the following definitions

OPENAI_API_KEY=""
GOOGLE_API_KEY=""

For local inference with Ollama, no API key is required. Install Ollama, pull the target model, and keep the local server running:

ollama pull gemma4
export OLLAMA_BASE_URL=127.0.0.1:11434
ollama list
ollama serve

# docker
Reference: https://docs.ollama.com/docker#run-model-locally
CPU example (will most likely be slower; remember to adjust `OLLAMA_TIMEOUT` as needed)
- docker run -d -v ollama:/root/.ollama -p 11434:11434 -e OLLAMA_BASE_URL=0.0.0.0 -e OLLAMA_TIMEOUT=240 --name ollama ollama/ollama
- docker exec -it ollama ollama pull gemma4:latest

Optionally, to use Playwright for retrieving web content (instead of the requests library):

playwright install --with-deps --only-shell chromium

Building `.whl` from source

Note

Installing the package from within the virtual environment could cause unexpected behavior, as Lexoid creates and activates its own environment in order to build the wheel.

make build

Creating a local installation

To install dependencies:

make install

or, to install with dev-dependencies:

make dev

To activate virtual environment:

source .venv/bin/activate

Usage

Example Notebook

Example Colab Notebook

Here's a quick example to parse documents using Lexoid:

from lexoid.api import parse
from lexoid.api import ParserType

parsed_md = parse("https://www.justice.gov/eoir/immigration-law-advisor", parser_type="AUTO")["raw"]
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="LLM_PARSE")["raw"]
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="STATIC_PARSE")["raw"]

print(parsed_md)

Parameters

path (str): The file path or URL.
parser_type (str, optional): The type of parser to use ("LLM_PARSE" or "STATIC_PARSE"). Defaults to "AUTO".
pages_per_split (int, optional): Number of pages per split for chunking. Defaults to 4.
max_threads (int, optional): Maximum number of threads for parallel processing. Defaults to 4.
**kwargs: Additional arguments for the parser.

Command Line Usage

Lexoid provides a command-line interface for document parsing without writing Python code.

Installation

The CLI is automatically available after installing Lexoid:

pip install lexoid
lexoid --help

Alternatively, use with Python module syntax:

python -m lexoid --help

Parse Documents

Convert documents to markdown or JSON:

# Parse to stdout (default markdown)
lexoid parse --input document.pdf

# Save to file
lexoid parse --input document.pdf --output output.md

# Output as JSON (includes metadata, segments, token usage)
lexoid parse --input document.pdf --format json --output result.json

# Use specific parser (STATIC_PARSE, LLM_PARSE, or AUTO)
lexoid parse --input document.pdf --parser-type STATIC_PARSE

# Use specific LLM model
lexoid parse --input document.pdf --model gpt-4o

# Enable verbose logging
lexoid parse --input document.pdf --verbose

Extract Structured Data with JSON Schema

Extract data conforming to a JSON schema:

# Inline schema
lexoid schema \
  --input document.pdf \
  --schema '{"type": "object", "properties": {"title": {"type": "string"}}}' \
  --output result.json

# Schema from file
lexoid schema \
  --input document.pdf \
  --schema schema.json \
  --output result.json

# Specify LLM provider
lexoid schema \
  --input document.pdf \
  --schema schema.json \
  --api openai \
  --model gpt-4o

Convert to LaTeX

Convert documents to LaTeX format:

# Convert to stdout
lexoid latex --input document.pdf

# Save to file
lexoid latex --input document.pdf --output output.tex

# Use specific model
lexoid latex --input document.pdf --model gpt-4o

Command-line Options

Common Options

--input, -i: Input file path (required) - Supports PDF, images, HTML, DOCX, XLSX, PPTX, or URLs
--output, -o: Output file path (optional) - If not specified, output is printed to stdout
--verbose, -v: Enable detailed logging

Parse Command

lexoid parse --help

--parser-type, -p: Parser type - AUTO (default), LLM_PARSE, or STATIC_PARSE
--model, -m: LLM model name (default: gemini-2.5-flash)
--pages-per-split: Pages per chunk (default: 4)
--max-processes: Parallel processes (default: 4)
--framework: Static parser framework - pdfplumber or paddleocr
--format: Output format - markdown (default, plain markdown text) or json (full result with metadata, segments, token usage)

Schema Command

lexoid schema --help

--schema, -s: JSON schema (file path or inline JSON, required)
--model, -m: LLM model (default: gpt-4o-mini)
--api: API provider - openai, gemini, anthropic, ollama, etc. (auto-detected if not specified)
--example-schema: Provide example data for the schema
--fill-single-schema: Auto-fill single schemas

LaTeX Command

lexoid latex --help

--model, -m: LLM model (default: gpt-4o-mini)
--api: API provider - openai, gemini, anthropic, ollama, etc. (auto-detected if not specified)

Supported API Providers

Google
OpenAI
Hugging Face
Together AI
OpenRouter
Fireworks
Ollama

Ollama Local Parsing

Lexoid supports local LLM_PARSE inference through Ollama. The initial recommended model is gemma4:latest.

from lexoid.api import parse

result = parse(
	"path/to/document.pdf",
	parser_type="LLM_PARSE",
	api_provider="ollama",
	model="gemma4:latest",
	max_processes=1,
)

print(result["raw"])

Notes:

Ollama uses the default local endpoint http://localhost:11434 unless OLLAMA_BASE_URL is set.
Lexoid forces max_processes=1 for Ollama-backed parsing to avoid local multiprocess contention.
AUTO routing does not select Ollama in this first version; choose it explicitly with api_provider="ollama".

Benchmark

Results aggregated across 14 documents.

Note: Benchmarks are currently done in the zero-shot setting.

Rank	Model	SequenceMatcher Similarity	TFIDF Similarity	Time (s)	Cost ($)
1	gemini-3-pro-preview	0.917 (±0.127)	0.943 (±0.159)	46.92	0.06288
2	gemini-3.5-flash	0.914 (±0.138)	0.989 (±0.016)	16.70	0.02936
3	AUTO	0.901 (±0.134)	0.988 (±0.016)	11.53	0.02327
4	gemini-3.1-pro-preview	0.900 (±0.183)	0.978 (±0.043)	45.49	0.02892
5	AUTO (with auto-selected model)	0.899 (±0.131)	0.960 (±0.066)	21.17	0.00066
6	gpt-5.2	0.890 (±0.193)	0.975 (±0.036)	33.32	0.03959
7	gemini-2.5-flash	0.886 (±0.164)	0.986 (±0.027)	52.55	0.01226
8	mistral-ocr-latest	0.882 (±0.106)	0.932 (±0.091)	5.75	0.00121
9	gemini-2.5-pro	0.876 (±0.195)	0.976 (±0.049)	22.65	0.02408
10	gemini-2.0-flash	0.875 (±0.148)	0.977 (±0.037)	11.96	0.00079
11	gpt-5.5	0.874 (±0.209)	0.939 (±0.138)	72.11	0.14495
12	gemini-3.1-flash-lite	0.869 (±0.211)	0.969 (±0.050)	14.98	0.00288
13	claude-3-5-sonnet-20241022	0.858 (±0.184)	0.930 (±0.098)	17.32	0.01804
14	gemini-1.5-flash	0.842 (±0.214)	0.969 (±0.037)	15.58	0.00043
15	gpt-5.4-mini	0.835 (±0.210)	0.948 (±0.066)	13.14	0.00902
16	gpt-5-mini	0.819 (±0.201)	0.917 (±0.104)	52.84	0.00811
17	gpt-5	0.807 (±0.215)	0.919 (±0.088)	98.12	0.05505
18	gpt-5.4	0.803 (±0.238)	0.936 (±0.150)	31.98	0.03887
19	claude-sonnet-4-20250514	0.801 (±0.188)	0.905 (±0.136)	22.02	0.02056
20	claude-opus-4-20250514	0.789 (±0.220)	0.886 (±0.148)	29.55	0.09513
21	accounts/fireworks/models/llama4-maverick-instruct-basic	0.772 (±0.203)	0.930 (±0.117)	16.02	0.00147
22	gemini-1.5-pro	0.767 (±0.309)	0.865 (±0.230)	24.77	0.01139
23	gemini-3-flash-preview	0.766 (±0.293)	0.858 (±0.210)	39.38	0.00969
24	claude-opus-4-8	0.764 (±0.254)	0.863 (±0.154)	11.10	0.03195
25	claude-sonnet-4-6	0.757 (±0.302)	0.843 (±0.206)	16.50	0.01804
26	gpt-4.1-mini	0.754 (±0.249)	0.803 (±0.193)	23.28	0.00347
27	accounts/fireworks/models/llama4-scout-instruct-basic	0.754 (±0.243)	0.942 (±0.063)	13.36	0.00087
28	gpt-4o	0.752 (±0.269)	0.896 (±0.123)	28.87	0.01469
29	gpt-4o-mini	0.728 (±0.241)	0.850 (±0.128)	18.96	0.00609
30	claude-haiku-4-5-20251001	0.683 (±0.300)	0.841 (±0.187)	7.86	0.00504
31	claude-3-7-sonnet-20250219	0.646 (±0.397)	0.758 (±0.297)	57.96	0.01730
32	gpt-4.1	0.637 (±0.301)	0.787 (±0.185)	35.37	0.01498
33	google/gemma-3-27b-it	0.604 (±0.342)	0.788 (±0.297)	23.16	0.00020
34	ds4sd/SmolDocling-256M-preview	0.603 (±0.292)	0.705 (±0.262)	507.74	0.00000
35	gpt-5.4-nano	0.600 (±0.309)	0.856 (±0.119)	22.51	0.00321
36	microsoft/phi-4-multimodal-instruct	0.589 (±0.273)	0.820 (±0.197)	14.00	0.00045
37	qwen/qwen-2.5-vl-7b-instruct	0.498 (±0.378)	0.630 (±0.445)	14.73	0.00056

Citation

If you use Lexoid in production or publications, please cite accordingly and acknowledge usage. We appreciate the support 🙏

Name		Name	Last commit message	Last commit date
Latest commit History 282 Commits
.agents/skills/python		.agents/skills/python
.github		.github
assets		assets
docs		docs
examples		examples
lexoid		lexoid
skills		skills
tests		tests
.dockerignore		.dockerignore
.env_example		.env_example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Motivation:

Installation

Installing with pip

Building `.whl` from source

Creating a local installation

Usage

Parameters

Command Line Usage

Installation

Parse Documents

Extract Structured Data with JSON Schema

Convert to LaTeX

Command-line Options

Common Options

Parse Command

Schema Command

LaTeX Command

Supported API Providers

Ollama Local Parsing

Benchmark

Citation

About

Uh oh!

Releases 24

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Motivation:

Installation

Installing with pip

Building .whl from source

Creating a local installation

Usage

Parameters

Command Line Usage

Installation

Parse Documents

Extract Structured Data with JSON Schema

Convert to LaTeX

Command-line Options

Common Options

Parse Command

Schema Command

LaTeX Command

Supported API Providers

Ollama Local Parsing

Benchmark

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 24

Uh oh!

Contributors

Uh oh!

Languages

Building `.whl` from source