Skip to content

feat: add [ LiteLLM AI Gateway ] for provider independence#186

Merged
ekeith (evanmkeith) merged 3 commits into
braintrustdata:mainfrom
RheagalFire:feat/add-litellm-provider
Jun 4, 2026
Merged

feat: add [ LiteLLM AI Gateway ] for provider independence#186
ekeith (evanmkeith) merged 3 commits into
braintrustdata:mainfrom
RheagalFire:feat/add-litellm-provider

Conversation

@RheagalFire

@RheagalFire Aarish Alam (RheagalFire) commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add LiteLLMClient / AsyncLiteLLMClient in py/autoevals/litellm.py: OpenAI-compatible adapters backed by litellm.completion() / litellm.acompletion() (plus embeddings and moderation).
  • Export both from py/autoevals/init.py so users can do init(client=LiteLLMClient()).
  • Add litellm to extras_require: install with pip install 'autoevals[litellm]'.
  • Add py/autoevals/test_litellm.py with 9 mocked unit tests covering chat, embeddings, moderation, async, end-to-end init() wiring, and the Responses-API shim.
  • Followup commit adds a Responses-API shim in LiteLLMClient.responses.create / AsyncLiteLLMClient.responses.create. Without it, init(client=LiteLLMClient()) with autoevals' default gpt-5-mini model would crash: oai.py routes gpt-5 models through prepare_responses_params which sends input=... and a flat tool schema, but litellm.completion expects messages=... with nested tool schema. The shim translates back.

Fits cleanly into the existing LLMClient architecture (py/autoevals/oai.py:129) which is duck-typed on the OpenAI v1 protocol. The adapter implements that surface; no changes to core.

Changes

  • py/autoevals/litellm.py: LiteLLMClient / AsyncLiteLLMClient + _LiteLLMResponses adapter that translates Responses-API params (input=, flat tool schema) back to Chat-Completions params (messages=, nested tool schema) before calling litellm.completion.

  • py/autoevals/init.py: re-exports the new clients.

  • setup.py: litellm optional extra.

  • py/autoevals/test_litellm.py: 9 mocked tests (adds coverage for Responses-API shim input→messages translation and flat→nested tool-schema translation).

    Testing & Usage

Unit tests (all pass):

  $ pytest py/autoevals/test_litellm.py -v
  py/autoevals/test_litellm.py::test_litellm_client_exposes_openai_v1_surface PASSED                                                                                                                                                                                                                                                                                      
  py/autoevals/test_litellm.py::test_litellm_chat_completions_forwards_to_litellm PASSED                                                                                                                                                                                                                                                                                  
  py/autoevals/test_litellm.py::test_litellm_client_without_api_key_does_not_forward_key PASSED                                                                                                                                                                                                                                                                           
  py/autoevals/test_litellm.py::test_litellm_embeddings_forwards_to_litellm PASSED                                                                                                                                                                                                                                                                                        
  py/autoevals/test_litellm.py::test_litellm_moderations_forwards_to_litellm PASSED                                                                                                                                                                                                                                                                                       
  py/autoevals/test_litellm.py::test_litellm_responses_create_translates_input_to_messages PASSED                                                                                                                                                                                                                                                                         
  py/autoevals/test_litellm.py::test_litellm_responses_create_translates_responses_api_tool_schema PASSED                                                                                                                                                                                                                                                                 
  py/autoevals/test_litellm.py::test_async_litellm_chat_completions_forwards PASSED                                                                                                                                                                                                                                                                                       
  py/autoevals/test_litellm.py::test_init_accepts_litellm_client PASSED                                                                                                                                                                                                                                                                                                   
  ============================== 9 passed in 0.61s ===============================                                                                                                                                                                                                                                                                                           

Live end-to-end smoke test against Azure OpenAI (azure/gpt-4o):

  [Test 1] LiteLLMClient.chat.completions.create, model=azure/gpt-4o                                                                                                                                                                                                                                                                                                      
    response: '4'
  [Test 2] Factuality scorer with init(client=LiteLLMClient())
    score: 0.6
    metadata: {'choice': 'B', 'rationale': 'Step 1: The expert answer states "George Washington." ... Step 3: Therefore, the submitted answer includes the information found in the expert answer and expresses it in a broader form, but remains fully consistent with the expert answer. Conclusion: The submitted answer is a superset of the expert answer and is
  fully consistent with it.'}
  
  [Test 3] Responses-API shim: client.responses.create(input=..., model=azure/gpt-4o)                                                                                                                                                                                                                                                                                     
           (Path autoevals takes for gpt-5 models. Shim translates input=                                                                                                                                                                                                                                                                                                 
           back to messages= before calling litellm.completion.)                                                                                                                                                                                                                                                                                                          
    response: '10'                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                          
  autoevals LiteLLM live test PASSED (chat + scorer + responses-shim).                                                                                                                                                                                                                                                                                                                                                

This exercised three paths. (1) raw chat.completions.create routed to litellm.completion. (2) full scorer path init(client=LiteLLMClient()) → Factuality.eval() → LLMClient.complete → shim → litellm.completion → parsed score with rationale. (3) Responses-API shim with input=... kwarg, which translates to messages=... before reaching LiteLLM (exercises the fix for the default gpt-5-mini routing).

Example usage

from autoevals import init
from autoevals.litellm import LiteLLMClient
from autoevals.llm import Factuality

init(
client=LiteLLMClient(),
default_model="anthropic/claude-3-5-sonnet-20241022",
)

evaluator = Factuality()
result = evaluator.eval(input="...", output="...", expected="...")

init(client=LiteLLMClient(), default_model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0")
init(client=LiteLLMClient(), default_model="gemini/gemini-1.5-pro")
init(client=LiteLLMClient(), default_model="ollama/llama3")

from autoevals.litellm import AsyncLiteLLMClient
init(client=AsyncLiteLLMClient(), default_model="openai/gpt-4o-mini")

@RheagalFire

Copy link
Copy Markdown
Contributor Author

cc Ankur Goyal (@ankrgyl) Olmo Maldonado (@ibolmo). would like your review here.

@RheagalFire

Copy link
Copy Markdown
Contributor Author

ekeith (@evanmkeith) do you have any update on this PR?

@evanmkeith ekeith (evanmkeith) merged commit 0278eff into braintrustdata:main Jun 4, 2026
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Braintrust eval report

Autoevals (main-1780605537)

Score Average Improvements Regressions
NumericDiff 76.3% (-1pp) 8 🟢 10 🔴
Time_to_first_token 11.83tok (-0.12tok) 39 🟢 77 🔴
Llm_calls 1.09 (-0.45) - 100 🔴
Tool_calls 0 (+0) - -
Errors 0 (+0) - -
Llm_errors 0 (+0) - -
Tool_errors 0 (+0) - -
Prompt_tokens 308.38tok (-220.04tok) 103 🟢 -
Prompt_cached_tokens 0tok (+0tok) - -
Prompt_cache_creation_tokens 0tok (+0tok) - -
Prompt_cache_creation_5m_tokens 0tok (+0tok) - -
Prompt_cache_creation_1h_tokens 0tok (+0tok) - -
Completion_tokens 257.38tok (-226.07tok) 157 🟢 52 🔴
Completion_reasoning_tokens 0tok (-371.2tok) 219 🟢 -
Total_tokens 565.76tok (-446.11tok) 157 🟢 52 🔴
Estimated_cost 0$ (0$) 52 🟢 51 🔴
Duration 11.57s (-0.38s) 64 🟢 152 🔴
Llm_duration 13.1s (-1.15s) 83 🟢 35 🔴

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants