feat: default max_new_tokens to remaining max_req_total_len budget by sufubao · Pull Request #1279 · ModelTC/LightLLM

sufubao · 2026-04-20T09:42:22Z

Summary

Use -1 as a sentinel when max_new_tokens is not explicitly provided by the request.
Resolve the sentinel to max_req_total_len - prompt_tokens during length validation, so requests can output up to the full remaining budget instead of the previous hard-coded 16384.
Relax max_new_tokens validation in both py_sampling_params.py and sampling_params.py to allow the -1 sentinel.

Test plan

Send a request without max_new_tokens and verify output is allowed up to max_req_total_len - prompt_tokens.
Send a request with prompt_tokens >= max_req_total_len and verify the new "no space left for output" error is raised.
Send a request with an explicit max_new_tokens and verify behavior is unchanged.
Send a request with min_new_tokens set and max_new_tokens unset, verify validation passes.

Use -1 as a sentinel when max_new_tokens is not explicitly provided, and resolve it to (max_req_total_len - prompt_tokens) during length validation so requests can output up to the full remaining budget.

gemini-code-assist

Code Review

This pull request introduces a sentinel value of -1 for max_new_tokens, allowing the system to dynamically calculate the maximum output length based on the remaining budget within max_req_total_len. The feedback identifies a potential issue where the prompt_tokens metric might become stale if the prompt is truncated, suggesting it be updated after the repair call. Additionally, it is recommended to cap min_new_tokens by the resolved max_new_tokens to maintain consistency and avoid invariant violations when the sentinel value is used.

gemini-code-assist · 2026-04-20T09:44:11Z

            )

            prompt_tokens = len(prompt_ids)
+            prompt_ids = await self._check_and_repair_length(prompt_ids, sampling_params)


The prompt_tokens variable is calculated before the call to _check_and_repair_length. If the prompt is truncated within _check_and_repair_length (e.g., due to long_truncation_mode), the prompt_tokens variable will become stale. This affects the lightllm_request_input_length metric recorded on line 342, which will reflect the original length instead of the actual length processed by the model. It should be updated after the repair call to ensure metrics are accurate.

Suggested change

prompt_ids = await self._check_and_repair_length(prompt_ids, sampling_params)

prompt_ids = await self._check_and_repair_length(prompt_ids, sampling_params)

prompt_tokens = len(prompt_ids)

gemini-code-assist · 2026-04-20T09:44:11Z

+                    f"the input prompt token len {prompt_tokens} >= max_req_total_len:"
+                    f"{self.max_req_total_len}, no space left for output"
+                )
+            sampling_params.max_new_tokens = remaining


When max_new_tokens is resolved from the sentinel value -1 to the remaining budget, it is possible that the resulting value is smaller than min_new_tokens (if the user explicitly set a high min_new_tokens). This would violate the invariant min_new_tokens <= max_new_tokens, which is normally checked during initialization but skipped when max_new_tokens is -1. To maintain consistency and avoid potential issues in the inference engine, min_new_tokens should be capped by the resolved max_new_tokens.

Suggested change

sampling_params.max_new_tokens = remaining

sampling_params.max_new_tokens = remaining

sampling_params.min_new_tokens = min(sampling_params.min_new_tokens, sampling_params.max_new_tokens)

feat: default max_new_tokens to remaining max_req_total_len budget

0f0d892

Use -1 as a sentinel when max_new_tokens is not explicitly provided, and resolve it to (max_req_total_len - prompt_tokens) during length validation so requests can output up to the full remaining budget.

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: default max_new_tokens to remaining max_req_total_len budget#1279

feat: default max_new_tokens to remaining max_req_total_len budget#1279
sufubao wants to merge 1 commit intomainfrom
feat/default-max-new-tokens-budget

sufubao commented Apr 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	prompt_ids = await self._check_and_repair_length(prompt_ids, sampling_params)
	prompt_ids = await self._check_and_repair_length(prompt_ids, sampling_params)
	prompt_tokens = len(prompt_ids)

	sampling_params.max_new_tokens = remaining
	sampling_params.max_new_tokens = remaining
	sampling_params.min_new_tokens = min(sampling_params.min_new_tokens, sampling_params.max_new_tokens)

Conversation

sufubao commented Apr 20, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant