Skip to content

feat: default max_new_tokens to remaining max_req_total_len budget#1279

Open
sufubao wants to merge 1 commit intomainfrom
feat/default-max-new-tokens-budget
Open

feat: default max_new_tokens to remaining max_req_total_len budget#1279
sufubao wants to merge 1 commit intomainfrom
feat/default-max-new-tokens-budget

Conversation

@sufubao
Copy link
Copy Markdown
Collaborator

@sufubao sufubao commented Apr 20, 2026

Summary

  • Use -1 as a sentinel when max_new_tokens is not explicitly provided by the request.
  • Resolve the sentinel to max_req_total_len - prompt_tokens during length validation, so requests can output up to the full remaining budget instead of the previous hard-coded 16384.
  • Relax max_new_tokens validation in both py_sampling_params.py and sampling_params.py to allow the -1 sentinel.

Test plan

  • Send a request without max_new_tokens and verify output is allowed up to max_req_total_len - prompt_tokens.
  • Send a request with prompt_tokens >= max_req_total_len and verify the new "no space left for output" error is raised.
  • Send a request with an explicit max_new_tokens and verify behavior is unchanged.
  • Send a request with min_new_tokens set and max_new_tokens unset, verify validation passes.

Use -1 as a sentinel when max_new_tokens is not explicitly provided,
and resolve it to (max_req_total_len - prompt_tokens) during length
validation so requests can output up to the full remaining budget.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a sentinel value of -1 for max_new_tokens, allowing the system to dynamically calculate the maximum output length based on the remaining budget within max_req_total_len. The feedback identifies a potential issue where the prompt_tokens metric might become stale if the prompt is truncated, suggesting it be updated after the repair call. Additionally, it is recommended to cap min_new_tokens by the resolved max_new_tokens to maintain consistency and avoid invariant violations when the sentinel value is used.

)

prompt_tokens = len(prompt_ids)
prompt_ids = await self._check_and_repair_length(prompt_ids, sampling_params)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The prompt_tokens variable is calculated before the call to _check_and_repair_length. If the prompt is truncated within _check_and_repair_length (e.g., due to long_truncation_mode), the prompt_tokens variable will become stale. This affects the lightllm_request_input_length metric recorded on line 342, which will reflect the original length instead of the actual length processed by the model. It should be updated after the repair call to ensure metrics are accurate.

Suggested change
prompt_ids = await self._check_and_repair_length(prompt_ids, sampling_params)
prompt_ids = await self._check_and_repair_length(prompt_ids, sampling_params)
prompt_tokens = len(prompt_ids)

f"the input prompt token len {prompt_tokens} >= max_req_total_len:"
f"{self.max_req_total_len}, no space left for output"
)
sampling_params.max_new_tokens = remaining
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When max_new_tokens is resolved from the sentinel value -1 to the remaining budget, it is possible that the resulting value is smaller than min_new_tokens (if the user explicitly set a high min_new_tokens). This would violate the invariant min_new_tokens <= max_new_tokens, which is normally checked during initialization but skipped when max_new_tokens is -1. To maintain consistency and avoid potential issues in the inference engine, min_new_tokens should be capped by the resolved max_new_tokens.

Suggested change
sampling_params.max_new_tokens = remaining
sampling_params.max_new_tokens = remaining
sampling_params.min_new_tokens = min(sampling_params.min_new_tokens, sampling_params.max_new_tokens)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant