feat: default max_new_tokens to remaining max_req_total_len budget#1279
feat: default max_new_tokens to remaining max_req_total_len budget#1279
Conversation
Use -1 as a sentinel when max_new_tokens is not explicitly provided, and resolve it to (max_req_total_len - prompt_tokens) during length validation so requests can output up to the full remaining budget.
There was a problem hiding this comment.
Code Review
This pull request introduces a sentinel value of -1 for max_new_tokens, allowing the system to dynamically calculate the maximum output length based on the remaining budget within max_req_total_len. The feedback identifies a potential issue where the prompt_tokens metric might become stale if the prompt is truncated, suggesting it be updated after the repair call. Additionally, it is recommended to cap min_new_tokens by the resolved max_new_tokens to maintain consistency and avoid invariant violations when the sentinel value is used.
| ) | ||
|
|
||
| prompt_tokens = len(prompt_ids) | ||
| prompt_ids = await self._check_and_repair_length(prompt_ids, sampling_params) |
There was a problem hiding this comment.
The prompt_tokens variable is calculated before the call to _check_and_repair_length. If the prompt is truncated within _check_and_repair_length (e.g., due to long_truncation_mode), the prompt_tokens variable will become stale. This affects the lightllm_request_input_length metric recorded on line 342, which will reflect the original length instead of the actual length processed by the model. It should be updated after the repair call to ensure metrics are accurate.
| prompt_ids = await self._check_and_repair_length(prompt_ids, sampling_params) | |
| prompt_ids = await self._check_and_repair_length(prompt_ids, sampling_params) | |
| prompt_tokens = len(prompt_ids) |
| f"the input prompt token len {prompt_tokens} >= max_req_total_len:" | ||
| f"{self.max_req_total_len}, no space left for output" | ||
| ) | ||
| sampling_params.max_new_tokens = remaining |
There was a problem hiding this comment.
When max_new_tokens is resolved from the sentinel value -1 to the remaining budget, it is possible that the resulting value is smaller than min_new_tokens (if the user explicitly set a high min_new_tokens). This would violate the invariant min_new_tokens <= max_new_tokens, which is normally checked during initialization but skipped when max_new_tokens is -1. To maintain consistency and avoid potential issues in the inference engine, min_new_tokens should be capped by the resolved max_new_tokens.
| sampling_params.max_new_tokens = remaining | |
| sampling_params.max_new_tokens = remaining | |
| sampling_params.min_new_tokens = min(sampling_params.min_new_tokens, sampling_params.max_new_tokens) |
Summary
-1as a sentinel whenmax_new_tokensis not explicitly provided by the request.max_req_total_len - prompt_tokensduring length validation, so requests can output up to the full remaining budget instead of the previous hard-coded16384.max_new_tokensvalidation in bothpy_sampling_params.pyandsampling_params.pyto allow the-1sentinel.Test plan
max_new_tokensand verify output is allowed up tomax_req_total_len - prompt_tokens.prompt_tokens >= max_req_total_lenand verify the new "no space left for output" error is raised.max_new_tokensand verify behavior is unchanged.min_new_tokensset andmax_new_tokensunset, verify validation passes.