Add optional --skip-validation flag to benchmark recipes and XPK workload creation#3708
Conversation
| "--skip-validation", | ||
| action="store_true", | ||
| default=False, | ||
| help="Skip validation during workload creation in xpk.", |
There was a problem hiding this comment.
Nit: Can you please add some info on what validation is being skipped here? It looks like this is passing a pathways-specific flag through. If that is the case, maybe this should just also say something about this being a pathways-specific flag?
There was a problem hiding this comment.
Actually, this isn't Pathways-specific. It's a general flag for the xpk workload create
In environments like Airflow/Composer, we only use workload create and create-pathway commands and don't actually need system dependencies like docker or kubectl-kjob. Currently, xpk validation fails if those tools aren't installed. Adding this flag allows us to skip those non-essential checks, significantly reducing the complexity and 'workarounds' needed for our CI/CD integration.
I've updated the help text to reflect that this skips system dependency and health validation.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
baa2f1c to
9a513e1
Compare
Description
Context
Previously, benchmark recipe interfaces (e.g.,
pw_mcjax_benchmark_recipe) lacked a mechanism to propagate the--skip-validationintent to the underlying generated commands. This prevented users from bypassing validation steps during benchmark execution.Solution
Introduced an optional
--skip-validationflag to the benchmark recipe interface. This flag is threaded down to thexpk workload createcommand to bypass system dependency checks (such asdockerorkubectl-kjob) and resource validation. This is specifically required for Airflow/Composer integration, where these tools are not available or necessary for launching workloads, thus avoiding unnecessary environment complexity.Default Behavior: If the flag is omitted, the existing validation logic remains active, ensuring backward compatibility.
New Behavior: When provided, it explicitly skips the validation phase in the workload creation.
Tests
Validated the flag propagation by executing the
pw_mcjax_benchmark_recipewith the new parameter:Execution Command:
python3 -m benchmarks.recipes.pw_mcjax_benchmark_recipe --user=root --cluster_name=pw-v6e-32x4 --project=cienet-cmcs --zone=us-central1-b --benchmark_steps=20 --num_slices_list=1 --server_image=[us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server:latest](https://www.google.com/url?sa=D&q=http%3A%2F%2Fus-docker.pkg.dev%2Fcloud-tpu-v2-images%2Fpathways%2Fserver%3Alatest) --proxy_image=[us-docker.pkg.dev/cloud-tpu-v2-images/pathways/proxy_server:latest](https://www.google.com/url?sa=D&q=http%3A%2F%2Fus-docker.pkg.dev%2Fcloud-tpu-v2-images%2Fpathways%2Fproxy_server%3Alatest) --runner=[gcr.io/tpu-prod-env-multipod/skip_val:latest](https://www.google.com/url?sa=D&q=http%3A%2F%2Fgcr.io%2Ftpu-prod-env-multipod%2Fskip_val%3Alatest) --selected_model_framework=pathways --selected_model_names=default_basic_1 --priority=medium --max_restarts=1 --bq_enable=False --bq_db_project=cloud-tpu-multipod-dev --bq_db_dataset=chzheng_test_100steps --workload_id=roo-pw-default-1-ha5 --device_type=v6e-32 --skip-validationVerification:
Execution logs confirm that the flag was correctly parsed and passed to the xpk command.
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.