[ML] Fix PyTorch allowlist validation timeout on HF download stall#3022
[ML] Fix PyTorch allowlist validation timeout on HF download stall#3022edsavage wants to merge 7 commits intoelastic:mainfrom
Conversation
Build elastic#2497 timed out because a HuggingFace model download stalled at 0% for 58 minutes (unauthenticated rate limiting). Two fixes: 1. Add HF_TOKEN injection to the validation step via post-checkout hook, reading from vault (secret/ci/elastic-ml-cpp/huggingface/hf_token). Authenticated requests get higher rate limits and more reliable downloads from HuggingFace Hub. 2. Add per-model timeout (default 10 minutes, configurable via --model-timeout) using SIGALRM. Models that can't be downloaded and traced within the timeout are skipped rather than consuming the entire step timeout. This prevents a single stalled download from failing the whole validation run. Made-with: Cursor
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
When the vault secret is missing or the read fails, HF_TOKEN was set to an empty string. HuggingFace's client then sent 'Authorization: Bearer ' (no token) which is an invalid HTTP header, causing all model loads to fail with 'Illegal header value'. Two fixes: - post-checkout: only export HF_TOKEN if the vault read returned a non-empty value - torchscript_utils.py: treat empty string as None so HuggingFace falls back to unauthenticated requests Made-with: Cursor
transformers 5.x introduced Python type annotations (set[str] | list[str] | None) that TorchScript in PyTorch 2.7.1 cannot resolve, causing all model tracing to fail with "Unsupported annotation" errors. Pin to 4.x which is compatible with torch.jit.trace. Made-with: Cursor
The hf-xet Rust-based download accelerator connects to cas-bridge.xethub.hf.co which is unreachable or throttled from the CI Kubernetes cluster, causing model downloads to stall indefinitely. The native C-level I/O also prevents Python SIGALRM from interrupting stalled downloads. Setting HF_HUB_DISABLE_XET=1 forces huggingface_hub to use the standard HTTPS requests-based downloader which: - Downloads directly from huggingface.co (no xet CAS) - Is interruptible by SIGALRM (Python-level I/O) - Is more reliable in restricted network environments Made-with: Cursor
When the build uses the pytorch_latest Docker image (nightly PyTorch builds), the validation step now installs nightly torch wheels from PyTorch's CPU index instead of the pinned 2.7.1. This ensures the allowlist is validated against the same PyTorch version that libtorch was built from, catching op changes early. For normal PR builds, torch==2.7.1 continues to be used (matching the production libtorch version). Made-with: Cursor
Instead of installing a separate torch wheel in a python:3.12 container, run the validation step inside the same Docker image that builds pytorch_inference. This gives exact version parity between the libtorch that the C++ binary links against and the torch used for model tracing/validation. Docker changes (pytorch_linux_image only): - Keep Python 3.12 and torch site-packages in the final stage (~100MB incremental, vs ~200-800MB wheel download per run) - Only the nightly image needs this since the validation step only runs during run_pytorch_tests builds Validation step changes: - Use DOCKER_IMAGE (or default build image) instead of python:3.12 - Only install non-torch deps (transformers, sentencepiece, protobuf) since torch is already in the image - Remove nightly wheels logic — the image has the correct torch - Print torch version at start for visibility Made-with: Cursor
b4f2f46 to
1765007
Compare
Two optimizations to reduce nightly build time: 1. Skip-if-unchanged: before building, compare the current viable/strict HEAD SHA against the pytorch.commit label baked into the last published image. If identical, skip the build entirely. Saves ~1 hour on quiet days / weekends. 2. sccache with GCS backend: the PyTorch compilation step uses sccache (CMAKE_C/CXX_COMPILER_LAUNCHER=sccache) backed by the existing elastic-ml-cpp-sccache GCS bucket. The cache persists across builds regardless of which Kubernetes node runs the job. GCS credentials are passed via BuildKit --mount=type=secret to avoid baking them into the image. Other improvements: - Split Dockerfile RUN into clone + build steps so sccache only applies to the compilation - Added pytorch.commit label to the final image for traceability - Enabled DOCKER_BUILDKIT=1 in the build script - Inject GCS credentials for build_pytorch_docker_image step in post-checkout hook - sccache stats printed at end of build for visibility - Graceful fallback: builds without cache if no GCS credentials Made-with: Cursor
d5d70e3 to
8353956
Compare
There was a problem hiding this comment.
Pull request overview
This PR hardens the PyTorch allowlist validation CI step against HuggingFace download stalls and version mismatches by adding per-model timeouts, improving HF token handling, and adjusting the CI runtime environment to use the build Docker image.
Changes:
- Add per-model SIGALRM timeout (configurable) to avoid stalled HF downloads consuming the full CI step timeout.
- Improve HF token handling (treat empty
HF_TOKENas unset; inject token only when present). - Update Buildkite validation step to run in the build image, disable hf-xet downloader, and pin
transformersto<5.0.0.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
dev-tools/extract_model_ops/validate_allowlist.py |
Adds per-model timeout support and CLI flag for timeout configuration. |
dev-tools/extract_model_ops/torchscript_utils.py |
Treats empty HF_TOKEN as None to avoid invalid auth headers. |
dev-tools/extract_model_ops/requirements.txt |
Pins transformers to <5.0.0 for TorchScript compatibility. |
dev-tools/docker/pytorch_linux_image/Dockerfile |
Adds sccache + attempts to carry Python/torch runtime into final image for validation parity. |
dev-tools/docker/build_pytorch_linux_build_image.sh |
Adds skip-if-unchanged logic, BuildKit + secret-based sccache configuration, and commit labeling. |
.buildkite/pipelines/validate_pytorch_allowlist.yml.sh |
Switches validation to the build image, disables hf-xet, avoids reinstalling torch, and prints torch version. |
.buildkite/hooks/post-checkout |
Injects HF_TOKEN for the validation step and expands GCS credential injection to the PyTorch image build step. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Install sccache for GCS-backed compilation caching across builds | ||
| ARG SCCACHE_VERSION=v0.14.0 | ||
| RUN curl -fsSL "https://github.com/mozilla/sccache/releases/download/${SCCACHE_VERSION}/sccache-${SCCACHE_VERSION}-x86_64-unknown-linux-musl.tar.gz" \ | ||
| | tar xz -C /usr/local/bin --strip-components=1 "sccache-${SCCACHE_VERSION}-x86_64-unknown-linux-musl/sccache" && \ | ||
| chmod +x /usr/local/bin/sccache |
There was a problem hiding this comment.
This stage uses curl to download sccache, but curl is not installed in the preceding dnf install list (only wget is). On minimal Rocky images this will fail the build. Install curl explicitly or switch the download to wget to make the build deterministic.
| # | ||
| # Optimizations: | ||
| # 1. Skip if viable/strict hasn't moved since the last build | ||
| # 2. ccache via BuildKit cache mount for fast incremental rebuilds |
There was a problem hiding this comment.
The comment says “ccache via BuildKit cache mount”, but the implementation here is using sccache + a Docker secret for the GCS backend. Please update the comment to match the actual mechanism so future readers don’t go looking for a ccache configuration that doesn’t exist.
| # 2. ccache via BuildKit cache mount for fast incremental rebuilds | |
| # 2. sccache using a Docker secret for the GCS backend for fast incremental rebuilds |
| - "python3 -c \"import torch; print(f'PyTorch version: {torch.__version__}')\"" | ||
| - "grep -v '^torch==' dev-tools/extract_model_ops/requirements.txt | pip3 install -r /dev/stdin" | ||
| - "python3 dev-tools/extract_model_ops/validate_allowlist.py --config dev-tools/extract_model_ops/validation_models.json --pt-dir dev-tools/extract_model_ops/es_it_models --verbose" |
There was a problem hiding this comment.
This step assumes the agent image provides python3 and pip3. The current dev-tools/docker/linux_image/Dockerfile final stage only copies /usr/local/gcc133 and does not include Python, so ml-linux-build:34 may not have these binaries. Consider invoking /usr/local/bin/python3.12 + /usr/local/bin/pip3.12 explicitly, or ensure the image guarantees python3/pip3 symlinks.
| - "python3 -c \"import torch; print(f'PyTorch version: {torch.__version__}')\"" | |
| - "grep -v '^torch==' dev-tools/extract_model_ops/requirements.txt | pip3 install -r /dev/stdin" | |
| - "python3 dev-tools/extract_model_ops/validate_allowlist.py --config dev-tools/extract_model_ops/validation_models.json --pt-dir dev-tools/extract_model_ops/es_it_models --verbose" | |
| - "/usr/local/bin/python3.12 -c \"import torch; print(f'PyTorch version: {torch.__version__}')\"" | |
| - "grep -v '^torch==' dev-tools/extract_model_ops/requirements.txt | /usr/local/bin/pip3.12 install -r /dev/stdin" | |
| - "/usr/local/bin/python3.12 dev-tools/extract_model_ops/validate_allowlist.py --config dev-tools/extract_model_ops/validation_models.json --pt-dir dev-tools/extract_model_ops/es_it_models --verbose" |
| old_handler = signal.signal(signal.SIGALRM, _timeout_handler) | ||
| signal.alarm(timeout) |
There was a problem hiding this comment.
signal.SIGALRM/signal.alarm() are not available on Windows. Since this is a dev-tool script (and the repo has Windows CI), running it on Windows will currently crash at handler installation. Consider guarding with hasattr(signal, "SIGALRM") and falling back to running without per-model timeouts (or using a cross-platform timeout mechanism).
| RUN ln -sf /usr/local/bin/python3.12 /usr/local/bin/python3 && \ | ||
| ln -sf /usr/local/bin/pip3.12 /usr/local/bin/pip3 |
There was a problem hiding this comment.
The final image copies the Python 3.12 binaries + stdlib directory, but not the native shared libs that the Python build may depend on (notably OpenSSL installed under /usr/local in the dependency-builder image). If /usr/local/lib64 (libssl/libcrypto) isn’t present, import ssl / pip / HF downloads can break at runtime. Consider also copying the required /usr/local/lib* runtime libs (or installing the needed packages in the final stage) and adding a quick sanity check (e.g., python3.12 -c 'import ssl') during the build.
| RUN ln -sf /usr/local/bin/python3.12 /usr/local/bin/python3 && \ | |
| ln -sf /usr/local/bin/pip3.12 /usr/local/bin/pip3 | |
| COPY --from=builder /usr/local/lib/*.so* /usr/local/lib/ | |
| COPY --from=builder /usr/local/lib64 /usr/local/lib64 | |
| RUN ln -sf /usr/local/bin/python3.12 /usr/local/bin/python3 && \ | |
| ln -sf /usr/local/bin/pip3.12 /usr/local/bin/pip3 && \ | |
| ldconfig && \ | |
| python3.12 -c 'import ssl' |
Summary
Fixes and improves the PyTorch allowlist validation CI step, addressing multiple issues encountered in recent builds (#2497, #2504, #2506, #2508).
Changes
1. Use build Docker image for validation (exact torch version match)
Instead of installing a separate torch wheel in a
python:3.12container, the validation step now runs inside the same Docker image that buildspytorch_inference. This gives exact version parity between the libtorch that the C++ binary links against and the torch used for model tracing.Docker changes (
linux_image/Dockerfile,pytorch_linux_image/Dockerfile):python3andpip3for convenienceValidation step changes (
validate_pytorch_allowlist.yml.sh):DOCKER_IMAGE(or default build image) instead ofpython:3.12ml-linux-build:34) and nightly (pytorch_latest) builds2. HF_TOKEN injection
post-checkouthook readsHF_TOKENfrom vault for the validation stepAuthorization: Bearerinvalid header)torchscript_utils.py: treats empty string asNone(defence-in-depth)3. Disable hf-xet native downloader
HF_HUB_DISABLE_XET=1in the validation step environmentcas-bridge.xethub.hf.cowhich is unreachable from the CI Kubernetes cluster4. Per-model timeout
--model-timeout)5. Pin transformers<5.0.0
transformers5.x has type annotations (set[str] | list[str] | None) that TorchScript in PyTorch 2.7.1 cannot resolvetorch.jit.traceTest results
# HF_TOKEN addedin CI log, authenticated downloadsDocker image rebuild required
The Dockerfile changes require rebuilding the Docker images:
ml-linux-build:34→ rebuild (or bump to:35) for productionpytorch_latest→ rebuilt automatically by the nightly pipeline