[ML] Fix PyTorch allowlist validation timeout on HF download stall by edsavage · Pull Request #3022 · elastic/ml-cpp

edsavage · 2026-04-12T21:16:21Z

Summary

Fixes and improves the PyTorch allowlist validation CI step, addressing multiple issues encountered in recent builds (#2497, #2504, #2506, #2508).

Changes

1. Use build Docker image for validation (exact torch version match)

Instead of installing a separate torch wheel in a python:3.12 container, the validation step now runs inside the same Docker image that builds pytorch_inference. This gives exact version parity between the libtorch that the C++ binary links against and the torch used for model tracing.

Docker changes (linux_image/Dockerfile, pytorch_linux_image/Dockerfile):

Keep Python 3.12 and torch site-packages in the final stage (~100MB incremental, vs ~200-800MB wheel downloaded every run previously)
Symlink python3 and pip3 for convenience

Validation step changes (validate_pytorch_allowlist.yml.sh):

Uses DOCKER_IMAGE (or default build image) instead of python:3.12
Only installs non-torch deps (transformers, sentencepiece, protobuf)
No more nightly wheel logic — both production and nightly images have the correct torch built in
Works automatically for both production (ml-linux-build:34) and nightly (pytorch_latest) builds

2. HF_TOKEN injection

post-checkout hook reads HF_TOKEN from vault for the validation step
Only exports if non-empty (prevents Authorization: Bearer invalid header)
torchscript_utils.py: treats empty string as None (defence-in-depth)

3. Disable hf-xet native downloader

Sets HF_HUB_DISABLE_XET=1 in the validation step environment
The hf-xet Rust downloader connects to cas-bridge.xethub.hf.co which is unreachable from the CI Kubernetes cluster
Standard Python requests-based downloader is also interruptible by SIGALRM

4. Per-model timeout

Each model download/trace is wrapped in a 10-minute SIGALRM timeout (configurable via --model-timeout)
Stalled downloads are skipped rather than consuming the full 60-minute step timeout

5. Pin transformers<5.0.0

transformers 5.x has type annotations (set[str] | list[str] | None) that TorchScript in PyTorch 2.7.1 cannot resolve
Pinned to 4.x which is compatible with torch.jit.trace

Test results

Local validation: 28 passed, 0 failed, 4 skipped (macOS, no FBGEMM)
CI build [8.8][ML] Add missing contexts for github commit statuses (#2513) #2514: 33 passed, 0 failed, 2 skipped (Linux, with FBGEMM quantization)
HF_TOKEN working: # HF_TOKEN added in CI log, authenticated downloads
Downloads completing: no stalls with xet disabled
Transformers 4.48.0 compatible with torch.jit.trace

Docker image rebuild required

The Dockerfile changes require rebuilding the Docker images:

ml-linux-build:34 → rebuild (or bump to :35) for production
pytorch_latest → rebuilt automatically by the nightly pipeline

Build elastic#2497 timed out because a HuggingFace model download stalled at 0% for 58 minutes (unauthenticated rate limiting). Two fixes: 1. Add HF_TOKEN injection to the validation step via post-checkout hook, reading from vault (secret/ci/elastic-ml-cpp/huggingface/hf_token). Authenticated requests get higher rate limits and more reliable downloads from HuggingFace Hub. 2. Add per-model timeout (default 10 minutes, configurable via --model-timeout) using SIGALRM. Models that can't be downloaded and traced within the timeout are skipped rather than consuming the entire step timeout. This prevents a single stalled download from failing the whole validation run. Made-with: Cursor

prodsecmachine · 2026-04-12T21:16:33Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues
✅	Licenses	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

When the vault secret is missing or the read fails, HF_TOKEN was set to an empty string. HuggingFace's client then sent 'Authorization: Bearer ' (no token) which is an invalid HTTP header, causing all model loads to fail with 'Illegal header value'. Two fixes: - post-checkout: only export HF_TOKEN if the vault read returned a non-empty value - torchscript_utils.py: treat empty string as None so HuggingFace falls back to unauthenticated requests Made-with: Cursor

transformers 5.x introduced Python type annotations (set[str] | list[str] | None) that TorchScript in PyTorch 2.7.1 cannot resolve, causing all model tracing to fail with "Unsupported annotation" errors. Pin to 4.x which is compatible with torch.jit.trace. Made-with: Cursor

The hf-xet Rust-based download accelerator connects to cas-bridge.xethub.hf.co which is unreachable or throttled from the CI Kubernetes cluster, causing model downloads to stall indefinitely. The native C-level I/O also prevents Python SIGALRM from interrupting stalled downloads. Setting HF_HUB_DISABLE_XET=1 forces huggingface_hub to use the standard HTTPS requests-based downloader which: - Downloads directly from huggingface.co (no xet CAS) - Is interruptible by SIGALRM (Python-level I/O) - Is more reliable in restricted network environments Made-with: Cursor

When the build uses the pytorch_latest Docker image (nightly PyTorch builds), the validation step now installs nightly torch wheels from PyTorch's CPU index instead of the pinned 2.7.1. This ensures the allowlist is validated against the same PyTorch version that libtorch was built from, catching op changes early. For normal PR builds, torch==2.7.1 continues to be used (matching the production libtorch version). Made-with: Cursor

Instead of installing a separate torch wheel in a python:3.12 container, run the validation step inside the same Docker image that builds pytorch_inference. This gives exact version parity between the libtorch that the C++ binary links against and the torch used for model tracing/validation. Docker changes (pytorch_linux_image only): - Keep Python 3.12 and torch site-packages in the final stage (~100MB incremental, vs ~200-800MB wheel download per run) - Only the nightly image needs this since the validation step only runs during run_pytorch_tests builds Validation step changes: - Use DOCKER_IMAGE (or default build image) instead of python:3.12 - Only install non-torch deps (transformers, sentencepiece, protobuf) since torch is already in the image - Remove nightly wheels logic — the image has the correct torch - Print torch version at start for visibility Made-with: Cursor

Two optimizations to reduce nightly build time: 1. Skip-if-unchanged: before building, compare the current viable/strict HEAD SHA against the pytorch.commit label baked into the last published image. If identical, skip the build entirely. Saves ~1 hour on quiet days / weekends. 2. sccache with GCS backend: the PyTorch compilation step uses sccache (CMAKE_C/CXX_COMPILER_LAUNCHER=sccache) backed by the existing elastic-ml-cpp-sccache GCS bucket. The cache persists across builds regardless of which Kubernetes node runs the job. GCS credentials are passed via BuildKit --mount=type=secret to avoid baking them into the image. Other improvements: - Split Dockerfile RUN into clone + build steps so sccache only applies to the compilation - Added pytorch.commit label to the final image for traceability - Enabled DOCKER_BUILDKIT=1 in the build script - Inject GCS credentials for build_pytorch_docker_image step in post-checkout hook - sccache stats printed at end of build for visibility - Graceful fallback: builds without cache if no GCS credentials Made-with: Cursor

Copilot

Pull request overview

This PR hardens the PyTorch allowlist validation CI step against HuggingFace download stalls and version mismatches by adding per-model timeouts, improving HF token handling, and adjusting the CI runtime environment to use the build Docker image.

Changes:

Add per-model SIGALRM timeout (configurable) to avoid stalled HF downloads consuming the full CI step timeout.
Improve HF token handling (treat empty HF_TOKEN as unset; inject token only when present).
Update Buildkite validation step to run in the build image, disable hf-xet downloader, and pin transformers to <5.0.0.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`dev-tools/extract_model_ops/validate_allowlist.py`	Adds per-model timeout support and CLI flag for timeout configuration.
`dev-tools/extract_model_ops/torchscript_utils.py`	Treats empty `HF_TOKEN` as `None` to avoid invalid auth headers.
`dev-tools/extract_model_ops/requirements.txt`	Pins `transformers` to `<5.0.0` for TorchScript compatibility.
`dev-tools/docker/pytorch_linux_image/Dockerfile`	Adds sccache + attempts to carry Python/torch runtime into final image for validation parity.
`dev-tools/docker/build_pytorch_linux_build_image.sh`	Adds skip-if-unchanged logic, BuildKit + secret-based sccache configuration, and commit labeling.
`.buildkite/pipelines/validate_pytorch_allowlist.yml.sh`	Switches validation to the build image, disables hf-xet, avoids reinstalling torch, and prints torch version.
`.buildkite/hooks/post-checkout`	Injects `HF_TOKEN` for the validation step and expands GCS credential injection to the PyTorch image build step.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T00:46:52Z

+# Install sccache for GCS-backed compilation caching across builds
+ARG SCCACHE_VERSION=v0.14.0
+RUN curl -fsSL "https://github.com/mozilla/sccache/releases/download/${SCCACHE_VERSION}/sccache-${SCCACHE_VERSION}-x86_64-unknown-linux-musl.tar.gz" \
+    | tar xz -C /usr/local/bin --strip-components=1 "sccache-${SCCACHE_VERSION}-x86_64-unknown-linux-musl/sccache" && \
+    chmod +x /usr/local/bin/sccache


This stage uses curl to download sccache, but curl is not installed in the preceding dnf install list (only wget is). On minimal Rocky images this will fail the build. Install curl explicitly or switch the download to wget to make the build deterministic.

Copilot · 2026-04-16T00:46:52Z

+#
+# Optimizations:
+#   1. Skip if viable/strict hasn't moved since the last build
+#   2. ccache via BuildKit cache mount for fast incremental rebuilds


The comment says “ccache via BuildKit cache mount”, but the implementation here is using sccache + a Docker secret for the GCS backend. Please update the comment to match the actual mechanism so future readers don’t go looking for a ccache configuration that doesn’t exist.

Suggested change

# 2. ccache via BuildKit cache mount for fast incremental rebuilds

# 2. sccache using a Docker secret for the GCS backend for fast incremental rebuilds

Copilot · 2026-04-16T00:46:53Z

+        - "python3 -c \"import torch; print(f'PyTorch version: {torch.__version__}')\""
+        - "grep -v '^torch==' dev-tools/extract_model_ops/requirements.txt | pip3 install -r /dev/stdin"
        - "python3 dev-tools/extract_model_ops/validate_allowlist.py --config dev-tools/extract_model_ops/validation_models.json --pt-dir dev-tools/extract_model_ops/es_it_models --verbose"


This step assumes the agent image provides python3 and pip3. The current dev-tools/docker/linux_image/Dockerfile final stage only copies /usr/local/gcc133 and does not include Python, so ml-linux-build:34 may not have these binaries. Consider invoking /usr/local/bin/python3.12 + /usr/local/bin/pip3.12 explicitly, or ensure the image guarantees python3/pip3 symlinks.

Suggested change

- "python3 -c \"import torch; print(f'PyTorch version: {torch.__version__}')\""

- "grep -v '^torch==' dev-tools/extract_model_ops/requirements.txt | pip3 install -r /dev/stdin"

- "python3 dev-tools/extract_model_ops/validate_allowlist.py --config dev-tools/extract_model_ops/validation_models.json --pt-dir dev-tools/extract_model_ops/es_it_models --verbose"

- "/usr/local/bin/python3.12 -c \"import torch; print(f'PyTorch version: {torch.__version__}')\""

- "grep -v '^torch==' dev-tools/extract_model_ops/requirements.txt | /usr/local/bin/pip3.12 install -r /dev/stdin"

- "/usr/local/bin/python3.12 dev-tools/extract_model_ops/validate_allowlist.py --config dev-tools/extract_model_ops/validation_models.json --pt-dir dev-tools/extract_model_ops/es_it_models --verbose"

Copilot · 2026-04-16T00:46:53Z

+    old_handler = signal.signal(signal.SIGALRM, _timeout_handler)
+    signal.alarm(timeout)


signal.SIGALRM/signal.alarm() are not available on Windows. Since this is a dev-tool script (and the repo has Windows CI), running it on Windows will currently crash at handler installation. Consider guarding with hasattr(signal, "SIGALRM") and falling back to running without per-model timeouts (or using a cross-platform timeout mechanism).

Copilot · 2026-04-16T00:46:53Z

+RUN ln -sf /usr/local/bin/python3.12 /usr/local/bin/python3 && \
+    ln -sf /usr/local/bin/pip3.12 /usr/local/bin/pip3


The final image copies the Python 3.12 binaries + stdlib directory, but not the native shared libs that the Python build may depend on (notably OpenSSL installed under /usr/local in the dependency-builder image). If /usr/local/lib64 (libssl/libcrypto) isn’t present, import ssl / pip / HF downloads can break at runtime. Consider also copying the required /usr/local/lib* runtime libs (or installing the needed packages in the final stage) and adding a quick sanity check (e.g., python3.12 -c 'import ssl') during the build.

Suggested change

RUN ln -sf /usr/local/bin/python3.12 /usr/local/bin/python3 && \

ln -sf /usr/local/bin/pip3.12 /usr/local/bin/pip3

COPY --from=builder /usr/local/lib/*.so* /usr/local/lib/

COPY --from=builder /usr/local/lib64 /usr/local/lib64

RUN ln -sf /usr/local/bin/python3.12 /usr/local/bin/python3 && \

ln -sf /usr/local/bin/pip3.12 /usr/local/bin/pip3 && \

ldconfig && \

python3.12 -c 'import ssl'

edsavage added 2 commits April 13, 2026 12:50

edsavage added v9.5.0 >build >non-issue :ml labels Apr 13, 2026

edsavage added 3 commits April 13, 2026 15:32

edsavage force-pushed the fix/hf-token-for-validation branch from b4f2f46 to 1765007 Compare April 14, 2026 01:59

edsavage force-pushed the fix/hf-token-for-validation branch from d5d70e3 to 8353956 Compare April 14, 2026 02:56

edsavage requested review from Copilot and valeriy42 and removed request for valeriy42 April 16, 2026 00:41

Copilot started reviewing on behalf of edsavage April 16, 2026 00:42 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fix PyTorch allowlist validation timeout on HF download stall#3022

[ML] Fix PyTorch allowlist validation timeout on HF download stall#3022
edsavage wants to merge 7 commits intoelastic:mainfrom
edsavage:fix/hf-token-for-validation

edsavage commented Apr 12, 2026 •

edited

Loading

Uh oh!

prodsecmachine commented Apr 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	# 2. ccache via BuildKit cache mount for fast incremental rebuilds
	# 2. sccache using a Docker secret for the GCS backend for fast incremental rebuilds

		old_handler = signal.signal(signal.SIGALRM, _timeout_handler)
		signal.alarm(timeout)

		RUN ln -sf /usr/local/bin/python3.12 /usr/local/bin/python3 && \
		ln -sf /usr/local/bin/pip3.12 /usr/local/bin/pip3

-RUN ln -sf /usr/local/bin/python3.12 /usr/local/bin/python3 && \
-    ln -sf /usr/local/bin/pip3.12 /usr/local/bin/pip3
+COPY --from=builder /usr/local/lib/*.so* /usr/local/lib/
+COPY --from=builder /usr/local/lib64 /usr/local/lib64
+RUN ln -sf /usr/local/bin/python3.12 /usr/local/bin/python3 && \
+    ln -sf /usr/local/bin/pip3.12 /usr/local/bin/pip3 && \
+    ldconfig && \
+    python3.12 -c 'import ssl'

Conversation

edsavage commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. Use build Docker image for validation (exact torch version match)

2. HF_TOKEN injection

3. Disable hf-xet native downloader

4. Per-model timeout

5. Pin transformers<5.0.0

Test results

Docker image rebuild required

Uh oh!

prodsecmachine commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

edsavage commented Apr 12, 2026 •

edited

Loading

prodsecmachine commented Apr 12, 2026 •

edited

Loading