Skip to content

[ML] Fix PyTorch allowlist validation timeout on HF download stall#3022

Open
edsavage wants to merge 7 commits intoelastic:mainfrom
edsavage:fix/hf-token-for-validation
Open

[ML] Fix PyTorch allowlist validation timeout on HF download stall#3022
edsavage wants to merge 7 commits intoelastic:mainfrom
edsavage:fix/hf-token-for-validation

Conversation

@edsavage
Copy link
Copy Markdown
Contributor

@edsavage edsavage commented Apr 12, 2026

Summary

Fixes and improves the PyTorch allowlist validation CI step, addressing multiple issues encountered in recent builds (#2497, #2504, #2506, #2508).

Changes

1. Use build Docker image for validation (exact torch version match)

Instead of installing a separate torch wheel in a python:3.12 container, the validation step now runs inside the same Docker image that builds pytorch_inference. This gives exact version parity between the libtorch that the C++ binary links against and the torch used for model tracing.

Docker changes (linux_image/Dockerfile, pytorch_linux_image/Dockerfile):

  • Keep Python 3.12 and torch site-packages in the final stage (~100MB incremental, vs ~200-800MB wheel downloaded every run previously)
  • Symlink python3 and pip3 for convenience

Validation step changes (validate_pytorch_allowlist.yml.sh):

  • Uses DOCKER_IMAGE (or default build image) instead of python:3.12
  • Only installs non-torch deps (transformers, sentencepiece, protobuf)
  • No more nightly wheel logic — both production and nightly images have the correct torch built in
  • Works automatically for both production (ml-linux-build:34) and nightly (pytorch_latest) builds

2. HF_TOKEN injection

  • post-checkout hook reads HF_TOKEN from vault for the validation step
  • Only exports if non-empty (prevents Authorization: Bearer invalid header)
  • torchscript_utils.py: treats empty string as None (defence-in-depth)

3. Disable hf-xet native downloader

  • Sets HF_HUB_DISABLE_XET=1 in the validation step environment
  • The hf-xet Rust downloader connects to cas-bridge.xethub.hf.co which is unreachable from the CI Kubernetes cluster
  • Standard Python requests-based downloader is also interruptible by SIGALRM

4. Per-model timeout

  • Each model download/trace is wrapped in a 10-minute SIGALRM timeout (configurable via --model-timeout)
  • Stalled downloads are skipped rather than consuming the full 60-minute step timeout

5. Pin transformers<5.0.0

  • transformers 5.x has type annotations (set[str] | list[str] | None) that TorchScript in PyTorch 2.7.1 cannot resolve
  • Pinned to 4.x which is compatible with torch.jit.trace

Test results

  • Local validation: 28 passed, 0 failed, 4 skipped (macOS, no FBGEMM)
  • CI build [8.8][ML] Add missing contexts for github commit statuses (#2513) #2514: 33 passed, 0 failed, 2 skipped (Linux, with FBGEMM quantization)
  • HF_TOKEN working: # HF_TOKEN added in CI log, authenticated downloads
  • Downloads completing: no stalls with xet disabled
  • Transformers 4.48.0 compatible with torch.jit.trace

Docker image rebuild required

The Dockerfile changes require rebuilding the Docker images:

  • ml-linux-build:34 → rebuild (or bump to :35) for production
  • pytorch_latest → rebuilt automatically by the nightly pipeline

Build elastic#2497 timed out because a HuggingFace model download stalled at
0% for 58 minutes (unauthenticated rate limiting). Two fixes:

1. Add HF_TOKEN injection to the validation step via post-checkout hook,
   reading from vault (secret/ci/elastic-ml-cpp/huggingface/hf_token).
   Authenticated requests get higher rate limits and more reliable
   downloads from HuggingFace Hub.

2. Add per-model timeout (default 10 minutes, configurable via
   --model-timeout) using SIGALRM. Models that can't be downloaded
   and traced within the timeout are skipped rather than consuming
   the entire step timeout. This prevents a single stalled download
   from failing the whole validation run.

Made-with: Cursor
@prodsecmachine
Copy link
Copy Markdown

prodsecmachine commented Apr 12, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

When the vault secret is missing or the read fails, HF_TOKEN was set
to an empty string. HuggingFace's client then sent 'Authorization:
Bearer ' (no token) which is an invalid HTTP header, causing all
model loads to fail with 'Illegal header value'.

Two fixes:
- post-checkout: only export HF_TOKEN if the vault read returned
  a non-empty value
- torchscript_utils.py: treat empty string as None so HuggingFace
  falls back to unauthenticated requests

Made-with: Cursor
transformers 5.x introduced Python type annotations (set[str] |
list[str] | None) that TorchScript in PyTorch 2.7.1 cannot resolve,
causing all model tracing to fail with "Unsupported annotation"
errors. Pin to 4.x which is compatible with torch.jit.trace.

Made-with: Cursor
The hf-xet Rust-based download accelerator connects to
cas-bridge.xethub.hf.co which is unreachable or throttled from
the CI Kubernetes cluster, causing model downloads to stall
indefinitely. The native C-level I/O also prevents Python SIGALRM
from interrupting stalled downloads.

Setting HF_HUB_DISABLE_XET=1 forces huggingface_hub to use the
standard HTTPS requests-based downloader which:
- Downloads directly from huggingface.co (no xet CAS)
- Is interruptible by SIGALRM (Python-level I/O)
- Is more reliable in restricted network environments

Made-with: Cursor
When the build uses the pytorch_latest Docker image (nightly
PyTorch builds), the validation step now installs nightly torch
wheels from PyTorch's CPU index instead of the pinned 2.7.1.
This ensures the allowlist is validated against the same PyTorch
version that libtorch was built from, catching op changes early.

For normal PR builds, torch==2.7.1 continues to be used (matching
the production libtorch version).

Made-with: Cursor
Instead of installing a separate torch wheel in a python:3.12
container, run the validation step inside the same Docker image
that builds pytorch_inference. This gives exact version parity
between the libtorch that the C++ binary links against and the
torch used for model tracing/validation.

Docker changes (pytorch_linux_image only):
- Keep Python 3.12 and torch site-packages in the final stage
  (~100MB incremental, vs ~200-800MB wheel download per run)
- Only the nightly image needs this since the validation step
  only runs during run_pytorch_tests builds

Validation step changes:
- Use DOCKER_IMAGE (or default build image) instead of python:3.12
- Only install non-torch deps (transformers, sentencepiece, protobuf)
  since torch is already in the image
- Remove nightly wheels logic — the image has the correct torch
- Print torch version at start for visibility

Made-with: Cursor
@edsavage edsavage force-pushed the fix/hf-token-for-validation branch from b4f2f46 to 1765007 Compare April 14, 2026 01:59
Two optimizations to reduce nightly build time:

1. Skip-if-unchanged: before building, compare the current
   viable/strict HEAD SHA against the pytorch.commit label baked
   into the last published image. If identical, skip the build
   entirely. Saves ~1 hour on quiet days / weekends.

2. sccache with GCS backend: the PyTorch compilation step uses
   sccache (CMAKE_C/CXX_COMPILER_LAUNCHER=sccache) backed by the
   existing elastic-ml-cpp-sccache GCS bucket. The cache persists
   across builds regardless of which Kubernetes node runs the job.
   GCS credentials are passed via BuildKit --mount=type=secret
   to avoid baking them into the image.

Other improvements:
- Split Dockerfile RUN into clone + build steps so sccache only
  applies to the compilation
- Added pytorch.commit label to the final image for traceability
- Enabled DOCKER_BUILDKIT=1 in the build script
- Inject GCS credentials for build_pytorch_docker_image step
  in post-checkout hook
- sccache stats printed at end of build for visibility
- Graceful fallback: builds without cache if no GCS credentials

Made-with: Cursor
@edsavage edsavage force-pushed the fix/hf-token-for-validation branch from d5d70e3 to 8353956 Compare April 14, 2026 02:56
@edsavage edsavage requested review from Copilot and valeriy42 and removed request for valeriy42 April 16, 2026 00:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the PyTorch allowlist validation CI step against HuggingFace download stalls and version mismatches by adding per-model timeouts, improving HF token handling, and adjusting the CI runtime environment to use the build Docker image.

Changes:

  • Add per-model SIGALRM timeout (configurable) to avoid stalled HF downloads consuming the full CI step timeout.
  • Improve HF token handling (treat empty HF_TOKEN as unset; inject token only when present).
  • Update Buildkite validation step to run in the build image, disable hf-xet downloader, and pin transformers to <5.0.0.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
dev-tools/extract_model_ops/validate_allowlist.py Adds per-model timeout support and CLI flag for timeout configuration.
dev-tools/extract_model_ops/torchscript_utils.py Treats empty HF_TOKEN as None to avoid invalid auth headers.
dev-tools/extract_model_ops/requirements.txt Pins transformers to <5.0.0 for TorchScript compatibility.
dev-tools/docker/pytorch_linux_image/Dockerfile Adds sccache + attempts to carry Python/torch runtime into final image for validation parity.
dev-tools/docker/build_pytorch_linux_build_image.sh Adds skip-if-unchanged logic, BuildKit + secret-based sccache configuration, and commit labeling.
.buildkite/pipelines/validate_pytorch_allowlist.yml.sh Switches validation to the build image, disables hf-xet, avoids reinstalling torch, and prints torch version.
.buildkite/hooks/post-checkout Injects HF_TOKEN for the validation step and expands GCS credential injection to the PyTorch image build step.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +23 to +27
# Install sccache for GCS-backed compilation caching across builds
ARG SCCACHE_VERSION=v0.14.0
RUN curl -fsSL "https://github.com/mozilla/sccache/releases/download/${SCCACHE_VERSION}/sccache-${SCCACHE_VERSION}-x86_64-unknown-linux-musl.tar.gz" \
| tar xz -C /usr/local/bin --strip-components=1 "sccache-${SCCACHE_VERSION}-x86_64-unknown-linux-musl/sccache" && \
chmod +x /usr/local/bin/sccache
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stage uses curl to download sccache, but curl is not installed in the preceding dnf install list (only wget is). On minimal Rocky images this will fail the build. Install curl explicitly or switch the download to wget to make the build deterministic.

Copilot uses AI. Check for mistakes.
#
# Optimizations:
# 1. Skip if viable/strict hasn't moved since the last build
# 2. ccache via BuildKit cache mount for fast incremental rebuilds
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says “ccache via BuildKit cache mount”, but the implementation here is using sccache + a Docker secret for the GCS backend. Please update the comment to match the actual mechanism so future readers don’t go looking for a ccache configuration that doesn’t exist.

Suggested change
# 2. ccache via BuildKit cache mount for fast incremental rebuilds
# 2. sccache using a Docker secret for the GCS backend for fast incremental rebuilds

Copilot uses AI. Check for mistakes.
Comment on lines +25 to 27
- "python3 -c \"import torch; print(f'PyTorch version: {torch.__version__}')\""
- "grep -v '^torch==' dev-tools/extract_model_ops/requirements.txt | pip3 install -r /dev/stdin"
- "python3 dev-tools/extract_model_ops/validate_allowlist.py --config dev-tools/extract_model_ops/validation_models.json --pt-dir dev-tools/extract_model_ops/es_it_models --verbose"
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step assumes the agent image provides python3 and pip3. The current dev-tools/docker/linux_image/Dockerfile final stage only copies /usr/local/gcc133 and does not include Python, so ml-linux-build:34 may not have these binaries. Consider invoking /usr/local/bin/python3.12 + /usr/local/bin/pip3.12 explicitly, or ensure the image guarantees python3/pip3 symlinks.

Suggested change
- "python3 -c \"import torch; print(f'PyTorch version: {torch.__version__}')\""
- "grep -v '^torch==' dev-tools/extract_model_ops/requirements.txt | pip3 install -r /dev/stdin"
- "python3 dev-tools/extract_model_ops/validate_allowlist.py --config dev-tools/extract_model_ops/validation_models.json --pt-dir dev-tools/extract_model_ops/es_it_models --verbose"
- "/usr/local/bin/python3.12 -c \"import torch; print(f'PyTorch version: {torch.__version__}')\""
- "grep -v '^torch==' dev-tools/extract_model_ops/requirements.txt | /usr/local/bin/pip3.12 install -r /dev/stdin"
- "/usr/local/bin/python3.12 dev-tools/extract_model_ops/validate_allowlist.py --config dev-tools/extract_model_ops/validation_models.json --pt-dir dev-tools/extract_model_ops/es_it_models --verbose"

Copilot uses AI. Check for mistakes.
Comment on lines +135 to +136
old_handler = signal.signal(signal.SIGALRM, _timeout_handler)
signal.alarm(timeout)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signal.SIGALRM/signal.alarm() are not available on Windows. Since this is a dev-tool script (and the repo has Windows CI), running it on Windows will currently crash at handler installation. Consider guarding with hasattr(signal, "SIGALRM") and falling back to running without per-model timeouts (or using a cross-platform timeout mechanism).

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +106
RUN ln -sf /usr/local/bin/python3.12 /usr/local/bin/python3 && \
ln -sf /usr/local/bin/pip3.12 /usr/local/bin/pip3
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final image copies the Python 3.12 binaries + stdlib directory, but not the native shared libs that the Python build may depend on (notably OpenSSL installed under /usr/local in the dependency-builder image). If /usr/local/lib64 (libssl/libcrypto) isn’t present, import ssl / pip / HF downloads can break at runtime. Consider also copying the required /usr/local/lib* runtime libs (or installing the needed packages in the final stage) and adding a quick sanity check (e.g., python3.12 -c 'import ssl') during the build.

Suggested change
RUN ln -sf /usr/local/bin/python3.12 /usr/local/bin/python3 && \
ln -sf /usr/local/bin/pip3.12 /usr/local/bin/pip3
COPY --from=builder /usr/local/lib/*.so* /usr/local/lib/
COPY --from=builder /usr/local/lib64 /usr/local/lib64
RUN ln -sf /usr/local/bin/python3.12 /usr/local/bin/python3 && \
ln -sf /usr/local/bin/pip3.12 /usr/local/bin/pip3 && \
ldconfig && \
python3.12 -c 'import ssl'

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants