Skip to content

Add NVIDIA DRA test cases and common DRA utilities#785

Open
nakshah87 wants to merge 1 commit intoaws:mainfrom
nakshah87:nvidia-dra
Open

Add NVIDIA DRA test cases and common DRA utilities#785
nakshah87 wants to merge 1 commit intoaws:mainfrom
nakshah87:nvidia-dra

Conversation

@nakshah87
Copy link
Copy Markdown
Contributor

Issue #, if available:

Description of changes:
Add NVIDIA DRA e2e test suite

This PR adds end-to-end tests for the NVIDIA DRA (Dynamic Resource Allocation) driver, validating that NVIDIA GPU and EFA RDMA devices are correctly
allocated to multi-node MPI workloads via the Kubernetes DRA framework. It also refactors shared logic from the existing Neuron DRA tests into
reusable common utilities.

What's included

Test framework (test/cases/nvidia-dra/)

  • main_test.go — Test harness that orchestrates setup/teardown of all dependencies: MPI operator, dranet DaemonSet, ResourceClaimTemplates, NVIDIA DRA driver (installed via Helm), and node labeling (nvidia.com/gpu.present=true). Dynamically builds the manifest and setup function list based on the instance family's RDMA type. Independent setup steps run concurrently.
  • nvidia_dra_test.go — Data-driven test runner that discovers test cases from embedded YAML files, computes MPIJob parameters from ResourceClaimTemplate specs, and runs both positive (MPIJob succeeds) and negative (workers remain Pending) assertions.
  • topology.go — Instance topology registry mapping instance families (p5) to their GPU counts, RDMA types, and test case directories. Also contains parameter computation and NCCL MPIJob template rendering logic.

MPIJob template (templates/nccl-test-mpijob.yaml.tmpl)

  • Go template for rendering MPIJob manifests. Parameterized on slots-per-worker, total processes, worker replicas, container image, and resource claims. The launcher runs all_reduce_perf NCCL test across all workers; workers expose SSH and configure RDMA networking via EFA.

Test cases (testcases/p5/)

  • all-efas-all-gpus.yaml — Positive test: allocate all GPU devices and all EFA interfaces per node.
  • five-efas-one-gpu-negative-test.yaml — Negative test: request a mismatched device group constraint (5 EFAs + 1 GPU with PCIe root matching) that should be unschedulable.

ResourceClaimTemplates (rcts/p5/)

  • rct-all-efas.yaml — Claims all EFA devices with allocationMode: All.
  • rct-all-gpus.yaml — Claims all GPU devices with allocationMode: All.
  • rct-five-efas-one-gpu.yaml — Claims 5 EFAs + 1 GPU with an intentionally wrong matchAttribute constraint.

Shared DRA utilities (test/common/)

  • dra_types.go — Extracted shared types (TestCaseSpec, ResourceClaimTemplateSpec, ResourceClaimRef) and helpers (ParseTestCaseSpec, LoadRCTIndex, LoadRCTManifests, ExtractFamily, SplitImageRepoTag, ValidateRequiredFlags, IsYAMLFile) from the Neuron DRA tests into reusable common code.
  • dra_features.go — Extracted BuildPositiveFeature, BuildNegativeFeature, and DiscoverAndBuildFeatures to eliminate duplicated test discovery and feature construction logic across DRA test suites.
  • dra.go — Added CountNodesByType helper for counting nodes matching a given instance type.

Refactored Neuron DRA tests (test/cases/neuron-dra/)

  • Updated to consume shared types and helpers from test/common instead of defining them locally, reducing ~344 lines of duplicated code.

How it works

  1. The test harness concurrently deploys the MPI operator, dranet (for EFA-based families), RCTs, labels nodes, and installs the NVIDIA DRA driver via Helm.
  2. Test cases are YAML files that reference RCTs by name. The framework resolves each RCT to compute GPU counts, slots-per-worker, and total MPI processes.
  3. An MPIJob is rendered from the nccl-test-mpijob.yaml.tmpl Go template and applied to the cluster.
  4. Positive tests wait for the MPIJob to succeed; negative tests assert that worker pods remain Pending.
  5. Teardown uninstalls the Helm release and deletes all applied manifests in reverse order.

Adding new test cases
To add a test for a new device/EFA combination:

  1. Add a ResourceClaimTemplate YAML under rcts//.
  2. Add a test case YAML under testcases// referencing the RCT name. Set expectFailure: true for negative tests.

To add support for a new instance family, add an entry to instanceTopologies in topology.go.

Testing
The test can be invoked as follows:
./nvidia-dra.test --test.timeout=60m \
--test.v \
-rdmaDeviceDraDriverImage= \
-acceleratorDraDriverImage= \
-containerTestImage= \
-nodeType=p5.48xlarge

The acceleratorDraDriverImage flag is optional. If not provided, it installs using the image present in the Helm chart.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant