Skip to content

Support metadata resolution for multi-cloud envs#2032

Open
movence wants to merge 17 commits intofeature-multi-cloudfrom
hsookim/multi-cloud/metadata
Open

Support metadata resolution for multi-cloud envs#2032
movence wants to merge 17 commits intofeature-multi-cloudfrom
hsookim/multi-cloud/metadata

Conversation

@movence
Copy link
Copy Markdown
Contributor

@movence movence commented Feb 18, 2026

Description of changes

Adds multi-cloud support for metric decoration and placeholder resolution. Azure (and future cloud providers)
use OTel's resourcedetectionprocessor for metadata, while EC2 continues using ec2tagger. Disk/volume tagging
is extracted into a standalone disktagger processor that supports both AWS and Azure.

Also include test fixes from #2010 and minor update to remove an unnecessary wrapper function (ref).

Cloud metadata (internal/cloudmetadata/)

  • Cloud-agnostic Provider interface with AWS and Azure implementations
  • Wraps re-exported metadataproviders from otel-contrib (no custom IMDS code)
  • Singleton with auto-detection (AWS ? Azure), used by config-downloader and config-translator
  • Replaces ec2util.GetEC2UtilSingleton() in all callers

Cloud provider enum (internal/cloudprovider/)

  • Shared CloudProvider type (Unknown, AWS, Azure)

Re-export wrappers (otel-contrib)

  • resourcedetectionprocessor/metadataproviders/{ec2,azure}/ exposes internal packages for external use

resourcedetection processor translator

  • Added for non-EC2 pipelines (Azure, GCP)
  • Reads append_dimensions and enables only requested attributes
  • Supports OTel-style placeholders (${host.id}, ${azure.vm.size})

ec2tagger changes

  • Removed all volume/disk tagging code (moved to disktagger)
  • Supports OTel placeholders (${host.id}) alongside legacy (${aws:InstanceId})

disktagger processor (plugins/processors/disktagger/)

  • Extracted from ec2tagger as standalone processor
  • AWS: reuses existing volume package (DescribeVolumes + NVMe serial)
  • Azure: IMDS storageProfile (OS + data disks) + /dev/disk/azure/ symlinks with sysfs SCSI fallback
  • Prefix matching for partition names (e.g. sda1 matches sda)

Pipeline translator

  • EC2 mode ? ec2tagger (unchanged)
  • Non-EC2 ? resourcedetectionprocessor
  • disktagger added when disk VolumeId/DiskId dimension is configured

Placeholder resolution

  • Runtime placeholders (${aws:VolumeId}, ${disk.id}) stripped during translation
  • cloudmetadata.GetProvider() replaces ec2util for metadata

Config examples

EC2 (backward compatible):

{"InstanceId": "${aws:InstanceId}", "VolumeId": "${aws:VolumeId}"}

EC2 (OTel style):

{"InstanceId": "${host.id}", "VolumeId": "${aws:VolumeId}"}

Azure:

{"InstanceId": "${host.id}", "Region": "${cloud.region}", "VolumeId": "${disk.id}"}

Known gap

Attribute renaming for non-EC2 clouds: resourcedetectionprocessor adds OTel semantic convention attributes (
e.g. host.id, cloud.region, azure.vm.size) to the Resource. These are not automatically renamed to CloudWatch
dimension names (e.g. InstanceId, Region, VMSize). The resource_to_telemetry_conversion setting copies them to metrics, but with OTel names for CW exporters.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Tested on Ubuntu hosts in AWS and Azure by emitting CPU/MEM/DISK metrics

  • AWS
Screenshot 2026-02-18 at 4 24 58 PM
  • Azure
Screenshot 2026-02-18 at 4 25 05 PM

Requirements

Before commiting your code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

Integration Tests

To run integration tests against this PR, add the ready for testing label.

@movence movence requested a review from a team as a code owner February 18, 2026 21:28
@movence movence force-pushed the hsookim/multi-cloud/metadata branch from 32e08fe to 55df7e2 Compare February 20, 2026 15:47
@movence movence added the ready for testing Indicates this PR is ready for integration tests to run label Feb 20, 2026
Comment thread internal/cloudmetadata/factory.go Outdated
// on hosts where IMDS is unreachable.
func GetProvider() Provider {
once.Do(func() {
ctx, cancel := context.WithTimeout(context.Background(), detectTimeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same context is used for both the aws.NewProvider and the azure.NewProvider. If aws.NewProvider takes 3 seconds and times out, then the azure.NewProvider will never get run, since the context will already have been cancelled by the time it gets there.

They cannot share contexts.

Comment on lines +56 to +83
func createDiskProvider(ctx context.Context, set processor.Settings) DiskProvider {
p := cloudmetadata.GetProvider()
if p == nil {
set.Logger.Warn("No cloud provider detected, disktagger will not tag disks")
return nil
}

switch p.CloudProvider() {
case cloudprovider.AWS:
credConfig := &configaws.CredentialsConfig{
Region: p.Region(),
}
awsCfg, err := credConfig.LoadConfig(ctx)
if err != nil {
set.Logger.Warn("Failed to load AWS config for disktagger", zap.Error(err))
return nil
}
set.Logger.Info("disktagger: using AWS EBS provider", zap.String("instanceID", p.InstanceID()), zap.String("region", p.Region()))
return awsprovider.NewProvider(ec2.NewFromConfig(awsCfg), p.InstanceID())
case cloudprovider.Azure:
set.Logger.Info("disktagger: using Azure managed disk provider")
ap := azureprovider.NewProvider()
return newMapProvider(ap.DeviceToDiskID)
default:
set.Logger.Warn("Unsupported cloud provider for disktagger")
return nil
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like cloudmetadata.GetProvider() is only being used here to determine which cloud the agent is running in. Isn't this something that could be determined at translation time and configured on this component?

It looks like all other cloudmetadata.GetProvider() usages are at translation time. Don't think it's worth calling and caching the IMDS info just for this switch case at runtime. We would also be able to add a validation check for the config.

next consumer.Metrics,
) (processor.Metrics, error) {
c := cfg.(*Config)
provider := createDiskProvider(ctx, set)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This should be created as part of the processor's constructor.

Comment on lines +12 to +15
type Config struct {
RefreshInterval time.Duration `mapstructure:"refresh_interval"`
DiskDeviceTagKey string `mapstructure:"disk_device_tag_key"`
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add godocs for these fields?

return ""
}
assert.ErrorIs(t, c.Refresh(), errNoProviders)
assert.ErrorIs(t, c.Refresh(context.Background()), errNoProviders)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could use t.Context()

cfg.Override = false

requested := collectRequestedAttributes(conf)
configureEC2Attributes(cfg, requested)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this too pre-emptive? We haven't switched the EC2 metadata over to the resourcedetection processor yet.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing unit tests for factory functions.

}

// Determine which device resolution method to use.
useSymlinks := symlinkAvailable()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to happen on each refresh?

}
}

func (p *Provider) DeviceToDiskID(ctx context.Context) (map[string]string, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing unit testing. Consider following something similar to internal/volume/host_linux.go which allows overrides for the os functions.

//
// Example: /dev/disk/azure/root → ../../sda → "sda"
func resolveSymlink(path string) string {
for _, prefix := range []string{"", "/rootfs"} {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we determine the prefix once (in the constructor perhaps) and store it on the provider? Seems like a waste to try "" every time if it's never going to be right. Other option is to make it configurable.

@github-actions
Copy link
Copy Markdown
Contributor

This PR was marked stale due to lack of activity.

@github-actions github-actions bot added the Stale label Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for testing Indicates this PR is ready for integration tests to run Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants