Support metadata resolution for multi-cloud envs#2032
Support metadata resolution for multi-cloud envs#2032movence wants to merge 17 commits intofeature-multi-cloudfrom
Conversation
32e08fe to
55df7e2
Compare
| // on hosts where IMDS is unreachable. | ||
| func GetProvider() Provider { | ||
| once.Do(func() { | ||
| ctx, cancel := context.WithTimeout(context.Background(), detectTimeout) |
There was a problem hiding this comment.
The same context is used for both the aws.NewProvider and the azure.NewProvider. If aws.NewProvider takes 3 seconds and times out, then the azure.NewProvider will never get run, since the context will already have been cancelled by the time it gets there.
They cannot share contexts.
| func createDiskProvider(ctx context.Context, set processor.Settings) DiskProvider { | ||
| p := cloudmetadata.GetProvider() | ||
| if p == nil { | ||
| set.Logger.Warn("No cloud provider detected, disktagger will not tag disks") | ||
| return nil | ||
| } | ||
|
|
||
| switch p.CloudProvider() { | ||
| case cloudprovider.AWS: | ||
| credConfig := &configaws.CredentialsConfig{ | ||
| Region: p.Region(), | ||
| } | ||
| awsCfg, err := credConfig.LoadConfig(ctx) | ||
| if err != nil { | ||
| set.Logger.Warn("Failed to load AWS config for disktagger", zap.Error(err)) | ||
| return nil | ||
| } | ||
| set.Logger.Info("disktagger: using AWS EBS provider", zap.String("instanceID", p.InstanceID()), zap.String("region", p.Region())) | ||
| return awsprovider.NewProvider(ec2.NewFromConfig(awsCfg), p.InstanceID()) | ||
| case cloudprovider.Azure: | ||
| set.Logger.Info("disktagger: using Azure managed disk provider") | ||
| ap := azureprovider.NewProvider() | ||
| return newMapProvider(ap.DeviceToDiskID) | ||
| default: | ||
| set.Logger.Warn("Unsupported cloud provider for disktagger") | ||
| return nil | ||
| } | ||
| } |
There was a problem hiding this comment.
It looks like cloudmetadata.GetProvider() is only being used here to determine which cloud the agent is running in. Isn't this something that could be determined at translation time and configured on this component?
It looks like all other cloudmetadata.GetProvider() usages are at translation time. Don't think it's worth calling and caching the IMDS info just for this switch case at runtime. We would also be able to add a validation check for the config.
| next consumer.Metrics, | ||
| ) (processor.Metrics, error) { | ||
| c := cfg.(*Config) | ||
| provider := createDiskProvider(ctx, set) |
There was a problem hiding this comment.
nit: This should be created as part of the processor's constructor.
| type Config struct { | ||
| RefreshInterval time.Duration `mapstructure:"refresh_interval"` | ||
| DiskDeviceTagKey string `mapstructure:"disk_device_tag_key"` | ||
| } |
There was a problem hiding this comment.
Can we add godocs for these fields?
| return "" | ||
| } | ||
| assert.ErrorIs(t, c.Refresh(), errNoProviders) | ||
| assert.ErrorIs(t, c.Refresh(context.Background()), errNoProviders) |
There was a problem hiding this comment.
nit: Could use t.Context()
| cfg.Override = false | ||
|
|
||
| requested := collectRequestedAttributes(conf) | ||
| configureEC2Attributes(cfg, requested) |
There was a problem hiding this comment.
Is this too pre-emptive? We haven't switched the EC2 metadata over to the resourcedetection processor yet.
There was a problem hiding this comment.
Missing unit tests for factory functions.
| } | ||
|
|
||
| // Determine which device resolution method to use. | ||
| useSymlinks := symlinkAvailable() |
There was a problem hiding this comment.
Does this need to happen on each refresh?
| } | ||
| } | ||
|
|
||
| func (p *Provider) DeviceToDiskID(ctx context.Context) (map[string]string, error) { |
There was a problem hiding this comment.
Missing unit testing. Consider following something similar to internal/volume/host_linux.go which allows overrides for the os functions.
| // | ||
| // Example: /dev/disk/azure/root → ../../sda → "sda" | ||
| func resolveSymlink(path string) string { | ||
| for _, prefix := range []string{"", "/rootfs"} { |
There was a problem hiding this comment.
Can we determine the prefix once (in the constructor perhaps) and store it on the provider? Seems like a waste to try "" every time if it's never going to be right. Other option is to make it configurable.
|
This PR was marked stale due to lack of activity. |
Description of changes
Adds multi-cloud support for metric decoration and placeholder resolution. Azure (and future cloud providers)
use OTel's resourcedetectionprocessor for metadata, while EC2 continues using ec2tagger. Disk/volume tagging
is extracted into a standalone
disktaggerprocessor that supports both AWS and Azure.Also include test fixes from #2010 and minor update to remove an unnecessary wrapper function (ref).
Cloud metadata (internal/cloudmetadata/)
Cloud provider enum (internal/cloudprovider/)
Re-export wrappers (otel-contrib)
resourcedetection processor translator
ec2tagger changes
disktagger processor (plugins/processors/disktagger/)
Pipeline translator
Placeholder resolution
Config examples
EC2 (backward compatible):
EC2 (OTel style):
Azure:
Known gap
Attribute renaming for non-EC2 clouds:
resourcedetectionprocessoradds OTel semantic convention attributes (e.g.
host.id,cloud.region,azure.vm.size) to the Resource. These are not automatically renamed to CloudWatchdimension names (e.g.
InstanceId,Region,VMSize). Theresource_to_telemetry_conversionsetting copies them to metrics, but with OTel names for CW exporters.License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
Tested on Ubuntu hosts in AWS and Azure by emitting CPU/MEM/DISK metrics
Requirements
Before commiting your code, please do the following steps.
make fmtandmake fmt-shmake lintIntegration Tests
To run integration tests against this PR, add the
ready for testinglabel.