Implement Runtime NVMe Instance Storage Discovery Using AWS EBS Symlinks#396
Implement Runtime NVMe Instance Storage Discovery Using AWS EBS Symlinks#396neddp wants to merge 15 commits into
Conversation
rkoster
left a comment
There was a problem hiding this comment.
In general I would have expected this logic to go into the https://github.com/cloudfoundry/bosh-agent/tree/main/infrastructure/devicepathresolver package.
Thank you for the review! That's was a big oversight on my end, I'll look into it. |
|
No worries 🙂 |
|
We discussed this during the FI WG meeting and this have to relay on the stemcell agent settings and agent strategy for disc handling. |
|
As discussed during the working group meeting, focus is now on validating: cloudfoundry/bosh-aws-cpi-release#196 (comment) |
|
As per: cloudfoundry/bosh-aws-cpi-release#196 (comment) this change is still needed. Please continue reviewing. |
* Refactor instance storage discovery into configurable component Implement auto-detection for instance storage disk type * Fix windows tests * Fix windows tests (but for real this time)
e7d00b4 to
dcd857a
Compare
|
@neddp could you take a look at these failing unit tests? |
|
Hi @rkoster, We still haven't had the time to test the changes on an actual deployment. I will move the PR to draft until we can confirm everything is working fine. We'll address the tests as well. |
* Make implementation iaas-agnostic * Rename storage resolver files * Fix tests * Remove instance storage resolver * Don't use the aws pattern as default * Refactor NVMe instance storage discovery and remove unused symlink patterns * Enhance NVMe instance storage discovery with managed volume pattern support * Fix unit tests * Don't run windows unit tests when not supported * Simplify FakeDevicePathResolver by removing unused fields and methods * Wait for udev to settle before resolving EBS symlinks * Add debug logs * Import udev and add comment about why it's needed
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a SymlinkDeviceResolver (NVMe constants, constructor) with ResolveSymlinksToDevices, GetDevicesByPattern, and FilterDevices. Implements udev trigger/settle, globbing, read-and-follow link resolution, and exclusion filtering. Adds Ginkgo tests for resolving, glob errors, and filtering. Wires a SymlinkDeviceResolver into NewProvider and passes it into NewLinuxPlatform to enable NVMe instance-storage discovery. Refactors SetupRawEphemeralDisks to discover instance-storage device paths (NVMe glob + symlink exclusion or identity resolution), sorts and validates discovered devices, and partitions discovered devices rather than per-CPI-disk resolved paths. Also updates fake resolver recording and platform tests to inject and use the new resolver. Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Co-authored-by: Ivaylo Ivanov <ivaylogi98@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@infrastructure/devicepathresolver/symlink_device_resolver_test.go`:
- Around line 88-90: The test file symlink_device_resolver_test.go has a
formatting/indentation issue around the assertion block that calls
resolver.ResolveSymlinksToDevices and checks the error; run goimports (or gofmt)
on the file to normalize imports and indentation so the block with
ResolveSymlinksToDevices, Expect(err).To(HaveOccurred()), and
Expect(err.Error()).To(ContainSubstring("nvme-invalid")) is properly formatted.
Ensure the file compiles and lints cleanly after running the formatter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9d923cdb-2fa3-4c19-bc3c-66c23b259996
📒 Files selected for processing (4)
infrastructure/devicepathresolver/symlink_device_resolver.goinfrastructure/devicepathresolver/symlink_device_resolver_test.goplatform/linux_platform.goplatform/linux_platform_test.go
ResolveSymlinksToDevices now logs a warning and continues when a symlink cannot be resolved (e.g. stale/broken symlinks in /dev/disk/by-id/). This prevents unnecessary deploy failures while the count validation in discoverNVMeInstanceStorage still catches any real mismatches. Co-authored-by: Ivaylo Ivanov <ivaylogi98@gmail.com>
There was a problem hiding this comment.
Pull request overview
Adds runtime NVMe instance-storage discovery on Nitro-based clouds by enumerating NVMe block devices and excluding IaaS-managed volumes via /dev/disk/by-id symlinks, then partitions only the discovered instance-storage devices (with updated tests and fakes).
Changes:
- Injects a new
SymlinkDeviceResolverintoLinuxPlatformto support symlink-based device discovery. - Updates
SetupRawEphemeralDisksto discover and partition instance-storage NVMe devices by filtering out managed volumes, with strict count validation. - Adds/updates Ginkgo tests (including new resolver tests) and adjusts the fake device path resolver to track multiple calls.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| platform/provider.go | Constructs and passes a symlink-based resolver into Linux platforms. |
| platform/linux_platform.go | Implements NVMe instance-storage discovery and uses it during raw ephemeral disk setup. |
| platform/linux_platform_test.go | Adds coverage for NVMe discovery scenarios and keeps existing raw ephemeral disk behavior covered. |
| infrastructure/devicepathresolver/symlink_device_resolver.go | New resolver: glob devices, resolve managed-volume symlinks, filter devices, and udev trigger/settle. |
| infrastructure/devicepathresolver/symlink_device_resolver_test.go | Unit tests for symlink resolution, globbing, and filtering behavior. |
| infrastructure/devicepathresolver/fakes/fake_device_path_resolver.go | Changes fake to record all GetRealDevicePath calls (slice), enabling multi-call assertions. |
| infrastructure/devicepathresolver/virtio_device_path_resolver_test.go | Updates assertions to match fake’s new call-recording behavior. |
| infrastructure/devicepathresolver/scsi_device_path_resolver_test.go | Updates assertions to match fake’s new call-recording behavior. |
| infrastructure/devicepathresolver/fallback_device_path_resolver_test.go | Updates assertions to match fake’s new call-recording behavior. |
| agent/bootstrap_test.go | Updates platform construction in tests to provide the new symlink resolver dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@neddp could you look at these copilot suggestions? |
aramprice
left a comment
There was a problem hiding this comment.
Can you take a look at the copilot suggestions.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
DiskSettings does not implement fmt.Stringer, so using %s produced malformed %!s(...) output. Switch to %+v for actionable error messages.
|
Both suggestions were addressed. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
platform/linux_platform.go (1)
859-875:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winFail closed when managed-volume symlink resolution is incomplete.
managedDevicesis treated as authoritative here, but the resolver upstream currently skips unreadable symlinks. If one EBS symlink is missed and the filtered count still matcheslen(devices), the code below canmklabela managed volume or even the root disk. Please make unresolved managed-volume symlinks abort discovery instead of continuing with a partial exclusion set.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/linux_platform.go` around lines 859 - 875, ResolveSymlinksToDevices currently may skip unreadable symlinks, letting managedDevices be incomplete; change the resolver to surface skipped/unresolved symlinks (e.g., change ResolveSymlinksToDevices to return (devices []string, skipped int, err error) or return an error when any symlink cannot be read) and update this call site in linux_platform.go to treat any skipped/unresolved count or non-nil error as fatal: after calling p.symlinkDeviceResolver.ResolveSymlinksToDevices, if skipped>0 (or err != nil) return an error instead of continuing, so managedDevices cannot be partial before calling FilterDevices/instanceStorage and proceeding with mklabel operations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@platform/linux_platform.go`:
- Around line 883-888: The loop calling devicePathResolver.GetRealDevicePath
ignores the boolean timedOut return; update the loop in linux_platform.go (where
devicePathResolver.GetRealDevicePath is invoked) to check the timedOut flag and
treat it as an explicit failure: if timedOut or realPath == "" return a wrapped
error (similar to other call sites) instead of continuing, so the function
returns a clear timeout error rather than allowing an empty path to be passed to
parted; use the same bosherr.WrapErrorf pattern and include context mentioning
the device and that resolution timed out.
---
Outside diff comments:
In `@platform/linux_platform.go`:
- Around line 859-875: ResolveSymlinksToDevices currently may skip unreadable
symlinks, letting managedDevices be incomplete; change the resolver to surface
skipped/unresolved symlinks (e.g., change ResolveSymlinksToDevices to return
(devices []string, skipped int, err error) or return an error when any symlink
cannot be read) and update this call site in linux_platform.go to treat any
skipped/unresolved count or non-nil error as fatal: after calling
p.symlinkDeviceResolver.ResolveSymlinksToDevices, if skipped>0 (or err != nil)
return an error instead of continuing, so managedDevices cannot be partial
before calling FilterDevices/instanceStorage and proceeding with mklabel
operations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ab6aff62-71d1-4d01-9a14-e284e52205e7
📒 Files selected for processing (1)
platform/linux_platform.go
Silently skipping a symlink that cannot be resolved leaves the managed device exclusion set incomplete. If an EBS volume's by-id symlink is broken, FilterDevices would not exclude it and the device could be misidentified as instance storage, potentially causing data loss. Return a wrapped error instead of continuing, so callers can propagate the failure rather than proceeding with a partial exclusion set.
The platform-level test still expected the old skip-and-continue behavior. Updated to assert that a broken managed volume symlink returns an error.
|
/coderabbitai review |
Problem
On AWS Nitro-based instances with NVMe devices, the kernel's PCIe enumeration order is non-deterministic. This means:
/dev/nvme0n1could be the root EBS volume OR instance storage/dev/nvme1n1could be instance storage OR the root EBS volumeSolution
Implemented runtime discovery to reliably identify instance storage by excluding EBS volumes.
Discovery Algorithm
Why EBS Symlinks Are Reliable
AWS automatically creates persistent symlinks for all EBS volumes via udev rules:
/dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol{volume_id}Backwards Compatibility
Non-NVMe instances: No changes to behavior
/dev/xvdb,/dev/sdb) use CPI paths directlyThis must be merged together with the CPI changes - cloudfoundry/bosh-aws-cpi-release#196
Pair @Ivaylogi98