Skip to content

Ensure we use a stable DUID for DHCPv6 exchanges#267

Open
bnaecker wants to merge 1 commit into
mainfrom
ben/consistent-dhcpv6-duids
Open

Ensure we use a stable DUID for DHCPv6 exchanges#267
bnaecker wants to merge 1 commit into
mainfrom
ben/consistent-dhcpv6-duids

Conversation

@bnaecker
Copy link
Copy Markdown
Contributor

@bnaecker bnaecker commented May 8, 2026

  • Package ndpd.conf in the switch zone with defaults that preclude DHCPv6 on any interface.
  • After fetching the correct, stable MAC addresses from the switch SP, dpd now uses the base MAC to write out a DUID to a file where illumos's dhpcagent can pick it up and use it later in exchanges.
  • For the VLANs representing the techports, tfportd now creates both a link-local address and allows DHCPv6 to run on the interface as well. This should trigger dhcpagent, which would pick up the stable DUID from the previous item.
  • Some misc cleanup, logging improvements, IdOrdMap over BTreeMap

@bnaecker bnaecker marked this pull request as draft May 8, 2026 18:10
@bnaecker bnaecker force-pushed the ben/consistent-dhcpv6-duids branch 4 times, most recently from eb9ed95 to e6e15de Compare May 14, 2026 22:51
@bnaecker bnaecker force-pushed the ben/consistent-dhcpv6-duids branch 4 times, most recently from 69cca33 to 46c1981 Compare May 19, 2026 04:20
@AlejandroME AlejandroME added this to the 20 milestone May 19, 2026
@bnaecker
Copy link
Copy Markdown
Contributor Author

This appears to work, barring the child-reaping issue discussed in oxidecomputer/tofino-sde#15. But the actual functionality of writing the DUID file and starting DHCP on the techports works. I've confirmed that they start to send Solicit messages, and that they receive a response, albeit with no addresses because the lab environment has none. I'll add more testing details once my build without the reaping problem works its way through the pipeline.

I should note that we're not doing the "usual" thing here of waiting for an NDP RA with the "Managed Configuration" bit set before starting DHCPv6 -- we do so unconditionally. We could in theory do that, by adding a line to ndpd.conf and then restarting in.ndpd. I chose this because it seemed less intrusive, but I'm open to the other route too.

@bnaecker bnaecker marked this pull request as ready for review May 20, 2026 01:03
@bnaecker
Copy link
Copy Markdown
Contributor Author

I talked this through with @rmustacc earlier today. I think it's actually better to take the other path I mentioned above, restarting in.ndpd after reworking its configuration file to allow DHCPv6 on these interfaces. That does require a restart of that one daemon, but it also ensures that, after that restart, the rest of the system continues to work as usual. We don't need to continually ensure that DHCPv6 is running in dpd, for example, that's in.ndpd's job.

I'm going to rework that bit of the code to reflect that now, and give this another test on a racklet.

Copy link
Copy Markdown
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bnaecker. Comments follow.

Comment thread .github/buildomat/common.sh Outdated
Comment thread dpd/src/dhcpv6/illumos.rs Outdated
Comment thread dpd/src/dhcpv6/illumos.rs Outdated
Comment thread tfportd/src/vlans.rs
Comment thread dpd/src/dhcpv6/illumos.rs Outdated
/// FMRI for the service running `in.ndpd
const NDPD_FMRI: &str = "svc:/network/routing/ndp:default";

#[link(name = "scf")]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't expose smf_restart_instance(), which is what I intended to use. AFAICT, in.ndpd doesn't handle a refresh, only a restart.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgallagher has worked on https://github.com/oxidecomputer/scuffle which is a more rusty way to do this and does expose this I believe.

Copy link
Copy Markdown
Contributor

@jgallagher jgallagher May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't expose smf_restart_instance() directly, but the functionality is there, yeah. It'd be something like this:

let scf = Scf::connect_global_zone()?;
let instance = scf.instance_from_fmri("your-instance-fmri")?;
instance.smf_refresh()?;

I need to get back to scuffle and actually publish it on crates.io, or you could pull it in as a git dependency for now if you'd like?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rmustacc and @jgallagher. It looks like restarting the service is exposed through this method, at least on Helios v3 bits. I think all the racklets are on that, as well as Omicron, so it should be safe. I'll give it a whirl in any case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched to scuffle in 3864764

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some fallout moving CI in this repo over to Helios v3, things are building again. I'm going to take this for one more lap on berlin with scuffle doing the SMF work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke too soon

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, with @citrus-it 's help, I've gotten past the Helios v2 -> v3 transition. I'm going for another spin on berlin shortly.

Comment thread dpd/src/dhcpv6/illumos.rs
Comment thread .github/buildomat/illumos.sh Outdated
@bnaecker
Copy link
Copy Markdown
Contributor Author

Alrighty, I've gone through one more round of rack-setup on berlin, and things look correct. Dendrite started up in 1986, and we see that it got the base MAC addresses from the Sidecar SP before time synced:

root@oxz_switch0:~# grep "base MAC" $(svcs -L dendrite) | looker
00:02:00.745Z INFO dpd: no base MAC address found, fetching from Sidecar FRUID
00:02:06.666Z INFO dpd: resetting base MAC address
    new = a8:40:25:05:35:02
    old = Temporary(a8:40:25:74:d6:f0)

Next, the DHCPv6 task that sets all this Rube Goldberg up starts. It then writes in the DUID to disk, updates the NDP configuration, and restarts in.ndpd:

root@oxz_switch0:~# grep dhcpv6-task $(svcs -L dendrite) | looker
00:02:06.667Z DEBG dpd: wrote DUID to tempfile
    unit = dhcpv6-task
00:02:06.668Z INFO dpd: wrote stable DHCPv6 DUID
    MAC = 0xA84025053502
    path = /etc/dhcp/duid
    unit = dhcpv6-task
00:02:06.668Z INFO dpd: wrote DUID based on MAC to disk
    MAC = a8:40:25:05:35:02
    path = /etc/dhcp/duid
    unit = dhcpv6-task
00:02:06.668Z INFO dpd: updated in.ndpd configuration file
    path = /etc/inet/ndpd.conf
    unit = dhcpv6-task
00:02:06.670Z INFO dpd: restarted `in.ndpd`
    unit = dhcpv6-task

The DUID and ndpd.conf files are correct:

root@oxz_switch0:~# cat /etc/inet/ndpd.conf
ifdefault StatefulAddrConf false
if techport0 StatefulAddrConf true
if techport1 StatefulAddrConf true
root@oxz_switch0:~# xxd /etc/dhcp/duid
00000000: 0003 0001 a840 2505 3502                 .....@%.5.

in.ndpd was started at the time indicated in the log file:

root@oxz_switch0:~# cat $(svcs -L ndp)
[ Dec 28 00:01:24 Disabled. ]
[ Dec 28 00:01:29 Enabled. ]
[ Dec 28 00:01:31 Executing start method ("/lib/svc/method/svc-ndp"). ]
[ Dec 28 00:01:35 Method "start" exited with status 0. ]
[ Dec 28 00:02:06 Stopping because service restarting. ]
[ Dec 28 00:02:06 Executing stop method (:kill). ]
[ Dec 28 00:02:06 Executing start method ("/lib/svc/method/svc-ndp"). ]
[ Dec 28 00:02:18 Method "start" exited with status 0. ]

And last, DHCPv6 is actually operating on these interfaces now:

root@oxz_switch0:~# ipadm | grep techport
techport0/ll      addrconf ok           fe80::aa40:25ff:fe05:3502%techport0/10
techport0/v6      static   ok           fdb1:a840:2504:211::1/10
techport0/ll      addrconf ok           fdb1:a840:2504:211:aa40:25ff:fe05:3502/64
techport1/ll      addrconf ok           fe80::aa40:25ff:fe05:3502%techport1/10
techport1/?       dhcp     ok           fd75:23e9:4173::f/128
techport1/v6      static   ok           fdb2:a840:2504:211::1/10
techport1/ll      addrconf ok           fdb2:a840:2504:211:aa40:25ff:fe05:3502/64
root@oxz_switch0:~# ifconfig techport0 inet6 dhcp status
Interface  State         Sent  Recv  Declined  Flags
techport0  SELECTING       10     0         0  [V6]
root@oxz_switch0:~# ifconfig techport1 inet6 dhcp status
Interface  State         Sent  Recv  Declined  Flags
techport1  BOUND            4     2         0  [V6]
(Began, Expires, Renew) = (05/20/2026 22:13, 05/21/2026 00:13, 05/20/2026 23:13)

The DHCPv6 server in the lab confirms it:

asilomar-arista-dmgmt#show dhcp server leases
fd75:23e9:4173::d
End: 2026/05/20 23:44:07 UTC
Last transaction: 2026/05/20 21:44:07 UTC
MAC address: a840.257d.0aea

fd75:23e9:4173::f
End: 2026/05/21 00:13:43 UTC
Last transaction: 2026/05/20 22:13:43 UTC
MAC address: a840.2505.3502

fd75:23e9:4173::ffff
End: 2026/05/21 00:14:30 UTC
Last transaction: 2026/05/20 22:14:30 UTC
MAC address: 1cfd.0878.c8de

asilomar-arista-dmgmt#show dhcp server ipv6
IPv6 DHCP server is active
DNS server(s):
DNS domain name:
Lease duration: 0 days 2 hours 0 minutes
Active leases: 3
IPv6 DHCP interface status:
   Interface     Status
---------------- -----------------
   Ethernet3     Active
   Ethernet33    Inactive (Not up)
   Ethernet34    Inactive (Not up)
   Ethernet47    Active
   Ethernet48    Active


Subnet: fd00:a:a:a::/127
Subnet name:
Range: fd00:a:a:a::1 to fd00:a:a:a::1 (0/1 addresses leased)
DNS server(s):
Direct: Inactive (No IPv6 addresses on interfaces match this subnet)
Relay: Active
Active leases: 0

Subnet: fd75:23e9:4173::/127
Subnet name:
Range: fd75:23e9:4173::1 to fd75:23e9:4173::1 (0/1 addresses leased)
DNS server(s):
Direct: Inactive (No IPv6 addresses on interfaces match this subnet)
Relay: Active
Active leases: 0

Subnet: fd75:23e9:4173::2/127
Subnet name:
Range: fd75:23e9:4173::3 to fd75:23e9:4173::3 (0/1 addresses leased)
DNS server(s):
Direct: Inactive (No IPv6 addresses on interfaces match this subnet)
Relay: Active
Active leases: 0

Subnet: fd75:23e9:4173::4/127
Subnet name:
Range: fd75:23e9:4173::5 to fd75:23e9:4173::5 (0/1 addresses leased)
DNS server(s):
Direct: Inactive (No IPv6 addresses on interfaces match this subnet)
Relay: Active
Active leases: 0

Subnet: fd75:23e9:4173::6/127
Subnet name:
Range: fd75:23e9:4173::7 to fd75:23e9:4173::7 (0/1 addresses leased)
DNS server(s):
Direct: Inactive (No IPv6 addresses on interfaces match this subnet)
Relay: Active
Active leases: 0

Subnet: fd75:23e9:4173::8/127
Subnet name:
Range: fd75:23e9:4173::9 to fd75:23e9:4173::9 (0/1 addresses leased)
DNS server(s):
Direct: Inactive (No IPv6 addresses on interfaces match this subnet)
Relay: Active
Active leases: 0

Subnet: fd75:23e9:4173::a/127
Subnet name:
Range: fd75:23e9:4173::b to fd75:23e9:4173::b (0/1 addresses leased)
DNS server(s):
Direct: Inactive (No IPv6 addresses on interfaces match this subnet)
Relay: Active
Active leases: 0

Subnet: fd75:23e9:4173::c/127
Subnet name:
Range: fd75:23e9:4173::d to fd75:23e9:4173::d (1/1 addresses leased)
DNS server(s):
Direct: Active
Relay: Active
Active leases: 1

Subnet: fd75:23e9:4173::e/127
Subnet name:
Range: fd75:23e9:4173::f to fd75:23e9:4173::f (1/1 addresses leased)
DNS server(s):
Direct: Active
Relay: Active
Active leases: 1

Subnet: fd75:23e9:4173::fffe/127
Subnet name:
Range: fd75:23e9:4173::ffff to fd75:23e9:4173::ffff (1/1 addresses leased)
DNS server(s):
Direct: Active
Relay: Active
Active leases: 1

It appears that the server is configured to only lease 1 address to this rack, because techport0 is still attempting to actively select an address. I'm not sure how to confirm that on the switch. There are leasable ranges, but each only has 1 address in it. techport1 looks like it got its address first, and presumably the server is configured so that techport0 needs to be assigned out of the same range, which is now fully-leased.

@rcgoodfellow @Nieuwejaar and maybe @citrus-it or @jgallagher, let me know if y'all have any more feedback on this.

@bnaecker
Copy link
Copy Markdown
Contributor Author

Also, the last thing to confirm is that we're actually using the DUID-LL based on this stable MAC as the client identifier option in the Solicit message from the host:

^Croot@oxz_switch0:~# snoop -r -d techport0 -v dhcp6
Using device techport0 (promiscuous mode)
ETHER:  ----- Ether Header -----
ETHER:
ETHER:  Packet 1 arrived at 22:33:26.82491
ETHER:  Packet size = 122 bytes
ETHER:  Destination = 33:33:0:1:0:2, (multicast)
ETHER:  Source      = a8:40:25:5:35:2,
ETHER:  Ethertype = 86DD (IPv6)
ETHER:
IPv6:   ----- IPv6 Header -----
IPv6:
IPv6:   Version = 6
IPv6:   Traffic Class = 0
IPv6:   Flow label = 0x0
IPv6:   Payload length = 68
IPv6:   Next Header = 17 (UDP)
IPv6:   Hop Limit = 1
IPv6:   Source address = fe80::aa40:25ff:fe05:3502
IPv6:   Destination address = ff02::1:2
IPv6:
UDP:  ----- UDP Header -----
UDP:
UDP:  Source port = 546
UDP:  Destination port = 547 (DHCPv6S)
UDP:  Length = 68
UDP:  Checksum = 7535
UDP:
DHCPv6: ----- Dynamic Host Configuration Protocol Version 6 -----
DHCPv6:
DHCPv6: Message type (msg-type) = 1 (Solicit)
DHCPv6: Transaction ID = bf8004
DHCPv6:
DHCPv6: Option Code = 1 (Client Identifier)
DHCPv6:   DUID Type = 3 (Link-layer Address)
DHCPv6:   Hardware Type = 1 (Ethernet (10Mb))
DHCPv6:   Link Layer Address = a8:40:25:05:35:02
DHCPv6: Option Code = 3 (Identity Association for Non-temporary Addresses)
DHCPv6:   IAID = 80
DHCPv6:   T1 (renew) = 0 seconds
DHCPv6:   T2 (rebind) = 0 seconds
DHCPv6: Option Code = 6 (Option Request)
DHCPv6:   Requested Option Code = 7 (Preference)
DHCPv6:   Requested Option Code = 12 (Server Unicast)
DHCPv6:   Requested Option Code = 23 (DNS Recursive Name Server)
DHCPv6:   Requested Option Code = 24 (Domain Search List)
DHCPv6:   Requested Option Code = 27 (Network Information Service Servers)
DHCPv6:   Requested Option Code = 29 (Network Information Service Domain Name)
DHCPv6: Option Code = 14 (Rapid Commit)
DHCPv6: Option Code = 8 (Elapsed Time)
DHCPv6:   Elapsed Time = 655.35 seconds
DHCPv6:

@bnaecker
Copy link
Copy Markdown
Contributor Author

Alright, after some last-minute help from Ry, I've convinced myself this is in fact working as intended. We get DHCPv6 addresses on berlin's techport1 on each of the two switches (not techport0). Those are the ports connected to the Arista switch in the lab, running the DHCPv6 server.

- Package `ndpd.conf` in the switch zone with defaults that preclude
  DHCPv6 on any interface.
- After fetching the correct, stable MAC addresses from the switch SP,
  `dpd` now uses the base MAC to write out a DUID to a file where
  illumos's `dhpcagent` can pick it up and use it later in exchanges. It
  also modifies the `ndpd.conf` to allow DHCPv6 on the technician ports,
  and restarts `in.ndpd` so that the DHCP client daemon is managed
  normally.
- Some misc cleanup, logging improvements, `IdOrdMap` over `BTreeMap`
@bnaecker bnaecker force-pushed the ben/consistent-dhcpv6-duids branch from 697223f to 931a663 Compare May 21, 2026 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants