Background

For a long while now I’ve been wanting to enable 802.3ad Link Aggregation (LACP) in conjunction with jumbo frames on my SmartOS servers. This is a setup I’d used on Solaris 10, OpenSolaris, and Linux fileservers for years. Unfortunately, due to SmartOS’ unique installation and boot paradigm (where the OS is not installed locally, it is extracted into a ramdisk from the boot archive at boot), getting it set up was not so easy. In fact, back when I set up my first SmartOS server in 2013, it wasn’t possible.

That was rectified and documented (thanks @rmustacc!) a number of years ago, but the one subsequent attmept to use it still ended in failure, so I gave up. Fast forward to yesterday, when I was standing up my first new SmartOS server in a couple years and decided to give aggr+jumbos another shot.

Trying it out

The documentation isn’t exactly clear on how to combine both options - aggregate interfaces are set up by hand in the SmartOS /usbkey/config config file, but MTUs are set using nictagadm(1M), which also appears to set up aggregate interfaces. After looking through the contents of nictagadm (it’s a shell script) and testing it out, it appears that it’s just a handy tool for setting the MTU on the correct entities in /usbkey/config.

With the config in what was hopefully the correct state, I rebooted and prayed. Unfortunately, the network (svc:/network/physical:default) failed to start. Thankfully, SmartOS’s custom SMF method for the network is very chatty:

# cat /var/svc/log/network-physical:default.log
[ Nov  8 20:45:25 Executing start method ("/lib/svc/method/net-physical"). ]
[ Nov  8 20:45:25 Timeout override by svc.startd.  Using infinite timeout. ]
+ smf_configure_ip
+ /sbin/zonename -t
+ [ global = global -o shared = exclusive ]
+ return 0
+ LD_LIBRARY_PATH=/lib
+ export LD_LIBRARY_PATH
+ ADMIN_DHCP_TIMEOUT=300
+ ActiveAggrLinks=''
+ typeset -A ActiveAggrLinks
+ smf_netstrategy
+ smf_is_nonglobalzone
+ [ global != global ]
+ return 1
+ /sbin/netstrategy
+ set -- ufs none none
+ [ 0 -eq 0 ]
+ [ ufs = nfs ]
+ _INIT_NET_STRATEGY=none
+ export _INIT_NET_STRATEGY
+ typeset -A plumbedifs
+ smf_is_globalzone
+ [ global = global ]
+ return 0
+ /usr/sbin/dladm init-phys
+ log_if_state before
+ echo '== debug start: before =='
== debug start: before ==
+ /usr/sbin/dladm show-phys
LINK         MEDIA                STATE      SPEED  DUPLEX    DEVICE
bnx0         Ethernet             unknown    0      unknown   bnx0
bnx1         Ethernet             unknown    0      unknown   bnx1
+ /sbin/ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000 
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
        inet6 ::1/128 
+ echo '== debug end: before =='
== debug end: before ==
+ load_sdc_sysinfo
+ boot_file_config_enabled
+ load_sdc_config
+ load_sdc_bootparams
+ sed -e 's/,/ /g'
+ echo ''
+ create_aggrs
+ typeset links macs mode mtu
+ [[ -z aggr0 ]]
+ aggrs=( aggr0 )
+ eval links='${SYSINFO_Aggregation_aggr0_Interfaces}'
+ links=bnx0,bnx1
+ eval macs='${SYSINFO_Aggregation_aggr0_MACs}'
+ macs=78:2b:cb:1b:06:9d,78:2b:cb:1b:06:9e
+ eval mode='${SYSINFO_Aggregation_aggr0_LACP_mode}'
+ mode=passive
+ eval mtu='${CONFIG_aggr0_mtu}'
+ mtu=9000
+ [[ -z off ]]
+ echo 'Creating aggr: aggr0 (mode=passive, links=bnx0,bnx1)'
Creating aggr: aggr0 (mode=passive, links=bnx0,bnx1)
+ dladm create-aggr -l bnx0 -l bnx1 -L passive aggr0
+ [[ 0 -eq 0 ]]
+ add_active_aggr_links aggr0 78:2b:cb:1b:06:9d,78:2b:cb:1b:06:9e
+ typeset alink
+ ActiveAggrLinks[78:2b:cb:1b:06:9d]=aggr0
+ ActiveAggrLinks[78:2b:cb:1b:06:9e]=aggr0
+ [[ -n 9000 ]]
+ dladm set-linkprop -p mtu=9000 aggr0
dladm: warning: cannot set link property 'mtu' on 'aggr0': operation not supported
+ [[ 1 -ne 0 ]]
+ echo 'Failed to set mtu on aggr aggr0 to 9000'
Failed to set mtu on aggr aggr0 to 9000
+ exit 95
[ Nov  8 20:45:35 Method "start" exited with status 95. ]

What’s going on here?:

dladm: warning: cannot set link property 'mtu' on 'aggr0': operation not supported

This was the very problem I’d had lo those many years ago, which, at the time, I’d attributed to a need to be able to set the MTU on the individual interfaces before the aggegation, e.g.:

# dladm set-linkprop -p mtu=9000 bnx0
# dladm set-linkprop -p mtu=9000 bnx1
# dladm create-aggr -l bnx0 -l bnx1 -L passive aggr0
# dladm set-linkprop -p mtu=9000 aggr0

Well, my conclusion as to the source of the error was completely wrong. As it turns out, the reason set-linkprop fails on the aggregation is because the Broadcom NetXtreme II bnx driver doesn’t support runtime configuration of the physical MTU limit. It can be set on IP interfaces, but not with dladm set-linkprop.

Configuring bnx

The MTU must be configured through the bnx driver’s config file, bnx.conf. The version that ships with the OS can be found in /kernel/drv/bnx.conf. Some sources online and the Solaris 11 docs claim you need to set the Jumbo=... option, but in the case of the driver version currently shipping in illumos, the option name is mtu.

But this is SmartOS, I can’t just change bnx.conf and reboot like I would with traditional Solaris. Thankfully, I stumbled across a solution posted by @nshalman. This clever enhancement by @wesolows and @JohnSonnenschein allows arbitrary files to be added to the boot archive ramdisk via the bootloader. As @nshalman says, @wesolows’ blog post where it’s mentioned is pretty enlightening, as is the “module” syntax documentation.

My hosts boot from PXE, and use iPXE as their bootloader. Thankfully, iPXE’s module (imgfetch) command works in exactly the same way as the GRUB equivalent, and it pulls the file off my TFTP server just like the rest of the boot components. Over on my (Debian) boot server, I now have:

# cat /var/lib/tftpboot/smartos/bnx.conf
mtu=9000,9000,9000,9000,9000,9000,9000,9000,9000,9000,9000,9000,9000,9000,9000,9000;

And my menu.ipxe entry for booting SmartOS has changed from:

kernel /smartos/${smartos-version}/platform/i86pc/kernel/amd64/unix -B smartos=true,console=${smartos-console},${smartos-console}-mode="115200,8,n,1,-" -v
initrd /smartos/${smartos-version}/platform/i86pc/amd64/boot_archive
boot

To:

kernel /smartos/${smartos-version}/platform/i86pc/kernel/amd64/unix -B smartos=true,console=${smartos-console},${smartos-console}-mode="115200,8,n,1,-" -v
module /smartos/${smartos-version}/platform/i86pc/amd64/boot_archive type=rootfs name=ramdisk
module /smartos/bnx.conf type=file name=kernel/drv/bnx.conf
boot

Regarding the name argument to the type=rootfs module, the documentation claims “This option is ignored for rootfs modules”. However, both @wesolows and @nshalman’s examples use it, and I didn’t test without it, so I left it alone.

Reboot, and you should find that bnx.conf has been copied to /system/boot/kernel/drv/bnx.conf. At first I was looking for it in /kernel/drv/bnx.conf, and thought it must not be working, but the documentation clears this point up.

I also thought it wasn’t working because even with bnx.conf in place and disabling all network configuration in /usbkey/config, dladm set-linkprop -p mtu=9000 bnx0 was still failing. It turns out that this is normal for the bnx driver. The MTU cannot be set via set-linkprop, it must be done through bnx.conf. Thankfully, once you have the aggregation configured correctly, the call to dladm set-linkprop -p mtu=9000 aggr0 in /lib/svc/method/net-physical does not fail.

Here is my working /usbkey/config:

#
# This file was auto-generated and must be source-able by bash.
#

aggr0_aggr=78:2b:cb:1b:06:9d,78:2b:cb:1b:6:9e
aggr0_lacp_mode=passive
aggr0_mtu=9000

admin_nic=aggr0
admin_mtu=9000
admin_ip=172.18.2.13
admin_netmask=255.255.255.0
admin_gateway=172.18.2.1

external_nic=aggr0
external_mtu=9000
external0_ip=128.118.250.13
external0_netmask=255.255.255.224
external0_gateway=128.118.250.1
external0_vlan_id=306

headnode_default_gateway=128.118.250.1

dns_resolvers=128.118.250.8,54.172.110.33
dns_domain=galaxyproject.org

ntp_hosts=0.smartos.pool.ntp.org
compute_node_ntp_hosts=dhcp

hostname=westmalle.galaxyproject.org

Conclusion

The main problem here seemed to be (other than my confusion at various points) the bnx driver. Many people report having an overall better experience with Intel cards, which I do not doubt.

So, I’m happy to report that LACP + jumbo frames are working great (and probably have been for years), as long as you can thwack your network device’s driver in to submission.

Thanks as always to the Joyeurs for their hard work and support.

Addendum 2017-11-09 11:51:00 -0400: I’ve disabled aggr0_lacp_mode and set my switch to use non-LACP trunking. Since I boot from PXE, and because LACP is not activated until the OS is booted, PXE was unable to contact the network to boot. Multilink trunking/failover still works and I wasn’t using dynamic LACP anyway, so this is not a huge issue.