File README of Package ib-bonding
Bonding support for operation over IPoIB
January 14 2007
Or Gerlitz <ogerlitz@voltaire.com>
This package contains patches to the bonding driver such that it would
be able to support non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode.
The motivation is to enable the bonding driver on its HA mode to work
with the IP over Infiniband (IPoIB) driver. With these patches we were
able to enslave IPoIB netdevices and run TCP, UDP, IP (UDP) Multicast
and ICMP traffic with fail-over and fail-back working fine. The working
env was the net-2.6.20 git and later also RH4 and SLES10 whose IB
drivers provided are based on OFED 1.1
More over, as IPoIB is also the IB ARP provider for the RDMA CM driver
which is used by native IB ULPs whose addressing scheme is based on IP
(eg iSER, SDP, Lustre, NFSoRDMA, RDS), bonding support for IPoIB
devices **enables** HA for these ULPs. This holds as when the ULP is
informed by the IB HW on the failure of the current IB connection, it
just need to reconnect, where the bonding device will now issue the IB ARP over the active IPoIB slave.
Please note that the XXX patch that must be applied on the IPoIB driver
to have it work fine with this package.
Below, some detailed info is provided on the patches applied by this
package to the kernel bonding code.
These patches are not enough for configuration of IPoIB bonding through
tools (eg /sbin/ifenslave and /sbin/ifup) provided by packages such as
sysconfig and initscripts, specifically since these tools sets the
bonding device to be UP before enslaving anything.
The next step we plan is look on how to enhance the tools/packages so
it would be possible to bond/enslave with the modified code. As
suggested by the bonding maintainer, this step can potentially involve
converting ifenslave to be a script based on the bonding sysfs
infrastructure rather on the somehow obsoleted
Documentation/networking/ifenslave.c
For the ease of potential users, the package contains example bash
scripts based on the bonding sysfs support which can be used to have
the modifed bonding driver working with the changes.
detailed info on the patches
============================
The first patch (dev_setup.patch) changes some of the bond netdevice
attributes and functions to be that of the active slave for the case of
the enslaved device not being of ARPHRD_ETHER type. Basically it
overrides those setting done by ether_setup(), which are netdevice
**type** dependent and hence might be not appropriate for devices of
other types. It also enforces mutual exclusion on bonding slaves from dissimilar ether types.
IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a
3 bytes IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID)
of the port this IPoIB device is bounded to. The QP is a resource
created by the IB HW and the GID is an identifier burned into the HCA
(i have omitted here some details which are not important for the bonding RFC).
Basically the IPoIB spec and impl. do not allow for setting the MAC
address of an IPoIB device and this work was made under this assumption.
Hence, the second patch (set_mac_address.patch) allows for enslaving
netdevices which do not support the set_mac_address() function. In that
case the bond mac address is the one of the active slave, where remote
peers are notified on the mac address (neighbour) change by Gratuitous
ARP sent by bonding when fail-over occurs (this is already done by the bonding code).
Normally, the bonding driver is UP before any enslavement takes place.
Once a netdevice is UP, the network stack acts to have it join some
multicast groups (eg the all-hosts 224.0.0.1). Now, since ether_setup()
have set the bonding device type to be ARPHRD_ETHER and address len to
be ETHER_ALEN, the net core code computes a wrong multicast link
address. This is b/c ip_eth_mc_map() is called where for mcast joins
taking place **after** the enslavement another ip_xxx_mc_map() is
called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND)
The third patch (allow_not_up_enslave.patch) handles this problem by