Native Infiniband / RoCE / Omni-Path Support (RDMA)
RDMA support for Infiniband, RoCE (RDMA over Converged Ethernet), and Omni-Path in BeeGFS is based on the Open Fabrics Enterprise Distribution ibverbs API (http://www.openfabrics.org).
Clients
To enable RDMA, the BeeGFS client kernel modules have to be compiled with Infiniband support. Client Infiniband support is enabled by setting the corresponding buildArgs option in the client autobuild file (/etc/beegfs/beegfs-client-autobuild.conf). This file also contains more details on the values that you need to set to enable Infiniband support.Servers
Up to BeeGFS v7.0:The BeeGFS OpenTk communication library (libbeegfs-opentk.so) for userspace services comes pre-built with and without native Infiniband support. You can use the following command to enable the shared library version with native Infiniband support:
$ beegfs-setup-rdma
(Note that the command above is automatically executed after the beegfs-opentk-lib package installation.)
BeeGFS v7.1 and newer:
Please install the libbeegfs-ib package. BeeGFS will then enable RDMA support automatically, if hardware and drivers are installed.
Verifying Infiniband Connectivity
At runtime, you can check whether your IB devices have been discovered by using the BeeGFS online configuration tool. The tool mode LISTNODES will show a list of all registered services and their configured network interfaces in order of preference. The word "RDMA" will be appended to interfaces that are enabled for the native Infiniband protocol. Use the following command to list the services and their available network interfaces:$ beegfs-ctl --listnodes --nodetype=storage --details $ beegfs-ctl --listnodes --nodetype=meta --details $ beegfs-ctl --listnodes --nodetype=client --details
To check whether the clients are connecting to the servers via RDMA or whether they are falling back to TCP because of configuration problems, use the following command to list established connections on a client:
$ beegfs-net
(Note that the command above reads information from /proc/fs/beegfs/<clientID>/X_nodes.)
In addition to the commands above, the log files also provide information on established connections and connection failures (if you are using at least logLevel=3). See /var/log/beegfs-X.log on clients and servers.
General Infiniband Tuning Settings
Note: A typical source of trouble is to have the ibacm service (/etc/init.d/ibacm) running on the machines. This service causes RDMA connections attempts to stall and should be disabled in all nodes. |
Note: RHEL kernel 3.10.0-327 introduced a new security heuristic that resulted in an incompatibility with how BeeGFS services used to handle RDMA connections. This problem was addressed in BeeGFS 2015.03-r16. Please, make sure that you use a compatible release. |
- Client / server side RDMA connection parameters in /etc/beegfs/beegfs-client.conf are
- connRDMABufSize
- The maximum size of a buffer (in bytes) that will be sent over the network.
- Multiple buffers (defined by connRDMABufNum) are transferred in parallel per connection.
- connRDMABufSize needs to be an integer multiple of 4 KiB.
- The client must allocate this buffer as physically continuous pages.
- Too large values (>64KiB) might cause page allocation failures.
- connRDMABufNum
- The number of available buffers that can be in flight for a single connection.
- connMaxInternodeNum
- The number of parallel connections that a client can establish to each of the servers.
- Connections are only established when needed and are also dropped when they are idle for a while.
- Note:
- RAM usage per connection is: connRDMABufSize x connRDMABufNum x 2
- Keep resulting maximum RAM usage on the server in mind when increasing these values:
connRDMABufSize x connRDMABufNum x 2 x connMaxInternodeNum x number_of_clients - connMaxInternodeNum is a general network tuning parameter in the (client) config file
- The maximum number of simultaneous connections to the same node.
Intel/QLogic TrueScale Infiniband Tuning
- Adjust the RDMA buffer parameters in file /etc/beegfs/beegfs-client.conf.
- Increase connRDMABufSize to 65536 (64KiB)
- Reduce connRDMABufNum to 12
- Install extra package "libipathverbs"
- The ib_qib module needs to be tuned at least on the server side
- In /etc/modprobe.conf or /etc/modprobe.d/ib_qib.conf
- options ib_qib singleport=1 krcvqs=4 rcvhdrcnt=4096
- Note: The optimal value of krcvqs=<value> depends on the number of CPU cores
- This value reserves the given number of receive queues for ibverbs.
- Please see Intel/QLogic OFED release notes for more details.
- On large clusters, you might need to adapt parameters on the servers to allow accepting a higher number of incoming RDMA connections, e.g.:
- Driver options: lkey_table_size=18, max_qp_wrs=131072, max_qps=131072, qp_table_size=2048
- Map count: echo 1000000 > /proc/sys/vm/max_map_count (use sysctl to make this change persistent)
- File handles: ulimit -n 262144 (use /etc/security/limits to make this change persistent)
Intel Omni-Path Architecture (OPA) Tuning
- Adjust the RDMA buffer parameters in file /etc/beegfs/beegfs-client.conf.
- Increase connRDMABufSize to 65536 (64KiB)
- Reduce connRDMABufNum to 12
- Intel Omni-Path provides a mode called "Accelerated RDMA" to improve performance of large transfers, which is off by default.
- See Intel Omni-Path Performance Tuning Guide chapter "Accelerated RDMA" for information on how to enable this mode.
Mellanox Infiniband Tuning
- On large clusters, you might need to set the log_mtts_per_seg and log_num_mtt mlx driver options to allow a higher number of RDMA connections.
- This option is typically set in /etc/modprobe.d/mlx4_core.conf
- For Mellanox DDR/QDR/FDR:
- The default connRDMABufSize/connRDMABufNum settings are good.
- For Mellanox EDR:
- Increase connRDMABufSize to 32768 (32KiB)
- Reduce connRDMABufNum to 22
Additional Notes
In an RDMA-capable cluster, some BeeGFS communication (especially communication with the management service, which is not performance-critical) uses TCP/IP and UDP/IP transfer. On some systems, the default "connected" IP-over-IB mode of IniniBand and Omni-Path does not seem to work well and results in spurious problems. In this case, you should try to switch the IPoIB mode to "datagram" on all hosts:$ echo datagram > /sys/class/net/ibX/mode
Back to User Guide - Tuning and advanced Configuration