Ethernet & IP Network Tuning
Table of Contents (Page) |
---|
Ethernet & TCP Settings
Network tuning can be applied to clients and servers. It is often based on adjustments to the common TCP settings. The most important settings here include the TCP window scaling, buffer sizes and timestamps. These settings can be changed by writing to the files in /proc/sys/net/ipv4 or by using sysctl. Network hardware providers typically list a number of recommended tuning settings on their websites.
Your distribution might also come with different TCP implementations, which provide better optimizations for high-speed networks than the standard TCP implementation.
For Ethernet, it is also very important to enable send and receive flow-control on the network cards (e.g. with ethtool) and on the switch. You also need to disable broadcast or storm control settings on your Ethernet switch to make sure they don't interfere with the highly concurrent parallel file streams.
In some Ethernet networks, TSO (TCP segmentation offload) and GSO (generic segmentation offload) can also cause problems with parallel file streams, such as significantly decreased throughput. Both of them can be disabled with ethtool, but you should only disable them if you have verified that they are causing problems.
The activation of jumbo frames is another configuration that should be considered for Ethernet networks. It increases the amount of data carried by each Ethernet frame, and therefore, helps BeeGFS to achieve higher throughput. However, this configuration requires all elements of the routes between BeeGFS services (i. e. switches, routers, network cards) to be configured to accept the larger frames.
Neighbor Table Sizes for Address Resolution (ARP)
In networks with a large number of interfaces, the default neighbor table sizes for ARP (Address Resolution Protocol) lookups are typically too small, causing a "Neighbour table overflow" error. This is especially relevant for the BeeGFS management service, which communicates with all other BeeGFS services and will shutdown in such cases due to a critical communication error that prevents it from monitoring registered BeeGFS services correctly.
To prevent a neighbor table overflow system error from happening, raise the threshold value net.ipv4.neigh.default.gc_thresh1 in file /etc/sysctl.conf (or use the sysctl tool). Its default value of 128 must be raised if the system has more than 128 interfaces that can be used by BeeGFS. For example, a system composed of 129 nodes with 1 network interface (IP address) each, or 65 nodes with 2 interfaces each, or 43 nodes with 3 interfaces.
In other words, the gc_tresh1 value should be higher than the number of all IPs that are used by BeeGFS. So, if you have 200 clients with 2 IP addresses each and 10 servers with 3 IP addresses each, then gc_thresh1 should be at least 200 * 2 + 10 * 3 = 460. Since there might also be communication between BeeGFS hosts and other machines from the network or the Internet, it would be a good idea to round up the value, e.g. to something 512 in this example.
In addition, the other gc_thresh threshold values should also be raised. You could double them, e.g. setting gc_thresh2=1024 and gc_thresh3=2048.
Finally, as this problem affects any process that communicates with too many hosts, you might want to increase these threshold values on other machines as well, not only on the machine running beegfs-mgmtd.
Firewalls / UDP Datagram Traffic / Network Address Translation (NAT)
TCP connections are only established from clients to servers or between servers, but never from a server to a client.
UDP packets are exchanged between all servers in both directions and between servers and clients in both directions.
UDP datagrams have to be able to travel from servers to clients without any datagram being sent from a client to a server first. Make sure that this kind of traffic is not blocked by using clients behind NAT or setting firewall rules that block this communication.
TCP and UDP ports used by the services can be found in the corresponding config files (/etc/beegfs/beegfs-...conf) or by querying the management service, e.g. for management service ports:
$ beegfs-ctl --listnodes --nodetype=management --nicdetails
All BeeGFS services use fixed TCP/UDP ports. The only exception are the beegfs-ctl and beegfs-fsck tools, which also use UDP, but have to use a random UDP port (because it must be possible that an arbitraty number of instances of the beegfs-ctl tool can run at the same time on the same machine).
In general, it is not required that beegfs-ctl can run on the compute nodes, but it is helpful for users if this is possible, e.g. to be able to check statistics (beegfs-ctl --userstats) or to query quota information (beegfs-ctl --getquota).
These are the default UDP and TCP port numbers of the BeeGFS services:
- Management service (beegfs-mgmtd): 8008
- Metadata service (beegfs-meta): 8005
- Storage service (beegfs-storage): 8003
- Client service (beegfs-client): 8004
By default, the client also establishes TCP connections to a userspace helper service (beegfs-helperd) on the same machine for DNS lookups and logging at TCP port 8006.
In general, it is not required that all BeeGFS services of the same type use the same TCP or UPD port, e.g. there can be some metadata services using port 8005, while other metadata services connecting to the same management service and thus being part of the same file system namespace, can use differnt ports. However, by default, all services of the same type of the same port.
Back to User Guide - Tuning and Advanced Configuration