Tips and Recommendations for Metadata Server Tuning


Table of Contents (Page)

  1. General Notes
  2. Hardware RAID
    1. Partition Alignment & RAID Settings of Local File System
    2. Metadata Server Throughput Tuning
  3. ZFS (Software RAID)
    1. Setting ZFS Module Parameters
    2. Creating ZFS Pools for Metadata Targets
    3. Configuring ZFS Pools for Metadata Targets
  4. System BIOS & Power Saving
  5. Concurrency Tuning
  6. Parallel Network Requests
 


General Notes


This page presents some tips and recommendations on how to improve the performance of BeeGFS metadata servers. As usual, the optimal settings depend on your particular hardware and usage scenarios, so you should use these settings only as a starting point for your tuning efforts. Benchmarking tools such as mdtest would help you identify the best settings for your BeeGFS metadata servers.

Some of the settings suggested here are non-persistent and will be reverted after the next reboot. In order to keep them permanently, you could add the corresponding commands to /etc/rc.local, as seen in the example below, use /etc/sysctl.conf or create udev rules to reapply them automatically when the machine boots.
echo 5 > /proc/sys/vm/dirty_background_ratio
echo 20 > /proc/sys/vm/dirty_ratio
echo 50 > /proc/sys/vm/vfs_cache_pressure
echo 262144 > /proc/sys/vm/min_free_kbytes
echo 1 > /proc/sys/vm/zone_reclaim_mode

echo always > /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/defrag

devices=(sda sdb)
for dev in "${devices[@]}"
do
  echo deadline > /sys/block/${dev}/queue/scheduler
  echo 128 > /sys/block/${dev}/queue/nr_requests
  echo 128 > /sys/block/${dev}/queue/read_ahead_kb
  echo 256 > /sys/block/${dev}/queue/max_sectors_kb
done


Here are some general guidelines that should be considered.



Extended Attributes (EAs)

BeeGFS metadata is stored as extended attributes (EAs) on the underlying file system to optimal performance. One metadata file will be created for each file that a user creates. About extended attributes usage:





Hardware RAID


Partition Alignment & RAID Settings of Local File System


To get the maximum performance out of your storage devices, it is important to set each partition offset according to their respective native alignment. Check the page Partition Alignment Guide for a walk-through about partition alignment and creation of a RAID-optimized local file system.

A very simple and commonly used method to achieve alignment without the challenges of partition alignment is to completely avoid partitioning and instead, create the file system directly on the device, as shown in the following sections.


Metadata Server Throughput Tuning


In general, the BeeGFS metadata service can use any standard Linux file systems. However, ext4 is the recommended file system for disk partitions of metadata targets, because it handles small file workloads (common on a BeeGFS metadata server) significantly faster than other local Linux file systems.

The default Linux kernel settings are rather optimized for single disk scenarios with low IO concurrency, so there are various settings that need to be tuned to get the maximum performance out of your storage servers.


Formatting Options

When formatting the ext4 partition, it is important to include options that minimize access times for large directories (-Odir_index), to create large inodes that allow storing BeeGFS metadata as extended attributes directly inside the inodes for maximum performance (-I 512), to reserve a sufficient number of inodes (-i 2048), and to use a large journal (-J size=400):
$ mkfs.ext4 -i 2048 -I 512 -J size=400 -Odir_index,filetype /dev/sdX


If you also use ext4 for your storage server targets, you may want to reserve fewer space for inodes and keep more space free for file contents by using -i 16384 or higher for those storage targets.

As metadata size increases with the number of targets per file, you should use -I 1024 if you are planning to stripe across more than 4 targets per file by default or if you are planning to use ACLs or store client-side extended attributes.

Due to the fact that ext4 has a fixed number of available inodes, it is possible to run out of available inodes even if free disk space is available. Thus, it is important to carefully select the number of inodes with respect to your needs if your metadata disk is small. You can check the number of available inodes by using df -ih after formatting. If you need to avoid such a limit, use a different file system (e.g. xfs) instead of ext4.

By default, ext4 does not allow user space processes to store extended attributes (EAs). If the beegfs-meta daemon is set to use EAs, the underlying file system has to be mounted using the option user_xattr. This option also may be stored permanently in the super-block:
$ tune2fs -o user_xattr /dev/sdX

(Note: Some linux distributions only provide an old version of tune2fs, which cannot handle the option '-o user_xattr' yet. A new version might be available under the name tune4fs.)

In systems expected to have directories with a large number of entries (over 10 million), the option large_dir must be informed along with dir_index. This option increases the capacity of the directory index and is available in Linux kernel 4.13 or newer. Nevertheless, having a large number of entries in a single directory is not a good practice performance-wise. Therefore, end users should be encouraged to distribute their files across multiple subdirectories, even if the option large_dir is being used.


Mount Options

To avoid the overhead of updating the last access file timestamps, the metadata partition can be mounted with the noatime option without any influence on the last access timestamps that users see in an BeeGFS mount.
Disable last access timestamps by adding the noatime argument to your mount options.

If your RAID controller has a battery-backup-unit (BBU) or similar technology to protect cache contents on power loss, adding the mount option nobarrier for XFS or ext4 can significantly increase throughput. Make sure to disable the individual internal caches of the attached disks in the controller settings, as these are not protected by the RAID controller battery.

The command below shows a typical mount options for BeeGFS metadata servers with a RAID controller battery.
$ mount -onoatime,nodiratime,nobarrier /dev/sdX <mountpoint>



IO Scheduler

The deadline scheduler typically yields best results for metadata access.
$ echo deadline > /sys/block/sdX/queue/scheduler


In order to avoid latencies, the number of schedulable requests should not be too high, the linux default of 128 is a good value.
$ echo 128 > /sys/block/sdX/queue/nr_requests


When tuning your metadata servers, keep in mind that this is often not so much about throughput, but rather about latency and also some amount of fairness: There are probably also some interactive users on your cluster, who want to see the results of their ls and other commands in an acceptable time, so you should try to work on that. This means, for instance, that you probably don't want to set a high value for /sys/block/sdX/iosched/read_expire on the metadata servers to make sure that users won't be waiting too long for their operations to complete.


Virtual Memory Settings

Transparent huge pages can cause performance degradation under high load and even stability problems on various kinds of systems. For RHEL 5.x, 6.x and derivatives, it is highly recommended to disable transparent huge pages, unless huge pages are explicitly requested by an application through madvise:
$ echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/enabled
$ echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/defrag

For RHEL 7.x and other distributions, it is recommended to have transparent huge pages enabled:

$ echo always > /sys/kernel/mm/transparent_hugepage/enabled
$ echo always > /sys/kernel/mm/transparent_hugepage/defrag



ZFS (Software RAID)


Software RAID implementations demand more powerful machines than traditional systems with RAID controllers, especially if features like data compression and checksums are enabled. Therefore, using ZFS as the underlying file system of metadata targets will require more CPU power and RAM than a more traditional a BeeGFS installation. It will also increase the importance of disabling features like CPU frequency scaling.

It is also recommended to be economical with the options enabled in ZFS, e.g. a feature like deduplication uses a lot of resources and can have a significant impact on performance, while not providing any benefit on a metadata target.

Another important factor that impacts performance in such systems is the version of ZFS packages used. At the time of writing, the latest version 0.7.1 still had some performance issues. In our internal tests, the highest throughput was observed with version 0.6.5.11.


Setting ZFS Module Parameters


After loading the ZFS module, please set the module parameters below, before creating any ZFS storage pool.

IO Scheduler

Set the IO scheduler used by ZFS. Both noop and deadline, which implement simple scheduling algorithms, are good options, as the metadata daemon is run by a single Linux user.
$ echo deadline > /sys/module/zfs/parameters/zfs_vdev_scheduler


Data Aggregation Limit

ZFS is able to aggregate small IO operations that handle neighboring or overlapping data into larger operations, in order to reduce the number of IOPs. The option zfs_vdev_aggregation_limit sets the maximum amount of data that can be aggregated, before the IO operation is finally performed on the disk.
$ echo 262144 > /sys/module/zfs/parameters/zfs_vdev_aggregation_limit



Creating ZFS Pools for Metadata Targets


Basic options like the pool type, cache and log devices must be defined at the creation time of the pool, as seen in the example below. The mount point is optional, but it is a good practice to define it with option -m, in order to control where the storage target directory will be located.
$ zpool create -m /data/meta001 meta001 mirror sda sdb


Data Protection

RAID-Z pool types are not recommended for metadata targets, due to the performance impact of parity updates. The recommended type is mirror, unless data safety is considered more important than metadata performance. In this case, RAID-Z1 should be used.

Partition Alignment

The list of devices composing the storage pool should contain whole block devices, so that ZFS can create disk partitions with optimal alignment. Moreover, including partitions formatted with other file systems like XFS and ext4 to this list introduces an unnecessary extra software layer to the system with a significant performance overhead.

Configuring ZFS Pools for Metadata Targets


File system properties, like data compression, may be defined at creation time with option -O, as shown in the example below, but they may also be defined (or redefined) later with command zfs set, as seen in the following sections.
$ zpool create -f -O compression=lz4 -O atime=off -O xattr=sa -m /data/meta001 meta001 mirror sda sdb


Data Compression

Data compression is a feature that should be enabled because it reduces the amount of space used by BeeGFS metadata. The CPU overhead caused by the compression functions are compensated by the decrease of the amount of data involved in the IO operations. Please use the data compression algorithm lz4, which is known to have a good balance between compression ratio and performance.
$ zfs set compression=lz4 meta001


Extended Attributes

As explained earlier, BeeGFS metadata is stored as extended attributes of files and therefore, this feature must be enabled, as follows. Please, note that option xattr should be set to sa and not the default on. The default mechanism stores extended attributes as separate hidden files. The sa mechanism inlines them with the actual files they belong to, making the management of extended attributes much more efficient.
$ zfs set xattr=sa meta001


Deduplication

Depuplication (dedup) is a space-saving feature that works by keeping a single copy of multiple identical files stored in the ZFS file system. This feature has a significant performance impact and should be disabled if the system has plenty of storage space.
$ zfs set dedup=off meta001


Unnecessary Properties

The BeeGFS storage service does not update access time. So, this property may be disabled in ZFS pools, as follows.
$ zfs set atime=off meta001



System BIOS & Power Saving


To allow the Linux kernel to correctly detect the system properties and enable corresponding optimizations (e.g. for NUMA systems), it is very important to keep your system BIOS updated.

The dynamic CPU clock frequency scaling feature for power saving, which is typically enabled by default, has a high impact on latency. Thus, it is recommended to turn off dynamic CPU frequency scaling. Ideally, this is done in the machine BIOS, where you will often find a general setting like "Optimize for performance".

If frequency scaling is not disabled in the machine BIOS, recent Intel CPU generations require the parameter intel_pstate=disable to be added to the kernel boot command line, which is typically defined in the grub boot loader configuration file. After changing this setting, the machine needs to be rebooted.

If the Intel pstate driver is disabled or not applicable to a system, frequency scaling can be changed at runtime, e.g. via:
$ echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null


You can check if CPU frequency scaling is disabled by using the following command on an idle system. If it is disabled, you should see the full clock frequency in the CPU MHz line for all CPU cores and not a reduced value.
$ cat /proc/cpuinfo



Concurrency Tuning


Worker Threads

Storage servers, metadata servers and clients allow you to control the number of worker threads by setting the value of tuneNumWorkers (in /etc/beegfs/beegfs-X.conf). In general, a higher number of workers allows for more parallelism (e.g. a server will work on more client requests in parallel).
For smaller clusters in the range of 100-200 compute nodes, you should set at least tuneNumWorkers=64 in beegfs-meta.conf. For larger clusters, tuneNumWorkers=128 is more appropriate.


Parallel Network Requests


Each metadata server establishes multiple connections to each of the other servers to enable more parallelism on the network level by having multiple requests in flight to the same server. The tuning option connMaxInternodeNum in /etc/beegfs/beegfs-meta.conf can be used to configure the number of simultaneous connections. The information provided in the client tuning guide also applies to metadata servers: Parallel Network Requests Tuning


Back to User Guide - Tuning and Advanced Configuration
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki