BeeGFS as the Hadoop File System
Hadoop can be configured to use BeeGFS as its distributed file system, as a more convenient and faster alternative than using HDFS. This page explains how to implement and test such configuration.
Table of Contents (Page) |
---|
Overview
There are 2 ways of integrating Hadoop and BeeGFS.
- One way is to use the BeeGFS Hadoop Connector that ensures that the data needed by each Hadoop node will be stored on the node. This is suitable to systems that lack low-latency high-throughput networks and therefore, could benefit from data locality.
- The other way is to configure Hadoop to access a BeeGFS mountpoint via POSIX, as if it were a local file system. This configuration is suitable for systems with networks like InfiniBand or Omni-Path, where data locality is not as important as parallelism for achieving good performance. In this scenario, the performance cost of network communication is significantly compensated by the positive impact of using multiple BeeGFS servers to process IO operations performed by the Hadoop applications.
The procedures to configure Hadoop in each scenario are presented below.
Using the BeeGFS Hadoop Connector
The following procedure has been tested for Hadoop 2.5.2, 2.6.2, and 2.7.2.
- Install BeeGFS version 2014.01-r10 or later in your system. Every Hadoop node must run the storage, metadata, and client services (beegfs-storage, beegfs-meta, beegfs-client, and beegfs-helperd). The management service (beegfs-mgmtd) can be configured to run in one of the Hadoop nodes or in another host that you consider more adequate. In addition, the Hadoop nodes must have the packages beegfs-utils and beegfs-client-devel installed.
- Download the BeeGFS Hadoop Connector.
- Extract the downloaded .tar.gz file and put the contained beegfs.jar file into folder $HADOOP_HOME/share/hadoop/common/lib, on every node of the Hadoop cluster.
$ tar xvzf beegfs-hadoop-connector-v1.0.tar.gz $ mv beegfs.jar $HADOOP_HOME/share/hadoop/common/lib
- Create the symbolic links below on every node.
$ ln -s /opt/beegfs/lib/libjbeegfs.so $HADOOP_HOME/lib/native/libjbeegfs.so $ ln -s /opt/beegfs/lib/jbeegfs.jar $HADOOP_HOME/share/hadoop/common/lib/jbeegfs.jar
- Choose a path under the BeeGFS mountpoint as the root path for Hadoop (e.g. /mnt/beegfs/hadoop) and create the directories user and tmp there. Feel free to create additional directories that you intend to use later.
$ mkdir -p /mnt/beegfs/hadoop/user $ mkdir /mnt/beegfs/hadoop/tmp
- Now edit file $HADOOP_HOME/etc/hadoop/core-site.xml in every Hadoop node, as follows:
- Set fs.default.name to viewfs:///
- Set fs.AbstractFileSystem.beegfs.impl to com.beegfs.BeeGFS
- For each directory you created above, make a mount table entry: fs.viewfs.mounttable.default.link./(dirname): beegfs:/(path to dir), e.g. set fs.viewfs.mounttable.default./user to beegfs:/mnt/beegfs/hadoop/user.
- Add the node name suffix. BeeGFS talks to the nodes using their hostnames, while Hadoop uses the fully qualified domain name. If the command "hostname -d" returns empty, set the property fs.beegfs.node-name-suffix to empty. Otherwise, if it returns the host domain (e.g.: global.cluster), set the property to the domain preceded by a dot (e.g.: .global.cluster).
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop-tmp/</value>
</property>
<property>
<name>fs.default.name</name>
<value>viewfs:///</value>
</property>
<property>
<name>fs.AbstractFileSystem.beegfs.impl</name>
<value>com.beegfs.BeeGFS</value>
</property>
<property>
<name>fs.beegfs.node-name-suffix</name>
<value>.global.cluster</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./tmp</name>
<value>beegfs:/mnt/beegfs/hadoop/tmp</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./user</name>
<value>beegfs:/mnt/beegfs/hadoop/user</value>
</property>
</configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop-tmp/</value>
</property>
<property>
<name>fs.default.name</name>
<value>viewfs:///</value>
</property>
<property>
<name>fs.AbstractFileSystem.beegfs.impl</name>
<value>com.beegfs.BeeGFS</value>
</property>
<property>
<name>fs.beegfs.node-name-suffix</name>
<value>.global.cluster</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./tmp</name>
<value>beegfs:/mnt/beegfs/hadoop/tmp</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./user</name>
<value>beegfs:/mnt/beegfs/hadoop/user</value>
</property>
</configuration>
Using the BeeGFS POSIX Interface
The following procedure has been tested for Hadoop 2.7.2.
- Install BeeGFS version 2014.01-r10 or later in your system, as you see fit. The Hadoop nodes only need to run the BeeGFS client services (beegfs-client, and beegfs-helperd), but other services and packages could also be installed there (beegfs-storage, beegfs-meta, beegfs-mgmtd, beegfs-utils).
- Choose a path under the BeeGFS mountpoint as the root path for Hadoop (e.g. /mnt/beegfs/hadoop) and create the directories user and tmp there. Feel free to create additional directories that you intend to use later.
$ mkdir -p /mnt/beegfs/hadoop/user $ mkdir /mnt/beegfs/hadoop/tmp
- Now edit file $HADOOP_HOME/etc/hadoop/core-site.xml in every Hadoop node, as follows:
- Set fs.default.name to viewfs:///
- Set fs.AbstractFileSystem.file.impl to org.apache.hadoop.fs.local.LocalFs
- For each directory you created above, make a mount table entry: fs.viewfs.mounttable.default.link./(dirname): file:/(path to dir), e.g. set fs.viewfs.mounttable.default./user to file:/mnt/beegfs/hadoop/user. Please, notice that the URL is using the scheme file, instead of beegfs, seen in the previous procedure.
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop-tmp/</value>
</property>
<property>
<name>fs.default.name</name>
<value>viewfs:///</value>
</property>
<property>
<name>fs.AbstractFileSystem.file.impl</name>
<value>org.apache.hadoop.fs.local.LocalFs</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./tmp</name>
<value>file:/mnt/beegfs/hadoop/tmp</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./user</name>
<value>file:/mnt/beegfs/hadoop/user</value>
</property>
</configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop-tmp/</value>
</property>
<property>
<name>fs.default.name</name>
<value>viewfs:///</value>
</property>
<property>
<name>fs.AbstractFileSystem.file.impl</name>
<value>org.apache.hadoop.fs.local.LocalFs</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./tmp</name>
<value>file:/mnt/beegfs/hadoop/tmp</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./user</name>
<value>file:/mnt/beegfs/hadoop/user</value>
</property>
</configuration>
Testing Hadoop and BeeGFS
After following one of the procedures above, you should be able to test the system, as follows.
- Start the Hadoop YARN services.
$ cd $HADOOP_HOME $ ./sbin/start-yarn.sh
- Run the commands below to copy a local file to the Hadoop file system powered by BeeGFS.
$ echo 123 >> /tmp/test.txt $ ./bin/hadoop fs -ls / $ ./bin/hadoop fs -put /tmp/test.txt /tmp $ ./bin/hadoop fs -ls /tmp
- Check if the file was created in BeeGFS.
$ ls /mnt/beegfs/hadoop/tmp