Logo

HPC @ Uni.lu

High Performance Computing in Luxembourg

Optimizing Performance on the Lustre Filesystem

Gaia uses Lustre for its $SCRATCH filesystem. Lustre has three major functional units:

  • The metadata servers (MDS), holding metadata targets (MDT) which stores metadata such as filenames, directories, permissions, file properties, etc.
  • The object storage servers (OSS), holding the object storage targets (OST) which stores the file data. The total capacity of the filesystem is the sum of the OST capacities.
  • The clients, that access and use the data.

Gaia Lustre infrastructure is composed of 2 MDS servers and 6 OSS servers (2 MDT and 24 OST). The following schema describes our infrastructure:

You can list the MDTs and OSTs with the command lfs df:

Terminal

   lfs df -h
   UUID                       bytes        Used   Available Use% Mounted on
   lustrefs-MDT0000_UUID        4.7T      873.6G        3.6T  19% /mnt/lustre[MDT:0]
   lustrefs-OST0000_UUID       14.3T       12.6T      997.0G  93% /mnt/lustre[OST:0]
   lustrefs-OST0001_UUID       14.3T       12.6T      993.9G  93% /mnt/lustre[OST:1]
   lustrefs-OST0002_UUID       14.3T       12.6T      994.3G  93% /mnt/lustre[OST:2]
   lustrefs-OST0003_UUID       14.3T       12.6T      997.0G  93% /mnt/lustre[OST:3]
   lustrefs-OST0004_UUID       14.3T       12.6T      997.7G  93% /mnt/lustre[OST:4]
   lustrefs-OST0005_UUID       14.3T       12.6T      996.2G  93% /mnt/lustre[OST:5]
   lustrefs-OST0006_UUID       14.3T       12.3T        1.3T  90% /mnt/lustre[OST:6]
   lustrefs-OST0007_UUID       14.3T       12.3T        1.3T  90% /mnt/lustre[OST:7]
   lustrefs-OST0008_UUID       14.3T       12.3T        1.3T  90% /mnt/lustre[OST:8]
   lustrefs-OST0009_UUID       14.3T       12.3T        1.3T  90% /mnt/lustre[OST:9]
   lustrefs-OST000a_UUID       14.3T       12.3T        1.3T  90% /mnt/lustre[OST:10]
   lustrefs-OST000b_UUID       14.3T       12.3T        1.3T  90% /mnt/lustre[OST:11]
   lustrefs-OST000c_UUID       29.1T       13.9T       15.2T  48% /mnt/lustre[OST:12]
   lustrefs-OST000d_UUID       29.1T       13.9T       15.2T  48% /mnt/lustre[OST:13]
   lustrefs-OST000e_UUID       29.1T       13.9T       15.2T  48% /mnt/lustre[OST:14]
   lustrefs-OST000f_UUID       29.1T       13.9T       15.2T  48% /mnt/lustre[OST:15]
   lustrefs-OST0010_UUID       29.1T       13.9T       15.2T  48% /mnt/lustre[OST:16]
   lustrefs-OST0011_UUID       29.1T       13.9T       15.2T  48% /mnt/lustre[OST:17]
   lustrefs-OST0012_UUID       21.8T      161.5G       21.6T   1% /mnt/lustre[OST:18]
   lustrefs-OST0013_UUID       21.8T      161.4G       21.6T   1% /mnt/lustre[OST:19]
   lustrefs-OST0014_UUID       21.8T      161.5G       21.6T   1% /mnt/lustre[OST:20]
   lustrefs-OST0015_UUID       21.8T      161.4G       21.6T   1% /mnt/lustre[OST:21]
   lustrefs-OST0016_UUID       21.8T      161.2G       21.6T   1% /mnt/lustre[OST:22]
   lustrefs-OST0017_UUID       21.8T      161.4G       21.6T   1% /mnt/lustre[OST:23]
   
   filesystem summary:       477.0T      233.7T      234.5T  50% /mnt/lustre

 

  • The MDT 0 is stored on the MDS Nexsan enclosure
  • The OST 0 to 5 are stored on the OSS1 Nexsan enclosure
  • The OST 6 to 11 are stored on the OSS2 Nexsan enclosure
  • The OST 12 to 17 are stored on the OSS3&4 Netapp enclosure
  • The OST 18 to 23 are stored on the OSS5&6 Netapp enclosure

 

File striping

Lustre is able to distribute the segments of a single file across several OSTs. This technique is called file striping.

File striping permits to increase the throughput of operations by taking advantage of several OSSs and OSTs, by allowing one or more clients to read/write different parts of the same file in parallel. On the other hand, striping small files can decrease the performance.

File striping allows file sizes larger than a single OST, large files MUST be striped over several OSTs in order to avoid filling a single OST and harming the performance for all users.

We can tune file striping using 3 properties:

 

Property Effect Default Accepted values Advised values
stripe_size Size of the file stripes in bytes 1048576 (1m) > 0 > 0
stripe_count Number of OST to stripe across 2 -1 (use all the OSTs), 1-24 1-6
stripe_offset Index of the OST where the first stripe of files will be written -1 (automatic) -1, 0-23 -1

 

You can see the settings applied on a directory or file with this command:

 

Terminal

   $ lfs getstripe RHEL-6.6-20140926.0-Server-x86_64-dvd1.iso 
   RHEL-6.6-20140926.0-Server-x86_64-dvd1.iso
   lmm_stripe_count:   18
   lmm_stripe_size:    4194304
   lmm_layout_gen:     0
   lmm_stripe_offset:  14
           obdidx           objid           objid           group
               14        98340910      0x5dc902e                0
               13        98340418      0x5dc8e42                0
               15        98340350      0x5dc8dfe                0
                4       592874375     0x23568b87                0
               17        98340320      0x5dc8de0                0
   ...
   
   
   
   $ lfs getstripe -d .
   stripe_count:   2 stripe_size:    1048576 stripe_offset:  -1

 

How to set the file striping parameters

You can tune the file striping parameters per directory or create a new file with specific striping parameters with the command lfs setstripe. Newly created files and directories will inherit these parameters from their parent directory. However, the parameters cannot be changed on an existing file.

 

usage: setstripe -d <directory>   (to delete default striping)
usage: setstripe [--stripe-count|-c <stripe_count>]
                                 [--stripe-index|-i <start_ost_idx>]
                                 [--stripe-size|-S <stripe_size>]

 

Parameter Description
stripe_size Number of bytes on each OST. Can be specified with k, m or g (in KB, MB and GB respectively)
stripe_count Number of OSTs to stripe over
start_ost_idx OST index of first stripe (default is -1, let lustre choose the optimal OST)

 

Examples

  • Set the striping parameters for a directory containing only small files (< 20MB)

Terminal

   $ cd $SCRATCH
   $ mkdir test_small_files
   $ lfs getstripe test_small_files
   test_small_files
   stripe_count:   2 stripe_size:    1048576 stripe_offset:  -1 pool:
   $ lfs setstripe --stripe-size 1M --stripe-count 1 test_small_files
   $ lfs getstripe test_small_files
   test_small_files
   stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1
  • Set the striping parameters for a directory containing only large files between 100MB and 1GB

Terminal

   $ mkdir test_large_files
   $ lfs setstripe --stripe-size 2M --stripe-count 2 test_large_files
   $ lfs getstripe test_large_files
   test_large_files
   stripe_count:   2 stripe_size:    2097152 stripe_offset:  -1
  • Set the striping parameters for a directory containing files larger than 1GB

Terminal

   $ mkdir test_larger_files
   $ lfs setstripe --stripe-size 4M --stripe-count 6 test_larger_files
   $ lfs getstripe test_larger_files
   test_larger_files
   stripe_count:   6 stripe_size:    4194304 stripe_offset:  -1

Note that these are simple examples, the optimal settings defer depending on the application (concurrent threads accessing the same file, size of each write operation, etc).

Best practices

For more details, you can read the following external resources: