SIONlib  1.6.1
Scalable I/O library for parallel access to task-local files
Collective write/read calls (sion_coll_fwrite, ...)

Introduction

Collective I/O calls use an I/O schema similar to MPI-I/O two-stage I/O: Tasks will be divided into collectors and sender. Each collector collects and writes data of its own tasks and a number of following tasks to the SION-file. Each sender will only send its own data to the corresponding collector task.

Advantages of this methods are the reduced number of writer tasks, which could be lead e.g. to a better efficiency of the I/O nodes and a more densely packing of the chunks in the SION file. The reason for this is that the alignment to file system blocks boundaries is only needed per collector.

The number of collectors depends on the chunksize requested by each task and can also be set by environment variables (see below). The main goal is to use enough senders per collector so that the collector can write a full file system block.

Usage

In order to use collective I/O in SIONlib two changes to normal SIONlib output are needed.

  • The write call sion_fwrite() has to be replaced by its collective counterpart sion_coll_fwrite()
  • Either of the following
    • The environment variable SION_COLLSIZE or SION_COLLNUM (see below) needs to be set, or
    • the file needs to be opend with ...,collective,collsize=NUM,... added to the file_mode, where NUM has the same meaning as the value of the environment variable SION_COLLSIZE.

If both, the environment variable SION_COLLSIZE and the open flag are set, the environment variable overwrites the value chosen in the flags.

Parameters

Environment

  • SION_COLLDEBUG
  • SION_COLLSIZE
    • -1: collector is computed by SIONlib and depends on chunksizes and file system blocksize
    • 0: collective is not used
    • > 0: number of tasks to be collected by one (master) task
  • SION_COLLNUM
    • > 0: roughly the number of collectors to use

Procedure to determine collectors and senders

  • If SION_COLLSIZE set: -> sion_filedesc->collsize
  • If not check if SION_COLLNUM set: ->
    numcoll = max(SION_COLLNUM, ntasks);
    sion_filedesc->collsize = sion_filedesc->ntasks / numcoll;
  • in calculate_startpointer: not less than one fsblksize per collector
    max_num_collectors = (int) (gsize / sion_filedesc->fsblksize);
  • sion_filedesc->collsize > 0: number of collectors or sender / collector given by user
    num_collectors = min(sion_filedesc->ntasks /
    sion_filedesc->collsize, max_num_collectors);
  • sion_filedesc->collsize < 0: SIONlib computes number of collectors
    num_collectors = min(max_num_collectors, ntasks); # given by data
    size and fsblksize
  • limit number of collectors according table:
    if ((sion_filedesc->ntasks >= 512) && (num_collectors > 32))
    num_collectors = 32;
sion_filedesc->ntasks >= num_collectors > -> num_collectors =
256 16 16
128 8 8
64 8 8
32 8 8
16 8 4

Collective Modes

Normal write mode (for comparison)

+-----+ +-----+ +-----+ +-----+
tasks | t_0 | | t_1 | | t_2 | | t_3 |
+-----+ +-----+ +-----+ +-----+
| | | |
V V V V
+---------------------+---------------------+---------------------+---------------------+
FS blocks | d_0 | | d_1 | | d_2 | | d_3 | |
+---------------------+---------------------+---------------------+---------------------+

Each chunk is extended to at least one file system block. This prevents contention between different tasks trying to write to the same file system block, but also is a potential wast of space if the data actually written is very small.

Standard collective write mode

+-----+ +-----+ +-----+ +-----+
tasks | t_0 |<------+ | t_1 | | t_2 | | t_3 |
+-----+ | +-----+ +-----+ +-----+
| | | | |
| | V V |
| +----------+----------<----------+ |
| |
| +---------------------<---------------------+
| |
V V
+---------------------+---------------------+---------------------+---------------------+
FS blocks | d_0 | d_1 | d_2 | | d_3 | | | |
+---------------------+---------------------+---------------------+---------------------+

The standard collective write mode tries to collect data from different tasks for a better usage of the available space and a reduction of the processes actually writing to the file system (see introduction).

Merge collective write mode

+-----+ +-----+ +-----+ +-----+
tasks | t_0 |<------+ | t_1 | | t_2 | | t_3 |
+-----+ | +-----+ +-----+ +-----+
| | | | |
| | V V |
| +----------+----------<----------+ |
| |
| +---------------------<---------------------+
| |
V V
+---------------------+---------------------+---------------------+---------------------+
FS blocks | ----- d_0 ----- | | d_3 | | | |
+---------------------+---------------------+---------------------+---------------------+

Merge collective write behaves like the standard collective write but represents all the collected data as it would belong to the collector and leaves the collected chunks empty.

It is enabled using "...,collectivemerge,..." or "...,cmerge,..." in file mode.

Sources