Chapter 5. Setting Environment Variables

Chapter 5. Setting Environment Variables
Prev		Next

This chapter describes the variables that specify the environment under which your MPI programs will run. Environment variables have default values if not explicitly set. You can change some variables to achieve particular performance objectives; others are required values for standard-compliant programs.

Setting MPI Environment Variables

Table 5-1 describes the MPI environment variables you can set for your programs. Unless otherwise specified, these variables are available for both Linux and IRIX systems.

Table 5-1. MPI Environment Variables

Variable

Description

Default

MPI_ARRAY
(IRIX systems only)

Sets an alternative array name to be used for communicating with Array Services when a job is being launched.

Default name set in the arrayd.conf file.

MPI_BAR_COUNTER
(IRIX systems only)

Specifies the use of a simple counter barrier algorithm within the MPI_Barrier(3) and MPI_Win_fence(3) functions.

Not enabled if job contains more than 64 PEs.

MPI_BAR_DISSEM

Specifies the use of the alternate barrier algorithm, the dissemination/butterfly, within the MPI_Barrier(3) and MPI_Win_fence(3) functions. This alternate algorithm provides better performance on jobs with larger PE counts. The MPI_BAR_DISSEM option is recommended for jobs with PE counts of 64 or greater.

Disabled if job contains less than 64 PEs; otherwise, enabled.

MPI_BUFFER_MAX
(IRIX systems only)

Specifies a minimum message size, in bytes, for which the message will be considered a candidate for single-copy transfer. Currently, this mechanism is available only for communication between MPI processes on the same host. The sender data must reside in either the symmetric data, symmetric heap, or global heap. The MPI data type on the send side must also be a contiguous type.

Not enabled.

If the XPMEM driver is enabled ( for single host jobs, see MPI_XPMEM_ON and for multihost jobs, see MPI_USE_XPMEM), MPI allows single-copy transfers for basic predefined MPI data types from any sender data location, including the stack and private heap. The XPMEM driver also allows single-copy transfers across partitions.

If cross mapping of data segments is enabled at job startup, data in common blocks will reside in the symmetric data segment. On systems running IRIX 6.5.2 or later, this feature is enabled by default. You can employ the symmetric heap by using the shmalloc (shpalloc) functions in LIBSMA.

Testing of this feature has indicated that mMost MPI applications benefit more from buffering of medium-sized messages than from buffering of large messages, even though buffering of medium-sized messages requires an extra copy of data. However, highly synchronized applications that perform large message transfers can benefit from the single-copy pathway.

MPI_BUFS_PER_HOST

Determines the number of shared message buffers (16 Kbytes each) that MPI is to allocate for each host. These buffers are used to send large messages.

16 pages (each page is 16 Kbytes)

MPI_BUFS_PER_PROC

Determines the number of private message buffers (16 Kbytes each) that MPI is to allocate for each process. These buffers are used to send large messages.

16 pages (each page is 16 Kbytes)

MPI_BYPASS_CRC
(IRIX systems only)

Adds a checksum to each large message sent via HIPPI bypass. If the checksum does not match the data received, the job is terminated. Use of this environment variable might degrade performance.

Not set

MPI_BYPASS_DEV_SELECTION
(IRIX systems only)

Specifies the algorithm MPI is to use for sending messages over multiple HIPPI adapters. Set this variable to one of the following values:

0 - Static device selection. In this case, a process is assigned a HIPPI device to use for communication with processes on another host. The process uses only this HIPPI device to communicate with another host. This algorithm has been observed to be effective when interhost communication patterns are dominated by large messages (significantly more than 16K bytes).

1 - Dynamic device selection. In this case, a process can select from any of the devices available for communication between any given pair of hosts. The first device that is not being used by another process is selected. This algorithm has been found to work best for applications in which multiple processes are trying to send medium-sized messages (16K or fewer bytes) between processes on different hosts. Large messages (more than 16K bytes) are split into chunks of 16K bytes. Different chunks can be sent over different HIPPI devices.

2 - Round robin device selection. In this case, each process sends successive messages over a different HIPPI 800 device.

MPI_BYPASS_DEVS
( IRIX systems only)

Sets the order for opening HIPPI adapters. The list of devices does not need to be space-delimited (0123 is also valid).

0 1 2 3

An array node usually has at least one HIPPI adapter, the interface to the HIPPI network. The HIPPI bypass is a lower software layer that interfaces directly to this adapter. The bypass sends MPI control and data messages that are 16 or fewer Kbytes.

When you know that a system has multiple HIPPI adapters, you can use the MPI_BYPASS_ DEVS variable to specify the adapter that a program opens first. You can use this variable to ensure that multiple MPI programs distribute their traffic across the available adapters. If you prefer not to use the HIPPI bypass, you can turn it off by setting the MPI_BYPASS_OFF variable.

When a HIPPI adapter reaches its maximum capacity of four MPI programs, it is not available to additional MPI programs. If all HIPPI adapters are busy, MPI sends internode messages by using TCP over the adapter instead of the bypass.

MPI_BYPASS_SINGLE
(IRIX systems only)

Allows MPI messages to be sent over multiple HIPPI connections if multiple connections are available. The HIPPI OS bypass multiboard feature is enabled by default. This environment variable disables it. When you set this variable, MPI operates as it did in previous releases, with use of a single HIPPI adapter connection, if available.

MPI_BYPASS_VERBOSE
(IRIX systems only)

Allows additional MPI initialization information to be printed in the standard output stream. This information contains details about the HIPPI OS bypass connections and the HIPPI adapters that are detected on each of the hosts.

MPI_CHECK_ARGS

Enables checking of MPI function arguments. Segmentation faults might occur if bad arguments are passed to MPI, so this is useful for debugging purposes. Using argument checking adds several microseconds to latency.

Not enabled.

MPI_COMM_MAX

Sets the maximum number of communicators that can be used in an MPI program. Use this variable to increase internal default limits. (May be required by standard-compliant programs.)

256

MPI_DIR

Sets the working directory on a host. When an mpirun command is issued, the Array Services daemon on the local or distributed node responds by creating a user session and starting the required MPI processes. The user ID for the session is that of the user who invokes mpirun, so this user must be listed in the .rhosts file on the corresponding nodes. By default, the working directory for the session is the user's $HOME directory on each node. You can direct all nodes to a different directory (an NFS directory that is available to all nodes, for example) by setting the MPI_DIR variable to a different directory.

$HOME on the node. If using -np or -nt, the default is the current directory.

MPI_DPLACE_INTEROP_OFF
(IRIX systems only)

Disables an MPI/dplace interoperability feature available beginning with IRIX 6.5.13. By setting this variable, you can obtain the behavior of MPI with dplace on older releases of IRIX.

Not enabled.

MPI_DSM_CPULIST
(IRIX systems only)

Specifies a list of CPUs on which to run an MPI application. To ensure that processes are linked to CPUs, this variable should be used in conjunction with the MPI_DSM_MUSTRUN variable.

Not enabled.

MPI_DSM_MUSTRUN
(IRIX systems only)

Enforces memory locality for MPI processes. Use of this feature ensures that each MPI process obtains a CPU and physical memory on the node to which it was originally assigned. This variable improves program performance on IRIX systems running release 6.5.7 and earlier, when running a program on a quiet system. With later IRIX releases, under certain circumstances, you do not need to set this variable. Internally, this feature directs the library to use the process_cpulink(3) function instead of process_mldlink(3) to control memory placement.

You should not use MPI_DSM_MUSTRUN when the job is submitted to Miser (see miser_submit(1)) because this might cause the program to hang.

Not enabled.

MPI_DSM_OFF
(IRIX systems only)

Turns off nonuniform memory access (NUMA) optimization in the MPI library.

Not enabled.

MPI_DSM_PLACEMENT
(IRIX systems only)

Specifies the default placement policy to be used for the stack and data segments of an MPI process. Set this variable to one of the following values:

firsttouch - With this policy, IRIX attempts to satisfy requests for new memory pages for stack, data, and heap memory on the node where the requesting process is currently scheduled.

fixed

fixed - With this policy, IRIX attempts to satisfy requests for new memory pages for stack, data, and heap memory on the node associated with the memory locality domain (mld) with which an MPI process was linked at job startup. This is the default policy for MPI processes.

roundrobin - With this policy, IRIX attempts to satisfy requests for new memory pages in a round robin fashion across all of the nodes associated with the MPI job. It is generally not recommended to use this setting.

threadroundrobin - This policy is intended for use with hybrid MPI/OpenMP applications only. With this policy, IRIX attempts to satisfy requests for new memory pages for the MPI process stack, data, and heap memory in a roundrobin fashion across the nodes allocated to its OpenMP threads. This placement option might be helpful for large OpenMP/MPI process ratios. For non-OpenMP applications, this value is ignored.

MPI_DSM_PPM
(IRIX systems only)

Sets the number of MPI processes per memory locality domain (mld). For Origin 2000 systems, values of 1 or 2 are allowed. For Origin 3000 systems, values of 1, 2, or 4 are allowed.

Origin 2000 systems, 2; Origin 3000 systems, 4.

MPI_DSM_TOPOLOGY (IRIX systems only)

Specifies the shape of the set of hardware nodes on which the PE memories are allocated. Set this variable to one of the following values:

cube -- A group of memory nodes that form a perfect hypercube. The number of processes per host must be a power of 2. If a perfect hypercube is unavailable, a less restrictive placement is used.

Not enabled.

cube_fixed -- A group of memory nodes that form a perfect hypercube. The number of processes per host must be a power of 2. If a perfect hypercube is unavailable, the placement will fail, disabling NUMA placement.

cpucluster -- Any group of memory nodes. The operating system attempts to place the group numbers close to one another, taking into account nodes with disabled processors. (Default for IRIX 6.5.11 and higher).

free -- Any group of memory nodes. The operating system attempts to place the group numbers close to one another. (Default for IRIX 6.5.10 and earlier releases).

MPI_DSM_VERBOSE
(IRIX systems only)

Instructs mpirun to print information about process placement for jobs running on NUMA systems.

Not enabled.

MPI_DSM_VERIFY
(IRIX systems only)

Instructs mpirun to run some diagnostic checks on proper memory placement of MPI data structures at job startup. If errors are found, a diagnostic message is printed to stderr.

Not enabled.

MPI_GM_DEVS
(IRIX systems only)

Sets the order for opening GM(Myrinet) adapters. The list of devices does not need to be space-delimited (0321 is valid). The syntax is the same as for the MPI_BYPASS_DEVS environment variable. In this release, a maximum of 8 adapters are supported on a single host.

MPI will use all available GM(Myrinet) devices.

MPI_GM_VERBOSE

Allows some diagnostic information concerning messaging between processes using GM (Myrinet) to be displayed on stderr.

Not enabled.

MPI_GROUP_MAX

Sets the maximum number of groups that can be used in an MPI program. Use this variable to increase internal default limits. (May be required by standard-compliant programs.)

256

MPI_GSN_DEVS
(IRIX 6.5.9 systems or later)

Sets the order for opening GSN adapters. The list of devices does not need to be quoted or space-delimited (0123 is valid).

MPI will use all available GSN devices.

MPI_GSN_VERBOSE
(IRIX 6.5.9 systems or later)

Allows additional MPI initialization information to be printed in the standard output stream. This information contains details about the GSN (ST protocol) OS bypass connections and the GSN adapters that are detected on each of the hosts.

Not enabled.

MPI_MSG_RETRIES

Specifies the number of times the MPI library attempts to get a message header, if none are available. Each MPI message that is sent requires an initial message header. If one is not available after the specified number of attempts, the job will abort.

500

Note that this variable no longer applies to processes on the same host, or when using the GM (Myrinet) protocol. In these cases, message headers are allocated dynamically on an as-needed basis.

MPI_MSGS_MAX

Controls the total number of message headers that can be allocated. This allocation applies to messages exchanged between processes on a single host, or between processes on different hosts when using the GM(Myrinet) OS bypass protocol. Note that the initial allocation of memory for message headers is 128 Kbytes.

Allow up to 64 Mbytes to be allocated for message headers. If you set this variable, specify the maximum number of message headers.

MPI_MSGS_PER_HOST

Sets the number of message headers to allocate for MPI messages on each MPI host. Space for messages that are destined for a process on a different host is allocated as shared memory on the host on which the sending processes are located. MPI locks these pages in memory. Use this variable to allocate buffer space for interhost messages.

Caution: If you set the memory pool for interhost packets to a large value, you can cause allocation of so much locked memory that total system performance is degraded.

1024

The previous description does not apply to processes that use the GM(Myrinet) OS bypass protocol. In this case, message headers are allocated dynamically as needed. See the MPI_MSGS_MAX variable description.

MPI_MSGS_PER_PROC

This variable is effectively obsolete. Message headers are now allocated on an as-needed basis for messaging either between processes on the same host, or between processes on different hosts when using the GM (Myrinet) OS bypass protocol. You can use the new MPI_MSGS_MAX variable to control the total number of message headers that can be allocated.

1024

MPI_OPENMP_INTEROP (IRIX systems only)

Setting this variable modifies the placement of MPI processes to better accomodate the OpenMP threads associated with each process.

Note: This option is available only on Origin 300 and Origin 3000 servers.

Not enabled

MPI_REQUEST_MAX

Sets the maximum number of simultaneous nonblocking sends and receives that can be active at one time. Use this variable to increase internal default limits. (May be required by standard-compliant programs.)

16384

MPI_SHARED_VERBOSE

Allows some diagnostic information concerning messaging within a host to be displayed on stderr.

Not enabled.

MPI_SLAVE_DEBUG_ATTACH

Specifies the MPI process to be debugged. If you set MPI_SLAVE_DEBUG_ATTACH to N, the MPI process with rank N prints a message during program startup, describing how to attach to it from another window using the dbx debugger on IRIX or the gdb debugger on Linux. You must attach the debugger to process N within ten seconds of the printing of the message.

Not enabled.

MPI_STATIC_NO_MAP (IRIX systems only)

Disables cross mapping of static memory between MPI processes. This variable can be set to reduce the significant MPI job startup and shutdown time that can be observed for jobs involving more than 512 processors on a single IRIX host. Note that setting this shell variable disables certain internal MPI optimizations and also restricts the use of MPI-2 one-sided functions. For more information, see the MPI_Win man page.

Not enabled.

MPI_STATS

Enables printing of MPI internal statistics. Each MPI process prints statistics about the amount of data sent with MPI calls during the MPI_Finalize process. Data is sent to stderr. To prefix the statistics messages with the MPI rank, use the -p option on the mpirun command.

Note: Because the statistics-collection code is not thread-safe, this variable should not be set if the program uses threads.

Not enabled.

MPI_TYPE_DEPTH

Sets the maximum number of nesting levels for derived data types. (May be required by standard-compliant programs.) This variable limits the maximum depth of derived data types that an application can create. MPI logs error messages if the limit specified by MPI_TYPE_DEPTH is exceeded.

8 levels

MPI_TYPE_MAX

Sets the maximum number of derived data types that can be used in an MPI program. Use this variable to increase internal default limits. (May be required by standard-compliant programs.)

1024

MPI_UNBUFFERED_STDIO

Normally, mpirun line-buffers output received from the MPI processes on both the stdout and stderr standard IO streams. This prevents lines of text from different processes from possibly being merged into one line, and allows use of the mpirun -prefix option.

1024

Of course, there is a limit to the amount of buffer space that mpirun has available (currently, about 8,100 characters can appear between new line characters per stream per process). If more characters are emitted before a new line character, the MPI program aborts with an error message.

Setting the MPI_UNBUFFERED_STDIO environment variable disables this buffering. This is useful, for example, when a program's rank 0 emits a series of periods over time to indicate progress of the program. With buffering, the entire line of periods will be output only when the new line character is seen. Without buffering, each period will be immediately displayed as soon as mpirun receives it from the MPI program. (Note that the MPI program still needs to call fflush(3) or FLUSH(101) to flush the stdout buffer from the application code.)

Additionally, setting MPI_UNBUFFERED_STDIO allows an MPI program that emits very long output lines to execute correctly.

Note that if MPI_UNBUFFERED_STDIO is set, the mpirun -prefix option is ignored.

MPI_USE_GSN (IRIX 6.5.12 systems or later)

Requires the MPI library to use the GSN (ST protocol) OS bypass driver as the interconnect when running across multiple hosts or running with multiple binaries. If a GSN connection cannot be established among all hosts in the MPI job, the job is terminated.

Not set

GSN imposes a limit of one MPI process using GSN per CPU on a system. For example, on a 128-CPU system, you can run multiple MPI jobs, as long as the total number of MPI processes using the GSN bypass does not exceed 128.

Once the maximum allowed MPI processes using GSN is reached, subsequent MPI jobs return an error to the user output, as the following example:

MPI: Could not connect all
 processes to GSN adapters.
The maximum number of GSN
adapter connections per
system is normally equal
to the number of CPUs on
the system.

If there are a few CPUs still available, but not enough to satisfy the entire MPI job, the error will still be issued and the MPI job terminated.

MPI_USE_GM (IRIX systems only)

Requires the MPI library to use the Myrinet(GM) OS bypass driver as the interconnect when running across multiple hosts or running with multiple binaries. If a GM connection cannot be established among all hosts in the MPI job, the job is terminated.

Not set

MPI_USE_HIPPI (IRIX systems only)

Requires the MPI library to use the HiPPI 800 OS bypass driver as the interconnect when running across multiple hosts or running with multiple binaries. If a HiPPI connection cannot be established among all hosts in the MPI job, the job is terminated.

Not set

MPI_USE_TCP

Requires the MPI library to use the TCP/IP driver as the interconnect when running across multiple hosts or running with multiple binaries.

Not set

MPI_USE_XPMEM (IRIX 6.5.13 systems or later)

Requires the MPI library to use the XPMEM driver as the interconnect when running across multiple hosts or running with multiple binaries. This driver allows MPI processes running on one partition to communicate with MPI processes on a different partition via the NUMAlink network. The NUMAlink network is powered by block transfer engines (BTEs). BTE data transfers do not require processor resources.

Not set

The XPMEM (cross partition) device driver is available only on Origin 3000 and Origin 300 systems running IRIX 6.5.13 or greater.

Note: Due to possible MPI program hangs, you should not run MPI across partitions using the XPMEM driver on IRIX versions 6.5.13, 6.5.14, or 6.5.15. This problem has been resolved in IRIX version 6.5.16.

If all of the the hosts specified on the mpirun command do not reside in the same partitioned system, you can select one additional interconnect via the MPI_USE variables. MPI communication between partitions will go through the XPMEM driver, and communication between non-partitioned hosts will go through the second interconnect.

MPI_XPMEM_ON

Enables the XPMEM single-copy enhancements for processes residing on the same host.

Not set

The XPMEM enhancements allow single-copy transfers for basic predefined MPI data types from any sender data location, including the stack and private heap. Without enabling XPMEM, single-copy is allowed only from data residing in the symmetric data, symmetric heap, or global heap.

Both the MPI_XPMEM_ON and MPI_BUFFER_MAX variables must be set to enable these enhancements. Both are disabled by default.

If the following additional conditions are met, the block transfer engine (BTE) is invoked instead of bcopy, to provide increased bandwidth:

Send and receive buffers are cache-aligned.

Amount of data to transfer is greater than or equal to the MPI_XPMEM_THRESHOLD value.

Note: The XPMEM driver does not support checkpoint/restart at this time. If you enable these XPMEM enhancements, you will not be able to checkpoint and restart your MPI job.

The XPMEM single-copy enhancements require an Origin 3000 and Origin 300 servers running IRIX release 6.5.15 or greater.

MPI_XPMEM_THRESHOLD

Specifies a minimum message size, in bytes, for which single-copy messages between processes residing on the same host will be transferred via the BTE, instead of bcopy. The following conditions must exist before the BTE transfer is invoked:

Single-copy mode is enabled (MPI_BUFFER_MAX).

8192

XPMEM single-copy enhancements are enabled (MPI_XPMEM_ON).

Send and receive buffers are cache-aligned.

Amount of data to transfer is greater than or equal to the MPI_XPMEM_THRESHOLD value.

The XPMEM enhancements allow single-copy transfers for basic MPI types from any sender data location, including the stack and private heap. Without enabling XPMEM, single-copy is allowed only from data residing in the symmetric data, symmetric heap, or global heap.

Both the MPI_XPMEM_THRESHOLD and MPI_BUFFER_MAX variables must be set to enable these enhancements. Both are disabled by default.

If the following additional conditions are met, the block transfer engine (BTE) is invoked instead of bcopy, to provide increased bandwidth:

Send and receive buffers are cache-aligned.

Amount of data to transfer is greater than or equal to the MPI_XPMEM_THRESHOLD value.

In addition to enabling these single-copy enhancements, the MPI_XPMEM_THRESHOLD environment variable can be used to specify a minimum message size, in bytes, for which the message will be transferred via the BTE, provided the above conditions are met. If a value is not provided, a default of 8192 bytes will be used.

Note: The XPMEM driver does not support checkpoint/restart at this time. If you enable these XPMEM enhancements, you will not be able to checkpoint and restart your MPI job.

The XPMEM single-copy enhancements require Origin 3000 and Origin 300 servers running IRIX release 6.5.15 or greater.

MPI_XPMEM_VERBOSE

Setting this variable allows additional MPI diagnostic information to be printed in the standard output stream. This information contains details about the XPMEM connections.

Not enabled

PAGESIZE_DATA
(IRIX systems only)

Specifies the desired page size in kilobytes for program data areas. On Origin series systems, supported values include 16, 64, 256, 1024, and 4096. Specified values must be integer.

Note: Setting MPI_DSM_OFF disables the ability to set the data pagesize via this shell variable.

Not enabled

PAGESIZE_STACK
(IRIX systems only)

Specifies the desired page size in kilobytes for program stack areas. On Origin series systems, supported values include 16, 64, 256, 1024, and 4096. Specified values must be integer.

Note: Setting MPI_DSM_OFF disables the ability to set the data pagesize via this shell variable.

Not enabled

SMA_GLOBAL_ALLOC
(IRIX systems only)

Activates the LIBSMA based global heap facility. This variable is used by 64-bit MPI applications for certain internal optimizations, and is used as support for the MPI_Alloc_mem function. For additional details, see the intro_shmem(3) man page.

Not enabled

SMA_GLOBAL_HEAP_SIZE
(IRIX systems only)

For 64 bit applications, specifies the per processes size of the LIBSMA global heap in bytes.

33,554,432 bytes

Internal Message Buffering in MPI

An MPI implementation can copy data that is being sent to another process into an internal temporary buffer so that the MPI library can return from the MPI function, giving execution control back to the user. However, according to the MPI standard, you should not assume that there is any message buffering between processes because the MPI standard does not mandate a buffering strategy. Some implementations choose to buffer user data internally, while other implementations block in the MPI routine until the data can be sent. These different buffering strategies have performance and convenience implications.

Most MPI implementations do use buffering for performance reasons and some programs depend on it. Table 5-2 illustrates a simple sequence of MPI operations that cannot work unless messages are buffered. If sent messages were not buffered, each process would hang in the initial MPI_Send call, waiting for an MPI_Recv call to take the message. Because most MPI implementations do buffer messages to some degree, a program like this does not usually hang. The MPI_Send calls return after putting the messages into buffer space, and the MPI_Recv calls get the messages. Nevertheless, program logic like this is not valid by the MPI standard.

The SGI implementation of MPI uses buffering under most circumstances. Short messages of 64 or fewer bytes are always buffered. On IRIX systems, longer messages are buffered unless the message to be sent resides in either a common block, the symmetric heap, or global shared heap and the sending and receiving processes reside on the same host. The MPI data type on the send side must also be a contiguous type. The message size must be greater than the size setting for MPI_BUFFER_MAX (see Table 5-1). If the XPMEM driver is enabled (for single host jobs, see MPI_XPMEM_ON and for multihost jobs, see MPI_USE_XPMEM), MPI allows single-copy transfers for basic MPI types from any sender data location, including the stack and private heap. The XPMEM driver also allows single-copy transfers across partitions. Under these circumstances, the receiver copies the data directly into its receive message area without buffering. Obviously, MPI applications with code segments equivalent to that shown in Table 5-2 will almost certainly deadlock if this bufferless pathway is available.

Note: This feature is not currently available on Linux systems.

Table 5-2. Outline of Improper Dependence on Buffering

Process 1	Process 2
`MPI_Send(2,....)`	`MPI_Send(1,....)`
`MPI_Recv(2,....)`	`MPI_Recv(1,....)`

Prev	Table of Contents	Next
Chapter 4. Thread-Safe MPI		Chapter 6. MPI Optimization and Tuning