This chapter describes the variables that specify the environment under which your MPI programs will run. Environment variables have default values if not explicitly set. You can change some variables to achieve particular performance objectives; others are required values for standard-compliant programs.
Table 5-1 describes the MPI environment variables you can set for your programs. Unless otherwise specified, these variables are available for both Linux and IRIX systems.
Table 5-1. MPI Environment Variables
Variable | Description | Default | ||||
---|---|---|---|---|---|---|
MPI_ARRAY | Sets an alternative array name to be used for communicating with Array Services when a job is being launched. | Default name set in the arrayd.conf file. | ||||
MPI_BAR_COUNTER | Specifies the use of a simple counter barrier algorithm within the MPI_Barrier(3) and MPI_Win_fence(3) functions. | Not enabled if job contains more than 64 PEs. | ||||
MPI_BAR_DISSEM | Specifies the use of the alternate barrier algorithm, the dissemination/butterfly, within the MPI_Barrier(3) and MPI_Win_fence(3) functions. This alternate algorithm provides better performance on jobs with larger PE counts. The MPI_BAR_DISSEM option is recommended for jobs with PE counts of 64 or greater. | Disabled if job contains less than 64 PEs; otherwise, enabled. | ||||
MPI_BUFFER_MAX | Specifies a minimum message size, in bytes, for which the message will be considered a candidate for single-copy transfer. Currently, this mechanism is available only for communication between MPI processes on the same host. The sender data must reside in either the symmetric data, symmetric heap, or global heap. The MPI data type on the send side must also be a contiguous type. | Not enabled. | ||||
| If the XPMEM driver is enabled ( for single host jobs, see MPI_XPMEM_ON and for multihost jobs, see MPI_USE_XPMEM), MPI allows single-copy transfers for basic predefined MPI data types from any sender data location, including the stack and private heap. The XPMEM driver also allows single-copy transfers across partitions. |
| ||||
| If cross mapping of data segments is enabled at job startup, data in common blocks will reside in the symmetric data segment. On systems running IRIX 6.5.2 or later, this feature is enabled by default. You can employ the symmetric heap by using the shmalloc (shpalloc) functions in LIBSMA. |
| ||||
| Testing of this feature has indicated that mMost MPI applications benefit more from buffering of medium-sized messages than from buffering of large messages, even though buffering of medium-sized messages requires an extra copy of data. However, highly synchronized applications that perform large message transfers can benefit from the single-copy pathway. |
| ||||
MPI_BUFS_PER_HOST | Determines the number of shared message buffers (16 Kbytes each) that MPI is to allocate for each host. These buffers are used to send large messages. | 16 pages (each page is 16 Kbytes) | ||||
MPI_BUFS_PER_PROC | Determines the number of private message buffers (16 Kbytes each) that MPI is to allocate for each process. These buffers are used to send large messages. | 16 pages (each page is 16 Kbytes) | ||||
MPI_BYPASS_CRC | Adds a checksum to each large message sent via HIPPI bypass. If the checksum does not match the data received, the job is terminated. Use of this environment variable might degrade performance. | Not set | ||||
MPI_BYPASS_DEV_SELECTION | Specifies the algorithm MPI is to use for sending messages over multiple HIPPI adapters. Set this variable to one of the following values:
| 1 | ||||
|
|
| ||||
|
|
| ||||
MPI_BYPASS_DEVS | Sets the order for opening HIPPI adapters. The list of devices does not need to be space-delimited (0123 is also valid). | 0 1 2 3 | ||||
| An array node usually has at least one HIPPI adapter, the interface to the HIPPI network. The HIPPI bypass is a lower software layer that interfaces directly to this adapter. The bypass sends MPI control and data messages that are 16 or fewer Kbytes. | |||||
| When you know that a system has multiple HIPPI adapters, you can use the MPI_BYPASS_ DEVS variable to specify the adapter that a program opens first. You can use this variable to ensure that multiple MPI programs distribute their traffic across the available adapters. If you prefer not to use the HIPPI bypass, you can turn it off by setting the MPI_BYPASS_OFF variable. | |||||
| When a HIPPI adapter reaches its maximum capacity of four MPI programs, it is not available to additional MPI programs. If all HIPPI adapters are busy, MPI sends internode messages by using TCP over the adapter instead of the bypass. | |||||
MPI_BYPASS_SINGLE | Allows MPI messages to be sent over multiple HIPPI connections if multiple connections are available. The HIPPI OS bypass multiboard feature is enabled by default. This environment variable disables it. When you set this variable, MPI operates as it did in previous releases, with use of a single HIPPI adapter connection, if available. |
| ||||
MPI_BYPASS_VERBOSE | Allows additional MPI initialization information to be printed in the standard output stream. This information contains details about the HIPPI OS bypass connections and the HIPPI adapters that are detected on each of the hosts. | |||||
MPI_CHECK_ARGS | Enables checking of MPI function arguments. Segmentation faults might occur if bad arguments are passed to MPI, so this is useful for debugging purposes. Using argument checking adds several microseconds to latency. | Not enabled. | ||||
MPI_COMM_MAX | Sets the maximum number of communicators that can be used in an MPI program. Use this variable to increase internal default limits. (May be required by standard-compliant programs.) | 256 | ||||
MPI_DIR | Sets the working directory on a host. When an mpirun command is issued, the Array Services daemon on the local or distributed node responds by creating a user session and starting the required MPI processes. The user ID for the session is that of the user who invokes mpirun, so this user must be listed in the .rhosts file on the corresponding nodes. By default, the working directory for the session is the user's $HOME directory on each node. You can direct all nodes to a different directory (an NFS directory that is available to all nodes, for example) by setting the MPI_DIR variable to a different directory. | $HOME on the node. If using -np or -nt, the default is the current directory. | ||||
MPI_DPLACE_INTEROP_OFF | Disables an MPI/dplace interoperability feature available beginning with IRIX 6.5.13. By setting this variable, you can obtain the behavior of MPI with dplace on older releases of IRIX. | Not enabled. | ||||
MPI_DSM_CPULIST | Specifies a list of CPUs on which to run an MPI application. To ensure that processes are linked to CPUs, this variable should be used in conjunction with the MPI_DSM_MUSTRUN variable. | Not enabled. | ||||
MPI_DSM_MUSTRUN | Enforces memory locality for MPI processes. Use of this feature ensures that each MPI process obtains a CPU and physical memory on the node to which it was originally assigned. This variable improves program performance on IRIX systems running release 6.5.7 and earlier, when running a program on a quiet system. With later IRIX releases, under certain circumstances, you do not need to set this variable. Internally, this feature directs the library to use the process_cpulink(3) function instead of process_mldlink(3) to control memory placement. You should not use MPI_DSM_MUSTRUN when the job is submitted to Miser (see miser_submit(1)) because this might cause the program to hang. | Not enabled. | ||||
MPI_DSM_OFF | Turns off nonuniform memory access (NUMA) optimization in the MPI library. | Not enabled. | ||||
MPI_DSM_PLACEMENT | Specifies the default placement policy to be used for the stack and data segments of an MPI process. Set this variable to one of the following values:
| fixed | ||||
|
|
| ||||
|
|
| ||||
|
|
| ||||
MPI_DSM_PPM | Sets the number of MPI processes per memory locality domain (mld). For Origin 2000 systems, values of 1 or 2 are allowed. For Origin 3000 systems, values of 1, 2, or 4 are allowed. | Origin 2000 systems, 2; Origin 3000 systems, 4. | ||||
MPI_DSM_TOPOLOGY (IRIX systems only) | Specifies the shape of the set of hardware nodes on which the PE memories are allocated. Set this variable to one of the following values:
| Not enabled. | ||||
|
|
| ||||
|
|
| ||||
|
|
| ||||
MPI_DSM_VERBOSE | Instructs mpirun to print information about process placement for jobs running on NUMA systems. | Not enabled. | ||||
MPI_DSM_VERIFY | Instructs mpirun to run some diagnostic checks on proper memory placement of MPI data structures at job startup. If errors are found, a diagnostic message is printed to stderr. | Not enabled. | ||||
MPI_GM_DEVS | Sets the order for opening GM(Myrinet) adapters. The list of devices does not need to be space-delimited (0321 is valid). The syntax is the same as for the MPI_BYPASS_DEVS environment variable. In this release, a maximum of 8 adapters are supported on a single host. | MPI will use all available GM(Myrinet) devices. | ||||
MPI_GM_VERBOSE | Allows some diagnostic information concerning messaging between processes using GM (Myrinet) to be displayed on stderr. | Not enabled. | ||||
MPI_GROUP_MAX | Sets the maximum number of groups that can be used in an MPI program. Use this variable to increase internal default limits. (May be required by standard-compliant programs.) | 256 | ||||
MPI_GSN_DEVS | Sets the order for opening GSN adapters. The list of devices does not need to be quoted or space-delimited (0123 is valid). | MPI will use all available GSN devices. | ||||
MPI_GSN_VERBOSE | Allows additional MPI initialization information to be printed in the standard output stream. This information contains details about the GSN (ST protocol) OS bypass connections and the GSN adapters that are detected on each of the hosts. | Not enabled. | ||||
MPI_MSG_RETRIES | Specifies the number of times the MPI library attempts to get a message header, if none are available. Each MPI message that is sent requires an initial message header. If one is not available after the specified number of attempts, the job will abort. | 500 | ||||
| Note that this variable no longer applies to processes on the same host, or when using the GM (Myrinet) protocol. In these cases, message headers are allocated dynamically on an as-needed basis. |
| ||||
MPI_MSGS_MAX | Controls the total number of message headers that can be allocated. This allocation applies to messages exchanged between processes on a single host, or between processes on different hosts when using the GM(Myrinet) OS bypass protocol. Note that the initial allocation of memory for message headers is 128 Kbytes. | Allow up to 64 Mbytes to be allocated for message headers. If you set this variable, specify the maximum number of message headers. | ||||
MPI_MSGS_PER_HOST | Sets the number of message headers to allocate for MPI messages on each MPI host. Space for messages that are destined for a process on a different host is allocated as shared memory on the host on which the sending processes are located. MPI locks these pages in memory. Use this variable to allocate buffer space for interhost messages.
| 1024 | ||||
| The previous description does not apply to processes that use the GM(Myrinet) OS bypass protocol. In this case, message headers are allocated dynamically as needed. See the MPI_MSGS_MAX variable description. |
| ||||
MPI_MSGS_PER_PROC | This variable is effectively obsolete. Message headers are now allocated on an as-needed basis for messaging either between processes on the same host, or between processes on different hosts when using the GM (Myrinet) OS bypass protocol. You can use the new MPI_MSGS_MAX variable to control the total number of message headers that can be allocated. | 1024 | ||||
MPI_OPENMP_INTEROP (IRIX systems only) | Setting this variable modifies the placement of MPI processes to better accomodate the OpenMP threads associated with each process.
| Not enabled | ||||
MPI_REQUEST_MAX | Sets the maximum number of simultaneous nonblocking sends and receives that can be active at one time. Use this variable to increase internal default limits. (May be required by standard-compliant programs.) | 16384 | ||||
MPI_SHARED_VERBOSE | Allows some diagnostic information concerning messaging within a host to be displayed on stderr. | Not enabled. | ||||
MPI_SLAVE_DEBUG_ATTACH | Specifies the MPI process to be debugged. If you set MPI_SLAVE_DEBUG_ATTACH to N, the MPI process with rank N prints a message during program startup, describing how to attach to it from another window using the dbx debugger on IRIX or the gdb debugger on Linux. You must attach the debugger to process N within ten seconds of the printing of the message. | Not enabled. | ||||
MPI_STATIC_NO_MAP (IRIX systems only) | Disables cross mapping of static memory between MPI processes. This variable can be set to reduce the significant MPI job startup and shutdown time that can be observed for jobs involving more than 512 processors on a single IRIX host. Note that setting this shell variable disables certain internal MPI optimizations and also restricts the use of MPI-2 one-sided functions. For more information, see the MPI_Win man page. | Not enabled. | ||||
MPI_STATS | Enables printing of MPI internal statistics. Each MPI process prints statistics about the amount of data sent with MPI calls during the MPI_Finalize process. Data is sent to stderr. To prefix the statistics messages with the MPI rank, use the -p option on the mpirun command.
| Not enabled. | ||||
MPI_TYPE_DEPTH | Sets the maximum number of nesting levels for derived data types. (May be required by standard-compliant programs.) This variable limits the maximum depth of derived data types that an application can create. MPI logs error messages if the limit specified by MPI_TYPE_DEPTH is exceeded. | 8 levels | ||||
MPI_TYPE_MAX | Sets the maximum number of derived data types that can be used in an MPI program. Use this variable to increase internal default limits. (May be required by standard-compliant programs.) | 1024 | ||||
MPI_UNBUFFERED_STDIO | Normally, mpirun line-buffers output received from the MPI processes on both the stdout and stderr standard IO streams. This prevents lines of text from different processes from possibly being merged into one line, and allows use of the mpirun -prefix option. | 1024 | ||||
| Of course, there is a limit to the amount of buffer space that mpirun has available (currently, about 8,100 characters can appear between new line characters per stream per process). If more characters are emitted before a new line character, the MPI program aborts with an error message. |
| ||||
| Setting the MPI_UNBUFFERED_STDIO environment variable disables this buffering. This is useful, for example, when a program's rank 0 emits a series of periods over time to indicate progress of the program. With buffering, the entire line of periods will be output only when the new line character is seen. Without buffering, each period will be immediately displayed as soon as mpirun receives it from the MPI program. (Note that the MPI program still needs to call fflush(3) or FLUSH(101) to flush the stdout buffer from the application code.) |
| ||||
| Additionally, setting MPI_UNBUFFERED_STDIO allows an MPI program that emits very long output lines to execute correctly. |
| ||||
| Note that if MPI_UNBUFFERED_STDIO is set, the mpirun -prefix option is ignored. |
| ||||
MPI_USE_GSN (IRIX 6.5.12 systems or later) | Requires the MPI library to use the GSN (ST protocol) OS bypass driver as the interconnect when running across multiple hosts or running with multiple binaries. If a GSN connection cannot be established among all hosts in the MPI job, the job is terminated. | Not set | ||||
| GSN imposes a limit of one MPI process using GSN per CPU on a system. For example, on a 128-CPU system, you can run multiple MPI jobs, as long as the total number of MPI processes using the GSN bypass does not exceed 128. |
| ||||
| Once the maximum allowed MPI processes using GSN is reached, subsequent MPI jobs return an error to the user output, as the following example:
|
| ||||
| If there are a few CPUs still available, but not enough to satisfy the entire MPI job, the error will still be issued and the MPI job terminated. |
| ||||
MPI_USE_GM (IRIX systems only) | Requires the MPI library to use the Myrinet(GM) OS bypass driver as the interconnect when running across multiple hosts or running with multiple binaries. If a GM connection cannot be established among all hosts in the MPI job, the job is terminated. | Not set | ||||
MPI_USE_HIPPI (IRIX systems only) | Requires the MPI library to use the HiPPI 800 OS bypass driver as the interconnect when running across multiple hosts or running with multiple binaries. If a HiPPI connection cannot be established among all hosts in the MPI job, the job is terminated. | Not set | ||||
MPI_USE_TCP | Requires the MPI library to use the TCP/IP driver as the interconnect when running across multiple hosts or running with multiple binaries. | Not set | ||||
MPI_USE_XPMEM (IRIX 6.5.13 systems or later) | Requires the MPI library to use the XPMEM driver as the interconnect when running across multiple hosts or running with multiple binaries. This driver allows MPI processes running on one partition to communicate with MPI processes on a different partition via the NUMAlink network. The NUMAlink network is powered by block transfer engines (BTEs). BTE data transfers do not require processor resources. | Not set | ||||
| The XPMEM (cross partition) device driver is available only on Origin 3000 and Origin 300 systems running IRIX 6.5.13 or greater.
|
| ||||
| If all of the the hosts specified on the mpirun command do not reside in the same partitioned system, you can select one additional interconnect via the MPI_USE variables. MPI communication between partitions will go through the XPMEM driver, and communication between non-partitioned hosts will go through the second interconnect. |
| ||||
MPI_XPMEM_ON | Enables the XPMEM single-copy enhancements for processes residing on the same host. | Not set | ||||
| The XPMEM enhancements allow single-copy transfers for basic predefined MPI data types from any sender data location, including the stack and private heap. Without enabling XPMEM, single-copy is allowed only from data residing in the symmetric data, symmetric heap, or global heap. |
| ||||
| Both the MPI_XPMEM_ON and MPI_BUFFER_MAX variables must be set to enable these enhancements. Both are disabled by default. |
| ||||
| If the following additional conditions are met, the block transfer engine (BTE) is invoked instead of bcopy, to provide increased bandwidth:
|
| ||||
|
|
| ||||
|
|
| ||||
MPI_XPMEM_THRESHOLD | Specifies a minimum message size, in bytes, for which single-copy messages between processes residing on the same host will be transferred via the BTE, instead of bcopy. The following conditions must exist before the BTE transfer is invoked:
| 8192 | ||||
|
|
| ||||
|
|
| ||||
|
|
| ||||
| The XPMEM enhancements allow single-copy transfers for basic MPI types from any sender data location, including the stack and private heap. Without enabling XPMEM, single-copy is allowed only from data residing in the symmetric data, symmetric heap, or global heap. |
| ||||
| Both the MPI_XPMEM_THRESHOLD and MPI_BUFFER_MAX variables must be set to enable these enhancements. Both are disabled by default. |
| ||||
| If the following additional conditions are met, the block transfer engine (BTE) is invoked instead of bcopy, to provide increased bandwidth:
|
| ||||
|
|
| ||||
| In addition to enabling these single-copy enhancements, the MPI_XPMEM_THRESHOLD environment variable can be used to specify a minimum message size, in bytes, for which the message will be transferred via the BTE, provided the above conditions are met. If a value is not provided, a default of 8192 bytes will be used.
|
| ||||
| The XPMEM single-copy enhancements require Origin 3000 and Origin 300 servers running IRIX release 6.5.15 or greater. |
| ||||
MPI_XPMEM_VERBOSE | Setting this variable allows additional MPI diagnostic information to be printed in the standard output stream. This information contains details about the XPMEM connections. | Not enabled | ||||
PAGESIZE_DATA | Specifies the desired page size in kilobytes for program data areas. On Origin series systems, supported values include 16, 64, 256, 1024, and 4096. Specified values must be integer.
| Not enabled | ||||
PAGESIZE_STACK | Specifies the desired page size in kilobytes for program stack areas. On Origin series systems, supported values include 16, 64, 256, 1024, and 4096. Specified values must be integer.
| Not enabled | ||||
SMA_GLOBAL_ALLOC | Activates the LIBSMA based global heap facility. This variable is used by 64-bit MPI applications for certain internal optimizations, and is used as support for the MPI_Alloc_mem function. For additional details, see the intro_shmem(3) man page. | Not enabled | ||||
SMA_GLOBAL_HEAP_SIZE | For 64 bit applications, specifies the per processes size of the LIBSMA global heap in bytes. | 33,554,432 bytes |
An MPI implementation can copy data that is being sent to another process into an internal temporary buffer so that the MPI library can return from the MPI function, giving execution control back to the user. However, according to the MPI standard, you should not assume that there is any message buffering between processes because the MPI standard does not mandate a buffering strategy. Some implementations choose to buffer user data internally, while other implementations block in the MPI routine until the data can be sent. These different buffering strategies have performance and convenience implications.
Most MPI implementations do use buffering for performance reasons and some programs depend on it. Table 5-2 illustrates a simple sequence of MPI operations that cannot work unless messages are buffered. If sent messages were not buffered, each process would hang in the initial MPI_Send call, waiting for an MPI_Recv call to take the message. Because most MPI implementations do buffer messages to some degree, a program like this does not usually hang. The MPI_Send calls return after putting the messages into buffer space, and the MPI_Recv calls get the messages. Nevertheless, program logic like this is not valid by the MPI standard.
The SGI implementation of MPI uses buffering under most circumstances. Short messages of 64 or fewer bytes are always buffered. On IRIX systems, longer messages are buffered unless the message to be sent resides in either a common block, the symmetric heap, or global shared heap and the sending and receiving processes reside on the same host. The MPI data type on the send side must also be a contiguous type. The message size must be greater than the size setting for MPI_BUFFER_MAX (see Table 5-1). If the XPMEM driver is enabled (for single host jobs, see MPI_XPMEM_ON and for multihost jobs, see MPI_USE_XPMEM), MPI allows single-copy transfers for basic MPI types from any sender data location, including the stack and private heap. The XPMEM driver also allows single-copy transfers across partitions. Under these circumstances, the receiver copies the data directly into its receive message area without buffering. Obviously, MPI applications with code segments equivalent to that shown in Table 5-2 will almost certainly deadlock if this bufferless pathway is available.
Note: This feature is not currently available on Linux systems. |
Table 5-2. Outline of Improper Dependence on Buffering
Process 1 | Process 2 |
---|---|
MPI_Send(2,....) | MPI_Send(1,....) |
MPI_Recv(2,....) | MPI_Recv(1,....) |