Reverse Engineering OpenSHMEM
libfabric
Section titled “libfabric”- Abstracts the network communication aspect of things for the application. Below is an example of a high level SHMEM program that leverages it.
#include <shmem.h>#include <stdio.h>
int main() { shmem_init(); // Uses libfabric internally
int my_pe = shmem_my_pe(); int num_pes = shmem_n_pes();
static int shared_var = 0; // Symmetric memory
shmem_barrier_all(); // Ensure all PEs reach this point
if (my_pe == 0) { shared_var = 42; shmem_put(&shared_var, &shared_var, 1, 1); // Uses libfabric to send data }
shmem_barrier_all(); // Wait for transfer
printf("PE %d sees shared_var = %d\n", my_pe, shared_var);
shmem_finalize(); // Uses libfabric internally return 0;}Tracing the Implementation of shmem_put
Section titled “Tracing the Implementation of shmem_put”I used Sandia’s Source Code
Definition for shmem_put
src\data_c.c4:69:`#pragma weak shmem_put$1 = pshmem_put$1src\data_c.c4:70:#define shmem_put$1 pshmem_put$1')dnlsrc\data_c.c4:91:`#pragma weak shmem_put$1_nbi = pshmem_put$1_nbisrc\data_c.c4:92:#define shmem_put$1_nbi pshmem_put$1_nbi')dnlsrc\data_c.c4:238:`#pragma weak shmem_put$1_signal = pshmem_put$1_signalsrc\data_c.c4:239:#define shmem_put$1_signal pshmem_put$1_signal')dnlsrc\data_c.c4:261:`#pragma weak shmem_put$1_signal_nbi = pshmem_put$1_signal_nbisrc\data_c.c4:262:#define shmem_put$1_signal_nbi pshmem_put$1_signal_nbi')dnlsrc\data_f.c4:57:`SHMEM_DEF_FC_PUT(FC_FUNC_(SH_DOWNCASE(shmem_put$1), SH_UPCASE(SHMEM_PUT$1)), $2)')dnlsrc\data_f.c4:83:`SHMEM_DEF_FC_PUT_NBI(FC_FUNC_(SH_DOWNCASE(shmem_put$1_nbi), SH_UPCASE(SHMEM_PUT$1_NBI)), $2)')dnlSo I went and looked at data_c
-
shmem_putis mapped via#pragma weaktopshmem_put:#pragma weak shmem_put = pshmem_put#define shmem_put pshmem_putThis means
shmem_putis an alias forpshmem_putwhen profiling is enabled. -
The actual implementation of
shmem_putis in this macro:#define SHMEM_DEF_PUT(STYPE, TYPE) \void SHMEM_FUNCTION_ATTRIBUTES \SHMEM_FUNC_PROTOTYPE(STYPE##_put, TYPE *target, \const TYPE *source, size_t nelems, int pe)\long completion = 0; \SHMEM_ERR_CHECK_INITIALIZED(); \SHMEM_ERR_CHECK_PE(pe); \SHMEM_ERR_CHECK_CTX(ctx); \SHMEM_ERR_CHECK_SYMMETRIC(target, sizeof(TYPE) * nelems); \SHMEM_ERR_CHECK_NULL(source, nelems); \SHMEM_ERR_CHECK_OVERLAP(target, source, sizeof(TYPE) * \nelems, sizeof(TYPE) * nelems, 0, \(shmem_internal_my_pe == pe)); \shmem_internal_put_nb(ctx, target, source, \sizeof(TYPE) * nelems, pe, \&completion); \shmem_internal_put_wait(ctx, &completion); \}This tells us:
- Validation checks are performed (initialization, PE check, context check, symmetry check, overlap check).
- The actual non-blocking put operation is done using
shmem_internal_put_nb. - The function waits for completion using
shmem_internal_put_wait.
If we look at shmem_comm.h we find the implementation for shmem_internal_put_nb.
static inlinevoidshmem_internal_put_nb(shmem_ctx_t ctx, void *target, const void *source, size_t len, int pe, long *completion){ if (len == 0) return;
if (shmem_shr_transport_use_write(ctx, target, source, len, pe)) { shmem_shr_transport_put(ctx, target, source, len, pe); } else { shmem_transport_put_nb((shmem_transport_ctx_t *)ctx, target, source, len, pe, completion); }}Breakdown of shmem_internal_put_nb
Section titled “Breakdown of shmem_internal_put_nb”-
Check for Zero-Length Transfers
- If
len == 0, it just returns immediately.
- If
-
Choose the Transport Mechanism
-
If
shmem_shr_transport_use_write(...)istrue, it calls:shmem_shr_transport_put(ctx, target, source, len, pe);This is an alternative transport layer (Shared Transport).
-
Otherwise, it calls:
shmem_transport_put_nb((shmem_transport_ctx_t *)ctx, target, source, len, pe, completion);This is the main transport function, which is where the actual network communication happens.
-
We see that it is defined in multiple different places and depends on the specific type of network in use ex: Unified Communications X (UCX), portals, or openfabrics interconnect (OFI).
PS C:\Users\grant\Downloads\SOS-main> Get-ChildItem -Path . -Recurse -Include *.c,*.h | Select-String -Pattern 'shmem_transport_put_nb'
...SNIP...src\transport_none.h:122:shmem_transport_put_nb(shmem_transport_ctx_t* ctx, void *target, const void *source, size_tlen,src\transport_none.h:145:shmem_transport_put_nbi(shmem_transport_ctx_t* ctx, void *target, const void *source, size_tlen,src\transport_ofi.h:680:void shmem_transport_put_nb(shmem_transport_ctx_t* ctx, void *target, const void *source,size_t len,src\transport_ofi.h:884:void shmem_transport_put_nbi(shmem_transport_ctx_t* ctx, void *target, const void *source,size_t len,src\transport_portals4.h:574:shmem_transport_put_nbi(shmem_transport_ctx_t* ctx, void *target, const void *source,size_t len, int pe)src\transport_portals4.h:590:shmem_transport_put_nb(shmem_transport_ctx_t* ctx, void *target, const void *source,size_t len,src\transport_portals4.h:606: shmem_transport_put_nbi(ctx, target, source, len, pe);src\transport_portals4.h:1110: shmem_transport_put_nbi(ctx, target, source, len, pe);src\transport_ucx.h:254:shmem_transport_put_nb(shmem_transport_ctx_t* ctx, void *target, const void *source, size_tlen,src\transport_ucx.h:286:shmem_transport_put_nbi(shmem_transport_ctx_t* ctx, void *target, const void *source, size_tlen,src\transport_ucx.h:725: shmem_transport_put_nbi(ctx, target, source, len, pe);We’re interested in the OpenFabrics Interconnect implementation so I went and looked at transport_ofi.h:
static inlinevoid shmem_transport_put_nb(shmem_transport_ctx_t* ctx, void *target, const void *source, size_t len, int pe, long *completion){ int ret = 0; uint64_t dst = (uint64_t) pe; uint64_t polled = 0; uint64_t key; uint8_t *addr;
shmem_internal_assert(completion != NULL);
if (len <= shmem_transport_ofi_max_buffered_send) {
shmem_transport_put_scalar(ctx, target, source, len, pe);
} else if (len <= shmem_transport_ofi_bounce_buffer_size && ctx->bounce_buffers) {
SHMEM_TRANSPORT_OFI_CTX_LOCK(ctx); SHMEM_TRANSPORT_OFI_CNTR_INC(&ctx->pending_put_cntr); shmem_transport_ofi_get_mr(target, pe, &addr, &key);
shmem_transport_ofi_bounce_buffer_t *buff = create_bounce_buffer(ctx, source, len); polled = 0;
const struct iovec msg_iov = { .iov_base = buff->data, .iov_len = len }; const struct fi_rma_iov rma_iov = { .addr = (uint64_t) addr, .len = len, .key = key }; const struct fi_msg_rma msg = { .msg_iov = &msg_iov, .desc = GET_MR_DESC_ADDR(shmem_transport_ofi_get_mr_desc_index(source)), .iov_count = 1, .addr = GET_DEST(dst), .rma_iov = &rma_iov, .rma_iov_count = 1, .context = buff, .data = 0 }; do { ret = fi_writemsg(ctx->ep, &msg, FI_COMPLETION | FI_DELIVERY_COMPLETE); } while (try_again(ctx, ret, &polled)); SHMEM_TRANSPORT_OFI_CTX_UNLOCK(ctx);
} else { shmem_transport_ofi_put_large(ctx, target, source,len, pe); (*completion)++; }}Function Parameters
Section titled “Function Parameters”shmem_transport_ctx_t* ctx→ The SHMEM transport context.void *target→ Remote memory location where data is to be written.const void *source→ Local memory buffer containing the data to be written.size_t len→ Length of the data transfer in bytes.int pe→ The processing element (PE) ID of the target node.long *completion→ A pointer to a counter tracking completion status.
Implementation Analysis
Section titled “Implementation Analysis”-
Handles Non-Blocking Puts (
NB)- This function issues a non-blocking put operation.
- The completion status is tracked via
completion, which is incremented if needed.
-
Uses Multiple Strategies for Data Transfer
- Small transfers (
<= shmem_transport_ofi_max_buffered_send)- Calls
shmem_transport_put_scalar(), which likely performs anfi_inject_write().
- Calls
- Medium-sized transfers (
<= shmem_transport_ofi_bounce_buffer_size)- Uses a bounce buffer and
fi_writemsg().
- Uses a bounce buffer and
- Large transfers
- Calls
shmem_transport_ofi_put_large(), which likely splits data into chunks.
- Calls
- Small transfers (
-
Uses
libfabricAPI- The function interacts with libfabric (
fi_writemsg()). - It leverages Remote Memory Access (RMA) with FI_COMPLETION and FI_DELIVERY_COMPLETE.
- The function interacts with libfabric (
-
Ensures Ordering
SHMEM_TRANSPORT_OFI_CTX_LOCK(ctx)andSHMEM_TRANSPORT_OFI_CTX_UNLOCK(ctx)protect shared state.
Related Functions
Section titled “Related Functions”shmem_transport_put_scalar(): Used for small transfers (<= shmem_transport_ofi_max_buffered_send).shmem_transport_ofi_put_large(): Used for large transfers.shmem_transport_put_wait(): Ensures completion.
Where is it Used?
Section titled “Where is it Used?”-
shmem_internal_put_nb()inshmem_comm.hcallsshmem_transport_put_nb()when shared memory writes are not possible:shmem_transport_put_nb((shmem_transport_ctx_t *)ctx, target, source, len, pe, completion); -
Other files (
transport_none.h,transport_portals4.h, etc.) contain different transport implementations.
If we go look at shmem_transport_put_scalar we see
static inlinevoid shmem_transport_put_scalar(shmem_transport_ctx_t* ctx, void *target, const void *source, size_t len, int pe){ int ret = 0; uint64_t dst = (uint64_t) pe; uint64_t polled = 0; uint64_t key; uint8_t *addr;
shmem_transport_ofi_get_mr(target, pe, &addr, &key);
shmem_internal_assert(len <= shmem_transport_ofi_max_buffered_send);
SHMEM_TRANSPORT_OFI_CTX_LOCK(ctx); SHMEM_TRANSPORT_OFI_CNTR_INC(&ctx->pending_put_cntr);
do {
ret = fi_inject_write(ctx->ep, source, len, GET_DEST(dst), (uint64_t) addr, key);
} while (try_again(ctx, ret, &polled)); SHMEM_TRANSPORT_OFI_CTX_UNLOCK(ctx);}Key Points
Section titled “Key Points”-
Handles Small Transfers
- This function is called when
len <= shmem_transport_ofi_max_buffered_send. - It ensures that small messages are handled efficiently without needing extra buffering.
- This function is called when
-
Uses
fi_inject_write()fi_inject_write()is a one-sided, RDMA write in libfabric that:- Does not require completion tracking.
- Is optimized for small messages.
-
Memory Registration
- Calls
shmem_transport_ofi_get_mr(target, pe, &addr, &key); - This function retrieves:
addr: the remote memory address.key: the remote memory region key.
- Calls
-
Ensures Ordering
-
Uses:
SHMEM_TRANSPORT_OFI_CTX_LOCK(ctx);SHMEM_TRANSPORT_OFI_CTX_UNLOCK(ctx); -
This ensures atomicity when accessing shared structures.
-
-
Retries on Resource Exhaustion
-
Uses:
do {ret = fi_inject_write(...);} while (try_again(ctx, ret, &polled)); -
If
fi_inject_write()fails due to a temporary resource issue (e.g., lack of completion queue entries), it retries.
-
-
Incrementing the Counter
-
Uses:
SHMEM_TRANSPORT_OFI_CNTR_INC(&ctx->pending_put_cntr); -
This ensures the operation is tracked in the transport layer.
-
Where is it Used?
Section titled “Where is it Used?”-
shmem_transport_put_nb()(Non-Blocking Put)if (len <= shmem_transport_ofi_max_buffered_send) {shmem_transport_put_scalar(ctx, target, source, len, pe);} -
shmem_transport_put_nbi()(Non-Blocking Immediate Put)if (len <= shmem_transport_ofi_max_buffered_send) {shmem_transport_put_scalar(ctx, target, source, len, pe);} -
It is also indirectly used by
shmem_internal_put_nb()and similar SHMEM API functions.