Collective Communication Optimizations: Requirement and Analysis

With the development of distributed applications, especially High Performance Computing (HPC) and Artificial Intelligence (AI), the scale of parallel computing clusters is constantly expanding, and the pressure brought by collective communication to the network is also increasing. Existing implementations of collective communication are based on RDMA(Remote Direct Memory Access), however, the most obvious problem is that the point-to-point transmission semantic of RDMA is not well aligned with logical communication patterns defined in collective communication, which incurs more bandwidth occupancy,more memory copies at endpoints and more data movements, thus lowering the overall parallel computing efficiency. Detailed use cases and problems are proposed in . Emerging collective communication optimization technical schemes focus on network assisted collective acceleration, which can greatly alleviate network pressure, improve transmission efficiency, and shorten flow completion time (FCT). Some of these approaches can also work compatibly with RDMA, which raises new standardization design space for extended RDMA-based protocols for collective communication optimization. In following sections, this document analyzes different technical schemes for network-assisted collective acceleration based on RDMA, and presents the gap between these work and current IETF standards. Requirements for designing new standards are proposed accordingly.

*Collective communication: A set of communication patterns that application processes follow to communicate with each other in a parallel computing cluster. These patterns include One-to-Many, Many-to-one, or Many-to-Many delivery mode. *Network Assisted Collective Acceleration(NACA): Using network devices, like switches, to offload and perform collective operations, so as to improve the overall collective communication efficiency.

NACA offloads collective operations to switch to implement. For example, Allreduce is done in switch for aggregation. Broadcast is offloaded to switch for data copy, and Scatter leverages switch for data tailoring. Detailed collective operations are listed in . NACA can be built on RDMA so as to optimize collective communication. RDMA allows endpoints to directly read and write memory data from other endpoints at high speed without requiring kernel processing and CPU resources. Memory zero copy and kernel bypass have gradually made RDMA the mainstream communication technology for HPC and AI applications in data centers. This draft mainly focuses on the analysis of two different communication modes for RDMA-based network-assisted collective acceleration.

When using network devices to offload collective operations, RDMA communication modes can be divided into server-to-sever mode and server-to-switch mode.

The server-to-server RDMA connection mode means not to change the logic of the existing applications. and switches that participate in collective communication cannot be seen by applications. The destination of RDMA connection is set to be another server endpoint, but switches can participate in the collective operations during the data transmission. In this communication mode, native transport protocols can cause false positives. For example, when the switch help perform data aggregation during Allreduce, in each round, there will be only one aggregated packet sent from the switch to the destination server. Packets from multiple senders are dropped after the aggregation. And at the destination side, it will be judged as packet loss and trigger packet retransmission . Some modification on reliability mechanisms of native RDMA transport should be improved.

+------------+ +----------------------+ +------------+ | <-----------------------------------------------> receiver 1 | | | | | +------------+ | | | switch | | sender | | forwarding and NACA | | | | | +------------+ | <-----------------------------------------------> receiver 2 | | | | RDMA connection | +------------+ | | | | | | | | | | | | +------------+ | <-----------------------------------------------> receiver 3 | +------------+ +----------------------+ +------------+

The server-to-switch mode means that switches act as RDMA endpoints, and RDMA connection is built between the sender, i.e. server, and the destination, i.e. switch. In this case, hop-by-hop RDMA connections are built for end-to-end data transmission. It is necessary to define what RDMA functions should be offloaded to switches, since offloading all RDMA functions to the network would bring heavy burdens to switches not only on memory and buffer space, but also on protocol complexity.

RDMA connection +------------------------+ +------------+ | <-------------> receiver 1 | | | +------------+ RDMA | | Connection| | +----------+ | switch | +------------+ | sender <----------> forwarding and NACA <-------------> receiver 1 | +----------+ | | +------------+ | | | | | | +------------+ | <-------------> receiver 1 | +------------------------+ +------------+

Scalable Hierachical Aggregation and Reduction Protocol(SHARP) SHARP breaks the end-to-end transport rule by implementing Target Channel Adapter(TCA) in switches. The TCA supports both Reliable Connection (RC) transport to enable reliable delivery of data through the aggregation tree as well as Unreliable Datagram (UD) transport to enable Multicast distribution of the aggregation result. SHARP has been realized in Infiniband commodity switches, but as is stated, SHARP is based on Infiniband architecture. Currently it cannot work interoperably with the other network architectures, thus limiting its applicability. Figure 3 shows the SHARP protocol, it has two primary phases, Aggregation Request and Aggregation Response. SHARP header is designed over IBA header, followed by aggregation operations and other data description information.

RDMA over Converged Ethernet version 2(RoCEv2) is an RDMA scheme based on the UDP protocol over Ethernet. Its core design uses InfiniBand's transport layer, where data is transmitted in sequence and retransmitted using go-back-n. Therefore, a lossless and ordered network is required to achieve ideal performance. The network has introduced Priority Flow Control (PFC) and IP based Explicit Congestion Notification (ECN) to ensure lossless transmission. Technical schemes of NACA based on RoCEv2 have been analyzed in both academia and industry, but compared with Infiniband SHARP, there is currently no commercial solutions, which means there are a lot of standardization space in this area. Take and as two examples for server-to-server communication mode of RoCEv2-based NACA. is designed to offload Allreduce to switches. For ring-Allreduce, workers establish RDMA connections with front and rear workers, using RDMA write to send parameters, and RDMA read to receive aggregation results. The switch is a man-in-the-middle who receives data and aggregates it locally, then returns the results to workers in RDMA read way. This approach has little impact on applications. And it improves the performance since it reduces aggregation rounds compared to traditional ring-Allreduce method. However, mechanisms such as transport reliability and flow control are designed based on an server-to-server communication model, so they need to be redesigned or adapted accordingly. Figure 5 shows the three phases of NetReduce protocol. First packet, middle packet and the last packet. NetReduce header is built over RoCEv2. NetReduce has similar function as SHARP, but it is designed for aggregation used in ring-Allreduce, so it contains ring information, message information, and rank information.

+---------------------------------------------------------+ | Switch | | man-in-the-middle | | | | +---------------------------------------------------+ | | | | | | | +--------------+ +--------------+ | | | | | | | | | | +---------------------------------------------------------+ | | | | | | | | | | | | | | | | | | +--v------+-+ +-v------+--+ +-v-------+-+ | worker 1 | | worker 2 | | worker 3 | +-----------+ +-----------+ +-----------+

First Pkt +---------+---------+---------+---------------+----------+------+ | UDP | IB BTH | IB RETH | NetReduce Hdr | Payload | ICRC | +---------+---------+---------+---------------+----------+------+ Middle Pkt +---------+---------+----------+------+ | UDP | IB BTH | Payload | ICRC | +---------+---------+----------+------+ Last Pkt +---------+---------+---------+---------+------+ | UDP | IB BTH | IB IMM | Payload | ICRC | +---------+---------+---------+---------+------+ The design objective of is for offloading Broadcast operations. Multiple receivers first send RDMA related information, like Queue Pair(QP) number and destination address, to the sender host for registration, and Multicast Forwarding Tree(MFT) is built on these information. Intermediate switches will make decisions based on their downstream connectors. If leaf switch is directly connected with the receiver host, it will work as a RDMA bridge by modifying data packets. In this way, multicast is done in the forward direction, and Acknowledge signals are aggregated in reverse direction to realize reliability. This kind of implementation incur less modification to native RDMA and has better compatibility. Figure 6 shows the Cepheus Multicast Registration Protocol(MRP). Before starting to implement the Broadcast operations, the source of the multicast propagates the MRP into the entire fabric to install multicast forwarding table in each switch and build a MFT. Recevier's RDMA information, i.e, QP number and destination IP address are predefined before the MFT is set up. During multicast, there is no real RDMA communication. Switches that are directly connected with receivers will modify the packet header to make the logical RDMA connection complete.

Cepheus MRP +---------+----------+--------------+ | UDP | Metadata | Node Payload | +---------+----------+--------------+ Metadata +---------------+---------------+ | Total | Seq | Node Numbers | +---------------+---------------+ Node Payload +----------+---------+---------+ | Node QPN | Node IP | Reserve | +----------+---------+---------+ In RoCEv2 network ,there is also server-to-swtich mode where switches implement RDMA protocol stack and workers establish RDMA connections with switches. This approach is similar to InfiniBand SHARP, but based on Ethernet. Due to capacity limitations, network devices do not need to support complete RDMA transport protocol. Similarly, the shortcoming of this mode is that it requires network devices to support RDMA.

iWARP is another RDMA scheme for Ethernet based on TCP protocol. Like RoCEv2, iWARP uses InfiniBand Verbs to interact with applications.RDMAP (Remote Direct Memory Access Protocol) provides RDMA semantic support for upper layer requests such as RDMA_Send, RDMA_Read, RDMA_Write. DDP (Data Placement Protocol) implements zero copy function. DDP Packet contains information describing the memory area. Hardware can directly move data in the DDP Packet to the destination in memory through DMA based on the control information in the DDP Packet . The above process does not require the involvement of the CPU. MPA (Marker Protocol Data Unit Aligned Framing) is responsible for adding control information to the TCP flow according to a certain algorithm at the sending end, so that the receiving end can recognize the boundaries of DDP Packet in the flow according to the algorithm.

+--------------+--------------+--------------+-----------------+------------+----------+ | TCP Header | MPA Header | DDP Header | RDMA Header | Payload | MPA CRC | +--------------+--------------+--------------+-----------------+------------+----------+ Due to TCP ensuring packet ordered delivering and transmission reliability, iWARP could adapt to larger network scales compared to RoCEv2, but its performance is lower. Because of the high cost of offloading complete TCP/IP stack to hardware and the resource intensive maintenance of TCP protocol status, the use of iWARP is not as widespread as RoCEv2. In the server-to-server NACA based on iWARP, any change in Payload may be considered as an interruption of the flow, and any packet loss must be retransmitted. The transport layer mechanism is too complex and difficult to modify. In the server-to-switch NACA based on iWARP mode, due to resource limitations, network devices do not need to implement a complete protocol stack. It is necessary to clarify which parts of existing protocols must be implemented. Meanwhile, if network devices maintain TCP connections, they need to manage resources reasonably.

NACA offloads collective operations with low computational precision and high I/O communication to network device. Network devices not only complete packet routing and forwarding, but also need to process collective messages. Therefore, NACA functions should be designed to instruct network devices to distinguish and process different traffic. Accordingly, an NACA header should be designed over the transport layer to complete the mapping mechanism between packets and collective messages. Therefore, the following requirements are proposed to support collective communication optimization: R1: MUST define a NACA header to indicate what collective operations that switches need to offload, together with relevant information, for example, message id, sequence number, job id etc. R2: SHOULD support fallback mechanism, in case network devices are not sufficient for processing complete collective operations.

As has been explained in previous sections, the major gap between native RDMA and NACA is reflected in the transport semantics. There need mechanisms for transport semantic bridging in order to combine the high-performance transport capability of RDMA and NACA functionality. Besides, NACA may not need full functionality of native RDMA, and it is not ideal to implement full RDMA functionality within switches, because of limited hardware resources. For example, most of RDMA-based NACA solutions only call RDMA read, write, send, and receive operations. Accordingly, the following requirements need to be met: R3: Transport layer MUST support RDMA function. R4: SHOULD allow for different RDMA communication modes for NACA as described in section 3. R5: In server-to-swtich mode, SHOULD clarify which part of the RDMA functions the switch supports, in order to establish a RDMA connection with the server and complete NACA.

As it has been analyzed in section 3 that IWARP solutions can not work well with NACA, because it builds RDMA functions on top of TCP which are too complex to implement in switches. The most promising solution is RDMA over UDP, for example, RoCEv2. However, native RoCEv2 has several limitations and can not work very well with NACA in large scale clusters. These limitations are reflected in the mechanisms of reliability, flow control, and congestion control. For reliability, go-back-n packet retransmission is low efficient, and it may incur much buffer occupancy in NACA switches. Priority Flow Control(PFC) also has high requirement for buffer space, and for Many-to-one collective operations, PFC will take up even more buffer space. As for congestion control, there are lots of algorithms and not all of them work well with NACA. A common congestion control mechanism need to be designed. Thus, there are following requirements: R6: NACA MUST be designed with reliability, and the reliability mechanism of RoCEv2 SHOULD be modified to be more efficient. R7: Flow control SHOULD be optimized in order to save more buffer and memory space for NACA functions. R8: The congestion control of NACA SHOULD work compatibly with other congestion control mechanisms applied for other network traffic that runs in the same fabric.

Since AI model training tasks usually follow a predefined rule that task as well as the training group are settle, and once the training starts, there will be no more new comers to join the group. On basis of this, NACA task assignment usually follows a centralized pattern. For example, NACA support Allreduce by following Aggregation tree, and support broadcast by building a multicast forwarding tree. While some routing policies may follow distributed patterns. For example, Adaptive Routing(AR) selects the optimal path at each network node distributedly. These solutions may not co-exist with each other. In order to better balance traffic management and task assignment: R9: NACA Task assignment SHOULD be co-designed with routing policies for joint optimization.

Due to situations of multi-tenancy, a single switch may need to perform different NACA functions and forward normal traffic. Since NACA header contains collective operations metadata and payload parameters, if the switch logic designed for NACA is incorrectly applied on normal traffic, there will be very severe security issues. Hence, security requirements are as follows: R10: Resources MUST be isolated on switches to ensure that different tasks do not interfere each other, and NACA functions do not operate on normal traffic.

Fault tolerance is required since there is a chance that single network device may run out of service, due to either single point failure or link break down. Therefore: R11: The mechanism of choosing alternative node for implementing NACA functions MUST be designed, to ensure system robustness and reliability.

Some security concerns have been described in the .

Use cases like AI model training, distributed storage, and big data analysis usually need infrastructure to be deployed in clusters which are operated by single entities, for example, limited domain . In this case, not only the compute and network infrastructure, but also the application could be owned by single service providers. These use cases are typically performance-driven, which means they need application and infrastructure to be co-designed to reach optimization. However, applications are not co-designed with underlying network protocols case-by-case, as long as the definition and realization of certain collective operations that would be offloaded can be reached in consensus across vendors, like unified primitives used for implementing the collective communication, applications can leverage on the standardized north bound API to improve performance, albeit the applications do not belong to the same service providers.

TBD.