Collective Communication Optimization: Requirement and Analysis

Optimizing collective communication originates from parallel computing which is limited in High-Performance Computing(HPC) at the very beginning. However, with the development of distributed applications, especially distributed Artificial Intelligence(AI), collective communication becomes a common inter-process communication pattern which can be extended to apply in various scenarios, and the use cases are still evolving repidly, from clusters in controlled environments, to campus network, and even to open Internet, like Federated Learning. Collective communication represents the behaviors of applications, which currently is not co-designed with network protocols. Specific requirements on the protocol design should be raised to instruct the standardization. Currently, there are some other work related on this topic, but their scope is either too broad or limited in single applications. This document focuses on optimizing collective communication from the network perspective. Solutions may be first introduced to limited domain , we still consider about how to extend the work into Internet-scale applications.

In this section, we elaborate some work that are related to collective communication optimization. These work comes from IETF working groups, research groups, open source or industry practice. We make some analysis as well to help understand why the collective communication optimization is worth more investigations and studies in IETF.

According to the charter of Computing in the Network(COIN) research group, its scope is to investigate how network data plane programmability can improve Internet architecture. Use cases that COIN defines are relatively broad, including network functions offloading, machine learning acceleration, in-network caching and in-network control, etc. What "computing" means here is very generic. It is highly coupled to the programmability of the chips. The collective communication offloading is briefly mentioned as one the use cases of COIN. Nevertheless, optimizing collective communication is not only leverage the programmable chip help the calculation of collective operations in network devices, but also includes the potential standazation wotk lik communication protocol enhancement or application interface design.

Scalable Hierachical Aggregation Protocol(SHArP) is a private solution for optimizing collective communication, owned by Nvidia. It has been realized in its commodity switches. SHArP breaks the end-to-end transport rule by implementing Target Channel Adapter(TCA) in switches. The TCA supports both Reliable Connection (RC) transport to enable reliable delivery of data through the aggregation tree as well as Unreliable Datagram (UD) transport to enable Multicast distribution of the aggregation result. As is stated, SHArP is based on Infiniband architecture. Currently it cannot work interoperably with the other network architectures, thus limiting its applicability.

As is analyzed in [CCO PS & USECASE], existing IP multicast protocols havn't been made the full use for distributed application systems which require collective communication. Most collective operations still use unicast mode for transmission which experience a lot of overhead in terms of network bandwidth consumption and host CPU consumption. Exisiting IP multicast uses the forwarding based on the destination multicast IP address. It is not clear yet how the group based collective communications leverage the multicast to improve the performance and save the bandwidth.

Unified Communication X(UCX) defines a set of network APIs and their implementation for high throughput computing. It's oriented towards programming models that need collective communication. The core objective of UCX is to design a framework for collective communication implementation and address the cross-platform functionality and performance portability challenges, since there are multiple efforts in the industry to optimize the collective-based programming models, and they are either specific model oriented or specific hardware limited, which creates a lot of duplicate work and doesn't scale well. UCX hopes to address these issues by unifying network APIs across different network architecture, like Infiniband, RoCE, TCP/IP, and Cray, etc. The goal seems challenging because it will indeed introduce a lot of abstraction which may impact system overall performance. The object of this draft cares more about the optimization of collective communication, including protocols that realize message transportation, multicast, and OAM. Cross-platform portability is not the major goal.

It's very necessay to discuss about some other issues like where the optimization should be applied and who should be operator. The scope analysis helps better find the landscape of the new technique.

Since there need some discussions on whether end-to-end transport mechanism should be broken to optimize collective communication, there are different technical roadmaps. On the one hand, SHArP is the representative of breaking end-to-end transportation, but it's based on Infiniband. On the other hand, transport transparent solutions like doesn't break the end-to-end rule and could be a host-friendly solution, so that it has better scalability and interoperability to be applied in Internet. But there still need further investigation on other potential collective operations that could be accelerated than Allreduce. Considering that optimizing collective communication may introduce addtional overhead in transport upgrade and there are some security issues as side-effects as stated in [CCO PS & USECASE], collective optimization maybe first introduced into limited domain and then extended to the Internet.

Use cases like AI model training, distributed storage, and big data analysis usually need infrastructure to be deployed in clusters which are operated by single entities. In this case, not only the compute and network infrastructure, but also the application could be owned by single service providers. These use cases are typically performance-driven, which means they need application and infrastructure to be co-designed to reach optimization. However, applications are not co-designed with underlying network protocols case-by-case, as long as the definition and realization of certain collective operations that would be offloaded can be reached in consensus across vendors, like unified primitives used for implementing the collective communication, applications can leverage on the standardized north bound API to improve performance, albeit the applications do not belong to the same service providers .

Distributed parallel computing needs high-performance inter-process communication. The ability of Remote Memory Access(RMA) can reduce the number of memory copies which could improve the performance of distributed systems. R1. Transport layer MUST support RMA function. R2. Memory sharing is RECOMMENDED to support collective operations.

The communication modes and traffic characteristics of different applications in parallel computing vary, resulting in different transport layer capabilities required. The transport layer needs to make adjustments to adapt to different communication modes. Especially after supporting collective operations offloading, the communication mode has changed to: end-to-end, and end-to-network, whitch means suporting in-network processing/computing functions. R3. The implementation of blocking or non-blocking communication of applications MUST be adjusted according to different communication modes. R4. The transport layer MUST provide appropriate reliability mechanisms to adapt to different communication modes.

Collective operations offloading utilizes network devices to achieve low computational accuracy and high IO computing tasks, achieving optimization of collective operations. Network computing devices not only complete packet routing and forwarding, but also need to process transport layer messages. Therefore, the transport layer needs to complete the mapping mechanism between packets and messages, and complete the transmission of transport layer messages. Therefore, the transport layer protocol that supports collective communication optimization needs to meet the following requirements: R5. The transport layer MUST carry messages that network devices can recognize to complete offloaded collective operations processing. R6. The transport layer MUST support fallback mechanism, in case network devices are not sufficient for collective operations offloading.

Current hardware implementations of offloading collective operations vary, and there is no unified standard for the supported in-network computing primitives, which poses great challenges to the management and control plane. In addition, as the network scale continues to expand, it is necessary to rely on topology-aware algorithms to complete computing-related task allocation, path planning, etc., in order to optimize the control effect. R7. MUST support unified definition and management of data types, data structures supported by network devices considering collectives offloading, and the unified management of resources such as memory of network devices. R8. It is RECOMMENDED to achieve topology awareness, task scheduling, and allocation in collaboration between the end and network.

IP multicast protocols bring native one-to-group communication mechanisms which is very suited for one-to-group and group-to-group collective operations. Even though designing IP multicast protocols is not within the scope of transport and application area, but extending IP multicast protocols to support collective communication is necessary. Thus: R9. IP multicast protocols SHOULD be extended to support collective communication. However, whether to design new multicast algorithms and protocols that are dedicated for collective communication is out of the scope.

Self-healing is required since there is chance that single in-network device can run out of service, due to either single point failure or link break down. Therefore: R10. The mechanism of choosing alternative node for implementing collective operations MUST be designed, to ensure system robustness and reliability.

Some security concerns have been described in the [CCO PS & use cases].

TBD.