<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com 
     This can be converted using the Web service at http://xml.resource.org/ -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- You want a table of contents -->
<!-- Use symbolic labels for references -->
<!-- This sorts the references -->
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc topblock="yes"?>
<?rfc comments="no"?>
<rfc category="info" docName="draft-yao-tsvwg-cco-requirement-and-analysis-00"
     ipr="trust200902">
  <front>
    <title
    abbrev="Collective Communication Optimization: Requirement and Analysis">Collective
    Communication Optimization: Requirement and Analysis</title>

    <author fullname="Kehan Yao" initials="K." surname="Yao">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>yaokehan@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Shiping Xu" initials="S." surname="Xu">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>xushiping@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Yizhou Li" initials="Y." surname="Li">
      <organization>Huawei Technologies</organization>

      <address>
        <postal>
          <street/>

          <city>Nanjing, Jiangsu</city>

          <country>China</country>
        </postal>

        <email>liyizhou@huawei.com</email>
      </address>
    </author>

    <author fullname="Hongyi Huang" initials="H." surname="Huang">
      <organization>Huawei Technologies</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <country>China</country>
        </postal>

        <email>hongyi.huang@huawei.com</email>
      </address>
    </author>

    <author fullname="Dirk KUTSCHER" initials="D." surname="KUTSCHER">
      <organization>HKUST (Guangzhou)</organization>

      <address>
        <postal>
          <street/>

          <city>Guangzhou</city>

          <country>China</country>
        </postal>

        <email>dku@hkust-gz.edu.cn</email>
      </address>
    </author>

    <date day="23" month="October" year="2023"/>

    <area>Transport</area>

    <workgroup>Transport Area Working Group</workgroup>

    <keyword>parallel computing;collective operation;in-network
    computing</keyword>

    <abstract>
      <t>As is mentioned in draft [CCO PS &amp; USECASE], the most obvious
      problem on why existing protocols cannot meet the high-performance
      requirements of collective communications is that these distributed
      applications are not co-designed with the underlying networking
      protocols. There is a semantic gap between inter-process message
      transportation and packet forwarding, which should be bridged by
      efficient mapping and optimization.</t>

      <t>This draft further presents the technical requirements on how the
      collective communication optimization should be designed, and makes some
      discussion and analysis on several related work.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>Optimizing collective communication originates from parallel
      computing which is limited in High-Performance Computing(HPC) at the
      very beginning. However, with the development of distributed
      applications, especially distributed Artificial Intelligence(AI),
      collective communication becomes a common inter-process communication
      pattern which can be extended to apply in various scenarios, and the use
      cases are still evolving repidly, from clusters in controlled
      environments, to campus network, and even to open Internet, like
      Federated Learning.</t>

      <t>Collective communication represents the behaviors of applications,
      which currently is not co-designed with network protocols. Specific
      requirements on the protocol design should be raised to instruct the
      standardization. Currently, there are some other work related on this
      topic, but their scope is either too broad or limited in single
      applications. This document focuses on optimizing collective
      communication from the network perspective. Solutions may be first
      introduced to limited domain <xref target="RFC8799"/>, we still consider
      about how to extend the work into Internet-scale applications.</t>
    </section>

    <section title="Existing Work and Analysis">
      <t>In this section, we elaborate some work that are related to
      collective communication optimization. These work comes from IETF
      working groups, research groups, open source or industry practice. We
      make some analysis as well to help understand why the collective
      communication optimization is worth more investigations and studies in
      IETF.</t>

      <section title="COIN">
        <t>According to the charter of Computing in the Network(COIN) research
        group, its scope is to investigate how network data plane
        programmability can improve Internet architecture. Use cases that COIN
        defines are relatively broad, including network functions offloading,
        machine learning acceleration, in-network caching and in-network
        control, etc. What "computing" means here is very generic. It is
        highly coupled to the programmability of the chips. The collective
        communication offloading is briefly mentioned as one the use cases of
        COIN. Nevertheless, optimizing collective communication is not only
        leverage the programmable chip help the calculation of collective
        operations in network devices, but also includes the potential
        standazation wotk lik communication protocol enhancement or
        application interface design.</t>
      </section>

      <section title="SHArP">
        <t>Scalable Hierachical Aggregation Protocol(SHArP) <xref
        target="SHArP"/> is a private solution for optimizing collective
        communication, owned by Nvidia. It has been realized in its commodity
        switches. SHArP breaks the end-to-end transport rule by implementing
        Target Channel Adapter(TCA) in switches. The TCA supports both
        Reliable Connection (RC) transport to enable reliable delivery of data
        through the aggregation tree as well as Unreliable Datagram (UD)
        transport to enable Multicast distribution of the aggregation result.
        As is stated, SHArP is based on Infiniband architecture. Currently it
        cannot work interoperably with the other network architectures, thus
        limiting its applicability.</t>
      </section>

      <section title="Relations with IP multicast">
        <t>As is analyzed in [CCO PS &amp; USECASE], existing IP multicast
        protocols havn't been made the full use for distributed application
        systems which require collective communication. Most collective
        operations still use unicast mode for transmission which experience a
        lot of overhead in terms of network bandwidth consumption and host CPU
        consumption. Exisiting IP multicast uses the forwarding based on the
        destination multicast IP address. It is not clear yet how the group
        based collective communications leverage the multicast to improve the
        performance and save the bandwidth.</t>
      </section>

      <section title="Unified Communication X">
        <t>Unified Communication X(UCX) defines a set of network APIs and
        their implementation for high throughput computing. It's oriented
        towards programming models that need collective communication. The
        core objective of UCX is to design a framework for collective
        communication implementation and address the cross-platform
        functionality and performance portability challenges, since there are
        multiple efforts in the industry to optimize the collective-based
        programming models, and they are either specific model oriented or
        specific hardware limited, which creates a lot of duplicate work and
        doesn't scale well. UCX hopes to address these issues by unifying
        network APIs across different network architecture, like Infiniband,
        RoCE, TCP/IP, and Cray, etc. The goal seems challenging because it
        will indeed introduce a lot of abstraction which may impact system
        overall performance.</t>

        <t>The object of this draft cares more about the optimization of
        collective communication, including protocols that realize message
        transportation, multicast, and OAM. Cross-platform portability is not
        the major goal.</t>
      </section>

      <section title="Other Considerations">
        <t>It's very necessay to discuss about some other issues like where
        the optimization should be applied and who should be operator. The
        scope analysis helps better find the landscape of the new
        technique.</t>

        <section title="Applicable Domain">
          <t>Since there need some discussions on whether end-to-end transport
          mechanism should be broken to optimize collective communication,
          there are different technical roadmaps. On the one hand, SHArP is
          the representative of breaking end-to-end transportation, but it's
          based on Infiniband. On the other hand, transport transparent
          solutions like <xref pageno="false" target="NetReduce"/> doesn't
          break the end-to-end rule and could be a host-friendly solution, so
          that it has better scalability and interoperability to be applied in
          Internet. But there still need further investigation on other
          potential collective operations that could be accelerated than
          Allreduce. Considering that optimizing collective communication may
          introduce addtional overhead in transport upgrade and there are some
          security issues as side-effects as stated in [CCO PS &amp; USECASE],
          collective optimization maybe first introduced into limited domain
          and then extended to the Internet.</t>
        </section>

        <section title="Operational considerations">
          <t>Use cases like AI model training, distributed storage, and big
          data analysis usually need infrastructure to be deployed in clusters
          which are operated by single entities. In this case, not only the
          compute and network infrastructure, but also the application could
          be owned by single service providers. These use cases are typically
          performance-driven, which means they need application and
          infrastructure to be co-designed to reach optimization. However,
          applications are not co-designed with underlying network protocols
          case-by-case, as long as the definition and realization of certain
          collective operations that would be offloaded can be reached in
          consensus across vendors, like unified primitives used for
          implementing the collective communication, applications can leverage
          on the standardized north bound API to improve performance, albeit
          the applications do not belong to the same service providers <xref
          target="NetRPC"/>.</t>
        </section>
      </section>
    </section>

    <section title="Requirements">
      <section title="Improve Tranport Efficiency">
        <t>Distributed parallel computing needs high-performance inter-process
        communication. The ability of Remote Memory Access(RMA) can reduce the
        number of memory copies which could improve the performance of
        distributed systems.</t>

        <t>R1. Transport layer MUST support RMA function.</t>

        <t>R2. Memory sharing is RECOMMENDED to support collective
        operations.</t>
      </section>

      <section title="Transport Suitability for Different Communication Modes">
        <t>The communication modes and traffic characteristics of different
        applications in parallel computing vary, resulting in different
        transport layer capabilities required. The transport layer needs to
        make adjustments to adapt to different communication modes. Especially
        after supporting collective operations offloading, the communication
        mode has changed to: end-to-end, and end-to-network, whitch means
        suporting in-network processing/computing functions.</t>

        <t>R3. The implementation of blocking or non-blocking communication of
        applications MUST be adjusted according to different communication
        modes.</t>

        <t>R4. The transport layer MUST provide appropriate reliability
        mechanisms to adapt to different communication modes.</t>
      </section>

      <section title="End-network Collaborative Collective Communication">
        <t>Collective operations offloading utilizes network devices to
        achieve low computational accuracy and high IO computing tasks,
        achieving optimization of collective operations. Network computing
        devices not only complete packet routing and forwarding, but also need
        to process transport layer messages. Therefore, the transport layer
        needs to complete the mapping mechanism between packets and messages,
        and complete the transmission of transport layer messages. Therefore,
        the transport layer protocol that supports collective communication
        optimization needs to meet the following requirements:</t>

        <t>R5. The transport layer MUST carry messages that network devices
        can recognize to complete offloaded collective operations
        processing.</t>

        <t>R6. The transport layer MUST support fallback mechanism, in case
        network devices are not sufficient for collective operations
        offloading.</t>
      </section>

      <section title="Management and Control">
        <t>Current hardware implementations of offloading collective
        operations vary, and there is no unified standard for the supported
        in-network computing primitives, which poses great challenges to the
        management and control plane. In addition, as the network scale
        continues to expand, it is necessary to rely on topology-aware
        algorithms to complete computing-related task allocation, path
        planning, etc., in order to optimize the control effect.</t>

        <t>R7. MUST support unified definition and management of data types,
        data structures supported by network devices considering collectives
        offloading, and the unified management of resources such as memory of
        network devices.</t>

        <t>R8. It is RECOMMENDED to achieve topology awareness, task
        scheduling, and allocation in collaboration between the end and
        network.</t>
      </section>

      <section title="Make Use of IP Multicast">
        <t>IP multicast protocols bring native one-to-group communication
        mechanisms which is very suited for one-to-group and group-to-group
        collective operations. Even though designing IP multicast protocols is
        not within the scope of transport and application area, but extending
        IP multicast protocols to support collective communication is
        necessary. Thus:</t>

        <t>R9. IP multicast protocols SHOULD be extended to support collective
        communication. However, whether to design new multicast algorithms and
        protocols that are dedicated for collective communication is out of
        the scope.</t>
      </section>

      <section title="Self-healing">
        <t>Self-healing is required since there is chance that single
        in-network device can run out of service, due to either single point
        failure or link break down. Therefore:</t>

        <t>R10. The mechanism of choosing alternative node for implementing
        collective operations MUST be designed, to ensure system robustness
        and reliability.</t>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>Some security concerns have been described in the [CCO PS &amp; use
      cases].</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>

      <?rfc include="reference.RFC.8799"?>
    </references>

    <references title="Informative References">
      <reference anchor="NetReduce">
        <front>
          <title>In-Network Aggregation with Transport Transparency for
          Distributed Training</title>

          <author fullname="Shuo Liu" surname="Liu">
            <organization>Huawei</organization>
          </author>

          <date year="2023"/>
        </front>

        <seriesInfo name="DOI" value="10.1145/3582016.3582037"/>
      </reference>

      <reference anchor="SHArP">
        <front>
          <title>Scalable Hierarchical Aggregation Protocol (SHArP): A
          Hardware Architecture for Efficient Data Reduction</title>

          <author fullname="Richard L. Graham" surname="Graham">
            <organization>Mellanox Technologies, Inc.</organization>
          </author>

          <date year="2023"/>
        </front>

        <seriesInfo name="DOI" value="10.1109/COMHPC.2016.006"/>
      </reference>

      <reference anchor="NetRPC">
        <front>
          <title>NetRPC: Enabling In-Network Computation in Remote Procedure
          Calls</title>

          <author fullname="Bohan Zhao" surname="Zhao">
            <organization>Tsinghua University</organization>
          </author>

          <date year="2023"/>
        </front>
      </reference>
    </references>
  </back>
</rfc>
