<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com 
     This can be converted using the Web service at http://xml.resource.org/ -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- You want a table of contents -->
<!-- Use symbolic labels for references -->
<!-- This sorts the references -->
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc topblock="yes"?>
<?rfc comments="no"?>
<rfc category="info" docName="draft-yao-tsvwg-cco-requirement-and-analysis-01"
     ipr="trust200902">
  <front>
    <title
    abbrev="Collective Communication Optimizations: Requirement and Analysis">Collective
    Communication Optimizations: Requirement and Analysis</title>

    <author fullname="Kehan Yao" initials="K." surname="Yao">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>yaokehan@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Shiping Xu" initials="S." surname="Xu">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>xushiping@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Yizhou Li" initials="Y." surname="Li">
      <organization>Huawei Technologies</organization>

      <address>
        <postal>
          <street/>

          <city>Nanjing, Jiangsu</city>

          <country>China</country>
        </postal>

        <email>liyizhou@huawei.com</email>
      </address>
    </author>

    <author fullname="Hongyi Huang" initials="H." surname="Huang">
      <organization>Huawei Technologies</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <country>China</country>
        </postal>

        <email>hongyi.huang@huawei.com</email>
      </address>
    </author>

    <author fullname="Weifeng Wang" initials="W." surname="Wang">
      <organization>New H3C Technologies Co., Ltd</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <country>China</country>
        </postal>

        <email>wangweifeng@h3c.com</email>
      </address>
    </author>

    <author fullname="Dirk KUTSCHER" initials="D." surname="KUTSCHER">
      <organization>THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY
      (GUANGZHOU)</organization>

      <address>
        <postal>
          <street/>

          <city>Guangzhou</city>

          <country>China</country>
        </postal>

        <email>dku@hkust-gz.edu.cn</email>
      </address>
    </author>

    <date day="5" month="February" year="2024"/>

    <area>Transport</area>

    <workgroup>Transport Area Working Group</workgroup>

    <keyword>collective communication; RDMA;</keyword>

    <abstract>
      <t>Gernerative AI applications depend on large scale parallel computing
      clusters for model training and inference. Existing implementations of
      collective communication in parallel computing is built on top of RDMA,
      the most adoptable AI transport protocol. However, One-to-Many,
      Many-to-One, and Many-to-Many collective operations all depend on
      point-to-point transport semantics of RDMA, which inevitably introduces
      more bandwidth occupancy and transmission overhead. Emerging approaches
      for collective communication optimization focus on network-assisted
      collective acceleration and can work compatibly with RDMA. This document
      analyzes different technical schemes for network-assisted collective
      acceleration based on RDMA, and presents the gap between these work and
      current IETF standards, notably iWARP. Requirements for designing new
      standards are proposed accordingly.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>With the development of distributed applications, especially High
      Performance Computing (HPC) and Artificial Intelligence (AI), the scale
      of parallel computing clusters is constantly expanding, and the pressure
      brought by collective communication to the network is also increasing.
      Existing implementations of collective communication are based on
      RDMA(Remote Direct Memory Access), however, the most obvious problem is
      that the point-to-point transmission semantic of RDMA is not well
      aligned with logical communication patterns defined in collective
      communication, which incurs more bandwidth occupancy,more memory copies
      at endpoints and more data movements, thus lowering the overall parallel
      computing efficiency. Detailed use cases and problems are proposed in
      <xref target="I-D.yao-tsvwg-cco-problem-statement-and-usecases"/>.</t>

      <t>Emerging collective communication optimization technical schemes
      focus on network assisted collective acceleration, which can greatly
      alleviate network pressure, improve transmission efficiency, and shorten
      flow completion time (FCT). Some of these approaches can also work
      compatibly with RDMA, which raises new standardization design space for
      extended RDMA-based protocols for collective communication optimization.
      In following sections, this document analyzes different technical
      schemes for network-assisted collective acceleration based on RDMA, and
      presents the gap between these work and current IETF standards.
      Requirements for designing new standards are proposed accordingly.</t>
    </section>

    <section anchor="definition-of-terms" title="Definition of Terms">
      <t>*Collective communication: A set of communication patterns that
      application processes follow to communicate with each other in a
      parallel computing cluster. These patterns include One-to-Many,
      Many-to-one, or Many-to-Many delivery mode.</t>

      <t>*Network Assisted Collective Acceleration(NACA): Using network
      devices, like switches, to offload and perform collective operations, so
      as to improve the overall collective communication efficiency.</t>
    </section>

    <section title="Existing Work and Analysis">
      <t>NACA offloads collective operations to switch to implement. For
      example, Allreduce is done in switch for aggregation. Broadcast is
      offloaded to switch for data copy, and Scatter leverages switch for data
      tailoring. Detailed collective operations are listed in <xref
      target="I-D.yao-tsvwg-cco-problem-statement-and-usecases"/>. NACA can be
      built on RDMA so as to optimize collective communication. RDMA allows
      endpoints to directly read and write memory data from other endpoints at
      high speed without requiring kernel processing and CPU resources. Memory
      zero copy and kernel bypass have gradually made RDMA the mainstream
      communication technology for HPC and AI applications in data centers.
      This draft mainly focuses on the analysis of two different communication
      modes for RDMA-based network-assisted collective acceleration.</t>

      <section title="Classification and Analysis of Two Modes for NACA">
        <t>When using network devices to offload collective operations, RDMA
        communication modes can be divided into server-to-sever mode and
        server-to-switch mode.</t>

        <section title="NACA Based on Server-to-server RDMA Connection">
          <t>The server-to-server RDMA connection mode means not to change the
          logic of the existing applications. and switches that participate in
          collective communication cannot be seen by applications. The
          destination of RDMA connection is set to be another server endpoint,
          but switches can participate in the collective operations during the
          data transmission. In this communication mode, native transport
          protocols can cause false positives. For example, when the switch
          help perform data aggregation during Allreduce, in each round, there
          will be only one aggregated packet sent from the switch to the
          destination server. Packets from multiple senders are dropped after
          the aggregation. And at the destination side, it will be judged as
          packet loss and trigger packet retransmission <xref
          target="I-D.yao-tsvwg-cco-problem-statement-and-usecases"/>. Some
          modification on reliability mechanisms of native RDMA transport
          should be improved.</t>

          <figure align="center"
                  title="NACA Based on Server-to-server RDMA Connection">
            <artwork>+------------+           +----------------------+            +------------+
|            &lt;-----------------------------------------------&gt; receiver 1 |
|            |           |                      |            +------------+
|            |           |        switch        |
|   sender   |           | forwarding and NACA  |
|            |           |                      |            +------------+
|            &lt;-----------------------------------------------&gt; receiver 2 |
|            |           |   RDMA connection    |            +------------+
|            |           |                      |            
|            |           |                      |
|            |           |                      |            +------------+
|            &lt;-----------------------------------------------&gt; receiver 3 |
+------------+           +----------------------+            +------------+</artwork>
          </figure>
        </section>

        <section title="NACA Based on Server-to-switch RDMA Connection">
          <t>The server-to-switch mode means that switches act as RDMA
          endpoints, and RDMA connection is built between the sender, i.e.
          server, and the destination, i.e. switch. In this case, hop-by-hop
          RDMA connections are built for end-to-end data transmission. It is
          necessary to define what RDMA functions should be offloaded to
          switches, since offloading all RDMA functions to the network would
          bring heavy burdens to switches not only on memory and buffer space,
          but also on protocol complexity.</t>

          <figure align="center"
                  title="NACA Based on Server-to-switch RDMA Connection">
            <artwork>                                               RDMA connection
                      +------------------------+             +------------+
                      |                        &lt;-------------&gt; receiver 1 |
                      |                        |             +------------+
               RDMA   |                        |
            Connection|                        |
+----------+          |         switch         |             +------------+
|  sender  &lt;----------&gt;  forwarding and NACA   &lt;-------------&gt; receiver 1 |
+----------+          |                        |             +------------+
                      |                        |
                      |                        |
                      |                        |             +------------+
                      |                        &lt;-------------&gt; receiver 1 |
                      +------------------------+             +------------+</artwork>
          </figure>
        </section>
      </section>

      <section title="Gap Analysis of Existing Solutions">
        <section title="Infiniband SHARP">
          <t>Scalable Hierachical Aggregation and Reduction Protocol(SHARP)
          <xref target="SHARP"/>SHARP breaks the end-to-end transport rule by
          implementing Target Channel Adapter(TCA) in switches. The TCA
          supports both Reliable Connection (RC) transport to enable reliable
          delivery of data through the aggregation tree as well as Unreliable
          Datagram (UD) transport to enable Multicast distribution of the
          aggregation result. SHARP has been realized in Infiniband commodity
          switches, but as is stated, SHARP is based on Infiniband
          architecture. Currently it cannot work interoperably with the other
          network architectures, thus limiting its applicability.</t>

          <t>Figure 3 shows the SHARP protocol, it has two primary phases,
          Aggregation Request and Aggregation Response. SHARP header is
          designed over IBA header, followed by aggregation operations and
          other data description information.</t>

          <figure align="center" title="SHARP Protocol">
            <artwork>+---------+---------+-------+------+-----------+--------+--------+----+
|   IBA   |  SHARP  | Tuple | User | Operation | Target | SHARP  | CRC|
| Header  |  Header | Header| Data |  Header   | Header | Payload|    |
+---------+---------+--------------+-----------+--------+--------+----+

           Aggregation Request Pkt

+---------+---------+-------+------+--------+----+
|   IBA   |  SHARP  | Tuple | User | SHARP  | CRC|
| Header  |  Header | Header| Data | Payload|    |
+---------+---------+--------------+--------+----+

           Aggregation Response Pkt</artwork>
          </figure>
        </section>

        <section title="RoCEv2 Solutions">
          <t>RDMA over Converged Ethernet version 2(RoCEv2) is an RDMA scheme
          based on the UDP protocol over Ethernet. Its core design uses
          InfiniBand's transport layer, where data is transmitted in sequence
          and retransmitted using go-back-n. Therefore, a lossless and ordered
          network is required to achieve ideal performance. The network has
          introduced Priority Flow Control (PFC) and IP based Explicit
          Congestion Notification (ECN) to ensure lossless transmission.
          Technical schemes of NACA based on RoCEv2 have been analyzed in both
          academia and industry, but compared with Infiniband SHARP, there is
          currently no commercial solutions, which means there are a lot of
          standardization space in this area.</t>

          <t>Take <xref target="NetReduce"/> and <xref target="Cepheus"/> as
          two examples for server-to-server communication mode of RoCEv2-based
          NACA.</t>

          <t><xref target="NetReduce"/> is designed to offload Allreduce to
          switches. For ring-Allreduce, workers establish RDMA connections
          with front and rear workers, using RDMA write to send parameters,
          and RDMA read to receive aggregation results. The switch is a
          man-in-the-middle who receives data and aggregates it locally, then
          returns the results to workers in RDMA read way. This approach has
          little impact on applications. And it improves the performance since
          it reduces aggregation rounds compared to traditional ring-Allreduce
          method. However, mechanisms such as transport reliability and flow
          control are designed based on an server-to-server communication
          model, so they need to be redesigned or adapted accordingly.</t>

          <t>Figure 5 shows the three phases of NetReduce protocol. First
          packet, middle packet and the last packet. NetReduce header is built
          over RoCEv2. NetReduce has similar function as SHARP, but it is
          designed for aggregation used in ring-Allreduce, so it contains ring
          information, message information, and rank information.</t>

          <figure align="center" title="NetReduce Illustration">
            <artwork>+---------------------------------------------------------+
|                       Switch                            |
|                  man-in-the-middle                      |
|                                                         |
|  +---------------------------------------------------+  |
|  |                                                   |  |
|  |      +--------------+      +--------------+       |  |
|  |      |              |      |              |       |  |
+---------------------------------------------------------+
   |      |              |      |              |       |
   |      |              |      |              |       |
   |      |              |      |              |       |
+--v------+-+          +-v------+--+         +-v-------+-+
|  worker 1 |          |  worker 2 |         |  worker 3 |
+-----------+          +-----------+         +-----------+</artwork>
          </figure>

          <figure align="center" title="NetReduce Protocol">
            <artwork>                        First Pkt
+---------+---------+---------+---------------+----------+------+
|   UDP   | IB BTH  | IB RETH | NetReduce Hdr |  Payload | ICRC |
+---------+---------+---------+---------------+----------+------+

                       Middle Pkt
+---------+---------+----------+------+
|   UDP   | IB BTH  |  Payload | ICRC |
+---------+---------+----------+------+

                       Last Pkt
+---------+---------+---------+---------+------+
|   UDP   | IB BTH  | IB IMM  | Payload | ICRC |
+---------+---------+---------+---------+------+</artwork>
          </figure>

          <t>The design objective of <xref target="Cepheus"/> is for
          offloading Broadcast operations. Multiple receivers first send RDMA
          related information, like Queue Pair(QP) number and destination
          address, to the sender host for registration, and Multicast
          Forwarding Tree(MFT) is built on these information. Intermediate
          switches will make decisions based on their downstream connectors.
          If leaf switch is directly connected with the receiver host, it will
          work as a RDMA bridge by modifying data packets. In this way,
          multicast is done in the forward direction, and Acknowledge signals
          are aggregated in reverse direction to realize reliability. This
          kind of implementation incur less modification to native RDMA and
          has better compatibility.</t>

          <t>Figure 6 shows the Cepheus Multicast Registration Protocol(MRP).
          Before starting to implement the Broadcast operations, the source of
          the multicast propagates the MRP into the entire fabric to install
          multicast forwarding table in each switch and build a MFT.
          Recevier's RDMA information, i.e, QP number and destination IP
          address are predefined before the MFT is set up. During multicast,
          there is no real RDMA communication. Switches that are directly
          connected with receivers will modify the packet header to make the
          logical RDMA connection complete.</t>

          <figure align="center"
                  title="Cepheus Multicast Registration Protocol">
            <artwork>             Cepheus MRP
+---------+----------+--------------+
|   UDP   | Metadata | Node Payload |
+---------+----------+--------------+

              Metadata
+---------------+---------------+
|  Total | Seq  |  Node Numbers |
+---------------+---------------+

            Node Payload
+----------+---------+---------+
| Node QPN | Node IP | Reserve |
+----------+---------+---------+</artwork>
          </figure>

          <t>In RoCEv2 network ,there is also server-to-swtich mode where
          switches implement RDMA protocol stack and workers establish RDMA
          connections with switches. This approach is similar to InfiniBand
          SHARP, but based on Ethernet. Due to capacity limitations, network
          devices do not need to support complete RDMA transport protocol.
          Similarly, the shortcoming of this mode is that it requires network
          devices to support RDMA.</t>
        </section>

        <section title="iWARP">
          <t>iWARP<xref target="RFC5040"/> is another RDMA scheme for Ethernet
          based on TCP protocol. Like RoCEv2, iWARP uses InfiniBand Verbs to
          interact with applications.RDMAP (Remote Direct Memory Access
          Protocol) provides RDMA semantic support for upper layer requests
          such as RDMA_Send, RDMA_Read, RDMA_Write. DDP (Data Placement
          Protocol) implements zero copy function. DDP Packet contains
          information describing the memory area. Hardware can directly move
          data in the DDP Packet to the destination in memory through DMA
          based on the control information in the DDP Packet . The above
          process does not require the involvement of the CPU. MPA (Marker
          Protocol Data Unit Aligned Framing) is responsible for adding
          control information to the TCP flow according to a certain algorithm
          at the sending end, so that the receiving end can recognize the
          boundaries of DDP Packet in the flow according to the algorithm.</t>

          <figure align="center" title="iWARP Protocol">
            <artwork>+--------------+--------------+--------------+-----------------+------------+----------+
|  TCP Header  |  MPA Header  |  DDP Header  |   RDMA Header   |  Payload   |  MPA CRC |
+--------------+--------------+--------------+-----------------+------------+----------+</artwork>
          </figure>

          <t>Due to TCP ensuring packet ordered delivering and transmission
          reliability, iWARP could adapt to larger network scales compared to
          RoCEv2, but its performance is lower.&nbsp;Because of the high cost
          of offloading complete TCP/IP stack to hardware and the resource
          intensive maintenance of TCP protocol status, the use of iWARP is
          not as widespread as RoCEv2.</t>

          <t>In the server-to-server NACA based on iWARP, any change in
          Payload may be considered as an interruption of the flow, and any
          packet loss must be retransmitted. The transport layer mechanism is
          too complex and difficult to modify.</t>

          <t>In the server-to-switch NACA based on iWARP mode, due to resource
          limitations, network devices do not need to implement a complete
          protocol stack. It is necessary to clarify which parts of existing
          protocols must be implemented. Meanwhile, if network devices
          maintain TCP connections, they need to manage resources
          reasonably.</t>
        </section>
      </section>
    </section>

    <section title="Requirements">
      <section title="NACA Function Design and Header Definition">
        <t>NACA offloads collective operations with low computational
        precision and high I/O communication to network device. Network
        devices not only complete packet routing and forwarding, but also need
        to process collective messages. Therefore, NACA functions should be
        designed to instruct network devices to distinguish and process
        different traffic. Accordingly, an NACA header should be designed over
        the transport layer to complete the mapping mechanism between packets
        and collective messages. Therefore, the following requirements are
        proposed to support collective communication optimization:</t>

        <t>R1: MUST define a NACA header to indicate what collective
        operations that switches need to offload, together with relevant
        information, for example, message id, sequence number, job id etc.</t>

        <t>R2: SHOULD support fallback mechanism, in case network devices are
        not sufficient for processing complete collective operations.</t>
      </section>

      <section title="Bridge RDMA Transport Semantics">
        <t>As has been explained in previous sections, the major gap between
        native RDMA and NACA is reflected in the transport semantics. There
        need mechanisms for transport semantic bridging in order to combine
        the high-performance transport capability of RDMA and NACA
        functionality. Besides, NACA may not need full functionality of native
        RDMA, and it is not ideal to implement full RDMA functionality within
        switches, because of limited hardware resources. For example, most of
        RDMA-based NACA solutions only call RDMA read, write, send, and
        receive operations. Accordingly, the following requirements need to be
        met:</t>

        <t>R3: Transport layer MUST support RDMA function.</t>

        <t>R4: SHOULD allow for different RDMA communication modes for NACA as
        described in section 3.</t>

        <t>R5: In server-to-swtich mode, SHOULD clarify which part of the RDMA
        functions the switch supports, in order to establish a RDMA connection
        with the server and complete NACA.</t>
      </section>

      <section title="RDMA Transport Related Issues">
        <t>As it has been analyzed in section 3 that IWARP solutions can not
        work well with NACA, because it builds RDMA functions on top of TCP
        which are too complex to implement in switches. The most promising
        solution is RDMA over UDP, for example, RoCEv2. However, native RoCEv2
        has several limitations and can not work very well with NACA in large
        scale clusters. These limitations are reflected in the mechanisms of
        reliability, flow control, and congestion control. For reliability,
        go-back-n packet retransmission is low efficient, and it may incur
        much buffer occupancy in NACA switches. Priority Flow Control(PFC)
        also has high requirement for buffer space, and for Many-to-one
        collective operations, PFC will take up even more buffer space. As for
        congestion control, there are lots of algorithms and not all of them
        work well with NACA. A common congestion control mechanism need to be
        designed. Thus, there are following requirements:</t>

        <t>R6: NACA MUST be designed with reliability, and the reliability
        mechanism of RoCEv2 SHOULD be modified to be more efficient.</t>

        <t>R7: Flow control SHOULD be optimized in order to save more buffer
        and memory space for NACA functions.</t>

        <t>R8: The congestion control of NACA SHOULD work compatibly with
        other congestion control mechanisms applied for other network traffic
        that runs in the same fabric.</t>
      </section>

      <section title="Joint Design of NACA Task Assignment and Routing Policies">
        <t>Since AI model training tasks usually follow a predefined rule that
        task as well as the training group are settle, and once the training
        starts, there will be no more new comers to join the group. On basis
        of this, NACA task assignment usually follows a centralized pattern.
        For example, NACA support Allreduce by following Aggregation tree, and
        support broadcast by building a multicast forwarding tree. While some
        routing policies may follow distributed patterns. For example,
        Adaptive Routing(AR) selects the optimal path at each network node
        distributedly. These solutions may not co-exist with each other. In
        order to better balance traffic management and task assignment:</t>

        <t>R9: NACA Task assignment SHOULD be co-designed with routing
        policies for joint optimization.</t>
      </section>

      <section title="Security and Traffic Isolation">
        <t>Due to situations of multi-tenancy, a single switch may need to
        perform different NACA functions and forward normal traffic. Since
        NACA header contains collective operations metadata and payload
        parameters, if the switch logic designed for NACA is incorrectly
        applied on normal traffic, there will be very severe security issues.
        Hence, security requirements are as follows:</t>

        <t>R10: Resources MUST be isolated on switches to ensure that
        different tasks do not interfere each other, and NACA functions do not
        operate on normal traffic.</t>
      </section>

      <section title="Fault Tolerance">
        <t>Fault tolerance is required since there is a chance that single
        network device may run out of service, due to either single point
        failure or link break down. Therefore:</t>

        <t>R11: The mechanism of choosing alternative node for implementing
        NACA functions MUST be designed, to ensure system robustness and
        reliability.</t>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>Some security concerns have been described in the <xref
      target="I-D.yao-tsvwg-cco-problem-statement-and-usecases"/>.</t>
    </section>

    <section title="Operational Considerations">
      <t>Use cases like AI model training, distributed storage, and big data
      analysis usually need infrastructure to be deployed in clusters which
      are operated by single entities, for example, limited domain <xref
      target="RFC8799"/>. In this case, not only the compute and network
      infrastructure, but also the application could be owned by single
      service providers. These use cases are typically performance-driven,
      which means they need application and infrastructure to be co-designed
      to reach optimization. However, applications are not co-designed with
      underlying network protocols case-by-case, as long as the definition and
      realization of certain collective operations that would be offloaded can
      be reached in consensus across vendors, like unified primitives used for
      implementing the collective communication, applications can leverage on
      the standardized north bound API to improve performance, albeit the
      applications do not belong to the same service providers.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>

      <?rfc include="reference.RFC.5040"?>

      <?rfc include="reference.RFC.8799"?>

      <?rfc include="reference.I-D.yao-tsvwg-cco-problem-statement-and-usecases"?>
    </references>

    <references title="Informative References">
      <reference anchor="NetReduce">
        <front>
          <title>In-Network Aggregation with Transport Transparency for
          Distributed Training</title>

          <author fullname="Shuo Liu" surname="Liu">
            <organization>Huawei</organization>
          </author>

          <date year="2023"/>
        </front>

        <seriesInfo name="DOI" value="10.1145/3582016.3582037"/>
      </reference>

      <reference anchor="Cepheus">
        <front>
          <title>Cepheus: Accelerating Datacenter Applications with
          High-Performance RoCE-Capable Multicast</title>

          <author fullname="Wenxue Li and Junyi Zhang" surname="Li, Zhang">
            <organization>HKUST, Huawei</organization>
          </author>

          <date year="2024"/>
        </front>
      </reference>

      <reference anchor="SHARP">
        <front>
          <title>Scalable Hierarchical Aggregation and Reduction Protocol
          (SHARP): A Hardware Architecture for Efficient Data
          Reduction</title>

          <author fullname="Richard L. Graham" surname="Graham">
            <organization>Mellanox Technologies, Inc.</organization>
          </author>

          <date year="2023"/>
        </front>

        <seriesInfo name="DOI" value="10.1109/COMHPC.2016.006"/>
      </reference>
    </references>
  </back>
</rfc>
