<?xml version="1.0" encoding="US-ASCII"?>

<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>

<!-- used by XSLT processors -->
<?xml-stylesheet type='text/xsl' href='http://xml.resource.org/authoring/rfc2629.xslt'?>
<!-- For a complete list and description of processing instructions (PIs), 
    please see http://xml.resource.org/authoring/README.html. -->

<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
    (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="std"
     xmlns:xi="http://www.w3.org/2001/XInclude"
     docName="draft-ietf-bess-evpn-fast-df-recovery-07"
     updates="8584"
     consensus="true"
     submissionType="IETF"
     ipr="trust200902">

 <!-- ***** FRONT MATTER ***** -->

 <front>
   <!-- The abbreviated title is used in the page header - it is only necessary if the 
        full title is longer than 39 characters -->
   <title abbrev="Fast Recovery for EVPN DF-Election">Fast Recovery for EVPN Designated Forwarder Election</title>

   <!-- add 'role="editor"' below for the editors if appropriate -->

   <!-- Another author who claims to be an editor -->
  <author fullname="Patrice Brissette" initials="P." surname="Brissette" role="editor">
     <organization>Cisco</organization>
     <address>
       <email>pbrisset@cisco.com</email>
     </address>
   </author>

   <author fullname="Ali Sajassi" initials="A." surname="Sajassi">
     <organization>Cisco</organization>
     <address>
       <email>sajassi@cisco.com</email>
     </address>
   </author>

  <author fullname="Luc Andre Burdet" initials="LA." surname="Burdet">
     <organization>Cisco</organization>
     <address>
       <email>lburdet@cisco.com</email>
     </address>
   </author>

  <author fullname="John Drake" initials="J." surname="Drake">
     <organization>Juniper</organization>
     <address>
       <email>jdrake@juniper.net</email>
     </address>
   </author>

  <author fullname="Jorge Rabadan" initials="J." surname="Rabadan">
     <organization>Nokia</organization>
     <address>
       <email>jorge.rabadan@nokia.com</email>
     </address>
   </author>

   <date year="2023" />

   <!-- Meta-data Declarations -->
   <area>General</area>
   <workgroup>BESS Working Group</workgroup>

   <!-- WG name at the upperleft corner of the doc,
        IETF is fine for individual submissions. 
        If this element is not present, the default is "Network Working Group",
        which is used by the RFC Editor as a nod to the history of the IETF. -->

   <keyword>EVPN</keyword>
   <keyword>Designated Forwarder</keyword>
   <keyword>Convergence</keyword>
   <keyword>Recovery</keyword>

   <abstract>
     <t>The Ethernet Virtual Private Network (EVPN) solution provides
     Designated Forwarder election procedures for multihomed Ethernet Segments. These
     procedures have been enhanced further by applying Highest
     Random Weight (HRW) algorithm for Designated Forwarder (DF) election
     in order to avoid unnecessary DF status changes upon a failure.
     This document improves these procedures by providing a fast Designated Forwarder 
     election upon recovery of the failed link or node associated
     with the multihomed Ethernet Segment. The solution is
     independent of the number of EVIs associated with that Ethernet
     Segment and it is performed via a simple signaling between the
     recovered PE and each of the other PEs in the multihoming group.</t>
   </abstract>

   <note title="Requirements Language">
      <t> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref target="RFC2119">RFC 2119</xref>
      and <xref target="RFC8174">RFC 8174</xref>. 
      </t>
    </note>
   
 </front>

 <middle>
   <section anchor="intro" title="Introduction">
     <t>The Ethernet Virtual Private Network (EVPN) solution <xref target="RFC7432"/> is
     becoming pervasive in data center (DC) applications for Network
     Virtualization Overlay (NVO) and DC interconnect (DCI) services, and
     in service provider (SP) applications for next generation virtual
     private LAN services.</t>
     
     <t><xref target="RFC7432"/> describes DF election procedures for
     multihomed Ethernet Segments. These procedures are enhanced further in
     <xref target="RFC8584"/> by applying Highest Random Weight Algorithm for DF
     election in order to avoid unnecessary DF status changes upon a link
     or node failure associated with the multihomed Ethernet Segment.
     This document makes further improvements to the DF election procedures in
     <xref target="RFC8584"/> by providing an option for a fast DF election upon
     recovery of the failed link or node associated with the multihomed
     Ethernet Segment. This DF election is achieved independent of number
     of EVIs associated with that Ethernet Segment and it is performed via
     a simple signaling between the recovered PE and each of the other PEs
     in the multihomed group.
     The solution is based on simple one-way signaling mechanism.</t>
     
     <section anchor="terminology" title="Terminology">
        <t>
         <dl>
           <dt>Designated Forwarder (DF):</dt><dd>A PE that is currently forwarding
           (encapsulating/decapsulating) traffic for a given VLAN in and out of
           a site.</dd>
         </dl>
	</t>
     </section>

   <section anchor="challenges" title="Challenges with Existing Solution">
        <t>In EVPN technology, multiple PE devices have the ability to encap and
        decap data belonging to the same VLAN. In certain situations, this
        may cause L2 duplicates and even loops if there is a momentary
        overlap of forwarding roles between two or more PE devices, leading
        to broadcast storms.</t>

        <t>EVPN <xref target="RFC7432"/> currently uses timer based synchronization among PE
        devices in redundancy group that can result in duplications (and even
        loops) because of multiple DFs if the timer is too short or
        blackholing if the timer is too long.</t>

        <t>Using split-horizon filtering (<relref target="RFC7432" section="8.3"/>)
        can prevent loops (but not duplicates). However, if there are overlapping DFs in two
        different sites at the same time for the same VLAN, the site
        identifier will be different upon re-entry of the packet and hence
        the split-horizon check will fail, leading to L2 loops.</t>

        <t>The updated DF procedures in <xref target="RFC8584"/> use the well known
        Highest Random Weight&nbsp;(HRW) algorithm to avoid reshuffling of VLANs among
        PE devices in the redundancy group upon failure/&wj;recovery. This
        reduces the impact to VLANs not assigned to the failed/&wj;recovered ports
        and eliminates loops or duplicates at failure/&wj;recovery events.</t>

        <t>However, upon PE insertion or a port being newly added to a multihomed Ethernet Segment,
        HRW also cannot help as a transfer of DF role to the new port must occur
        while the old DF is still active.</t>

        <figure anchor="topology" title="CE1 multihomed to PE1 and PE2.">
         <artwork><![CDATA[
                                  +---------+
               +-------------+    |         |
               |             |    |         |
             / |    PE1      |----|         |   +-------------+
            /  |             |    |  MPLS/  |   |             |---CE3
           /   +-------------+    |  VxLAN/ |   |     PE3     |
      CE1 -                       |  Cloud  |   |             |
           \   +-------------+    |         |---|             |
            \  |             |    |         |   +-------------+
             \ |     PE2     |----|         |
               |             |    |         |
               +-------------+    |         |
                                  +---------+
    ]]>
	</artwork></figure>

        <t>In <xref target="topology"/>, when PE2 is inserted or booted up, PE1 will transfer
        the DF role of some VLANs to PE2 to achieve load balancing. However,
        because there is no handshake mechanism between PE1 and PE2,
        duplication of DF roles for a given VLAN is possible. Duplication of
        DF roles may eventually lead to duplication of
        traffic as well as L2 loops.</t>

        <t>Current EVPN specifications <xref target="RFC7432"/> and <xref target="RFC8584"/>
        rely on a timer-based approach for transferring the DF role to the newly inserted device.
        This can cause the following issues:

        <ul>
            <li>Loops/Duplicates if the timer value is too short</li>
            <li>Prolonged Traffic Blackholing if the timer value is too long</li>
        </ul>
        </t>
   </section>

         
      <section anchor="advantages" title="Advantages to Proposed Solution">

        <t>There are multiples advantages of using the proposed clock-synchronization approach,
        namely:
        <ul>
          <li>A simple uni-directional signaling is all that is needed, no complicated handshake or
          state machine.</li>
          <li>Solution is backwards-compatible: PEs supporting only older
          <xref target="RFC7432"/> shall simply discard the unrecognized new "Service
          Carving Timestamp" BGP Extended Community</li>
          <li>Many of the existing DF Election algorithms can be supported:
              <ul>
                <li><xref target="RFC7432"/> default ordered list ordinal algorithm (Modulo),</li>
                <li><xref target="RFC8584"/> highest-random weight, etc.</li>
            </ul>
          </li>
          <li>Solution is independent of any BGP propagation delay of Ethernet Segment route (Route
          Type 4)</li>
          <li>Solution is agnostic of the actual time synchronization mechanism used (e.g. NTP, PTP, etc.), while
          normalizing the exchange format in an NTP-based encoding.</li>
        </ul>
        </t>

      </section>

   </section>


   <section anchor="sync" title="DF Election Synchronization Solution">

      <t>The solution relies on the concept of common clock alignment between partner PEs participating
      to a common Ethernet Segment i.e. PE1 and PE2 in <xref target="topology"/>. The main idea is to have all peering PEs of that
      Ethernet Segment perform DF election, and apply their resulting carving state,
      at a same pre-announced time. </t>
      
      <t>The DF Election procedure, as described in <xref target="RFC7432"/> and as optionally
      signalled in <xref target="RFC8584"/>, is applied.
      All PEs attached to a given Ethernet Segment are clock-synchronized
      using a networking protocol for clock synchronization (e.g. NTP, PTP, etc.).
      When a new PE is inserted or an existing PE device, that PE
      communicates the current time to peering partners plus the remaining
      peering timer time left. This constitutes an "end time" or "absolute time" as seen from
      local PE. That absolute time is called "Service Carving Time" (SCT).</t>

      <t>A new BGP Extended Community, the Service Carving Timestamp is advertised along with Ethernet Segment route (RT-4) to
      communicate to other partners the Service Carving Time.</t>

      <t>Upon reception of that new BGP Extended Community, partner PEs can determine
      exactly the anticipated carving time. The notion of skew is introduced to
      eliminate any potential duplicate traffic or loops. The receiving partner PEs add a skew
      (default = -10ms) to the Service Carving Time to enforce this.
      The previously inserted PE(s) must carve first, followed shortly (skew) by
      the newly insterted PE.</t>
      
      <t>To summarize, all peering PEs carve almost simultaneously at the time
      announced by newly added/recovered PE. The newly inserted PE initiates the SCT,
      and carves immediately on peering timer expiry.
      The previously inserted PE(s) receiving Ethernet Segment route (RT-4) with a SCT BGP extended community,
      carve shortly before Service Carving Time.</t>

      <section anchor="ntpencoding" title="BGP Encoding">
        <t>A new BGP extended community needs to be defined to communicate the
        Service Carving Timestamp for each Ethernet Segment.</t>

        <t>A new transitive extended community where the Type field is 0x06, and
        the Sub-Type is 0x0F is advertised along with Ethernet
        Segment route. The expected Service Carving Time is encoded as a
        8-octet value as follows:
	
        <figure><artwork><![CDATA[
                     1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type = 0x06   | Sub-Type(0x0F)|      Timestamp Seconds        ~
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
~  Timestamp Seconds            | Timestamp Fractional Seconds  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
            ]]>
        </artwork></figure>
        </t>

        <t> 
        The timestamp exchanged uses the NTP epoch of January 1, 1900  <xref target="RFC5905"/>.
	    The 64-bit timestamp of the NTP protocol consists of a 32-bit part for seconds and a 32-bit
        part for fractional second:
        <ul>
        <li>Timestamp Seconds: 32-bit NTP seconds are encoded in this field.</li>
        <li>Timestamp Fractional Seconds: 16 bits of the NTP fractional seconds are encoded in this
        field. The use of a 16-bit fractional seconds yields adequate precision of 15 microseconds
        (2^-16 s).</li>
        </ul>
        </t>


        <t>This document introduces a new flag called "T" (for Time
        Synchronization) to the bitmap field of the DF Election Extended
        Community defined in <xref target="RFC8584"/>. 
	
        <figure><artwork><![CDATA[
                     1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
| Type = 0x06   | Sub-Type(0x06)| RSV |  DF Alg | |A| |T|       ~
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
~     Bitmap    |            Reserved = 0                       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            ]]>
        </artwork></figure>
        </t>

        <t>
        <ul>
        <li>Bit 3: Time Synchronization (corresponds to Bit 27 of the DF Election Extended
         Community). When set
        to 1, it indicates the desire to use Time Synchronization capability
        with the rest of the PEs in the Ethernet Segment.</li>
        </ul>
        </t>

        <t>
        This capability is used in conjunction with the agreed upon DF Type (DF Election Type).
        For example if all the PEs in the Ethernet Segment indicate having Time
        Synchronization capability and are requesting the DF type to be HRW, then
        the HRW algorithm is used in conjunction with this capability.</t>

      </section>

      <section anchor="fsm_8584" title="Updates to RFC8584">
        <t>This document introduces an additional delay to the events and
        transitions defined for the default DF election algorithm FSM in
        <relref target="RFC8584" section="2.1"/> without changing the FSM states or events itself.</t>
 
        <t>The peering PE's FSM in DF_DONE which receives a RECV_ES transitions to DF_CALC. Because
        of the SCT carried in the Ethernet-Segment update, the output of the DF_CALC and transition
        back into DF_DONE are delayed, as are accompanying forwarding updates to DF/NDF state.</t>

        <t>The corresponding actions when transitions are performed or states are
        entered/exited is modified as follows:</t>
        <ol start="9">
        <li>DF_CALC on CALCULATED: Mark the election result for the VLAN or
        Bundle.
        <ol type="9.%d">
        <li>Where SCT timestamp is present on the RECV_ES event of Action 11,
        wait the remaining time before continuing to 9.2. </li>
        <li>Assume a DF/NDF for the local PE for the VLAN or VLAN Bundle,
        and transition to DF_DONE.</li>
        </ol>
        </li>
 
        <li>DF_DONE on exiting the state: If a new DF election is triggered
        and the current DF is lost, then assume an NDF for the local PE
        for the VLAN or VLAN Bundle.</li>
 
        <li>DF_DONE on VLAN_CHANGE, RCVD_ES, or LOST_ES: Transition to
        DF_CALC.</li>
        </ol>
        </section>


      </section>


      <section anchor="example" title="Synchronization Scenarios">

        <t>Let's take <xref target="topology"/> as an example where initially PE2 had failed and
        PE1 had taken over. This example shows the problem with the DF&nbhy;Election mechanism in <xref target="RFC7432"/>.</t>

        <t>Based on <relref target="RFC7432" section="8.5"/>, using the default 3 second peering timer:
        <ol>
          <li>Initial state: PE1 is in steady-state, PE2 is recovering</li>
          <li>PE2 recovers at (absolute) time t=99</li>
          <li>PE2 advertises RT-4 (sent at t=100) to partner PE1</li>
          <li>PE2 starts a 3 second peering timer</li>
          <li>PE1 carves immediately on RT-4 reception, i.e. t=100 + minimal BGP
          propagation delay</li>
          <li>PE2 carves at time t=103</li>
        </ol>
        </t>
            
        <t><xref target="RFC7432"/> aims of favouring traffic black hole over duplicate traffic.
	    With above procedure, traffic black holing will occur as part of each PE recovery sequence
        since PE1 has transitioned some VLANs to Non-Designated-Forwarder (NDF) immediately upon
        reception.<br/>
        The peering timer value (default = 3 seconds) has a direct effect on the duration of the blackholing.
        A shorter (esp. zero) peering timer may, however, result in duplicate traffic or traffic loops.</t>
	
        <t>Based on the Service Carving Time (SCT) approach:
        <ol>
          <li>Initial state: PE1 is in steady-state, PE2 is recovering</li>
          <li>PE2 recovers at (absolute) time t=99</li>
          <li>PE2 advertises RT-4 (sent at t=100) with target SCT value t=103 to
          partner PE1</li>
          <li>PE2 starts 3 second peering timer</li>
          <li>PE1 starts service carving timer, with remaining time until t=103</li>
          <li>Both PE1 and PE2 carve at (absolute) time t=103</li>
        </ol>
        </t>

        <t>
        In fact, PE1 should carve slightly before PE2 (skew) to maintain the preference of minimal loss
        over duplicate traffic. The previously inserted PE2 that is recovering
	    performs both transitions DF to NDF and NDF to DF per VLANs at the peering timer expiry.
	    Since the goal is to prevent duplicates, the original PE1, which received the SCT will apply:
	  <ul>
	    <li>DF to NDF transition at t=SCT minus skew, where both PEs are NDF for 'skew' amount of time</li>
	    <li>NDF to DF transition at t=SCT</li>
	  </ul>
	  It is this split-behaviour which ensures a good transition of DF role with contained amount of loss.
	</t>
	
        <t>Using SCT approach, the negative effect of the peering timer is mitigated.
        Furthermore, the BGP Ethernet Segment route (RT-4) transmission delay (from PE2 to PE1)
        becomes a non-issue.
	The use of SCT approach remedies the problem associated with the peering timer: the 3 second
    timer window is shortened to the order of milliseconds.</t>
	
	
        <section anchor="concurrent" title="Concurrent Recoveries">
        <t>In the eventuality 2 or more PEs in a peering Ethernet Segment group are recovering
        concurrently or roughly the same time, each will advertise a Service Carving Timestamp.
        This SCT value would correspond to what each recovering PE considers the "end time" for DF
        Election. A similar situation arises in staggered recovering PEs, when a second PE recovers at rougly
        a first PE's advertised SCT expiry, and with its own new SCT-2 outside of the initial SCT
        window.</t>
        
        <t>In the case of multiple outstanding DF elections, one requested by each of the recovering
        PEs, the SCTs must simply be time-ordered and all PEs execute only a single DF Election at
        the service carving time corresponding to the largest received timestamp value.
        The DF Election will involve all the active PEs in a single DF Election update.</t>

        <t>Example:
        <ol>
          <li>Initial state: PE1 is in steady-state, all services elected at PE1.</li>
          <li>PE2 recovers at time t=100, advertises RT-4 with target SCT value t=103 to partners
          (PE1)</li>
          <li>PE2 starts 3 second peering timer</li>
          <li>PE1 starts service carving timer, with remaining time until t=103</li>
          <li>PE3 recovers at time t=102, advertises RT-4 with target SCT value t=105 to partners
          (PE1, PE2)</li>
          <li>PE3 starts 3 second peering timer</li>

          <li>PE2 cancels peering timer, starts service carving timer with remaining time until
          t=105</li>
          <li>PE1 updates service carving timer, with remaining time until t=105</li>
          <li>PE1, PE2 and PE3 carve at (absolute) time t=105</li>
        </ol>
        </t>

        </section>
        </section>

        <section anchor="ntpcompat" title="Backwards Compatibility">
          <t>Per redundancy group, for the DF election procedures to be globally
          convergent and unanimous, it is necessary that all the participating
          PEs agree on the DF Election algorithm to be used. It is, however,
          possible that some PEs continue to use the existing modulo-based DF
          election and do not rely on the new SCT BGP extended community. PEs
          running a baseline DF election mechanism will simply discard
          the new SCT BGP extended community as unrecognized.</t>
	  
          <t>A PE can indicate its willingness to support clock-synched carving by
          signaling the new 'T' DF Election Capability as well as including the new
          Service Carving Time BGP extended community along with the
          Ethernet Segment Route (Type-4).
          In the case where one or more PEs attached to the Ethernet Segment do not signal T=1,
          all PEs in the Ethernet Segment SHALL revert back to the <xref target="RFC7432"/> timer
          approach. This is especially important in the context of the VLAN shuffling with more than
          2 PEs.</t>

      </section>

      <section anchor="security" title="Security Considerations">
        <t>The mechanisms in this document use EVPN control plane as defined in
        <xref target="RFC7432"/>. Security considerations described in
        <xref target="RFC7432"/> are equally applicable. This document uses MPLS
        and IP-based tunnel technologies to support data plane transport.
        Security considerations described in <xref target="RFC7432"/> and in 
        <xref target="RFC8365"/> are equally applicable.</t>
      </section>

      <section anchor="IANA" title="IANA Considerations">

        <t>This document solicits the allocation of the following sub-type in the
        "EVPN Extended Community Sub-Types" registry setup by <xref target='RFC7153'/>:
        <figure><artwork><![CDATA[
      0x0F     Service Carving Timestamp    This document
        ]]></artwork></figure>
        </t>

        <t>This document solicits the allocation of the following values in the
        "DF Election Capabilities" registry setup by <xref target='RFC8584'/>:
        <figure><artwork><![CDATA[
      Bit         Name                             Reference
      ----        ----------------                 -------------
      3           Time Synchronization             This document
        ]]></artwork></figure>
        </t>
      </section>
    </middle>

 <!--  *****BACK MATTER ***** -->

<back>
    <!-- References split into informative and normative -->
    <references title="Normative References">
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.2119.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.8174.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.7153.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.7432.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.8365.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.8584.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.5905.xml"/>

    </references>

    <section anchor="contributors" title="Contributors">
    <t>In addition to the authors listed on the front page, the following co-authors
    have also contributed substantially to this document:</t>
  
    <t>Gaurav Badoni<br/>Cisco</t>
    <t>Email: gbadoni@cisco.com</t>

    <t>Dhananjaya Rao<br/>Cisco</t>
    <t>Email: dhrao@cisco.com</t>
    </section>

    <section anchor="acknowledgements" title="Acknowledgements">
        <t>Authors would like to acknowledge helpful comments
        and contributions of Satya Mohanty and Bharath Vasudevan.
        Also thank you to Anoop Ghanwani for his thorough review with valuable comments and
        corrections.</t>
    </section>

</back>
</rfc>

