<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
     which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
     There has to be one entity for each item to be referenced. 
     An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2328 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2328.xml">
<!ENTITY RFC2918 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2918.xml">
<!ENTITY RFC4760 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4760.xml">
<!ENTITY RFC4271 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4271.xml">
<!ENTITY RFC4456 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4456.xml">
<!ENTITY RFC5492 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5492.xml">
<!ENTITY RFC4724 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4724.xml">
<!ENTITY RFC7313 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7313.xml">
<!ENTITY RFC4724 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4724.xml">
<!ENTITY RFC9107 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.9107.xml">
<!ENTITY RFC7752 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7752.xml">
<!ENTITY I-D.ietf-idr-dynamic-cap SYSTEM "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.ietf-idr-dynamic-cap.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
     please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
     (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
     (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="std" docName="draft-ietf-idr-bgp-nh-cost-03"
     ipr="trust200902">
  <!-- category values: std, bcp, info, exp, and historic
     ipr values: full3667, noModification3667, noDerivatives3667
     you can add the attributes updates="NNNN" and obsoletes="NNNN" 
     they will automatically be output with "(if approved)" -->

  <!-- ***** FRONT MATTER ***** -->

  <front>
    <!-- The abbreviated title is used in the page header - it is only necessary if the 
         full title is longer than 39 characters -->

    <title abbrev="draft-ietf-idr-bgp-nh-cost">Carrying next-hop cost
    information in BGP</title>

    <!-- add 'role="editor"' below for the editors if appropriate -->

    <!-- Another author who claims to be an editor -->

    <author fullname="Ilya Varlashkin" initials="I.V." surname="Varlashkin">
      <organization>Google</organization>

      <address>
        <email>ilya@nobulus.com</email>
      </address>
    </author>

    <author fullname='Robert Raszuk' initials='R' surname='Raszuk'>
   <organization>NTT Network Innovations</organization>
   <address>
       <postal>
           <street>940 Stewart Dr</street>
           <city>Sunnyvale</city>
           <region>CA</region>
           <code>94085</code>
           <country>USA</country>
       </postal>
       <email>robert@raszuk.net</email>
   </address>
    </author>
    
    <author fullname="Keyur Patel" initials="K."
            surname="Patel">
      <organization>Arrcus, Inc</organization>
      <address>
        <postal>
          <street>2077 Gateway Pl</street>
          <city>San Jose, CA 95110</city>
          <country>USA</country>
          <code>95110</code>
        </postal>
        <phone></phone>
        <email>keyur@arrcus.com</email>
      </address>
    </author>

    <author fullname="Manish Bhardwaj " initials="M."
            surname="Bhardwaj">
      <organization>Cisco Systems</organization>
      <address>
        <postal>
          <street>170 W. Tasman Drive</street>
          <city>San Jose, CA 95124</city>
          <country>USA</country>
          <code>95134</code>
        </postal>
        <phone></phone>
        <email>manbhard@cisco.com</email>
      </address>
    </author>



    <author fullname="Serpil Bayraktar" initials="S."
            surname="Bayraktar">
      <organization>Cisco Systems</organization>
      <address>
        <postal>
          <street>170 W. Tasman Drive</street>
          <city>San Jose, CA 95124</city>
          <country>USA</country>
          <code>95134</code>
        </postal>
        <phone></phone>
        <email>serpil@cisco.com</email>
      </address>
    </author>

    <date month="Nov" year="2021" />

    <!-- Meta-data Declarations -->

    <area>General</area>

    <workgroup>Internet Engineering Task Force</workgroup>

    <!-- WG name at the upperleft corner of the doc,
         IETF is fine for individual submissions.  
	 If this element is not present, the default is "Network Working Group",
         which is used by the RFC Editor as a nod to the history of the IETF. -->

    <keyword>IDR</keyword>

    <keyword>BGP</keyword>

    <!-- Keywords will be incorporated into HTML output
         files in a meta tag but they have no effect on text or nroff
         output. If you submit your draft to the RFC Editor, the
         keywords will be used for the search engine. -->

    <abstract>
      <t>
	BGPLS provides a mechanism by which Link state and traffic engineering
	information can be collected from internal networks and shared with
	external network routers using BGP. BGPLS defines a new Address Family
	to exchange this information using BGP.
      </t>

      <t>
	BGP Optimal Route Reflection [BGP-ORR] [RFC9107] provides a mechanism for a
	centralized BGP Route Reflector to acheive requirements of a
	Hot Potato Routing as described in Section 11 of [RFC4456].
	Optimal Route Reflection requires BGP ORR to overwrite the
	default IGP location placement of the route reflector; which
	is used for determining cost to the nexthop contained in the path.
      </t>

      <t>This draft augments BGPLS and defines a new extensions to exchange cost 
	information to
	next-hops for the purpose of calculating best path from a peer
	perspective rather than local BGP speaker own perspective.</t>
    </abstract>
  </front>

  <middle>
    <section title="Introduction">

      <t>
	In a certain situation, route-reflector clients may not get optimum path
	to certain destinations. ADDPATH solves this problem by letting
	route-reflector to advertise multiple paths for a given prefix. If number
	of advertised paths are sufficiently big, route-reflector clients can choose
	same route as they would in case of full-mesh. This approach however
	places an additional burden on the control plane. Solutions proposed by
	[BGP-ORR] <xref target="RFC9107"></xref> use different approach - instead of calculating best path from
	the local speaker's own perspective the calculations are done using cost from
	the client to the next-hops. Although they eliminate need for
	transmitting redundant routing information between peers, there are
	scenarios where cost to the next-hop cannot be obtained accurately using
	these methods. For example, if next-hop information itself has been
	learned via BGP then simple SPF run on link-state database won't be
	sufficient to obtain cost information. There are also scenarios where
	while a Route Reflector can reach its clients, the client to client
	connectivity MAY be down.
      </t>

      <t>
	BGPLS 	<xref target="RFC7752"></xref>.
	provides a mechanism by which Link state and traffic engineering
	information can be collected from internal networks and shared with
	external network routers using BGP. BGPLS defines a new Address Family
	to exchange this information using BGP.
      </t>


      <t>
	To address such scenarios, this draft defines extensions to BGPLS to carry
	cost information of the next-hops. In particular, this draft defines a 
	new Protocol ID to announce a Router's
	IGP routes, and a Prefix Descriptor to carry the cost information of the
	IGP routes used towards resolving next-hops.
      </t>
    </section>

    <section title="NEXT-HOP INFORMATION BASE">
      <t>
	To facilitate further description of the proposed solution we
	introduce a new table for all known next-hops and costs to it from various
	routers on the network.
      </t>

      <t>
	Next-Hop Information Base (NHIB) stores cost to reach next-hop from
	an arbitrary router on the network. This information is essential for
	choosing best path from a peer perspective rather than BGP-speaker own
	perspective. In canonical form NHIB entry is triplet (router, next-hop,
	cost), however this specification does not impose any restriction on how
	BGP implementations store that information internally. The cost in NHIB
	is does not have to be an IGP cost, but all costs in NHIB MUST be
	comparable with each other.
      </t>

      <t>
	NHIB can be populated from various sources including static routing
	and dynamic routing. However, this document focuses on populating 
	NHIB using BGP. 
      </t>

      <t>
	An implementation implementing the BGP extension described in this draft
	MAY provide an operator-controlled configuration knob significant to an 
	individual BGP speaker that treats next-hop cost information received from 
	two or more clients as equivalent. For example a route-reflector could 
	receive next-hop cost only from R1 but it will use it while calculating 
	best-path also for R2, R3, Rn because it has been instructed to do so 
	by locally-significant configuration. Multiple sources can be used for 
	redundancy purpose.
      </t>
    </section>

    <section title="BGP Bestpath Selection Modification">
      <t>
	This section applies regardless of method used to populate NHIB.
      </t>

      <t>
	When BGP speaker conforming to this specification selects routes to
	be advertised to a peer it SHOULD use cost information from NHIB rather
	than its own IGP cost to the next-hop after step (d) of 9.1.2.2 in 
	<xref target="RFC4271"></xref>.
      </t>
    </section>

    <section title="BGPLS Extensions">
      <section title="RIB Metrics Prefix Descriptor">
	<t>
	This draft defines a new Prefix Descriptor known as a Cost Prefix 
	Descriptor with a TLV code point value to be assigned by IANA. The
	Cost descriptor looks like:
	</t>

	<t>
	  <figure align="center">
            <artwork align="left"><![CDATA[

   +--------------+-----------------------+----------+-----------------+
   |   TLV Code   | Description           |  Length  | Value defined   |
   |    Point     |                       |          | in:             |
   +--------------+-----------------------+----------+-----------------+
   |     TBD      | Cost                  | 4 bytes  | Cost Value      |
   +--------------+-----------------------+----------+-----------------+

      Cost Value is a 4 byte Metric value computed by a Router's 
      local RIB.

            ]]></artwork>
	  </figure>
	</t>

	<t>
	  The Cost value is a cost associated with a prefix by
	  a Router. The cost is typically computed by the routing procotols
	  that owns a route.
	</t>
	
    </section>

      <section title="RIB Protocol ID">
	<t>
	  This draft defines a new protocol ID for IPv4 and IPv6 Topology Prefix
	  NLRI known as a RIB Protocol ID. The RIB Protocol ID has a value
	  to be assigned by IANA. The Prefix NLRI with RIB Protocol ID
	  is used to announce all the local and IGP computated
	  routes that are installed in the RIB along with its Cost value.
	</t>
      </section>


      <section title="Information Exchange">
        <t>Typically BGPLS sessions will be established between
          route-reflectors and its internal peers (both clients and
          non-clients). As soon as the BGPLS session is ESTABLISHED, 
	  all the RIB routes
          used to resolve next-hop cost and information about 
	  next-hop costs MAY
          be sent immediately by clients to its route-reflector. 
	  Implementations are advised to announce BGP
	  updates for this SAFI before any other SAFIs to facilitate faster
	  convergence of other SAFIs on Route Reflectors.
	  </t>

	<t>
	  Each internal neighbor of a route-reflector
	  announces
	  its IGP RIB Prefix information and its RIB metrics to the Route
	  Reflector using a BGPLS session and a new NLRI Protocol ID and
	  RIB metric Prefix Descriptor. Each neighbor updates Route
	  Reflector with its IGP prefix cost everytime a cost to an IGP route
	  changes. 
      </t>

	<t>
	  Upon a receipt of a BGPLS route and its associated cost, a 
	  Route Reflector stores the prefix, cost, and neighbor information
	  in its local NHRIB database. It then uses the received cost towards
	  calculation of bestpath from the respective clients perpective as
	  opposed to its own IGP cost.
	</t>
      </section>

      <section title="Termination of the session carrying next-hop cost">
        <t>
	  When the BGPLS session carrying next-hop cost 
	  terminates (for whatever reason), the BGP speaker
	  SHOULD invalidate all the next-hop cost information (i.e same
	  treatment that applies to the next-hop cost as to any other
	  BGP learned information).
	</t>
      </section>

      <section title="Graceful Restart and Route-Refresh">
        <t>BGPLS sessions carrying next-hop cost could use Graceful Restart 
	  <xref target="RFC4724"/> and Route Refresh <xref target="RFC7313"/>
          mechanisms in the same way as it&rsquo;s used for IPv4 and IPv6
        unicast.
	</t>
      </section>
    </section>

    <section title="Security considerations">
      <t>
	This document does not introduce new security considerations above
	and beyond those already specified in 
	<xref target="RFC4271" />, <xref target="RFC9107" />,
	<xref target="RFC7752" />.
	</t>
    </section>

    <section title="IANA Considerations">
      <t>
	This draft defines a new protocol id value for RIB Protocol ID.
	This draft requests IANA to allocate a value for a RIB Protocol
	ID from BGPLS Protocol ID Registry.
      </t>
      
      <t>
	This draft defines a new RIB Metrics Prefix Descriptor
	value. This draft request IANA to allocate a TLV code value
	for the new descriptor from the Prefix Descriptor registry.
      </t>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>
	The authors would like to acknowledge David Ward, Anton Elita,
	Nagendra Kumar and Burjiz Pithawala for their critical reviews
	and feedback.
      </t>
    </section>


  </middle>

  <!--  *****BACK MATTER ***** -->

  <back>
    <references title="Normative References">
      &RFC4760;
      &RFC4271;
      &RFC2119;
      &RFC2328;
      &RFC4456;
      &RFC4724;
      &RFC7313;
      &RFC9107;
      &RFC7752;

    </references>

    <references title="Informative References">
      &RFC2918;
    </references>

    <section title="USAGE SCENARIOS">
      <section title="Trivial case">
        <figure>
          <artwork><![CDATA[
     --+---NetA---+--
       |          |
      r1          r2
       |          |
       R1--RR-----R2
       | \        |
       |  +------R4
       R3        
        ]]></artwork>
        </figure>

        <t>In this scenario r1 and r3 along with NetA are part of AS1; and
        R1-R4 along with RR are in AS2.</t>

        <t>If RR implements non-optimized route-reflection, then it will
        choose path to NetA via R1 and advertise it to both R3 and R4. Such
        choice is good from R3 perspective, but it results in suboptimal
        traffic flow from R4 to NetA.</t>

        <t>Using the proposed BGPLS extensions, the route-reflector will learn that cost from R4 to
        R1 is 8 whereas to R2 it's only 1. RR will announce NetA to R4 with
        next-hop set to R2, while its announce to R3 will still have R1 as
        next-hop. Both R3 and R4 now will send traffic to NetA via closest
        exit, achieving same behaviour as if full iBGP mesh would have been
        configured.</t>
      </section>

      <section title="Non-IGP based cost">
        <t>When it's desirable to direct traffic over an exit other than the
        one with smallest IGP cost, BGPLS extensions can be used to convey cost which
        is not based on IGP. For example, network operator may arrange exit
        points in order of administrative preference and configure routers to
        send this instead of IGP cost. Route reflector then will then
        calculate best path based on administrative preference rather than IGP
        metrics.</t>

        <t>Network operators should excercise care to ensure that all routers
        up to and including exit point do not devert packets on to a different
        path, otherwise routing loops may occur. One way to achieve this is to
        have consistent administrative preference among all routers. Another
        option is to use a tunneling mechanism (e.g. MPLS-TE tunnel) between
        source and the exit point, provided that the router serving as exit
        point will send packets out of the network rather than diverting them
        to another exit point.</t>
      </section>

      <section title="Multiple route-reflectors">
        <t>This example demonstrates that BGPLS extensions are necessary only
        between routers that already exchange other AFI/SAFI.</t>

        <figure>
          <artwork><![CDATA[
                           |
R1----R3---------R5----R7--+
      |           |        |
     RR1          |       NetA
      |          RR2       |
      |           |        |
R2----R4---------R6----R8--+
			   |
       ]]></artwork>
        </figure>

        <t>In the above network the routers R1-R4 are clients of RR1, and
        R5-R8 are clients of RR2. RR1 and RR2 also peer with each other and
        use ADDPATH.</t>

        <t>RR2 learns about NetA from R7 and R8. Since it sends not just
        best-path but all prefixes to RR1, there is no need for RR2 to learn
        cost information from R1 and R2 towards R7 and R8. On the other hand
        RR1 does exchange cost information using BGPLS with R1 and R2 so that each of
        them can receive routes, which are best from their perspective.</t>

        <t>As addition to ADDPATH a mechanism could be devised that would
        allow RR2 to learn how many alternative routes does it need to send to
        RR1. For example, if NetA would also be connected to R9 (not shown)
        but all clients of RR1 prefer R7 as exit point and R9 as next-best,
        then there is no need for RR2 to send NetA routes with next-hop R8 to
        RR1.</t>

        <t>Discussion: authors would like to solicit discussion whether there
        is sufficient interest in such mechanism.</t>
      </section>

      <section title="Inter-AS MPLS VPN">
        <t>Previous example could be transposed to Inter-AS MPLS VPN Option C
        scenario. In this case route reflectors RR1 and RR2 can be from
        different autonomous system. Essentially the behaviour of routers
        remains as already described.</t>
      </section>

      <section title="Corner case">
        <figure>
          <artwork><![CDATA[
   --+---NetA--+--
     |         |
RR---R1        R2
       \      /
        R3---R4       ]]></artwork>
        </figure>

        <t>In the above network cost from R3 to R1 is 10, all other costs are
        1. If RR advertises NetA to R3 based on cost information received from
        R3, but uses its own cost when advertising NetA to R4, there will be a
        loop formed. This is the reason why section &ldquo;BGP best path
        selection modification&rdquo; requires RR to have next-hop cost
        information for every next-hop and every peer.</t>

        <t>Note that the problem is the same as if RR would not use extensions
        described in this document and R3 would peer directly with R1 and R2,
        while R4 would peer only with RR.</t>
      </section>
    </section>
  </back>
</rfc>
