It's that time of the year again. Register for GetNetCrafty 2019! Register

I’ve recently been looking at BGP designs using route reflectors (RR). As a best practice for RR designs, the logical iBGP sessions should follow the physical topology. But what could happen if you don’t follow this practice?

In the following example, I will allow my RRs to behave badly, and NOT follow the physical topology to see what might happen.

Proposed BGP and RR Design

Initially, AS 200 has a full mesh design of iBGP speakers. (I am ignoring how AS 100 is inter-connected.) Routers A and B from AS 100 both send prefix 10.26.6.0/24 to their neighbors. IP address 10.26.6.1 is currently reachable from PE-M1 & PE-M2. The dashed lines show the logical iBGP sessions. The thick solid black lines show the physical connectivity in the network.

The basic BGP configuration is straight-forward, all routers have a full mesh of  iBGP sessions to all other BGP speakers in their domain. The two edge routers CE-A1 and CE-A2 have eBGP sessions to edge routers A and B in AS 100.

The following loopback addressing is in place:

  • PE-T1    10.216.248.1/32
  • PE-T2    10.216.248.2/32
  • PE-M1    10.216.248.3/32
  • PE-M1    10.216.248.4/32
  • CE-A1    10.216.248.33/32
  • CE-A1    10.216.248.44/32

All the routers in AS 200 are peering on loopback 0, for example :

!
PE-M2#sh run | beg router bgp
router bgp 200
no synchronization
bgp log-neighbor-changes
neighbor 10.216.248.1 remote-as 200
neighbor 10.216.248.1 update-source Loopback0
neighbor 10.216.248.2 remote-as 200
neighbor 10.216.248.2 update-source Loopback0
neighbor 10.216.248.3 remote-as 200
neighbor 10.216.248.3 update-source Loopback0
neighbor 10.216.248.33 remote-as 200
neighbor 10.216.248.33 update-source Loopback0
neighbor 10.216.248.34 remote-as 200
neighbor 10.216.248.34 update-source Loopback0
no auto-summary
!
. . .
PE-M2#

Here is what one of the edge router’s BGP  configuration looks like:

CE-A1#sh run | beg router bgp
router bgp 200
no synchronization
bgp log-neighbor-changes
neighbor 10.26.6.6 remote-as 100
neighbor 10.216.248.1 remote-as 200
neighbor 10.216.248.1 update-source Loopback0
neighbor 10.216.248.1 next-hop-self
neighbor 10.216.248.2 remote-as 200
neighbor 10.216.248.2 update-source Loopback0
neighbor 10.216.248.2 next-hop-self
neighbor 10.216.248.3 remote-as 200
neighbor 10.216.248.3 update-source Loopback0
neighbor 10.216.248.3 next-hop-self
neighbor 10.216.248.4 remote-as 200
neighbor 10.216.248.4 update-source Loopback0
neighbor 10.216.248.4 next-hop-self
neighbor 10.216.248.34 remote-as 200
neighbor 10.216.248.34 update-source Loopback0
neighbor 10.216.248.34 next-hop-self
network 10.216.0.0 mask 255.255.0.0
no auto-summary
!
. . .
CE-A1#

Initially, all devices in AS 200 have two BGP entries to reach 10.26.6.1, for example:

PE-M2#sh ip bgp
BGP table version is 3, local router ID is 10.216.248.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
* i10.26.6.0/24 10.216.248.33 0 100 0 100 i
*>i 10.216.248.34 0 100 0 100 i
*>i10.216.0.0/16 10.216.248.34 0 100 0 i
* i 10.216.248.33 0 100 0 i
PE-M2#
PE-M2#ping 10.26.6.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.26.6.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/4 ms
PE-M2#

Migrating to a BGP Route Reflector Configuration

To test an RR design that does not follow the physical topology, the following physical and logical topology will be implemented:

Proposed BGP and RR Design

The dashed lines from PE-T1 and PE-T2 show the logical iBGP sessions to the RR clients. There is also an iBGP session between PE-T1 and PE-T2. The thick solid black lines again show the physical connectivity in the network.  (This is NOT a recommended design, but is used here for illustration.)

The following new RR configurations are applied:

!PE-T1
no router bgp 200
router bgp 200
neighbor 10.216.248.2 remote-as 200
neighbor 10.216.248.2 update-source lo 0
neighbor 10.216.248.2 route-reflector-client
neighbor 10.216.248.4 remote-as 200
neighbor 10.216.248.4 update-source lo 0
neighbor 10.216.248.33 remote-as 200
neighbor 10.216.248.33 update-source lo 0
neighbor 10.216.248.33 route-reflector-client
!PE-T2
no router bgp 200
router bgp 200
neighbor 10.216.248.1 remote-as 200
neighbor 10.216.248.1 update-source lo 0
neighbor 10.216.248.3 remote-as 200
neighbor 10.216.248.3 update-source lo 0
neighbor 10.216.248.3 route-reflector-client
neighbor 10.216.248.34 remote-as 200
neighbor 10.216.248.34 update-source lo 0
neighbor 10.216.248.34 route-reflector-client
! CE-A1 
no router bgp 200
router bgp 200
neighbor 10.216.248.1 remote-as 200
neighbor 10.216.248.1 update-source lo 0
neighbor 10.216.248.1 next-hop-self
neighbor 10.26.6.6 remote-as 100
!CE-A2
no router bgp 200
router bgp 200
neighbor 10.216.248.2 remote-as 200
neighbor 10.216.248.2 update-source lo 0
neighbor 10.216.248.2 next-hop-self
neighbor 10.26.6.10 remote-as 100
!PE-M1
no router bgp 200
router bgp 200
neighbor 10.216.248.2 remote-as 200
neighbor 10.216.248.2 update-source lo 0
! PE-M2 
no router bgp 200
router bgp 200
neighbor 10.216.248.1 remote-as 200
neighbor 10.216.248.1 update-source lo 0

Verifying the RR Configuration

As expected, the PE-T1 and PE-T2 routers now only have three iBGP sessions, for example:

PE-T1#sh ip bgp sum
BGP router identifier 10.216.248.1, local AS number 200
BGP table version is 2, main routing table version 2
1 network entries using 121 bytes of memory
2 path entries using 104 bytes of memory
2/1 BGP path/bestpath attribute entries using 152 bytes of memory
1 BGP rrinfo entries using 24 bytes of memory
1 BGP AS-PATH entries using 24 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 425 total bytes of memory
BGP activity 1/0 prefixes, 2/0 paths, scan interval 60 secs

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/Pfx
10.216.248.2 4 200 6 7 2 0 0 00:04:43 1
10.216.248.4 4 200 5 6 2 0 0 00:02:28 0
10.216.248.33 4 200 5 6 2 0 0 00:03:08 1
PE-T1#

The PE-M1 and PE-M2 routers only have one iBGP session, for example:

PE-M2#sh ip bgp sum
BGP router identifier 10.216.248.4, local AS number 200
BGP table version is 2, main routing table version 2
1 network entries using 121 bytes of memory
1 path entries using 52 bytes of memory
2/1 BGP path/bestpath attribute entries using 152 bytes of memory
1 BGP rrinfo entries using 24 bytes of memory
1 BGP AS-PATH entries using 24 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 373 total bytes of memory
BGP activity 1/0 prefixes, 1/0 paths, scan interval 60 secs

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.216.248.1 4 200 5 4 2 0 0 00:02:03 1
PE-M2#

As expected, the RR clients now have one BGP entry towards 10.26.6.1, for example:


PE-M1#sh ip bgp
BGP table version is 3, local router ID is 10.216.248.3
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*>i10.26.6.0/24 10.216.248.34 0 100 0 100 i
PE-M1#
PE-M2#sh ip bgp
BGP table version is 2, local router ID is 10.216.248.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*>i10.26.6.0/24 10.216.248.33 0 100 0 100 i
PE-M2#

Testing Connectivity

So what happens now when PE-M1 or PE-M2 attempts to reach 10.26.6.1?

PE-M1#trace 10.26.6.1

Type escape sequence to abort.
Tracing the route to 10.26.6.1

  1 10.216.250.5 0 msec 0 msec 0 msec
  2 10.216.250.6 0 msec 0 msec 0 msec
  3 10.216.250.5 0 msec 0 msec 0 msec
  4 10.216.250.6 0 msec 0 msec 0 msec
  5 10.216.250.5 0 msec 0 msec 0 msec
  6 10.216.250.6 0 msec 0 msec 0 msec
  7 10.216.250.5 0 msec 0 msec 0 msec
  8 10.216.250.6 0 msec 0 msec 0 msec
  9 10.216.250.5 0 msec 0 msec 0 msec
 10 ...

Identifying the Issue

Maybe you saw the issue from the previous show ip bgp results. If not, the routing tables of PE-M1 and PE-M2 help illustrate the problem:

PE-M1#sh ip ro
. . .
Gateway of last resort is not set

10.0.0.0/8 is variably subnetted, 13 subnets, 3 masks
B 10.26.6.0/24 [200/0] via 10.216.248.34, 00:02:55
D 10.216.248.1/32
[90/128512] via 10.216.250.13, 00:04:56, TenGigabitEthernet2/0/0
D 10.216.248.2/32
[90/128768] via 10.216.250.5, 00:04:56, TenGigabitEthernet3/0/0
C 10.216.248.3/32 is directly connected, Loopback0
D 10.216.248.4/32
[90/128512] via 10.216.250.5, 00:04:56, TenGigabitEthernet3/0/0
D 10.216.248.33/32
[90/131072] via 10.216.250.13, 00:04:57, TenGigabitEthernet2/0/0
D 10.216.248.34/32
[90/131328] via 10.216.250.5, 00:04:57, TenGigabitEthernet3/0/0
C 10.216.250.4/30 is directly connected, TenGigabitEthernet3/0/0
L 10.216.250.6/32 is directly connected, TenGigabitEthernet3/0/0
C 10.216.250.12/30 is directly connected, TenGigabitEthernet2/0/0
L 10.216.250.14/32 is directly connected, TenGigabitEthernet2/0/0
D 10.216.250.128/30
[90/3072] via 10.216.250.13, 00:04:57, TenGigabitEthernet2/0/0
D 10.216.250.132/30
[90/3328] via 10.216.250.5, 00:04:57, TenGigabitEthernet3/0/0
PE-M1#


PE-M2#sh ip ro
. . .
Gateway of last resort is not set

10.0.0.0/8 is variably subnetted, 13 subnets, 3 masks
B 10.26.6.0/24 [200/0] via 10.216.248.33, 00:03:03
D 10.216.248.1/32
[90/128768] via 10.216.250.6, 00:04:37, TenGigabitEthernet3/0/0
D 10.216.248.2/32
[90/128512] via 10.216.250.13, 00:04:37, TenGigabitEthernet2/0/0
D 10.216.248.3/32
[90/128512] via 10.216.250.6, 00:04:37, TenGigabitEthernet3/0/0
C 10.216.248.4/32 is directly connected, Loopback0
D 10.216.248.33/32
[90/131328] via 10.216.250.6, 00:04:38, TenGigabitEthernet3/0/0
D 10.216.248.34/32
[90/131072] via 10.216.250.13, 00:04:38, TenGigabitEthernet2/0/0
C 10.216.250.4/30 is directly connected, TenGigabitEthernet3/0/0
L 10.216.250.5/32 is directly connected, TenGigabitEthernet3/0/0
C 10.216.250.12/30 is directly connected, TenGigabitEthernet2/0/0
L 10.216.250.14/32 is directly connected, TenGigabitEthernet2/0/0
D 10.216.250.128/30
[90/3328] via 10.216.250.6, 00:04:38, TenGigabitEthernet3/0/0
D 10.216.250.132/30
[90/3072] via 10.216.250.13, 00:04:38, TenGigabitEthernet2/0/0
PE-M2#

The network has a routing loop. When PE-M1 tries to forward traffic to the BGP-learned 10.26.6.0 addresses, it looks up the IGP address of the next hop to CE-A2 (address 10.216.248.34). The IGP next hop to 10.216.248.34 is PE-M2 at 10.216.250.6. So PE-M1 forwards the traffic to PE-M2.

When PE-M2 tries to forward traffic to the BGP-learned 10.26.6.0 addresses, it looks up the IGP address of the next hop to CE-A1 (address 10.216.248.33).  The IGP next hop to 10.216.248.33 is PE-M1 at 10.216.250.5. So PE-M2 forwards the traffic back to PE-M1.

Net result: PE-M1 and PE-M2 have formed a routing loop, and will continue to loop the traffic for 10.26.6.0/24.

Summary

In BGP designs with route reflectors, the logical iBGP sessions really should follow the physical topology.  This practice helps prevent routing loops.

To resolve the routing loop in this example, CE-A1 and PE-M1 should be RR clients of only PE-T1, and CE-A2 and PE-M2 should be RR clients of only PE-T2. With this updated design, the logical and the physical topology would match, and the routing loop avoided.

— cwr

Carole Warner Reece

Architect

A senior network consultant with more than fifteen years of industry experience, Carole is one of our most highly experienced network professionals. Her current focus is on the data center and on network infrastructure.

View more Posts

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.