This is just where I keep my network notes. Maybe you keep yours in a notebook, or a text file. Mine are just on this public webpage.

Some of this is copied from other sites (I try and cite those where possible) and other stuff is from lab reproductions.

My test setup is a CML cluster.

Contact Me

Email: ariadne@haske.org

This document is built on labwork and Interconnections, See [1]

Terms

  • Bridge: A device that participates in the spanning tree algorithm.

  • Root Bridge: The bridge that wins the STP election.

  • Bridge ID: Three fields, next to each other. Bridge Priority, Extension ID (the VLAN), MAC Address

  • BPDU: Bridge Protocol Data Unit. The frame used in 802.1D STP.

  • STP: Spanning tree protocol. Frequently cited at 802.1D.

  • 802.1D: An IEEE standard. The oldest Ethernet STP.

  • Root ID: - The bridge that has won and is winning the elections.

  • Designated ports: AKA DP. Sends BPDUs downstream.

  • Root Port: AKA, RP. AKA, Upstream. Receives BPDUs, from upstream switch. Each bridge can have only one RP. RP is picked by port-selection-algo

  • TCN: Topology change notification. Sent by the bridge that sees a STP change, upstream via it's RP. This is it's own message.

  • TCA Bit: Topology Change Acknowledge, sent by the upstream bridge, to let the TC reporting bridge know it relay'd the TCN upstream. This is inside a config BPDU.

  • TC Bit: Topology Change. The root bridge sets the TC to tell other bridges to set their mac address tables to max age. This is inside a config BPDU.

How STP makes a loop free topology.

STP elects root and designated ports, aka RP, and DPs. It also moves STP ports into Blocking.

  • A bridge can only have one RP.
  • All ports on the root are DPs.
  • Ports on the root bridge never enter blocking.
  • Blocked ports must keep receiving BPDUs to stay blocked (the election must continue, forever)
  • if two would-be DPs send and receive BPDUs.
    • There is a loop.
    • The port that has the inferior BPDU will block.
  1. All bridges turn on send BPDUs on all STP ports, themselves as root.
  2. STP ports (bridges) compare BPDUs.
  3. Bridge with lowest Bridge ID is root, (Lowest priority, if priority is default, lowest mac, usually the oldest switch)
  4. All ports on root bridge are DP, and BPDU cost field is set to zero.
  5. Root sends BPDUs.
  6. DPs send configuration BDPUs.
  7. RPs receive configuration BPDUs.
  8. Root bridge sends BPDU, cost is 0, with port identifiers set.
  9. A non-root bridge can only have one RP.
  10. Non-root bridge gets BPDUs. It uses the port selection Algo to pick one RP.
  11. Non-root bridge starts STP elections on all other ports, by sending BPDUs. It takes the cost inside the received BPDU, and adds it's port cost.
  12. If a DP gets a BDPU, STP blocks the port if the received BPDU is better.

Port Selection Algo

  • All choices are made based on the received BPDU.
  • Modifications are made on the upstream switch.
  1. Lowest cost to root.
  2. Lowest system priority of advertising switch.
  3. Lowest MAC of advertising switch.
  4. Port Identifier Byte of advertising switch (port priority + port number)
Spanning Tree Protocol
    Protocol Identifier: Spanning Tree Protocol (0x0000)
    Protocol Version Identifier: Spanning Tree (0)
    BPDU Type: Configuration (0x00)
    BPDU flags: 0x01, Topology Change
    Root Identifier: 32768 / 1 / 52:54:00:10:43:6f
    Root Path Cost: 0
    Bridge Identifier: 32768 / 1 / 52:54:00:10:43:6f
    Port identifier: 0x0002     < ------------------------- first byte is "port priority" the default on Cisco is 128, or 0x80
    Message Age: 0
    Max Age: 20
    Hello Time: 2
    Forward Delay: 15

Timers

  • Hello Time is usually 2 seconds between BPDUs.

  • Forward Delay is typically 15 seconds. It's between off -> listening -> learning.

Device Priority.

4 bits, goes in geometric sequence starting from 0 to 61440.

switch(config)# spanning-tree vlan 60 priority ?
% Bridge Priority must be in increments of 4096.
% Allowed values are: 
  0     4096  8192  12288 16384 20480 24576 28672
  32768 36864 40960 45056 49152 53248 57344 61440

Root bridges election in Spanning Tree.

Two bridges send each other BPDUs, they compare bridge IDs to see who will keep sending BPDUs

The bridge with the lower ID (priority + mac address) wins. The non-root-bridge copies this bridge ID into it's BPDU, and sends that downstream.

The default for priority is 32768 or 0x80 on the wire. Because the 802.1D committee exists, the priority is this, plus the vlan ID.

Always configure a root bridge, or the oldest device with probably the lowest mac address wins the root bridge election.

Path Cost

The root bridge BPDU gets stuff tack'd onto it. The root bridge advertises itself as 0 cost.

Cost is the value of the link, towards the root bridge.

 ┌───────┐                                                                    
 │  SW1  │                                                                    
 └───┬───┘                                                                    
     │                                                                        
     │                                                                        
     │  Cost in BPDU from SW1 is 0                                                     
     │                                                                        
Eth0 │ ◄──── Interface is Assigned a cost of 100 by SW2 based on link Speed
 ┌───┴───┐                                                                    
 │  SW2  │                                                                    
 └───┬───┘                                                                    
Eth1 │                                                                        
     │                                                                        
     │   Cost in BPDU on-the-wire is now 100, SW2 Eth0 Cost                   
     │                                                                        
Eth0 │                                                                        
 ┌───┴───┐                                                                    
 │  SW2  │                                                                    
 └───────┘                                                                    

Portfast

For end Hosts

  • Does not protect against BPDUs

Loop Prevention

Best practice is to set the root to 0 and the secondary to 4096.

STP Loop Guard

A unidirectional failure on a root or alternate port will cause spanning tree to loop, as other switches will unblock ports, and the unidirectional failure will still forward frames. To prevent this, turn on stp loop guard so ... if a port doesn't get a BPDU, it enters STP loop-inconsistent disabling the port.

This is done per interface, and is pretty tedious.

switch(config)# interface Ethernet 1/1
switch(config-if)# spanning-tree guard loop

More details here.

Port Types

  • Designated ports: send BPDUs downstream.

  • Root Ports are the best port towards the root bridge, either the lowest total cost or the lowest advertised priority or lowest advertised port ID (interface number).

Root Path Cost

Root Path Cost - What the interfaces costs + the advertised cost to the root. The root sends a cost of 0.

STP Path Calculations

spanning-tree pathcost method long

SpeedShort-Mode CostLong-Mode Cost
10 Mbps1002000000
100 Mbps19200000
1 Gbps420000
10 Gbps22000
20 Gbps11000
40 Gbps1500
100 Gbps1200
1 Tbps120
10 Tbps12

802.1D - Spanning Tree

The 802.1D committee wanted two learning states1, one with and one without learning station addresses. This is why it's more complicated.

1

Interconnections - Radia Perlman, page 67.

┌─────────────┐
│     off     │
└──────┬──────┘
       │
       │  Turn on interface
       │
┌──────▼──────┐
│  Listening  │ Receive + Send BPDUs
└──────┬──────┘
       │
       │  forward delay (default 15s)
       │
┌──────▼──────┐
│  Learning   │ Receive + Send BPDUs + Program CAM
└──────┬──────┘
       │
       │  forward delay (default 15s)
       │
┌──────▼──────┐
│  Forwarding │ Receive + Send BPDUs + Program CAM + Forward Frames
└─────────────┘

BPDU Frame Format

This is a RSTP BPDU.

Spanning Tree Protocol

    Protocol Identifier: Spanning Tree Protocol (0x0000)
    Protocol Version Identifier: Rapid Spanning Tree (2)
    BPDU Type: Rapid/Multiple Spanning Tree (2x02)
    BPDU flags: 0x3c, Forwarding, Learning, Port Role: Designated
    
    0... .... = Topology Change Acknowledgment: No
    .0.. .... = Agreement: No
    ..1. .... = Forwarding: Yes
    ...1 .... = Learning: Yes
    .... 11.. = Port Role: Designated (3)
    .... ..0. = Proposal: No
    .... ...0 = Topology Change: No
    
    Root Identifier: 32768 / 1 / aa:bb:cc:00:07:00
    Root Path Cost: 100
    
    Bridge Identifier: 32768 / 1 / aa:bb:cc:00:0a:00
    Port identifier: 0x8003
    Message Age: 1
    Max Age: 20
    Hello Time: 2
    Forward Delay: 15
    Version 1 Length: 0

This is what the BPDU looks like on-the-wire

┌───────────────────────────────┬───────────────┬───────────────┐
│                               │               │               │
│          Protocol ID          │    Version    │   BPDU Type   │
│                               │               │               │
│1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8│1 2 3 4 5 6 7 8│1 2 3 4 5 6 7 8│
└───────────────────────────────┴───────────────┴───────────────┘
             2 bytes                  1 byte         1 byte

┌───────────────┬───────────────────────────────────────────────►
│               │
│     Flag      │                    Root ID
│               │
│1 2 3 4 5 6 7 8│1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
└───────────────┴───────────────────────────────────────────────►
    1 byte                            8 bytes

◄───────────────────────────────────────────────────────────────►

                           Root ID

 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
◄───────────────────────────────────────────────────────────────►
                          8 bytes

◄───────────────┬───────────────────────────────────────────────►
                │
    Root ID     │              Root Path Cost
                │
 1 2 3 4 5 6 7 8│1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
◄───────────────┴───────────────────────────────────────────────►
    8 bytes                       4 bytes

◄───────────────┬───────────────────────────────────────────────►
 Root Path Cost │
                │                Bridge ID
                │
 1 2 3 4 5 6 7 8│1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
◄───────────────┴───────────────────────────────────────────────►
  4 bytes                         8 bytes

◄───────────────────────────────────────────────────────────────►

                           Bridge ID

 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
◄───────────────────────────────────────────────────────────────►
                          8 bytes

◄───────────────┬───────────────────────────────┬───────────────►
                │                               │ Message age
   Bridge ID    │           Port ID             │  (in 1/256s of a second)
                │                               │
 1 2 3 4 5 6 7 8│1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8│1 2 3 4 5 6 7 8
◄───────────────┴───────────────────────────────┴───────────────►
    8 bytes                2 Bytes                   2 Bytes

◄───────────────┬───────────────────────────────┬───────────────►
                │           Max Age             │ Hello Time
   Message Age  │        (in 1/256ths)          │  (in 1/256ths of a second)
                │                               │
 1 2 3 4 5 6 7 8│1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8│1 2 3 4 5 6 7 8
◄───────────────┴───────────────────────────────┴───────────────►
    2 Bytes                 2 Bytes                   2 Bytes

◄───────────────┬───────────────────────────────┬───────────────┐
                │  Forward Delay                │   Version 1   │
   Hello Time   │    (in 1/256ths of a second)  │    Length     │
                │                               │               │
 1 2 3 4 5 6 7 8│1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8│1 2 3 4 5 6 7 8│
◄───────────────┴───────────────────────────────┴───────────────┘
    2 Bytes                 2 Bytes                   1 Byte

┌───────────────────────────────┐
│                               │
│      Version 3 Length         │
│                               │
│1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8│
└───────────────────────────────┘
           2 Bytes

Port elections

Bridge Priority, Vlan, Bridge MAC, Port Priority, Port Number

Default settings

Who is the root?

Both bridges temporarily send BPDUs with themselves both set as root.

+--------+                                                                                       +-------+                                                                 
|        |                                                                                       |       |                                                                 
|      1 +-- 32768 / 1 / 52:54:00:4b:99:08 / 8001 ------- 32768 / 1 / 52:54:00:e8:3a:ff / 8001 --+ 1     |                                                                 
|  SW1 2 +-- 32768 / 1 / 52:54:00:4b:99:08 / 8002 ------- 32768 / 1 / 52:54:00:e8:3a:ff / 8002 --+ 2 SW2 |                                                                
|      3 +-- 32768 / 1 / 52:54:00:4b:99:08 / 8003 ------- 32768 / 1 / 52:54:00:e8:3a:ff / 8003 --+ 3     |                                                                 
|        |                                                                                       |       |                                                                 
+--------+                                                                                       +-------+

SW1 wins with 4b. SW1 has the lower MAC address.

32768 / 1 / 52:54:00:4b:99:08 / 8001 < 32768 / 1 / 52:54:00:e8:3a:ff

Setting Bridge priority to zero

Who is the root?

Both bridges temporarily send BPDUs with themselves both set as root.

+--------+                                                                                       +-------+                                                                 
|        |                                                                                       |       |                                                                 
|      1 +-- 32768 / 1 / 52:54:00:4b:99:08 / 8001 ----------- 0 / 1 / 52:54:00:e8:3a:ff / 8001 --+ 1     |                                                                 
|  SW1 2 +-- 32768 / 1 / 52:54:00:4b:99:08 / 8002 ----------- 0 / 1 / 52:54:00:e8:3a:ff / 8002 --+ 2 SW2 |                                                                
|      3 +-- 32768 / 1 / 52:54:00:4b:99:08 / 8003 ----------- 0 / 1 / 52:54:00:e8:3a:ff / 8003 --+ 3     |                                                                 
|        |                                                                                       |       |                                                                 
+--------+                                                                                       +-------+

SW2 wins with 0. SW2 has the lower bridge priority.

32768 / 1 / 52:54:00:4b:99:08 / 8001 > 0 / 1 / 52:54:00:e8:3a:ff

Port Blocking, Port Default

Which ports block?

+-----------+                                                                                       +---------------+                                                                  
|           |                                                                                       |               |                                                                  
|      DP 1 |-- 32768 / 1 / 52:54:00:4b:99:08 / 8001 -----------------------------------------------| 1 RP          |                                                                  
|  SW1 DP 2 |-- 32768 / 1 / 52:54:00:4b:99:08 / 8002 -----------------------------------------------| 2 BLK  SW2    |                                                                  
|      DP 3 |-- 32768 / 1 / 52:54:00:4b:99:08 / 8003 -----------------------------------------------| 3 BLK         |                                                                  
|           |                                                                                       |               |                                                                  
+-----------+                                                                                       +---------------+                                                                  
  • All ports on root bridge are DP.
  • SW2 gets three BPDUs, the best BPDU is on port 1, it has the lowest port number.
  • SW2 sets the other two ports to BLK.

Port Blocking, Port Priority

Which ports block?

+-----------+                                                                                       +---------------+                                                                  
|           |                                                                                       |               |                                                                  
|      DP 1 |-- 32768 / 1 / 52:54:00:4b:99:08 / 8001 -----------------------------------------------| 1 BLK         |                                                                  
|  SW1 DP 2 |-- 32768 / 1 / 52:54:00:4b:99:08 / 8002 -----------------------------------------------| 2 BLK  SW2    |                                                                  
|      DP 3 |-- 32768 / 1 / 52:54:00:4b:99:08 / 0003 -----------------------------------------------| 3 RP          |                                                                  
|           |                                                                                       |               |                                                                  
+-----------+                                                                                       +---------------+                                                                  
  • All ports on root bridge are DP.
  • SW2 gets three BPDUs, the best BPDU is on port 3, it has the lowest priority. 00
  • SW2 sets the other two ports to BLK.

Topology Change Notifications (TCNs)

  • A TCN is a kind of BPDU message.
  • There is no root ID or bridge ID.
  • The TCN is sent out the RP.
Spanning Tree Protocol
    Protocol Identifier: Spanning Tree Protocol (0x0000)
    Protocol Version Identifier: Spanning Tree (0)
    BPDU Type: Topology Change Notification (0x80)
  1. Bridge sees change in STP topology, sends TCN to upstream bridge.
  2. Upstream sees TCN, sends a regular BDPU back with TCN-Ack set.
  3. Upstream bridge sends TCN upstream, this continues until TCN reaches the root.
  4. Root Bridge sees the TCN, marks BPDUs with TC bit set.
  5. All bridges see TC, and set their max-age to 15 seconds.
  6. Root bridge stops sending TCs.

The default for Cisco is keeping a mac-address in CAM for 300 seconds (5 minutes)

Receiving a TCN sets this max age to the forward delay usually 15 seconds. This means any server that is not actively sending, will have it's traffic flooded onto that VLAN.

switch# show mac address-table aging-time 
Global Aging Time:  300
Finding TCNs
switch# show spanning-tree vlan 20 detail | s Spanning
 VLAN0020 is executing the rstp compatible Spanning Tree protocol
  Bridge Identifier has priority 32768, sysid 20, address aabb.cc00.0100
  Configured hello time 2, max age 20, forward delay 15, transmit hold-count 6
  Current root has priority 8212, address aabb.cc00.0200
  Root port is 7 (Ethernet1/2), cost of root path is 200
  Topology change flag not set, detected flag not set
  Number of topology changes 8 last change occurred 01:07:20 ago   < ----
          from Ethernet1/2                                         < ----
  Times:  hold 1, topology change 35, notification 2
          hello 2, max age 20, forward delay 15 
  Timers: hello 0, topology change 0, notification 0, aging 300
On the device
switch# show spanning-tree vlan 20 detail | i VLAN|transitions 
 VLAN0020 is executing the rstp compatible Spanning Tree protocol
 Port 2 (Ethernet0/1) of VLAN0020 is designated forwarding 
   Number of transitions to forwarding state: 2
 Port 4 (Ethernet0/3) of VLAN0020 is alternate blocking 
   Number of transitions to forwarding state: 1
 Port 7 (Ethernet1/2) of VLAN0020 is root forwarding 
   Number of transitions to forwarding state: 2
 Port 8 (Ethernet1/3) of VLAN0020 is alternate blocking 
   Number of transitions to forwarding state: 0
 Port 12 (Ethernet2/3) of VLAN0020 is designated forwarding 
   Number of transitions to forwarding state: 2
In the logs
switch# show logging | i %LINK
*Jul  8 04:22:24.660: %LINK-3-UPDOWN: Interface Ethernet0/0, changed state to up
*Jul  8 04:22:24.702: %LINK-3-UPDOWN: Interface Ethernet0/1, changed state to up
*Jul  8 04:22:24.715: %LINK-3-UPDOWN: Interface Ethernet0/2, changed state to up
*Jul  8 04:22:24.740: %LINK-3-UPDOWN: Interface Ethernet0/3, changed state to up
*Jul  8 04:22:24.769: %LINK-3-UPDOWN: Interface Ethernet1/0, changed state to up
*Jul  8 04:22:24.794: %LINK-3-UPDOWN: Interface Ethernet1/1, changed state to up
*Jul  8 04:22:24.819: %LINK-3-UPDOWN: Interface Ethernet1/2, changed state to up
*Jul  8 04:22:24.858: %LINK-3-UPDOWN: Interface Ethernet1/3, changed state to up
*Jul  8 04:22:24.888: %LINK-3-UPDOWN: Interface Ethernet2/0, changed state to up
*Jul  8 04:22:24.903: %LINK-3-UPDOWN: Interface Ethernet2/1, changed state to up
*Jul  8 04:22:24.927: %LINK-3-UPDOWN: Interface Ethernet2/2, changed state to up
*Jul  8 04:22:24.942: %LINK-3-UPDOWN: Interface Ethernet2/3, changed state to up
*Jul  8 04:22:24.965: %LINK-3-UPDOWN: Interface Ethernet3/0, changed state to up
*Jul  8 04:22:24.989: %LINK-3-UPDOWN: Interface Ethernet3/1, changed state to up
*Jul  8 04:22:25.013: %LINK-3-UPDOWN: Interface Ethernet3/2, changed state to up
*Jul  8 04:22:25.033: %LINK-3-UPDOWN: Interface Ethernet3/3, changed state to up
*Jul  8 04:22:26.685: %LINK-5-CHANGED: Interface Vlan1, changed state to administratively down
*Jul  8 04:24:58.575: %LINK-5-CHANGED: Interface Ethernet0/0, changed state to administratively down
*Jul  8 04:25:06.138: %LINK-3-UPDOWN: Interface Ethernet0/0, changed state to up
*Jul  8 04:26:59.260: %LINK-5-CHANGED: Interface Ethernet0/0, changed state to administratively down
*Jul  8 04:27:11.982: %LINK-3-UPDOWN: Interface Ethernet0/0, changed state to up
*Jul  8 04:28:43.205: %LINK-5-CHANGED: Interface Ethernet0/0, changed state to administratively down
*Jul  8 04:31:09.988: %LINK-3-UPDOWN: Interface Ethernet0/0, changed state to up
*Jul  8 04:33:53.881: %LINK-5-CHANGED: Interface Ethernet0/0, changed state to administratively down
*Jul  8 04:34:02.140: %LINK-3-UPDOWN: Interface Ethernet0/0, changed state to up
*Jul  8 05:00:52.111: %LINK-5-CHANGED: Interface Ethernet0/0, changed state to administratively down
*Jul  8 05:00:59.749: %LINK-3-UPDOWN: Interface Ethernet0/0, changed state to up
*Jul  8 05:03:48.728: %LINK-5-CHANGED: Interface Ethernet0/0, changed state to administratively down
*Jul  8 05:03:54.050: %LINK-3-UPDOWN: Interface Ethernet0/0, changed state to up
*Jul  8 05:07:04.113: %LINK-5-CHANGED: Interface Ethernet0/0, changed state to administratively down
*Jul  8 05:07:06.713: %LINK-3-UPDOWN: Interface Ethernet0/0, changed state to up
*Jul  8 05:07:31.603: %LINK-5-CHANGED: Interface Ethernet0/0, changed state to administratively down
*Jul  8 05:07:36.280: %LINK-3-UPDOWN: Interface Ethernet0/0, changed state to up
*Jul  8 05:11:32.247: %LINK-3-UPDOWN: Interface Vlan10, changed state to up
*Jul  8 06:35:29.308: %LINK-5-CHANGED: Interface Ethernet0/0, changed state to administratively down
*Jul  8 06:35:43.756: %LINK-3-UPDOWN: Interface Ethernet0/0, changed state to up

References

  1. R. Perlman, Interconnections: Bridges, Routers, Switches, and Internetworking Protocols, 2nd ed. Boston, MA: Addison-Wesley, 1999.

Layer 2 Configuration Guide, Cisco IOS-XE 17.16.X

802.1Q Frame Format

32 bits added to a ethernet frame to multiplex VLANs

                                   ┌────── Priority Code Point(PCP)
                                   │         Used for LAN CoS
                                   │
                                   │   ┌── Drop Elgible Indicator (DEI)
                                   │   │
                                   ▼   ▼
┌───────────────────────────────┬─────┬─┬───────────────────────┐
│    Tag Protocol Identifier    │     │ │                       │
│     (TPID) Set to 0x8100      │ PCP │ │       VLAN ID         │
│                               │     │ │                       │
│1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8│1 2 3│4│5 6 7 8 1 2 3 4 5 6 7 8│
└───────────────────────────────┴─────┴─┴───────────────────────┘
            16 bits                3   1        12 bits
VLAN IDPurpose
0reserved for 802.1P
1default vlan
2-1001normal network operations
1002-1005reserved
1006-4094extended vlan range
  • Only works if the attached device sends a BPDU. Cannot prevent a switch from being attached to a port. 802.1x helps with that.

Detects a BPDU, and err-disables a port

The global command only affects ports that have portfast already turned on.

switch(config)# spanning-tree portfast bpduguard default

... should be set so access ports go errdisable when a rogue switch is connected and require an operator to correct.

Seeing err-disabled status

switch# show int status

Port      Name               Status       Vlan       Duplex  Speed Type 
[output omitted]
Et2/3                        err-disabled 1            auto   auto unknown
Et3/0                        connected    trunk        auto   auto unknown
Et3/1                        connected    1            auto   auto unknown

Turning on automated recovery

switch(config)# errdisable recovery cause bpduguard

Verify

switch# show errdisable recovery 
ErrDisable Reason            Timer Status
-----------------            --------------
arp-inspection               Disabled
bpduguard                    Enabled

[output omitted]
          
Interface       Errdisable reason       Time left(sec)
---------       -----------------       --------------
unicast-flood                Disabled
vmps                         Disabled
psp                          Disabled
dual-active-recovery         Disabled
evc-lite input mapping fa    Disabled
Recovery command: "clear     Disabled

Timer interval: 300 seconds

Interfaces that will be enabled at the next timeout:

Interface       Errdisable reason       Time left(sec)
---------       -----------------       --------------
Et2/3                  bpduguard          296

SPAN

Local

monitor session 1 source interface GigabitEthernet1/0/1 both
monitor session 1 destination interface GigabitEthernet1/0/2

RSPAN

  • VLAN Encapsulated.
  • Does not support layer 2 protocols. (CDP, BPDUs)
  • If the source is a trunk port, you can use the filter keyword to select specific vlans.

Source Switch

vlan 3000
 remote-span
monitor session 1 source interface GigabitEthernet1/0/1 both
monitor session 1 destination remote vlan 3000

Destination switch

vlan 3000
 remote-span
monitor session 1 source remote vlan 3000
monitor session 1 destination interface GigabitEthernet1/0/2

ERSPAN

GRE Encapsulated.

These will encapsulate BPDUs and other Layer 2 protocols.

These need ip routing turned on.

These do not support QoS.

Source switch

monitor session 1 type erspan-source
 !
 ! Could also put a vlan here
 !
 source interface Gi2
 destination
  erspan-id 100
  ip address 10.0.12.2
  origin ip address 10.0.12.1
 no shutdown

Destination switch

monitor session 1 type erspan-destination
 destination interface Gi2
 source
  erspan-id 100
  !
  ! An outside address on this box, not a loopback.
  ! this is the de-encapsulation interface.
  !
  ip address 10.0.12.2
 no shutdown

References

https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst9300/software/release/17-12/configuration_guide/nmgmt/b_1712_nmgmt_9300_cg/configuring_span_and_rspan.html

https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst9300/software/release/17-12/configuration_guide/nmgmt/b_1712_nmgmt_9300_cg/configuring_erspan.html

https://www.cisco.com/c/en/us/td/docs/iosxr/cisco8000/traffic-mirroring/b-traffic-mirroring-configuration-guide-cisco8k/erspan-overview/restrictions-for-erspan.html

OSPF is protocol 89.

Terms

  • IFF: If and only if
  • LSA: Link State Advertisement
  • LSDB: Link-state Database
  • OSPF Process ID: Just where the databases live. Not transmitted. Allows multiple OSPF processes.
  • DR: Designated Router. The network vertex for a broadcast or NBMA network. Used to simplify the number of FULL adjacencies.
  • Advertising Router: The router that created the LSA. The value in this field is the RID.
  • RID: Router ID. A unique 32-bit number to identify the router in a graph. Doesn't have to be an IP-the-box, but is usually a loopback.
  • The Update Rule: A router can only modify an LSA, iff it's RID is inside the "Advertising Router" field.
  • LS Sequence: Higher sequence numbers are newer LSAs. The first sequence number in any LSA is 8000000.
  • LS Checksum: Used to ensure the LSA was transmitted without corruption. Everything is checked except LS Age.
  • LS Age: LSAs time out in an hour, and are refreshed every 30 minutes. LSA Age increments when they go through routers.

Packet Types

TypeNamePurpose
1HelloOSPF puts the neighbor ID into it's hello messages.
2Database Description (DBD/DDP)Used to sync a new neighbor rapidly. Large update packet, to transfer the LSDB in bulk. Contains lots of LSAs.
3Link-State Request (LSR)The router wants a specific LSA.
4Link-State Update (LSU)The neighbor sends a specific LSA.
5Link-State Acknowledgment (LSAck)To confirm a device got the intended LSAs, it transmits the exact same LSAs back to the receiver.

These can be thought of as the five steps.

  1. We say hello, using each others names, to confirm we can both hear one another.
  2. We share state (like the weather).
  3. I ask how something went.
  4. You tell me how it went.
  5. To make sure I really got it, I'll repeat it word-for-word.

Hello Packets

These things must match for an adjacency to form

  • Subnet
  • Subnet mask
  • Interface MTU
  • Area
  • Area flags (NSSA, Stub)
  • Is DR/BDR enabled
  • Authentication
  • Hello time
  • Dead time

These must not match

  • Router ID

Check with debug ip ospf event

Broadcast Network Multicast Packet to acknowledge multiple neighbors

Ethernet II, Src: aa:bb:cc:00:4b:00 (aa:bb:cc:00:4b:00), Dst: IPv4mcast_05 (01:00:5e:00:00:05)
Internet Protocol Version 4, Src: 10.0.0.6, Dst: 224.0.0.5
Open Shortest Path First
  OSPF Header
  OSPF Hello Packet
      Network Mask: 255.255.255.0
      Hello Interval [sec]: 10
      Options: 0x12, (L) LLS Data block, (E) External Routing
      Router Priority: 1
      Router Dead Interval [sec]: 40
      Designated Router: 10.0.0.2
      Backup Designated Router: 10.0.0.1
      Active Neighbor: 1.1.1.1
      Active Neighbor: 2.2.2.2
      Active Neighbor: 3.3.3.3
      Active Neighbor: 4.4.4.4
      Active Neighbor: 5.5.5.5

OSPF Adjacency State Machine

StateDescription
DownOSPF is running, no hello packets received yet.
AttemptNBMA mode, the router has sent OSPF packets.
InitThe router sees hello packets.
2-WayThe router sees it's own router-id in the hello packet.
ExStartRouters vote on who exchanges LSDB first.
LoadingRouter DB has been exchanged, router is requesting specific LSAs.
FullLSDBs for this area are identical on both sides.

DR and BDR

OSPF uses explicit acknowledgments (re-sending the LSAs), so as neighbors and adjacencies grow, the amount of OSPF traffic on a network increases.

A network with six ospf routers forming a full-mesh requires 30 adjacencies.

To mitigate the scaling problem, on broadcast segments OSPF elects a DR, and BDR, to maintain the LSDB.

The RFC calls this a "network vertex". We can also use the term DR.

  • All routers listen for hello on 224.0.0.5
  • DR floods LSAs to the routers with 224.0.0.5
  • DROTHER talks to the DR/BDR on 224.0.0.6

In the diagram (from the RFC), everything connects to N2, so problem solved.

                                    **FROM**
                +---+      +---+
                |RT3|      |RT4|              |RT3|RT4|RT5|RT6|N2 |
                +---+      +---+        *  ------------------------
                  |    N2    |          *  RT3|   |   |   |   | X |
            +----------------------+    T  RT4|   |   |   |   | X |
                  |          |          O  RT5|   |   |   |   | X |
                +---+      +---+        *  RT6|   |   |   |   | X |
                |RT5|      |RT6|        *   N2| X | X | X | X |   |
                +---+      +---+

                          Broadcast or NBMA networks

See OSPF LSAs to see what the actual contents of the LSAs are.

The DR

Forms full adjacencies.

R1# show ip ospf neighbor 

Neighbor ID     Pri   State           Dead Time   Address         Interface
2.2.2.2          50   FULL/BDR        00:00:31    10.0.0.2        Ethernet0/0
3.3.3.3           1   FULL/DROTHER    00:00:37    10.0.0.3        Ethernet0/0
4.4.4.4           1   FULL/DROTHER    00:00:34    10.0.0.4        Ethernet0/0
5.5.5.5           1   FULL/DROTHER    00:00:32    10.0.0.5        Ethernet0/0
6.6.6.6           1   FULL/DROTHER    00:00:31    10.0.0.6        Ethernet0/0
  • First router online on the segment is the DR.

Drother

  • Only forms full adjacencies with the DR, and BDR.
  • When it sends LSAs, sends them to the DR/BDR via 224.0.0.6.
R1# show ip ospf neighbor 

Neighbor ID     Pri   State           Dead Time   Address         Interface
2.2.2.2          50   FULL/BDR        00:00:31    10.0.0.2        Ethernet0/0
3.3.3.3           1   FULL/DROTHER    00:00:37    10.0.0.3        Ethernet0/0
4.4.4.4           1   FULL/DROTHER    00:00:34    10.0.0.4        Ethernet0/0
5.5.5.5           1   FULL/DROTHER    00:00:32    10.0.0.5        Ethernet0/0
6.6.6.6           1   FULL/DROTHER    00:00:31    10.0.0.6        Ethernet0/0

Network LSAs

These are sent by the DR to describe the routers on this segment.

See OSPF LSAs to see what the actual contents of the LSA.

Identical Databases

Each router can perform it's own SPT via Dijkstra's algorithm.

LSAs are flooded throughout an area, all routers in the same area should have the same LSAs and same database.

R1# show ip ospf database database-summary  | s Area 0
Area 0 database summary
  LSA Type      Count    Delete   Maxage
  Router        5        0        0       
  Network       5        0        0       
  Summary Net   8        0        0       
  Summary ASBR  2        0        0       
  Type-7 Ext    0        0        0       
    Prefixes redistributed in Type-7  0
  Opaque Link   0        0        0       
  Opaque Area   0        0        0       
  Subtotal      20       0        0
R2# show ip ospf database database-summary | s Area 0
Area 0 database summary
  LSA Type      Count    Delete   Maxage
  Router        5        0        0       
  Network       5        0        0       
  Summary Net   8        0        0       
  Summary ASBR  2        0        0       
  Type-7 Ext    0        0        0       
    Prefixes redistributed in Type-7  0
  Opaque Link   0        0        0       
  Opaque Area   0        0        0       
  Subtotal      20       0        0

Can also check with checksums

show ip ospf | i Checksum

LSAs

The Router ID is what is used to build the SPT. It's very important it's both

  • Correct
  • Easy to identify the router
  +-------------------------+ Three fields to differentiate LSAs
  |         LS Age          |     - LS Type
  +-------------------------+     - Link State ID
  |  Options      LS Type   |     - Advertising Router
  +-------------------------+
  |     Link State ID       |  < -- Unique number from the Advertising Router for Each LSA
  +-------------------------+
  |   Advertising Router    |  < -- Router ID
  +-------------------------+
  |    LS Sequence Number   |  < -- How old the LSA is. LSAs with higher numbers are updates to older LSAs
  +-------------------------+
  |      LS Checksum        |
  +-------------------------+
  |        Length           |
  +-------------------------+

OSPF Hierarchy

OSPF has four levels of routing hierarchy

O - Intra-area (same area) OI - Inter-area (same OSPF domain) E1 - External type 1 (To an attached but non-OSPF domain) E2 - External type 2 (to the Internet)

The bit E is what makes E1 and E2 routes. The bit being set is an E2 route, which is considered less preferred.

CodeNumberRFC NamePurposeDescription
O1Router-LSAinterfaces on a routerFlooded, Single Area, never crosses area boundary.
O2Network-LSArouters on a networkFlooded, Single area, only sent by the DR.
IA3Summary-LSAnetworks in other areasABRs send these, to describe, routes to networks
E1, E24Summary-LSAnext-hop to a ASBRASBRs send these, to describe, routes to AS boundary routers.
E1, E25AS-external-LSAroutes to E1 or E2 networksASBRs send these, to describe, routes to an AS.
E1, E27NSSA SummariesNSSA ASBRs send these, to describe, routes to an AS.

Type 5 LSAs

        0                   1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |            LS age             |     Options   |      5        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                        Link State ID                          |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                     Advertising Router                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                     LS sequence number                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |         LS checksum           |             length            |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                         Network Mask                          |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |E|     0       |                  metric                       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                      Forwarding address                       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                      External Route Tag                       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |E|    TOS      |                TOS  metric                    |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                      Forwarding address                       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Default Route

OSPF has two ways of originating a default route.

default-information originate if a default route is present.

default-information originate always do it anyway.

Cost

Default OSPF is all links above 100Mbps are the same cost.

auto-cost reference-bandwidth 40,000

Network Types

OSPF Representation of routers and networks

CLINetwork TypesLSA Type 1 or 2Use-case
ip ospf network broadcastBroadcast2 - DR ElectionEthernet, Token Ring, FDDI
ip ospf network non-broadcastNBMA12 - DR ElectionX.25, frame-relay, ATM. Requires a full-mesh.
ip ospf network point-to-pointpoint-to-point1 - No DRSerial links, Unnumbered, TDM, HDLC, PPP (Full Adjacency)
ip ospf network point-to-multipointHub and spoke on Ethernet1 - No DRHub and Spoke Topologies, like DMVPN or Frame Relay
1

RFC compliant (??) implementation. For actual nbma networks use ip ospf network point-to-multipoint.

2

The DR (which should be the HUB or bad things happen) needs to have static neighbor statements.

Moy                         Standards Track                    [Page 13]

RFC 2328                     OSPF Version 2                   April 1998

                                                  **FROM**

                                           *      |RT1|RT2|
                +---+Ia    +---+           *   ------------
                |RT1|------|RT2|           T   RT1|   | X |
                +---+    Ib+---+           O   RT2| X |   |
                                           *    Ia|   | X |
                                           *    Ib| X |   |

                     Physical point-to-point networks


                                                  **FROM**
                      +---+                *
                      |RT7|                *      |RT7| N3|
                      +---+                T   ------------
                        |                  O   RT7|   |   |
            +----------------------+       *    N3| X |   |
                       N3                  *

                              Stub networks

                                                  **FROM**
                +---+      +---+
                |RT3|      |RT4|              |RT3|RT4|RT5|RT6|N2 |
                +---+      +---+        *  ------------------------
                  |    N2    |          *  RT3|   |   |   |   | X |
            +----------------------+    T  RT4|   |   |   |   | X |
                  |          |          O  RT5|   |   |   |   | X |
                +---+      +---+        *  RT6|   |   |   |   | X |
                |RT5|      |RT6|        *   N2| X | X | X | X |   |
                +---+      +---+

                          Broadcast or NBMA networks

Area summary

These will show up as a IA route in OSPF, and a route-to-null on the ABR.

  • requires a route present in the RIB.

v4 example.

router ospf 1
 router-id 2.2.2.2
 area 1 range 10.0.0.0 255.255.224.0

v6 example.

router ospfv3 1
 !
 address-family ipv6 unicast
  area 1 range 2001:DB8::/56
 exit-address-family

Route-Filtering

You can use the same command to tell the router to ... exclude these routes from the backbone, via the not-advertise keyword.

Using range

The area command is now a route-filter.

v4 example.

router ospf 1
 router-id 2.2.2.2
 area 1 range 10.0.0.0 255.255.224.0 not-advertise

v6 example.

router ospfv3 1
 !
 address-family ipv6 unicast
  area 1 range 2001:DB8::/56 not-advertise
 exit-address-family

Using filter-lists

These are a bit harder to use, in and out are inbound and outbound to the area.

For this topology

             Area 0                               Area 1               
                                                               
                                 |           10.0.10.0/24            
                                 |         2001:db8:0:10/64          
                                 |                            +----+ 
                              +----+       +------------------+ R3 | 
+----+                        |    +-------+                  +----+ 
| R1 +------------------------+ R2 |                        
+----+                        |    +------+     
             10.0.0.0/24      +----+      |                   +----+ 
           2001:db8:0:0/64       |        +-------------------+ R4 | 
                                 |           10.0.20.0/24     +----+ 
                                 |         2001:db8:0:20/64          

v4

ip prefix-list PREFIX_LIST_LOOPBACK_v4 seq 10 deny 1.1.1.1/32
ip prefix-list PREFIX_LIST_LOOPBACK_v4 seq 20 deny 2.2.2.2/32
ip prefix-list PREFIX_LIST_LOOPBACK_v4 seq 30 deny 3.3.3.3/32
!
router ospf 1
 area 0 filter-list prefix PREFIX_LIST_LOOPBACK_v4 in
 area 1 filter-list prefix PREFIX_LIST_LOOPBACK_v4 in

v6

!
ipv6 prefix-list PREFIX_LIST_v6 seq 10 deny FD::1/128
ipv6 prefix-list PREFIX_LIST_v6 seq 20 deny FD::3/128
ipv6 prefix-list PREFIX_LIST_v6 seq 30 deny FD::4/128
!
router ospfv3 1
 !
 address-family ipv6 unicast
  area 0 filter-list prefix PREFIX_LIST_v6 in
  area 1 filter-list prefix PREFIX_LIST_v6 in

The Problem

A customer with L3VPN service via OSPF-BGP-VPNv4 decides to connect two sites together via OSPF backdoor, a direct connection they manage themselves.

When they turn on their private OSPF peering, all the traffic between these two sites now prefers the new link, vs the L3VPN cloud.

The Solution: Sham Links

Sham links are needed because the routes provided by an L3VPN are O IA. When the OSPF backdoor link comes up it will be preferred for two reasons:

  • OSPF has a lower AD than BGP.
  • O routes are prefered over O IA

A sham link makes two PE routers at different sites in the same customer VRF form an intra-area connection.

From OSPF Sham-Link Support for MPLS VPN - Cisco.

Before you create a sham-link between PE routers in an MPLS VPN, you must:

  • Configure a new interface with a /32 address on the remote PE so that OSPF packets can be sent over the VPN backbone to the remote end of the sham-link. The /32 address must meet the following criteria:
    • Belong to a VRF
    • Not be advertised by OSPF
    • Be advertised by BGP
    • You can use the /32 address for other sham-links

References

https://datatracker.ietf.org/doc/html/rfc2328

Type 1 and Type 2 describe what's inside an area.

Type 1 - Here are my links.

Type 2 - Here is my attached network.

Type 1 - Router

DR

R1# show ip ospf database router 1.1.1.1

            OSPF Router with ID (1.1.1.1) (Process ID 1)

                Router Link States (Area 0)

  LS age: 32
  Options: (No TOS-capability, DC)
  LS Type: Router Links
  Link State ID: 1.1.1.1
  Advertising Router: 1.1.1.1
  LS Seq Number: 8000007B
  Checksum: 0x1A77
  Length: 36
  Number of Links: 1

    Link connected to: a Transit Network
     (Link ID) Designated Router address: 10.0.0.1
     (Link Data) Router Interface address: 10.0.0.1
      Number of MTID metrics: 0
       TOS 0 Metrics: 10

DROther

R4#show ip ospf database router 4.4.4.4

            OSPF Router with ID (4.4.4.4) (Process ID 1)

                Router Link States (Area 0)

  LS age: 135
  Options: (No TOS-capability, DC)
  LS Type: Router Links
  Link State ID: 4.4.4.4
  Advertising Router: 4.4.4.4
  LS Seq Number: 8000007C
  Checksum: 0x5D18
  Length: 36
  Number of Links: 1

    Link connected to: a Transit Network
     (Link ID) Designated Router address: 10.0.0.1
     (Link Data) Router Interface address: 10.0.0.4
      Number of MTID metrics: 0
       TOS 0 Metrics: 10

**DR Describing the network

Type 2 - Network

R4# show ip ospf database network 

            OSPF Router with ID (4.4.4.4) (Process ID 1)

                Net Link States (Area 0)

  LS age: 183
  Options: (No TOS-capability, DC)
  LS Type: Network Links
  Link State ID: 10.0.0.1 (address of Designated Router)
  Advertising Router: 1.1.1.1
  LS Seq Number: 80000002
  Checksum: 0x4481
  Length: 48
  Network Mask: /24
        Attached Router: 1.1.1.1
        Attached Router: 2.2.2.2
        Attached Router: 3.3.3.3
        Attached Router: 4.4.4.4
        Attached Router: 5.5.5.5
        Attached Router: 6.6.6.6

Broadcast Network, with a DR

DR

R1# show ip ospf database router 1.1.1.1

            OSPF Router with ID (1.1.1.1) (Process ID 1)

                Router Link States (Area 0)

  LS age: 32
  Options: (No TOS-capability, DC)
  LS Type: Router Links
  Link State ID: 1.1.1.1
  Advertising Router: 1.1.1.1
  LS Seq Number: 8000007B
  Checksum: 0x1A77
  Length: 36
  Number of Links: 1

    Link connected to: a Transit Network
     (Link ID) Designated Router address: 10.0.0.1
     (Link Data) Router Interface address: 10.0.0.1
      Number of MTID metrics: 0
       TOS 0 Metrics: 10

DROther

R4#show ip ospf database router 4.4.4.4

            OSPF Router with ID (4.4.4.4) (Process ID 1)

                Router Link States (Area 0)

  LS age: 135
  Options: (No TOS-capability, DC)
  LS Type: Router Links
  Link State ID: 4.4.4.4
  Advertising Router: 4.4.4.4
  LS Seq Number: 8000007C
  Checksum: 0x5D18
  Length: 36
  Number of Links: 1

    Link connected to: a Transit Network
     (Link ID) Designated Router address: 10.0.0.1
     (Link Data) Router Interface address: 10.0.0.4
      Number of MTID metrics: 0
       TOS 0 Metrics: 10

**DR Describing the network

R4# show ip ospf database network 

            OSPF Router with ID (4.4.4.4) (Process ID 1)

                Net Link States (Area 0)

  LS age: 183
  Options: (No TOS-capability, DC)
  LS Type: Network Links
  Link State ID: 10.0.0.1 (address of Designated Router)
  Advertising Router: 1.1.1.1
  LS Seq Number: 80000002
  Checksum: 0x4481
  Length: 48
  Network Mask: /24
        Attached Router: 1.1.1.1
        Attached Router: 2.2.2.2
        Attached Router: 3.3.3.3
        Attached Router: 4.4.4.4
        Attached Router: 5.5.5.5
        Attached Router: 6.6.6.6

From the DR

R1# show ip ospf database router 1.1.1.1

            OSPF Router with ID (1.1.1.1) (Process ID 1)

                Router Link States (Area 0)

  LS age: 32
  Options: (No TOS-capability, DC)
  LS Type: Router Links
  Link State ID: 1.1.1.1
  Advertising Router: 1.1.1.1
  LS Seq Number: 8000007B
  Checksum: 0x1A77
  Length: 36
  Number of Links: 1

    Link connected to: a Transit Network
     (Link ID) Designated Router address: 10.0.0.1
     (Link Data) Router Interface address: 10.0.0.1
      Number of MTID metrics: 0
       TOS 0 Metrics: 10

From a DROther

R4#show ip ospf database router 4.4.4.4

            OSPF Router with ID (4.4.4.4) (Process ID 1)

                Router Link States (Area 0)

  LS age: 135
  Options: (No TOS-capability, DC)
  LS Type: Router Links
  Link State ID: 4.4.4.4
  Advertising Router: 4.4.4.4
  LS Seq Number: 8000007C
  Checksum: 0x5D18
  Length: 36
  Number of Links: 1

    Link connected to: a Transit Network
     (Link ID) Designated Router address: 10.0.0.1
     (Link Data) Router Interface address: 10.0.0.4
      Number of MTID metrics: 0
       TOS 0 Metrics: 10

The

Theory

  • BGP works on the premise that if a router sees its own AS path, it must be a loop.
  • The default timer is 60 seconds with 180 seconds for hold time. This means worst-case is 3 minutes to fail-over.
  • BGP aggregate-address only works if there is a subnet inside the aggregate range in BGP.

Working with BGP

  • Only consider traffic in one direction at a time
  • Accepting a route will affect outgoing traffic
  • Advertising a route will affect incomming traffic
  • Filter out everything except the routes needed
  • BGP DOES NOT LOAD BALANCE

On Cisco IOS bgp soft-reconfig-backup tells the router "if you must, save a entire table" otherwise rely on RFC2918, which are dynamic updates.

Soft reconfig is ancient, pre-RFC.

Soft Reconfig via Route Refresh (trusting the other device)!

clear ip bgp <neighbor_ip> soft in1

1

https://www.cisco.com/c/en/us/td/docs/ios-xml/ios/iproute_bgp/configuration/xe-16/irg-xe-16-book/bgp-4-soft-configuration.html

Example of a BGP AS Path

These read left to right like a book. This prefix was most recently from AS 7018.

7018 701 15 i
            ^ this means IGP, and AS 15 has an IGP route for it like OSPF or EIGRP

BGP Best Path Selection

- Higher Weigth                                       
- Higher Local Preference                            
- Locally Originated                                 (Network or Aggregate Command)
- Shortest AS-PATH
- Lowest Origin Type                                 (IGP > EGP > Incomplete)
- Lowest MED                                         (Neighbor ASes must be the same)
- Prefer eBGP > Confederated eBGP > iBGP
- Prefer path with lowest IGP metric to next hop
- Determine if bestpath is enabled
  - Prefer external path which is oldest
  - Prefer path from router with lower ID
  - Prefer path with shorter cluster length
  - Prefer path from lowest neighbor address

Cisco - Select BGP Best Path Algorithm

BGP Path Attributes

RFC 4271 - BGP-4

  • Well-known mandatory
  • Well-known discretionary
  • Optional transitive
  • Optional nontransitive
Path AttributeCategory
OriginMandatory
AS_PATHMandatory
NEXT_HOPMandatory
LOCAL_PREFDiscretionary
ATOMIC_AGGREGATEDiscretionary
AGGREGATOROptional Transitive
COMMUNITYOptional Transitive
MULTI_EXIT_DISCOptional Non-Transitive
ORIGINATOR_IDOptional Non-Transitive
CLUSTER_LISTOptional Non-Transitive

Origin

IGP > EGP > Incomplete

  • IGP means it came from an IGP. This is the highest preference.
  • Incomplete means its likely a redistributed route

Next Hop

  1. eBGP, routers in different AS, destination outside AS. The Next hop will be the advertising router.
  2. iBGP, routers in same AS, destination inside AS. The Next hop will be the advertising router.
  3. iBGP, routers in same AS, destination outside AS. The Next hop is the external peer who advertised the address.

... When the third option happens ...

  • Advertise into the IGP the external links to the BGP peers.
  • Tell the AS border router to change the next hop to its own IP address. [next-hop-self]

LOCAL_PREF

  • Controls traffic Outgoing traffic.
  • Only shared between iBGP peers, used to determine the exit. Higher is better.

MULTI_EXIT_DISC

  • Controls incoming traffic.
  • Lower is better

ATOMIC_AGGREGATE

BGP can aggregate smaller prefixes into larger ones even if a smaller prefix comes from a different AS.

A router in AS 105 gets these prefixes from its peers.

192.168.0.0/24 (123 204)
192.168.1.0/24 (123 205)

If the administrator chooses, they can aggregate this, but lose path information.

192.168.0.0/23 (105) ATOMIC_AGGREGATE. 

Downstream peers can not remove this tag

AGGREGATOR

AS and Router ID of the BGP router that did the atomic aggregation.

COMMUNITY

Usually used to tag routes from a specific customer.

TagPurpose
INTERNETDefault community.
NO_EXPORTDo not share with other ASes
NO_ADVERTISEDo not share with other routers
LOCAL_AS????

ORIGINATOR_ID

For route reflectors The origaning router puts its Router_ID here. If it sees this, it knows a loop as occured.

CLUSTER_LIST

  • For route reflectors
  • The sequence of Router_IDs through which the route has passed. If a router seeis its Router_ID a loop has occured.

WEIGHT

  • Cisco specific & this router only
  • Routes learned are 0
  • Locally generated routes are 32768

Route Reflectors

A RR will not change any attributes of a route.

  • If a route is learned from a non-client iBGP peer, reflect to clients
  • If a route is learned from a client, reflect to everyone
  • If a route is learned from a eBGP peer, reflect to everyone

Only the route reflector is aware of the reflecting. The clients are dumb

If you configure route reflectors as a cluster you must manually configure the cluster_ID

BGP by default will summarize.

Use no auto-summary.

Using redistribute under BGP will make the resulting route show up with an orign code of incomplete.

Sending a default route

neighbor A.B.C.D default-originate

To get iBGP routers to update the next-hop to be themselves when advertising to other iBGP routers use

neighbor A.B.C.D next-hop-self

This makes it so other iBGP routers don't need reachability information for the physical link to the next AS.

BGP Finite State Machine

  • Idle - check the config
  • Connect - TCP is probably broken
  • Active - Listening for TCP
  • OpenSent
  • OpenConfirm
  • Established

Fixing next-hop issues

Just because the route shows up in show ip bgp doesn't mean it will install. BGP needs to be able to reach the next-hop.

  1. Add the transit routes the IGP.
  2. Use next-hop self in BGP.
  3. Use a route-map to set the next hops.

Route Reflection

Terms
  • Cluster List - Router ID of the route Reflector. Used to prevent loops between RRs.
  • Originator - Route reflector peer. Used to prevent loops between clients.
Three rules for route reflectors
  • If the route is recieved from a non-client peer, reflect to clients only.
  • If the route is recieved from a client peer, reflect to non-client peers, and client peers.
  • If the route is recieved from an EBGP peer, reflect to all client and non-client peers.
Notes
  • Route reflectors can be clients of each other. This causes extra overhead.
  • If multiple route reflectors server the same cluster they should have the same Cluster_ID.
BGP Route Reflectors Loop Prevention
  • If a BGP router that receives a route from an iBGP neighbor in the incoming update detects the presence of its own Router-ID in the Originator-ID attribute it will reject the update.
  • If a BGP router that receives a route from an iBGP neighbor is configured to operate as a route reflector and in the incoming update detects the presence of its own Cluster-ID in the Cluster-list attribute it will reject the update.
Confederations

NEXT_HOP is preserved throughout the confederation.

MED is preserved for routes advertised into the confederation

LOCAL_PREF is preserved throughout the confederation

AS_PATH for privates ASes is used within the confederation

Force interior confederation MEDs to be considered:

bgp deterministic-med

Route Reflectors are generally preferred.

IF you want to add two BGP speakers to the same router reflector cluster, specify the cluster ID.

  • clients can not detect inter-cluster loops. They don't have the attributes in the BGP table.

BGP redistribution into anything

EIGRP Terminology

  • Successor route: The current best path, with the smallest metric. The "successful" route.

  • Successor: The first next-hop router for the successor route.

  • Feasible distance (FD): Lowest metric to reach a subnet. The sum of the RD + local cost.

  • Reported distance (RD): The metric inside a route update from another router. The sending router included it's FD, which becomes out RD.

  • Feasibility condition: If another path is actually a backup, the RD will be less than the current FD.

  • Feasible successor: A route that satisfies the feasibility condition and is maintained as a backup route.

  • Split Horizon: Never advertise a network, out the same interface it was learned on.

  • Poison Reverse: If you must advertise a network out the same interface it was received on, advertise the delay as infinity.

Example.

R2 sends an update

  • 10.0.0.0/24 - RD is 2000

R3 Sends an update

  • 10.0.0.0/24 - RD is 2050

R1 calculates total path metric.

  • R2 is 2000 + 1000 = 3000.
  • R3 is 2050 + 50 = 2100. < - Successor route.

R1 sees it has an reported distance less than the current distance, so installs that route as the feasible successor.

+--------+            1000             +--------+    10.0.0.0/24      
|   R1   +-----------------------------+   R2   +---------------------                           
+-----+--+                             +-+------+      2000                                                                               
      |            +--------+            |                            
      +------------+   R3   +------------+                                                                                                        
         50        +--------+      50

Example with the EIGRP topology table

R1# show ip eigrp topology 10.0.0.0/24
EIGRP-IPv4 Topology Entry for AS(1)/ID(1.1.1.1) for 10.0.0.0/24
  State is Passive, Query origin flag is 1, 1 Successor(s), FD is 2100
P 10.0.0.0/24, 1 successors, FD is 2100                <--- Feasible Distance
        via 10.0.13.3 (2100/2050), GigabitEthernet0/3  <--- Successor Route
        via 10.0.12.2 (3000/2000), GigabitEthernet0/2  <--- Feasible Successor
                       |     |
                       |     +-- Reported Distance 
                       +-------- Path Metric
        
                                                             (RD 2000 < FD 2100)

Metric calculation

metric = ([K1 * bandwidth + (K2 * bandwidth) / (256 - load) + K3 * delay] * [K5 / (reliability + K4)]) * 256

K1, set to 1 K3, set to 1

Wide metrics allow for faster links.

Unequal Cost Multi Path

EIGRP can load balance over the successor and feasible successor routes with a variance command.

Timers

  • Hello packets are every 5 seconds, on 60 seconds on T1 links.
    • The deadtime is 3x the hold timer.

Initial Bringup

  • Send Hello packets, to 224.0.0.10
    • Doesnt' require multicast to be on
    • Unicast Init from neighbor, set Seq, Set Ack to 0
      • Neighbor Sends back Ack as prior sequence number.
      • Update Messages

Stuck in Active

  • The router is too busy to answer the query (generally due to high CPU utilization).
  • The router has memory problems and cannot allocate the memory to process the query or build the reply packet.
  • The circuit between the two routers is not good; there are not enough packets that get through to keep the neighbor relationship up, but some queries or replies are lost between the routers.
  • unidirectional links (a link on which traffic can only flow in one direction because of a failure)

Update Message

  • AS number
  • Prefixes
  • End-of-table Flag

Prefixes

  • Type (internal, etc)
  • Reliability
  • Load
  • MTU
  • Hop Count
  • Delay
  • Bandwidth
  • Flags
    • Source Withdrawn
    • Candidate Default
    • Route is Active
    • Route is Replicated
  • Next-hop
  • Prefix Length

Network

  • The CLI parser is converting the IP into binary, then comparing it to the wild mask.
  • The CLI parser will only save the matched bits of the IP.
  • The CLI parser will not save the zeroth network, anything starting with 0.
  • The CLI parser will only save the matched bits of an IP if if finds bits that are "on"
  • Using the "all" mask of 255.255.255.255 creates this statement 'network 0.0.0.0' and matches everything.
  • Using the "unique-ip" mask of 0.0.0.0 means "match this single address"
  • The wildcard mask only accepts contiguous numbers "Discontiguous mask is not supported."

192.0.2.5 127.255.255.255 - becomes 128.0.0.0, the rest of the bits get dropped.

References

https://www.cisco.com/c/en/us/support/docs/ip/enhanced-interior-gateway-routing-protocol-eigrp/16406-eigrp-toc.html

VRRP

HSRP

GLBP

Terms

  • GLBP - Gateway Load Balancing Protocol.
  • AVG - Active Virtual Gateway. The AVG response to ARP requests, with the same IP, but different MAC addresses to load balance for GLBP.
  • AVF - Active Virtual Forwarder. A router in a GLBP group that is forwarding packets. All AVFs have their own mac, and are responsible for forwarding traffic destined towards that MAC.
  • Cisco proprietary

  • 224.0.0.102

  • UDP 3222

  • AVG is highest priority

  • Max of 4 active AVFs

  • Two states: Active, Listen

  • MD5 is supported

References

Cisco - Configuring GLBP

I learned this protocol using IOS-XR.

Async, no echo - Please respond to this packet with the control plane of the far device.

BFD Async without Echo

          Peer-A to Peer-B, lets agree to use BFD.
          
          Peer-A, I see your control packets.
          
          Peer-B, I also see your control packets.
          
          
          L3 SRC A
          L3 DST B
          
         +------------------------------->
+-------+                                 +-------+
|Peer-A |                                 |Peer-B |
+-------+                                 +-------+
         <-------------------------------+

Async, with echo - Just loop the BFD packets back onto the link, please.

BFD Async with Echo

The packets never leave the data plane, and never touches the control plane of Peer-A or Peer-B.


           L3 SRC A
           L3 DST A

!
! Peer A tests it's return path
!
+-------+                                   +-------+
|       | +-------------------------------+ |       |
|Peer-A |                                 | |Peer-B |
|       | <-------------------------------+ |       |
+-------+                                   +-------+


           L3 SRC A
           L3 DST A
!
! Peer B also tests it's return path
!
+-------+                                   +-------+
|       | +-------------------------------+ |       |
|Peer-A | |                                 |Peer-B |
|       | +-------------------------------> |       |
+-------+                                   +-------+

Ports

BFD is UDP, to an application on the network device

BFD Control is sent as SRC UDP 49512 --> Destination 3784

BFD Payload is sent as SRC UDP 3785 --> Destination 3785

BFD State Machine

Courtesy of the RFC

RFC 5880           Bidirectional Forwarding Detection          June 2010

(removed) 

The following diagram provides an overview of the state machine.
Transitions involving AdminDown state are deleted for clarity (but
are fully specified in sections 6.8.6 and 6.8.16).  The notation on
each arc represents the state of the remote system (as received in
the State field in the BFD Control packet) or indicates the
expiration of the Detection Timer.

                             +--+
                             |  | UP, ADMIN DOWN, TIMER
                             |  V
                     DOWN  +------+  INIT
              +------------|      |------------+
              |            | DOWN |            |
              |  +-------->|      |<--------+  |
              |  |         +------+         |  |
              |  |                          |  |
              |  |               ADMIN DOWN,|  |
              |  |ADMIN DOWN,          DOWN,|  |
              |  |TIMER                TIMER|  |
              V  |                          |  V
            +------+                      +------+
       +----|      |                      |      |----+
   DOWN|    | INIT |--------------------->|  UP  |    |INIT, UP
       +--->|      | INIT, UP             |      |<---+
            +------+                      +------+
  • Async - If the other side doesn't recieve the packets, it's declared down.

  • BOB - BFD over Bundle

  • BLB - BFD over Logical Bundle - (VLANS, Sub-interfaces). This requires multipath to be enabled. Multipath doesn't inject BFD packets into the HP queue.

IOS-XR Commands

multipath include location 0/1/CPU0
bundle coexistence bob-blb logical
show tech-support routing bfd file

IOS-XR Examples

Take the session down if latency grows to 150ms for a single echo packet.
bfd fast detect 
bfd multiplier 50
echo latency detect
Take the session down if latency grows to 300ms for a single echo packet.
bfd fast detect 
bfd multiplier 50
bfd echo latency detect percentage 200
Take the session down if the latency grows to 150ms for 3 consequitive echo packets
bfd fast detect
bfd multiplier 50
bfd echo latency detect percentage 100 count 3

Disable echo mode

bfd 
interface g0/0/0/0
 echo disable

Protecting the BFD data-plane packets from QoS

192.168.100.1 <-> 192.168.100.2

!
! Config for 192.168.100.1
!
ipv4 access-list BFD-TRAFFIC
 5 permit udp host 192.168.100.1 any range 3784 3785
 10 permit udp host 192.168.100.2 any range 3784 3785
!
class-map match-any BFD-CLASS
 match access-group ipv4 BFD-TRAFFIC
!
policy-map OUT
class BFD-CLASS
 priority level 1
 police rate 10 kbps
!
interface TenGig <>
 service-policy output OUT
 bfd address-family ipv4 multiplier 3
 bfd address-family ipv4 destination 192.168.100.1
 bfd address-family ipv4 fast-detect
 bfd address-family ipv4 minimum-interval 100
!

Enabling BFD on RSVP (IOS)

A Config
ip rsvp signalling bfd hello
!
! this very dangerous because CPU load will affect processing of BFD control packets
!
int f0/0.45
 ip rsvp signalling hello bfd
 bfd interval 50 min_rx 50 multiplier 3
Verification

show ip rsvp hello bfd nbr

Mutual Route-Redistribution

  • Tag EIGRP as 100
  • TAG OSPF as 1
  • Route maps should take the form DENY -> PERMIT.
  • Routes are tagged when they are advertised.

Route tags appear on-the-wire and can be read by other routers. ospf.lsa.asext.extrttag == 100

In this example, EIGRP becomes a Type-5 OSPF update, with a route-tag of 100. If we look for these tags can exclude them in redistribution updates.

route-map ospf-into-eigrp deny 10
 description previously tagged EIGRP traffic
 match tag 100
!
route-map ospf-into-eigrp permit 20
 match source-protocol ospf 1 ospfv3 1
 set tag 1
!
route-map eigrp-into-ospf deny 10
 description previously tagged OSPF traffic
 match tag 1
!
route-map eigrp-into-ospf permit 20
 match source-protocol eigrp 100
 set tag 100
!
router eigrp 100
 redistribute ospf 1 metric 1000000 100 255 1 1500 route-map ospf-into-eigrp
!
router ospf 1
 redistribute eigrp 100 subnets route-map eigrp-into-ospf

A very basic setup, that assumes a working underlay. I implemented this on my home lab of c7200s in GNS3 running 15.2(4)S7. My underlay was IS-IS to router loopbacks.

Site 1 EIDs - 192.168.100.0/24
Site 2 EIDs - 192.168.101.0/24

xTR for Site 1 - Lo0 16.16.16.16
xTR for Site 2 - Lo0 19.19.19.19

Site 1 - xTR

config
R18# show run | s lisp
router lisp
 database-mapping 192.168.100.0/24 18.18.18.18 priority 1 weight 50
 ipv4 itr map-resolver 16.16.16.16
 ipv4 itr
 ipv4 etr map-server 16.16.16.16 key cisco
 ipv4 etr
 exit
verify
R18# show ip lisp map-cache 
LISP IPv4 Mapping Cache for EID-table default (IID 0), 2 entries

0.0.0.0/0, uptime: 00:19:42, expires: never, via static send map-request
  Negative cache entry, action: send-map-request
192.168.101.0/24, uptime: 00:10:08, expires: 23:49:44, via map-reply, complete
  Locator      Uptime    State      Pri/Wgt
19.19.19.19  00:10:08  up           1/50 

Site 2 - xTR

config
R19# show run | s lisp
router lisp
 database-mapping 192.168.101.0/24 19.19.19.19 priority 1 weight 50
 ipv4 itr map-resolver 16.16.16.16
 ipv4 itr
 ipv4 etr map-server 16.16.16.16 key cisco
 ipv4 etr
 exit
verify
R19#show ip lisp map-cache 
LISP IPv4 Mapping Cache for EID-table default (IID 0), 2 entries

0.0.0.0/0, uptime: 00:11:50, expires: never, via static send map-request
  Negative cache entry, action: send-map-request
192.168.100.0/24, uptime: 00:11:29, expires: 23:48:23, via map-reply, complete
  Locator      Uptime    State      Pri/Wgt
  18.18.18.18  00:11:29  up           1/50
MS/MR
config
R16# show run | s lisp
router lisp
 site 1
  authentication-key cisco
  eid-prefix 192.168.100.0/24
  exit
 !
 site 2
  authentication-key cisco
  eid-prefix 192.168.101.0/24
  exit
 !
 ipv4 map-server
 ipv4 map-resolver
 exit
verify
R16# show lisp site name 1
Site name: 1
Allowed configured locators: any
Allowed EID-prefixes:
  EID-prefix: 192.168.100.0/24 
    First registered:     00:25:12
    Routing table tag:    0
    Origin:               Configuration
    Merge active:         No
    Proxy reply:          No
    TTL:                  1d00h
    State:                complete
    Registration errors:  
      Authentication failures:   0
      Allowed locators mismatch: 0
    ETR 10.0.0.23, last registered 00:00:28, no proxy-reply, no map-notify
                   TTL 1d00h, no merge, nonce 0x3E715231-0x150380FC
                   state complete
      Locator      Local  State      Pri/Wgt
      18.18.18.18  yes    up           1/50 

R16# show lisp site name 2
Site name: 2
Allowed configured locators: any
Allowed EID-prefixes:
  EID-prefix: 192.168.101.0/24 
    First registered:     00:25:24
    Routing table tag:    0
    Origin:               Configuration
    Merge active:         No
    Proxy reply:          No
    TTL:                  1d00h
    State:                complete
    Registration errors:  
      Authentication failures:   0
      Allowed locators mismatch: 0
    ETR 10.0.0.26, last registered 00:00:37, no proxy-reply, no map-notify
                   TTL 1d00h, no merge, nonce 0x2F281A3C-0x0760FD58
                   state complete
      Locator      Local  State      Pri/Wgt
      19.19.19.19  yes    up           1/50 

A Packet (an ICMP Request)

Capture is here

Frame 4156: 134 bytes on wire (1072 bits), 134 bytes captured (1072 bits) on interface -, id 0
Ethernet II, Src: ca:17:30:54:00:08 (ca:17:30:54:00:08), Dst: ca:1a:39:b0:00:08 (ca:1a:39:b0:00:08)
Internet Protocol Version 4, Src: 10.0.0.24, Dst: 19.19.19.19
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
    Total Length: 120
    Identification: 0x0096 (150)
    010. .... = Flags: 0x2, Don't fragment
    ...0 0000 0000 0000 = Fragment Offset: 0
    Time to Live: 63
    Protocol: UDP (17)
    Header Checksum: 0x0aa2 [validation disabled]
    [Header checksum status: Unverified]
    Source Address: 10.0.0.24
    Destination Address: 19.19.19.19
User Datagram Protocol, Src Port: 1024, Dst Port: 4341
    Source Port: 1024
    Destination Port: 4341
    Length: 100
    Checksum: 0x0000 [zero-value ignored]
    [Stream index: 2]
    [Timestamps]
    UDP payload (92 bytes)
Locator/ID Separation Protocol (Data)
    Flags: 0xc0
    Nonce: 939002 (0x0e53fa)
    0000 0000 0000 0000 0000 0000 0000 0001 = Locator-Status-Bits: 0x00000001
Internet Protocol Version 4, Src: 192.168.100.100, Dst: 192.168.101.100
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
    Total Length: 84
    Identification: 0xc736 (50998)
    010. .... = Flags: 0x2, Don't fragment
    ...0 0000 0000 0000 = Fragment Offset: 0
    Time to Live: 63
    Protocol: ICMP (1)
    Header Checksum: 0x2959 [validation disabled]
    [Header checksum status: Unverified]
    Source Address: 192.168.100.100
    Destination Address: 192.168.101.100
Internet Control Message Protocol
    Type: 8 (Echo (ping) request)
    Code: 0
    Checksum: 0xc078 [correct]
    [Checksum Status: Good]
    Identifier (BE): 82 (0x0052)
    Identifier (LE): 20992 (0x5200)
    Sequence Number (BE): 1 (0x0001)
    Sequence Number (LE): 256 (0x0100)
    [Response frame: 4157]
    Timestamp from icmp data: Jul 20, 2023 18:00:03.000000000 Eastern Daylight Time
    [Timestamp from icmp data (relative): 0.551525000 seconds]
    Data (48 bytes)

0000  53 4e 08 00 00 00 00 00 10 11 12 13 14 15 16 17   SN..............
0010  18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27   ........ !"#$%&'
0020  28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37   ()*+,-./01234567

Sources

LISP Fundamentals and Troubleshooting Basics - Cisco

Terms

  • Multicast: A one-to-many service using UDP packets destined to group IP address. Hosts subscribe to the group, routers replicate for the group.
  • IGMP: Internet Group Management Protocol. A host uses IGMP to request a multicast stream. Switches see it (for snooping), and the FHR uses this to build the MDT.
  • PIM: Protocol Independent Multicast. Multicast capable routers communicate to each over via PIM.
  • IIL: Incoming Interface List, part of the MDT.
  • OIL: Outgoing Interface List, part of the MDT.
  • MDT: Multicast Distribution Tree. The full set of links participating in multicast, via PIM, IGMP, including IILs, and OILs.
  • RP: Rendezvous Point. A router designated as the root of a shared tree.
  • (*,G): Star comma Gee. AKA, a shared tree. These require a RP. Called Star comma Gee, because typing "show ip mroute" ... this is what shows up.
  • (S,G): Ess comma Gee. AKA a source tree. These do not require a RP.
  • Source Tree: AKA, SPT, or shortest path tree. SPT is best tree.
  • RPT: Rendezvous Point Tree, this is a *,G that points towards the RP.
  • ASM: Any Source Multicast. The host only knows the group it wants to receive (239.10.10.10).
  • SSM: Source Specific multicast. The host already knows the source, and group address (10.0.0.1, 232.10.10.10).
  • Upstream: Towards the source.
  • Downstream: Towards group members.
  • FHR: First hop router. This router receives a multicast stream.
  • LHR: Last Hop router receives IGMP messages from receivers, which are translated into PIM join messages.
  • MRIB: The multicast routing table. Shows RPTs, SPTs, RPFIs, OILs, and IILs.
  • MFIB: The forwarding table. This is used for programming the hardware.
  • RIB: Routing Information Base
  • DF: Designated Forwarder. Used in PIDIR-PIM.

Harder Terms

RPF - Reverse Path Forwarding

PIM is protocol independent, in the sense, that if a stream turns on, it must have a source, so it takes the form (10.0.0.1, 239.1.1.1), a (S,G).

If we do show ip route 10.0.0.1, we'll see the interface the router intends to send any traffic towards that source address. This is the "upstream" interface.

As multicast traffic flows from 10.0.0.1, it should flow into the upstream interface, and out of any downstream interfaces (the OIL).

Tracing the traffic back to the source this way is called "reverse path forwarding" and the interface along this path is the RPF.

The PIM neighbor on the RPF is called the RPF neighbor.

Any multi-cast traffic from any given source, not received on the RPF is discarded. This prevents loops.

Shared Trees

(*,G) entries in the mroute table require fewer resources, since multiple sources can use the same tree.

(*,G) entries in the mroute table represent a security risk, because any source can send to this shared tree.

Theory (in v4)

Multicast is always TO a group, a destination, or a set of destinations.

Multicast comes from an older time. Unlike Unicast addresses, you can tell via bits if a v4 address is multicast.

A multicast address always start with 1110

Address ScopesDescription
224.0.0.0/4Multicast Supernet
224.0.0.0/24Local Control (TTL=1)
224.0.1.0/24Internetwork Control (an example is NTP, Cisco RP-Announce, Cisco RP-Discovery)
232.0.0.0/8Source-Specific Multicast (SSM). Via an extension PIM can build (S,G) MDTs.
233.0.0.0/8GLOP! Companies with a 16-bit ASN can have globally static multicast. 233.X.Y.0/8
239.0.0.0/8Organization-Local Scope. Exactly like RFC1918, but for multicast.

Common L3 Addresses

Same Broadcast Domain

ProtocolMulticast Address
all-hosts224.0.0.1
all-routers224.0.0.2
OSPF-hello224.0.0.5
OSPF-DR224.0.0.6
RIPv2224.0.0.9
EIGRP224.0.0.10
PIM224.0.0.13
mDNS224.0.0.251

Can be forwarded

ProtocolMulticast Address
ntp224.0.1.1
cisco-rp-announce224.0.1.39
cisco-rp-discovery224.0.1.40
ProtocolMulticast AddressNotes
ntp224.0.1.1
cisco-rp-announce224.0.1.39Candidate RPs announce every 60s. Highest IP wins.
cisco-rp-discovery224.0.1.40Mapping agent floods RP-to-group mappings.

IANA Assignments

PIM forms adjacencies in only one direction

The multicast source is the root of the tree. Packets flow downstream from the source. Control plane traffic like PIM joins flow upstream to the RP, or to the reciever.

ProtocolMulticast Address
all-hosts224.0.0.1
all-routers224.0.0.2
OSPF-hello224.0.0.5
OSPF-DR224.0.0.6
RIPv2224.0.0.9
EIGRP224.0.0.10
PIM224.0.0.13
mDNS224.0.0.251

PIM

PIM ModeFull NameHow it works
PIM-DMDense ModeNo RP. Floods everywhere, routers send prune messages to un-join. Assumes everyone wants the traffic.
PIM-SMSparse ModeComplex. Requires a RP, RP Discovery, and phases. Uses register messages, and both tree types.
PIM Sparse-DenseSparse-Dense ModeRuns sparse for groups with a known RP, dense for groups without. Legacy transitional mode.
Bidir-PIMBidirectionalShared tree only, traffic flows both toward and away from RP. No SPT switchover. Good for many-to-many applications.
PIM-SSMSource SpecificNo RP. Receiver specifies both source and group (S,G).

PIM Message Types

TypeMessage TypeDestinationPurpose
0Hello224.0.0.13 (all PIM routers)Establish adjacency, negotiate parameters.
1RegisterRP address (unicast)First-hop router notifies RP of new source, encapsulates multicast data until SPT is built.
2Register stopFirst-hop router (unicast)RP tells first-hop router to stop sending Register messages.
3Join/prune224.0.0.13 (all PIM routers)Join or prune a multicast tree, either (*,G) toward RP or (S,G) toward source.
4Bootstrap224.0.0.13 (all PIM routers)BSR floods RP-set information throughout the domain so all routers know candidate RPs.
5Assert224.0.0.13 (all PIM routers)Elect a single forwarder on a multi-access segment when duplicate traffic is detected.
8Candidate RP advertisementBootstrap router (BSR) (unicast)Candidate RPs advertise themselves to the BSR.
9State refresh224.0.0.13 (all PIM routers)PIM-DM only. Prevents prune state from timing out and triggering a re-flood.
10DF election224.0.0.13 (all PIM routers)Bidir-PIM only. Elects a Designated Forwarder per link to forward traffic toward the RP.

Auto RP

Cisco devices can announce their willingness to be an RP, via cisco-rp-announce

A different service, a mapping agent, will read these messages, pick a winner, then advertise that out via cisco-rp-discovery

  • 5.5.5.5, Candidate RP.
  • 4.4.4.4, mapping agent.
R4# show ip pim autorp 
AutoRP Information: 
  AutoRP is enabled.
  RP Discovery packet MTU is 1500.
  224.0.1.40 is joined on Loopback0.
  AutoRP groups over sparse mode interface is enabled

PIM AutoRP Statistics: Sent/Received
  RP Announce: 0/16, RP Discovery: 64/42

These packets are slow.

R4#debug ip pim auto-rp 
PIM Auto-RP debugging is on
R4#
!
! Sent to cisco-rp-discovery
!
*Apr 25 19:57:08.940: Auto-RP(0): Build RP-Discovery packet
*Apr 25 19:57:08.941: Auto-RP(0):  Build mapping (224.0.0.0/4, RP:5.5.5.5), PIMv2 v1,
*Apr 25 19:57:08.942: Auto-RP(0): Send RP-discovery packet of length 48 on GigabitEthernet0/3 (1 RP entries)
*Apr 25 19:57:08.943: Auto-RP(0): Send RP-discovery packet of length 48 on GigabitEthernet0/4 (1 RP entries)
*Apr 25 19:57:08.945: Auto-RP(0): Send RP-discovery packet of length 48 on GigabitEthernet0/0 (1 RP entries)
*Apr 25 19:57:08.948: Auto-RP(0): Send RP-discovery packet of length 48 on Loopback0(*) (1 RP entries)
*Apr 25 19:57:12.008: Auto-RP(0): Received RP-discovery packet of length 48, from 10.0.45.5, ignored
!
! Received by cisco-rp-announce
!
*Apr 25 19:58:30.159: Auto-RP(0): Received RP-announce packet of length 48, from 5.5.5.5, RP_cnt 1, ht 181
*Apr 25 19:58:30.159: (0): pim_add_prm:: 224.0.0.0/240.0.0.0, rp=5.5.5.5, repl = 0, ver =3, is_neg =0, bidir = 0, crp = 0
*Apr 25 19:58:30.160: Auto-RP(0): Update
*Apr 25 19:58:30.160:  prm_rp->bidir_mode = 0 vs bidir = 0 (224.0.0.0/4, RP:5.5.5.5), PIMv2 v1
R4# undebug all
All possible debugging has been turned off

Dense

Based on RFC 3973 Protocol Independent Multicast Dense Mode (PIM-DM)

  • Push Model
    • Good for when every subnet probably wants this traffic
  • No PIM DR
    • All FHR forward multicast traffic
      • Multicast traffic is flooded out every interface that isn't the RPF.
  • Eventually builds a SPT after prunes
  • IGMP joins turn into graft messages
  • Prunes last 3 minutes
    • Flood and Prune
    • Routers with no Receivers or duplicate S,G traffic prune.
    • 224.0.0.13 to find neighbors
    • Receivers prune back
    • Router attached to LAN listens for multicast control plane.
      • Receives source traffic
        • Insert (*,G) and (S,G) into mrib
        • Incoming traffic is attached to IIL
        • OIL is all other interfaces
        • Flood to OIL
        • PIM dense always uses SPT.
  • Prune occurs
    • Traffic flows stop, but (S,G) remains in table
    • Multicast fails RPF
    • No downstream neighbor or reciever
    • Downstream sent prune
    • LAN Prune override exception
  • After pruning
    • Flood again, prune back, flood again, prune back

PIM Sparse

Based on RFC4601 - Protocol Independent Multicast Sparse Mode (PIM-SM)

  • Explicit joins everywhere. No flooding.
  • LHR, sends a PIM-Join towards the RP, building a (*,G).
  • Phased
      1. The RPT tree
      • Receivers sending their (*,G) messages towards the RP.
      • FHR encapsulates the multicast traffic directly towards the RP.
      • PIM-Register
      • RP de-encapsulates the traffic, sending it down the RPT.
      1. Register Stop
      • The RP sends a (S,G) towards the source.
      • When multicast packets start showing up, without encapsulation, the RP sends a Register-Stop.
      1. SPT tree
      • LHR requests a (S,G) entry towards it's upstream, until it's joined to the (S,G) tree.
      • When the LHR starts getting two copies of the traffic, it sends a (S,G,rpt) prune message, towards the RP. (A prune specific to the RPT)
  • If two LHRs exist, and duplicate traffic is detected a PIM elections happens.
    • These Asserts are every 3 minutes.
    • RPTbit, 0 is preferred and means "has (S,G) tree"
      • Metric Preference (Administrative Distance)
        • Metric
          • IP address of subnet interface.
  • Specify the tunnel, for the pim-register messages on Cisco via ip pim register-source loopback 0
  • The tunnel interface encapsulates the entire multicast packet, which adds 28 bytes of overhead. Packets close to the MTU will be silently dropped on IOS-XE.

PIM-SM-register-register-stop-prune.pcap

a DR is elected by highest priority, or highest IP in the subnet.

  • DR sends the PIM join upstream.

The RP always gets the stream, even if it has no receivers to forward it to.

BIDIR-PIM

Based on RFC 4601 - Bidirectional Protocol Independent Multicast (BIDIR-PIM)

  • Superset of PIM-SM
  • No (S,G) entries
  • Traffic can flow up and down the same tree.
  • Still needs RPs
    • RP must be dedicated to BIDIR-PIM.
  • Each bidirectional link has a DF election.
    • Ingress packets on any PIM interface can be forwarded downstream onto DF links.
      • No DF links, no forwarding.
    • Ingress packets to a DF can be forwarded upstream via the RPF towards the RPA.

MSDP

  • RPs register to each other, in different multicast domains.
  • RP sends a SA (source active) message.
  • Still needs PIM running for the S,G.
  • TCP port 639.
  • Has keepalives.

show ip msdp peer show ip msdp sa-cache

Shared-Tree (*,G)

  • Shared trees are essential for multiple senders to the same group

  • A single tree is built for each group, regardless of source

    • 3 sources, 1 tree
  • Selects a router as the root of the tree

  • If a receiver is on the same subnet as the sending host, it will need to revert to PIM Dense for that segment

  • This isn't always better. Shared trees will typically take suboptimal paths through a network

  • Source trees are better distributed, hence they are more robust

  • RP Selection is a hassle

Source Based Multicast (S,G)

  • PIM dense uses a separate tree for each multicast source and destination group.
  • Groups do not share trees.
    • 3 Sources 3 trees.

Commands

show pim rpf hash

show pim range-list

show pim topology

show mrib route

show ip mroute

What interface should I receive this host traffic from?

show ip rpf 10.0.0.

show ip mfib

See if multicast even works

show ip pim stats

See if PIM adjacency traffic even arrives.

show ip pim interface detail

See results of DF election

show ip pim interface df

FLAGS
 A - Accepting. This interface is accepting data
 F - Forwarding. Where to send multicast traffic

Nexus 7K

show forwarding multicast route group <>

L2 Addresses

MAC addresses are 48 bits.

The first 25 bits are always.

0000 0001 . 0000 0000 . 0101 1110 . 0??? ????
       01 :        00 :        5E :
        ^                           ^
        |                           └─  Multicast requires this bit be 0.
        |
        └─ Individual/Group. Multicast requires this bit be 1.

So the first six bytes are 01:00:5E

The last 23 bits come from the IP address.

A Multicast IP

Mapping 232.10.10.10 → 01:00:5E:0A:0A:0A

Copy the low order 23 bits directly from the v4 address.


  232.10.10.10/8
  (in binary)
  1110 1000 . 0000 1010 . 0000 1010 . 0000 1010
               \______________________________/
               Remember these 23 bits.
   

Building the L2 Address.

Ethernet Multicast MAC Address

          1 :         0 :        5E :        0A :       0A  :        0A
  0000 0001 . 0000 0000 . 0101 1110 . 0000 1010 . 0000 1010 . 0000 1010
  \__________________________________/|\______________________________/
        Assigned first 25 bits        |   Same bits as above.
        (always 01:00:5E)             |  (24 bits → 23 bits, 1 bit dropped)
                                      |
                                      |
                                      └─  Multicast requires this bit be 0

Quirks and Tech Debt.

Because we copied only 23 bits, vs 28 bits, we have 5 bits of overlap.

v4 is 32 bits, minus those four bits that can never change 1110 to get 28 bits.

All these IPs share the same multicast L2 address.

All 32 IPv4 addresses mapping to 01:00:5E:0A:0A:0A
══════════════════════════════════════════════════════════════════════════════
Address           Octet 1    Octet 2    Octet 3    Octet 4
──────────────────────────────────────────────────────────────────────────────
224. 10.10.10     1110 0000  0000 1010  0000 1010  0000 1010
224.138.10.10     1110 0000  1000 1010  0000 1010  0000 1010
225. 10.10.10     1110 0001  0000 1010  0000 1010  0000 1010
225.138.10.10     1110 0001  1000 1010  0000 1010  0000 1010
226 .10.10.10     1110 0010  0000 1010  0000 1010  0000 1010
226.138.10.10     1110 0010  1000 1010  0000 1010  0000 1010
227 .10.10.10     1110 0011  0000 1010  0000 1010  0000 1010
227.138.10.10     1110 0011  1000 1010  0000 1010  0000 1010
228 .10.10.10     1110 0100  0000 1010  0000 1010  0000 1010
228.138.10.10     1110 0100  1000 1010  0000 1010  0000 1010
229 .10.10.10     1110 0101  0000 1010  0000 1010  0000 1010
229.138.10.10     1110 0101  1000 1010  0000 1010  0000 1010
230 .10.10.10     1110 0110  0000 1010  0000 1010  0000 1010
230.138.10.10     1110 0110  1000 1010  0000 1010  0000 1010
231 .10.10.10     1110 0111  0000 1010  0000 1010  0000 1010
231.138.10.10     1110 0111  1000 1010  0000 1010  0000 1010
232 .10.10.10     1110 1000  0000 1010  0000 1010  0000 1010  < --- This is our SSM address.
232.138.10.10     1110 1000  1000 1010  0000 1010  0000 1010
233 .10.10.10     1110 1001  0000 1010  0000 1010  0000 1010  < --- An address in the GLOP block.
233.138.10.10     1110 1001  1000 1010  0000 1010  0000 1010
234 .10.10.10     1110 1010  0000 1010  0000 1010  0000 1010
234.138.10.10     1110 1010  1000 1010  0000 1010  0000 1010
235 .10.10.10     1110 1011  0000 1010  0000 1010  0000 1010
235.138.10.10     1110 1011  1000 1010  0000 1010  0000 1010
236 .10.10.10     1110 1100  0000 1010  0000 1010  0000 1010
236.138.10.10     1110 1100  1000 1010  0000 1010  0000 1010
237 .10.10.10     1110 1101  0000 1010  0000 1010  0000 1010
237.138.10.10     1110 1101  1000 1010  0000 1010  0000 1010
238 .10.10.10     1110 1110  0000 1010  0000 1010  0000 1010
238.138.10.10     1110 1110  1000 1010  0000 1010  0000 1010
239 .10.10.10     1110 1111  0000 1010  0000 1010  0000 1010
239.138.10.10     1110 1111  1000 1010  0000 1010  0000 1010  < --- an Organizational scope address.
══════════════════════════════════════════════════════════════════════════════
                       ^^^^  ^
                       ||||  | 
                       └└└└──└─ I incremented these five bits to show the pattern.

Lab Stuff.

BPF - Capture all PIM, but not PIM hello messages.

ip proto 103 and not ether[34] == 0x20

Sending Multicast

iperf --client 239.10.10.10 --udp --time 3600 --interval 1 --bandwidth 1pps --ttl 15 --len 1000

Receiving Multicast

iperf --server --udp --bind 239.10.10.10 --interval 1

The C9000-L series, does not support Catalyst Center, and has lower stackwise Speeds.

Two Tier Collapsed Core

cisco-campus-two-tier-collapsed-core-cisco

  • The core and distribution switches are the same
  • The center is running StackWise Virtual

Three Tier

cisco-campus-three-tier-with-network-services-layer

Layer 2 Access with traditional multilayer

  • Layer 2 is a single wiring closest, or access uplink pair.
  • FHRP is used, but limits bandwidth to one uplink, vs both.

The Campus Network

  • Campus networks are always oversubscribed.
  • Over-subscription rates between 4-20 are common.
  • Networks with over-subscription that results in queuing should implement QoS for voice traffic.

Access Layer

  • 9200 (160Gbps stack-wise ring)
  • 9300 (480Gbps stack-wise ring)
  • 9400 (modular chassis)

Considerations

  • mGig, so access speeds can scale
  • UPOE+, 90W with perpetual power (survives reboots)

Distribution Layer

  • 9400 (modular chassis)
  • 9500
  • 9600 (modular chassis)

Considerations

  • Service heavy (FHRPs, Routing, SVIs)
  • Typical L2 boundary
  • Used to interconnect all the access layer switches in a building
  • Used to interconnect Access layer switches, once they can't form a full-mesh
  • Also contains the failure domain of the access layer.
  • Simplified Distribution, using stackwise virtual to remove FHRP.

Core Layer

  • 9500
  • 9600 (modular chassis)

Considerations

  • No services
  • Layer 3 only
  • Always on
  • Ideally, a minimum of 100G to conserve ports.

cisco-campus-lan-core

Traditional Design

cisco-campus-looped-access

  • Needs STP to block ports

Traditional Design - Loop Free

cisco-campus-loop-free-access

Other Designs

SD-Access

  • Cisco Catalyst Center
  • Cisco Identity Services Engine

cisco-campus-sd-access-design

Open Standards Based Overlay

  • MP-BGP
  • VXLAN

cisco-campus-bgp-evpn-vxlan

Campus LAN Best Practices - Security

  • DHCP Snooping, to prevent users from hooking up a DHCP server from home on accident.

  • Dynamic ARP inspection, to prevent a ARP attack, where the attack sends ARP replies with the IPs in the subnet.

  • BDPU Guard, to prevent home switches.

  • 802.1x, port authentication

  • Cisco Umbrella, Cisco's DNS offering.

Campus LAN Best Practices - High Availability

  • SSO: Stateful Switch Over, used to sync RPs in modular switches.

  • NSF: Non-Stop Forwarding allows graceful restarting of a L3 protocol. Allows the data-plane to continue while the new RP

  • MLS: Multi-layer Switch.

  • StackWise: Older tech, to combine switches together. Up to 8 switches can be stacked. They operate as one switch.

  • StackWise Virtual: Two MLS devices, are combined to become one logical device.

  • StackWise Virtual Link: The control/data path between the two switches. Should be two links minimum.

  • GIR: Graceful Insertion or Removal. Influencing paths by changing route-metrics or adjusting FHRP priorities.

Etherchannel

  • Use a dynamic protocol, to check on link health

References

https://www.cisco.com/c/en/us/td/docs/solutions/CVD/Campus/cisco-campus-lan-wlan-design-guide.html

  • Cisco Catalyst Center: Formerly Cisco DNA center. Speaks NETCONF, SNMP, SSH southbound, REST/HTTPS Northbound.

  • Campus Fabric: Equipment managed without Catalyst Center, can be CLI or NETCONF/RESTCONF.

  • ISE: Identity Services Engine. Cisco's modern AAA server.

  • SD-Access: Campus Fabric managed with Cisco Catalyst Center and Cisco ISE.

  • SGT: Scalable Group tags, formally called Security Group Tags. These are managed by ISE.

  • SGT Policy: Instead of identifying traffic based on IP or MAC, traffic can be identified by SGT.

  • Overlay: LISP, VXLAN and CTS (Cisco TrustSec, carries SGTs inside of VXLAN-GPO.

  • VXLAN-GPO: Cisco extended the VXLAN header to include SGTs (Now called Scalable Group Tags)

  • Underlay: Usually IS-IS, since it's IPv4 and IPv6 agnostic. Even the underlay can be automatically deployed.

  • Control Plane Node: Contains the LISP MS/MR databases Endpoint-to-location, or EID-to-RLOC. Each node contains the full database.

  • Fabric Border Node: Connects other L3 networks to SDA fabric.

  • Fabric WLC: Connects APs and the WLC to the SDA fabric.

  • Fabric Intermediate Node: Only does underlay services, like IS-IS or IP transport.

  • Fabric Edge Node: Connects campus host devices to the SDA fabric, usually an access layer or distribution layer device. Is a LISP xTR, with an anycast gateway, with overlay host protocols, (like DHCP).

Fabric Edge Onboardin

  • (Method 1) Open Auth or MAB, user connects to a port -> host pool.
  • (Method 2) 802.1x authenticates the device -> host pool.
  • Host pool has a SGT, SVI and VRF instance.
  • SVI is the anycast gateway (same IP address and MAC for that SVI & VRF) on all edge nodes.
  • Host address is now an EID (MAC, /32 IPv4, /128 IPv6), that can be registered with the control plane node.
  • Control plane signaling is LISP, dataplane is managed via VXLAN-GPO.

Fabric Border Nodes Types

  • Internal Border: WLC, Firewall, Data center

  • Default Border: Internet.

  • Internal + Default: Both.

Wireless

If the WLC can participate in the fabric, it's a fabric aware WLC. It performs PxTR (proxy lisp encap/de-encap) for hosts connected to fabric APs, and registers their EIDs with the control nodes.

Control plane traffic is CAPWAP inside of VXLAN-GPO. Dataplane traffic can just ride VXLAN-GPO

LISP

  • The LISP instance ID is the VRF.

Cisco Catalyst Center

  • NCP: Network Control Platform. This module is connect via API to the GUI, and is what talks to the network gear via NETCONF, SNMP, or SSH. Does all the underlay automation.

  • NDA: Network Data Platform. Data collection and analytics. Netflow, Syslog, ERSPAN, etc.

  • ISE: Is required. 802.1x, Mac Authentication Bypass (MAB), or Web Authentication (WebAuth). Can talk to AWS or Active Directory. ISE is tightly integrated via API calls to CatC.

Terms

  • DIA: Direct Internet Access. What we usually have has residential customers. No real guarantee of service, but tends to be fast.
  • SLA: Service Level Agreement. Business Internet, especially, to connect sites together tends to have a SLA.
  • MPLS: A kind of VPN service provided by an ISP, to connect business sites together. Comes with a SLA. More expensive than DIA.
  • BFD: Bidirectional Forwarding Detection

Devices

  • Manager: AKA vManage, AKA, the NMS. What a human interacts with, the GUI
  • Validator: AKA vBond. Initial Authentication and provisioning, (Cisco calls this orchestration) Responsible for NAT traversal.
  • Controller: AKA vSmart. Holds the current state of the network, (routes and data policy) maintains active connections to the edges and programs them.
  • WAN Edge: AKA vEdge. What gets programmed. Provides data-plane between sites, via circuits like DIA, or MPLs.
  • vEDGE: Old hardware-based Viptela gear, pre-Cisco acquisition. Unfavored.

Marketing Terms

  • Cisco SD-WAN Cloud OnRamp: AKA, CoR. Edges can perform analytics to SaaS or IaaS offerings to select the best path, via jitter.

Validator

Should be give a FQDN, so WAN edges have no problems finding it on connection to a DIA.

FQDNs also mean we aren't putting a static IP into a config.

Initial authentication is done with PKI, and RSA encryption.

Can not be placed behind NAT, unless the NAT device does a 1:1 static translation.

This device does the load balancing if multiple controllers are being used.

The Validator has a permanent dTLS tunnel to all the controllers.

Controllers

  • Keeps all the routes between sites, that are managed via the OMP protocol (like BGP, but proprietary)
  • Logical tunnel topologies (such as hub and spoke, regional, and partial mesh)
  • Service Chaining
  • Traffic Engineering
  • Segmentation per VPN

WAN Edge

  • Dataplane for a site
  • Has OMP, BGP, OSPF, EIGRP, ACLs, ARP, HA, and QoS.
  • Connects via dTLS to the controllers.
  • Connects via dTLS to other edges.

SD-WAN Policy

Policies are further classified as

  • Local Policy: Programed on the edges. ACLs, QoS, routing, and AAA.
  • Centralized Policy: Route policy, before being sent to the edges, (Topology, VPN Membership, Application Aware Routing)

Application Aware Routing

  • If two edges connect to each other over dTLS, BFD is run over the tunnel.
  • For AAR, or CoR, the edge will send HTTP probes and measure the jitter and/or loss.
  • The score for an app is the vQoS (Viptela Quality of Experience) from 0 to 10, 10 being best.

VPNs

VPN0: Underlay Signaling, transport WAN. Typically public addresses or SRC-NAT Public addresses.

VPN512: OOB Management

VPNn: Any number from 1 to 65527. Not 0. Not 512. Used for service-side (also known as LAN-side) traffic.

sd-wan commands

show sdwan control local-properties

DTLS Tunnels to SDWAN Manager and SDWAN Controllers

show sdwan control connections

show sdwan control connection-history

OMP

show sdwan omp peers

show sdwan omp routes

show sdwan omp tlocs

show sdwan omp services

show sdwan omp multicast-routes

Validator Only

show orchestrator connections

Initial Bringup

Pasting in the bootstrap

tclsh
puts [open "bootflash:name-of-bootstrap-file.cfg" w+] {
<list of certs goes here>
<must be done via an actual terminal>
<like SecureCRT>
<with character and line send delay>
}

Copy via HTTP using Python

  1. Get the current IP

python -c "import socket; s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM); s.connect(('8.8.8.8', 80)); print(s.getsockname()[0]); s.close()"

  1. Start the server with above IP

python -m http.server 8000 --bind 10.0.0.1

  1. Copy into cisco box

copy tftp://10.0.0.1:8000/.cfg bootflash:/.cfg

controller-mode enable

Terms

  • RADIUS - Remote Authentication Dial-In User Service. Created to provide AAA for ISP users, or Dial-In for businesses.
  • TACACS - Terminal Access Controller Access-Control System. An AAA protocol to provide support for authenticate once, authorize many.
  • TACACS+ - Same as above, basically an upgraded version, not backward compatible.
  • EAP - Extensible Authentication Protocol, 802.1x, used for LAN Auth, only works with RADIUS.

TACACS+ Flows

Authentication Flow

tacacs-plus-authentication-flows

Authorization and Accouting Flow

tacacs-plus-auth-and-accounting-flows

Log Message Severity Levels

KeywordSeverityDescriptionMnemonic
Emergency0System unusableEven
Alert1Immediate action requiredA
Critical2Critical Event (Highest of 3)Computer
Error3Error Event (Middle of 3)Expert
Warning4Warning Event (Lowest of 3)Will
Notification5Normal, More ImportantNot
Informational6Normal, Less ImportantIgnore
Debug7Requested by User DebugDebugs

Mnemonic courtesy of Romelchand

NTP

Server Only - Based on Internal Clock

ntp master <stramum>

Client/Server - Based on other NTP clocks and stratum

ntp server <address|hostname>

An Example Config

I found a list of time servers here.

ntp server pool.ntp.org
ntp server time.nist.gov
ntp server time.cloudflare.com
ntp source <loopback-should-go-here>
!
! NTP Master 7 ... if internet connectivity is lost, and external NTP fails, this box can still serve NTP.
!
ntp master 7

A caution: Using pool.ntp.org

Consider if the NTP Pool is appropriate for your use. If business, organization or human life depends on having correct time or can be harmed by it being wrong, you shouldn't "just get it off the Internet". The NTP Pool is generally very high quality, but it is a service run by volunteers in their spare time. Please talk to your equipment and service vendors about getting local and reliable service setup for you. See also our terms of service. We recommend time servers from Meinberg, but you can also find time servers from End Run, Spectracom and many others.

  • Stop on first match.
  • end-of-list, no matches, deny.

An ACL to just count traffic should always end with

permit ip any any

Block a specific host

Necessary because the default action at the end is "deny any"

access-list 1 deny host 10.0.0.1
access-list 1 permit any

Allow a host range

This allows packets from 192.168.10.0/24 to travel to 192.168.200.0/24

access-list 101 permit ip 192.168.10.0 0.0.0.255 192.168.200.0 0.0.0.255

Deny access except from specific hosts

Usually required for features like CoPP

access-list 10 permit 10.0.0.1
access-list 10 permit 10.0.0.2
access-list 10 permit 10.0.0.3

References

https://www.cisco.com/c/en/us/support/docs/ip/access-lists/26448-ACLsamples.html

CoPP Configuration.

This was performed on an C8000v, running 17.13.1a

  1. A simple ACL that matches based on ICMP.
ip access-list extended ACL_ICMP_UNKNOWN
 permit icmp any any
  1. Make class-map to use the ACL.
class-map CLASS_MAP_ICMP_UNKNOWN
 match access-group name ACL_ICMP_UNKNOWN
  1. Make a policy map that uses the above class-maps
policy-map POLICY_MAP_COPP
 class CLASS_MAP_ICMP_UNKNOWN
  police cir 10000 conform-action transmit  exceed-action drop
 class class-default
  1. Apply it to the control plane.
control-plane
 service-policy input COPP-POLICY-MAP
  1. Validate
router# show policy-map control-plane input 
 Control Plane 

  Service-policy input: POLICY_MAP_COPP

    Class-map: CLASS_MAP_RFC1918 (match-all)  
      0 packets, 0 bytes
      5 minute offered rate 0000 bps
      Match: access-group name ACL_RFC1918

    Class-map: CLASS_MAP_ICMP_UNKNOWN (match-all)  
      0 packets, 0 bytes
      5 minute offered rate 0000 bps, drop rate 0000 bps
      Match: access-group name ACL_ICMP_UNKNOWN
      police:
          cir 1000000 bps, bc 31250 bytes
        conformed 0 packets, 0 bytes; actions:
          transmit 
        exceeded 0 packets, 0 bytes; actions:
          drop 
        conformed 0000 bps, exceeded 0000 bps

    Class-map: class-default (match-any)  
      0 packets, 0 bytes
      5 minute offered rate 0000 bps, drop rate 0000 bps
      Match: any

Test Setup

This uses python3, scapy, and sendpfast, to send icmp packets with random sources.

  1. Install sendpfast
sudo apt install tcpreplay
  1. Start a python virtual environment.
python3 -m venv venv
source venv/bin/activate
  1. Install scapy inside it.
pip install scapy
  1. Modify then paste in the following python script.

dst iface

cat > flood.py << 'EOF'
from scapy.all import *
import random

def random_public_ip():
    while True:
        ip = f"{random.randint(1,223)}.{random.randint(0,255)}.{random.randint(0,255)}.{random.randint(1,254)}"
        if not (ip.startswith("10.") or 
                ip.startswith("192.168.") or 
                ip.startswith("172.") and 16 <= int(ip.split(".")[1]) <= 31):
            return ip

pkts = [Ether()/IP(src=random_public_ip(), dst="192.168.52.198")/ICMP() for _ in range(1000)]
sendpfast(pkts, pps=10000, loop=100, iface="ens18")
EOF
  1. In a different terminal run something like this to see the packets leaving the interface.
sudo tcpdump -i ens18 icmp -n
  1. This requires raw sockets to run.
sudo venv/bin/python3 flood.py
  • SA - Source Address

  • DA - Destination Adress

                      INSIDE NETWORK                                   OUTSIDE NETWORK

           ┌────────────────────────────────────┐         ┌──────────────────────────────────────┐
           │                                    │         │                                      │
           │       ┌────────────┬─────────────┐ │         │       ┌─────────────┬──────────────┐ │
           │ ────► │    SA      │     DA      │ │ ──────► │ ────► │    SA       │     DA       │ │
  ┌──────┐ │       │Inside Local│Outside Local│ │         │       │Inside Global│Outside Global│ │ ┌───────┐
  │Inside│ │       └────────────┴─────────────┘ │  ┌───┐  │       └─────────────┴──────────────┘ │ │Outside│
  │ Host │ │                                    │  │NAT│  │                                      │ │ Host  │
  └──────┘ │ ┌────────────┬─────────────┐       │  └───┘  │ ┌─────────────┬──────────────┐       │ └───────┘
           │ │    SA      │     DA      │       │         │ │    SA       │     DA       │       │
           │ │Inside Local│Outside Local│ ◄──── │ ◄────── │ │Inside Global│Outside Global│ ◄──── │
           │ └────────────┴─────────────┘       │         │ └─────────────┴──────────────┘       │
           │                                    │         │                                      │
           └────────────────────────────────────┘         └──────────────────────────────────────┘

Based on a diagram here.

NAT Overload - Port Address Translation or PAT

This is Source NAT.1

Packets to R3 will appear to be from 10.0.0.2

          192.168.0.0/24             10.0.0.0/24        
┌────┐.1                 .2┌────┐.2             .1┌────┐
│ R1 │─────────────────────│ R2 │─────────────────│ R3 │
└────┘E0/0             E0/0└────┘E0/1         E0/1└────┘
                           ▲    ▲                       
                           │    │                       
           Inside ─────────┘    └─────── Outside        

R1

interface Ethernet0/0
 ip address 192.168.1.1 255.255.255.0

ip route 0.0.0.0 0.0.0.0 192.168.1.2

R2

interface Ethernet0/0
 ip address 192.168.1.2 255.255.255.0
 ip nat inside

interface Ethernet0/1
 ip address 10.0.0.2 255.255.255.0
 ip nat outside

ip nat inside source list 1 interface Ethernet0/1 overload

ip access-list standard 1
 10 permit 192.168.1.0 0.0.0.255

R3

interface Ethernet0/1
 ip address 10.0.0.3 255.255.255.0

ip route 0.0.0.0 0.0.0.0 10.0.0.2

R2 Debugs during NAT

Performed with the above configs via CML IOL routers version 17.12.1.

R2# debug ip nat 1
IP NAT debugging is on for access list 1

*Sep 16 21:32:21.386: NAT: Entry assigned id 4
*Sep 16 21:32:21.386: NAT*: ICMP id=5->1024
*Sep 16 21:32:21.386: NAT*: s=192.168.1.1->10.0.0.2, d=10.0.0.3 [17]
*Sep 16 21:32:21.387: NAT*: ICMP id=1024->5
*Sep 16 21:32:21.387: NAT*: s=10.0.0.3, d=10.0.0.2->192.168.1.1 [17]

R2# show ip nat translations
Pro Inside global      Inside local       Outside local      Outside global
icmp 10.0.0.2:1024     192.168.1.1:5      10.0.0.3:5         10.0.0.3:1024
1

Source NAT, because the source address needs to be changed to access outside hosts. As packets move through the router, they will create entries for return packets.

Captured on-wire.

packet #1 - who has 10.0.6.10? Tell 10.0.0.20
packet #2 - 10.0.0.10 is at ce:b1:5f:58:1d:8a

ARP Request

> Ethernet II

    Destination: Broadcast (ff:ff:ff:ff:ff:ff)
    Source: 1a:20:4e:9e:fb:9c (1a:20:4e:9e:fb:9c)
    Type: ARP (0x0806)

> Address Resolution Protocol (request)

    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1)
    Sender MAC address: 1a:20:4e:9e:fb:9c (1a:20:4e:9e:fb:9c)
    Sender IP address: 10.0.0.20
    Target MAC address: 00:00:00_00:00:00 (00:00:00:00:00:00)
    Target IP address: 10.0.0.10

ARP Reply

> Ethernet II

    Destination: 1a:20:4e:9e:fb:9c (1a:20:4e:9e:fb:9c)
    Source: ce:b1:5f:58:1d:8a (ce:b1:5f:58:1d:8a)
    Type: ARP (0x0806)
    Padding: <lots of zeros>

> Address Resolution Protocol (reply)

    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: reply (2)
    Sender MAC address: ce:b1:5f:58:1d:8a (ce:b1:5f:58:1d:8a)
    Sender IP address: 10.0.0.10
    Target MAC address: 1a:20:4e:9e:fb:9c (1a:20:4e:9e:fb:9c)
    Target IP address: 10.0.0.20
  • ARP Spoofing: happens when an attacker users a known MAC address on the network, usually the network router for the subnet.

  • ARP Poisoning: happens when ARP tables on devices (routers, switches, hosts) contain false mappings.

Successful ARP attacks lead to traffic hijacking, traffic denial, or man-in-the-middle attacks.

Dynamic ARP Inspection

Minimum config

ip dhcp snooping vlan 10
ip arp inspection vlan 10
ip arp inspection validate src-mac dst-mac ip 
!
! Ports
!
interface GigabitEthernet0/1
 description towards DHCP server
 ip arp inspection trust
 ip dhcp snooping trust

Validation

access-1# show ip dhcp snooping binding 
MacAddress          IpAddress        Lease(sec)  Type           VLAN  Interface
------------------  ---------------  ----------  -------------  ----  --------------------
52:54:00:0D:65:73   10.10.10.102     80574       dhcp-snooping   10    GigabitEthernet0/0
Total number of bindings: 1

access-1# show ip arp inspection 

Source Mac Validation      : Enabled
Destination Mac Validation : Enabled
IP Address Validation      : Enabled

 Vlan     Configuration    Operation   ACL Match          Static ACL
 ----     -------------    ---------   ---------          ----------
   10     Enabled          Active                         

 Vlan     ACL Logging      DHCP Logging      Probe Logging
 ----     -----------      ------------      -------------
   10     Deny             Deny              Off          

 Vlan      Forwarded        Dropped     DHCP Drops      ACL Drops
 ----      ---------        -------     ----------      ---------
   10            134              0              0              0

 Vlan   DHCP Permits    ACL Permits  Probe Permits   Source MAC Failures
 ----   ------------    -----------  -------------   -------------------
   10             48              0              0                     0

 Vlan   Dest MAC Failures   IP Validation Failures   Invalid Protocol Data
 ----   -----------------   ----------------------   ---------------------
          
 Vlan   Dest MAC Failures   IP Validation Failures   Invalid Protocol Data
 ----   -----------------   ----------------------   ---------------------
   10                   0                        0                       0

Reference

Cisco - Dynamic ARP Inspection

Practical Networking - Gratuitous ARP

Assured Forwarding PHB Group

Four AF classes, each should get it's own resources.

AF11 (DSCP 10) 001010 AF12 (DSCP 12) 001100 AF13 (DSCP 14) 001110

AF21 (DSCP 18) 010010 AF22 (DSCP 20) 010100 AF23 (DSCP 22) 010110

AF31 (DSCP 26) 011010 AF32 (DSCP 28) 011100 AF33 (DSCP 30) 011110

AF41 (DSCP 34) 100010 AF42 (DSCP 36) 100100 AF43 (DSCP 38) 100110

Terms

  • 1 second, is 1000 ms.
  • 1 millisecond: Network latency is measured in ms, or 1 thousandth of a second 0.001.
  • 1 microsecond: 1 μs (a millionth) of a second. 0.000 001. 1000 μs is 1 ms.
  • 1 nanosecond: 1 ns (a billionth) of a second. 0.000 000 001. 1000 ns is 1 μs.
  • NTP: An older time standard. Can sync time between 10 to 1 ms.
  • PTP: Modern time standard. Can sync time between 10 to 1 ns.
  • PTPv1: - Defined in IEEE 1588-2002
  • PTPv2: - Defined in IEEE 1588-2008, not backwards compatible.
  • PTPv2.1: - Defined in IEEE 1588-2019, is backward compatible.
  • 1588 Clock: A clock in the PTP time domain. Clocks have ports.
  • Terminating Clock: A clock with one port.
  • Ordinary Clock: a clock in a terminating device.
  • Boundary Clock:: a clock in a transmitting device, like an ethernet switch. Connects PTP domains.
  • Transparent Clock: a boundary clock that can correct for delay and modifies the PTP event message.
  • Grandmaster: All clocks sync to this one clock.
  • Master: All clocks in a subdomain sync to the master. The master sync's to the grand master.

Time Terms

  • Epoch: The start of time.
  • Offset: The estimated time between a master clock sending time, and a slave clock receiving it.

Uses

  • Robotics, synchronizing movements.
  • Mobile Phone networks, telemetry, billing, logging
  • Financial Networks, trade settling fairness.
  • Power Networks, to sync to the 60hz grid.
  • Science network, seismic data

Process

After PTP has time from something like a GPS device, it can pass that time along, so long as the devices in the path can mark and read the timestamps

PTP Delay and Offset Calculations

General Messages

  • Announce: Used to determine which Grand Master is selected Best Master

  • Follow_Up: Used to convey a captured timestamp of a transmitted SYNC message

  • Delay_Response: Used to measure delay between IEEE 1588 devices

  • Pdelay_Response_Follow_Up: Used between IEEE 1588 devices to measure the delay on an incoming link

  • Management: Used between management devices and clocks

  • Signaling: Used by clocks to deliver how messages are sent

Event Messages

  • Sync: Used to convey time

  • Delay_Request: Used to measure delay from downstream devices

  • Pdelay_Request: Used to initiate and measure delay

  • Pdelay_Response: Used to respond and measure delay

SyncE synchronizes clock frequency over an Ethernet port. It does not synchronize time-of-day, that's done by PTP, IEEE 1588.

Setting as oscillator to a frequency is syntonization.

References

ITU-T Rec. G.8261 - Architecture and the wander performance of SyncE networks

ITU-T Rec. G.8262 - Synchronous Ethernet clocks for SyncE

ITU-T Rec. G.8264 - Ethernet Synchronization Messaging Channel (ESMC)

Config Options

ITU-T G.813 Option 1 clock (QL-SEC)

EEC-option 1

ITU-T G.812 type IV clock (QL-ST3)

EEC-option 2

Terms

Synchronous Ethernet and IEEE 1588 in Telecoms

  • Time Interval: Distance between two events, (measured in seconds), milliseconds, microseconds, nanoseconds, picoseconds

  • Frequency: Rate of a repetitive event. Measured in cyles per second. A device that produces frequency is an oscilator.

  • T0: System Clock (line interface output)

  • T1: Timing Reference signal derived from STM-N (STS-N/SyncE) input.

  • T2: Timing Reference signal derived from 2048/1544 kbit input [input from PDH]

  • T3: Timing reference signal derived from 2048 or 2048 1544 with SSM.

  • T4: Clock-interface output.

  • OSC: Internal ST3 oscillator

  • SSM: Synchronization Status Message

  • ESMC: Ethernet Synchronization Message Channel

  • MTIE: Maximum time interval error is a measure of the worst case phase variation of a signal with respect to a perfect signal over a given period of time.

  • TDEV: Time deviation is a statistical analysis of the phase stability of a signal over a given period of time.

Netflow v5 - v4 flows only v9 - template based IPFIX

Flexible Netflow

Netflow needs four things to work:

  • Records
  • Exporters
  • Monitors
  • Interfaces

IOS-XE

flow record FLOW_RECORD_IPV4
 match ipv4 protocol
 match ipv4 source address
 match ipv4 destination address
 match transport source-port
 match transport destination-port
 match interface input
 collect interface output
 collect counter bytes long
 collect counter packets long
 collect timestamp sys-uptime first
 collect timestamp sys-uptime last
!
flow exporter FLOW_EXPORTER
 !
 ! IPFix is standards based netflow.
 !
 export-protocol ipfix
 destination 10.0.52.100
 source GigabitEthernet2
 transport udp 2055
 template data timeout 60
!
flow monitor FLOW_MONITOR_IPV4
 exporter FLOW_EXPORTER
 cache timeout active 60
 record FLOW_RECORD_IPV4
!
interface GigabitEthernet1
 ip flow monitor FLOW_MONITOR_IPV4 input
 ip flow monitor FLOW_MONITOR_IPV4 output

IOS-XR

flow exporter-map EXPORTER_MAP_1
version v9
options interface-table
template data timeout 600
!
dscp 48
transport udp 2055
source Loopback1
destination <IP 1>
!
flow monitor-map MONITOR_MAP_INTERNET
  record ipv4
  exporter EXPORTER_MAP_1
  cache timeout active 60
  cache timeout inactive 5
!
sampler-map SAMPLER_MAP_INTERNET
  random 1 out-of 500
!
interface ten 1/1
  flow ipv4 monitor MONITOR_MAP_INTERNET sampler SAMPLER_MAP_INTERNET ingress
  flow ipv4 monitor MONITOR_MAP_INTERNET sampler SAMPLER_MAP_INTERNET egress

Lab validations

R1# show flow monitor FLOW_MONITOR_IPV4 statistics 
  Cache type:                               Normal (Platform cache)
  Cache size:                               200000
  Current entries:                               4
  High Watermark:                                4

  Flows added:                                   8
  Flows aged:                                    4
    - Active timeout      (    60 secs)          4


R1# show flow monitor FLOW_MONITOR_IPV4 cache sort highest counter bytes long top 10 format table
Processed 3 flows
Aggregated to 3 flows
Showing the top 3 flows

IPV4 SRC ADDR    IPV4 DST ADDR    TRNS SRC PORT  TRNS DST PORT  INTF INPUT            IP PROT  intf output                     bytes long             pkts long    time first     time last
===============  ===============  =============  =============  ====================  =======  ====================  ====================  ====================  ============  ============
10.0.10.101      10.0.20.101              48640           5000  Gi4                        17  Gi1                                 334100                   325  20:37:12.210  20:37:44.424
10.0.12.2        224.0.0.5                    0              0  Gi1                        89  Null                                   600                     6  20:36:54.026  20:37:41.568
10.0.12.1        224.0.0.5                    0              0  Null                       89  Gi1                                    600                     6  20:36:52.808  20:37:38.836

Commands

show chassis detail

show chassis rmi

Lightweight Modes

Client-Serving AP Modes

  • Local: This is the default mode. A local mode AP tunnels all client traffic, for all WLANs, in CAPWAP, to the controller. In this mode, the AP’s radios are operational only when the AP is connected to its controller. Local mode APs do not support mesh operation. All AP models support Local mode.

  • FlexConnect: In this mode, client traffic can either be tunneled in CAPWAP to the controller, or egress at the AP’s LAN port, depending on the WLAN configuration. FlexConnect mode APs do not support mesh operation. All models support FlexConnect mode.

  • Bridge and Flex+Bridge: These modes are used in mesh deployments, where wireless rather than wired backhaul is used for CAPWAP connectivity. Not all AP models support these modes; see the relevant mesh documentation for information about support for mesh operation.

Network Management AP Modes

  • Monitor: In this mode, the AP radios are dedicated to monitoring the Wi-Fi channel for RRM and rogue detection. All AP models support this mode.

  • Rogue Detector: In this mode, the AP radios are disabled; the AP monitors the LAN to detect on-wire rogue activity. This mode is not supported on Cisco Wave 2 or 802.11ax APs and is deprecated.

  • Sniffer: In this mode, the AP radio operates in promiscuous mode and captures all Wi-Fi traffic on a channel. These packets are tunneled in CAPWAP to the controller, which forwards them to a machine running OmniPeek or Wireshark for storage and analysis.

  • SE-Connect: In this mode, the AP provides a dedicated connection to CleanAir for spectrum analysis by software such as Spectrum Expert or Chanalyzer. SE-Connect mode is supported only on SE models with CleanAir.

Cisco Wireless Controller Configuration Guide, Release 8.10

Basic Ansible

This was done on a home lab running Debian 11. tesseract is my control-node.

  1. Add Ansible to Sources list
  2. Update the OS Sources
  3. Install Ansible
  4. Create SSH keys
  5. Tell Ansible to use ssh-agent so you don't have to retype passwords
  6. Use Ansible to copy the controle node SSH key to the ansible hosts
  7. Use an Ansible playbook to ping the devices
  8. Use an Ansible playbook to upgrade the devices
Add Ansible to Sources list
$ echo "deb http://ppa.launchpad.net/ansible/ansible/ubuntu focal main" | sudo tee /etc/apt/sources.list.d/ansible.list
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 93C4A3FD7BB9C367
$ sudo apt update
Install Ansible
$ sudo apt install ansible
Define hosts, Create Host file

Do not put special characters (like -) into the group names. Hosts should be FQDNs.

ariadne@tesseract:~/ansible$ cat /etc/ansible/hosts 
[proxmox]
<hosts redacted>

[docker]
<hosts redacted>

[k8s]
<hosts redacted>

[linux]
<hosts redacted>
Define Defaults, Modify ansible.cfg
ariadne@tesseract:/etc/ansible$ cat ansible.cfg 
# [output omitted]

[defaults]
host_key_checking = False
remote_user = ariadne
Create a public SSH key to allow passwordless access

I'm using an internal linux host called tesseract. It doesn't use a password, it's a home lab.

ariadne@tesseract:~$ ssh-keygen -t rsa -b 4096 -C "ariadne@tesseract.haske.org"
Write a playbook to copy the SSH keys
ariadne@tesseract:~/ansible$ cat copy_ssh_keys_test.yml 
---
- name: Copy SSH key to hosts
  hosts: all
  become: yes

  tasks:
  - name: Set authorized key taken from file
    authorized_key:
      user: ariadne
      state: present
      key: "{{ lookup(file, /home/ariadne/.ssh/id_rsa.pub) }}"
Run it
ariadne@tesseract:~/ansible$ ansible-playbook -k copy_ssh_keys.yml 
SSH password: 

PLAY [Copy SSH key to hosts] ***********************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************************************************************************************************************************************
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]

TASK [Set authorized key taken from file] **********************************************************************************************************************************************************************************************************************
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
ok: [hosts-redacted]
changed: [hosts-redacted]
changed: [hosts-redacted]
changed: [hosts-redacted]
changed: [hosts-redacted]
changed: [hosts-redacted]
changed: [hosts-redacted]
changed: [hosts-redacted]
changed: [hosts-redacted]

PLAY RECAP *****************************************************************************************************************************************************************************************************************************************************
hosts.redacted    : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
hosts.redacted    : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0     
Write a Playbook to Upgrade Everything
ariadne@tesseract:~/ansible$ cat upgrade-everything.yml 
---
- name: Update and upgrade apt packages
  hosts: all
  become: true
  tasks:
    - name: Update apt cache and upgrade all packages
      apt:
        upgrade: yes
        update_cache: yes
        cache_valid_time: 86400 #One day
Sources

https://docs.ansible.com/ansible/latest/installation_guide/installation_distros.html#installing-ansible-on-debian https://docs.ansible.com/ansible/latest/inventory_guide/connection_details.html

Have a valid user with AAA new-model turned on

conf t
aaa new-model
aaa authentication login default local
aaa authorization exec default local
username admin privilege 15 secret cisco123

Restconf

  1. RESTCONF uses HTTP or HTTPS, so turn on the webserver
conf t
ip http secure-server
  1. Turn on RESTCONF
conf t
restconf
  1. Validate

RESTCONF relies on DMI and nginx

restconf-router# show platform software yang-management process
confd            : Running    
nesd             : Running    
syncfd           : Running    
ncsshd           : Running    
dmiauthd         : Running    
nginx            : Running    
ndbmand          : Running    
pubd             : Running  

Get an IP Address

This is done from the linux commandline via curl

--insecure is added because Cisco generates it's own self-signed certificates.

ariadne@tesseract:~$ curl --insecure --user admin:cisco123 \
   -H "Accept: application/yang-data+json" \
   https://192.168.52.199/restconf/data/Cisco-IOS-XE-native:native/interface/Loopback=0

{
  "Cisco-IOS-XE-native:Loopback": {
    "name": 0,
    "ip": {
      "address": {
        "primary": {
          "address": "1.1.1.1",
          "mask": "255.255.255.255"
        }
      }
    }
  }
}

Set an IP Address

Also done from the linux commandline via curl, just with a PATCH message.

ariadne@tesseract:~$ curl --insecure --user admin:cisco123 \
   -X PATCH \
   -H "Accept: application/yang-data+json" \
   -H "Content-Type: application/yang-data+json" \
   https://192.168.52.199/restconf/data/Cisco-IOS-XE-native:native/interface/Loopback=0 \
   -d '{
     "Cisco-IOS-XE-native:Loopback": {
       "name": 0,
       "ip": {
         "address": {
           "primary": {
             "address": "2.2.2.2",
             "mask": "255.255.255.255"
           }
         }
       }
     }
   }'

Use NETCONF-YANG

  1. Ensure a Valid user with AAA new-model is turned on, and available (see above)

  2. Turn on NETCONF-YANG

conf t
netconf-yang
  1. Validate
restconf-router#show netconf-yang status 
netconf-yang: enabled
netconf-yang ssh port: 830
netconf-yang candidate-datastore: disabled

I performed this lab inside a linux virtual environment.

  1. Load a python virtual environment
python3 -m venv ~/netconf-lab
  1. Activate it
source ~/netconf-lab/bin/activate
  1. Install ncclient
pip install ncclient
  1. Enter the python shell
python
  1. Connect to device:
>>> conn = manager.connect(
    host="192.168.52.199",
    port=830,
    username="admin",
    password="cisco123",
    hostkey_verify=False,
    device_params={"name": "iosxe"}
)
  1. Paste in a payload, follow the XML
>>> payload = """
<config>
  <native xmlns="http://cisco.com/ns/yang/Cisco-IOS-XE-native">
    <interface>
      <Loopback>
        <name>5</name>
        <ip>
          <address>
            <primary>
              <address>5.5.5.5</address>
              <mask>255.255.255.255</mask>
            </primary>
          </address>
        </ip>
      </Loopback>
    </interface>
  </native>
</config>
"""
>>> conn.edit_config(target="running", config=payload)
<?xml version="1.0" encoding="UTF-8"?>
<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="urn:uuid:5edcd8ca-3e51-4581-8bce-87f7eb939735" xmlns:nc="urn:ietf:params:xml:ns:netconf:base:1.0"><ok/></rpc-reply>

Reference

Programmability Configuration Guide, Cisco IOS XE 17.17.x

Terms

TermDefinition
MR-APSinter-chassis APS.
APSAutomatic Protection Switching for POS
UNIUser Network Interface
NNINetwork Node Interface
InterworkingGetting L2 information from Ethernet to work over Sonet or frame relay.
STESection Terminating Equipment
LTELine terminating equipment
PTEPath terminating equipment
POHPath overhead - This layer represents end-to-end status.
LOHLine overhead - Typically major nodes in SONET like ADMs
SOHSection overhead - Optical regenators
SPESynchronous payload envelope
BIPBit Interleaved Parity
FEBEFar End Block Error

Sonet

Path Payloads must match. Check Scrambling.

Network elements are expected to terminate and understand their layer, and layer overhead

If a SONET reciever at the Line level counts a BIP, it returns it to sender. The sender increments the line FEBE

It's been a while, the below might be wrong.

+-------------------------------------------------- PATH -------------------------------------------------+
|                                                                                                         |
|                                                                                                         |
|   +--------------- LINE --------------------+            +------------------ LINE-------------------+   |
|   |                                         |            |                                          |   |
v   v                                         v            v                                          v   v

+---+      +------------+       +-----+       +------------+      +-----+       +------------+        +---+
|CPE|------|Terminal    |-------|Regen|-------|Add/Drop    |------|Regen|-------|Terminal    |--------|CPE|
+---+ DS-n | Multiplexer| OC-N  +-----+ OC-N  | Multiplexer| OC-N +-----+ OC-N  | Multiplexer|  DS-n  +---+
           +------------+                     +------------+                    +------------+

    ^      ^            ^       ^     ^       ^            ^      ^     ^       ^            ^        ^
    |      |            |       |     |       |            |      |     |       |            |        |
    +------+            +-------+     +-------+            +------+     +-------+            +--------+
    SECTION              SECTION       SECTION             SECTION       SECTION              SECTION

C2 Byte

C2 Defines the SONET payload

An old note, probably from a standard document.

The SONET standard defines the C2 byte as the path signal label. The purpose of this byte 
is to communicate the payload type that the SONET Framing OverHead (FOH) encapsulates. 
The C2 byte functions similar to Ethertype and Logical Link Control (LLC)/Subnetwork 
Access Protocol (SNAP) header fields on an Ethernet network. The C2 byte allows a single
interface to transport multiple payload types simultaneously.

This table lists common values for the C2 byte:

Hex ValueSONET Payload Contents
00Unequipped.
01Equipped - non-specific payload.
02Virtual Tributaries (VTs) inside (default).
03VTs in locked mode (no longer supported).
04Asynchronous DS3 mapping.
12Asynchronous DS-4NA mapping.
13Asynchronous Transfer Mode (ATM) cell mapping.
14Distributed Queue Dual Bus (DQDB) cell mapping.
15Asynchronous Fiber Distributed Data Interface (FDDI) mapping.
16IP inside Point-to-Point Protocol (PPP) with scrambling.
CFIP inside PPP without scrambling.
E1- FCPayload Defect Indicator (PDI).
FETest signal mapping (see ITU Rec. G.707).
FFAlarm Indication Signal (AIS).

An Example:

Framing: SONET
SPE Scrambling: Enabled
C2 State: Stable   C2_rx = 0xCF (207)   C2_tx = 0x16 (22) / Scrambling Derived
S1S0(tx): 0x0  S1S0(rx): 0x2 / Framing Derived

Monitoring at each Network Element is usually helpful

POS - Spawned interface from SONET controller.

controller SONET0/2/0/0

clock source internal

Sonet YELLOW is RDI (Remote Defect indication)

Packet Over Sonet

Document: Troubleshooting Bit Error on SONET Links
URL: http://www.cisco.com/en/US/tech/tk482/tk607/technologies_tech_note09186a0080094a79.shtml
Section: When Do Particular BIP Errors Occur?

In addition, you must understand that BIP errors have different error detection resolutions, which are explained here:

B1: B1 can detect up to eight parity errors per frame. This level of resolution is not acceptable at OC-192 rates. Even-numbered errors can elude the parity check on links with high error rates.

B2: B2 can detect a far higher number of errors per frame. The exact number increases as the number of STS-1s (or STM-1s) increases in the SONET frame. For example, an OC-192/STM-64 produces a 192 x 8 = 1536 bit-wide BIP field. In other words, B2 can count up to 1536 bit errors per frame. There is considerably less chance of an even-numbered error that eludes the B2 parity calculation. B2 offers superior resolution when compared to B1 or B3. Therefore, a SONET interface can report B2 errors only for a particular monitored segment.

B3: B3 can detect up to eight parity errors in the entire SPE. This number produces acceptable resolution for a channelized interface because, (for example) each STS-1 in an STS-3 has a path overhead and B3 byte. However, this number produces poor resolution over concatenated payloads in which a single set of path overhead must cover a relatively large payload frame. 
Packet over SONET commands
Displays information about the automatic protection switching feature

show aps

Displays information about the hardware

show controller sonet slot/port-adapter/port

Displays information about the interface

show controllers pos

G709

G709 is an optical specification that is specifcially designed for FEC (Forward Error correction) It uses Reed-Solomon to produce redundant information that can be used to rebuild the frame.

  • OTU - Optical channel Transport Unit

  • ODU - Optical channel Data Unit

  • OPU - OPtical channel Payload Unit

SRP - Spatial Reuse protocol

This is used for fiber rings, its where the destination nodes pulls the info from the ring so it doesn't loop endlessly.

Like taken from a standards document someplace

Spatial Reuse Protocol (SRP) is a media-independent MAC layer protocol that operates over two counterrotating
fiber-optic rings. The dual rings provide survivability of data in case of a failed node or a break in 
connecting cables by rerouting the data path over the alternate ring. SRP provides a more efficient use of 
bandwidth by having packets traverse only the part of the ring necessary to get to the destination node. Once
the packet has reached the destination node, it is removed from the ring, allowing other parts of the ring
to reuse the bandwidth. Data packets travel on one ring, while associated control packets travel in the opposite
direction on the alternate ring, ensuring that the data takes the shortest path to its destination.

RPR - Resilient Packet Ring

802.17

  • Steering - Nodes are told the affected node is down and don't include it.
  • Wrapping - The node closest to the break route the traffic on the other direction of the ring.

Side A Always connects to Side B.

Example of a working connection.

Node2# show controller srp 4/0
SRP4/0 - Side A (Outer RX, Inner TX)
SECTION
  LOF = 0          LOS    = 0                            BIP(B1) = 3
LINE
  AIS = 0          RDI    = 0          FEBE = 36599      BIP(B2) = 46
PATH
  AIS = 0          RDI    = 0          FEBE = 4440       BIP(B3) = 26
  LOP = 0          NEWPTR = 0          PSE  = 0          NSE     = 0

Active Defects: None
Active Alarms:  None
Alarm reporting enabled for: SLOS SLOF PLOP 

Framing           : SONET
Rx SONET/SDH bytes: (K1/K2) = 0/0        S1S0 = 0  C2 = 0x16
Tx SONET/SDH bytes: (K1/K2) = 0/0        S1S0 = 0  C2 = 0x16  J0 = 0x1 
Clock source      : Internal
Framer loopback   : None
Path trace buffer : Stable 
  Remote hostname : Node1
  Remote interface: SRP4/0
  Remote IP addr  : <removed>
  Remote side id  : B
BER thresholds:           SF = 10e-3  SD = 10e-6
IPS BER thresholds(B3):   SF = 10e-3  SD = 10e-6
TCA thresholds:           B1 = 10e-6  B2 = 10e-6  B3 = 10e-6

SRP4/0 - Side B (Inner RX, Outer TX)
SECTION
LOF = 0          LOS    = 0                            BIP(B1) = 65535
LINE
AIS = 0          RDI    = 0          FEBE = 65535      BIP(B2) = 65535
PATH
AIS = 0          RDI    = 0          FEBE = 65535      BIP(B3) = 65535
LOP = 0          NEWPTR = 3          PSE  = 0          NSE     = 0
Active Defects: None
Active Alarms:  None
Alarm reporting enabled for: SLOS SLOF PLOP 
Framing           : SONET
Rx SONET/SDH bytes: (K1/K2) = 0/0        S1S0 = 0  C2 = 0x16
Tx SONET/SDH bytes: (K1/K2) = 0/0        S1S0 = 0  C2 = 0x16  J0 = 0x1 
Clock source      : Internal
Framer loopback   : None
Path trace buffer : Stable 
Remote hostname : Node3
Remote interface: SRP4/0
Remote IP addr  : <removed>
Remote side id  : A
BER thresholds:           SF = 10e-3  SD = 10e-6
IPS BER thresholds(B3):   SF = 10e-3  SD = 10e-6
TCA thresholds:           B1 = 10e-6  B2 = 10e-6  B3 = 10e-6
References

SONET Primer

T1 Framing

D4 Frame is 24 timeslots + framing bit.

100011011100

Ethernet II --  14 octets.
MPLS        --   4 octets.
CESoPSN     --   4 octets.
TDM Payload -- 192 octets.

Each Ethernet II frame takes up 1712 bits on the wire.

T1 Channel Associated Signaling (CAS) [Used for voice]
    Every 6th frame will have all the lowest order bits stolen on each channel for signaling information.
    Super Framing does this 6 (A bit), 12 (B bit), 18 (A bit), 24 (B bit)
    Extended Super Framing does this but makes four bits. A, B, C, D

On RX

  • 175 contigouse pulse positions with no positive or negative polarity.

On TX

  • Sends yellow alarm Far End Alarm
  • Next device downstream gets a blue alarm

On this device marks the link as T1 LOS Loss of Signal.

T1 Clocking Types

CommandDescription
clock source linederive reference from external device.
clock source internaluse local PLL for reference.
network-clock-participatejoin the TDM backplane of the router.
network-clock-selectTells the TDM backplane to use certain T1 as a reference clock, and share it.

network-clock-select requires a T1 line to be in clock source line mode.

network-clock-participate is required for network-clock-select

Mainboard voice DSPs MUST use the backplane clock. They can't opt out.

All network-clock-participate devices share the same clocking-domain.

T1 Clocking Information

T1 reads from RX and TX buffers at the clock rate. Slips are reported when data is read at the wrong clock. Sometimes it might sample the same bit twice, sometimes it might miss bits completely.

UDP Packet Format

User Datagram Protocol - RFC 768

UDP does try to send error-free packets by including a checksum, the below via the RFC

Checksum is the 16-bit one's complement of the one's complement sum of a pseudo header of information from the IP header, the UDP header, and the data, padded with zero octets at the end (if necessary) to make a multiple of two octets.

...

If the computed checksum is zero, it is transmitted as all ones (the equivalent in one's complement arithmetic). An all zero transmitted checksum value means that the transmitter generated no checksum (for debugging or for higher level protocols that don't care).

 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
┌────────────────────────────────┬───────────────────────────────┐
│          Source Port           │       Destination Port        │
├────────────────────────────────┼───────────────────────────────┤
│            Length              │           Checksum            │
├────────────────────────────────┴───────────────────────────────┘
│          Data Octets
└────────────────────────────────►
TFTP Read Request
Frame 115: 69 bytes on wire (552 bits), 69 bytes captured (552 bits) on interface -, id 0
    Internet Protocol Version 4, Src: 10.0.10.22, Dst: 10.0.10.33
    User Datagram Protocol, Src Port: 52775, Dst Port: 69
        Source Port: 52775
        Destination Port: 69
        Length: 31
        Checksum: 0x4aed [correct]
        [Checksum Status: Good]
        [Stream index: 0]
        [Timestamps]
        UDP payload (23 bytes)
    Trivial File Transfer Protocol
        Opcode: Read Request (1)
        Source File: startup-config
        Type: octet
TFTP Data Packet
Frame 116: 562 bytes on wire (4496 bits), 562 bytes captured (4496 bits) on interface
    Internet Protocol Version 4, Src: 10.0.10.33, Dst: 10.0.10.22
    User Datagram Protocol, Src Port: 52590, Dst Port: 52775
        Source Port: 52590
        Destination Port: 52775
        Length: 524
        Checksum: 0xde83 [correct]
        [Checksum Status: Good]
        [Stream index: 1]
        [Timestamps]
        UDP payload (516 bytes)
    Trivial File Transfer Protocol
        Opcode: Data Packet (3)
        [Destination File: startup-config]
        [Read Request in frame 115]
        Block: 1
        [Full Block Number: 1]
    Data (512 bytes)
    
    0000  0a 21 0a 21 20 4c 61 73 74 20 63 6f 6e 66 69 67   .!.! Last config
    0010  75 72 61 74 69 6f 6e 20 63 68 61 6e 67 65 20 61   uration change a
    0020  74 20 30 35 3a 31 31 3a 31 35 20 55 54 43 20 53   t 05:11:15 UTC S
    0030  61 74 20 4a 75 6c 20 38 20 32 30 32 33 0a 21 0a   at Jul 8 2023.!.
    0040  76 65 72 73 69 6f 6e 20 31 35 2e 32 0a 73 65 72   version 15.2.ser
    0050  76 69 63 65 20 74 69 6d 65 73 74 61 6d 70 73 20   vice timestamps 
    0060  64 65 62 75 67 20 64 61 74 65 74 69 6d 65 20 6d   debug datetime m
    0070  73 65 63 0a 73 65 72 76 69 63 65 20 74 69 6d 65   sec.service time
    0080  73 74 61 6d 70 73 20 6c 6f 67 20 64 61 74 65 74   stamps log datet
    0090  69 6d 65 20 6d 73 65 63 0a 6e 6f 20 73 65 72 76   ime msec.no serv
    00a0  69 63 65 20 70 61 73 73 77 6f 72 64 2d 65 6e 63   ice password-enc
    00b0  72 79 70 74 69 6f 6e 0a 73 65 72 76 69 63 65 20   ryption.service 
    00c0  63 6f 6d 70 72 65 73 73 2d 63 6f 6e 66 69 67 0a   compress-config.
    00d0  21 0a 68 6f 73 74 6e 61 6d 65 20 53 57 33 0a 21   !.hostname SW3.!
    00e0  0a 62 6f 6f 74 2d 73 74 61 72 74 2d 6d 61 72 6b   .boot-start-mark
    00f0  65 72 0a 62 6f 6f 74 2d 65 6e 64 2d 6d 61 72 6b   er.boot-end-mark
    0100  65 72 0a 21 0a 21 0a 6c 6f 67 67 69 6e 67 20 64   er.!.!.logging d
    0110  69 73 63 72 69 6d 69 6e 61 74 6f 72 20 45 58 43   iscriminator EXC
    0120  45 53 53 20 73 65 76 65 72 69 74 79 20 64 72 6f   ESS severity dro
    0130  70 73 20 36 20 6d 73 67 2d 62 6f 64 79 20 64 72   ps 6 msg-body dr
    0140  6f 70 73 20 45 58 43 45 53 53 43 4f 4c 4c 20 0a   ops EXCESSCOLL .
    0150  6c 6f 67 67 69 6e 67 20 62 75 66 66 65 72 65 64   logging buffered
    0160  20 35 30 30 30 30 0a 6c 6f 67 67 69 6e 67 20 63    50000.logging c
    0170  6f 6e 73 6f 6c 65 20 64 69 73 63 72 69 6d 69 6e   onsole discrimin
    0180  61 74 6f 72 20 45 58 43 45 53 53 0a 21 0a 6e 6f   ator EXCESS.!.no
    0190  20 61 61 61 20 6e 65 77 2d 6d 6f 64 65 6c 0a 21    aaa new-model.!
    01a0  0a 21 0a 21 0a 21 0a 21 0a 6e 6f 20 69 70 20 69   .!.!.!.!.no ip i
    01b0  63 6d 70 20 72 61 74 65 2d 6c 69 6d 69 74 20 75   cmp rate-limit u
    01c0  6e 72 65 61 63 68 61 62 6c 65 0a 21 0a 21 0a 21   nreachable.!.!.!
    01d0  0a 6e 6f 20 69 70 20 64 6f 6d 61 69 6e 2d 6c 6f   .no ip domain-lo
    01e0  6f 6b 75 70 0a 69 70 20 63 65 66 0a 6e 6f 20 69   okup.ip cef.no i
    01f0  70 76 36 20 63 65 66 0a 21 0a 21 0a 21 0a 73 70   pv6 cef.!.!.!.sp

Alpine Hosts

hostname pc-20
ip link set dev eth0 up
ip address add 10.0.20.20/24 dev eth0
ip route add default via 10.0.20.1

iperf

Server iperf --port 2000 --server

Client iperf --port 2000 --client 10.0.0.1 --num 10k --reverse --udp

CML On Proxmox

... seems to work fine!

If you have enterprise CML, there is a front network and a back network.

The back network uses ipv6 link-local addresses which do not play well with Proxmox port channels and vlan tags.

It seems much safer to have a dedicated port for the back network.

Subnet with fingers

I just memorize these sequences, ungainly, but works.

Decimal masks - 128, 192, 224, 240, 248, 252, 254, 255

Wildcard masks - 127, 63, 31, 15, 7, 3, 1, 0

RFC 791 - Classful Networking

Early Internet addressing (1980s) the IP itself indicated the subnet mask, by using the High Order bits. There were only three network sizes.

/8 - Address starts with 0-127 - 128 networks

/16 - Address starts with 128-191 - 65,536 networks

/24 - Address starts with 192-223 - 16,777,216 networks

In the long ago, the hope was to use the first few bits of an address to tell the subnet mask. Even though we never do this in the modern era a few parts of classful networking are still here.

  • /24 is a very popular prefix
  • /16 is a very popular prefix
  • All multicast addresses start with 1110
Internet Protocol
Specification

  Addressing

    To provide for flexibility in assigning address to networks and
    allow for the  large number of small to intermediate sized networks
    the interpretation of the address field is coded to specify a small
    number of networks with a large number of host, a moderate number of
    networks with a moderate number of hosts, and a large number of
    networks with a small number of hosts.  In addition there is an
    escape code for extended addressing mode.

    Address Formats:

      High Order Bits   Format                           Class
      ---------------   -------------------------------  -----
            0            7 bits of net, 24 bits of host    a
            10          14 bits of net, 16 bits of host    b
            110         21 bits of net,  8 bits of host    c
            111         escape to extended addressing mode

RFC1918 Dungeons

These are the most famous IPv4 networks.

RFC 1918        Address Allocation for Private Internets   February 1996

3. Private Address Space

   The Internet Assigned Numbers Authority (IANA) has reserved the
   following three blocks of the IP address space for private internets:

     10.0.0.0        -   10.255.255.255  (10/8 prefix)
     172.16.0.0      -   172.31.255.255  (172.16/12 prefix)
     192.168.0.0     -   192.168.255.255 (192.168/16 prefix)

   We will refer to the first block as "24-bit block", the second as
   "20-bit block", and to the third as "16-bit" block. Note that (in
   pre-CIDR notation) the first block is nothing but a single class A
   network number, while the second block is a set of 16 contiguous
   class B network numbers, and third block is a set of 256 contiguous
   class C network numbers.

IP Protocol Numbers

When IP encapsulates another protoctol it labels the protoctol field with a number to define the next layer.

IP Protocol NumberDescription
1ICMP
2IGMP
6TCP
17UDP
46RSVP
47GRE
51ESP (IPSec)
51AH (IPSec)
69TFTP
88EIGRP
89OSPF
103PIM
112VRRP
115L2TP
161SNMP
162TRAPS

Protocol Numbers - IANA

Cisco Administrative Distance

ProtocolAdministrative Distance
Connected0
Static1
EIGRP Summary5
eBGP20
EIGRP Internal90
OSPF110
IS-IS115
RIP120
ODR160
EIGRP External170
iBGP200
Unknown/Infinite1255

Troubleshooting TechNotes - What is Administrative Distance? - Cisco

1

Can use to do route-filtering.

IO Pathways

Device controller tells the CPU it's done (put data into a buffer) by sending an interrupt.

IO goes from controller - local buffer - CPU

Interrupts

Hardware interrupts

  • A buffer has been filled

Traps or exceptions are software generated interrupts

  • User requests
  • Errors

Most operating systems are interrupt driven.

Storage Structures

Main Memory (DRAM)

  • Random Access
  • Lost with power outage (volatile)

Secondary Storage

  • Larger
  • Not lost with power outage (non-volatile)

Caching

Copying data from secondary storage to main memory

  • Faster

Storage Hierarchy Registers > cache > main memory (dram) > solid-state disks > spinning disks > optical disks > magnetic tapes.

Direct Memory Access (DMA)

Some amount of DRAM is owned directly by an IO controller, and uses the DRAM for the buffer. When done, the IO controller sends an interrupt.

Processing

  • Asymmetric - each processor does a specific task.
  • Symmetric - each processor performs all tasks.

Multithreading

While one thread is asking for memory, execute the other thread. Go back and forth.

Dual Mode

User mode and Kernel mode, with a mode bit. Kernel mode is also called privileged.

System Calls

System calls are how user mode apps interact with the kernel. APIs are provided facilities to access the kernel without using system calls (which may not be allowed)

  • Win32 for Windows
  • POSIX API (Unix, Linux, Mac OS X)
  • Java API for Java Virtual Machine (JVM)

Load Averages

Windows will show a percentage of CPU. Linux systems instead show the number of processes waiting to acces the CPU. It can get to double digits.

Threading

A single-thread process has a program counter that says "go here to read the next instruction please"

Memory Management

Copying from storage into dram, into cache. Only stuff in L1 cache can be executed.

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

Source Stack Overflow

Debugging

Kernighan's Law

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it? -- Brian Kernighan, 1974

Write easy to understand code, planning on future debugging.

Communications Models

Message Passing (modern)

  • Puts messages into a shared queue, gives it a number, tell the other app "Go read this message"

Shared Memory (ancient)

  • Applications can just overwrite each others data.

Scheduling

  • FCFS - First come First Served. Not really used anymore
  • SJF - Shortest Job first, kind-of how QoS works.
  • Priority - Give processes an integer, rank them.
  • RR - Round Robin, using time quantum, called q like 10-100 milliseconds
  • CFS - *Completely Fair Scheduler
    • Involved, emulates time-slices
    • N tasks, each task gets 1/N time.

Multilevel Queue - Done in Linux

  • Foreground, Background

    • Foreground gets 80% as RR
  • Background

    • FCFS

Process Environment

  • Argument vector - the command line arguments used to invoke the running program
  • Environment vector - the list of "NAME=VALUE" pairs

Static and Dynamic Linking

  • Static - the library functions are embedded in the executable.
  • Dynamic - the library functions are at a place in memory, and shared.

Terms

  • STDM - Synchronous Time-Division Multiplexing
  • DS0 - Level 0. One timeslot. A timeslot carries 8 bits. Frame rate is 8000 hz. 8 * 8000 = 64Kbps.
  • B8ZS - Binary Eight Zero Substitution. A special way to encode 0000 0000 for DS0 lines.
  • T1 Frame - T-Carrier, Level 1. Aggregates 24 DS0 frames, or 192 bits. The T1 gets an extra bit, for framing so 193. 193 * 8000 is 1.544 Mbps.
  • Super Frame - 12 T1 frames.
  • Framing Search - Each T1 frame uses the extra bit to encode part of the superframe bit pattern 0101 1101 0001 or (5, 13, 1).
  • APS - Automatic Protection Switching. The device engaging in APS sends the data on both links, the working link and the protected link. The recieving device devices which to use.
  • DS1 - Data Stream, Level 1.
  • T1 - T-Carrier, Level 1, Carries 24 DS0 frames, or 192 bits. The T1 gets an extra bit, for framing so 193. 193 * 8000 is 1.544 Mbps.
  • ACR - Access Circuit Redundancy

The common STDM system in the US is T-Carrier.

Cisco CEM Terms

  • ACR - Adaptive Clock Recovery, A technique to recovery the clock based on the fill level of the jitter buffer.

Access Circuit Redundancy

References

T-Carrier and SONET

All you Wanted to Know about T1 But Were afraid to Ask

OCx CEM Interface Module Config Guide IOS-XE 17 ASR 900 Series

Rocky Linux, Certbot, Let's Encrypt, DNS and Snap

This setup means a device can have a valid SSL certificate and still be inaccessible from the Internet, so https://host.example.com works internally without SSL warnings.

Let's Encrypt is a Certificate Authority provided by the non-profit Internet Security Research Group as a free service.

This is a partial set of instructions to get valid SSL certificates via Let's Encrypt via certbot. It doesn't include autorenew. I did this on Rocky Linux but other instructions exist for other platforms.

These instructions follow RFC 8555#section-8.4 -> DNS Challenge.

I'm using cloudflare with a domain I own, but there is a good sized list of supported DNS plugins.

Instructions

  1. Remove the older certbot

    sudo dnf remove certbot

  2. Update the package list

    sudo dnf update

  3. Install the EPEL repository

    sudo dnf install epel-release

  4. Install snapd, via the EPEL repository

    sudo dnf install snapd

  5. Enable the snap socket

    sudo systemctl enable --now snapd.socket

  6. Enable Classic Snap

    sudo ln -s /var/lib/snapd/snap /snap

  7. Install Classic Certbot, via Snap

    sudo snap install --classic certbot

  8. Link it like a regular binary.

    sudo ln -s /snap/bin/certbot /usr/bin/certbot

  9. Tell Certbot it can have root

    sudo snap set certbot trust-plugin-with-root=ok

  10. Obtain the cloudflare plugin

    sudo snap install certbot-dns-cloudflare

  11. Re-establish connection to box, to refresh binary paths

    <exit>

    <reconnect>

  12. Get an API token from cloudflare.

    • Limit permissions to Zone - DNS - Edit
    • Limit the Zone to Include - Specific Zone - <domain>
  13. Create a cloudflare.key file with the API token

    dns_cloudflare_api_token = <token here>

  14. Set the permissions on the key to be restrictive

    sudo chmod o-rwx cloudflare.key

  15. Get the certificates

    sudo certbot certonly \
      --dns-cloudflare \
      --dns-cloudflare-credentials /opt/certbot/cloudflare.key \
      -d host.example.com
    
  16. Move cloudflare.key into the new /etc/letsencrypt/ directory.

    sudo mv /etc/letsencrypt/cloudflare-api-key cloudflare.key

  17. Check work

    ls -la /etc/letsencrypt/

References

EFF - Install Certbot via Snap

Snapcraft - Installing Snap or Rocky Linux

Read The Docs - Certbot - DNS Plugins

#
# This is the config for portainer, and the reverse proxy, traefik
#

#
# This is a VM that hosts portainer. These are services started by docker compose.
#
# sudo docker comopose up -d
# sudo docker compose down
#
# the network user-bridge needs to be specified in advance
#
# My wiki host is wiki.<mydomain>.org
# My wiki backup host is wiki-backup.<mydomain>.org
#
# The A and AAAA records point to the IP of the VM.
#
#
# My external DNS is handled by cloudflare. I'm using dns-challenge for getting LetsEncrypt SSL certs.
#
#


ariadne@docker-host:~/docker/portainer-traefik$ cat docker-compose.yml 
version: '3.1'
services:
  portainer:
    container_name: portainer
    image: portainer/portainer-ce:latest
    command: -H unix:///var/run/docker.sock
    restart: always
      #    ports:
      #- 8000:8000
      #- 9443:9443
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - portainer_data:/data
    networks:
      - user-bridge
    labels:
      - "traefik.enable=true"
      # using-the-fqdn
      - "traefik.http.routers.using-the-fqdn.rule=Host(`<docker-host>.<redacted>.org`)"
      - "traefik.http.routers.using-the-fqdn.entrypoints=websecure"
      - "traefik.http.routers.using-the-fqdn.service=using-the-fqdn"
      - "traefik.http.routers.using-the-fqdn.tls.certresolver=letsencrypt"
      - "traefik.http.services.using-the-fqdn.loadbalancer.server.port=9000"
  traefik:
    image: "traefik:v2.10"
    container_name: traefik
    restart: always
    command:
      # - "--log.level=DEBUG"
      - "--api.insecure=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      # create entry point "web"
      - "--entrypoints.web.address=:80"
      # create entry point "websecure"
      - "--entrypoints.websecure.address=:443"
      - "--entrypoints.web.http.redirections.entryPoint.to=websecure"
      - "--entrypoints.web.http.redirections.entryPoint.scheme=https"
      # create cert resolver "letsencrypt"
      - "--certificatesresolvers.letsencrypt.acme.dnschallenge=true"
      - "--certificatesresolvers.letsencrypt.acme.dnschallenge.provider=cloudflare"
      - "--certificatesresolvers.letsencrypt.acme.dnschallenge.resolvers=1.1.1.1:53,8.8.8.8:53"
      # - "--certificatesresolvers.letsencrypt.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory" # Staging CA Server
      - "--certificatesresolvers.letsencrypt.acme.caserver=https://acme-v02.api.letsencrypt.org/directory" # Production CA Server
      - "--certificatesresolvers.letsencrypt.acme.email=<redacted>"
      - "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
    ports:
      - "80:80"
      - "443:443"
      - "8080:8080"
    networks:
      - user-bridge
    environment:
      - "CF_DNS_API_TOKEN=<redacted>"
    volumes:
      - "./letsencrypt:/letsencrypt"
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
    labels:
      # create router "http-catchall"
      - "traefik.http.routers.http-catchall.rule=hostregexp(`{host:.+}`)"
      - "traefik.http.routers.http-catchall.entrypoints=web"
      # create middleware "middlewares"
      - "traefik.http.middlewares.redirect-to-https.redirectscheme.scheme=https"
      - "traefik.http.middlewares.redirect-to-https.redirectscheme.permanent=true"
volumes:
  portainer_data:

networks:
  user-bridge:
    external: true


#
# This is the config for the db, wiki, and duplicati backup services
#
ariadne@grove:~/docker/home-wiki$ cat docker-compose.yml 
version: "3.1"

services:
  db:
    image: postgres:15-alpine
    restart: no
    environment:
      POSTGRES_DB: wiki
      POSTGRES_PASSWORD: <redacted>
      POSTGRES_USER: wikijs
    logging:
      driver: "none"
    volumes:
      - /mnt/wiki-drive:/var/lib/postgresql/data
    networks:
      - user-bridge

  wiki:
    image: ghcr.io/requarks/wiki:2
    restart: always
    environment:
      DB_TYPE: postgres
      DB_HOST: db
      DB_PORT: 5432
      DB_USER: wikijs
      DB_PASS: wikijsrocks
      DB_NAME: wiki
    ports:
      - "3000:3000"
    networks:
      - user-bridge
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.wiki.rule=Host(`wiki.<redacted>.org`)"
      - "traefik.http.routers.wiki.entrypoints=web,websecure"
      - "traefik.http.routers.wiki.tls.certresolver=letsencrypt"
      - "traefik.http.services.wiki.loadbalancer.server.port=3000"

  duplicati:
    image: duplicati/duplicati:latest
    restart: always
    ports:
      - "8200:8200"
    command: "/usr/bin/duplicati-server --webservice-port=8200 --webservice-interface=any --webservice-allowed-hostnames=*"
    volumes:
      - /mnt/wiki-drive:/wiki-drive:rw        # What we want to back up 
      - /opt/duplicati/data:/data:rw          # Config Storage on the host
    networks:
      - user-bridge
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.duplicati.rule=Host(`wiki-backup.<redacted>.org`)"
      - "traefik.http.routers.duplicati.entrypoints=web,websecure"
      - "traefik.http.routers.duplicati.tls.certresolver=letsencrypt"
      - "traefik.http.services.duplicati.loadbalancer.server.port=8200"

networks:
  user-bridge:
    external: true

Windows 10 P2V - Physical to Virtual

My Setup

I am adding a compute node to an existing proxmox cluster.

I bought a used i7 Windows 10 machine with a 512 GB NVMe drive. On the outside are two COA stickers, one for Windows 10 Pro, and another for H&S Office 2019.

The current OS boots and the copy of Office works.

Goal: I want to keep this install of Windows 10 working, and copy the OS into Proxmox. I want to virtualize this OS.

This will give me a working licensed copy of Office.

Theory

I just need to get the "data" onto the VM.

  1. The physical machine and the VM need to be able to ping each other.
  2. Installing the drivers ahead of time should make the OS bootable.
  3. Copying the data should preserve the OS and applications.
  4. Copying the partitions should make recovery easier.
  5. Rebuilding the boot information should make the OS bootable.

A lot of this is to enable a clean "recovery" of the OS once it's copied over. My copy of Windows 10 relies on:

  • FAT32
  • NTFS - This filesystem should really only be checked using Microsoft's own tools.
  • BCD - Boot Configuraiton Data
  • GPT
  • EFI
  • MSR

Dataloss

These tools cause dataloss.

A typo will destroy a filesystem.

Before doing this, practice both making and recovering bare metal restores (BMRs) ... I used Clonezilla.

BMR is usually device-to-image, or image-to-device.

Here are the docs for using Clonezilla.

My Windows 10 BMR is 11GB stored as bzip2.

If Possible Just Clone the Disk

I wanted to go from a larger drive (512GB) to a smaller drive (64GB). That meant instead of copying the devices, I needed to copy the partitions, after resizing them.

drive-to-drive cloning would be much easier.

Download ISOs

Most of the time was spent inside of recovery OSes, working with unmounted filesystems.

SystemRescue - Linux recovery media with NTFS support.

Windows 10 Installation Media - This is also the recovery disk. It can be made on the host being virtualized. This is needed to fix, BCD (Boot Configuariton Data) and EFI problems.

Clonezilla - A bare metal recovery tool.

Preparing Windows 10 to be virtualized

My Windows 10 machine had some extras on it I didn't want to virtualize.

  1. Create a restore image with Clonezilla

    This is the failsafe image, before touching anything. I saved mine to a samba share, but it can be saved anywhere it will fit that isn't on the device.

  2. Turn off the hibernation file

    Via the command prompt as an administrator:

    powercfg -h off

  3. Clean up the hard disk

    Into the search box type:

    Disk Cleanup

  4. Set the virtual memory pagefile to 1024MB

    A file of this size is needed for coredumps, errors, and logging.

    Follow these instructions.

  5. (Optional) Run WinDirStat to look for odd or large files

    Delete or Uninstall them.

    Windows Directory Statistics - WinDirStat

  6. Run chkdsk on C:

    Via the command prompt as an administrator:

    chkdsk C: /R

    /R - "Locates bad sectors and recovers readable information (implies /F, when /scan not specified)"

    Reboot

  7. (Optional) - Create another restore point with Clonezilla

    This is the cleaned image, to save all the clean up work.

  8. Boot GParted

    This is where it gets dangerous. GParted can be used to resize offline NTFS partitions.

  9. Resize the "Basic data partition"

    My data partition was 410GiB. I resized it down to 48GiB. The data on the partition is 25GiB.

  10. Move the "Recovery" partition

    I used the GUI to slide it over.

  11. Save your work with GParted

    Click the green checkmark. This writes the changes to disk.

  12. Boot into Windows 10

    Check to make sure the OS is still sane. Does the Internet work?

  13. Run chkdsk again on C:

    This is done to make sure the filesystem is OK.

    Via the command prompt as an administrator:

    chkdsk C: /R

    /R - "Locates bad sectors and recovers readable information (implies /F, when /scan not specified)"

    Reboot

  14. (Optional) - Create another restore point with Clonezilla

    This is the prepared image.

  15. Boot into SystemRescue

Creating the Virtual Machine

I used PVE - Proxmox Virtual Environment as my hypervisor. Any hypervisor should work.

I used the Proxmox GUI to assign the VM a hard disk of 64GB.

I boot the VM with SystemRescue, and make sure it can get a working IP address.

Preparing the Hard Drive on the Virtual Machine

There are four partitions on my windows 10 machine. I want to copy them over-the-network using netcat.

  1. Both - Boot SystemRescue

  2. Both - Open GParted

  3. Destination - Using GParted, recreate the partition structure on the new hard disk

    I used a mix of fdisk and the GUI for this.

    • Created a GPT Partition Table
    • Copied the partitions including the start and stop sectors, exactly.
    • Copied the flags

    I started with four partitions on both and ended with four partitions. They all fit on this smaller disk.

  4. Destination - Turn off the firewall

    systemctl stop iptables

  5. Destination - Get the IP Address

    ip a

  6. Destination - Turn on the small service netcat

    This needs to be done for each partition, one at a time.

    nc -l -p 19000 | bzip2 -d | dd of=/dev/sda1

  7. Source - Redirect dd into bzip into netcat, throw traffic at the Destination

    This needs to be done for each partition, one at a time.

    dd bs=16M if=/dev/nvme0n1p1 | bzip2 -c | nc <ip_address> <port>

Windows 10 Recovery

I went from a NVMe drive to a IDE drive. I still needed to recover the bootdata.

  1. Destination - Load the ISO for the Windows Recovery Environment.

    Click Repair your computer

    Click Troubleshoot

    Click Command Prompt

I followed this guide to repair the boot info.

  1. Look at the new VM disk

    diskpart

    This leads to the DISKPART> prompt.

  2. Verify the disk is GPT.

    Under "GPT" there should be a star.

  3. Select Disk 0

    This is the only hard disk in this VM.

    sel disk 0

  4. List the partitions and Volumes

    This is the windows equivalant to fdisk.

    list partition

    list volume

    This is my lab system.

    DISKPART> list partition 
    
       Partition ###   Type            Size        Offset
       -------------   --------------  ----------  -------
       Partition 1     System          100 MB      1024 KB
       Partition 2     Reserved        16 MB       101 MB
       Partition 3     Primary         46 GB       117 MB
    
    DISKPART> list volume 
    
       Volume ###  Ltr     Label       Fs      Type        Size        Status      Info 
       ----------  ---     ----------  -----   ----------  -------     ----------  -------
       Volume 0    D       ESD-ISO     UDF     CD-ROM      4667 MB     Healthy  
       Volume 1    C                   NTFS    Partition     46 GB     Healthy  
       Volume 2                        FAT32   Partition    100 MB     Healthy     Hidden
    

    There are the three required volumes.

    • NTFS - The data partition, apps and the OS
    • EFI - Extensible Firmware Interface. Where the modern boot system lives. Usually 100MB, FAT32
    • MSR - Microsoft System Reserved. Usually 16MB formatted as "MSR". Used by Windows to help manage the file partitions

At this point, I could just follow along with the Windows OS Hub article, to restore the BCD bootloader configuration.

References

Windows OS Hub - How to Repair EFI/GPT Bootloader on Windows 10 or 11

Microsoft - Disk cleanup in Windows

Ten Forums - How to Manage Virtual Memory Pagefile in Windows 10

Microsoft - BCD Boot Command Line Options

Windows OS Hub - How to repair deleted EFI partition in windows 7