One of the networks I'm doing O&M on has a pair of 6500 series as the core. It doesn't happen all the time, but around once or twice a week snmp traps for TCAM over utilization will be screaming at us all day. Nobody seems to care since it hasn't had an effect on services (that attitude is prevalent around here), but I want to solve it for my own peace of mind.
The utilization is 99%. Any suggestions on where I should start looking? I'm going to hit the Google machine here in a minute but I figured I would ping you guys too.
random guess: ipv6? (MLD?)
Whats the mac stable size? VRFs? Routing table size? Do you have multicast routing? Although something like IPv6 MLD indeed sounds more likely.
try this
--snip—
event manager applet cpu_stats
event snmp oid "1.3.6.1.4.1.9.9.109.1.1.1.1.3.1" get-type exact entry-op gt entry-val "70"
exit-op lt exit-val "50" poll-interval 5
action 1.01 syslog msg "------HIGH CPU DETECTED----, CPU:$_snmp_oid_val %"
action 1.02 cli command "enable"
action 1.03 cli command "show clock | append disk0:cpu_stats"
action 1.04 cli command "show proc cpu sort | append disk0:cpu_stats"
action 1.05 cli command "Show proc cpu | exc 0.00% | append disk0:cpu_stats"
action 1.06 cli command "Show proc cpu history | append disk0:cpu_stats"
action 1.07 cli command "show logging | append disk0:cpu_stats "
action 1.08 cli command "show spanning-tree detail | in ieee|occurr|from|is exec | append disk0:cpu_stats"
action 1.09 cli command "debug netdr cap rx | append disk0:cpu_stats"
action 1.10 cli command "show netdr cap | append disk0:cpu_stats"
action 1.11 cli command "undebug all"
!
** EEM Script will fire when CPU goes above 70%, and will not refire until CPU goes back under 50%.
you can tweak entry-vlan and exit-val to your taste.
Quote from: ristau5741 on October 02, 2015, 07:15:50 AM
try this
--snip—
event manager applet cpu_stats
event snmp oid "1.3.6.1.4.1.9.9.109.1.1.1.1.3.1" get-type exact entry-op gt entry-val "70"
exit-op lt exit-val "50" poll-interval 5
action 1.01 syslog msg "------HIGH CPU DETECTED----, CPU:$_snmp_oid_val %"
action 1.02 cli command "enable"
action 1.03 cli command "show clock | append disk0:cpu_stats"
action 1.04 cli command "show proc cpu sort | append disk0:cpu_stats"
action 1.05 cli command "Show proc cpu | exc 0.00% | append disk0:cpu_stats"
action 1.06 cli command "Show proc cpu history | append disk0:cpu_stats"
action 1.07 cli command "show logging | append disk0:cpu_stats "
action 1.08 cli command "show spanning-tree detail | in ieee|occurr|from|is exec | append disk0:cpu_stats"
action 1.09 cli command "debug netdr cap rx | append disk0:cpu_stats"
action 1.10 cli command "show netdr cap | append disk0:cpu_stats"
action 1.11 cli command "undebug all"
!
** EEM Script will fire when CPU goes above 70%, and will not refire until CPU goes back under 50%.
you can tweak entry-vlan and exit-val to your taste.
crap, this is for CPU utilization, not TCAM utilization....
you probably could use the SNMP navigator to find the TCAM OID (1.3.6.1.4.1.9.9.97.1.9.1.1.1?), and then update the cli commands to reflect what you want to see.
p.s. probably should open a TAC case for this, with a 'show tech' output while the issue is happening.
Sorry for the late response.
This hasn't happened again since the 2nd and we are short-handed so I have been slammed with the normal netops activities. It's hard sometimes being a pillar of networking might.
We aren't running any IPv6, so that is out.. and no multicast that I am aware of. When it happens again I will do as ristau said and open a TAC case so someone smarter than me can look at it :D
Quote from: ristau5741 on October 02, 2015, 07:15:50 AM
try this
--snip—
event manager applet cpu_stats
event snmp oid "1.3.6.1.4.1.9.9.109.1.1.1.1.3.1" get-type exact entry-op gt entry-val "70"
exit-op lt exit-val "50" poll-interval 5
action 1.01 syslog msg "------HIGH CPU DETECTED----, CPU:$_snmp_oid_val %"
action 1.02 cli command "enable"
action 1.03 cli command "show clock | append disk0:cpu_stats"
action 1.04 cli command "show proc cpu sort | append disk0:cpu_stats"
action 1.05 cli command "Show proc cpu | exc 0.00% | append disk0:cpu_stats"
action 1.06 cli command "Show proc cpu history | append disk0:cpu_stats"
action 1.07 cli command "show logging | append disk0:cpu_stats "
action 1.08 cli command "show spanning-tree detail | in ieee|occurr|from|is exec | append disk0:cpu_stats"
action 1.09 cli command "debug netdr cap rx | append disk0:cpu_stats"
action 1.10 cli command "show netdr cap | append disk0:cpu_stats"
action 1.11 cli command "undebug all"
!
** EEM Script will fire when CPU goes above 70%, and will not refire until CPU goes back under 50%.
you can tweak entry-vlan and exit-val to your taste.
This is some network ninja stuff right here.. I'm going to play with this later.
Just because you're not running ipv6 doesn't mean a pc or Nic is not. Google ipv6 mld flooding for a nasty example involving intel buggy Nic drivers that I have personally seen in the wild
Quote from: wintermute000 on October 08, 2015, 02:01:18 AM
Just because you're not running ipv6 doesn't mean a pc or Nic is not. Google ipv6 mld flooding for a nasty example involving intel buggy Nic drivers that I have personally seen in the wild
:challenge-accepted:
Quote from: Reggle on October 02, 2015, 07:03:16 AM
Whats the mac stable size? VRFs? Routing table size? Do you have multicast routing? Although something like IPv6 MLD indeed sounds more likely.
Do you mean the max size? Or how many mac addresses and routes are currently in the tables?
Quote from: config t on October 08, 2015, 02:46:00 AM
Do you mean the max size? Or how many mac addresses and routes are currently in the tables?
He means this. Those take up TCAM space. If for example you take full BGP routes on a 6500/7600 you can run into TCAM issues if your supervisor can't handle that many routes or if it can and hasn't been adjusted to accept a larger amount. I'll try to find the command to check, can't remember it right off.
Ok I remembered it better than I though I would once I was on the CLI.
6509#show platform hardware capacity forwarding
L2 Forwarding Resources
MAC Table usage: Module Collisions Total Used %Used
1 0 98304 1071 1%
2 0 98304 1053 1%
3 0 98304 1053 1%
5 0 65536 1058 2%
VPN CAM usage: Total Used %Used
512 0 0%
L3 Forwarding Resources
FIB TCAM usage: Total Used %Used
72 bits (IPv4, MPLS, EoM) 196608 6189 3%
144 bits (IP mcast, IPv6) 32768 65 1%
detail: Protocol Used %Used
IPv4 6187 3%
MPLS 1 1%
EoM 1 1%
IPv6 1 1%
IPv4 mcast 61 1%
IPv6 mcast 3 1%
Adjacency usage: Total Used %Used
1048576 978 1%
Forwarding engine load:
Module pps peak-pps peak-time
1 2482 420560 06:55:17 EDT Sat Jul 18 2015
2 6532 333333 16:54:00 EDT Tue Sep 8 2015
3 542 37735 12:41:00 EDT Tue Sep 22 2015
5 4350 188235 01:30:54 EDT Tue Jun 16 2015
6509#show mod
Mod Ports Card Type Model Serial No.
--- ----- -------------------------------------- ------------------ -----------
1 8 CEF720 8 port 10GE with DFC WS-X6708-10GE blahblahblah
2 8 CEF720 8 port 10GE with DFC WS-X6708-10GE blahblahblah
3 8 CEF720 8 port 10GE with DFC WS-X6708-10GE blahblahblah
5 2 Supervisor Engine 720 (Active) WS-SUP720-3B blahblahblah
9 48 SFM-capable 48 port 10/100/1000mb RJ45 WS-X6548-GE-45AF blahblahblah
Quote from: routerdork on October 08, 2015, 08:36:41 AM
Ok I remembered it better than I though I would once I was on the CLI.
6509#show platform hardware capacity forwarding
L2 Forwarding Resources
MAC Table usage: Module Collisions Total Used %Used
1 0 98304 1071 1%
2 0 98304 1053 1%
3 0 98304 1053 1%
5 0 65536 1058 2%
VPN CAM usage: Total Used %Used
512 0 0%
L3 Forwarding Resources
FIB TCAM usage: Total Used %Used
72 bits (IPv4, MPLS, EoM) 196608 6189 3%
144 bits (IP mcast, IPv6) 32768 65 1%
detail: Protocol Used %Used
IPv4 6187 3%
MPLS 1 1%
EoM 1 1%
IPv6 1 1%
IPv4 mcast 61 1%
IPv6 mcast 3 1%
Adjacency usage: Total Used %Used
1048576 978 1%
Forwarding engine load:
Module pps peak-pps peak-time
1 2482 420560 06:55:17 EDT Sat Jul 18 2015
2 6532 333333 16:54:00 EDT Tue Sep 8 2015
3 542 37735 12:41:00 EDT Tue Sep 22 2015
5 4350 188235 01:30:54 EDT Tue Jun 16 2015
6509#show mod
Mod Ports Card Type Model Serial No.
--- ----- -------------------------------------- ------------------ -----------
1 8 CEF720 8 port 10GE with DFC WS-X6708-10GE blahblahblah
2 8 CEF720 8 port 10GE with DFC WS-X6708-10GE blahblahblah
3 8 CEF720 8 port 10GE with DFC WS-X6708-10GE blahblahblah
5 2 Supervisor Engine 720 (Active) WS-SUP720-3B blahblahblah
9 48 SFM-capable 48 port 10/100/1000mb RJ45 WS-X6548-GE-45AF blahblahblah
Sweet. Those are good commands to know.. google wasn't being helpful when I tried to find them earlier today.
On the subject of IPv6 MLD flooding.. I researched it as wintermute suggested and reading this article got me thinking..
http://packetpushers.net/good-nics-bad-things-blast-ipv6-multicast-listener-discovery-queries/
If MLD is the cause, I think the topology of the network here that experienced the high utilization may have mitigated the problem to the point where it didn't cause an outage. It is set up so that it tunnels through the normal production network on line-encryptors (TACLANE), meaning the layer 2 domains are broken up and isolated to single buildings with the exception of about 10 switches with direct connections to the VSS pair. The fact that I have only seen it on rare occasions could be because any new machines put on the network that contributed would have eventually had IPv6 turned off through a GP update.
Does that make any sense? I won't be able to prove it until it happens again but I feel like I'm on to something.
Quote from: config t on October 08, 2015, 11:49:43 AM
Sweet. Those are good commands to know.. google wasn't being helpful when I tried to find them earlier today.
I tried to keep a blog going for obscure things like this that I wanted to remember and then found that I don't take the time to blog much so it's never been put back online after my last upgrade.
Quote from: config t on October 08, 2015, 11:49:43 AM
On the subject of IPv6 MLD flooding.. I researched it as wintermute suggested and reading this article got me thinking..
http://packetpushers.net/good-nics-bad-things-blast-ipv6-multicast-listener-discovery-queries/
Funny, I was called out to a business a month or two ago for that exact same issue. Their wireless scanners in production were getting kicked off the network and when I arrived I saw their access point LED going crazy. Suspected a broadcast storm but it was all multicast, around 100 Mbps, coming from one PC with that I-217LM NIC. Found it going by the mac address, that was an interesting problem :)
Quote from: routerdork on October 08, 2015, 02:24:09 PM
[I tried to keep a blog going for obscure things like this that I wanted to remember and then found that I don't take the time to blog much so it's never been put back online after my last upgrade.
We ought to have a sticky in the R&S forum for t-shooting commands.