Author Topic: What Causes a Switch to Crash?  (Read 336 times)

deanwebb (OP)

  • Permit any any all log
  • Administrator
  • Volume Licensing
  • *****
  • Join Date: Jan 2015
  • Posts: 7858
  • Country: us
  • Rep: 19
  • *I* am the one who NACs.
    • View Profile
  • Certifications: FSCA: ForeScout Certified Administrator, CCNP Security, Tufin CSE, TippingPoint ASE
What Causes a Switch to Crash?
« on: April 28, 2018, 11:49:12 AM »
"You turned on (X) and that brought down the switch!"

I've heard that more than once, in reference to various things, including but not limited to Netflow, SNMP, SSH connections, SPAN ports, and port-based ACLs. So, I'm wondering how much of that is for-reals and how much of that is a switch guy that was dealing with a flaky switch and just wanted to blame someone else for his switch going down.

Basically, what brings down switches and why?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!

wintermute000

  • Senior Engineer
  • ****
  • Join Date: Jan 2015
  • Posts: 2475
  • Rep: 27
    • View Profile
  • Certifications: Alphabets
Re: What Causes a Switch to Crash?
« Reply #1 on: April 29, 2018, 07:43:06 PM »
1.) broadcast storms
2.) features that smash the CPU like ACL logging, too-verbose-debugs etc.
3.) bugs - I have literally seen 'debug ntp' crash a 6500. Another classic is when the 10Gb modules were new, some versions of 3750X code would make the thing run at 80-100% CPU redlining and dropping packets like flies just for having the thing physcially in there. Legit bugs to happen.

ristau5741

  • Administrator
  • OC-1920
  • *****
  • Join Date: Jan 2015
  • Posts: 12256
  • Country: us
  • Rep: 19
    • View Profile
  • Certifications: Instanity
Re: What Causes a Switch to Crash?
« Reply #2 on: April 30, 2018, 06:24:25 AM »
Typically, when an event happens on a device that the programmed code does not know how to handle, an exception is generated, and as a life preservation task the device will reboot itself to reset everything back to "normal".
:professorcat:

“You can destroy your now by worrying about tomorrow.”
-Janis Joplin

deanwebb (OP)

  • Permit any any all log
  • Administrator
  • Volume Licensing
  • *****
  • Join Date: Jan 2015
  • Posts: 7858
  • Country: us
  • Rep: 19
  • *I* am the one who NACs.
    • View Profile
  • Certifications: FSCA: ForeScout Certified Administrator, CCNP Security, Tufin CSE, TippingPoint ASE
Re: What Causes a Switch to Crash?
« Reply #3 on: April 30, 2018, 07:14:22 AM »
Oh yes, the debug sessions that engineers forgot to turn off... Good call on that one.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!

mlan

  • Access Port
  • *
  • Join Date: Dec 2015
  • Posts: 71
  • Country: us
  • Rep: 1
    • View Profile
Re: What Causes a Switch to Crash?
« Reply #4 on: May 01, 2018, 03:22:20 PM »
The best crash I have ever experienced was a memory bit flip that forced a reload of a 6500 supervisor in a VSS pair.  The ensuing network destruction that resulted from that crash was a sight to behold.  Root cause was possibly a solar flare?  Haha...

wintermute000

  • Senior Engineer
  • ****
  • Join Date: Jan 2015
  • Posts: 2475
  • Rep: 27
    • View Profile
  • Certifications: Alphabets
Re: What Causes a Switch to Crash?
« Reply #5 on: May 01, 2018, 06:39:08 PM »
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

ristau5741

  • Administrator
  • OC-1920
  • *****
  • Join Date: Jan 2015
  • Posts: 12256
  • Country: us
  • Rep: 19
    • View Profile
  • Certifications: Instanity
Re: What Causes a Switch to Crash?
« Reply #6 on: May 02, 2018, 06:45:51 AM »
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

if they can't pinpoint the cause, that's the scapegoat.
:professorcat:

“You can destroy your now by worrying about tomorrow.”
-Janis Joplin

dlots

  • Senior Engineer
  • ****
  • Join Date: Jan 2015
  • Posts: 3994
  • Rep: 13
    • View Profile
  • Certifications: Lots
Re: What Causes a Switch to Crash?
« Reply #7 on: May 02, 2018, 09:07:07 AM »
Updating IOSs

The Cisco "test crash" command  (Will pretty much always crash your cisco device)

Control Plane Policing

deanwebb (OP)

  • Permit any any all log
  • Administrator
  • Volume Licensing
  • *****
  • Join Date: Jan 2015
  • Posts: 7858
  • Country: us
  • Rep: 19
  • *I* am the one who NACs.
    • View Profile
  • Certifications: FSCA: ForeScout Certified Administrator, CCNP Security, Tufin CSE, TippingPoint ASE
Re: What Causes a Switch to Crash?
« Reply #8 on: May 02, 2018, 10:07:52 AM »

Control Plane Policing


This is recommended for security on Cisco features that can't be switched off... what's some more detail / war story about how this brings down a switch?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!

dlots

  • Senior Engineer
  • ****
  • Join Date: Jan 2015
  • Posts: 3994
  • Rep: 13
    • View Profile
  • Certifications: Lots
Re: What Causes a Switch to Crash?
« Reply #9 on: May 02, 2018, 12:25:52 PM »
I honestly don't remember if it was a switch or a router, but we put copp on, write mem, wait a while and the device went down, came back up, wait a few, then it went back down (repeate)

mlan

  • Access Port
  • *
  • Join Date: Dec 2015
  • Posts: 71
  • Country: us
  • Rep: 1
    • View Profile
Re: What Causes a Switch to Crash?
« Reply #10 on: May 04, 2018, 05:11:30 PM »
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

I still have the SP and RP crashfiles... here are the relevant bits from the SP crashfile:

Code: [Select]
Cache error detected!
  CPO_ECC     (reg 26/0): 0x00000089
  CPO_CACHERI (reg 27/0): 0xA0000000
  CP0_CAUSE   (reg 13/0): 0x00001C00

Real cache error detected.  System will be halted.

Error: Primary data cache, fields: data,
Actual physical addr 0x00000000,
virtual address is imprecise.

 Imprecise Data Parity Error

 Imprecise Data Parity Error

 08:58:20 PDT Wed Jul 13 2011: Interrupt exception, CPU signal 20, PC = 0x40FEA860



--------------------------------------------------------------------
   Possible software fault. Upon reccurence, please collect
   crashinfo, "show tech" and contact Cisco Technical Support.
--------------------------------------------------------------------


-Traceback= 417BEE50
$0 : 00000000, AT : 42640000, v0 : 52D11A90, v1 : 45BF04F8
a0 : 52D11AC4, a1 : 52D44E3C, a2 : 40FEA848, a3 : 52D44E3C
t0 : 408B5698, t1 : 3400FF01, t2 : 3400F100, t3 : FFFF00FF
t4 : 417B13A8, t5 : 0000FFFF, t6 : 00000004, t7 : 0000030D
s0 : 52D44E3C, s1 : 00000002, s2 : 40FEA848, s3 : 52D44E3C
s4 : 43ECEF90, s5 : 00000004, s6 : 00000000, s7 : EFFFFFFA
t8 : 55BB5088, t9 : 00000000, k0 : 55B8DC94, k1 : 408EAE50
gp : 42647238, sp : 52D44D90, s8 : 9FBF04BE, ra : 40FEA860
EPC  : 417BEE50, ErrorEPC : 40FEA860, SREG     : 3400FF05
MDLO : 3B13B68E, MDHI     : 00000719, BadVaddr : 00000000
DATA_START : 0x42322420
Cause 00000000 (Code 0x0): Interrupt exception


The SP crash forced the RP to reload and then all hell broke loose....   more info

deanwebb (OP)

  • Permit any any all log
  • Administrator
  • Volume Licensing
  • *****
  • Join Date: Jan 2015
  • Posts: 7858
  • Country: us
  • Rep: 19
  • *I* am the one who NACs.
    • View Profile
  • Certifications: FSCA: ForeScout Certified Administrator, CCNP Security, Tufin CSE, TippingPoint ASE
Re: What Causes a Switch to Crash?
« Reply #11 on: May 10, 2018, 09:16:52 AM »
Wow, a whole article on how to blame sunspots for your crash. Niiiiiiiiiiiiiiiiiiice. Putting that in my bag of tricks...
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!

SimonV

  • advipservicesk9
  • ****
  • Join Date: Jan 2015
  • Posts: 1045
  • Country: be
  • Rep: 10
    • View Profile
    • Blog
  • Certifications: N+ GFL, CCNP, CCNA Wireless, JNCIS-SEC/ENT
Re: What Causes a Switch to Crash?
« Reply #12 on: May 14, 2018, 06:46:18 AM »
I found this in the comments of one of the whitepapers about it:

Quote
When given the transient soft parity error explanation for a device or component failure, the following link may help you rule out Solar Flares as a possibility: http://www.tesis.lebedev.ru/en/sun_flares.html?m=9&d=11&y=2013

Replace date in the URL or click the calendar on the page.

deanwebb (OP)

  • Permit any any all log
  • Administrator
  • Volume Licensing
  • *****
  • Join Date: Jan 2015
  • Posts: 7858
  • Country: us
  • Rep: 19
  • *I* am the one who NACs.
    • View Profile
  • Certifications: FSCA: ForeScout Certified Administrator, CCNP Security, Tufin CSE, TippingPoint ASE
Re: What Causes a Switch to Crash?
« Reply #13 on: May 14, 2018, 02:43:06 PM »
That is so awesome. I think this is my favorite whitepaper now.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!

Dieselboy

  • Administrator
  • advipservicesk9
  • *****
  • Join Date: Jan 2015
  • Posts: 1165
  • Country: au
  • Rep: 5
    • View Profile
  • Certifications: CCNP, CCNA-Voice
Re: What Causes a Switch to Crash?
« Reply #14 on: May 21, 2018, 08:32:29 PM »
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

I still have the SP and RP crashfiles... here are the relevant bits from the SP crashfile:

Code: [Select]
Cache error detected!
  CPO_ECC     (reg 26/0): 0x00000089
  CPO_CACHERI (reg 27/0): 0xA0000000
  CP0_CAUSE   (reg 13/0): 0x00001C00

Real cache error detected.  System will be halted.

Error: Primary data cache, fields: data,
Actual physical addr 0x00000000,
virtual address is imprecise.

 Imprecise Data Parity Error

 Imprecise Data Parity Error

 08:58:20 PDT Wed Jul 13 2011: Interrupt exception, CPU signal 20, PC = 0x40FEA860



--------------------------------------------------------------------
   Possible software fault. Upon reccurence, please collect
   crashinfo, "show tech" and contact Cisco Technical Support.
--------------------------------------------------------------------


-Traceback= 417BEE50
$0 : 00000000, AT : 42640000, v0 : 52D11A90, v1 : 45BF04F8
a0 : 52D11AC4, a1 : 52D44E3C, a2 : 40FEA848, a3 : 52D44E3C
t0 : 408B5698, t1 : 3400FF01, t2 : 3400F100, t3 : FFFF00FF
t4 : 417B13A8, t5 : 0000FFFF, t6 : 00000004, t7 : 0000030D
s0 : 52D44E3C, s1 : 00000002, s2 : 40FEA848, s3 : 52D44E3C
s4 : 43ECEF90, s5 : 00000004, s6 : 00000000, s7 : EFFFFFFA
t8 : 55BB5088, t9 : 00000000, k0 : 55B8DC94, k1 : 408EAE50
gp : 42647238, sp : 52D44D90, s8 : 9FBF04BE, ra : 40FEA860
EPC  : 417BEE50, ErrorEPC : 40FEA860, SREG     : 3400FF05
MDLO : 3B13B68E, MDHI     : 00000719, BadVaddr : 00000000
DATA_START : 0x42322420
Cause 00000000 (Code 0x0): Interrupt exception


The SP crash forced the RP to reload and then all hell broke loose....   more info

How is it possible when Cisco equipment uses ECC memory?