Lost access to management resources yesterday morning, couldn't get into nuthin, jumpbox nope, console into jumbox, tacacs nope.. Trolling around the data center I found one of the 6500 VSS switches down, shouldn't be an issue, yet....So Sup on the active VSS 6500 switch failed, system failed over, (switch crashed due to memory allocation failures) I brought up the second switch which restored some services and users were still complaining about intermittent connectivity issues, this VSS pair is not the core, but a distro block with lots of stuff connected. scratching out heads, this shouldn't happen.... thought we narrowed it down to some legacy services, rebooted the switch connected to those, and they came up, users still complaining about intermittent connectivity to other areas of the network, so we focused on the VSS switch that had problems, we determined that all the affected service were connected via port-channels, services that weren't - not affected. so we focused on port channels, all interfaces up/up everything looks connected, but reports of connectivity loss still going on. someone on the team had the idea to shut each member of the port-channel, see if connectivity is restored, on the 4th try of a 6 member port channel, shut/no shut all services restore. our guess is that is was either some sort of hashing issue on the switch, or some sort of UDLD issue. (not enabled on switches) The UDLD issue is more logical, but s shut/no shut
of an interface should not have fixed a physical issue with a fiber cable. just plain weird. took a lengthy outage for it, 4+ hours.
Are you running LACP or mode on?
:itcrowd:
Sorry, couldn't resist.
I don't trust vss
Quote from: ristau5741 on January 08, 2016, 11:15:42 AM
Quote from: SimonV on January 08, 2016, 08:34:55 AM
Are you running LACP or mode on?
LACP , mode active
Good, mode on was the first thing I thought off when I read port-channel issues
If this was related to UDLD you would see it down down not-connect on one side, and down down (errdisable) on the other. So most likely not a UDLD issue. How are you doing link balancing on the port-channel? Anything fancy like source ip? Hashing would effect LACP balancing ... if that were indeed the case.
LACP should ensure no mismatches though including bringing down member links
SHOULD lol... bearing in mind real time L2 control traffic like LACP is supposed to be the hardest thing to get right in clustering/multi-chassis software (e.g. they still haven't gotten most of this stuff working in openflow central control plane)
This kinda thing is why I refused to VSS our cores together
Near as I could figure, the 10G module in one VSS, was in a cyclical reboot mode, where it would power, come up, and online for like 2 seconds, then the module would crash again, and power off. doing this repeatedly every 3-4 minutes, until we pulled it out of the chassis, that's when things got back to normal. so the port channels would come up on both switches modules, traffic would start to traverse the port channel on the rebooting module and then get lost when the module rebooted. I don't know what the load balancing across the port channels was at the time. but looks like it was a fairly heavily used link.