Networking-Forums.com

Professional Discussions => Routing and Switching => Topic started by: deanwebb on May 27, 2016, 10:46:24 AM

Title: Packet Drop Issue
Post by: deanwebb on May 27, 2016, 10:46:24 AM
This is an update of the Wireless issue I ranted on about... well, we got the replacement and it had the same issue.

In the traces, I noted that fragmented UDP traffic from the site goes just fine... but arrives at cross-WAN destinations without the final packet that makes everything make sense. Unfragmented UDP traffic, no problem. TCP traffic... can't say I've seen any fragmented TCP stuff. But fragmented SNMP traffic NEVER works and fragmented RADIUS traffic works about half the time, with more failures during busier times of the day.

We think the RADIUS and SNMP issues are connected, since they have the same issue: terminating fragment drops. We can test for success by generating bulk SNMP get-requests and seeing if that get-response packet shows up or gets dropped. Right now, it's getting dropped at some point between the far site's Riverbed (checked egress on WAN interface, looked good there) and the NIC on the destination servers (RADIUS server and Cisco Prime server).

Anyone here ever see a thing like that, where those final fragments get dropped and the destination responds with an ICMP message that the TTL on fragment reassembly expired?

We're going to be doing captures on the WAN routers to see if the WAN router at the far location is sending those
Title: Re: Packet Drop Issue
Post by: icecream-guy on May 27, 2016, 11:15:43 AM
Quote from: deanwebb on May 27, 2016, 10:46:24 AM
This is an update of the Wireless issue I ranted on about... well, we got the replacement and it had the same issue.

In the traces, I noted that fragmented UDP traffic from the site goes just fine... but arrives at cross-WAN destinations without the final packet that makes everything make sense. Unfragmented UDP traffic, no problem. TCP traffic... can't say I've seen any fragmented TCP stuff. But fragmented SNMP traffic NEVER works and fragmented RADIUS traffic works about half the time, with more failures during busier times of the day.

We think the RADIUS and SNMP issues are connected, since they have the same issue: terminating fragment drops. We can test for success by generating bulk SNMP get-requests and seeing if that get-response packet shows up or gets dropped. Right now, it's getting dropped at some point between the far site's Riverbed (checked egress on WAN interface, looked good there) and the NIC on the destination servers (RADIUS server and Cisco Prime server).

Anyone here ever see a thing like that, where those final fragments get dropped and the destination responds with an ICMP message that the TTL on fragment reassembly expired?

We're going to be doing captures on the WAN routers to see if the WAN router at the far location is sending those

dont riverbeds run in pairs? one for the far end, one for local end, to do the proprietary magik on the WAN between the two.
Title: Re: Packet Drop Issue
Post by: deanwebb on May 27, 2016, 01:36:40 PM
We do have one in the main office where the RADIUS server is, but... all the UDP traffic bypasses it. It goes straight through the IPS to the L3 core switch.

And, yes, I checked the IPS... no packets blocked to the RADIUS server.
Title: Re: Packet Drop Issue
Post by: wintermute000 on May 29, 2016, 07:55:29 AM
have defo seen fragments dropped before, nice detective work - but - every time I've seen a fragment getting dropped scenario, its been... wait for it... a firewall LOL (or at least a ZBFW feature, seen that before definitely). If there's anything at all doing reassembly before it hits the host? Pretty sure and IPS would have to reassemble UDP fragments to do its job....

as you say can't be riverbeds as they pass through UDP (well by default anyway and you've confirmed as well). One good thing though is that doing a non-intrusive, rolling packet capture on a riverbed is a GUI enabled piece of pie so you can easily see if its happening before or after WAN transit - from your wording its unclear which end you've checked, is it OK at the point it enters the DC side riverbed? don't forget you can capture on either side of the riverbed too (if memory serves me correctly) if you really want to be paranoid. But you want to at least start by chopping the problem in half (i.e. is the fragment getting dropped in the WAN or your DC).

Do you see the same UDP fragmentation behaviour (and presumably NO drops) with other controllers at other sites? whilst you're packet capturing, you might as well get a 'reference' capture?

just because we're down in the weeds, you are talking about fragmentation of the UDP payload right i.e. UDP reassembled segments forming a 'UDP packet'? You aren't talking about IP fragmentation issues, PMTUD works, 1500 nice and clean all the way through etc.?
Title: Re: Packet Drop Issue
Post by: deanwebb on May 29, 2016, 04:20:32 PM
No firewall in the path, there is an IPS in the path, does not show it is blocking any traffic, but it is inline.

The Riverbed in the remote site had traces running on inbound/outbound on both the LAN and WAN interfaces. Zero packet loss all the way out to the WAN router, this much we know. On the DC side, the Riverbeds aren't yet inline... so we have to get the trace set up tomorrow on our core switch, since there aren't any spare interfaces on the routers. I figure if the traffic is intact on the way in, then we have something set up in both datacenters we tested that affects *only* traffic from this one site. If not, then it's something in front of the switch, again affecting traffic only from this site.

Same RADIUS controllers that we've tested all handle traffic from other WLCs. I've also checked with a WLC-RADIUS server combo to a third datacenter. Same fragmentation of SNMP and RADIUS, no drops. I took a few of these reference captures specifically to keep my sanity. 1500 MTU on all devices on our LAN. This is reassembly of UDP packets.

I'm skeptical that it's a device in the datacenters, since it's only this one site having the issue. What I do suspect is either the WAN router at the remote site, or one of the devices in the MPLS network close to the remote side end of things.

I hope to know more on the capture we get tomorrow. I'll need to see if we can get our WAN provider to also get captures from inside its network.
Title: Re: Packet Drop Issue
Post by: Dieselboy on May 29, 2016, 10:32:13 PM
Network diagram? Is the traffic being encapsulated over a VPN tunnel?
Title: Re: Packet Drop Issue
Post by: deanwebb on May 30, 2016, 08:33:16 AM
No VPN.

It's WiSM inside 6800 -> Riverbed -> WAN router -> MPLS -> WAN router -> IPS -> Core switches -> Vblock -> RADIUS server
Title: Re: Packet Drop Issue
Post by: wintermute000 on May 30, 2016, 06:28:18 PM
can't fault your logic (i.e. one site affected only so 99% not likely to be hub site, packet captures at your spoke WAN CE are fine, so issue is 99% likely to be in the WAN)
Title: Re: Packet Drop Issue
Post by: Dieselboy on May 30, 2016, 09:33:42 PM
Not a load balancing issue with the provider MPLS like what I have found recently on my provider internet service?
What is the mac address of the source of traffic which the provider sees?
Title: Re: Packet Drop Issue
Post by: deanwebb on May 30, 2016, 10:08:48 PM
@Wintermute: thanks for confirming my suspicions.

@Dieselboy: We checked that right after you posted that thread.
Title: Re: Packet Drop Issue
Post by: NetworkGroover on June 01, 2016, 10:41:14 AM
I'll admit that I was too lazy to read the entire thread - but I saw "fragmented" and thought, "Why?"

Why do you have fragmented traffic?  That's never desirable I thought?

EDIT - Ah, nevermind.  I see.
Title: Re: Packet Drop Issue
Post by: deanwebb on June 01, 2016, 12:58:42 PM
OK, found a zillion packet drops on the egress interface on the remote site's WAN router. That is bad because it's not good.

The drops spike with user activity. All queues are affected. About 20% of the traffic there is from the guest wireless SSID... guess what we just throttled down...

So, I can see why that drops RADIUS traffic as the user count increases for the day, but I still don't see how the SNMP bulk traffic gets whacked while unfragmented SNMP traffic always gets through... Doesn't help that I'm also sick today and had to deal with a partner site whose ISP started blocking port 22 outbound...
Title: Re: Packet Drop Issue
Post by: icecream-guy on June 01, 2016, 01:46:33 PM
Quote from: deanwebb on June 01, 2016, 12:58:42 PM
OK, found a zillion packet drops on the egress interface on the remote site's WAN router. That is bad because it's not good.

The drops spike with user activity. All queues are affected. About 20% of the traffic there is from the guest wireless SSID... guess what we just throttled down...

So, I can see why that drops RADIUS traffic as the user count increases for the day, but I still don't see how the SNMP bulk traffic gets whacked while unfragmented SNMP traffic always gets through... Doesn't help that I'm also sick today and had to deal with a partner site whose ISP started blocking port 22 outbound...

what sick? of the rain?  you waterlogged?  I'd blame the port 22 thing on all the flooding there in Texas.
Title: Re: Packet Drop Issue
Post by: Dieselboy on June 01, 2016, 09:17:33 PM
Quote from: ristau5741 on June 01, 2016, 01:46:33 PM
Quote from: deanwebb on June 01, 2016, 12:58:42 PM
OK, found a zillion packet drops on the egress interface on the remote site's WAN router. That is bad because it's not good.

The drops spike with user activity. All queues are affected. About 20% of the traffic there is from the guest wireless SSID... guess what we just throttled down...

So, I can see why that drops RADIUS traffic as the user count increases for the day, but I still don't see how the SNMP bulk traffic gets whacked while unfragmented SNMP traffic always gets through... Doesn't help that I'm also sick today and had to deal with a partner site whose ISP started blocking port 22 outbound...

what sick? of the rain?  you waterlogged?  I'd blame the port 22 thing on all the flooding there in Texas.

Unfragmented SNMP traffic always gets through and fragmented SNMP does not because if one packet is lost out of the fragmented SNMP then all packets would be discarded right? If one SNMP request is fragged into 3 packets and one of those 3 are lost then the entire request is lost and would need to be resent.
Title: Re: Packet Drop Issue
Post by: deanwebb on June 01, 2016, 10:16:26 PM
Nope. One of the fragments may go, but 1 or 2 others go on through, hence the ICMP type 11 messages that the device didn't get all the fragments it was expecting.

And when we see the resend, all 3 for the request leave, but less than 3 arrive at the destination.
Title: Re: Packet Drop Issue
Post by: wintermute000 on June 02, 2016, 06:42:47 AM
if you're getting random output drops then its strange that you've only noticed this on fragmented UDP traffic.

You're probably sick of captures by now LOL but a good test is iperf, as there are sequence numbers in there even in UDP mode IIRC.
Title: Re: Packet Drop Issue
Post by: deanwebb on June 02, 2016, 09:08:13 AM
No, I always love a good capture... and I've got sequence numbers in Wireshark.

Our WAN provider's engineers found zillions of drops on the outbound interface and throttled back the guest wireless traffic... but today is a holiday in the remote site, with only 80 people on the wireless there, and we're still seeing the goofy RADIUS drops... I'm thinking the solution for this is not in throttling traffic although, clearly, it was needed since packets from all queues were getting drops because of about 20% of the total traffic being guest wireless stuff.
Title: Re: Packet Drop Issue
Post by: deanwebb on June 10, 2016, 05:50:24 AM
RESOLVED:

QoS policy. It actually had the RADIUS and SNMP traffic marked for a priority queue, but fragments are IPv4... those go into the bulk traffic and get dropped first. Policy was amended to include classification by endpoints and then a few more tweaks, and now it's working.

PHEW.
Title: Re: Packet Drop Issue
Post by: Dieselboy on June 11, 2016, 05:02:29 AM
Good stuff.. :)
Title: Re: Packet Drop Issue
Post by: wintermute000 on June 11, 2016, 05:59:53 AM
Can I clarify: your QoS policy was rely on NBAR presumably instead of L3/L4 to classify radius and SNMP, hence could not ID the fragments?

And the fix was to switch to classifying via traditional L3/L4? As I'd imagine even a fragmented UDP packet would surely have the port info intact?

I'm very curious if more application-smart devices *cough* Palo Alto *cough* have the same issue. I'll hassle the firewall guys next week if above is indeed the case.
Title: Re: Packet Drop Issue
Post by: deanwebb on June 11, 2016, 11:36:38 AM
UDP fragments had some port info in them, but it was still default Cisco QoS logic to put fragments into the highest category. Fix was to specifically deny fragments from that top queue, which was congested beyond belief.

Just encountered another site where all TCP traffic worked fine, but SNMP and RADIUS were broken... hmmm...
Title: Re: Packet Drop Issue
Post by: wintermute000 on June 11, 2016, 08:28:42 PM
So what exactly is the classify map matching on, can you post the configuration please
I.e. What criteria was configured for snmp radius that was failing to catch fragments
Title: Re: Packet Drop Issue
Post by: deanwebb on June 12, 2016, 09:50:58 AM
We first had no match, that gave random results.

Then we matched on destination IP addresses of the RADIUS servers. That gave predictable, but still erroneous results. The problem was that the first packet would match as SNMP or RADIUS traffic, but the fragment would still shoot into the EF queue and get whacked.

Once the line to exclude fragments from the top queue was entered, the issue resolved. We then removed the match on IP address and the issue stayed resolved.

Extended IP access list EF_Video_Voice
5 deny ip any any fragments


That's pretty much all we needed, that line at the top of the ACL to keep fragments out. Then the default Cisco behavior when treating fragments didn't screw up our traffic.
Title: Re: Packet Drop Issue
Post by: wintermute000 on June 12, 2016, 11:18:38 PM
I'm sorry but this makes even less sense now - I had assumed you were using 'opposite' terminology and calling the best efforts queue the 'top' queue but you mean EF i.e. the best queue which I presume is given priority.

If the fragments are being (accidentally) prioritized, why are they being dropped?

Secondly, what is the match criteria that is incorrectly matching UDP fragments into ET class and queue? i.e. what is the class-map that classifies EF? e.g. I've had issues before with NBARv1 mis-matching.
For example, if you are using the ACL EF_Video_Voice to match the class-map, then why is it even matching ANY radius/SNMP fragments in the first place?
I am aware voice traffic is usually small UDP packets which is why I am suspicious its an issue with NBAR and the use of "match protocol voice" or similar commands. 


Are you able to post the complete QoS configurations?
Title: Re: Packet Drop Issue
Post by: deanwebb on June 13, 2016, 08:14:57 AM
Can't post the QoS policy, since it's on gear owned and operated by our WAN vendor. I only have that snippet sent by the vendor.

The internet pipe is very small - 200Mb/sec for about 3000 people. The voice/video queue was set at 1Mb/sec, as well. There was massive congestion in all queues, and there were drops in all queues.

The SNMP and RADIUS were in a queue that had less priority than voice/video, but which also had more capacity allotted to it to handle the traffic. First packets or unfragmented packets got through every time.

The fragments, however, according to the WAN vendor engineer, were hit by a default behavior in Cisco QoS logic. IE, there was no policy to send them into our highest-priority-queue-of-doom, they just wound up in there because of that default behavior. Blocking fragments from that queue pushed them into the next-highest-priority queue, which was where the rest of the SNMP and RADIUS traffic was successfully getting through.

That's the best I can do, I'm afraid. But it was no policy that moved the fragments, it was default Cisco QoS logic. If the voice/video queue had more space in it and the WAN link not been running at 90%+ congestion, we wouldn't have seen the issue.
Title: Re: Packet Drop Issue
Post by: LynK on June 27, 2016, 09:12:07 AM
silly question. Have you double checked MTU on all of your devices? Have you double checked speed/duplex on your links?
Title: Re: Packet Drop Issue
Post by: deanwebb on June 28, 2016, 08:06:05 AM
Yes and yes.

The site wanted to see if the fix allowed it to use just the datacenter and not the local RADIUS server. The answer is yes, until the entire office shows up to work a few hours after the early birds roll in. Then it's a hellride until people start to go home for the day. That WAN link is still too small, so even though the fragmented RADIUS traffic is no longer getting sent to die in the voice-video queue, it gets massacred in the general queue alongside other traffic when there's just too much stuff on the line.