QoS for a Cisco 3850 Switch

Author
Peter Welcher
Architect, Operations Technical Advisor

I’ve been writing up an internal NetCraftsmen QoS template for a Cisco 3850 switch. This blog relates some lab experiences with the 3850. I hope it provides some useful information for those grappling with 3850 QoS. The second half of the blog is some observations and personal philosophy about QoS. I have quite a few more such than are revealed below — but, hey, I’m trying to write a relatively short blog for once!

With new Cisco devices, when one wants a QoS template, about all one can do is RTFM (Read the Fine Manual), then Google search to see if anything useful is available yet (rare!). And usually one has to then re-interpret Cisco Medianet and prior templates for the queuing model the chips support. The final step is testing for CLI variations or obvious issues. I’d love to take a precision traffic generator and test thoroughly, but few if any consulting customers are interested in that level of  effort. And consultants have an aversion to unbillable work (gasp!).

If you’re wondering “why templates”, well, it takes time and thought to translate the various Cisco documents into commands one can put into devices. Furthermore, the various switch models each have some quirks. So we’ve built up a set of documents that capture what we’ve learned over time.

Lab Experiences

I like to do VLAN-based QoS. Trust the voice VLAN, any video-conferencing (IPVC) VLANs, and do full Classification and Marking on data VLANs. (And server VLANs, and border devices, and then there’s also wireless …).

The 3850 switch QoS looks somewhat more router like than most switches, with some queue settings that do not start with “mls qos”. That’s good!

The 3850 manual says that VLAN-based QoS is available but requires a per-port command. From other small Cisco switches, one expects to use “mls qos vlan-based” on the physical ports. I found that command online in a couple of blogs. I did not find it in the 3.x code documentation. Nor in the CLI. That’s not so good.

So I did a test along with a co-worker (thanks, Steve F!). We set up two PCs on two ports in a VLAN, the source PC and the destination (JPERF “server”) PC. I set up IPERF/JPERF to generate traffic, and created a QoS policy to set DSCP to EF. I applied the service-policy command to apply that policy to the VLAN, inbound. No show command that I tried was really at all helpful in seeing hits against the rule. When we applied an outbound policy (match DSCP, random bandwidth percent setting), we did get some rate information (based on a “match EF” class).

So we turned on WireShark on the destination PC. Result: DSCP EF confirmed.

We then applied a similar simple policy to the source port, one that set DSCP to CS3. The output showed the DSCP marking CS3.

Tentative Conclusion: VLAN-based QoS appears to be on by default in the 3850, and if you apply a service-policy to a port, it over-rides the VLAN-based policy.

Yes, that wasn’t the most conclusive test. And I really like avoiding surprises by using well-documented features. But per-port per-VLAN QoS as an alternative is a bit ugly, if you don’t need that degree of complexity.

By the way, along the way the 3850 show information about rates wasn’t matching up with the IPERF/JPERF info. I ended up thinking JPERF 2.x says MegaBytes when it means MegaBits. And also not really believing the 3850 counters. WireShark showed results matching what I thought was going on. Tentative conclusion: the 3850 QoS show commands may have some cosmetic bugs in them, not be counting header bits, or something is off.

Also observed: changing the policy on the fly (without removing it from the interface) left the ASIC chip (or something) very confused. Removing the service-policy and re-applying it apparently fixed the problem. Unfortunately, being able to modify a policy on the fly is pretty useful. Having to remove the service-policy from interfaces, edit the policy-map, then re-apply it is a bit more work, even using cut and paste with NotePad++.

Background and QoS Philosophy

For those who might be interested, my philosophy of QoS for Cisco is roughly:

(a) QoS is far too complicated (therefore great for consultants?).

(b) It takes a lot of attention to detail to deploy it right. It’s easy to miss ports, blades, or other aspects when you have 100-900 switches.

(c) If you add the innate complexity (different on each Cisco platform, etc.) to extra complexity you generate, complexity-squared = un-deployable.

(d) Always stick to commands and approaches that work widely, e.g. use ACLs to Classify and Mark, since that’s supported on almost anything but the very limited low-end L2 switches. Having widely differing approaches that leverage every feature of every device is a recipe for an unmaintainble QoS deployment.

(e) If you can use one ACL across the board, it greatly simplifies life. Having to tune or create an ACL per-site or per-subnet involves thought, hence will be time-consuming and error-prone. On the other hand, too long an ACL exhausts TCAM resources and won’t work either. So far, I’ve been able to keep ACL’s short.

For example, for VoIP, you don’t have to match on all possible source, destination pairs. That would be a long ACL! If the ACL will be applied to a voice VLAN, you can assume traffic marked EF is probably coming from a phone. Checking the source address isn’t going to add any value, since a rogue PC doing something nasty would also have to be in that subnet anyway. Does checking the destination have value? I don’t see the point. I’d just trust the voice VLAN, since most people aren’t going to know how to get their PC or an app to transmit traffic with the right VLAN tag and DSCP EF. I worry more about desktop admins, Microsoft GPOs, or application programmer mis-conceptions — but most of that impacts the data VLAN, where I do ACL-based Classification and Marking.

(f) Autoqos is (was?) a great idea, but what it does on each platform in a particular IOS release is not documented thoroughly. And life’s too short to find out using show commands. Some of the newer autoqos trust features sound great, but most of my customers still have 5-10 year old switches that don’t do the new stuff. If it doesn’t work across the board, I don’t do it. See (d) above.

(f) See (a) above, and Keep It Simple!

So over the years I’ve eased up and tried to simplify, simplify, simplify.  The end result must be something that can actually be deployed within a lifetime, understood, and supported, yet gets the job done well. The mission is to get QoS in place to protect fragile traffic. That is not achieved if design and deployment get vastly bogged down in additional unnecessary details. So do what’s necessary or useful, but no more.

Some things that I do to facilitate that:

(a) Don’t police EF traffic on ports unless you have serious trust issues or are a provider. Really, why jump through hoops for something that probably will never happen. Deal with it if it does happen. In general, security pros are paid to be paranoid. QoS doesn’t (can’t) be that strict.

(b) Do everything by VLAN, not per-port. There are a lot fewer VLANs than active ports in switches. Life is too short to pay detailed attention to ports. And I’d rather not have to think about hardware models and linecards. Let alone which specific sub-models of 2960-S switches do or do not support certain QoS features.

(c) Use good naming conventions with version stamps in description lines for e.g. policy-maps and ACLs, to facilitate tracking versions and doing maintenance — and automation.

(d) Deploy QoS classes and policies for the future, using ACLs that don’t match anything. That way, standing up a new QoS class just means creating and pushing out an ACL. (This assumes a traffic mix resembling what Cisco Medianet tested.)

(e) Trust where you can, e.g. Voice VLAN. A recent insight: create a “trusted VLAN’ for devices like Tandberg/Polycom IPVC devices. (I refuse to call Tandberg units “TelePresence” since then I have to refer to the big room units as “fully Immersive TelePresence”, or “big room units”, both of which just make my brain itch.) I used to think per-purpose VLANs, but heck, if there are a few properly configured Tandbergs and a few properly configured Video Surveillance cameras (with central storage), then put them in the same VLAN as long as I can trust them.

Twitter: @pjwelcher

Disclosure Statement

ccie_15years_med CiscoChampion200PX

4 responses to “QoS for a Cisco 3850 Switch

  1. Hi thanks for the information. Could you please through some insight on the queue-buffer ratio command on the 3850. thanks.

  2. This looks somewhat like some of the queue tuning in the 3750 and fellow travelers. I’d say leave it alone unless you are dropping e.g. video, which would suggest you need more buffering. The concern would be short-changing another class on buffers. I’d stick with the defaults where possible and when there’s a clear problem, either do lab work to find a minimal change that works, or work with TAC on it. I like working with TAC when there’s something obscure, it’s also feedback to Cisco they either need to make a feature simpler, or the documentation more explicit.

  3. Dr Pete,

    How did you translate the cos-dscp mappings? Or did you just leave the default table map?

  4. I usually go with the defaults for Cisco switches. They usually have COS times 8 for DSCP, except COS 5 maps to EF = decimal 46.

Leave a Reply