Let’s just agree on this up front: my home network is overly complicated. There’s a nice, boring LAN with WiFi APs and TVs and Playstations and so forth. There are family members trying to conduct business and school and social lives, and they mostly depend on things working.

And then… let’s call it The Dark Side of the Network. Where things get complicated. Where servers live and talk OSPF to multiple L3 switches, so everything keeps working even when I reboot a switch or move cables. Where service IP addresses are announced from multiple servers via BGP, and fail over when servers go down. Where there are redundant WAN links and redundant routers, and things get messy. To put a more acceptable spin on it, let’s call it “lab space.”

I’ve never really been happy with the lab side of my network gear. Mostly, this is because I’m not willing to pay for the “right” equipment, and I end up making due with whatever I can find used on eBay for a good price when I’m ready to upgrade things.

Generation Zero

Long ago in the mists of time, I had a couple unmanaged switches, a cheap WiFi AP, and an old Linux box that played router and firewall. The network config on the Linux box was configured via a simple shell script. Simple, but that meant that incremental changes (like adding a new firewall rule in the middle of the existing config) could only really be done by resetting the whole config and running the script over again. Performance was mostly fine, because Internet speeds from that era were slow; I think I went from 256 kbps to a couple Mbps with this sort of router.

Generation One

Eventually, the tweak-shell-scripts-and-hope-it-works thing got old, and I discovered Vyatta, which was sort of an open-source clone of Juniper’s router OS. At the time, I was kind of concerned about Vyatta’s long-term viability as an open-source project, but I figured that it was better than what I had already.

Vyatta worked well enough for the time, but in 2012 they sold to Brocade, who promptly started focusing on Vyatta’s commercial versions to the detriment of the open source version. By 2013, the open source version was dead, and I was looking for a new router.

A group forked the last open-source version of Vyatta into VyOS, but the first few years of VyOS were slow and didn’t really fix any of the things about Vyatta that bugged me. Eventually, I decided it was time for a wholesale change.

Generation Two

So, if I wasn’t happy with an open-source Juniper clone, maybe I’d be happier with an actual router from Juniper? In 2014, I bought a used Juniper SRX240 off of eBay and spent a few days fighting to configure it the way I wanted. At the time, Junos on SRXes still had a bunch of gotchas; IIRC I had a hard time getting it to act as a DHCP server reliably, and couldn’t get it to tunnel IPv6 to tunnelbroker.com at all. But at least those were new problems, compared to Vyatta.

After a few months, I broke down and bought a second cheap SRX240; they were nearing their end-of-life and weren’t all that pricey. SRXes supported “chassis cluster” mode, where two SRXes could share a config and be managed together, and then fail over on reboot, etc.

Unfortunately, turning on chassis cluster mode required renaming most of the interfaces in the config. That wasn’t too bad…

And then I discovered that in chassis cluster mode on SRX240s, Juniper didn’t support being a DHCP server. Huh. Also, IPv6 tunnelling was right out.

Everything else worked, though, and it made router software updates a snap. Things failed over within a second or so and everything just kept working.

Somewhere along the way here, I replaced my “main” home switch with a Juniper EX4200. And then added another one in the garage. And then added a second one in the garage because I’d filled up the first one. And then added 4th one to the basement, so I could have another couple 10GbE ports. The EX4200s were way too loud, but they were a proper L3 switch, with PoE and OSPF and BGP and pretty much anything else that I’d want.

Except for reasonable multicast support, as it turns out. Or, perhaps more accurately, they had reasonable multicast support with IGMP and PIM and so forth, but the world had moved on. I had tons of consumer devices that used multicast for discovery (mDNS, etc), and none of them used IGMP, and that confused the heck out of the switches. This led to fun debugging problems, like “why does the list of available Chromecasts on my phone vary depending on which WiFi AP I’ve roamed to?”

Generation Three

Remember I said that the Juniper SRX240s were cheap because they were nearing their EOL date? Yeah, that passed and they stopped getting updates. And a few security bugs popped up. Strictly speaking, only SRX240s with 1 GB of RAM were EOL; in theory I could open mine up and add more memory, and then find a way to hack their model number to trick the installer into working, but that didn’t seem like a great use of time.

So I bought a used SRX650 off of eBay. The 650 was the 240’s big brother; it ran the same basic software but had a much larger CPU and had multiple slots that could take 16-port GigE cards or even 2 port 10GbE cards. I only bought one this time; they were too expensive to get a pair, and the 10GbE cards were vanishingly rare in the used market anyway. This meant that I couldn’t use chassis cluster mode. Fortunately, Junos had moved forward enough so that DHCP and IPv6 tunneling worked right, finally. The router was fast, it was less complicated than 2x SRX240s had been, and everything was good.

On the server side, I picked up an amazingly cheap L2 40GbE switch, and a not-quite-as-cheap 10GbE Juniper switch, and upgraded servers mostly to 10GbE or 40GbE. With one router, one 40GbE switch, and one 10GbE switch, I didn’t have much redundancy, but things worked well enough. Software updates lead to 10-30m outages, though.

By this point, my home Internet link was around 150-300 Mbps. The SRX was good for something like 800 Mbps - 3 Gbps, depending on what exactly you were asking for, so it was overkill for Internet use, but it wasn’t too hard to bring it to its knees when servers with 10 Gbps NICs hammered things. That could mostly be mitigated by having the L3 10 Gbps switch take over routing between 10 Gbps machines.

Generation Four

The next generation didn’t actually involve any router changes. I started getting tired of the noise from the Juniper EX4200 switches in my basement wiring closet. The weird mDNS problems kept annoying the rest of the household. I also had a spate of Sonos-induced network problems that I’ll probably write up some day; Sonos + Juniper was really a match made in hell for many reasons.

So I decided to simplify the “house” side of the house. I bought a couple Ubiquiti Unifi switches and APs, and moved all of the non-server stuff onto them. Not being “enterprise” switches, they didn’t act affronted by “broken” mDNS multicast traffic. They drew way less power, so my wiring closet cooled down enough that I could close the door without overheating. And they were nearly silent anyway, which was nice. They were just managed L2 switches, not L3 switches, but that was fine. Switch software updates still caused outages, but that was manageable by plugging APs into multiple switches and using STP to give things a bit of redundancy.

Generation Five

That all worked fine until Covid hit. Then a few things happened all more or less at the same time.

  1. My Internet speed climbed to 1 Gbps.
  2. The SRX650 went EOL and stopped getting software updates.
  3. I suddenly had 4 people in the house that needed working Internet all day long for school and work.

I started getting weird bug reports from family members. Things like “iPad crashes 2x/day when connected to WiFi, turning WiFi off makes it work fine for weeks”. Plus studdering video, poor WiFi coverage in places where people wanted to work, and so on. Lots of weird random problems.

Thanks to the pandemic, I had free time on my hands, so I tried pretty much everything.

  1. I finally set up Prometheus for monitoring things. General maxim: you can’t fix what you can’t see. It showed… weirdness, especially for WiFi devices. Weird ping latency sometimes, things that should be online falling offline for 15 minutes at a stretch. It was just generally ugly.
  2. I upgraded the Unifi Ethernet switches a bit, explicitly having redundancy at every point in the network. Lots of 10 Gbps fiber between the garage and the wiring closet. It looked nicer, but didn’t actually seem to help anything. It let me catch up on software updates that I’d been putting off for fear of breaking late-night Zoom calls, though, so that was still a win.
  3. I tried changing most of the meaningful settings on the Unifi WiFi APs, but WiFi kept being bad. I changed channels, turned off optional features, enabled and disabled assorted roaming options. All to no avail.
  4. I ended up biting the bullet and replacing the Unifi APs with pricy enterprise-grade Ruckus APs. They were mostly amazing, right up to the point where they started crashing several times per day. The WiFi experience was better, slightly, until the APs all went down and rebooted at the same time. WTF?
  5. I finally decided that there had to be something “toxic” on my network, probably something embedded or IoT-ish. I tried disconnecting my fridge from WiFi, and then waited a day to see if the APs still crashed. Didn’t help.
  6. Finally, I remembered the one thing on the home network with a known history of breaking things: my Sonos speakers. They were death to Juniper EX4200s, especially when more than one ended up connected with wired Ethernet (instead of their own semi-WiFi bridging protocol). So I hunted and discovered that yes, two were plugged in. I unplugged one of them, and everything got better. No more zoom dropouts. No more AP crashes. No more crashing or hanging iPads and iPhones. See Twitter for the whole story.

At this point, things were mostly working again, but the Juniper SRX650 was getting long in the tooth and had to go. And the remaining Juniper switches in the garage, plus the 40GbE switch had also passed their use-by date.

Plus, after debugging network crashes caused by rogue loudspeakers, I figured I was invincible.

So, I replaced the pile of switches with 2 100GbE switches. One Arista 7060CX-32S and one Edge-Core 7326-56X, and then replaced the SRX650 with a PC running VyOS.

The plan was to run SONiC on both switches and keep pretty much all of the “server” stuff completely disjoint from the “house” stuff.

Moving from the 2U SRX650 to a grossly overpowered 1U PC with multiple 100GbE and 25GbE ports was pretty easy. Latency generally dropped across the board. I traded some weird Juniper bugs for some weird VyOS bugs, but it was mostly fine.

The switches, on the other hand, went horribly.

SONiC (screw it, I’m going to just write ‘Sonic’ for now on) sounds really nice on paper. It’s an open source switch OS, mostly from Microsoft, based on Linux and FRR and all of the usual tools, sitting on top of all of the usual big switch chips. It’s designed to run entire big hyperscaling datacenters. So that’s great, right?

Yeah… so the thing is, if you’re building 50x giant DCs, then you get to put a lot of effort into your designs. You don’t need your network to be flexible. You don’t even want your network to be flexible, except in a very few specific ways that are needed for your services. Everything else should be as boring as possible. Since you’re going to deploy hundreds of identical switches, there’s no real downside to maintaining your own switch OS build, with your defaults burned into the image.

If you’re trying to connect 2 switches to a bunch of legacy stuff, though, absolutely none of that helps you at all. Things that Sonic lacked when I started:

  • Spanning tree
  • OSPF
  • Flexible per-peer BGP configs
  • Working SNMP
  • The ability to maintain login accounts across software updates
  • Any standard way to break 100GbE interfaces into 4x25GbE or 4x10GbE.

That’s the short list. The long list is more like “everything but BGP-routed interfaces and bridged+routed VLANs.” Now, since this while mess is build on Linux, a lot of this is pretty easy to work around. You want OSPF? Fine, edit /etc/frr/daemons and manually start the OSPF daemon inside of the BGP docker container. And restart it manually at every boot, because there’s really no good way to restart it automatically at all.

The Edge-Core switch mostly worked, except it’d sometimes utterly fail to bring up new interfaces unless I power-cycled it. So I’d plug in a new 100GbE cable and… no link. So I’d change cables, fiddle with settings on both ends. Nothing. I’d reboot, and bam, link went up.

The Arista switch was even worse; after a couple months, Sonic was too big to be able to hold 2 versions on the Arista’s flash. It mostly didn’t have the weird interface issues of the Edge-Core, but it wasn’t upgradable without heroic measures, so I finally reflashed it back to Arista’s EOS software. About 30 minutes later my quality of life was measurably better. Everything… just… worked. OSPF? Breaking interfaces out unto 4x10G? SNMP? Multiple ssh users? How about VRRP? It was so boring that I considered replacing the Edge-Core with another Arista switch, only to discover that they’d jumped 6x in price. Ugh.

I tried upgrading to the latest Sonic release on the Edge-Core; not helpful. Then I remembered that the vendor actually maintains their own Sonic distribution, and I decided to take a look at it. It added OSPF, spanning tree, a handful of other features, and some bug fixes. They’d cleaned up Sonic’s port breakout setup, so supposedly that worked, even. One quick reboot, and… it wasn’t perfect, but it was better.

Mostly. I still find situations every month or so where the Sonic switch just… stops routing. I should be able to talk A (Linux) -> B (Sonic) -> C (Linux). All three run OSPF. All 3 can see their peer correctly. All of them have correct routes in their routing tables. A+B can ping each other, and B+C and ping each other. A cannot ping C.

I cleared up one case of this last month by turning down OSPF on box D, which had no routes that overlapped anything of A or B. The Arista switch was fine. But not the Sonic switch.

I had another case where boxes E, F, G, and H all had terrible 99p DNS latency. Rebooting backup router I suddenly fixed that. Even though there was no traffic passing through I at all. Generally, rebooting the Sonic switch will fix this sort of problem, for a while.

Also, while the Sonic switch does STP, it doesn’t seem to actually work right for me. Edge-Core’s support people have been helpful, and part of the problem was on my end, but we also uncovered bugs that will need to wait for the next release. This makes interconnecting the ‘house’ and ‘server’ networks tricky.

At this point, I’d really like to get rid of the Edge-Core switch, but cheap 100GbE switches just don’t exist on the market. Even for ‘cheap car’ values of cheap.

Generation Six?

At this point, the Arista switch is working perfectly, while the open-source Sonic switch mostly passes traffic, but is fragile.

Similarly, my VyOS router is acting up. It’s also unhappy with STP in some cases, and in any case only does ancient slow STP, not modern RSTP or any of its descendants. While my main ISP is reliable, I’ve added a 5G backup, and VyOS’s WAN failover support is terrible if you have anything but a dead-simple network. What should be a simple, single rule that switches the default route between interface A and interface B has ballooned into a 28 rule monstrosity that excludes intra-network traffic and then special-cases handling for each inbound interface.

My last 2 VyOS upgrades have each went terribly. The first upgrade left me unable to log into the router, even on the console. Apparently the password hash algorithm changed or something? So I rolled that back. The second upgrade was even more fun–it decided to renumber some of my Ethernet interfaces for me. VyOS’s config explicitly includes each Ethernet interface’s MAC address, so it can maintain consistent names across reboots and hardware changes (Linux is weird this way). The VyOS updater decided to rewrite that bit of config for me and broke it.

Earlier this year, I picked up what is apparently the only cheap piece of network equipment that remains on eBay, a Juniper NFX250. The NFX series is weird. They’re mostly just a PC under the hood, running Linux, with a VM running Junos (which is FreeBSD-based). They mostly behave like SRXes. There are a few weird bits, though:

  • They have 14 Ethernet ports on the front panel (not counting the management port), all of which are connected to an internal managed switch. That switch then has 2x 10GbE ports that are connected to the PC inside. Then those ports are dumped into Open vSwitch (plus DPDK), which has multiple logical Ethernet ports mapped onto it. You can easily map individual ports to VLANs, set up trunking, and so forth, but you can’t directly route between individual Ethernet ports like you could on an SRX. Instead, you need to assign ports to VLANs, and then make sure that the VLAN is mapped through to a pseudo interface and route that. It works, but it’s weird.
  • They can’t consistently do jumbo frames. The switch can handle 9k frames fine, but OVS and the pseudo-Ethernet bits inside can only handle 2k, and silently drop anything above that. It’s documented, sort of.
  • The documentation for NFXs is really bad by Juniper’s standards. The NFX250 was the first member of the family, and they completely replaced the OS at one point. So a lot of docs discuss the old way of doing things. You need to search for NFX250 ‘NG’, which is the new generation of software. (Supposedly the NFX150 and NFX350 didn’t go through this step, but they’re newer and rare). Lots of “normal” Juniper things don’t work quite the way you’d expect. Like software upgrades: instead of request system software add, it’s request vmhost software add.
  • The NFX series is really designed to run networking VMs at the edge (“Network Function Virtualization”), presumably so you can run company-specific services in VMs on the NFX250 in branch offices, etc. So there’s a whole virtual machine management layer that you don’t see on SRXes, etc, and the Junos interface itself is inside of a VM. In some ways this is kind of nice–I’m probably going to add a VM to mine to handle mDNS proxying at some point, which Junos can’t do–but in other ways it just makes the whole thing weird.
  • There isn’t enough RAM. The NFX250 usually comes with 16 GB of RAM, and since it’s a Xeon-D15xx under the hood it’ll happily take up to 64 GB of DDR4 DIMMs, the Junos VM is hard-wired to only use 2 GB. Which means it can’t take full BGP routes or anything. Which I wasn’t planning to do, but still…

Once I made it through the NFX-specific weirdness, I’m kind of liking the little box. It could use more 10GbE ports (it only has 2, I’d like 4), but since it’s priced on eBay about the same as a single 100GbE NIC I’m not going to complain too much. The NFX350 has 8x 10GbE ports, but they’re newer, probably aren’t selling very fast, and haven’t come off of lease yet anywhere so there aren’t any on eBay. I’m not going to pay new car prices for a new NFX350.

For now, I’m debating replacing my overpowered VyOS box with a pair of NFX250s, just because they’re less work to maintain, and then moving the bulk of the internal routing onto the Arista, leaving the Sonic switch as a backup for now.

I’ll be writing more about the NFX250s; there are a surprising number of little problem that I ran into with them that aren’t well documented. I suspect that a couple ‘guide to NFX250’ articles could be useful for others.

Conclusions

At this point, I think I can draw a couple pretty simple conclusions about commercial vs open-source routing and switching platforms (at least for my uses):

  1. Commercial offerings go EOL and lose software support way before the hardware becomes obsolete for home/lab uses.
  2. Open Source options are nice in principle, but don’t generally offer very good support for even slightly unusual uses, and have vastly inferior documentation and support for edge features.
  3. Open Source/Commercial hybrid platforms like Vyatta are inherently difficult, because they really want to hold features back for their commercial offering, or give users some reason to pay beyond hypothetical support offerings. Building an open community around this sort of structure is hard. VyOS is going down this route as well; hopefully they won’t sell and vanish like Vyatta did.
  4. PC hardware is actually surprisingly good at a lot of network work these days, if everything is set up right. You can’t throw 4 2-port 100GbE cards into a PC and expect 800 Gb of throughput, but a pair of 100GbE ports can forward a lot of large-packet bytes. And there’s a bunch of work going on that will make small packet, high PPS workloads better. Unfortunately, it seems to be difficult to retrofit XDP and friends into existing stacks like VyOS.
  5. Monitoring and testing networks is still a pain in the neck. I still don’t understand what the Sonic switch does when it stops forwarding. I suspect that it’s failing to push a kernel route through to the Broadcom switch chip’s FIB, but debugging that is awkward at best.
  6. Linux is still lacking a lot of surprisingly basic stuff, like easy RSTP support. I can’t even figure out how to debug STP the “modern” way, with the bridge command from iproute2 instead of the older brctl command.