December 2022 – nuxx.net

A couple of years after it happened I’m sharing this story about the intersection of an OS bug, a network hardware quirk, and a global pandemic. A chain of semi-esoteric things aligned and only caused noticeable problems in a very specific — dare I snarkily say unprecidented — situation.

I found this issue both fascinating and maddening and I hope you will as well. This does not contain code-level details of the bug (I don’t have them), but I’m sharing it both to document this problem and share a little story of what goes on behind the scenes in supporting a big enterprise IT environment.

In March 2020, when Stay Home, Stay Safe began in Michigan, most of my fellow employees began working from home (WFH). For years we’d been building IT systems to allow most people to work from anywhere, and the shift to WFH was going great. Between VPN connections back to the corporate network, lots of things in the cloud, and Azure Active Directory (AAD) to handle Single Sign-On (SSO) for almost every company application, all most folks needed for remote work was their standard laptop and an internet connection. The computing experience of working from home was effectively the same experience as working in the office.

Sure, we had some bumps with home internet connections not being robust enough, but we helped people through those. Mostly we’d find that what someone thought was a good home internet connection — because their phone or video streaming worked fine — wasn’t great for things like moving big files around or video calls. [1] Generally those having network performance issues had fine service from their ISP, but their home routers were old and not up to task. We would recommend upgrading the router, they’d go buy something new, and all would be good.

In early autumn we began receiving reports of users getting the notorious You Can’t Get There From Here (YCGTFH) message when trying to access anything that used AAD for SSO. This generic message is displayed when AAD authentication fails and access is denied. Because so many things were behind AAD SSO this interrupted a lot of work. These computers either were no longer joined to AAD or had an expired token, and it seemed tied to internet access.

Digging into it, there would be errors shown by dsregcmd /status (the AAD CLI utility) and Test-DeviceRegConnectivity.ps1. This script checks internet access as SYSTEM to the AAD public endpoints would fail on all three connection tests, implying the lack of connectivity to Microsoft endpoints was keeping AAD registration (and token refresh) from working properly. But the user still could browse the web and hit those URLs and we hadn’t changed anything internet access-wise since the WFH began.

We found that manually setting a proxy server for the whole of the system (via netsh winhttp set proxy) would allow AAD registration (dsregcmd /join) to succeed and SSO would then work. We also found that restarting the WinHTTP Web Proxy Auto-Discovery Service (WinHttpAutoProxySvc), which handles WPAD, would sometimes fix the problem, but only sometimes.

Even more confusingly sometimes a reboot would fix it. Or, sometimes if a user drove into the office and used the network there, it would work. But not always. [2]

Simply, we had some computers whose SYSTEM account couldn’t access the internet for so long that an AAD token had expired, this broke SSO and users were being told You Can’t Get There From Here.

Typical for a lot of large organizations we have authenticating proxy servers sitting between the client network and the public internet. All requests bound for the internet need to go through them, and these proxies are located by a Proxy Auto-Config (PAC) file that is found either by a direct setting (AutoConfigUrl) or Web Proxy Auto-Discovery (WPAD), via DNS, both of which send the same file. We directly set the PAC file URL on a per-user basis and leave WPAD at its default of enabled. Thus for the end user and things running under their account, WPAD is used, falling back to the PAC file setting if that fails. For the SYSTEM account a direct PAC file setting is not used, relying solely on WPAD to find the path to the internet. [3]

Looking at a network capture when AAD registration would fail instead of the normal chain of events requests we’d see no DNS requests for WPAD, and no PAC file download. Instead we saw the client attempting to resolve the AAD endpoints via DNS, and then would attempt to reach out directly to them, which would be blocked by the company firewall. The proxy was not being used; WPAD wasn’t working. This was weird because every piece worked when tested independently (DNS resolution for WPAD hostnames, invoke-webrequest http://x.x.x.x/wpad.dat, specifying the WPAD PAC file in AutoConfigUrl), but as a whole it just didn’t work.

This went on for quite a while, supported by a Premier case with Microsoft. We could see that WPAD was frequently failing, but struggled with getting a consistent reproduction and going down dead-ends. We bandaged the problem with manual, direct proxy settings and AAD registration. This was mostly fine short-term, but caused overhead for our support folks and was a ticking bomb.

Then one day, thanks to a fortuitous conversation with a very smart lead Microsoft engineer while working another issue I found out about a bug with Microsoft’s WPAD implementation that was just discovered and was being patched in the next round of patches. The description exactly explained our problem and I was elated.

It turned out that if Windows 10 received a blank DHCP option 252, the WinHttpAutoProxySvc service would not query DNS for WPAD, and — the broken part — it would never do so again until the service was restarted. Directly configured PAC files would be used, but WPAD was broken. Here in our environment the SYSTEM account would not have internet access and this meant AAD registration, Test-DeviceRegConnectivity.ps1, the Microsoft Store, and all such internet-needing SYSTEM-level things didn’t work.

Apparently some home router vendors — most notably the hugely-popular ASUS [4] — would send option 252 but leave it blank because they found doing so reduced name resolution requests from clients. This is seen in Wireshark as:
Option: (252) Private/Proxy autodiscovery Length: 1 Private/Proxy autodiscovery: \n

Windows, which has WPAD enabled by default, will try a number of name resolution queries (DNS for wpad/ wpad.local.tld.com/wpad.local.com, NetBIOS, LLMNR, etc) to locate a PAC file server if it does not see a DHCP option 252. Because WinHttpAutoProxySvc looks to DHCP then only tries DNS if that fails, by setting this option but leaving it blank the name resolution steps would not occur. I can only guess as to why these vendors find it desirable, but perhaps they like reducing the load on the built-in DNS forwarders, or they saw it as a security benefit or… who knows. Either way, the result of this blank option and the Windows bug was that WPAD — via DHCP and DNS — didn’t work.

So what is option 252? In WPAD there are multiple discovery mechanisms for finding the PAC file server. Beyond DNS there is also a Dynamic Host Configuration Protocol (DHCP) method where, along with the typical network address settings, the client receives the URL for downloading the PAC file. This is done via option 252, but isn’t widely supported, it’s normally not used, and we don’t use it it either.

While WPAD is core OS function, it unfortunately never left draft RFC status. Implementations have no formal standard to target; it’s just a guideline. Additionally, because it never left draft status, DHCP option 252 also remains unallocated and without a standard, simply part of the Reserved (Private Use) range. So, setting it but leaving it blank is not unacceptable, OS’ should be able to accommodate.

In a network capture I was then able to clearly see this happen, and it was simple to replicate on a test network. And then it finally all came together…

COVID-19 WFH resulted in a bunch of people upgrading their home networks, with lots of them buying new home routers, including the very-popular ASUS brand. If someone booted up their Windows 10 computer and the blank option 252 sending DHCP server was the first thing it saw, WPAD would break, with all the downstream consequences, including our AAD connectivity issues. And if they’d have AAD connectivity issues for long enough — until a token expired — they would start getting You Can’t Get There From Here messages.

If they went in the office or somewhere else which doesn’t set 252 and fully rebooted (which restarts the WinHttpAutoProxySvc service), WPAD would work and AAD registration would work. But if the service never restarted — if they never actually rebooted — WPAD was stuck not doing anything. [2]

Testing a pre-release version of the patch showed it fixed the problem, and then a few weeks later KB4601382 was released, with the detail “Improves the ability of the WinHTTP Web Proxy Auto-Discovery Service to ignore invalid Web Proxy Auto-Discovery Protocol (WPAD) URLs that the Dynamic Host Configuration Protocol (DHCP) server returns.”. We deployed this patch and the reports of AAD registration / You Can’t Get There From Here issues collapsed.

That was it, it was fixed. A popular home router vendor did something weird (but not against standard), the OS implemented something poorly, and people were working for so long in one of those environments that a credential expired, couldn’t be renewed, and they lost access to SSO.

[1] Mobile apps and a lot of modern websites are fairly asynchronous, sending requests in the background while still working nicely, because they are built to be tolerant of the blips that happen while on wireless networks. Video streaming specifically caches (or buffers) the video locally so that hiccups in the network connection don’t make the video pause and stutter. More real-time-ish things like Remote Desktop or video calls or copying files via SMB are considerably more sensitive to poor network connections.

[2] Retrospectively I suspect confusion around what it means to shut down or restart a computer led to many of the reports of reboots/shutdowns/driving into the office fixing the problem or not and the difficulties in getting a reliable reproduction. Different sleep modes, some of which result in the BIOS displaying the POST when waking from sleep even if the OS doesn’t restart, leads some to believe the operating system was restarted when it may not have been. Or other folks believe that closing the lid is “shutting down”.

It was also unfortunately common for users to have a flexible version of “home”. Sometimes home meant where they’d been working for the last six months and had the problem, sometimes it meant a vacation home with a different ISP they’d gone to the day before but failed to mention, sometimes it meant the other side of the world. Teasing this information out was difficult, as many used “home” to mean anywhere but their normally assigned desk. “I’ve only been at home” frequently meant “I continue to be not at the office”.

[3] We use WPAD because Windows 10 requires internet access for a lot of OS level things that run as SYSTEM, including the Microsoft Store and AAD device registration. If proxy servers are used, the SYSTEM account needs to find the proxy servers. Early on in our Windows 10 deployment Microsoft told us either direct internet access or WPAD was required. WPAD via DHCP doesn’t work with most VPN clients, because they configure their virtual adapters directly and not via DHCP. Thus, to have similar connectivity when in the office or remote, DNS for WPAD is the the choice.

[4] I’ve been told multiple brands do this, but ASUS only vendor where I personally observed it.

Comments closed

Every since getting one of the v4 models back in late 2018, the Salsa Warbird has been one of my absolute favorite bikes. I’ve ridden on all surfaces, and it’s been great, with only one recurring problem: front shifting.

These days bikes ridden anywhere but pavement are almost exclusively rear-shifting-only (1x) — without a front derailleur — and for good reason. Unfortunately, when I bought my Warbird there weren’t many (any?) 1x groups for drop bar use; Shimano hadn’t even announced their excellent gravel-focused GRX group. So, my bike came with the best bang-for-the-buck drivetrain at the time, Shimano 105 R7000. This remains an excellent groupset, and I put it on my new road bike, but a couple things about it were a bit lacking for off-pavement use: lack of a clutch derailleur to keep the chain taut when things got rough, and the front derailleur.

Early on I switched the rear derailleur to a clutched Ultegra RX (RD-RX800-GS), which was basically a preview of GRX, and worked wonderfully. It dropped right in place on the bike and Just Worked to solve chain slap. But, I still had one more problem… The front derailleur.

Salsa Warbird v4 Front Derailleur Cable Route

Located where the rear wheel flings debris, front derailleurs are quite prone to getting gunked up while riding, and (unfortunately) the Warbird exacerbates this by having the shift cable exit the frame at the bottom bracket, behind the seat tube, facing upward. This location collects whatever flings off the tire and is a very hard place to seal a moving cable. Coupled with our gritty, chloride-treated dirt roads, the result has been shift cables and housing corroding and binding far before any other drivetrain parts wear.

This corrosion leads to extra drag on the shift cable needed to shift, up until things bind firmly, after which the sacrificial Main Lever Support L (part number Y63X80010) in the shifter to breaks (exploded view). The overall design of this sacrificial part is great; a removable and cheap plastic bushing on the shift/brake lever blade that presses against the metal piece that handles the shifting. If the cable becomes too tight this $3 piece fails before anything else in the shifter, and replacing it takes seconds. But, the cable and housing also need to be fixed, and COVID-era parts shortages (these supports were unobtainium for about 12 months — I was using 3D printed ones for a while) and it all quickly becomes an irritation.

I’d lived with front shifting quirks for a couple years as most of the time it was fine, but losing the front derailleur was always in the back of my mind. Mountain bikes have this way for years, with a rear cassette that provides all the range needed, and this has become common on new gravel bikes.

After front shifting failed early on in De Ronde van Grampian and I ended up stuck in the big ring, I realized two things: I should get rid of the front derailleur, and I don’t need quite all the range that a 50-34/11-34 2×11 drivetrain offers. I use most of the range, but could lose a little on the top and bottom end of the range and be fine.

The first change was to remove the front derailleur, which was made easy by the front shift housing being split with a barrel adjuster below the handlebar. I was able to pull the cable and remove the housing all the way back to just under the the bar, leaving 2″ stub of housing unobtrusively peeking out of the bar tape. Next time I rewrap the bars I’ll remove this, but for now it’s fine.

Then to finish the conversion I fitted a Wolf Tooth 42t chainring specifically designed with an offset designed for 1x. None of the single-ring chainring bolts that I had around the house fit right, so I grabbed some of Wolftooth’s matching chainring bolts and it was all set. The ring fits the crank perfectly, worked great with the original chain length, and the I could shift across the whole of the cassette without any rear adjustment.

I did two rides on local rolling dirt roads and except for some longer steep climbs things were good. I was feeling good for both rides, so I didn’t mind having to grind out the longer climbs in sub-optimal cadence, but I realized I might need some more range on longer days. I’d also been concerned the left brifter would be flappy and noisy without a shift cable connected, but it’s just the same as before.

Next up was getting more gear range…

Gear Ratios for 105 R7000 50/34 11-34, 42×11-34, and 42×11-40.

The original Shimano 105 R7000 drivetrain, with 50/34 chainrings and an 11-34 cassette, gave a range from 455% to 100%, plenty for all my riding, but with a whole bunch of overlap. While I’d use the lowest ratio gears somewhat frequently, I’d almost never use the highest gear because at typical cadence I’d be moving well above 33 MPH. At those speeds I’m probably coasting down a hill and no longer pedaling.

After changing to a single 42t chainring the available ratios were 382% to 124%. At the high end this is nearly identical to everything but the rarely-used smallest cog, but at the low end it was like I lost the easiest two gears. On a couple test local couple-hour rides I didn’t absolutely need them, but it did require a bit more effort on some of the harder climbs.

With 11-speed mountain bike drivetrains on the decline, I came across Shimano XT 11-40 cassettes on sale and decided to try a bigger cassette on the rear of the bike. I was still on the original cassette and it was nearly worn out, so a change was due. This required a 1.85mm spacer to make the MTB cassette fit on a road freehub, which moved the cogs outboard, but a few limit adjustments and all was good. Now with a range of 382% to 105% it was almost identical to what I’d use on the original 2x setup.

Ultegra RX derailleur and 11-40 Cassette

The larger cassette fit pretty well, but caused some small quirks in spacing with the upper jockey wheel: the B-tension screw had to be cranked in almost all the way to have the gap set properly on biggest cog. It was quite a fiddly adjustment, as a quarter turn too far one way would have the jockey wheel rubbing on the cassette and a quarter turn the other would make shifting slow. This much B-tension also unwrapped the chain from the cassette a fair bit, making me concerned that with less chain on the cassette while pedaling the chain and cassette would wear faster than normal.

At this point I did a test ride and everything worked great when riding. Between the new chain (old was at 0.5%), cassette, and chainring the drivetrain was silent and I didn’t have a bit of trouble, but I still wanted it to be better.

A few days earlier I’d ordered a Wolftooth RoadLink DM, a nifty part that replaces a link in rear derailleurs like mine, repositioning it to handle larger cassettes. This is advertised to improve chainwrap and shifting performance, and immediately after installation the benefits were apparent. On the biggest cog chainwrap is much better, the derailleur no longer looks over-extended, the B-screw doesn’t need to be turned all the way in, and shifting feels more spot on.

Ultegra RX w/ RoadLink DM and 11-40 Cassette

While I could have fitted a Shimano GRX RD-RX812 derailleur instead of using the RoadLink DM on the Ultegra RX, this would have cost $90 more than the RoadLink DM, and I’m not sure it’d be any better.

All done, total cost for this was $265.03 after tax, and with a new cassette, chain, and chainring, I’ve now basically got a new drivetrain.

Wolf Tooth Chainring: $84.95
Wolf Tooth Chainring Bolts: $26.45
Shimano XT Cassette: $95.39
Shimano SLX/105 Chain: $26.49
Wolf Tooth RoadLink DM: $31.75

I’m quite happy with how this all turned out. A problem is solved, worn out things were upgraded, functional parts were kept, and it works great. My Warbird is ready for another year of gravel fun.

Comments closed

Month: December 2022

How ASUS and a Microsoft Bug Almost Broke Remote Work

1x’ing my Salsa Warbird for 2023