COVID-19 is a delaying game

March 03, 2020 | 0 comments

I've heard a lot of people get fatalistic about taking precautions, and talk about how they're inevitably going to get sick with COVID-19 even if they wash their hands, maintain distance, even if they partially self-isolate (e.g. work from home). This can be due to living with others, having a face-to-face customer service job, or just eating food that has passed through the hands of others.

And it's true. They're most likely going to come down with COVID-19 at some point, regardless of what they do.

The key—the thing people aren't talking about—is when. And that makes all the difference.

An experiment in repopping popcorn

July 29, 2019 | 0 comments

We use an air popper to make popcorn at home, and there are always a few unpopped kernels at the bottom. Far less than for microwave popcorn, and not enough to worry about waste-wise, but a few. I became curious about whether these were just unpopped, or actually unpoppable.

Verdict in my N=1 experiment: Yes, almost all of them can be repopped! The easiest thing is to just toss 'em back in the popper for next time. Throw 'em back, they're not big enough yet. ;-)

Load balancing: Beyond healthchecks

July 22, 2019 | 3 comments

I became interested in finding The Perfect Load Balancer when we had a series of incidents at work involving a service talking to a database that was behaving erratically. While our first focus was on making the database more stable, it was clear to me that there could have been a vastly reduced impact to service if we had been able to load-balance requests more effectively between the database's several read endpoints.

The more I looked into the state of the art, the more surprised I was to discover that this is far from being a solved problem. There are plenty of load balancers, but many use algorithms that only work for one or two failure modes—and in these incidents, we had seen a variety of failure modes.

This post describes what I learned about the current state of load balancing for high availability, my understanding of the problematic dynamics of the most common tools, and where I think we should go from here.

(Disclaimer: This is based primarily on thought experiments and casual observations, and I have not had much luck in finding relevant academic literature. A later simulation run and a dark-launch in production had very favorable results, but due to external circumstances it never saw full production usage. So consider this only 75% reality-tested.)

TL;DR

Points I'd like you to take away from this:

Server health can only be understood in the context of the cluster's health
Load balancers that use active healthchecks to kick out servers may unnecessarily lose traffic when healthchecks fail to be representative of real traffic health
Passive monitoring of actual traffic allows latency and failure rate metrics to participate in equitable load distribution
If small differences in server health produce large differences in load balancing, the system may oscillate wildly and unpredictably
Randomness can inhibit mobbing and other unwanted correlated behaviors

Adaptive load balancing

March 20, 2019 | 0 comments

At work, I've recently run up against the classic challenge faced by anyone running a high-availability service: Load balancing in the face of failures. I'm not sure the right solution has been written in software yet, but after a good deal of hammock time and chatting with coworkers, I think I've put together an algorithm that might work.

Let's say you have a goodly sized collection of API servers each talking to a handful of backend servers, load-balancing between them. The API servers receive high request rates that necessitate calls to the backend and must be kept highly available, even if backend servers unexpectedly go down or intermediary network conditions degrade. Backpressure is not an option; you can't just send HTTP 429 Too Many Requests. Taking the load off of a backend server that is suffering is good, but that can put more pressure on the others. How do you know what failure rate means you should be shedding load? How do you integrate both latency/timeout issues and explicit errors?

Generally: How do you maximize successful responses to your callers while protecting yourself from cascading failures? How can a load-balancer understand the cluster-level health of the backend?

The short version: Track an exponentially decaying health measure for each backend server based on error rates, distribute requests proportionally to health, and skip over servers that have reached an adaptive concurrency limit based on latency measures.

Update 2019-07-30: While I no longer think this precise approach is what I want, the general outlines are still good. You can read my conclusions about traffic-informed load balancing. The experimental code that I'm still working on is an evolution of the algorithm outlined here, but it replaces the buckets with a single exponentially decaying average and discards the entire fallback cascade in favor of a single weighted random selection.

My own Creepy Facebook Surveillance Moment

February 18, 2019 | 2 comments

I've heard any number of stories from people about creepy things Facebook or other ad systems have done. "I was talking about X with a friend, and that evening an ad for X popped up on a web page!" The insidious thing is that it *could* have just been coincidence. You can't prove anything.

Well, this week it happened to me, and I don't even use Facebook. I can't prove anything. But it's deeply disturbing. TL;DR: Blank Facebook account I opened 8.5 years ago and never used receives recommendation, out of the blue, to check out a small store I only just learned existed and started patronizing.

More recent entries | Older entries