A Bi Disaster

Lies, damned lies, and performance statistics

I've been spending some time trying to squeeze a little more performance out of my blog, because:

  1. Users love speed more than you might imagine
  2. I can

Write-ups about how to speed Bearblog sites up are definitely in the works. But I wanted to publish a bit of a rant about performance measurement first of all.

You simply have to measure to correctly make websites faster. You just have to. But measuring that performance is really hard.

What you'll know if you've spent enough time optimizing frontend performance is that all performance metrics need to be taken with a grain of salt. There are so many machines with so many qualities of web connection with so many screen sizes in so many places in the world that any performance figure requires interpretation.

I'm running tests using DebugBear, which simulates a mobile client in us-east. Would we get the same result if we ran the same test on us-west? What if we ran it for users in Africa on Opera Mini1? All of these are questions make it difficult to determine whether the numbers we're seeing are really reflective of everyone's real-world experience.

So the solution must be to use aggregate data instead of single runs, right? Aggregate data of all of your users turns out to be the hardest of all to analyze. Were the set of users who used your site Tuesday equivalent to the set of users who visited your site on Wednesday? Who knows?

Ok, so instead of comparing days, what if we do an A/B test? The same problem still applies. It's difficult to really segment your users in such a way that they have the same performance characteristics. Once I ran an A/B test that used the exact same code. I got a statistically-significant difference.

If you're working on enterprise software where users are accessing your site from work, you might notice that Saturday is the slowest day of all. Fun2 fact: India has a 6-day workweek, and they tend to be on slower connections on slower devices than in the West. So on Saturday, you often notice that performance metrics just seem to degrade.

There's the question of what code you're comparing. Generally speaking, your team is not going to be happy holding back releases for the 2 weeks it takes to run a performance experiment. So you're most likely going to be comparing code at different releases. This might not be the hugest factor if you're on a team with a few engineers with a few changes per release. But if you're working on a team that makes hundreds of changes a day, it's really difficult to know how much impact these code changes have.

And then there's the question of how you actually summarize the data you have. Performance data has a very long tail. Most users will have page loads in say 500ms, but then you have users at say 5 seconds. And somehow, you always end up with some users whose page loads are on the order of 2 minutes. I typically exclude those users. Essentially all of that means the average is going to be highly skewed by outliers.

So then maybe we should use percentiles. We generally want to improve the experience of people who have a slow connection, so it makes sense to focus on higher percentiles. The general rule of thumb is to use something like the 95th percentile.

I'm not a statistics expert, but my experience is that the 95th percentile tends to have very wild variations. And besides that, what even does the 95th percentile even tell us? Exactly what the 95th user out of a hundred is seeing. But what about the other 99%?

Should we also compare a different percentile? Perhaps. But it's difficult enough to interpret one metric, much less two.

But then if the average is skewed by people at the higher percentiles, and we care about the users at higher percentiles, maybe the average actually is ok to use after all.

I'll tell you, I've tried all of those things and I honestly could not tell you which one is the best. They all tell you different things, and how do you choose which one is telling you what you want to know?

And even setting aside the question of how we measure performance, there's a question of what we're actually measuring in the first place. None of your users are sitting there with a stopwatch waiting for something to appear on the screen and basing their perceptions on that.

Oh, and management understands literally none of this. I can't count the number of times I've been asked We have a goal of improving performance by 500ms this quarter. Can you do that? Sorry, but it just doesn't work that way.

Improve performance by 500ms by what measure? I might be able to run an experiment that shows a 500ms improvement on a given set of days. But the difference is going to be different if you re-ran the experiment on a different set of days.

And I can't just say Oh this change is definitely going to make a 500ms difference, guaranteed. If I could, why would we even be having this discussion at all? I'd just push changes to prod and call it a day.

So what does all of this mean for me? I'm just running performance metrics in us-east on a simulated mobile device on Chrome and calling it a day. Yep, they don't tell me as much as I'd like them to.

  1. Opera Mini turns out to be more popular than you'd imagine at first, particularly in Africa

  2. I guess it's a not-so-fun fact if you do live in India.

#performance #programming #rant #web apps