Skip to main content

The 10x Engineer Is a Myth: It's More Like 100x

·10 mins

People love to argue about the so-called “10x engineer”, which in the techie world refers to a computer person who produces 10x as much value as the average computer person (when discussing people who write software and their corresponding output).

I want to correct something. The 10x engineer is a misnomer because it actually should be “The 100x Engineer”. Not only is the 100x engineer quite real, but most people wildly underestimate the value provided by the very best talent. One reason for this is that the human brain is very bad at understanding probabilities, distribution, compounding returns, and exponential growth.

You’ve likely heard of the Pareto principle before, which is a well established rule on the distribution of returns. In the context of software development, the Pareto principle still holds. Still, it masks the reality of the situation because often the distribution is much closer to 99:1. That is to say, 99% of the returns are generated by 1% of the work or effort.

We can do a tiny bit of math to see how this works with regard to the 10x engineer. For example, if you are a so-called 10x engineer, then to generate 80% of the returns in a company, you would need two teammates to generate the other 20%. Let’s summarize this in tabular form:

EngineerInput PartsMultiplierOutput Parts
10x1180
1x1220

In practice, however, you’re very unlikely to have a team of 3 people with this kind of distribution, because it’s rare that you’ll have a ratio of one 10x engineer for every three engineers. In reality, you might (if you’re very lucky) have one 10x engineer for every 10 or 100 engineers. If we assume one 10x engineer plus nine average engineers, our distribution looks as follows:

EngineerEngineer CountOutput PartsShare of Output
10x1100.5
1x10100.5

As we can see, this still isn’t right, and given the rarity of these high-value engineers, it’s much more likely you’ll have far fewer than approximately 1 in 10 of them. So what about one 10x plus 100 average engineers? Now it looks like this:

EngineerEngineer CountOutput PartsShare of Output
10x1100.1
1x99990.9

This doesn’t make sense either, because it’s the opposite of what we expect: most of the returns should come from our high preformer. The reason for this is simple: our 10x engineer is actually a 100x engineer, and they’re much more rare than people realize. If we adjust our engineer multiplier from the previous example from 10x to 100x, we get the following:

Engineer MultiplierEngineer CountOutput PartsShare of Output
100x11000.5
1x99990.5

Hmm, this is still not the distribution we expect. We can lower the number of 1x engineer, to say 30, and we’ll get a more realistic result:

Engineer MultiplierEngineer CountOutput PartsShare of Output
100x11000.76
1x30300.23

This is still only a ratio of 76:23, which doesn’t even match the Pareto principle’s 80:20 yet. If we drop the number of 1x engineers to just ten we get the following:

Engineer MultiplierEngineer CountOutput PartsShare of Output
100x11000.90
1x10100.09

So now with a total of 11 engineers, where one of them as a 100x engineer, we get a 90:10 distribution.

Now as I already mentioned, in real life the distribution tends to be closer to 99:1, which is really quite counter-intuitive. In order to get 99:1 with a 100x engineer, we only need one other engineer:

Engineer MultiplierEngineer CountOutput PartsShare of Output
100x11000.99
1x110.01

This doesn’t seem right at all, and why is that? The answer is pretty simple: imagine that at the organizational level you have output distribution that matches the Pareto principle. In other words, your Fortune 500 company has about 80% of he returns generate by 20% of the staff.

Now, in real life, that 80% of the staff does practically nothing, and often provides negative value to the business, and they actually decrease the overall output. To understand this, we need to break our engineer multiplier out into a few more personas (and we’ll call it an “Employee Multiplier”):

Employee MultiplierEmployee CountOutput PartsShare of Output
100x11000.83
1x10100.08
0.5x40200.16
0.1x10001000.83
-1x10-100
-100x1-1000
Total120

Notice that I’ve added some people who are net negatives, including one person who made a big mistake and cost the company a lot. These are counted as 0 because they don’t contribute to positive returns.

Something weird also happened here: we just realized that we’re calculating the share of output per group, rather than per person. At the group level we can kind of see it matching the Pareto principle, but we should really scale this on a per-person basis as follows:

Employee MultiplierEmployee CountOutput PartsOutput Per PersonPersonal Share of OutputRelative Total Share
100x11001000.830.98
1x101010.0080.009
0.5x40200.50.0040.005
0.1x10001000.10.00080.00095
-1x10-10000
-100x1-100000
Total10621200.8428

Woah! What happened? The 100x employee in a company of 1062 is producing 98% of the output, when adjusted on a per-person basis, though it’s only 83% of the total output.

Let’s be clear that I made these numbers up, this data is not real, but in practice, this is the kind of distribution you will find.

Here we’re assuming that 1 in every 1000 or so employees is the one who’s producing the majority of the value. We see that only a few people create value when adjusted on a per-person basis. The vast majority of all the returns come from the 0.01% of everyone when you account for the fact that everyone in the middle is producing very little value, and there are always detractors too.

Show Me the Data #

Up till now, this has been nothing but conjecture. Lucky for us, we have an opportunity to look at publicly available data for software projects to demonstrate the 100x engineer (and you’ll see why the 10x engineer doesn’t make sense).

I’m going to examine statistics of number of commits and number of lines of code changed for public GitHub projects. I’ve sourced a list of repositories from this list of top GitHub repos, and I’ll select 200 of the repos from that list.

Before I go any further, I want to address a few of the criticisms preemptively:

  • “number of commits is a terrible metric”: Yes, I agree. But it’s a totally reasonable proxy for productivity at the statistical level.
  • “number of lines of code changed is a terrible metric”: Yes, see above.
  • “some accounts could be bots or just people generating fake activity”: Yes, but it won’t change the results. If you want to go through all the data and remove all the bots and whatnot, you’ll get the same results. If you think I’m wrong, please disprove me, and I’ll happily publish a post here explaining how I’m dumb.
  • “lots of people create value without writing code”: Yes, but it’s not statistically relevant. We’re looking at a snapshot of data, and when you drill up or down, you’ll continue to see the same distribution at the different levels. This is one neat property of the Pareto principle, as the Wikipedia article explains.

So while yes, I agree that this is an imperfect analysis, it’s good enough for this blog post.

With that out of the way, let’s look at the distribution of commits:

Commits
Distribution of number of commits per GitHub account across top projects

Wow! What are we looking at here? Above is a histogram of commits, in which we can see the vast majority of contributors have made less than 1000 commits, but a small (but statistically significant) number of people make well over 1000 commits.

This makes for a weird-looking histogram because much of the data to the right (sometimes called “fat tails”) is almost impossible to visualize due to the scales involved and number of pixels.

One workaround for this is using a logarithmic scale with histograms, which produces a distorted view of the data. You can’t simply ignore the things you dislike.

In practice, people tend to throw away the data they dislike until their analysis starts to reflect their biases. Some people call this “data science”, but a better term would be “data alchemy” or maybe “data religion”.

An even worse thing is when people shape the data until it matches their preferred distribution, such as the normal distribution. Sadly, this practice is quite common because while many people claim they’re “data-driven”, they really mean that they’ve cherry-picked their data until it matches their hypotheses, biases, or predisposed notions.

Okay, now what if we look at number of lines changed? The result is quite similar:

Lines changed
Distribution of number of lines changed per GitHub account across top projects

Same thing here, except you’ll notice the scale for lines_changed is in units of 1e7, or tens of millions (10,000,000) of lines changed.

So the real question is, does this data match my claims above? To answer that, let’s look at the data differently, in tabular form:

PercentileValueCount At or BelowCount AboveTotalTotal Share
20%133327320747,95699.6%
50%357074945742,36898.8%
80%2285502102717,41395.5%
99%126710545107362,86748.3%

In our dataset, we have a total of 10,652 samples. We’re looking at contributions from 10,652 separate GitHub accounts across 200 distinct projects. The “Value” column represents the number of commits needed to be in that percentile group. The “Total Share” column represents the percentage of the total number of commits per percentile cohort. “Total” is the sum of all commits above this cohort, and “Total Share” is the percentage of the total number of commits being in the upper portion of that cohort represents.

We can see in the table above that the 20th percentile (i.e., the bottom 20%) make just one commit from 3332 accounts, and the bottom 50th make only three commits from 5707 accounts. The top 20% have at least 22 commits among 2102 accounts, and the top 1% have at least 1267 commits each from just 107 accounts.

The top 20% (i.e., the 80th percentile) makes up 95.5% of all commits, and the top 1% make up nearly 50% of all commits. The top 1% cohort is creating more than 1000x as many commits as the bottom 20%.

Now what about number of lines of code changed? How does that data look? Let’s see:

PercentileValueCount At or BelowCount AboveTotalTotal Share
20%622658387475,340,23199.9%
50%8553305322475,243,38599.9%
80%216885222130473,527,67399.6%
99%85436110545107327,472,22168.9%

This is even more wild, the numbers are extreme. At the low end, we have the bottom 20% of people contributing six or fewer lines of code changed. At the top end, just 1% of people have contributed 68.9% of all lines of code changed, and the top 20% contribute 99.6% combined.

All Hail the 100x Engineer #

Should we all worship these 100x engineers? No, probably not, because this isn’t unique to computer people. This distribution is widespread, with the 1% always contributing enormously outsized returns.

What can you do with this information? Not a whole lot. It’s impossible to hire only 100x engineers because you will never know who the 100x engineer is until you examine the data in hindsight. You cannot foretell the future by examining the past. If you could, all economists would be billionaires.

The Code #

If you’d like to pick apart my analysis, the code is available here on GitHub.