← All posts jacob@stephens.page
Notes · Measurement

Byte Counts Lie: Measuring What I Actually Wrote

June 13, 2026 · ~640 words · Git authorship measurement

I built stephens.page/stack to show the distribution of languages and tools across every repository I work in. The first version summed GitHub's per-language byte counts and PHP towered over everything. The trouble is I never wrote a large share of that PHP. The page was measuring repository size and calling it me.

The offender is one repository: the reservations platform I have worked on at Educational Travel Adventures since February 2022, a codebase built since the 1990s. GitHub's Linguist sees about 14.5 million bytes of PHP in it. Crediting all of that to me is a false claim on a page whose whole point is to be honest. So the metric had to change from "how many bytes of each language sit in my repos" to "how much of each language did I actually write."

The two attempts that didn't work

The obvious first move is GitHub's contributor statistics API, which reports additions and deletions per author. For a repository this large it returned all-zero additions - GitHub silently declines to compute the stat past a size threshold. Dead end.

The second attempt was churn: git log --numstat, summing the lines each author added over all history. That standard "contribution" measure was badly polluted. Single commits added two to four and a half million lines at a stroke, none hand-written. Worse, the history begins on 2023-09-08, the day the project migrated from Bitbucket into GitHub, importing roughly 28 years of inherited code across dozens of same-day commits under my account. Churn credited me with 56% of the PHP. Still a lie, just a smaller one.

Blame the surviving lines, not the churn

The honest unit is current-tree ownership: of the lines that exist today, who wrote each one. That is what git blame answers, and it ignores churn that was later deleted. But two things stood in the way.

First, because the history starts at the 2023 migration, every inherited line blames to the import commit - me - rather than to whoever wrote it years earlier. The fix was the project's history before GitHub. It lived in three Bitbucket repositories going back to the first commit in 2014. For any current line still blamed to a migration-day commit, I re-attributed it by my real share of authorship in the pre-2023 Bitbucket snapshot, measured, again, by blame.

Second, the migration reshuffled files into new subdirectories, so matching a file across the 2023 seam by path mostly fails. Rather than chase renames, I attribute inherited lines by my per-language share of the old snapshot as a whole, which survives the reorganization.

A few details that mattered:

  1. Exclude the bulk dumps. Any commit adding more than 20,000 lines is an import, a vendored dependency, a committed cache, or a backup - never hand-authored. Those lines count toward the codebase total but never toward me.
  2. Exclude vendored paths. vendor/, node_modules/, bundled libraries, and minified files are not my authorship in any repo.
  3. Resolve identity carefully. My commits span several names and emails, and AI agents I built committed under my own email. I match my human identities and subtract the agent ones.
  4. Bias conservative. Where the migration flattened a file's history so blame can't recover the real author, I count those lines as not mine. A portfolio number should under-claim, never over-claim.

The result, and the surprise

Measured this way, I authored about 12.9% of the inherited platform's current lines - roughly 263,000 lines that are genuinely mine - and its PHP specifically drops from the churn estimate of 56% to 27%. Each repository's language bytes on the page are now scaled by the share I wrote; repositories I created count fully, and only the inherited one is discounted.

The surprise was where the weighting bit. It barely moved PHP's share of my overall volume. What it cut was JavaScript, from about 29% of the total to 5%. Almost all of that codebase's JavaScript was inherited or vendored, and the byte count had been quietly crediting me with it. The distortion wasn't where I expected.

The principle generalizes past one page: a byte count measures the size of a repository, not your contribution to it. The two match only for code you started yourself. Everywhere else, the honest measure is blame over the complete history - and that history may not live where the code does now.

Drafted from the generator and my working notes with help from Claude, which also did much of the implementation. The method, the decisions, and the judgment calls are mine; the prose is collaborative.