Reproducible builds for CPython source tarballs

Published 2023-10-10 by Seth Larson
Reading time: 3 minutes

This critical role would not be possible without funding from the OpenSSF Alpha-Omega Project. Massive thank-you to Alpha-Omega for investing in the security of the Python ecosystem!

It was a short week for me with a day of PTO and the Python Software Foundation observing Indigenous Peoples' Day in the USA.

More focused time on CPython's release process! Using diffoscope which is a tool that does a deep diff of files or directories and then gives a summary of the differences, we can see what the differences are between the official CPython 3.12.0 release tarballs and the ones I built using GitHub Actions:

│ ├── file list
│ │ @@ -1,4764 +1,4764 @@
│ │ -drwxr-xr-x   0                             0 2023-10-02 11:48:14.000000 Python-3.12.0/
│ │ +drwx------   0 thomas (1000) thomas (1000) 0 2023-10-02 12:03:24.000000 Python-3.12.0/

The biggest difference was in the archive metadata. You can see the release manager for 3.12 Thomas Wouters' name in the python.org tarball for Python 3.12.0!

Tar files save metadata about the user and group just like a filesystem, and the default behavior is to save the calling users' information. I consulted the guide on reproducible-builds.org for archives and the documentation for GNU tar which has a section on reproducibility. From reading these documents I found the following individual options:

Combining everything together you get something like this:

$ tar cf Python-3.12.0.tgz \
    --sort=name \
    --mtime= --clamp-mtime \
    --owner=0 --group=0 --numeric-owner \
    --pax-option=exthdr.name=%d/PaxHeaders/%f,delete=atime,delete=ctime \
    --mode=go+u,go-w \
    --use-compress-program "gzip --no-name -9" \

You can see the complete pull request to python/release-tools which has all the options together.

After adding these options I was able to reproduce a source build on GitHub Actions byte-for-byte with the same build locally and by using reprotest was able to verify that this process worked for many different filesystem and user scenarios.

Another improvement I added to the Windows installer builds to ensure that the git tag wasn't erroneously or maliciously rewritten upstream at the beginning of the Windows installers build. This change adds another input to the Windows build process which is the upstream git commit SHA and then checks that known good value against the resolved git tag after the CPython source code is downloaded. This means that the correct git commit is being used and that malicious code can't be injected into the Windows installers unknowingly.

Other items

That's all for this week! 👋 If you're interested in more you can read next week's report or last week's report.

Thanks for reading! ♡ Did you find this article helpful and want more content like it? Get notified of new posts by subscribing to the RSS feed or the email newsletter.

This work is licensed under CC BY-SA 4.0