This critical role would not be possible without funding from the OpenSSF Alpha-Omega Project. Massive thank-you to Alpha-Omega for investing in the security of the Python ecosystem!
It was a short week for me with a day of PTO and the Python Software Foundation observing Indigenous Peoples' Day in the USA.
More focused time on CPython's release process! Using diffoscope which is a tool that does a deep diff of files or directories and then gives a summary of the differences, we can see what the differences are between the official CPython 3.12.0 release tarballs and the ones I built using GitHub Actions:
│ ├── file list │ │ @@ -1,4764 +1,4764 @@ │ │ -drwxr-xr-x 0 0 2023-10-02 11:48:14.000000 Python-3.12.0/ ... │ │ +drwx------ 0 thomas (1000) thomas (1000) 0 2023-10-02 12:03:24.000000 Python-3.12.0/
The biggest difference was in the archive metadata. You can see the release manager for 3.12 Thomas Wouters' name in the python.org tarball for Python 3.12.0!
Tar files save metadata about the user and group just like a filesystem, and the default behavior is to save the calling users' information. I consulted the guide on reproducible-builds.org for archives and the documentation for GNU tar which has a section on reproducibility. From reading these documents I found the following individual options:
git log v3.12.0 -1 --pretty=%ct) as the maximum
mtimemetadata value for
tar. Those options manifest as
--sort=namefor this behavior.
0for both user and group instead of inheriting from the current user of
tar. Uses options
--no-nameoption to the gzip compression subroutine to avoid embedding the name into the gzip stream.
Combining everything together you get something like this:
$ tar cf Python-3.12.0.tgz \ --sort=name \ --mtime= --clamp-mtime \ --owner=0 --group=0 --numeric-owner \ --pax-option=exthdr.name=%d/PaxHeaders/%f,delete=atime,delete=ctime \ --mode=go+u,go-w \ --use-compress-program "gzip --no-name -9" \ ...
You can see the complete pull request to python/release-tools which has all the options together.
After adding these options I was able to reproduce a source build on GitHub Actions byte-for-byte with the same build locally and by using reprotest was able to verify that this process worked for many different filesystem and user scenarios.
Another improvement I added to the Windows installer builds to ensure that the git tag wasn't erroneously or maliciously rewritten upstream at the beginning of the Windows installers build. This change adds another input to the Windows build process which is the upstream git commit SHA and then checks that known good value against the resolved git tag after the CPython source code is downloaded. This means that the correct git commit is being used and that malicious code can't be injected into the Windows installers unknowingly.
Don't let social media algorithms decide what you want to see.
This work is licensed under