This critical role would not be possible without funding from the OpenSSF Alpha-Omega Project.
Massive thank-you to Alpha-Omega for investing in the security of the Python ecosystem!
Second weekly report for the Security Developer-in-Residence role, if you missed the first one you can give it a read it here.
Python is known for its ability to be a "glue" language thanks to its C API and access to libraries written in C, C++, Go, Fortran, and Rust. This feature is likely one of the reasons for Python's massive popularity and makes Python libraries like numpy, pandas, and pydantic super-fast thanks to some of their code being written in a more CPU-performant language.
This super-power of Python has implications for supply chain security, let's go through why together:
The Python Package Index (PyPI) only hosts Python distributions, usually one of two types: source distributions and wheel distributions.
Python distributions which want to leverage compiled libraries either need to have users install those compiled libraries themselves or ship a pre-compiled library with the distribution. Having to install compiled libraries with your system package manager and then compiling each package from source is frustrating to users.
A common workflow for binary wheels is to run
auditwheel repair (or
delocate for macOS and
delvewheel for Windows) to build wheels
in their many different OSes and architectures (manylinux, musllinux, macOS, Windows, etc) and
auditwheel repair so that the libraries that need to get bundled are bundled into the wheel automatically.
This combination means that building binary wheels for all the different OSes and architectures has never been easier! But that also means there's a lot of bundled libraries out there.
The kicker is that those bundled libraries can have vulnerabilities too!
An example of this happening is when
pdftopng contained vulnerable versions of
(among other libraries) bundled in their wheel. There's also this related issue for the PyPI Advisory database
about vulnerabilities in shared libraries.
These bundled libraries don't show up in your
pip freeze so it's tougher for
you and your audit tooling to know what libraries and versions are in use.
Software Bill of Materials (SBOM) to the rescue! With an SBOM, you can programmatically know what is included in the distribution you've downloaded, including the non-Python components. If auditing tools has access to these SBOMs and a source for vulnerabilities for the relevant components you can now check that distributions aren't vulnerable, including their sub-components!
I wanted to figure out which bundled libraries are common amongst the top 500 Python packages that provide binary wheels, so I created a project
to gather the dataset along with some providing some pre-computed data so you can play around yourself and not need to download 75GB to your machine. Here are the results
after normalizing the names of the
There's more than only binary libraries too, packages like
pip bundle tons of Python libraries with their
source code and
find vulnerabilities for those bundled projects, too.
But wait, it can't be that simple, can it? Let's talk about downstream re-packagers.
Using the same dataset, let's pick on a library that sorts near the top alphabetically:
aerospike. This package
libssl.so.1.0.2k with it's
manylinux2014_x86_64 wheels. For anyone who's not aware,
libssl.so is one of the shared libraries for OpenSSL so seeing the version
be raising some alarm bells.
But fear not, if we examine how that wheel was built
we can see that they're using
cibuildwheel which in turn uses the official
manylinux2014_x86_64 container image
which is based on CentOS 7. CentOS 7 uses OpenSSL 1.0.2k as a base but
applies security patches to known vulnerabilities for libraries but crucially maintains the existing library version number and API backwards-compatibility,
so in theory the bundled library is free of vulnerabilities thanks to CentOS package maintainers.
Why is this a problem for auditing wheels for vulnerabilities? Because if a machine sees "OpenSSL 1.0.2k" it won't know whether that's a genuine OpenSSL 1.0.2k build (vulnerable) or one that's patched by a repackager.
Package URLs provide a way to identify the differences between the source project and a repackaged build of the same project. For example, here are two different package URLs for OpenSSL:
Using the package URL we can help disambiguate and provide more context to vulnerability detection tools.
That's not the end of the problems, there's still many tough ones to figure out:
I'm just starting to work with others interested in adding SBOMs to Python distributions, so there will be more updates on this in the future.
Now that my PR for gathering metrics on trusted publishers (called OIDC publishers internally) has been deployed I can start to share some numbers with you all.
The metrics that have been added distinguish between projects which have only configured Trusted Publishers and ones that have successfully published a release with a Trusted Publisher. There's also separate metrics for projects which have been marked "critical" due to downloads.
|Configured Trusted Publishers||Published with Trusted Publishers|
From the PyPI 2FA Dashboard, there are at the time of writing 4641 critical projects and 465,860 total projects on PyPI meaning around ~2.8% of critical projects have configured Trusted Publishers. Trusted Publishers were introduced back in late April, so around 78 days meaning an average of ~20 projects and ~2 new critical projects and per day adopt Trusted Publishers.
I'll be publishing some dashboards to track these metrics over time publicly soon.
In the list of supported secret types, the PyPI API tokens don't currently support "Push Protection". This means that GitHub won't outright reject a commit that contains a PyPI API token, instead the commit will always go through and then the secret will be revoked shortly after. It's a better user experience to not have to generate a new API token and zero exposure time to get rejected versus pushed so I wanted to see how difficult it'd be to get Push Protection enabled for PyPI tokens.
I emailed GitHub's Secret Scanning team and after a few hours I received this response:
We will need to analyze the performance of your tokens to ensure they meet the threshold for Push Protection (they will need to have <1% false positives). We will reach out once we have done the review and let you know next steps.
I'm hopeful that given PyPI API tokens regex pattern of
that the false positive rate would be low (thanks Macaroon prefixes!) but we shall see.
If this topic interests you, there's an open ticket for implementing secret scanning report APIs for GitLab.
Wow, you made it to the end!
If you're like me, you don't believe social media should be the way to get updates on the cool stuff your friends are up to. Instead, you should either follow my blog with the RSS reader of your choice or via my email newsletter for guaranteed article publication notifications.
If you really enjoyed a piece I would be grateful if you shared with a friend. If you have follow-up thoughts you can send them via email.
Thanks for reading!