This critical role would not be possible without funding from the OpenSSF Alpha-Omega Project. Massive thank-you to Alpha-Omega for investing in the security of the Python ecosystem!
Last week I published a graphic showing the use of memory safe and unsafe systems programming languages in Python packages which garnered some interest from the community how I was creating such a graphic.
The graphic used file extension information which isn't a perfect method for detecting other programming languages, but likely good enough for trends and identifying projects.
What is interesting about this graphic is it needs access to files within Python distributions like wheels and source distributions on PyPI. This is something that's difficult to access without actually downloading the artifact. So how can I query this information for every package since 2005?
I used the same dataset previously to detect vulnerable WebP binaries bundled in Python packages. Let's explore how to use this dataset to answer other questions!
None of this article would be possible without the work of Tom Forbes to create and continually update this dataset. Thanks Tom for all your work and for helping me get started.
I'm also doing work on a few different projects regarding Python packaging metadata, namely PEP 639 and "Tracking bundled projects in Python distributions". Having this dataset available gives me a bunch of contextual information for those projects as well as being able to track adoption of new packaging metadata.
There was also a bit of emphasis about memory-safe programming languages in the recent US Government RFI, and I was the author for the section regarding memory safety. I wanted to explore the Python package ecosystems' current usage of memory safe languages like Rust and Go compared to C, C++, and Fortran. From the above graphic it seems there's some interest in using memory-safe languages which is nice to see.
The need to be able to query this dataset for multiple projects meant it probably made a bit of sense to create a small utility that can be reused, including by others (yay open source!) I created a small Gist that includes this utility. It's not optimized (actually quite slow if you don't use threads when downloading of files).
⚠️ WARNING: PLEASE READ! ⚠️
A word of warning before we start blindly downloading all the things, these datasets are all very large, like 30+ GB just for the high-level metadata in Parquet files. Make sure you have enough storage space before copying and pasting any commands you see in this blog post. I don't want to hear that anyone's filled up their hard-drive without knowing. You have been warned! 🐉
With that out of the way, let's get started ourselves! The entire dataset is available under the pypi-data GitHub organization with varying levels of detail all the way from high-level metadata and filenames to actual file contents.
There are many datasets available on py-code.org/datasets. The Clickhouse dataset isn't completely up-to-date but as a way to experiment with the dataset it can be an easy place to play around and get started. We want the complete up-to-date dataset though, so we need to download things locally. We want the "Metadata on every file uploaded to PyPI" dataset.
To download the dataset there's a series of curl
commands:
$ curl -L --remote-name-all $(curl -L "https://github.com/pypi-data/data/raw/main/links/dataset.txt")
Two curls in one (at least there's no ... | sudo sh
involved...) let's examine the innermost curl first
and use a local copy instead of fetching from the network:
$ curl -L "https://github.com/pypi-data/data/raw/main/links/dataset.txt" > dataset.txt
$ cat dataset.txt
https://github.com/pypi-data/data/releases/download/2023-11-12-03-06/index-0.parquet
https://github.com/pypi-data/data/releases/download/2023-11-12-03-06/index-1.parquet
https://github.com/pypi-data/data/releases/download/2023-11-12-03-06/index-10.parquet
https://github.com/pypi-data/data/releases/download/2023-11-12-03-06/index-11.parquet
https://github.com/pypi-data/data/releases/download/2023-11-12-03-06/index-12.parquet
https://github.com/pypi-data/data/releases/download/2023-11-12-03-06/index-13.parquet
https://github.com/pypi-data/data/releases/download/2023-11-12-03-06/index-14.parquet
...
It's a list of URLs that all look legit, let's download those (this will take some time):
$ curl -L --remote-name-all $(cat dataset.txt)
$ ls
index-0.parquet index-12.parquet index-1.parquet index-4.parquet index-7.parquet
index-10.parquet index-13.parquet index-2.parquet index-5.parquet index-8.parquet
index-11.parquet index-14.parquet index-3.parquet index-6.parquet index-9.parquet
In order to take full advantage of this dataset we can query the top-level Parquet metadata and subsequently download the underlying individual files only when
necessary. I've created a small helper as I mentioned earlier (pycodeorg
module below) to assist with these examples.
The dataset uses Parquet as a data storage format which is columnar and can be queried using DuckDB. This is the first project I've used DuckDB with and
from first impressions it seems like a lovely piece of software. Before we start creating our query I like to see what the dataset fields and types are so lets run a DESCRIBE
:
DESCRIBE SELECT * FROM '*.parquet';
┌─────────────────┬─────────────┬─────────┐
│ column_name │ column_type │ null │
│ varchar │ varchar │ varchar │
├─────────────────┼─────────────┼─────────┤
│ project_name │ VARCHAR │ YES │
│ project_version │ VARCHAR │ YES │
│ project_release │ VARCHAR │ YES │
│ uploaded_on │ TIMESTAMP │ YES │
│ path │ VARCHAR │ YES │
│ archive_path │ VARCHAR │ YES │
│ size │ UBIGINT │ YES │
│ hash │ BLOB │ YES │
│ skip_reason │ VARCHAR │ YES │
│ lines │ UBIGINT │ YES │
│ repository │ UINTEGER │ YES │
├─────────────────┴─────────────┴─────────┤
│ 11 rows 6 columns │
└─────────────────────────────────────────┘
Now that we know the form of the dataset we can make our first query. Let's create a query for projects per file extension and split that by month. That query would look something like this:
SELECT (
-- We're bucketing our data by month and extension --
datetrunc('month', uploaded_on) AS month,
regexp_extract(path, '\.([a-z0-9]+)$', 1) AS ext,
-- DuckDB has native list/object manipulation, pretty cool! --
LIST(DISTINCT project_name) AS projects
)
FROM '*.parquet'
WHERE (
-- Our regex for matching files for languages we care about --
regexp_matches(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0-2}(?:or)?|go)$')
-- Filter out test files and whole virtual environments --
-- embedded in Python distributions. --
AND NOT regexp_matches(path, '(^|/)test(|s|ing)')
AND NOT contains(path, '/site-packages/')
)
GROUP BY month, ext
ORDER BY month DESC;
With this query and some data massaging we can create this graphic and see how Rust is driving the majority of memory-safe programming language use in binary Python distributions:
Previously it was very difficult to learn about the adoption of new packaging metadata standards
and fields due to the prohibitively large bandwidth, storage, and CPU cost that came with downloading an entire
swath of PyPI and unpack their contents only to examine a small METADATA
or WHEEL
file. However, with this dataset
we can write a simple query and fetch only the files we need to get the answers to the above questions:
SELECT repository, project_name, path
FROM '*.parquet'
WHERE (
-- We only want distributions uploaded in --
-- October 2023 for a recent snapshot. --
datetrunc('month', uploaded_on) = DATE '2023-10-01'
-- We want .dist-info/WHEEL files from wheels --
AND regexp_matches(path, '\.dist-info/WHEEL$')
-- And files shouldn't be skipped since we can't call --
-- `get_file()` on these, like if they're empty or binaries. --
-- Pretty unlikely! --
AND skip_reason == ''
);
substitute this query in for the QUERY
variable below:
import re
import pycodeorg
QUERY = ...
# Find all 'WHEEL' metadata files in wheels:
for repo, project, path in pycodeorg.query(QUERY):
# Fetch the file data from the dataset
data = pycodeorg.get_data(repo, project, path)
# Then parse the 'Generator' field and aggregate
if match := re.search(rb"\nGenerator:\s*([\w]+)", data):
builder = match.group(1).decode()
...
This query allows me to provide this data, which to my knowledge isn't available yet
and from this we can answer questions like which wheel builder is most common (which are bdist_wheel
by a wide margin, then poetry
and hatch
)
and which packaging metadata fields are in use. I'm excited to see what other insights folks are able to gather
from using this dataset!
That's all for this week! 👋 If you're interested in more you can read next week's report or last week's report.
Don't let social media algorithms decide what you want to see.
Get notified of new publications by subscribing to the RSS feed or the email newsletter:
This work is licensed under