Welcome to the first installment of "Why URLs are Hard": a series of stories that I've accumulated from reading a lot about URLs.
We take URLs for granted and mostly think of them as very simple things because
of how often we interact with clean and simple URLs like
Little do you know there are decades of ancient dark magic that occurred before
we ended up with URLs we know and love today.
This story is about finding a mysterious API in Python's
and discovering a now almost entirely unused URL feature. Come along with me! :)
Within the documentation it's mentioned that URLs are parsed according to RFC 3986 which is a set of rules that describe how to segment a URL into different components. Let's take a quick look at that standard to see what parts of a URL we see.
There's a cute little ASCII diagram showing off all the parts of a URL:
foo://example.com:8042/over/there?name=ferret#nose \_/ \______________/\_________/ \_________/ \__/ | | | | | scheme authority path query fragment
... and then the
authority section is further decomposed into:
authority = [ userinfo "@" ] host [ ":" port ]
One of the best parts of reading RFCs is thinking about how much effort people put into the adorable ASCII art :)
Okay, now that we know what to expect let's try out
urlparse with the URL from the RFC:
>>> from urllib.parse import urlparse >>> url = ( ... "foo://user:firstname.lastname@example.org:8042" ... "/over/there?name=ferret#nose" ) >>> parts = urlparse(url) >>> parts ParseResult( scheme='foo', netloc='user:email@example.com:8042', path='/over/there', params='', query='name=ferret', fragment='nose' ) >>> parts.hostname 'example.com' >>> parts.port 8042 >>> parts.username 'user' >>> parts.password 'pass'
Okay so looks like we have this as a mapping from
ParseResult to RFC 3986:
Notice the ??? in the list? I was confused too. No matter what I put into my URL I couldn't get
anything to show up in
The documentation for
ParseResult.params is "Parameters for last path element"
and then isn't mentioned much anywhere else. Googling around is tough too because "
is Requests way of adding to the query string for the requested URL so most results are
Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment to delimit scheme-specific or dereference- handler-specific subcomponents. For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment.
= have special meaning within the
let's throw those into
urlparse and see what happens:
>>> urlparse("http://example.com/a;z=y;x/b;c;d=e") ParseResult( scheme='http', netloc='example.com', path='/a;z=y;x/b', params='c;d=e', query='', fragment='' )
Huh, I didn't expect it to pull the values actually outside of the
And it looks like it only pulled the params from the last segment,
/a;z=y;x/ is untouched.
Wonder how many bugs are lurking out there because of this quirk. :)
So if you're relying on URL parsing and directly inspecting the
path component make sure
you check your implementation and amend it to add
params is non-empty.
Either that or use a URL parser that doesn't have this quirk like
I especially recommend using another library if you're making security decisions based on the URL.
A write-up from 2011 details a security issue related to path parameters
which an application using
ParseResult.path alone would likely also be vulnerable to.
Hope you learned something and stay safe!
Wow, you made it to the end!
If you're like me, you don't believe social media should be the way to get updates on the cool stuff your friends are up to. Instead, you should either follow my blog with the RSS reader of your choice or via my email newsletter for guaranteed article publication notifications.
If you really enjoyed a piece I would be grateful if you shared with a friend. If you have follow-up thoughts you can send them via email.
Thanks for reading!