URL filtering vulnerabilities in lxml
The lxml toolkit is a library for working with XML
and HTML from Python. It includes a utility called Cleaner
that
supports filtering
HTML for dangerous
and unwanted elements and attributes, although since early 2022 it has
been marked as not suitable for security
purposes. Nevertheless,
it is still used that way by many projects.
A coworker and I were recently exploring its capabilities. At one point he made a simple mistake that is extremely common in Python, and stumbled onto what I recognized as a vulnerability. Exploring the code more, I found another vulnerability, this one not dependent upon misconfiguration.
(As of this update, a fix is merged into lxml but not yet released.)
What Cleaner does
To provide background, let's first take a look at a typical use case. In this code block, we instruct Cleaner not to strip frames but to only allow embeds with hosts on a given allowlist, in this case a couple of popular video sites:
from lxml.html.clean import Cleaner
cl = Cleaner(frames=False, host_whitelist=['www.youtube.com', 'vimeo.com'])
print(cl.clean_html(f'Embedded video: <iframe src="https://youtube.com/video"></iframe>'))
# ↪ <p>Embedded video: <iframe src="https://youtube.com/video"></iframe></p>
In that input, the iframe's src was acceptable, so it was left alone.
However, if I set the iframe src to https://evil.com/
, the output is
<p>Embedded video: </p>
-- the frame element is gone entirely.
For simplicity, in this article I'll be giving host_whitelist
values
and iframe src URLs and indicating whether the URL is accepted or
rejected. And if you're following along with the source code, the
method of interest is allow_embedded_url
in lxml version
4.9.1.
Type confusion + slash confusion
When my coworker was playing around with Cleaner on the Python REPL, I
mentioned the risk of parser
mismatch
and he started feeding in a couple of examples from Claroty and
Snyk's joint report on URL
parsers. In the
course of playing with this, he ended up with the combination of
host_whitelist=('youtube.com')
and the malformed URL
https:///youtube.com/video
(note the extra slash before the
hostname). This URL was allowed. But it's even worse: This
configuration also allows the URL https:///evil.com/
. (And yes,
browsers will accept those.) What's going on here?
Why this happens
The first problem here partly lies with Python's tuple syntax, which has a rather unfortunate requirement for how to write single-item tuples:
('foo', 'bar')
evaluates to a 2-tuple('foo')
evaluates to a string('foo',)
evaluates to a 1-tuple
If you start with host_whitelist=('youtube.com', 'vimeo.com')
and
strip it down to host_whitelist=('youtube.com')
, you've effectively
written host_whitelist='youtube.com'
. The allowlist is now a string,
not a collection. This mistake is extremely common in Python code.
A second problem is that lxml does not have type checking or coercion
for host_whitelist
. Some other configuration properties do accept
either a string or collection of
strings,
but not this one.
Third is that Python's urlsplit, rather than rejecting the malformed URL, just tries its best to return something:
>>> urlsplit('https:///evil.com/more/path')
SplitResult(scheme='https', netloc='', path='/evil.com/more/path', query='', fragment='')
netloc
is its name for the URL authority section (userinfo,
hostname, port) and here is returned as an empty string. RFC 3986
makes it clear that when there is a //
, the authority section is
required, and within that the host is a required, non-empty
component. So
this should have thrown an exception or returned None
rather than a
SplitResult
.
And finally, the most popular browsers use a different URL parser,
conforming to WHATWG's alternative view of how URLs should
work. They'll treat https://////youtube.com
and
https:\\youtube.com
as https://youtube.com
. So this triple-slash
URL is treated as a double-slash URL, but with a "validation error" --
that is then ignored.
Putting this all together, when lxml checks netloc in
self.host_whitelist
in the above example, what it's actually
executing is '' in 'youtube.com'
, which is true -- the empty string
is indeed a substring of 'youtube.com'
, or of any string. (The
in
operator is overloaded for a variety of types.)
Implications
First off, this is of course low severity; it relies on a type of misconfiguration which, while easy to make, is still a misconfiguration. However, in such a situation, an attacker can simply inject an additional slash in any http/https URL they want to use and it will be accepted; the triple-slash URL will then be interpreted by the browser as having a non-empty hostname.
But in fact, Cleaner's own defaults use an empty tuple, so anyone starting from that is at risk. And using a tuple here is actually pretty common. In a few minutes I was able to find a number of projects using an empty tuple for their whitelist (1 2 3 4 5) and another using an 8-tuple. None of these is currently vulnerable, but an incautious edit to use a single host would make them vulnerable.
Additionally, the configuration is not always local to the call to
Cleaner. I found several projects
(1,
2)
using a configurable host_whitelist
, which is theoretically fine
except someone might write the setting as "host1, host2, host3"
instead of ["host1", "host2", "host3"]
and then become
vulnerable. This method of providing a list of hosts in a
comma-delimited string is common in software, and I was even able to
find a
project
using it with Cleaner.
Remediation
Since host_whitelist
is never supposed to be a string in the first
place, the easiest thing to do is to validate it in Cleaner's
constructor to always be a list.
It's tempting to instead wrap it in a list if it's a string, but
this is still dangerous. Remember, the configured value might be ""
;
turning it into [""]
would allow URLs with an empty hostname. Best
to just deny invalid configs in the first place.
(In Python, you can also use a linter to check for misspelled
1-tuples. This is not robust against all typos, though;
("youtube.com" "vimeo.com")
would also be a single
string. However, linters are still a great idea in general.)
That covers the type confusion. What about the other two problems I identified, relating to the actual parser mismatch?
The two obvious approaches I can see are
reserialization
and strict parsing. But reserialization doesn't help here because
urlsplit
not only incorrectly allows an empty authority on parse,
but also recomposes it with an empty authority:
>>> from urllib.parse import urlsplit, urlunsplit
>>> parts = urlsplit('https:///evil.com/path')
>>> parts.netloc == ''
True
>>> urlunsplit(parts)
'https:///evil.com/path'
Browsers are then happy to "fix up" https:///...
URLs in the spirit
of "garbage in, something nice out". Both deviate from the spec, and
in different ways, but in a combination that leads to a vulnerability.
The remaining option is strictness; urlsplit must raise an exception when encountering invalid URLs. However, it is unlikely the Python maintainers will do this; they've been unwilling to reject malformed URLs in the past and urlsplit only raises exceptions in the case of certain malformed IP addresses. So there's one obvious answer left: lxml should switch to using an RFC 3986 compliant URL parser, and drop any iframes and other embdedded objects with invalid URLs.
I made a minimal pull request to fix this vulnerability by ensuring
that host_whitelist
is never a string:
https://github.com/lxml/lxml/pull/349
Incorrect authority parsing
Given a host_whitelist=['youtube.com']
, lxml allows the URL
https://youtube.com:@evil.com/
, and the browser would load a page
from evil.com
.
Why this happens
In order to determine the scheme and hostname of the URL,
allow_embedded_url
performs the following:
scheme, netloc, path, query, fragment = urlsplit(url)
netloc = netloc.lower().split(':', 1)[0]
That is, it asks for the authority component and then manually parses
it. But the authority has up to four pieces and three or four
different places where colons can be found, and is not totally trivial
to parse. Let's take as an example the ugly but valid URL
https://example.com:some:stuff@[2620:0:861:ed1a::1]:123/path
. Here
are the parts:
- Scheme:
https
- Authority:
example.com:some:stuff@[2620:0:861:ed1a::1]:123
- Userinfo:
example.com:some:stuff
. The structure within this isn't really part of the URL specification, and userinfo in URLs is deprecated in browser-land, but nevertheless would currently be interpreted by a browser like so:- User:
example.com
(split at first colon) - Password:
some:stuff
- User:
- Host:
[2620:0:861:ed1a::1]
-- An IPv6 address with its required delimiters. Always contains colons. Let's not even get into zone-IDs. - Port:
123
- Userinfo:
- Path:
/path
Needless to say, many pieces of code that try to parse an authority component come to different conclusions about the boundaries, since they use handwritten parsers that often don't comply with the spec. In the case of lxml, the first colon is used as a delimiter. Not only would this break for any IPv6 address literal, it also introduces a vulnerability by getting confused by colons in the userinfo component.
This is, as usual when it comes to URL-related vulnerabilities, an example of parser mismatch.
Implications
Because of the usefulness in phishing attacks and other naughtiness,
browser vendors have been trying for years to get people to stop using
userinfo in web URLs. They already require people to click through a
warning popup asking them if they really meant to log into
evil.com
with username youtube.com
. Chromium has switched to
discarding the userinfo entirely on top-level navigation and since
version 59 blocks iframe URLs that contain it. Firefox 91 warns at the
top level, but still permits it in the iframe. (I'm guessing they'll
make this change as well at some point.)
However, the opposite seems to hold for ajax requests: Firefox blocks a userinfo-bearing XHR, but Chromium allows it (though it does at least strip out the userinfo).
So in either of these popular browsers there's the possibility (for now) of tricking the page into loading a resource from the wrong origin.
Remediation
Fixing this might be as simple as switching to asking for the
hostname
property of the SplitResult
that urlsplit
returns. It's
already
available, and
appears to have been available as far back as Python
2.7. And that's what
my pull request for fixing
this one uses.
Timeline
- 2022-08-24: Discovered tuple pitfall
- 2022-08-25: Discovered userinfo/triple-slash vulnerability
- 2022-08-25: Asked about disclosure process on lxml mailing list
- 2022-08-26: Disclosed vulnerabilities to Stefan Behnel by email, offering to make public PRs
- 2022-08-28: Received confirmation that I can just make public PRs
- 2022-08-29: Posted PRs for review
- 2022-10-16: Published blog post
- 2024-01-04: Fix merged into lxml
No comments yet.
Self-service commenting is not yet reimplemented after the Wordpress migration, sorry! For now, you can respond by email; please indicate whether you're OK with having your response posted publicly (and if so, under what name).