URL filtering vulnerabilities in lxml

The lxml toolkit is a library for working with XML and HTML from Python. It includes a utility called Cleaner that supports filtering HTML for dangerous and unwanted elements and attributes, although since early 2022 it has been marked as not suitable for security purposes. Nevertheless, it is still used that way by many projects.

A coworker and I were recently exploring its capabilities. At one point he made a simple mistake that is extremely common in Python, and stumbled onto what I recognized as a vulnerability. Exploring the code more, I found another vulnerability, this one not dependent upon misconfiguration.

(As of this update, a fix is merged into lxml but not yet released.)

What Cleaner does

To provide background, let's first take a look at a typical use case. In this code block, we instruct Cleaner not to strip frames but to only allow embeds with hosts on a given allowlist, in this case a couple of popular video sites:

from lxml.html.clean import Cleaner
cl = Cleaner(frames=False, host_whitelist=['www.youtube.com', 'vimeo.com'])
print(cl.clean_html(f'Embedded video: <iframe src="https://youtube.com/video"></iframe>'))

# ↪ <p>Embedded video: <iframe src="https://youtube.com/video"></iframe></p>

In that input, the iframe's src was acceptable, so it was left alone. However, if I set the iframe src to https://evil.com/, the output is <p>Embedded video: </p> -- the frame element is gone entirely.

For simplicity, in this article I'll be giving host_whitelist values and iframe src URLs and indicating whether the URL is accepted or rejected. And if you're following along with the source code, the method of interest is allow_embedded_url in lxml version 4.9.1.

Type confusion + slash confusion

When my coworker was playing around with Cleaner on the Python REPL, I mentioned the risk of parser mismatch and he started feeding in a couple of examples from Claroty and Snyk's joint report on URL parsers. In the course of playing with this, he ended up with the combination of host_whitelist=('youtube.com') and the malformed URL https:///youtube.com/video (note the extra slash before the hostname). This URL was allowed. But it's even worse: This configuration also allows the URL https:///evil.com/. (And yes, browsers will accept those.) What's going on here?

Why this happens

The first problem here partly lies with Python's tuple syntax, which has a rather unfortunate requirement for how to write single-item tuples:

  • ('foo', 'bar') evaluates to a 2-tuple
  • ('foo') evaluates to a string
  • ('foo',) evaluates to a 1-tuple

If you start with host_whitelist=('youtube.com', 'vimeo.com') and strip it down to host_whitelist=('youtube.com'), you've effectively written host_whitelist='youtube.com'. The allowlist is now a string, not a collection. This mistake is extremely common in Python code.

A second problem is that lxml does not have type checking or coercion for host_whitelist. Some other configuration properties do accept either a string or collection of strings, but not this one.

Third is that Python's urlsplit, rather than rejecting the malformed URL, just tries its best to return something:

>>> urlsplit('https:///evil.com/more/path')
SplitResult(scheme='https', netloc='', path='/evil.com/more/path', query='', fragment='')

netloc is its name for the URL authority section (userinfo, hostname, port) and here is returned as an empty string. RFC 3986 makes it clear that when there is a //, the authority section is required, and within that the host is a required, non-empty component. So this should have thrown an exception or returned None rather than a SplitResult.

And finally, the most popular browsers use a different URL parser, conforming to WHATWG's alternative view of how URLs should work. They'll treat https://////youtube.com and https:\\youtube.com as https://youtube.com. So this triple-slash URL is treated as a double-slash URL, but with a "validation error" -- that is then ignored.

Putting this all together, when lxml checks netloc in self.host_whitelist in the above example, what it's actually executing is '' in 'youtube.com', which is true -- the empty string is indeed a substring of 'youtube.com', or of any string. (The in operator is overloaded for a variety of types.)

Implications

First off, this is of course low severity; it relies on a type of misconfiguration which, while easy to make, is still a misconfiguration. However, in such a situation, an attacker can simply inject an additional slash in any http/https URL they want to use and it will be accepted; the triple-slash URL will then be interpreted by the browser as having a non-empty hostname.

But in fact, Cleaner's own defaults use an empty tuple, so anyone starting from that is at risk. And using a tuple here is actually pretty common. In a few minutes I was able to find a number of projects using an empty tuple for their whitelist (1 2 3 4 5) and another using an 8-tuple. None of these is currently vulnerable, but an incautious edit to use a single host would make them vulnerable.

Additionally, the configuration is not always local to the call to Cleaner. I found several projects (1, 2) using a configurable host_whitelist, which is theoretically fine except someone might write the setting as "host1, host2, host3" instead of ["host1", "host2", "host3"] and then become vulnerable. This method of providing a list of hosts in a comma-delimited string is common in software, and I was even able to find a project using it with Cleaner.

Remediation

Since host_whitelist is never supposed to be a string in the first place, the easiest thing to do is to validate it in Cleaner's constructor to always be a list.

It's tempting to instead wrap it in a list if it's a string, but this is still dangerous. Remember, the configured value might be ""; turning it into [""] would allow URLs with an empty hostname. Best to just deny invalid configs in the first place.

(In Python, you can also use a linter to check for misspelled 1-tuples. This is not robust against all typos, though; ("youtube.com" "vimeo.com") would also be a single string. However, linters are still a great idea in general.)

That covers the type confusion. What about the other two problems I identified, relating to the actual parser mismatch?

The two obvious approaches I can see are reserialization and strict parsing. But reserialization doesn't help here because urlsplit not only incorrectly allows an empty authority on parse, but also recomposes it with an empty authority:

>>> from urllib.parse import urlsplit, urlunsplit
>>> parts = urlsplit('https:///evil.com/path')
>>> parts.netloc == ''
True
>>> urlunsplit(parts)
'https:///evil.com/path'

Browsers are then happy to "fix up" https:///... URLs in the spirit of "garbage in, something nice out". Both deviate from the spec, and in different ways, but in a combination that leads to a vulnerability.

The remaining option is strictness; urlsplit must raise an exception when encountering invalid URLs. However, it is unlikely the Python maintainers will do this; they've been unwilling to reject malformed URLs in the past and urlsplit only raises exceptions in the case of certain malformed IP addresses. So there's one obvious answer left: lxml should switch to using an RFC 3986 compliant URL parser, and drop any iframes and other embdedded objects with invalid URLs.

I made a minimal pull request to fix this vulnerability by ensuring that host_whitelist is never a string: https://github.com/lxml/lxml/pull/349

Incorrect authority parsing

Given a host_whitelist=['youtube.com'], lxml allows the URL https://youtube.com:@evil.com/, and the browser would load a page from evil.com.

Why this happens

In order to determine the scheme and hostname of the URL, allow_embedded_url performs the following:

scheme, netloc, path, query, fragment = urlsplit(url)
netloc = netloc.lower().split(':', 1)[0]

That is, it asks for the authority component and then manually parses it. But the authority has up to four pieces and three or four different places where colons can be found, and is not totally trivial to parse. Let's take as an example the ugly but valid URL https://example.com:some:stuff@[2620:0:861:ed1a::1]:123/path. Here are the parts:

  • Scheme: https
  • Authority: example.com:some:stuff@[2620:0:861:ed1a::1]:123
    • Userinfo: example.com:some:stuff. The structure within this isn't really part of the URL specification, and userinfo in URLs is deprecated in browser-land, but nevertheless would currently be interpreted by a browser like so:
      • User: example.com (split at first colon)
      • Password: some:stuff
    • Host: [2620:0:861:ed1a::1] -- An IPv6 address with its required delimiters. Always contains colons. Let's not even get into zone-IDs.
    • Port: 123
  • Path: /path

Needless to say, many pieces of code that try to parse an authority component come to different conclusions about the boundaries, since they use handwritten parsers that often don't comply with the spec. In the case of lxml, the first colon is used as a delimiter. Not only would this break for any IPv6 address literal, it also introduces a vulnerability by getting confused by colons in the userinfo component.

This is, as usual when it comes to URL-related vulnerabilities, an example of parser mismatch.

Implications

Because of the usefulness in phishing attacks and other naughtiness, browser vendors have been trying for years to get people to stop using userinfo in web URLs. They already require people to click through a warning popup asking them if they really meant to log into evil.com with username youtube.com. Chromium has switched to discarding the userinfo entirely on top-level navigation and since version 59 blocks iframe URLs that contain it. Firefox 91 warns at the top level, but still permits it in the iframe. (I'm guessing they'll make this change as well at some point.)

However, the opposite seems to hold for ajax requests: Firefox blocks a userinfo-bearing XHR, but Chromium allows it (though it does at least strip out the userinfo).

So in either of these popular browsers there's the possibility (for now) of tricking the page into loading a resource from the wrong origin.

Remediation

Fixing this might be as simple as switching to asking for the hostname property of the SplitResult that urlsplit returns. It's already available, and appears to have been available as far back as Python 2.7. And that's what my pull request for fixing this one uses.

Timeline

  • 2022-08-24: Discovered tuple pitfall
  • 2022-08-25: Discovered userinfo/triple-slash vulnerability
  • 2022-08-25: Asked about disclosure process on lxml mailing list
  • 2022-08-26: Disclosed vulnerabilities to Stefan Behnel by email, offering to make public PRs
  • 2022-08-28: Received confirmation that I can just make public PRs
  • 2022-08-29: Posted PRs for review
  • 2022-10-16: Published blog post
  • 2024-01-04: Fix merged into lxml

No comments yet. Feed icon

Self-service commenting is not yet reimplemented after the Wordpress migration, sorry! For now, you can respond by email; please indicate whether you're OK with having your response posted publicly (and if so, under what name).