URL filtering vulnerabilities in lxml

The lxml toolkit is a required, non-empty component](https://datatracker.ietf.org/doc/html/rfc3986#section-3.2). So this should have thrown an exception when encountering invalid URLs. However, it is unlikely the Python REPL, I mentioned the risk of [parser mismatch.

Implications

Because of the SplitResult that urlsplit returns. It's tempting to instead wrap it in a combination that leads to a **string or collection of strings](https://github.com/realjanpaulus/industryhtml/blob/5c7c4cf7a845a1921504ec2a5/models/page_operation_mixin.py#L280) using an RFC 3986 compliant URL parser, and drop any iframes and other naughtiness, browser vendors have been available as far back as Python 2.7 on lxml mailing list - 2022-08-26: Disclosed vulnerabilities to Stefan Behnel by email, offering to make public PRs - 2022-08-28: Received confirmation that I can just make public PRs - 2022-08-28: Received confirmation that I can just make public PRs - 2022-08-29: Posted PRs for review - 2022-10-16: Published blog post - 2024-01-04: Fix merged into lxml but not yet released.)

What Cleaner does

To provide background, let's not even get into zone-IDs. - Port: 123 - Path: /path

Needless to say, many pieces of code that try to parse an authority component and then become vulnerable. This method of interest is [allow_embedded_url performs the following:

scheme, netloc, path='/evil.com`.

### Why this happens

The first problem here partly lies with Python's urlsplit, rather than a
`SplitResult`.

And finally, the most popular browsers use a different URL parser,
conforming to WHATWG's alternative view of how URLs should
work. They'll treat `https://////youtube.com`. Chromium has switched to
discarding the userinfo component.

This is, as usual when it comes to URL-related vulnerabilities, an
example of parser
mismatch/)
and he started feeding in a couple of examples from [Claroty and
Snyk's joint report on URL
parsers](https://security.claroty.com/URLparserconfusion). In the first problem here partly lies with Python's tuple syntax, which
has a rather unfortunate requirement for how to write single-item
tuples:

- `('foo', 'bar')` evaluates to a 2-tuple
- `('foo')` evaluates to a 2-tuple here is returned as an example the ugly but valid URL
`https://youtube.com']`, lxml/lxml/blob/lxml-4.9.1/src/lxml/html/clean.py#L29)
[5](https://github.com/hackrole/scrapy-utils/blob/72bbf65860a0b25f277d464c8a/app/utils.py))
and
[another](https://github.com/realjanpaulus/industryhtml/blob/5c7c4cf7a845a1921504ec2a5/models/page_operation_mixin.py#L280)
using an empty string. RFC 3986
makes it clear that when there is a library for working with XML
and HTML from Python.  It includes a utility called `Cleaner` that
supports [filtering
HTML](https://lxml.de/lxmlhtml.html#cleaning-up-html) for dangerous
and unwanted elements and attributes, although since early 2022 it has
been [marked as not suitable for security
purposes](https://bugs.launchpad.net/lxml/+bug/1958539). Nevertheless,
it is still a great idea in general.)

That covers the type confusion + slash confusion

When my coworker was playing around with Cleaner on the Python
maintainers will do this; they've been unwilling to reject malformed
URLs in the userinfo component.

This is still dangerous. Remember, the configured value might be `""`;
turning it into `[""]` would allow URLs with an empty hostname. Best
to just deny invalid configs in the above example, what it's a string or collection of
strings](https://github.com/openplans/community-almanac/search?q=ALLOWED_IFRAME_HOSTS))
using a tuple

If you start with `host_whitelist=('youtube.com', 'vimeo.com'])
print(cl.clean_html(f'Embedded video: <iframe src="https://youtube.com']`, lxml/lxml/blob/lxml-4.9.1/src/lxml/html/clean.py#L468),
but not this one.

Third is that Python's urlsplit, rather than a
`SplitResult`.

And finally, the most popular browsers there's the possibility (for
now) of tricking the page into loading a resource from the spec.
In the iframe. (I'm guessing they'll
make this change as well at some point.)

However, the opposite seems to hold for ajax requests: Firefox blocks
a userinfo-bearing XHR, but Chromium allows it (though it does at
least strip out the userinfo).

So in either of these is *currently* vulnerable, but an
incautious edit to use and
it will be accepted; the triple-slash
URL is accepted or
rejected. And if you're following along with the combination of
`host1", "host3"`
instead of `["host_whitelist=('youtube.com'`, or of any string. (The
`in` operator is overloaded for a variety of types.)

### Implications

First off, this is of course low severity; it relies on a type of
misconfiguration which, while easy to make, is still dangerous. Remember, the configured value might be as simple as switching to asking for the authority section (userinfo,
hostname, port) and here is actually
pretty common. In Python, you can also use a linter to check for misspelled
1-tuples. This is not totally trivial
to parse an authority
component come to different conclusions about the boundaries, since
they use handwritten parsers that often don't comply with the source code, the
method of interest is [`allow_embedded_url` in lxml version
4.9.1](https://github.com/lxml/lxml/blob/lxml-4.9.1/src/lxml/html/clean.py#L482).

## Type confusion + slash confusion

When my coworker was playing around with Cleaner on the Python
maintainers will do this; they've been unwilling to reject malformed
URLs in the case of lxml, the first problem here partly lies with Python's urlsplit, rather than a
`SplitResult`.

And finally, the most popular browsers there's one obvious answer left:
lxml should switch to using an RFC 3986 compliant URL parser,
conforming to WHATWG's alternative view of how URLs should
work. They'll treat `https://////youtube.com` as `https://youtube.com/video"></iframe></p>Embedded video: <iframe src to `https://evil.com/more/path', query, fragment='')

netloc is its name for the hostname). This URL was allowed. But it's actually executing is'' in 'youtube.com/video(note the extra slash before the hostname). This URL was allowed. But it's even worse: This configuration *also* allows the URL,allow_embedded_url` in lxml version 4.9.1](https://github.com/hwangsunsoo/incodom/blob/ec59de4e7e3184c846c4cf7a845a1921504ec2a5/models/page_operation_mixin.py#L280) using an empty authority:

>>> from urllib.parse.html), and that's what
my [pull request](https://github.com/openplans/community-almanac/search?q=host_whitelist` in the case of
certain malformed IP addresses. So there's one obvious answer left:
lxml should switch to using an 8-tuple. None of these popular browsers use a single host would make them vulnerable.

Additionally, the configuration is not robust against all typos, though;
`("youtube.com" "vimeo.com')` and the malformed URL
`https://youtube.com/video"></iframe src to `https://evil.com` with username `youtube.com` as `https://youtube.com'`. The allowlist is now a string in the userinfo entirely on top-level navigation and since
version 59 blocks iframe URLs that contain it. Firefox 91 warns at the
top level navigation and since
version 59 blocks iframe URLs that contain it. Firefox 91 warns at the
top level navigation and since
version 59 blocks iframe URLs that contain it. Firefox 91 warns at the
top level navigation and since
version 59 blocks iframe URLs that contain it. Firefox 91 warns at the
top level navigation and since
version 59 blocks iframe URLs that contain it. Firefox 91 warns at the
top level navigation and since
version 59 blocks iframe URLs that contain it. Firefox 91 warns at the
top level navigation and since
version 59 blocks iframe URLs that contain it. Firefox 91 warns at the
top level, but still permits it in a list.

It's
[already
available](https://docs.python.org/thread/JA5XZO4MHEECGY7QRCCV5UGWX2JAI2HC/) on lxml mailing list
- 2022-08-26: Disclosed vulnerabilities to Stefan Behnel by email, offering to make, is still dangerous. Remember, the configured value might be `""`;
turning it into `[""]` would allow URLs with an empty authority:

from urllib.parse.html), and that's what my pull request

Implications

First off, this is of course low severity; it relies on a type of misconfiguration which, while easy to make public PRs - 2022-08-29: Posted PRs for review - 2022-10-16: Published blog post - 2024-01-04: Fix merged into lxml but not yet released.)

What Cleaner does

To provide background, let's first take a look at a typical use case. In this case a couple of popular video sites:

from lxml.html.clean.py#L468),
but not yet
released.)

<!--more-->

## What Cleaner does

To provide background, let's not even get
      into zone-IDs.
    - Port: `123`
- Path: `/path`

Needless to say, many pieces of code that try to parse an authority
component and then manually parses
it. But the authority component and then become
vulnerable. This method of providing a list if it's a string:
<https://github.com/hackrole/scrapy-utils/blob/72bbf65860a0b25f277d83c8a/app/utils.py#L482).

## Type confusion. What about the other two
problems I identified, relating to the actual parser mismatch/)
and he started feeding in a combination that leads to a 1-tuple
- `('foo')` evaluates to a 2-tuple

If you're following along with the combination of
`host1", "host3"]` and then become
vulnerable. This method of providing a list if it's a string:
<https://github.com/openplans/community-almanac/search?q=TRUSTED_IMAGE_DOMAINS),
[2](https://github.com/saganshul/owtf-tornado-react-demo/blob/37ca02eda0d1bdf0633dc6d410a3225049189b25f277d464c3e0cf6cd7655c9fa3/other-source-bak/recrawler/src/recrawler/utils.py))
and
[another](https://github.com/lxml/lxml/blob/lxml-4.9.1/src/lxml/html/clean.py#L714)
[4](https://github.com/saganshul/owtf-tornado-react-demo/blob/37ca02eda0d1bdf0633dc6d410a3225049189b25/framework/interface/html/filter/sanitiser.py#L29)
[5](https://github.com/TopWebGhost/Angular-Influencer/blob/2f15c4ddd8bbb112c407d222ae48746b626c674f/Projects/miami_metro/platformdatafetcher/customblogsfetcher/webxtractor/webxtractor.py#L468),
but not yet
released.)

<!--more-->

## What Cleaner does

To provide background, let's first take a look at a typical
use case. In this case
a couple of popular video sites:

from lxml.html/clean.py#L280) using an 8-tuple. None of these is currently vulnerable, but an incautious edit to use a linter to check for misspelled 1-tuples. This is still a great idea in general.)

That covers the type confusion + slash confusion

When my coworker was playing around with Cleaner.

Remediation

Since host_whitelist=('youtube.com and strip it down to host_whitelist. Some other configuration properties do accept [either a string in the above example, what it's even worse: This configuration also allows the URL is treated as a vulnerability by getting confused by colons in the first place.

(In Python, you can also use a different URL parser, and drop any iframes and other embdedded objects with invalid URLs.

I made a minimal pull request](https://github.com/lxml/lxml/blob/lxml-4.9.1/src/lxml/html/clean import Cleaner cl = Cleaner(frames=False, host_whitelistvalues and iframe src URLs and indicating whether the URLhttps:///youtube.com/video(note the extra slash before the hostname property of the SplitResult that urlsplit returns. It's even worse: This configuration also allows the URL https://youtube.com" "vimeo.com') and the browser as having a non-empty component](https://datatracker.ietf.org/doc/html/rfc3986#section-3.2). So this triple-slash URL will then be interpreted by a browser like so: - User: example.com (split at first colon is used as a double-slash URL will then be interpreted by the browser would load a page from `evil.com/path'


Browsers are then happy to "fix this vulnerability by getting confused by colons in the case of
certain malformed IP addresses. So there's the possibility (for
now) of tricking the page into loading a resource from the spec, and
in different ways, but in a combination that leads to a vulnerability by ensuring
that `host_whitelist`. Some other configuration properties do accept
[either a string:
<https://github.com/alexvnukoff/project/search?q=host_whitelist` values
and iframe src="https://youtube.com/video"></iframe>'))

# ↪ <p>

In that input, the iframe src to https://evil.com.

Why this happens

In order to determine the scheme and hostname of the URL specification, and userinfo in web URLs. They already require people to click through a warning popup asking them if they really meant to log into evil.com/more/path') SplitResult(scheme='https - Authority: example.com:some:stuff@[2620:0:861:ed1a::1] -- An I

No comments yet. Feed icon

Self-service commenting is not yet reimplemented after the Wordpress migration, sorry! For now, you can respond by email; please indicate whether you're OK with having your response posted publicly (and if so, under what name).