Patch #42737
openReplacing html-pipeline with Loofah for HTML Filtering
0%
Description
The latest version of html-pipeline
is 3.2.3. Redmine still uses version 2.14.3, which was released in October 2022.
In version 3, html-pipeline replaces Nokogiri with Selma (a Ruby binding for Cloudflare’s lol-html
) as its internal HTML parser. As a result, upgrading to version 3 would require rewrites of existing filters, due to major API differences.
ActionPack already incorporates loofah
, which provides a filtering feature similar to that of html-pipeline
. Adapting filters from html-pipeline 2
to Loofah’s scrubber
feature is expected to be relatively straightforward, since both libraries are based on Nokogiri and share similar APIs. Migrating to loofah
instead of upgrading html-pipeline helps avoid the complexity of having two different HTML parsers (Nokogiri and Selma) in the same application.
For HTML sanitization, we will continue to use the sanitize
gem instead of Rails' built-in rails-html-sanitizer
, as it offers greater flexibility and aligns with the defaults previously used by html-pipeline 2
.
Additionally, since html-pipeline 2 uses Nokogiri’s HTML4 parser internally, and the Rails team recommends HTML5 parsing , this update also includes a transition to an HTML5-compatible parser.
The patch can be applied to r23780
Files
Related issues
Updated by Marius BĂLTEANU 18 days ago
Thanks Takashi Kato for posting this patch.
When I did the update to CommonMark 2.3.0 (#40197), I tried also to update the html_pipeline
to version 3 without success. I was able to reimplement in Selma some of the filters, but I got stuck on FixupAutoLinksFilter
because I wasn't able to find a way to navigate to previous element. The Selma parser seems to be faster, but also less flexible. Also, as you already mentioned, we would ended up with two parses.
I'm in favour of this change because we cannot stay with this old version 2 for too long and I don't see too many disadvantages if we drop it, but I'm not sure if we should deliver it in version "6.1.0" or in a future Redmine 7.0.0.
Holger Just, Jens Krämer being a feature that you worked on, what do you think about this?
Updated by Marius BĂLTEANU 18 days ago
- Assignee set to Marius BĂLTEANU
- Target version set to Candidate for next major release
Updated by Marius BĂLTEANU 18 days ago
- Related to Patch #40198: Update html-pipeline to 3.0 added
Updated by Jens Krämer 17 days ago
Generally seems to be a good direction to go, especially in terms of reducing the number of external dependencies. I would keep the file / class names (Filter vs Scrubber) as-is since there is not really a need to change them.
This chunk in the sanitization_filter_test looks like it downgrades the expected level of escaping in an HTML attribute, but I am not sure if it actually could be a problem:
def test_should_filter_tags
@@ -137,7 +135,7 @@ if Object.const_defined?(:Commonmarker)
],
[
'Lo<!-- comment -->rem</b> <a href=pants title="foo>ipsum <a href="http://foo.com/"><strong>dolor</a></strong> sit<br/>amet <script>alert("hello world");',
- 'Lorem <a href="pants" title="foo>ipsum <a href="><strong>dolor</strong></a> sit<br>amet '
+ 'Lorem <a href="pants" title="foo>ipsum <a href="><strong>dolor</strong></a> sit<br>amet '
],
Updated by Takashi Kato 16 days ago
The difference in attribute value escaping behavior is due to the change of Nokogiri's parser from HTML4 to HTML5.
This was done intentionally by the Nokogiri developers to comply with HTML standards.
However, the standard was changed last week (as Nokogiri's maintainer informed us).
Because of this, Nokogiri's behavior may change in the near future. We will have to wait and see.