doc: Warn that ts_headline() output is not HTML-safe.

Add a documentation warning to ts_headline() pointing out that, when
working with untrusted input documents, the output is not guaranteed
to be safe for direct inclusion in web pages. This is because, while
it does remove some XML tags from the input, it doesn't remove all
HTML markup, and so the result may be unsafe (e.g., it might permit
XSS attacks).

To guard against that, all HTML markup should be removed from the
input, making it plain text, or the output should be passed through an
HTML sanitizer.

In addition, document precisely what the default text search parser
recognises as valid XML tags, since that's what determines which XML
tags ts_headline() will remove.

Reported-by: Richard Neill <richard.neill@telos.digital>
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Backpatch-through: 13
This commit is contained in:
Dean Rasheed 2025-05-01 11:03:43 +01:00
parent 06c4f3ae80
commit d73d4cfdfc

View File

@ -1342,7 +1342,7 @@ ts_headline(<optional> <replaceable class="parameter">config</replaceable> <type
document, to distinguish them from other excerpted words. The
default values are <quote><literal>&lt;b&gt;</literal></quote> and
<quote><literal>&lt;/b&gt;</literal></quote>, which can be suitable
for HTML output.
for HTML output (but see the warning below).
</para>
</listitem>
<listitem>
@ -1354,6 +1354,21 @@ ts_headline(<optional> <replaceable class="parameter">config</replaceable> <type
</listitem>
</itemizedlist>
<warning>
<title>Warning: Cross-site scripting (XSS) safety</title>
<para>
The output from <function>ts_headline</function> is not guaranteed to
be safe for direct inclusion in web pages. When
<literal>HighlightAll</literal> is <literal>false</literal> (the
default), some simple XML tags are removed from the document, but this
is not guaranteed to remove all HTML markup. Therefore, this does not
provide an effective defense against attacks such as cross-site
scripting (XSS) attacks, when working with untrusted input. To guard
against such attacks, all HTML markup should be removed from the input
document, or an HTML sanitizer should be used on the output.
</para>
</warning>
These option names are recognized case-insensitively.
You must double-quote string values if they contain spaces or commas.
</para>
@ -2225,6 +2240,18 @@ LIMIT 10;
Specifically, the only non-alphanumeric characters supported for
email user names are period, dash, and underscore.
</para>
<para>
<literal>tag</literal> does not support all valid tag names as defined by
<ulink url="https://www.w3.org/TR/xml/">W3C Recommendation, XML</ulink>.
Specifically, the only tag names supported are those starting with an
ASCII letter, underscore, or colon, and containing only letters, digits,
hyphens, underscores, periods, and colons. <literal>tag</literal> also
includes XML comments starting with <literal>&lt;!--</literal> and ending
with <literal>--&gt;</literal>, and XML declarations (but note that this
includes anything starting with <literal>&lt;?x</literal> and ending with
<literal>&gt;</literal>).
</para>
</note>
<para>