File 0718-re-Document-ascii-character-class-deficiency.patch of Package erlang

From 092584bb2e0635bfbf73931c61c95d0b178caeb4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?John=20H=C3=B6gberg?= <john@erlang.org>
Date: Thu, 25 Feb 2021 13:01:08 +0100
Subject: [PATCH] re: Document [:ascii:] character class deficiency
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The regex library we use can work either in locale-specific mode,
or unicode mode. The locale-specific mode uses a pregenerated
table to tell which characters are printable, numeric, and so on.

For historical reasons, OTP has always used Latin-1 for this table,
so characters like `ö` are considered to be letters. This is fine,
but the library has two quirks that don't play well with each
other:

* The locale-specific table is always consulted for code points
  below 256 regardless of whether we're in unicode mode or not,
  and the `ucp` option only affects code points that aren't
  defined in this table (zeroed).
* The character class `[:ascii:]` matches characters that are
  defined in the above table.

This is fine when the regex library is built with its default ASCII
table: `[:ascii:]` only matches ASCII characters (by definition)
and the library documentation states that `ucp` is required to
match characters beyond that with `\w` and friends.

Unfortunately, we build the library with the Latin-1 table so
`[:ascii:]` matches Latin-1 characters instead, and we can't change
the table since we've documented that `\w` etc work fine with
Latin-1 characters, only requiring `ucp` for characters beyond
that.

At this point you might be thinking that this is a bug in how the
regex library handles `[:ascii:]`. Well, yes, POSIX says it should
match all code points between 0-127, but that's misleading since
it's only true for strict supersets of ASCII: should `[:ascii:]`
match 0x5C if the table is Shift-JIS? It would be just as wrong as
matching `ö`. :-(

Why not try to do the right thing and mark ASCII-compatibility for
each code point, since (for instance) 0x41 is `A` both in ASCII and
Shift-JIS? There's no way to ask a locale whether a code point
refers to the same character in ASCII, so the users would need to
manually go through the tables after generating them. Happy fun
times.

I've settled for documenting this mess since we can't fix this
on our end without breaking people's code, and there's not much
point in reporting this upstream since it'll either be misleading
or far too much work for the user, and PCRE-8.x is nearing the
very end of its life.
---
 lib/stdlib/doc/src/re.xml | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/stdlib/doc/src/re.xml b/lib/stdlib/doc/src/re.xml
index 11041e63b2..f0e35ef62e 100644
--- a/lib/stdlib/doc/src/re.xml
+++ b/lib/stdlib/doc/src/re.xml
@@ -2053,6 +2053,11 @@ foo\Kbar</code>
   <tag>xdigit</tag>   <item>hexadecimal digits</item>
 </taglist>
 
+<p>There is another character class, <c>ascii</c>, that erroneously matches
+Latin-1 characters instead of the 0-127 range specified by POSIX. This
+cannot be fixed without altering the behaviour of other classes, so we
+recommend matching the range with <c>[\\0-\x7f]</c> instead.</p>
+
 <p>The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
 space (32). Notice that this list includes the VT character (code 11). This
 makes "space" different to \s, which does not include VT (for Perl
-- 
2.26.2
Places

File 0718-re-Document-ascii-character-class-deficiency.patch of Package erlang

Places