Sign Up
Log In
Log In
or
Sign Up
Places
All Projects
Status Monitor
Collapse sidebar
home:Ledest:erlang:19
erlang
1150-re-Document-ascii-character-class-deficien...
Overview
Repositories
Revisions
Requests
Users
Attributes
Meta
File 1150-re-Document-ascii-character-class-deficiency.patch of Package erlang
From 092584bb2e0635bfbf73931c61c95d0b178caeb4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?John=20H=C3=B6gberg?= <john@erlang.org> Date: Thu, 25 Feb 2021 13:01:08 +0100 Subject: [PATCH] re: Document [:ascii:] character class deficiency MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The regex library we use can work either in locale-specific mode, or unicode mode. The locale-specific mode uses a pregenerated table to tell which characters are printable, numeric, and so on. For historical reasons, OTP has always used Latin-1 for this table, so characters like `ö` are considered to be letters. This is fine, but the library has two quirks that don't play well with each other: * The locale-specific table is always consulted for code points below 256 regardless of whether we're in unicode mode or not, and the `ucp` option only affects code points that aren't defined in this table (zeroed). * The character class `[:ascii:]` matches characters that are defined in the above table. This is fine when the regex library is built with its default ASCII table: `[:ascii:]` only matches ASCII characters (by definition) and the library documentation states that `ucp` is required to match characters beyond that with `\w` and friends. Unfortunately, we build the library with the Latin-1 table so `[:ascii:]` matches Latin-1 characters instead, and we can't change the table since we've documented that `\w` etc work fine with Latin-1 characters, only requiring `ucp` for characters beyond that. At this point you might be thinking that this is a bug in how the regex library handles `[:ascii:]`. Well, yes, POSIX says it should match all code points between 0-127, but that's misleading since it's only true for strict supersets of ASCII: should `[:ascii:]` match 0x5C if the table is Shift-JIS? It would be just as wrong as matching `ö`. :-( Why not try to do the right thing and mark ASCII-compatibility for each code point, since (for instance) 0x41 is `A` both in ASCII and Shift-JIS? There's no way to ask a locale whether a code point refers to the same character in ASCII, so the users would need to manually go through the tables after generating them. Happy fun times. I've settled for documenting this mess since we can't fix this on our end without breaking people's code, and there's not much point in reporting this upstream since it'll either be misleading or far too much work for the user, and PCRE-8.x is nearing the very end of its life. --- lib/stdlib/doc/src/re.xml | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/lib/stdlib/doc/src/re.xml b/lib/stdlib/doc/src/re.xml index 11041e63b2..f0e35ef62e 100644 --- a/lib/stdlib/doc/src/re.xml +++ b/lib/stdlib/doc/src/re.xml @@ -2460,10 +2460,9 @@ foo\Kbar</code> <p>The following are the supported class names:</p> - <taglist> + <taglist> <tag>alnum</tag><item>Letters and digits</item> <tag>alpha</tag><item>Letters</item> - <tag>ascii</tag><item>Character codes 0-127</item> <tag>blank</tag><item>Space or tab only</item> <tag>cntrl</tag><item>Control characters</item> <tag>digit</tag><item>Decimal digits (same as \d)</item> @@ -2478,6 +2477,11 @@ foo\Kbar</code> <tag>xdigit</tag><item>Hexadecimal digits</item> </taglist> + <p>There is another character class, <c>ascii</c>, that erroneously matches + Latin-1 characters instead of the 0-127 range specified by POSIX. This + cannot be fixed without altering the behaviour of other classes, so we + recommend matching the range with <c>[\\0-\x7f]</c> instead.</p> + <p>The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and space (32). Notice that this list includes the VT character (code 11). This makes "space" different to \s, which does not include VT (for Perl -- 2.26.2
Locations
Projects
Search
Status Monitor
Help
OpenBuildService.org
Documentation
API Documentation
Code of Conduct
Contact
Support
@OBShq
Terms
openSUSE Build Service is sponsored by
The Open Build Service is an
openSUSE project
.
Sign Up
Log In
Places
Places
All Projects
Status Monitor