Regex patterns or partial patterns representing low surrogate codes wrongly match the latter half of complete surrogate pairs. At least the following JRE versions are affected.
- Oracle Hotspot 1.8.0_72
- OpenJDK 1.8.0_66
Description
The following are example patterns, in Java string literal, which cause the problem. Each represents a low surrogate code U+DC00
or the range of low surrogate codes (U+DC00
-U+DFFF
).
"\\udc00"
"\\x{dc00}"
"[\\udc00-\\udfff]"
"[\\x{dc00}-\\x{dfff}]"
"[\\p{blk=Low Surrogates}]"
Actually I used the last pattern to detect isolated surrogate codes, that is, low surrogate codes which are not leaded by high surrogate codes (U+D800
-U+DBFF
). Isolated surrogate codes are “ill-formed” in UTF-16 encoding, but Java strings may contain those surrogate codes such as ">>\udc00<<"
.
The pattern matches isolated surrogates as expected, but it also matches the latter half of complete surrogate pairs such as "\ud800\udc00"
, which represents a single codepoint U+010000
.
Pattern regex = Pattern.compile("[\\p{blk=Low Surrogates}]"); Matcher matcher = regex.matcher("\ud800\udc00"); // U+010000 System.out.println(matcher.find()); // => true System.out.println(matcher.start()); // => 1 System.out.println(matcher.end()); // => 2
Such behavior seems to violate a requirement “RL1.7 Supplementary Code Points” in Unicode Technical Standard #18, which says:
where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching.
Workaround
When passing raw surrogate codes to Pattern#compile, resulted regex patterns only match isolated ones, and do not match parts of complete surrogate pairs.
Pattern p1 = Pattern.compile("\udc00"); System.out.println(p1.matcher("\ud800\udc00").find()); System.out.println(p1.matcher(">>\udc00<<").find()); // => false true Pattern p2 = Pattern.compile("[\udc00-\udfff]"); System.out.println(p2.matcher("\ud800\udc00").find()); System.out.println(p2.matcher(">>\udc00<<").find()); // => false true
Status
I have reported the issue on Java Bug Report on 2016-02-07. I'm hoping it will be processed.
The issue has been registered as JDK-8149446. *1