Problematic behavior about low surrogate codes in Java regex patterns

Regex patterns or partial patterns representing low surrogate codes wrongly match the latter half of complete surrogate pairs. At least the following JRE versions are affected.

  • Oracle Hotspot 1.8.0_72
  • OpenJDK 1.8.0_66

Description

The following are example patterns, in Java string literal, which cause the problem. Each represents a low surrogate code U+DC00 or the range of low surrogate codes (U+DC00-U+DFFF).

  • "\\udc00"
  • "\\x{dc00}"
  • "[\\udc00-\\udfff]"
  • "[\\x{dc00}-\\x{dfff}]"
  • "[\\p{blk=Low Surrogates}]"

Actually I used the last pattern to detect isolated surrogate codes, that is, low surrogate codes which are not leaded by high surrogate codes (U+D800-U+DBFF). Isolated surrogate codes are “ill-formed” in UTF-16 encoding, but Java strings may contain those surrogate codes such as ">>\udc00<<".

The pattern matches isolated surrogates as expected, but it also matches the latter half of complete surrogate pairs such as "\ud800\udc00", which represents a single codepoint U+010000.

Pattern regex = Pattern.compile("[\\p{blk=Low Surrogates}]");
Matcher matcher = regex.matcher("\ud800\udc00");  // U+010000
System.out.println(matcher.find());   // => true
System.out.println(matcher.start());  // => 1
System.out.println(matcher.end());    // => 2

Such behavior seems to violate a requirement “RL1.7 Supplementary Code Points” in Unicode Technical Standard #18, which says:

where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching.

Workaround

When passing raw surrogate codes to Pattern#compile, resulted regex patterns only match isolated ones, and do not match parts of complete surrogate pairs.

Pattern p1 = Pattern.compile("\udc00");
System.out.println(p1.matcher("\ud800\udc00").find());
System.out.println(p1.matcher(">>\udc00<<").find());
// => false true

Pattern p2 = Pattern.compile("[\udc00-\udfff]");
System.out.println(p2.matcher("\ud800\udc00").find());
System.out.println(p2.matcher(">>\udc00<<").find());
// => false true

Status

I have reported the issue on Java Bug Report on 2016-02-07. I'm hoping it will be processed.

The issue has been registered as JDK-8149446. *1

*1:Edited on Feb. 10 at 08:30 AM JST.