Ysavourel: 1 revision imported

2016-06-04T23:19:59Z

1 revision imported

← Older revision	Revision as of 19:19, 4 June 2016
(No difference)

Jhargraveiii at 18:04, 15 July 2015

2015-07-15T18:04:58Z

New page

The SRX 2.0 standard is based on the [http://www.gala-global.org/oscarStandards/srx/srx20.html#Intro_RegExp ICU regular expression notation].

Many Java applications use Java's regular expressions to implement [[SRX]] because ICU4J (ICU for Java) does not provide support of ICU regular expressions.

As of version 1.7 Java has support for most of the Unicode-enabled features as described in ICU. For example in Java "<code>\w</code>" means "<code>[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]</code>" like in ICU. Some ICU features can be replaced by an equivalent expression in Java, but some other features simply cannot be implemented in Java.

The following table shows the ICU and Java differences (assuming the UNICODE_CHARACTER_CLASS flag is set). The yellow entries denote a case where the ICU expression needs to be mapped to a Java equivalent (sometimes a complex one), and the red entries indicate the cases where the ICU expression cannot be mapped in Java.

{{NoteBox|Starting in M28, '''the Okapi implementation of SRX no longer uses the ICU Regex option by default.''' Java patterns are used and Unicode processing enabled via the UNICODE_CHARACTER_CLASS flag. (You can test this for example in [[Ratel]]).}}

 

{| border="1" cellpadding="5" cellspacing="0"
|+
|- valign="top"
| '''ICU Meta Character''' || '''Java Equivalent''' || '''ICU Description'''
|- valign="top"
| \a || same || Match a BELL, \u0007
|- valign="top"
| \A || same || Match at the beginning of the input. Differs from ^ in that \A will not match after a new line within the input.
|- valign="top"
| \b, outside of a set || same || Match if the current position is a word boundary. Boundaries occur at the transitions between word (\w) and non-word (\W) characters, with combining marks ignored. And the option UREGEX_UWORD is assumed to be NOT set (default).
|- valign="top"
| \b, within a set || \b is invalid when within a set. Use \u0008 instead. || Match a BACKSPACE, \u0008.
|- valign="top"
| \B || same || Match if the current position is not a word boundary. And the option UREGEX_UWORD is assumed to be NOT set (default).
|- valign="top"
| \cX || same || Match a control-X character.
|- valign="top"
| \d || same || Match any character with the Unicode General Category of Nd (Number, Decimal Digit.)
|- valign="top"
| \D || same || Match any character that is not a decimal digit.
|- valign="top"
| \e || same || Match an ESCAPE, \u001B.
|- valign="top"
| \E || same || Terminates a \Q ... \E quoted sequence.
|- valign="top"
| \f || same || Match a FORM FEED, \u000C.
|- valign="top"
| \G || same || Match if the current position is at the end of the previous match.
|- valign="top"
| \n || same || Match a LINE FEED, \u000A.
|- valign="top"
| style="background-color:red;color:white;"|\N{UNICODE CHARACTER NAME} || Does not exists || Match the named character.
|- valign="top"
| \p{UNICODE PROPERTY NAME} || same || Match any character with the specified Unicode Property.
|- valign="top"
| \P{UNICODE PROPERTY NAME} || same || Match any character not having the specified Unicode Property.
|- valign="top"
| \Q || same || Quotes all following characters until \E.
|- valign="top"
| \r || same || Match a CARRIAGE RETURN, \u000D.
|- valign="top"
| \s || same || Match a white space character. White space is defined as [\t\n\f\r\p{Z}].
|- valign="top"
| \S || same || Match a non-white space character.
|- valign="top"
| \t || same ||Match a HORIZONTAL TABULATION, \u0009.
|- valign="top"
|\uhhhh ||same || Match the character with the hex value hhhh.
|- valign="top"
| style="background-color:red;color:white;"|\Uhhhhhhhh || Does not exist ||Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff.
|- valign="top"
| \w || same || Match a word character. Word characters are [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
|- valign="top"
| \W || same || Match a non-word character.
|- valign="top"
| \x{hhhh} || same ||Match the character with hex value hhhh
|- valign="top"
| \xhh || same || Match the character with two digit hex value hh
|- valign="top"
| style="background-color:yellow;color:black;"|\X || Can approximate with complex regex (see extended grapheme support at bottom of page): [http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions Unicode Java Regex Equivalents] || Match a [http://www.unicode.org/unicode/reports/tr29/ Grapheme Cluster].
|- valign="top"
| \Z || same || Match if the current position is at the end of input, but before the final line terminator, if one exists.
|- valign="top"
| \z || same || Match if the current position is at the end of input.
|- valign="top"
| \0nnn || same || Match the character with octal value nnn.
|- valign="top"
| \n || same || Back Reference. Match whatever the nth capturing group matched. n must be >1 and < total number of capture groups in the pattern.
|- valign="top"
| [pattern] || same || Match any one character from the set. See [http://icu.sourceforge.net/userguide/unicodeSet.html UnicodeSet] for a full description of what may appear in the pattern.
|- valign="top"
| . || same || Match any character.
|- valign="top"
| ^ || same || Match at the beginning of a line.
|- valign="top"
| $ || same || Match at the end of a line.
|- valign="top"
| \ || same || <nowiki>Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ . /</nowiki>
|}

[[Category:Segmentation]] [[Category:SRX]]

SRX and Java - Revision history

Ysavourel: 1 revision imported

Jhargraveiii at 18:04, 15 July 2015