POSIX Regular Expressions - detailed manual

pupilzeng · 发表于 2003-11-20 20:49:35

NAME
regex - POSIX 1003.2 regular expressions

DESCRIPTION
Regular expressions (``RE''s), as defined in POSIX 1003.2,
come in two forms: modern REs  (roughly those  of  egrep;
1003.2  calls  these  ``extended''  REs)  and obsolete REs
(roughly those of ed(1); 1003.2 ``basic'' REs).  Obsolete
REs  mostly  exist  for backward compatibility in some old
programs; they will  be discussed  at  the  end. 1003.2
leaves  some  aspects  of RE syntax and semantics open; `'
marks decisions on these aspects that  may  not be  fully
portable to other 1003.2 implementations.

A (modern) RE is one or more non-empty branches, separated
by `|'. It matches  anything  that  matches  one  of  the
branches.

A  branch is one or more pieces, concatenated.  It matches
a match for the first, followed by a match for the second,
etc.

A piece is an atom possibly followed by a single `*', `+',
`?', or bound.  An atom followed by `*' matches a sequence
of 0 or more matches of the atom.  An atom followed by `+'
matches a sequence of 1 or more matches of the  atom. An
atom  followed by `?' matches a sequence of 0 or 1 matches
of the atom.

A bound is `{' followed by an  unsigned decimal  integer,
possibly  followed  by  `,'  possibly  followed by another
unsigned decimal integer, always  followed  by  `}'. The
integers  must  lie  between 0 and RE_DUP_MAX (255) inclu
sive, and if there are two of  them,  the  first  may  not
exceed the second.  An atom followed by a bound containing
one integer i and no comma matches a sequence of exactly i
matches of the atom.  An atom followed by a bound contain
ing one integer i and a comma matches a sequence of  i  or
more  matches  of  the  atom.  An atom followed by a bound
containing two integers i and j matches a  sequence  of i
through j (inclusive) matches of the atom.

An atom is a regular expression enclosed in `()' (matching
a match for the regular expression), an empty set of  `()'
(matching  the  null  string),  a  bracket expression (see
below), `.'  (matching any single character), `^'  (match
ing  the  null  string  at  the beginning of a line), `$'
(matching the null string at the end of a  line),  a  `\'
followed by one of the characters `^.[$()|*+?{\' (matching
that character taken as an ordinary character), a `\' fol
lowed  by  any  other  character  (matching that character
taken as an ordinary character, as if the `\' had not been
present), or a single character with no other significance
(matching that character).  A `{' followed by a character
other  than  a  digit  is  an  ordinary character, not the
beginning of a bound.  It is illegal to end  an  RE  with
`\'.

A  bracket  expression is a list of characters enclosed in
`[]'.  It normally matches any single character from  the
list  (but  see below). If the list begins with `^', it
matches any single character (but see below) not from  the
rest of the list.  If two characters in the list are sepa
rated by `-', this is shorthand for  the  full range  of
characters  between those two (inclusive) in the collating
sequence, e.g. `[0-9]' in ASCII matches any decimal digit.
It  is  illegal for two ranges to share an endpoint, e.g.
`a-c-e'.  Ranges  are  very  collating-sequence-dependent,
and portable programs should avoid relying on them.

To  include  a  literal `]' in the list, make it the first
character (following a possible `^').  To include  a  lit
eral `-', make it the first or last character, or the sec
ond endpoint of a range.  To use  a  literal  `-'  as  the
first  endpoint of a range, enclose it in `[.' and `.]' to
make it a collating element (see below).  With the  excep
tion  of  these and some combinations using `[' (see next
paragraphs), all other special characters, including  `\',
lose  their  special significance within a bracket expres
sion.

Within a bracket expression, a collating element (a  char
acter,  a  multi-character sequence that collates as if it
were a single character, or a collating-sequence name  for
either) enclosed in `[.' and `.]' stands for the sequence
of characters of that collating element.  The sequence  is
a  single  element  of  the  bracket expression's list. A
bracket expression containing a multi-character collating
element can  thus  match more than one character, e.g. if
the collating sequence includes a `ch' collating  element,
then the RE `[[.ch.]]*c' matches the first five characters
of `chchcc'.

Within a bracket expression, a collating element  enclosed
in `[=' and `=]' is an equivalence class, standing for the
sequences of characters of all collating elements  equiva
lent  to  that  one,  including itself.  (If there are no
other equivalent collating elements, the treatment  is  as
if  the enclosing  delimiters  were  `[.' and `.]'.)  For
example, if o and ^ are the  members  of  an  equivalence
class,  then `[[=o=]]', `[[=^=]]', and `[o^]' are all syn
onymous.  An equivalence class may not be an endpoint of a
range.

Within a bracket expression, the name of a character class
enclosed in `[:' and `:]' stands for the list of all char
acters  belonging to that class.  Standard character class
names are:
alnum digit    punct
alpha graph    space
blank lower    upper
cntrl print    xdigit

These stand for the character classes defined in ctype(3).
A locale may provide others.  A character class may not be
used as an endpoint of a range.

There are two special cases of  bracket expressions:  the
bracket expressions `[[:<:]]' and `[[:>:]]' match the null
string at the beginning and end of a word respectively. A
word  is defined as a sequence of word characters which is
neither preceded nor followed by word characters.  A  word
character  is  an alnum character (as defined by ctype(3))
or an underscore.  This is an extension,  compatible  with
but not specified by POSIX 1003.2, and should be used with
caution in software intended to be portable to other  sys
tems.

In  the event  that  an RE could match more than one sub
string of a given string, the RE matches the one  starting
earliest  in  the string.  If the RE could match more than
one substring starting  at  that  point,  it  matches  the
longest. Subexpressions  also match the longest possible
substrings, subject to the constraint that the whole match
be  as long as possible, with subexpressions starting ear
lier in the RE taking priority over ones  starting  later.
Note  that  higher-level subexpressions thus take priority
over their lower-level component subexpressions.

Match lengths are measured in  characters,  not collating
elements. A  null  string  is considered longer than no
match at all.  For example, `bb*' matches the three middle
characters of `abbbc', `(wee|week)(knights|nights)'
matches all ten characters of `weeknights', when  `(.*).*'
is  matched  against `abc' the parenthesized subexpression
matches all three characters, and when `(a*)*' is  matched
against `bc'  both  the  whole RE  and the parenthesized
subexpression match the null string.

If case-independent matching is specified, the  effect  is
much  as  if  all  case distinctions had vanished from the
alphabet.  When an  alphabetic  that  exists  in  multiple
cases  appears  as an ordinary character outside a bracket
expression, it is effectively transformed into  a  bracket
expression containing both cases, e.g. `x' becomes `[xX]'.
When it appears inside  a  bracket  expression, all  case
counterparts of it are added to the bracket expression, so
that  (e.g.)  `[x]'  becomes  `[xX]'  and  `[^x]'  becomes
`[^xX]'.

No particular limit is imposed on the length of REs.  Pro
grams intended to be portable should not employ REs longer
than  256 bytes, as an implementation can refuse to accept
such REs and remain POSIX-compliant.

Obsolete (``basic'') regular expressions differ in several
respects. `|',  `+', and `?' are ordinary characters and
there is  no  equivalent  for  their  functionality. The
delimiters  for bounds are `\{' and `\}', with `{' and `}'
by themselves ordinary characters. The  parentheses  for
nested  subexpressions are `$' and `$', with `(' and `)'
by themselves ordinary characters. `^'  is  an  ordinary
character  except at the beginning of the RE or the begin
ning of a parenthesized subexpression, `$' is an  ordinary
character  except  at  the  end of the RE or the end of a
parenthesized subexpression, and `*' is an ordinary  char
acter  if  it  appears  at  the beginning of the RE or the
beginning of a parenthesized subexpression (after a possi
ble leading `^').  Finally, there is one new type of atom,
a back reference: `\' followed by a non-zero decimal digit
d  matches  the same sequence of characters matched by the
dth parenthesized subexpression (numbering  subexpressions
by  the positions  of  their opening parentheses, left to
right), so that (e.g.) `$[bc]$\1' matches `bb'  or  `cc'
but not `bc'.

有些地方没怎么看懂，有高人能帮忙翻译一下吗？
感觉比较重要，因为很多编程语言里面都涉及到。

		自动登录	找回密码
密码			注册

POSIX Regular Expressions - detailed manual

浏览过的版块