--- pcre.3 2000/08/29 19:24:17 1.5
+++ pcre.3 2002/01/07 15:21:06 1.6
@@ -92,7 +92,9 @@
use these to include support for different releases.
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR
-are used for compiling and matching regular expressions.
+are used for compiling and matching regular expressions. A sample program that
+demonstrates the simplest way of using them is given in the file
+\fIpcredemo.c\fR. The last section of this man page describes how to run it.
The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and
\fBpcre_get_substring_list()\fR are convenience functions for extracting
@@ -129,18 +131,22 @@
The function \fBpcre_compile()\fR is called to compile a pattern into an
internal form. The pattern is a C string terminated by a binary zero, and
is passed in the argument \fIpattern\fR. A pointer to a single block of memory
-that is obtained via \fBpcre_malloc\fR is returned. This contains the
-compiled code and related data. The \fBpcre\fR type is defined for this for
-convenience, but in fact \fBpcre\fR is just a typedef for \fBvoid\fR, since the
-contents of the block are not externally defined. It is up to the caller to
-free the memory when it is no longer required.
-.PP
+that is obtained via \fBpcre_malloc\fR is returned. This contains the compiled
+code and related data. The \fBpcre\fR type is defined for the returned block;
+this is a typedef for a structure whose contents are not externally defined. It
+is up to the caller to free the memory when it is no longer required.
+
+Although the compiled code of a PCRE regex is relocatable, that is, it does not
+depend on memory location, the complete \fBpcre\fR data block is not
+fully relocatable, because it contains a copy of the \fItableptr\fR argument,
+which is an address (see below).
+
The size of a compiled pattern is roughly proportional to the length of the
pattern string, except that each character class (other than those containing
just a single character, negated or not) requires 33 bytes, and repeat
quantifiers with a minimum greater than one or a bounded maximum cause the
relevant portions of the compiled pattern to be replicated.
-.PP
+
The \fIoptions\fR argument contains independent bits that affect the
compilation. It should be zero if no options are required. Some of the options,
in particular, those that are compatible with Perl, can also be set and unset
@@ -149,19 +155,31 @@
their initial settings at the start of compilation and execution. The
PCRE_ANCHORED option can be set at the time of matching as well as at compile
time.
-.PP
+
If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately.
Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns
NULL, and sets the variable pointed to by \fIerrptr\fR to point to a textual
error message. The offset from the start of the pattern to the character where
the error was discovered is placed in the variable pointed to by
\fIerroffset\fR, which must not be NULL. If it is, an immediate error is given.
-.PP
+
If the final argument, \fItableptr\fR, is NULL, PCRE uses a default set of
character tables which are built when it is compiled, using the default C
locale. Otherwise, \fItableptr\fR must be the result of a call to
\fBpcre_maketables()\fR. See the section on locale support below.
-.PP
+
+This code fragment shows a typical straightforward call to \fBpcre_compile()\fR:
+
+ pcre *re;
+ const char *error;
+ int erroffset;
+ re = pcre_compile(
+ "^A.*Z", /* the pattern */
+ 0, /* default options */
+ &error, /* for error message */
+ &erroffset, /* for error offset */
+ NULL); /* use default character tables */
+
The following option bits are defined in the header file:
PCRE_ANCHORED
@@ -248,10 +266,10 @@
When a pattern is going to be used several times, it is worth spending more
time analyzing it in order to speed up the time taken for matching. The
function \fBpcre_study()\fR takes a pointer to a compiled pattern as its first
-argument, and returns a pointer to a \fBpcre_extra\fR block (another \fBvoid\fR
-typedef) containing additional information about the pattern; this can be
-passed to \fBpcre_exec()\fR. If no additional information is available, NULL
-is returned.
+argument, and returns a pointer to a \fBpcre_extra\fR block (another typedef
+for a structure with hidden contents) containing additional information about
+the pattern; this can be passed to \fBpcre_exec()\fR. If no additional
+information is available, NULL is returned.
The second argument contains option bits. At present, no options are defined
for \fBpcre_study()\fR, and this argument should always be zero.
@@ -260,6 +278,14 @@
studying succeeds (even if no data is returned), the variable it points to is
set to NULL. Otherwise it points to a textual error message.
+This is a typical call to \fBpcre_study\fR():
+
+ pcre_extra *pe;
+ pe = pcre_study(
+ re, /* result of pcre_compile() */
+ 0, /* no options exist */
+ &error); /* set to NULL or points to a message */
+
At present, studying a pattern is useful only for non-anchored patterns that do
not have a single fixed starting character. A bitmap of possible starting
characters is created.
@@ -309,13 +335,24 @@
PCRE_ERROR_BADMAGIC the "magic number" was not found
PCRE_ERROR_BADOPTION the value of \fIwhat\fR was invalid
+Here is a typical call of \fBpcre_fullinfo()\fR, to obtain the length of the
+compiled pattern:
+
+ int rc;
+ unsigned long int length;
+ rc = pcre_fullinfo(
+ re, /* result of pcre_compile() */
+ pe, /* result of pcre_study(), or NULL */
+ PCRE_INFO_SIZE, /* what is required */
+ &length); /* where to put the data */
+
The possible values for the third argument are defined in \fBpcre.h\fR, and are
as follows:
PCRE_INFO_OPTIONS
Return a copy of the options with which the pattern was compiled. The fourth
-argument should point to au \fBunsigned long int\fR variable. These option bits
+argument should point to an \fBunsigned long int\fR variable. These option bits
are those specified in the call to \fBpcre_compile()\fR, modified by any
top-level option settings within the pattern itself, and with the PCRE_ANCHORED
bit forcibly set if the form of the pattern implies that it can match only at
@@ -396,6 +433,20 @@
pattern has been studied, the result of the study should be passed in the
\fIextra\fR argument. Otherwise this must be NULL.
+Here is an example of a simple call to \fBpcre_exec()\fR:
+
+ int rc;
+ int ovector[30];
+ rc = pcre_exec(
+ re, /* result of pcre_compile() */
+ NULL, /* we didn't study the pattern */
+ "some string", /* the subject string */
+ 11, /* the length of the subject string */
+ 0, /* start at offset 0 in the subject */
+ 0, /* default options */
+ ovector, /* vector for substring information */
+ 30); /* number of elements in the vector */
+
The PCRE_ANCHORED option can be passed in the \fIoptions\fR argument, whose
unused bits must be zero. However, if a pattern was compiled with
PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it
@@ -437,9 +488,9 @@
The subject string is passed as a pointer in \fIsubject\fR, a length in
\fIlength\fR, and a starting offset in \fIstartoffset\fR. Unlike the pattern
-string, it may contain binary zero characters. When the starting offset is
-zero, the search for a match starts at the beginning of the subject, and this
-is by far the most common case.
+string, the subject may contain binary zero characters. When the starting
+offset is zero, the search for a match starts at the beginning of the subject,
+and this is by far the most common case.
A non-zero starting offset is useful when searching for another match in the
same subject by calling \fBpcre_exec()\fR again after a previous success.
@@ -626,8 +677,9 @@
practice be relevant.
The maximum length of a compiled pattern is 65539 (sic) bytes.
All values in repeating quantifiers must be less than 65536.
-The maximum number of capturing subpatterns is 99.
-The maximum number of all parenthesized subpatterns, including capturing
+There maximum number of capturing subpatterns is 65535.
+There is no limit to the number of non-capturing subpatterns, but the maximum
+depth of nesting of all kinds of parenthesized subpattern, including capturing
subpatterns, assertions, and other types of subpattern, is 200.
The maximum length of a subject string is the largest positive number that an
@@ -949,7 +1001,7 @@
Note that the sequences \\A, \\Z, and \\z can be used to match the start and
end of the subject in both modes, and if all branches of a pattern start with
-\\A is it always anchored, whether PCRE_MULTILINE is set or not.
+\\A it is always anchored, whether PCRE_MULTILINE is set or not.
.SH FULL STOP (PERIOD, DOT)
@@ -1053,7 +1105,7 @@
[12[:^digit:]]
-matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX
+matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
supported, and an error is given if they are encountered.
@@ -1151,7 +1203,7 @@
the ((red|white) (king|queen))
the captured substrings are "red king", "red", and "king", and are numbered 1,
-2, and 3.
+2, and 3, respectively.
The fact that plain parentheses fulfil two functions is not always helpful.
There are often times when a grouping subpattern is required without a
@@ -1792,6 +1844,137 @@
2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X.
+
+.SH SAMPLE PROGRAM
+The code below is a simple, complete demonstration program, to get you started
+with using PCRE. This code is also supplied in the file \fIpcredemo.c\fR in the
+PCRE distribution.
+
+The program compiles the regular expression that is its first argument, and
+matches it against the subject string in its second argument. No options are
+set, and default character tables are used. If matching succeeds, the program
+outputs the portion of the subject that matched, together with the contents of
+any captured substrings.
+
+On a Unix system that has PCRE installed in \fI/usr/local\fR, you can compile
+the demonstration program using a command like this:
+
+ gcc -o pcredemo pcredemo.c -I/usr/local/include -L/usr/local/lib -lpcre
+
+Then you can run simple tests like this:
+
+ ./pcredemo 'cat|dog' 'the cat sat on the mat'
+
+Note that there is a much more comprehensive test program, called
+\fBpcretest\fR, which supports many more facilities for testing regular
+expressions. The \fBpcredemo\fR program is provided as a simple coding example.
+
+On some operating systems (e.g. Solaris) you may get an error like this when
+you try to run \fBpcredemo\fR:
+
+ ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or directory
+
+This is caused by the way shared library support works on those systems. You
+need to add
+
+ -R/usr/local/lib
+
+to the compile command to get round this problem. Here's the code:
+
+ #include <stdio.h>
+ #include <string.h>
+ #include <pcre.h>
+
+ #define OVECCOUNT 30 /* should be a multiple of 3 */
+
+ int main(int argc, char **argv)
+ {
+ pcre *re;
+ const char *error;
+ int erroffset;
+ int ovector[OVECCOUNT];
+ int rc, i;
+
+ if (argc != 3)
+ {
+ printf("Two arguments required: a regex and a "
+ "subject string\\n");
+ return 1;
+ }
+
+ /* Compile the regular expression in the first argument */
+
+ re = pcre_compile(
+ argv[1], /* the pattern */
+ 0, /* default options */
+ &error, /* for error message */
+ &erroffset, /* for error offset */
+ NULL); /* use default character tables */
+
+ /* Compilation failed: print the error message and exit */
+
+ if (re == NULL)
+ {
+ printf("PCRE compilation failed at offset %d: %s\\n",
+ erroffset, error);
+ return 1;
+ }
+
+ /* Compilation succeeded: match the subject in the second
+ argument */
+
+ rc = pcre_exec(
+ re, /* the compiled pattern */
+ NULL, /* we didn't study the pattern */
+ argv[2], /* the subject string */
+ (int)strlen(argv[2]), /* the length of the subject */
+ 0, /* start at offset 0 in the subject */
+ 0, /* default options */
+ ovector, /* vector for substring information */
+ OVECCOUNT); /* number of elements in the vector */
+
+ /* Matching failed: handle error cases */
+
+ if (rc < 0)
+ {
+ switch(rc)
+ {
+ case PCRE_ERROR_NOMATCH: printf("No match\\n"); break;
+ /*
+ Handle other special cases if you like
+ */
+ default: printf("Matching error %d\\n", rc); break;
+ }
+ return 1;
+ }
+
+ /* Match succeded */
+
+ printf("Match succeeded\\n");
+
+ /* The output vector wasn't big enough */
+
+ if (rc == 0)
+ {
+ rc = OVECCOUNT/3;
+ printf("ovector only has room for %d captured "
+ substrings\\n", rc - 1);
+ }
+
+ /* Show substrings stored in the output vector */
+
+ for (i = 0; i < rc; i++)
+ {
+ char *substring_start = argv[2] + ovector[2*i];
+ int substring_length = ovector[2*i+1] - ovector[2*i];
+ printf("%2d: %.*s\\n", i, substring_length,
+ substring_start);
+ }
+
+ return 0;
+ }
+
+
.SH AUTHOR
Philip Hazel <ph10@cam.ac.uk>
.br
@@ -1803,8 +1986,6 @@
.br
Phone: +44 1223 334714
-Last updated: 28 August 2000,
-.br
- the 250th anniversary of the death of J.S. Bach.
+Last updated: 15 August 2001
.br
-Copyright (c) 1997-2000 University of Cambridge.
+Copyright (c) 1997-2001 University of Cambridge.
|