OSSP: CVS Repository: Check-in [129]

Check-in Number:

129

Date:

2000-Aug-01 11:46:06 (local)
2000-Aug-01 09:46:06 (UTC)

User:

rse

Branch:

Comment:

Upgrade to PCRE 3.3 and GNU shtool 1.5.1

Tickets:

Inspections:

Files:

ossp-pkg/pcre/ChangeLog	added-> 1.6
ossp-pkg/pcre/ChangeLog.OSSP	1.13 -> 1.14	5 inserted, 0 deleted
ossp-pkg/pcre/LICENCE	added-> 1.4
ossp-pkg/pcre/README	added-> 1.4
ossp-pkg/pcre/configure	1.8 -> 1.9	39 inserted, 21 deleted
ossp-pkg/pcre/configure.in	added-> 1.6
ossp-pkg/pcre/pcre.3	added-> 1.4
ossp-pkg/pcre/pcre.c	added-> 1.5
ossp-pkg/pcre/pcre.h	added-> 1.6
ossp-pkg/pcre/pcre_get.c	added-> 1.2
ossp-pkg/pcre/pcre_internal.h	added-> 1.5
ossp-pkg/pcre/pcre_test.c	added-> 1.5
ossp-pkg/pcre/pcre_test.d/testinput1	added-> 1.3
ossp-pkg/pcre/pcre_test.d/testinput2	added-> 1.4
ossp-pkg/pcre/pcre_test.d/testinput3	added-> 1.4
ossp-pkg/pcre/pcre_test.d/testinput4	added-> 1.2
ossp-pkg/pcre/pcre_test.d/testinput5	added-> 1.1
ossp-pkg/pcre/pcre_test.d/testinput6	added-> 1.1
ossp-pkg/pcre/pcre_test.d/testoutput1	added-> 1.6
ossp-pkg/pcre/pcre_test.d/testoutput2	added-> 1.6
ossp-pkg/pcre/pcre_test.d/testoutput3	added-> 1.6
ossp-pkg/pcre/pcre_test.d/testoutput4	added-> 1.6
ossp-pkg/pcre/pcre_test.d/testoutput5	added-> 1.1
ossp-pkg/pcre/pcre_test.d/testoutput6	added-> 1.1
ossp-pkg/pcre/pcreposix.3	added-> 1.2
ossp-pkg/pcre/pcreposix.c	added-> 1.3
ossp-pkg/pcre/pgrep.1	added-> 1.2
ossp-pkg/pcre/pgrep.c	added-> 1.3
ossp-pkg/pcre/shtool	1.5 -> 1.6	3 inserted, 3 deleted

ossp-pkg/pcre/ChangeLog -> 1.6

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,647 ---- + ChangeLog for PCRE + ------------------ + + + Version 3.3 01-Aug-00 + --------------------- + + 1. If an octal character was given, but the value was greater than \377, it + was not getting masked to the least significant bits, as documented. This could + lead to crashes in some systems. + + 2. Perl 5.6 (if not earlier versions) accepts classes like [a-\d] and treats + the hyphen as a literal. PCRE used to give an error; it now behaves like Perl. + + 3. Added the functions pcre_free_substring() and pcre_free_substring_list(). + These just pass their arguments on to (pcre_free)(), but they are provided + because some uses of PCRE bind it to non-C systems that can call its functions, + but cannot call free() or pcre_free() directly. + + 4. Add "make test" as a synonym for "make check". Corrected some comments in + the Makefile. + + 5. Add $(DESTDIR)/ in front of all the paths in the "install" target in the + Makefile. + + 6. Changed the name of pgrep to pcregrep, because Solaris has introduced a + command called pgrep for grepping around the active processes. + + 7. Added the beginnings of support for UTF-8 character strings. + + 8. Arranged for the Makefile to pass over the settings of CC, CFLAGS, and + RANLIB to ./ltconfig so that they are used by libtool. I think these are all + the relevant ones. (AR is not passed because ./ltconfig does its own figuring + out for the ar command.) + + + Version 3.2 12-May-00 + --------------------- + + This is purely a bug fixing release. + + 1. If the pattern /((Z)+|A)*/ was matched agained ZABCDEFG it matched Z instead + of ZA. This was just one example of several cases that could provoke this bug, + which was introduced by change 9 of version 2.00. The code for breaking + infinite loops after an iteration that matches an empty string was't working + correctly. + + 2. The pcretest program was not imitating Perl correctly for the pattern /a*/g + when matched against abbab (for example). After matching an empty string, it + wasn't forcing anchoring when setting PCRE_NOTEMPTY for the next attempt; this + caused it to match further down the string than it should. + + 3. The code contained an inclusion of sys/types.h. It isn't clear why this + was there because it doesn't seem to be needed, and it causes trouble on some + systems, as it is not a Standard C header. It has been removed. + + 4. Made 4 silly changes to the source to avoid stupid compiler warnings that + were reported on the Macintosh. The changes were from + + while ((c = *(++ptr)) != 0 && c != '\n'); + to + while ((c = *(++ptr)) != 0 && c != '\n') ; + + Totally extraordinary, but if that's what it takes... + + 5. PCRE is being used in one environment where neither memmove() nor bcopy() is + available. Added HAVE_BCOPY and an autoconf test for it; if neither + HAVE_MEMMOVE nor HAVE_BCOPY is set, use a built-in emulation function which + assumes the way PCRE uses memmove() (always moving upwards). + + 6. PCRE is being used in one environment where strchr() is not available. There + was only one use in pcre.c, and writing it out to avoid strchr() probably gives + faster code anyway. + + + Version 3.1 09-Feb-00 + --------------------- + + The only change in this release is the fixing of some bugs in Makefile.in for + the "install" target: + + (1) It was failing to install pcreposix.h. + + (2) It was overwriting the pcre.3 man page with the pcreposix.3 man page. + + + Version 3.0 01-Feb-00 + --------------------- + + 1. Add support for the /+ modifier to perltest (to output $` like it does in + pcretest). + + 2. Add support for the /g modifier to perltest. + + 3. Fix pcretest so that it behaves even more like Perl for /g when the pattern + matches null strings. + + 4. Fix perltest so that it doesn't do unwanted things when fed an empty + pattern. Perl treats empty patterns specially - it reuses the most recent + pattern, which is not what we want. Replace // by /(?#)/ in order to avoid this + effect. + + 5. The POSIX interface was broken in that it was just handing over the POSIX + captured string vector to pcre_exec(), but (since release 2.00) PCRE has + required a bigger vector, with some working space on the end. This means that + the POSIX wrapper now has to get and free some memory, and copy the results. + + 6. Added some simple autoconf support, placing the test data and the + documentation in separate directories, re-organizing some of the + information files, and making it build pcre-config (a GNU standard). Also added + libtool support for building PCRE as a shared library, which is now the + default. + + 7. Got rid of the leading zero in the definition of PCRE_MINOR because 08 and + 09 are not valid octal constants. Single digits will be used for minor values + less than 10. + + 8. Defined REG_EXTENDED and REG_NOSUB as zero in the POSIX header, so that + existing programs that set these in the POSIX interface can use PCRE without + modification. + + 9. Added a new function, pcre_fullinfo() with an extensible interface. It can + return all that pcre_info() returns, plus additional data. The pcre_info() + function is retained for compatibility, but is considered to be obsolete. + + 10. Added experimental recursion feature (?R) to handle one common case that + Perl 5.6 will be able to do with (?p{...}). + + 11. Added support for POSIX character classes like [:alpha:], which Perl is + adopting. + + + Version 2.08 31-Aug-99 + ---------------------- + + 1. When startoffset was not zero and the pattern began with ".*", PCRE was not + trying to match at the startoffset position, but instead was moving forward to + the next newline as if a previous match had failed. + + 2. pcretest was not making use of PCRE_NOTEMPTY when repeating for /g and /G, + and could get into a loop if a null string was matched other than at the start + of the subject. + + 3. Added definitions of PCRE_MAJOR and PCRE_MINOR to pcre.h so the version can + be distinguished at compile time, and for completeness also added PCRE_DATE. + + 5. Added Paul Sokolovsky's minor changes to make it easy to compile a Win32 DLL + in GnuWin32 environments. + + + Version 2.07 29-Jul-99 + ---------------------- + + 1. The documentation is now supplied in plain text form and HTML as well as in + the form of man page sources. + + 2. C++ compilers don't like assigning (void *) values to other pointer types. + In particular this affects malloc(). Although there is no problem in Standard + C, I've put in casts to keep C++ compilers happy. + + 3. Typo on pcretest.c; a cast of (unsigned char *) in the POSIX regexec() call + should be (const char *). + + 4. If NOPOSIX is defined, pcretest.c compiles without POSIX support. This may + be useful for non-Unix systems who don't want to bother with the POSIX stuff. + However, I haven't made this a standard facility. The documentation doesn't + mention it, and the Makefile doesn't support it. + + 5. The Makefile now contains an "install" target, with editable destinations at + the top of the file. The pcretest program is not installed. + + 6. pgrep -V now gives the PCRE version number and date. + + 7. Fixed bug: a zero repetition after a literal string (e.g. /abcde{0}/) was + causing the entire string to be ignored, instead of just the last character. + + 8. If a pattern like /"([^\\"]+|\\.)*"/ is applied in the normal way to a + non-matching string, it can take a very, very long time, even for strings of + quite modest length, because of the nested recursion. PCRE now does better in + some of these cases. It does this by remembering the last required literal + character in the pattern, and pre-searching the subject to ensure it is present + before running the real match. In other words, it applies a heuristic to detect + some types of certain failure quickly, and in the above example, if presented + with a string that has no trailing " it gives "no match" very quickly. + + 9. A new runtime option PCRE_NOTEMPTY causes null string matches to be ignored; + other alternatives are tried instead. + + + Version 2.06 09-Jun-99 + ---------------------- + + 1. Change pcretest's output for amount of store used to show just the code + space, because the remainder (the data block) varies in size between 32-bit and + 64-bit systems. + + 2. Added an extra argument to pcre_exec() to supply an offset in the subject to + start matching at. This allows lookbehinds to work when searching for multiple + occurrences in a string. + + 3. Added additional options to pcretest for testing multiple occurrences: + + /+ outputs the rest of the string that follows a match + /g loops for multiple occurrences, using the new startoffset argument + /G loops for multiple occurrences by passing an incremented pointer + + 4. PCRE wasn't doing the "first character" optimization for patterns starting + with \b or \B, though it was doing it for other lookbehind assertions. That is, + it wasn't noticing that a match for a pattern such as /\bxyz/ has to start with + the letter 'x'. On long subject strings, this gives a significant speed-up. + + + Version 2.05 21-Apr-99 + ---------------------- + + 1. Changed the type of magic_number from int to long int so that it works + properly on 16-bit systems. + + 2. Fixed a bug which caused patterns starting with .* not to work correctly + when the subject string contained newline characters. PCRE was assuming + anchoring for such patterns in all cases, which is not correct because .* will + not pass a newline unless PCRE_DOTALL is set. It now assumes anchoring only if + DOTALL is set at top level; otherwise it knows that patterns starting with .* + must be retried after every newline in the subject. + + + Version 2.04 18-Feb-99 + ---------------------- + + 1. For parenthesized subpatterns with repeats whose minimum was zero, the + computation of the store needed to hold the pattern was incorrect (too large). + If such patterns were nested a few deep, this could multiply and become a real + problem. + + 2. Added /M option to pcretest to show the memory requirement of a specific + pattern. Made -m a synonym of -s (which does this globally) for compatibility. + + 3. Subpatterns of the form (regex){n,m} (i.e. limited maximum) were being + compiled in such a way that the backtracking after subsequent failure was + pessimal. Something like (a){0,3} was compiled as (a)?(a)?(a)? instead of + ((a)((a)(a)?)?)? with disastrous performance if the maximum was of any size. + + + Version 2.03 02-Feb-99 + ---------------------- + + 1. Fixed typo and small mistake in man page. + + 2. Added 4th condition (GPL supersedes if conflict) and created separate + LICENCE file containing the conditions. + + 3. Updated pcretest so that patterns such as /abc\/def/ work like they do in + Perl, that is the internal \ allows the delimiter to be included in the + pattern. Locked out the use of \ as a delimiter. If \ immediately follows + the final delimiter, add \ to the end of the pattern (to test the error). + + 4. Added the convenience functions for extracting substrings after a successful + match. Updated pcretest to make it able to test these functions. + + + Version 2.02 14-Jan-99 + ---------------------- + + 1. Initialized the working variables associated with each extraction so that + their saving and restoring doesn't refer to uninitialized store. + + 2. Put dummy code into study.c in order to trick the optimizer of the IBM C + compiler for OS/2 into generating correct code. Apparently IBM isn't going to + fix the problem. + + 3. Pcretest: the timing code wasn't using LOOPREPEAT for timing execution + calls, and wasn't printing the correct value for compiling calls. Increased the + default value of LOOPREPEAT, and the number of significant figures in the + times. + + 4. Changed "/bin/rm" in the Makefile to "-rm" so it works on Windows NT. + + 5. Renamed "deftables" as "dftables" to get it down to 8 characters, to avoid + a building problem on Windows NT with a FAT file system. + + + Version 2.01 21-Oct-98 + ---------------------- + + 1. Changed the API for pcre_compile() to allow for the provision of a pointer + to character tables built by pcre_maketables() in the current locale. If NULL + is passed, the default tables are used. + + + Version 2.00 24-Sep-98 + ---------------------- + + 1. Since the (>?) facility is in Perl 5.005, don't require PCRE_EXTRA to enable + it any more. + + 2. Allow quantification of (?>) groups, and make it work correctly. + + 3. The first character computation wasn't working for (?>) groups. + + 4. Correct the implementation of \Z (it is permitted to match on the \n at the + end of the subject) and add 5.005's \z, which really does match only at the + very end of the subject. + + 5. Remove the \X "cut" facility; Perl doesn't have it, and (?> is neater. + + 6. Remove the ability to specify CASELESS, MULTILINE, DOTALL, and + DOLLAR_END_ONLY at runtime, to make it possible to implement the Perl 5.005 + localized options. All options to pcre_study() were also removed. + + 7. Add other new features from 5.005: + + $(?<= positive lookbehind + $(?<! negative lookbehind + (?imsx-imsx) added the unsetting capability + such a setting is global if at outer level; local otherwise + (?imsx-imsx:) non-capturing groups with option setting + (?(cond)re|re) conditional pattern matching + + A backreference to itself in a repeated group matches the previous + captured string. + + 8. General tidying up of studying (both automatic and via "study") + consequential on the addition of new assertions. + + 9. As in 5.005, unlimited repeated groups that could match an empty substring + are no longer faulted at compile time. Instead, the loop is forcibly broken at + runtime if any iteration does actually match an empty substring. + + 10. Include the RunTest script in the distribution. + + 11. Added tests from the Perl 5.005_02 distribution. This showed up a few + discrepancies, some of which were old and were also with respect to 5.004. They + have now been fixed. + + + Version 1.09 28-Apr-98 + ---------------------- + + 1. A negated single character class followed by a quantifier with a minimum + value of one (e.g. [^x]{1,6} ) was not compiled correctly. This could lead to + program crashes, or just wrong answers. This did not apply to negated classes + containing more than one character, or to minima other than one. + + + Version 1.08 27-Mar-98 + ---------------------- + + 1. Add PCRE_UNGREEDY to invert the greediness of quantifiers. + + 2. Add (?U) and (?X) to set PCRE_UNGREEDY and PCRE_EXTRA respectively. The + latter must appear before anything that relies on it in the pattern. + + + Version 1.07 16-Feb-98 + ---------------------- + + 1. A pattern such as /((a)*)*/ was not being diagnosed as in error (unlimited + repeat of a potentially empty string). + + + Version 1.06 23-Jan-98 + ---------------------- + + 1. Added Markus Oberhumer's little patches for C++. + + 2. Literal strings longer than 255 characters were broken. + + + Version 1.05 23-Dec-97 + ---------------------- + + 1. Negated character classes containing more than one character were failing if + PCRE_CASELESS was set at run time. + + + Version 1.04 19-Dec-97 + ---------------------- + + 1. Corrected the man page, where some "const" qualifiers had been omitted. + + 2. Made debugging output print "{0,xxx}" instead of just "{,xxx}" to agree with + input syntax. + + 3. Fixed memory leak which occurred when a regex with back references was + matched with an offsets vector that wasn't big enough. The temporary memory + that is used in this case wasn't being freed if the match failed. + + 4. Tidied pcretest to ensure it frees memory that it gets. + + 5. Temporary memory was being obtained in the case where the passed offsets + vector was exactly big enough. + + 6. Corrected definition of offsetof() from change 5 below. + + 7. I had screwed up change 6 below and broken the rules for the use of + setjmp(). Now fixed. + + + Version 1.03 18-Dec-97 + ---------------------- + + 1. A erroneous regex with a missing opening parenthesis was correctly + diagnosed, but PCRE attempted to access brastack[-1], which could cause crashes + on some systems. + + 2. Replaced offsetof(real_pcre, code) by offsetof(real_pcre, code[0]) because + it was reported that one broken compiler failed on the former because "code" is + also an independent variable. + + 3. The erroneous regex a[]b caused an array overrun reference. + + 4. A regex ending with a one-character negative class (e.g. /[^k]$/) did not + fail on data ending with that character. (It was going on too far, and checking + the next character, typically a binary zero.) This was specific to the + optimized code for single-character negative classes. + + 5. Added a contributed patch from the TIN world which does the following: + + + Add an undef for memmove, in case the the system defines a macro for it. + + + Add a definition of offsetof(), in case there isn't one. (I don't know + the reason behind this - offsetof() is part of the ANSI standard - but + it does no harm). + + + Reduce the ifdef's in pcre.c using macro DPRINTF, thereby eliminating + most of the places where whitespace preceded '#'. I have given up and + allowed the remaining 2 cases to be at the margin. + + + Rename some variables in pcre to eliminate shadowing. This seems very + pedantic, but does no harm, of course. + + 6. Moved the call to setjmp() into its own function, to get rid of warnings + from gcc -Wall, and avoided calling it at all unless PCRE_EXTRA is used. + + 7. Constructs such as \d{8,} were compiling into the equivalent of + \d{8}\d{0,65527} instead of \d{8}\d* which didn't make much difference to the + outcome, but in this particular case used more store than had been allocated, + which caused the bug to be discovered because it threw up an internal error. + + 8. The debugging code in both pcre and pcretest for outputting the compiled + form of a regex was going wrong in the case of back references followed by + curly-bracketed repeats. + + + Version 1.02 12-Dec-97 + ---------------------- + + 1. Typos in pcre.3 and comments in the source fixed. + + 2. Applied a contributed patch to get rid of places where it used to remove + 'const' from variables, and fixed some signed/unsigned and uninitialized + variable warnings. + + 3. Added the "runtest" target to Makefile. + + 4. Set default compiler flag to -O2 rather than just -O. + + + Version 1.01 19-Nov-97 + ---------------------- + + 1. PCRE was failing to diagnose unlimited repeat of empty string for patterns + like /([ab]*)*/, that is, for classes with more than one character in them. + + 2. Likewise, it wasn't diagnosing patterns with "once-only" subpatterns, such + as /((?>a*))*/ (a PCRE_EXTRA facility). + + + Version 1.00 18-Nov-97 + ---------------------- + + 1. Added compile-time macros to support systems such as SunOS4 which don't have + memmove() or strerror() but have other things that can be used instead. + + 2. Arranged that "make clean" removes the executables. + + + Version 0.99 27-Oct-97 + ---------------------- + + 1. Fixed bug in code for optimizing classes with only one character. It was + initializing a 32-byte map regardless, which could cause it to run off the end + of the memory it had got. + + 2. Added, conditional on PCRE_EXTRA, the proposed (?>REGEX) construction. + + + Version 0.98 22-Oct-97 + ---------------------- + + 1. Fixed bug in code for handling temporary memory usage when there are more + back references than supplied space in the ovector. This could cause segfaults. + + + Version 0.97 21-Oct-97 + ---------------------- + + 1. Added the \X "cut" facility, conditional on PCRE_EXTRA. + + 2. Optimized negated single characters not to use a bit map. + + 3. Brought error texts together as macro definitions; clarified some of them; + fixed one that was wrong - it said "range out of order" when it meant "invalid + escape sequence". + + 4. Changed some char * arguments to const char *. + + 5. Added PCRE_NOTBOL and PCRE_NOTEOL (from POSIX). + + 6. Added the POSIX-style API wrapper in pcreposix.a and testing facilities in + pcretest. + + + Version 0.96 16-Oct-97 + ---------------------- + + 1. Added a simple "pgrep" utility to the distribution. + + 2. Fixed an incompatibility with Perl: "{" is now treated as a normal character + unless it appears in one of the precise forms "{ddd}", "{ddd,}", or "{ddd,ddd}" + where "ddd" means "one or more decimal digits". + + 3. Fixed serious bug. If a pattern had a back reference, but the call to + pcre_exec() didn't supply a large enough ovector to record the related + identifying subpattern, the match always failed. PCRE now remembers the number + of the largest back reference, and gets some temporary memory in which to save + the offsets during matching if necessary, in order to ensure that + backreferences always work. + + 4. Increased the compatibility with Perl in a number of ways: + + (a) . no longer matches \n by default; an option PCRE_DOTALL is provided + to request this handling. The option can be set at compile or exec time. + + (b) $ matches before a terminating newline by default; an option + PCRE_DOLLAR_ENDONLY is provided to override this (but not in multiline + mode). The option can be set at compile or exec time. + + (c) The handling of \ followed by a digit other than 0 is now supposed to be + the same as Perl's. If the decimal number it represents is less than 10 + or there aren't that many previous left capturing parentheses, an octal + escape is read. Inside a character class, it's always an octal escape, + even if it is a single digit. + + (d) An escaped but undefined alphabetic character is taken as a literal, + unless PCRE_EXTRA is set. Currently this just reserves the remaining + escapes. + + (e) {0} is now permitted. (The previous item is removed from the compiled + pattern). + + 5. Changed all the names of code files so that the basic parts are no longer + than 10 characters, and abolished the teeny "globals.c" file. + + 6. Changed the handling of character classes; they are now done with a 32-byte + bit map always. + + 7. Added the -d and /D options to pcretest to make it possible to look at the + internals of compilation without having to recompile pcre. + + + Version 0.95 23-Sep-97 + ---------------------- + + 1. Fixed bug in pre-pass concerning escaped "normal" characters such as \x5c or + \x20 at the start of a run of normal characters. These were being treated as + real characters, instead of the source characters being re-checked. + + + Version 0.94 18-Sep-97 + ---------------------- + + 1. The functions are now thread-safe, with the caveat that the global variables + containing pointers to malloc() and free() or alternative functions are the + same for all threads. + + 2. Get pcre_study() to generate a bitmap of initial characters for non- + anchored patterns when this is possible, and use it if passed to pcre_exec(). + + + Version 0.93 15-Sep-97 + ---------------------- + + 1. /(b)|(:+)/ was computing an incorrect first character. + + 2. Add pcre_study() to the API and the passing of pcre_extra to pcre_exec(), + but not actually doing anything yet. + + 3. Treat "-" characters in classes that cannot be part of ranges as literals, + as Perl does (e.g. [-az] or [az-]). + + 4. Set the anchored flag if a branch starts with .* or .*? because that tests + all possible positions. + + 5. Split up into different modules to avoid including unneeded functions in a + compiled binary. However, compile and exec are still in one module. The "study" + function is split off. + + 6. The character tables are now in a separate module whose source is generated + by an auxiliary program - but can then be edited by hand if required. There are + now no calls to isalnum(), isspace(), isdigit(), isxdigit(), tolower() or + toupper() in the code. + + 7. Turn the malloc/free funtions variables into pcre_malloc and pcre_free and + make them global. Abolish the function for setting them, as the caller can now + set them directly. + + + Version 0.92 11-Sep-97 + ---------------------- + + 1. A repeat with a fixed maximum and a minimum of 1 for an ordinary character + (e.g. /a{1,3}/) was broken (I mis-optimized it). + + 2. Caseless matching was not working in character classes if the characters in + the pattern were in upper case. + + 3. Make ranges like [W-c] work in the same way as Perl for caseless matching. + + 4. Make PCRE_ANCHORED public and accept as a compile option. + + 5. Add an options word to pcre_exec() and accept PCRE_ANCHORED and + PCRE_CASELESS at run time. Add escapes \A and \I to pcretest to cause it to + pass them. + + 6. Give an error if bad option bits passed at compile or run time. + + 7. Add PCRE_MULTILINE at compile and exec time, and (?m) as well. Add \M to + pcretest to cause it to pass that flag. + + 8. Add pcre_info(), to get the number of identifying subpatterns, the stored + options, and the first character, if set. + + 9. Recognize C+ or C{n,m} where n >= 1 as providing a fixed starting character. + + + Version 0.91 10-Sep-97 + ---------------------- + + 1. PCRE was failing to diagnose unlimited repeats of subpatterns that could + match the empty string as in /(a*)*/. It was looping and ultimately crashing. + + 2. PCRE was looping on encountering an indefinitely repeated back reference to + a subpattern that had matched an empty string, e.g. /(a|)\1*/. It now does what + Perl does - treats the match as successful. + + ****

ossp-pkg/pcre/ChangeLog.OSSP 1.13 -> 1.14

--- ChangeLog.OSSP 2000/07/07 06:18:50 1.13 +++ ChangeLog.OSSP 2000/08/02 09:46:06 1.14 @@ -1,6 +1,11 @@ OSSP ChangeLog for PCRE ======================= + + Changes made to OSSP pcre 3.3-1 (07-Jul-2000 to 01-Aug-2000): + + *) Merged in changes between PCRE 3.2 to 3.3 + *) Upgraded GNU shtool from 1.5.0 to 1.5.1 Changes made to OSSP pcre 3.2-2 (16-May-2000 to 07-Jul-2000):

ossp-pkg/pcre/LICENCE -> 1.4

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,46 ---- + PCRE LICENCE + ------------ + + PCRE is a library of functions to support regular expressions whose syntax + and semantics are as close as possible to those of the Perl 5 language. + + Written by: Philip Hazel <ph10@cam.ac.uk> + + University of Cambridge Computing Service, + Cambridge, England. Phone: +44 1223 334714. + + Copyright (c) 1997-2000 University of Cambridge + + Permission is granted to anyone to use this software for any purpose on any + computer system, and to redistribute it freely, subject to the following + restrictions: + + 1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + + 2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. In practice, this means that if you use + PCRE in software which you distribute to others, commercially or + otherwise, you must put a sentence like this + + Regular expression support is provided by the PCRE library package, + which is open source software, written by Philip Hazel, and copyright + by the University of Cambridge, England. + + somewhere reasonably visible in your documentation and in any relevant + files or online help data or similar. A reference to the ftp site for + the source, that is, to + + ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/ + + should also be given in the documentation. + + 3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + + 4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. + + End

ossp-pkg/pcre/README -> 1.4

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,270 ---- + README file for PCRE (Perl-compatible regular expression library) + ----------------------------------------------------------------- + + The latest release of PCRE is always available from + + ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-xxx.tar.gz + + Please read the NEWS file if you are upgrading from a previous release. + + PCRE has its own native API, but a set of "wrapper" functions that are based on + the POSIX API are also supplied in the library libpcreposix. Note that this + just provides a POSIX calling interface to PCRE: the regular expressions + themselves still follow Perl syntax and semantics. The header file + for the POSIX-style functions is called pcreposix.h. The official POSIX name is + regex.h, but I didn't want to risk possible problems with existing files of + that name by distributing it that way. To use it with an existing program that + uses the POSIX API, it will have to be renamed or pointed at by a link. + + + Building PCRE on a Unix system + ------------------------------ + + To build PCRE on a Unix system, run the "configure" command in the PCRE + distribution directory. This is a standard GNU "autoconf" configuration script, + for which generic instructions are supplied in INSTALL. On many systems just + running "./configure" is sufficient, but the usual methods of changing standard + defaults are available. For example, + + CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local + + specifies that the C compiler should be run with the flags '-O2 -Wall' instead + of the default, and that "make install" should install PCRE under /opt/local + instead of the default /usr/local. + + If you want to make use of the experimential, incomplete support for UTF-8 + character strings in PCRE, you must add --enable-utf8 to the "configure" + command. Without it, the code for handling UTF-8 is not included in the + library. (Even when included, it still has to be enabled by an option at run + time.) + + The "configure" script builds four files: + + . Makefile is built by copying Makefile.in and making substitutions. + . config.h is built by copying config.in and making substitutions. + . pcre-config is built by copying pcre-config.in and making substitutions. + . RunTest is a script for running tests + + Once "configure" has run, you can run "make". It builds two libraries called + libpcre and libpcreposix, a test program called pcretest, and the pcregrep + command. You can use "make install" to copy these, and the public header file + pcre.h, to appropriate live directories on your system, in the normal way. + + Running "make install" also installs the command pcre-config, which can be used + to recall information about the PCRE configuration and installation. For + example, + + pcre-config --version + + prints the version number, and + + pcre-config --libs + + outputs information about where the library is installed. This command can be + included in makefiles for programs that use PCRE, saving the programmer from + having to remember too many details. + + + Shared libraries on Unix systems + -------------------------------- + + The default distribution builds PCRE as two shared libraries. This support is + new and experimental and may not work on all systems. It relies on the + "libtool" scripts - these are distributed with PCRE. It should build a + "libtool" script and use this to compile and link shared libraries, which are + placed in a subdirectory called .libs. The programs pcretest and pcregrep are + built to use these uninstalled libraries by means of wrapper scripts. When you + use "make install" to install shared libraries, pcregrep and pcretest are + automatically re-built to use the newly installed libraries. However, only + pcregrep is installed, as pcretest is really just a test program. + + To build PCRE using static libraries you must use --disable-shared when + configuring it. For example + + ./configure --prefix=/usr/gnu --disable-shared + + Then run "make" in the usual way. + + + Building on non-Unix systems + ---------------------------- + + For a non-Unix system, read the comments in the file NON-UNIX-USE. PCRE has + been compiled on Windows systems and on Macintoshes, but I don't know the + details because I don't use those systems. It should be straightforward to + build PCRE on any system that has a Standard C compiler, because it uses only + Standard C functions. + + + Testing PCRE + ------------ + + To test PCRE on a Unix system, run the RunTest script in the pcre directory. + (This can also be run by "make runtest", "make check", or "make test".) For + other systems, see the instruction in NON-UNIX-USE. + + The script runs the pcretest test program (which is documented in + doc/pcretest.txt) on each of the testinput files (in the testdata directory) in + turn, and compares the output with the contents of the corresponding testoutput + file. A file called testtry is used to hold the output from pcretest. To run + pcretest on just one of the test files, give its number as an argument to + RunTest, for example: + + RunTest 3 + + The first and third test files can also be fed directly into the perltest + script to check that Perl gives the same results. The third file requires the + additional features of release 5.005, which is why it is kept separate from the + main test input, which needs only Perl 5.004. In the long run, when 5.005 (or + higher) is widespread, these two test files may get amalgamated. + + The second set of tests check pcre_fullinfo(), pcre_info(), pcre_study(), + pcre_copy_substring(), pcre_get_substring(), pcre_get_substring_list(), error + detection, and run-time flags that are specific to PCRE, as well as the POSIX + wrapper API. It also uses the debugging flag to check some of the internals of + pcre_compile(). + + If you build PCRE with a locale setting that is not the standard C locale, the + character tables may be different (see next paragraph). In some cases, this may + cause failures in the second set of tests. For example, in a locale where the + isprint() function yields TRUE for characters in the range 128-255, the use of + [:isascii:] inside a character class defines a different set of characters, and + this shows up in this test as a difference in the compiled code, which is being + listed for checking. Where the comparison test output contains [\x00-\x7f] the + test will contain [\x00-\xff], and similarly in some other cases. This is not a + bug in PCRE. + + The fourth set of tests checks pcre_maketables(), the facility for building a + set of character tables for a specific locale and using them instead of the + default tables. The tests make use of the "fr" (French) locale. Before running + the test, the script checks for the presence of this locale by running the + "locale" command. If that command fails, or if it doesn't include "fr" in the + list of available locales, the fourth test cannot be run, and a comment is + output to say why. If running this test produces instances of the error + + ** Failed to set locale "fr" + + in the comparison output, it means that locale is not available on your system, + despite being listed by "locale". This does not mean that PCRE is broken. + + The fifth test checks the experimental, incomplete UTF-8 support. It is not run + automatically unless PCRE is built with UTF-8 support. This file can be fed + directly to the perltest8 script, which requires Perl 5.6 or higher. The sixth + file tests internal UTF-8 features of PCRE that are not relevant to Perl. + + + Character tables + ---------------- + + PCRE uses four tables for manipulating and identifying characters. The final + argument of the pcre_compile() function is a pointer to a block of memory + containing the concatenated tables. A call to pcre_maketables() can be used to + generate a set of tables in the current locale. If the final argument for + pcre_compile() is passed as NULL, a set of default tables that is built into + the binary is used. + + The source file called chartables.c contains the default set of tables. This is + not supplied in the distribution, but is built by the program dftables + (compiled from dftables.c), which uses the ANSI C character handling functions + such as isalnum(), isalpha(), isupper(), islower(), etc. to build the table + sources. This means that the default C locale which is set for your system will + control the contents of these default tables. You can change the default tables + by editing chartables.c and then re-building PCRE. If you do this, you should + probably also edit Makefile to ensure that the file doesn't ever get + re-generated. + + The first two 256-byte tables provide lower casing and case flipping functions, + respectively. The next table consists of three 32-byte bit maps which identify + digits, "word" characters, and white space, respectively. These are used when + building 32-byte bit maps that represent character classes. + + The final 256-byte table has bits indicating various character types, as + follows: + + 1 white space character + 2 letter + 4 decimal digit + 8 hexadecimal digit + 16 alphanumeric or '_' + 128 regular expression metacharacter or binary zero + + You should not alter the set of characters that contain the 128 bit, as that + will cause PCRE to malfunction. + + + Manifest + -------- + + The distribution should contain the following files: + + (A) The actual source files of the PCRE library functions and their + headers: + + dftables.c auxiliary program for building chartables.c + get.c ) + maketables.c ) + study.c ) source of + pcre.c ) the functions + pcreposix.c ) + pcre.in "source" for the header for the external API; pcre.h + is built from this by "configure" + pcreposix.h header for the external POSIX wrapper API + internal.h header for internal use + config.in template for config.h, which is built by configure + + (B) Auxiliary files: + + AUTHORS information about the author of PCRE + ChangeLog log of changes to the code + INSTALL generic installation instructions + LICENCE conditions for the use of PCRE + COPYING the same, using GNU's standard name + Makefile.in template for Unix Makefile, which is built by configure + NEWS important changes in this release + NON-UNIX-USE notes on building PCRE on non-Unix systems + README this file + RunTest.in template for a Unix shell script for running tests + config.guess ) files used by libtool, + config.sub ) used only when building a shared library + configure a configuring shell script (built by autoconf) + configure.in the autoconf input used to build configure + doc/Tech.Notes notes on the encoding + doc/pcre.3 man page source for the PCRE functions + doc/pcre.html HTML version + doc/pcre.txt plain text version + doc/pcreposix.3 man page source for the POSIX wrapper API + doc/pcreposix.html HTML version + doc/pcreposix.txt plain text version + doc/pcretest.txt documentation of test program + doc/perltest.txt documentation of Perl test program + doc/pcregrep.1 man page source for the pcregrep utility + doc/pcregrep.html HTML version + doc/pcregrep.txt plain text version + install-sh a shell script for installing files + ltconfig ) files used to build "libtool", + ltmain.sh ) used only when building a shared library + pcretest.c test program + perltest Perl test program + perltest8 Perl test program for UTF-8 tests + pcregrep.c source of a grep utility that uses PCRE + pcre-config.in source of script which retains PCRE information + testdata/testinput1 test data, compatible with Perl 5.004 and 5.005 + testdata/testinput2 test data for error messages and non-Perl things + testdata/testinput3 test data, compatible with Perl 5.005 + testdata/testinput4 test data for locale-specific tests + testdata/testinput5 test data for UTF-8 tests compatible with Perl 5.6 + testdata/testinput6 test data for other UTF-8 tests + testdata/testoutput1 test results corresponding to testinput1 + testdata/testoutput2 test results corresponding to testinput2 + testdata/testoutput3 test results corresponding to testinput3 + testdata/testoutput4 test results corresponding to testinput4 + testdata/testoutput5 test results corresponding to testinput5 + testdata/testoutput6 test results corresponding to testinput6 + + (C) Auxiliary files for Win32 DLL + + dll.mk + pcre.def + + Philip Hazel <ph10@cam.ac.uk> + August 2000

ossp-pkg/pcre/configure 1.8 -> 1.9

--- configure 2000/07/07 06:18:50 1.8 +++ configure 2000/08/02 09:46:06 1.9 @@ -31,6 +31,8 @@ # Any additions from configure.in: ac_default_prefix=/usr/local ac_help="$ac_help + --enable-utf8 enable UTF8 support (incomplete)" +ac_help="$ac_help --enable-static build static libraries (default=yes)" ac_help="$ac_help --enable-shared build shared libraries (default=yes)" @@ -550,7 +552,7 @@ # Extract the first word of "gcc", so it can be a program name with args. set dummy gcc; ac_word=$2 echo $ac_n "checking for $ac_word""... $ac_c" 1>&6 -echo "configure:554: checking for $ac_word" >&5 +echo "configure:556: checking for $ac_word" >&5 if eval "test \"`echo '$''{'ac_cv_prog_CC'+set}'`\" = set"; then echo $ac_n "(cached) $ac_c" 1>&6 else @@ -580,7 +582,7 @@ # Extract the first word of "cc", so it can be a program name with args. set dummy cc; ac_word=$2 echo $ac_n "checking for $ac_word""... $ac_c" 1>&6 -echo "configure:584: checking for $ac_word" >&5 +echo "configure:586: checking for $ac_word" >&5 if eval "test \"`echo '$''{'ac_cv_prog_CC'+set}'`\" = set"; then echo $ac_n "(cached) $ac_c" 1>&6 else @@ -631,7 +633,7 @@ # Extract the first word of "cl", so it can be a program name with args. set dummy cl; ac_word=$2 echo $ac_n "checking for $ac_word""... $ac_c" 1>&6 -echo "configure:635: checking for $ac_word" >&5 +echo "configure:637: checking for $ac_word" >&5 if eval "test \"`echo '$''{'ac_cv_prog_CC'+set}'`\" = set"; then echo $ac_n "(cached) $ac_c" 1>&6 else @@ -663,7 +665,7 @@ fi echo $ac_n "checking whether the C compiler ($CC $CFLAGS $LDFLAGS) works""... $ac_c" 1>&6 -echo "configure:667: checking whether the C compiler ($CC $CFLAGS $LDFLAGS) works" >&5 +echo "configure:669: checking whether the C compiler ($CC $CFLAGS $LDFLAGS) works" >&5 ac_ext=c # CFLAGS is not in ac_cpp because -g, -O, etc. are not valid cpp options. @@ -674,12 +676,12 @@ cat > conftest.$ac_ext << EOF -#line 678 "configure" +#line 680 "configure" #include "confdefs.h" main(){return(0);} EOF -if { (eval echo configure:683: \"$ac_link\") 1>&5; (eval $ac_link) 2>&5; } && test -s conftest${ac_exeext}; then +if { (eval echo configure:685: \"$ac_link\") 1>&5; (eval $ac_link) 2>&5; } && test -s conftest${ac_exeext}; then ac_cv_prog_cc_works=yes # If we can't run a trivial program, we are probably using a cross compiler. if (./conftest; exit) 2>/dev/null; then @@ -705,12 +707,12 @@ { echo "configure: error: installation or configuration problem: C compiler cannot create executables." 1>&2; exit 1; } fi echo $ac_n "checking whether the C compiler ($CC $CFLAGS $LDFLAGS) is a cross-compiler""... $ac_c" 1>&6 -echo "configure:709: checking whether the C compiler ($CC $CFLAGS $LDFLAGS) is a cross-compiler" >&5 +echo "configure:711: checking whether the C compiler ($CC $CFLAGS $LDFLAGS) is a cross-compiler" >&5 echo "$ac_t""$ac_cv_prog_cc_cross" 1>&6 cross_compiling=$ac_cv_prog_cc_cross echo $ac_n "checking whether we are using GNU C""... $ac_c" 1>&6 -echo "configure:714: checking whether we are using GNU C" >&5 +echo "configure:716: checking whether we are using GNU C" >&5 if eval "test \"`echo '$''{'ac_cv_prog_gcc'+set}'`\" = set"; then echo $ac_n "(cached) $ac_c" 1>&6 else @@ -719,7 +721,7 @@ yes; #endif EOF -if { ac_try='${CC-cc} -E conftest.c'; { (eval echo configure:723: \"$ac_try\") 1>&5; (eval $ac_try) 2>&5; }; } | egrep yes >/dev/null 2>&1; then +if { ac_try='${CC-cc} -E conftest.c'; { (eval echo configure:725: \"$ac_try\") 1>&5; (eval $ac_try) 2>&5; }; } | egrep yes >/dev/null 2>&1; then ac_cv_prog_gcc=yes else ac_cv_prog_gcc=no @@ -738,7 +740,7 @@ ac_save_CFLAGS="$CFLAGS" CFLAGS= echo $ac_n "checking whether ${CC-cc} accepts -g""... $ac_c" 1>&6 -echo "configure:742: checking whether ${CC-cc} accepts -g" >&5 +echo "configure:744: checking whether ${CC-cc} accepts -g" >&5 if eval "test \"`echo '$''{'ac_cv_prog_cc_g'+set}'`\" = set"; then echo $ac_n "(cached) $ac_c" 1>&6 else @@ -770,7 +772,7 @@ fi echo $ac_n "checking how to run the C preprocessor""... $ac_c" 1>&6 -echo "configure:774: checking how to run the C preprocessor" >&5 +echo "configure:776: checking how to run the C preprocessor" >&5 # On Suns, sometimes $CPP names a directory. if test -n "$CPP" && test -d "$CPP"; then CPP= @@ -785,13 +787,13 @@ # On the NeXT, cc -E runs the code through the compiler's parser, # not just through cpp. cat > conftest.$ac_ext <<EOF -#line 789 "configure" +#line 791 "configure" #include "confdefs.h" #include <assert.h> Syntax Error EOF ac_try="$ac_cpp conftest.$ac_ext >/dev/null 2>conftest.out" -{ (eval echo configure:795: \"$ac_try\") 1>&5; (eval $ac_try) 2>&5; } +{ (eval echo configure:797: \"$ac_try\") 1>&5; (eval $ac_try) 2>&5; } ac_err=`grep -v '^ *+' conftest.out | grep -v "^conftest.${ac_ext}\$"` if test -z "$ac_err"; then : @@ -802,13 +804,13 @@ rm -rf conftest* CPP="${CC-cc} -E -traditional-cpp" cat > conftest.$ac_ext <<EOF -#line 806 "configure" +#line 808 "configure" #include "confdefs.h" #include <assert.h> Syntax Error EOF ac_try="$ac_cpp conftest.$ac_ext >/dev/null 2>conftest.out" -{ (eval echo configure:812: \"$ac_try\") 1>&5; (eval $ac_try) 2>&5; } +{ (eval echo configure:814: \"$ac_try\") 1>&5; (eval $ac_try) 2>&5; } ac_err=`grep -v '^ *+' conftest.out | grep -v "^conftest.${ac_ext}\$"` if test -z "$ac_err"; then : @@ -819,13 +821,13 @@ rm -rf conftest* CPP="${CC-cc} -nologo -E" cat > conftest.$ac_ext <<EOF -#line 823 "configure" +#line 825 "configure" #include "confdefs.h" #include <assert.h> Syntax Error EOF ac_try="$ac_cpp conftest.$ac_ext >/dev/null 2>conftest.out" -{ (eval echo configure:829: \"$ac_try\") 1>&5; (eval $ac_try) 2>&5; } +{ (eval echo configure:831: \"$ac_try\") 1>&5; (eval $ac_try) 2>&5; } ac_err=`grep -v '^ *+' conftest.out | grep -v "^conftest.${ac_ext}\$"` if test -z "$ac_err"; then : @@ -850,7 +852,7 @@ echo "$ac_t""$CPP" 1>&6 echo $ac_n "checking whether ${MAKE-make} sets \${MAKE}""... $ac_c" 1>&6 -echo "configure:854: checking whether ${MAKE-make} sets \${MAKE}" >&5 +echo "configure:856: checking whether ${MAKE-make} sets \${MAKE}" >&5 set dummy ${MAKE-make}; ac_make=`echo "$2" | sed 'y%./+-%__p_%'` if eval "test \"`echo '$''{'ac_cv_prog_make_${ac_make}_set'+set}'`\" = set"; then echo $ac_n "(cached) $ac_c" 1>&6 @@ -877,6 +879,22 @@ fi +# Check whether --enable-utf8 or --disable-utf8 was given. +if test "${enable_utf8+set}" = set; then + enableval="$enable_utf8" + enable_utf8="$enableval" +else + enable_utf8=yes + +fi +if test ".$enable_utf8" = .yes; then + if test ".$CFLAGS" = .; then + CFLAGS="-DSUPPORT_UTF8" + else + CFLAGS="$CFLAGS -DSUPPORT_UTF8" + fi +fi + # Check whether --enable-static or --disable-static was given. if test "${enable_static+set}" = set; then enableval="$enable_static" @@ -908,12 +926,12 @@ for ac_func in memmove strerror do echo $ac_n "checking for $ac_func""... $ac_c" 1>&6 -echo "configure:912: checking for $ac_func" >&5 +echo "configure:930: checking for $ac_func" >&5 if eval "test \"`echo '$''{'ac_cv_func_$ac_func'+set}'`\" = set"; then echo $ac_n "(cached) $ac_c" 1>&6 else cat > conftest.$ac_ext <<EOF -#line 917 "configure" +#line 935 "configure" #include "confdefs.h" /* System header to define __stub macros and hopefully few prototypes, which can conflict with char $ac_func(); below. */ @@ -936,7 +954,7 @@ ; return 0; } EOF -if { (eval echo configure:940: \"$ac_link\") 1>&5; (eval $ac_link) 2>&5; } && test -s conftest${ac_exeext}; then +if { (eval echo configure:958: \"$ac_link\") 1>&5; (eval $ac_link) 2>&5; } && test -s conftest${ac_exeext}; then rm -rf conftest* eval "ac_cv_func_$ac_func=yes" else

ossp-pkg/pcre/configure.in -> 1.6

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,75 ---- + dnl ## + dnl ## Autoconf specification for PCRE library + dnl ## + + AC_PREREQ(2.12)dnl + AC_REVISION($Revision: 1.6 $) + + SHTOOL="./shtool" + AC_SUBST(SHTOOL) + T_MD=`$SHTOOL echo -e %B 2>/dev/null` + T_ME=`$SHTOOL echo -e %b 2>/dev/null` + PCRE_MAJOR=`grep '#define PCRE_MAJOR' pcre.h | awk '{ print $3; }'` + PCRE_MINOR=`grep '#define PCRE_MINOR' pcre.h | awk '{ print $3; }'` + PCRE_VERSION="$PCRE_MAJOR.$PCRE_MINOR" + AC_SUBST(PCRE_VERSION) + echo "Configuring ${T_MD}OSSP pcre${T_ME} (Perl-Compatible Regular Expressions) ${T_MD}${PCRE_VERSION}${T_ME}" + echo "Copyright (c) 1997-2000 Philip Hazel <ph10@cam.ac.uk>, All rights reserved." + echo "Copyright (c) 1997-2000 University of Cambridge, All rights reserved." + echo "Copyright (c) 1999-2000 Ralf S. Engelschall <rse@engelschall.com>" + PLATFORM=`$SHTOOL guessos` + echo "Platform: ${T_MD}${PLATFORM}${T_ME}" + echo "" + + AC_INIT(README) + AC_CONFIG_HEADER(config.h) + AC_PREFIX_DEFAULT(/usr/local) + + AC_PROG_CC + AC_PROG_CPP + AC_SET_MAKE + + AC_ARG_ENABLE(utf8,dnl + [ --enable-utf8 enable UTF8 support (incomplete)], + enable_utf8="$enableval", + enable_utf8=yes + )dnl + if test ".$enable_utf8" = .yes; then + if test ".$CFLAGS" = .; then + CFLAGS="-DSUPPORT_UTF8" + else + CFLAGS="$CFLAGS -DSUPPORT_UTF8" + fi + fi + + AC_ARG_ENABLE(static,dnl + [ --enable-static build static libraries (default=yes)], + enable_static="$enableval", + enable_static=yes + )dnl + AC_ARG_ENABLE(shared,dnl + [ --enable-shared build shared libraries (default=yes)], + enable_shared="$enableval", + enable_shared=yes + )dnl + libtool_flags='' + test "$enable_static" = no && libtool_flags="$libtool_flags --disable-static" + test "$enable_shared" = no && libtool_flags="$libtool_flags --disable-shared" + test "$ac_cv_prog_gcc" = yes && libtool_flags="$libtool_flags --with-gcc" + test "$ac_cv_prog_gnu_ld" = yes && libtool_flags="$libtool_flags --with-gnu-ld" + CC="$CC" CFLAGS="$CFLAGS" CPPFLAGS="$CPPFLAGS" LD="$LD" \ + ${CONFIG_SHELL-/bin/sh} ltconfig --no-reexec \ + $libtool_flags --no-verify ltmain.sh $PLATFORM ||\ + AC_MSG_ERROR([libtool configuration failed]) + LIBTOOL="./libtool" + AC_SUBST(LIBTOOL) + + AC_CHECK_FUNCS(memmove strerror) + + AC_OUTPUT(dnl + Makefile dnl + pcre-config dnl + ,dnl + chmod a+x pcre-config + )dnl +

ossp-pkg/pcre/pcre.3 -> 1.4

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,1809 ---- + .TH PCRE 3 + .SH NAME + pcre - Perl-compatible regular expressions. + .SH SYNOPSIS + .B #include <pcre.h> + .PP + .SM + .br + .B pcre *pcre_compile(const char *\fIpattern\fR, int \fIoptions\fR, + .ti +5n + .B const char **\fIerrptr\fR, int *\fIerroffset\fR, + .ti +5n + .B const unsigned char *\fItableptr\fR); + .PP + .br + .B pcre_extra *pcre_study(const pcre *\fIcode\fR, int \fIoptions\fR, + .ti +5n + .B const char **\fIerrptr\fR); + .PP + .br + .B int pcre_exec(const pcre *\fIcode\fR, "const pcre_extra *\fIextra\fR," + .ti +5n + .B "const char *\fIsubject\fR," int \fIlength\fR, int \fIstartoffset\fR, + .ti +5n + .B int \fIoptions\fR, int *\fIovector\fR, int \fIovecsize\fR); + .PP + .br + .B int pcre_copy_substring(const char *\fIsubject\fR, int *\fIovector\fR, + .ti +5n + .B int \fIstringcount\fR, int \fIstringnumber\fR, char *\fIbuffer\fR, + .ti +5n + .B int \fIbuffersize\fR); + .PP + .br + .B int pcre_get_substring(const char *\fIsubject\fR, int *\fIovector\fR, + .ti +5n + .B int \fIstringcount\fR, int \fIstringnumber\fR, + .ti +5n + .B const char **\fIstringptr\fR); + .PP + .br + .B int pcre_get_substring_list(const char *\fIsubject\fR, + .ti +5n + .B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);" + .PP + .br + .B void pcre_free_substring(const char *\fIstringptr\fR); + .PP + .br + .B void pcre_free_substring_list(const char **\fIstringptr\fR); + .PP + .br + .B const unsigned char *pcre_maketables(void); + .PP + .br + .B int pcre_fullinfo(const pcre *\fIcode\fR, "const pcre_extra *\fIextra\fR," + .ti +5n + .B int \fIwhat\fR, void *\fIwhere\fR); + .PP + .br + .B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int + .B *\fIfirstcharptr\fR); + .PP + .br + .B char *pcre_version(void); + .PP + .br + .B void *(*pcre_malloc)(size_t); + .PP + .br + .B void (*pcre_free)(void *); + + + + .SH DESCRIPTION + The PCRE library is a set of functions that implement regular expression + pattern matching using the same syntax and semantics as Perl 5, with just a few + differences (see below). The current implementation corresponds to Perl 5.005, + with some additional features from later versions. This includes some + experimental, incomplete support for UTF-8 encoded strings. Details of exactly + what is and what is not supported are given below. + + PCRE has its own native API, which is described in this document. There is also + a set of wrapper functions that correspond to the POSIX regular expression API. + These are described in the \fBpcreposix\fR documentation. + + The native API function prototypes are defined in the header file \fBpcre.h\fR, + and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be + accessed by adding \fB-lpcre\fR to the command for linking an application which + calls it. The header file defines the macros PCRE_MAJOR and PCRE_MINOR to + contain the major and minor release numbers for the library. Applications can + use these to include support for different releases. + + The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR + are used for compiling and matching regular expressions. + + The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and + \fBpcre_get_substring_list()\fR are convenience functions for extracting + captured substrings from a matched subject string; \fBpcre_free_substring()\fR + and \fBpcre_free_substring_list()\fR are also provided, to free the memory used + for extracted strings. + + The function \fBpcre_maketables()\fR is used (optionally) to build a set of + character tables in the current locale for passing to \fBpcre_compile()\fR. + + The function \fBpcre_fullinfo()\fR is used to find out information about a + compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only + some of the available information, but is retained for backwards compatibility. + The function \fBpcre_version()\fR returns a pointer to a string containing the + version of PCRE and its date of release. + + The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain + the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions + respectively. PCRE calls the memory management functions via these variables, + so a calling program can replace them if it wishes to intercept the calls. This + should be done before calling any PCRE functions. + + + .SH MULTI-THREADING + The PCRE functions can be used in multi-threading applications, with the + proviso that the memory management functions pointed to by \fBpcre_malloc\fR + and \fBpcre_free\fR are shared by all threads. + + The compiled form of a regular expression is not altered during matching, so + the same compiled pattern can safely be used by several threads at once. + + + .SH COMPILING A PATTERN + The function \fBpcre_compile()\fR is called to compile a pattern into an + internal form. The pattern is a C string terminated by a binary zero, and + is passed in the argument \fIpattern\fR. A pointer to a single block of memory + that is obtained via \fBpcre_malloc\fR is returned. This contains the + compiled code and related data. The \fBpcre\fR type is defined for this for + convenience, but in fact \fBpcre\fR is just a typedef for \fBvoid\fR, since the + contents of the block are not externally defined. It is up to the caller to + free the memory when it is no longer required. + .PP + The size of a compiled pattern is roughly proportional to the length of the + pattern string, except that each character class (other than those containing + just a single character, negated or not) requires 33 bytes, and repeat + quantifiers with a minimum greater than one or a bounded maximum cause the + relevant portions of the compiled pattern to be replicated. + .PP + The \fIoptions\fR argument contains independent bits that affect the + compilation. It should be zero if no options are required. Some of the options, + in particular, those that are compatible with Perl, can also be set and unset + from within the pattern (see the detailed description of regular expressions + below). For these options, the contents of the \fIoptions\fR argument specifies + their initial settings at the start of compilation and execution. The + PCRE_ANCHORED option can be set at the time of matching as well as at compile + time. + .PP + If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately. + Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns + NULL, and sets the variable pointed to by \fIerrptr\fR to point to a textual + error message. The offset from the start of the pattern to the character where + the error was discovered is placed in the variable pointed to by + \fIerroffset\fR, which must not be NULL. If it is, an immediate error is given. + .PP + If the final argument, \fItableptr\fR, is NULL, PCRE uses a default set of + character tables which are built when it is compiled, using the default C + locale. Otherwise, \fItableptr\fR must be the result of a call to + \fBpcre_maketables()\fR. See the section on locale support below. + .PP + The following option bits are defined in the header file: + + PCRE_ANCHORED + + If this bit is set, the pattern is forced to be "anchored", that is, it is + constrained to match only at the start of the string which is being searched + (the "subject string"). This effect can also be achieved by appropriate + constructs in the pattern itself, which is the only way to do it in Perl. + + PCRE_CASELESS + + If this bit is set, letters in the pattern match both upper and lower case + letters. It is equivalent to Perl's /i option. + + PCRE_DOLLAR_ENDONLY + + If this bit is set, a dollar metacharacter in the pattern matches only at the + end of the subject string. Without this option, a dollar also matches + immediately before the final character if it is a newline (but not before any + other newlines). The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is + set. There is no equivalent to this option in Perl. + + PCRE_DOTALL + + If this bit is set, a dot metacharater in the pattern matches all characters, + including newlines. Without it, newlines are excluded. This option is + equivalent to Perl's /s option. A negative class such as [^a] always matches a + newline character, independent of the setting of this option. + + PCRE_EXTENDED + + If this bit is set, whitespace data characters in the pattern are totally + ignored except when escaped or inside a character class, and characters between + an unescaped # outside a character class and the next newline character, + inclusive, are also ignored. This is equivalent to Perl's /x option, and makes + it possible to include comments inside complicated patterns. Note, however, + that this applies only to data characters. Whitespace characters may never + appear within special character sequences in a pattern, for example within the + sequence (?( which introduces a conditional subpattern. + + PCRE_EXTRA + + This option was invented in order to turn on additional functionality of PCRE + that is incompatible with Perl, but it is currently of very little use. When + set, any backslash in a pattern that is followed by a letter that has no + special meaning causes an error, thus reserving these combinations for future + expansion. By default, as in Perl, a backslash followed by a letter with no + special meaning is treated as a literal. There are at present no other features + controlled by this option. It can also be set by a (?X) option setting within a + pattern. + + PCRE_MULTILINE + + By default, PCRE treats the subject string as consisting of a single "line" of + characters (even if it actually contains several newlines). The "start of line" + metacharacter (^) matches only at the start of the string, while the "end of + line" metacharacter ($) matches only at the end of the string, or before a + terminating newline (unless PCRE_DOLLAR_ENDONLY is set). This is the same as + Perl. + + When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs + match immediately following or immediately before any newline in the subject + string, respectively, as well as at the very start and end. This is equivalent + to Perl's /m option. If there are no "\\n" characters in a subject string, or + no occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no + effect. + + PCRE_UNGREEDY + + This option inverts the "greediness" of the quantifiers so that they are not + greedy by default, but become greedy if followed by "?". It is not compatible + with Perl. It can also be set by a (?U) option setting within the pattern. + + PCRE_UTF8 + + This option causes PCRE to regard both the pattern and the subject as strings + of UTF-8 characters instead of just byte strings. However, it is available only + if PCRE has been built to include UTF-8 support. If not, the use of this option + provokes an error. Support for UTF-8 is new, experimental, and incomplete. + Details of exactly what it entails are given below. + + + .SH STUDYING A PATTERN + When a pattern is going to be used several times, it is worth spending more + time analyzing it in order to speed up the time taken for matching. The + function \fBpcre_study()\fR takes a pointer to a compiled pattern as its first + argument, and returns a pointer to a \fBpcre_extra\fR block (another \fBvoid\fR + typedef) containing additional information about the pattern; this can be + passed to \fBpcre_exec()\fR. If no additional information is available, NULL + is returned. + + The second argument contains option bits. At present, no options are defined + for \fBpcre_study()\fR, and this argument should always be zero. + + The third argument for \fBpcre_study()\fR is a pointer to an error message. If + studying succeeds (even if no data is returned), the variable it points to is + set to NULL. Otherwise it points to a textual error message. + + At present, studying a pattern is useful only for non-anchored patterns that do + not have a single fixed starting character. A bitmap of possible starting + characters is created. + + + .SH LOCALE SUPPORT + PCRE handles caseless matching, and determines whether characters are letters, + digits, or whatever, by reference to a set of tables. The library contains a + default set of tables which is created in the default C locale when PCRE is + compiled. This is used when the final argument of \fBpcre_compile()\fR is NULL, + and is sufficient for many applications. + + An alternative set of tables can, however, be supplied. Such tables are built + by calling the \fBpcre_maketables()\fR function, which has no arguments, in the + relevant locale. The result can then be passed to \fBpcre_compile()\fR as often + as necessary. For example, to build and use tables that are appropriate for the + French locale (where accented characters with codes greater than 128 are + treated as letters), the following code could be used: + + setlocale(LC_CTYPE, "fr"); + tables = pcre_maketables(); + re = pcre_compile(..., tables); + + The tables are built in memory that is obtained via \fBpcre_malloc\fR. The + pointer that is passed to \fBpcre_compile\fR is saved with the compiled + pattern, and the same tables are used via this pointer by \fBpcre_study()\fR + and \fBpcre_exec()\fR. Thus for any single pattern, compilation, studying and + matching all happen in the same locale, but different patterns can be compiled + in different locales. It is the caller's responsibility to ensure that the + memory containing the tables remains available for as long as it is needed. + + + .SH INFORMATION ABOUT A PATTERN + The \fBpcre_fullinfo()\fR function returns information about a compiled + pattern. It replaces the obsolete \fBpcre_info()\fR function, which is + nevertheless retained for backwards compability (and is documented below). + + The first argument for \fBpcre_fullinfo()\fR is a pointer to the compiled + pattern. The second argument is the result of \fBpcre_study()\fR, or NULL if + the pattern was not studied. The third argument specifies which piece of + information is required, while the fourth argument is a pointer to a variable + to receive the data. The yield of the function is zero for success, or one of + the following negative numbers: + + PCRE_ERROR_NULL the argument \fIcode\fR was NULL + the argument \fIwhere\fR was NULL + PCRE_ERROR_BADMAGIC the "magic number" was not found + PCRE_ERROR_BADOPTION the value of \fIwhat\fR was invalid + + The possible values for the third argument are defined in \fBpcre.h\fR, and are + as follows: + + PCRE_INFO_OPTIONS + + Return a copy of the options with which the pattern was compiled. The fourth + argument should point to au \fBunsigned long int\fR variable. These option bits + are those specified in the call to \fBpcre_compile()\fR, modified by any + top-level option settings within the pattern itself, and with the PCRE_ANCHORED + bit forcibly set if the form of the pattern implies that it can match only at + the start of a subject string. + + PCRE_INFO_SIZE + + Return the size of the compiled pattern, that is, the value that was passed as + the argument to \fBpcre_malloc()\fR when PCRE was getting memory in which to + place the compiled data. The fourth argument should point to a \fBsize_t\fR + variable. + + PCRE_INFO_CAPTURECOUNT + + Return the number of capturing subpatterns in the pattern. The fourth argument + should point to an \fbint\fR variable. + + PCRE_INFO_BACKREFMAX + + Return the number of the highest back reference in the pattern. The fourth + argument should point to an \fBint\fR variable. Zero is returned if there are + no back references. + + PCRE_INFO_FIRSTCHAR + + Return information about the first character of any matched string, for a + non-anchored pattern. If there is a fixed first character, e.g. from a pattern + such as (cat|cow|coyote), it is returned in the integer pointed to by + \fIwhere\fR. Otherwise, if either + + (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch + starts with "^", or + + (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set + (if it were set, the pattern would be anchored), + + -1 is returned, indicating that the pattern matches only at the start of a + subject string or after any "\\n" within the string. Otherwise -2 is returned. + For anchored patterns, -2 is returned. + + PCRE_INFO_FIRSTTABLE + + If the pattern was studied, and this resulted in the construction of a 256-bit + table indicating a fixed set of characters for the first character in any + matching string, a pointer to the table is returned. Otherwise NULL is + returned. The fourth argument should point to an \fBunsigned char *\fR + variable. + + PCRE_INFO_LASTLITERAL + + For a non-anchored pattern, return the value of the rightmost literal character + which must exist in any matched string, other than at its start. The fourth + argument should point to an \fBint\fR variable. If there is no such character, + or if the pattern is anchored, -1 is returned. For example, for the pattern + /a\\d+z\\d+/ the returned value is 'z'. + + The \fBpcre_info()\fR function is now obsolete because its interface is too + restrictive to return all the available data about a compiled pattern. New + programs should use \fBpcre_fullinfo()\fR instead. The yield of + \fBpcre_info()\fR is the number of capturing subpatterns, or one of the + following negative numbers: + + PCRE_ERROR_NULL the argument \fIcode\fR was NULL + PCRE_ERROR_BADMAGIC the "magic number" was not found + + If the \fIoptptr\fR argument is not NULL, a copy of the options with which the + pattern was compiled is placed in the integer it points to (see + PCRE_INFO_OPTIONS above). + + If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL, + it is used to pass back information about the first character of any matched + string (see PCRE_INFO_FIRSTCHAR above). + + + .SH MATCHING A PATTERN + The function \fBpcre_exec()\fR is called to match a subject string against a + pre-compiled pattern, which is passed in the \fIcode\fR argument. If the + pattern has been studied, the result of the study should be passed in the + \fIextra\fR argument. Otherwise this must be NULL. + + The PCRE_ANCHORED option can be passed in the \fIoptions\fR argument, whose + unused bits must be zero. However, if a pattern was compiled with + PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it + cannot be made unachored at matching time. + + There are also three further options that can be set only at matching time: + + PCRE_NOTBOL + + The first character of the string is not the beginning of a line, so the + circumflex metacharacter should not match before it. Setting this without + PCRE_MULTILINE (at compile time) causes circumflex never to match. + + PCRE_NOTEOL + + The end of the string is not the end of a line, so the dollar metacharacter + should not match it nor (except in multiline mode) a newline immediately before + it. Setting this without PCRE_MULTILINE (at compile time) causes dollar never + to match. + + PCRE_NOTEMPTY + + An empty string is not considered to be a valid match if this option is set. If + there are alternatives in the pattern, they are tried. If all the alternatives + match the empty string, the entire match fails. For example, if the pattern + + a?b? + + is applied to a string not beginning with "a" or "b", it matches the empty + string at the start of the subject. With PCRE_NOTEMPTY set, this match is not + valid, so PCRE searches further into the string for occurrences of "a" or "b". + + Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a special case + of a pattern match of the empty string within its \fBsplit()\fR function, and + when using the /g modifier. It is possible to emulate Perl's behaviour after + matching a null string by first trying the match again at the same offset with + PCRE_NOTEMPTY set, and then if that fails by advancing the starting offset (see + below) and trying an ordinary match again. + + The subject string is passed as a pointer in \fIsubject\fR, a length in + \fIlength\fR, and a starting offset in \fIstartoffset\fR. Unlike the pattern + string, it may contain binary zero characters. When the starting offset is + zero, the search for a match starts at the beginning of the subject, and this + is by far the most common case. + + A non-zero starting offset is useful when searching for another match in the + same subject by calling \fBpcre_exec()\fR again after a previous success. + Setting \fIstartoffset\fR differs from just passing over a shortened string and + setting PCRE_NOTBOL in the case of a pattern that begins with any kind of + lookbehind. For example, consider the pattern + + \\Biss\\B + + which finds occurrences of "iss" in the middle of words. (\\B matches only if + the current position in the subject is not a word boundary.) When applied to + the string "Mississipi" the first call to \fBpcre_exec()\fR finds the first + occurrence. If \fBpcre_exec()\fR is called again with just the remainder of the + subject, namely "issipi", it does not match, because \\B is always false at the + start of the subject, which is deemed to be a word boundary. However, if + \fBpcre_exec()\fR is passed the entire string again, but with \fIstartoffset\fR + set to 4, it finds the second occurrence of "iss" because it is able to look + behind the starting point to discover that it is preceded by a letter. + + If a non-zero starting offset is passed when the pattern is anchored, one + attempt to match at the given offset is tried. This can only succeed if the + pattern does not require the match to be at the start of the subject. + + In general, a pattern matches a certain portion of the subject, and in + addition, further substrings from the subject may be picked out by parts of the + pattern. Following the usage in Jeffrey Friedl's book, this is called + "capturing" in what follows, and the phrase "capturing subpattern" is used for + a fragment of a pattern that picks out a substring. PCRE supports several other + kinds of parenthesized subpattern that do not cause substrings to be captured. + + Captured substrings are returned to the caller via a vector of integer offsets + whose address is passed in \fIovector\fR. The number of elements in the vector + is passed in \fIovecsize\fR. The first two-thirds of the vector is used to pass + back captured substrings, each substring using a pair of integers. The + remaining third of the vector is used as workspace by \fBpcre_exec()\fR while + matching capturing subpatterns, and is not available for passing back + information. The length passed in \fIovecsize\fR should always be a multiple of + three. If it is not, it is rounded down. + + When a match has been successful, information about captured substrings is + returned in pairs of integers, starting at the beginning of \fIovector\fR, and + continuing up to two-thirds of its length at the most. The first element of a + pair is set to the offset of the first character in a substring, and the second + is set to the offset of the first character after the end of a substring. The + first pair, \fIovector[0]\fR and \fIovector[1]\fR, identify the portion of the + subject string matched by the entire pattern. The next pair is used for the + first capturing subpattern, and so on. The value returned by \fBpcre_exec()\fR + is the number of pairs that have been set. If there are no capturing + subpatterns, the return value from a successful match is 1, indicating that + just the first pair of offsets has been set. + + Some convenience functions are provided for extracting the captured substrings + as separate strings. These are described in the following section. + + It is possible for an capturing subpattern number \fIn+1\fR to match some + part of the subject when subpattern \fIn\fR has not been used at all. For + example, if the string "abc" is matched against the pattern (a|(z))(bc) + subpatterns 1 and 3 are matched, but 2 is not. When this happens, both offset + values corresponding to the unused subpattern are set to -1. + + If a capturing subpattern is matched repeatedly, it is the last portion of the + string that it matched that gets returned. + + If the vector is too small to hold all the captured substrings, it is used as + far as possible (up to two-thirds of its length), and the function returns a + value of zero. In particular, if the substring offsets are not of interest, + \fBpcre_exec()\fR may be called with \fIovector\fR passed as NULL and + \fIovecsize\fR as zero. However, if the pattern contains back references and + the \fIovector\fR isn't big enough to remember the related substrings, PCRE has + to get additional memory for use during matching. Thus it is usually advisable + to supply an \fIovector\fR. + + Note that \fBpcre_info()\fR can be used to find out how many capturing + subpatterns there are in a compiled pattern. The smallest size for + \fIovector\fR that will allow for \fIn\fR captured substrings in addition to + the offsets of the substring matched by the whole pattern is (\fIn\fR+1)*3. + + If \fBpcre_exec()\fR fails, it returns a negative number. The following are + defined in the header file: + + PCRE_ERROR_NOMATCH (-1) + + The subject string did not match the pattern. + + PCRE_ERROR_NULL (-2) + + Either \fIcode\fR or \fIsubject\fR was passed as NULL, or \fIovector\fR was + NULL and \fIovecsize\fR was not zero. + + PCRE_ERROR_BADOPTION (-3) + + An unrecognized bit was set in the \fIoptions\fR argument. + + PCRE_ERROR_BADMAGIC (-4) + + PCRE stores a 4-byte "magic number" at the start of the compiled code, to catch + the case when it is passed a junk pointer. This is the error it gives when the + magic number isn't present. + + PCRE_ERROR_UNKNOWN_NODE (-5) + + While running the pattern match, an unknown item was encountered in the + compiled pattern. This error could be caused by a bug in PCRE or by overwriting + of the compiled pattern. + + PCRE_ERROR_NOMEMORY (-6) + + If a pattern contains back references, but the \fIovector\fR that is passed to + \fBpcre_exec()\fR is not big enough to remember the referenced substrings, PCRE + gets a block of memory at the start of matching to use for this purpose. If the + call via \fBpcre_malloc()\fR fails, this error is given. The memory is freed at + the end of matching. + + + .SH EXTRACTING CAPTURED SUBSTRINGS + Captured substrings can be accessed directly by using the offsets returned by + \fBpcre_exec()\fR in \fIovector\fR. For convenience, the functions + \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and + \fBpcre_get_substring_list()\fR are provided for extracting captured substrings + as new, separate, zero-terminated strings. A substring that contains a binary + zero is correctly extracted and has a further zero added on the end, but the + result does not, of course, function as a C string. + + The first three arguments are the same for all three functions: \fIsubject\fR + is the subject string which has just been successfully matched, \fIovector\fR + is a pointer to the vector of integer offsets that was passed to + \fBpcre_exec()\fR, and \fIstringcount\fR is the number of substrings that + were captured by the match, including the substring that matched the entire + regular expression. This is the value returned by \fBpcre_exec\fR if it + is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it + ran out of space in \fIovector\fR, the value passed as \fIstringcount\fR should + be the size of the vector divided by three. + + The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR + extract a single substring, whose number is given as \fIstringnumber\fR. A + value of zero extracts the substring that matched the entire pattern, while + higher values extract the captured substrings. For \fBpcre_copy_substring()\fR, + the string is placed in \fIbuffer\fR, whose length is given by + \fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of memory is + obtained via \fBpcre_malloc\fR, and its address is returned via + \fIstringptr\fR. The yield of the function is the length of the string, not + including the terminating zero, or one of + + PCRE_ERROR_NOMEMORY (-6) + + The buffer was too small for \fBpcre_copy_substring()\fR, or the attempt to get + memory failed for \fBpcre_get_substring()\fR. + + PCRE_ERROR_NOSUBSTRING (-7) + + There is no substring whose number is \fIstringnumber\fR. + + The \fBpcre_get_substring_list()\fR function extracts all available substrings + and builds a list of pointers to them. All this is done in a single block of + memory which is obtained via \fBpcre_malloc\fR. The address of the memory block + is returned via \fIlistptr\fR, which is also the start of the list of string + pointers. The end of the list is marked by a NULL pointer. The yield of the + function is zero if all went well, or + + PCRE_ERROR_NOMEMORY (-6) + + if the attempt to get the memory block failed. + + When any of these functions encounter a substring that is unset, which can + happen when capturing subpattern number \fIn+1\fR matches some part of the + subject, but subpattern \fIn\fR has not been used at all, they return an empty + string. This can be distinguished from a genuine zero-length substring by + inspecting the appropriate offset in \fIovector\fR, which is negative for unset + substrings. + + The two convenience functions \fBpcre_free_substring()\fR and + \fBpcre_free_substring_list()\fR can be used to free the memory returned by + a previous call of \fBpcre_get_substring()\fR or + \fBpcre_get_substring_list()\fR, respectively. They do nothing more than call + the function pointed to by \fBpcre_free\fR, which of course could be called + directly from a C program. However, PCRE is used in some situations where it is + linked via a special interface to another programming language which cannot use + \fBpcre_free\fR directly; it is for these cases that the functions are + provided. + + + .SH LIMITATIONS + There are some size limitations in PCRE but it is hoped that they will never in + practice be relevant. + The maximum length of a compiled pattern is 65539 (sic) bytes. + All values in repeating quantifiers must be less than 65536. + The maximum number of capturing subpatterns is 99. + The maximum number of all parenthesized subpatterns, including capturing + subpatterns, assertions, and other types of subpattern, is 200. + + The maximum length of a subject string is the largest positive number that an + integer variable can hold. However, PCRE uses recursion to handle subpatterns + and indefinite repetition. This means that the available stack space may limit + the size of a subject string that can be processed by certain patterns. + + + .SH DIFFERENCES FROM PERL + The differences described here are with respect to Perl 5.005. + + 1. By default, a whitespace character is any character that the C library + function \fBisspace()\fR recognizes, though it is possible to compile PCRE with + alternative character type tables. Normally \fBisspace()\fR matches space, + formfeed, newline, carriage return, horizontal tab, and vertical tab. Perl 5 + no longer includes vertical tab in its set of whitespace characters. The \\v + escape that was in the Perl documentation for a long time was never in fact + recognized. However, the character itself was treated as whitespace at least + up to 5.002. In 5.004 and 5.005 it does not match \\s. + + 2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl permits + them, but they do not mean what you might think. For example, (?!a){3} does + not assert that the next three characters are not "a". It just asserts that the + next character is not "a" three times. + + 3. Capturing subpatterns that occur inside negative lookahead assertions are + counted, but their entries in the offsets vector are never set. Perl sets its + numerical variables from any such patterns that are matched before the + assertion fails to match something (thereby succeeding), but only if the + negative lookahead assertion contains just one branch. + + 4. Though binary zero characters are supported in the subject string, they are + not allowed in a pattern string because it is passed as a normal C string, + terminated by zero. The escape sequence "\\0" can be used in the pattern to + represent a binary zero. + + 5. The following Perl escape sequences are not supported: \\l, \\u, \\L, \\U, + \\E, \\Q. In fact these are implemented by Perl's general string-handling and + are not part of its pattern matching engine. + + 6. The Perl \\G assertion is not supported as it is not relevant to single + pattern matches. + + 7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code}) + constructions. However, there is some experimental support for recursive + patterns using the non-Perl item (?R). + + 8. There are at the time of writing some oddities in Perl 5.005_02 concerned + with the settings of captured strings when part of a pattern is repeated. For + example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value + "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if + the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) are set. + + In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the + future Perl changes to a consistent state that is different, PCRE may change to + follow. + + 9. Another as yet unresolved discrepancy is that in Perl 5.005_02 the pattern + /^(a)?(?(1)a|b)+$/ matches the string "a", whereas in PCRE it does not. + However, in both Perl and PCRE /^(a)?a/ matched against "a" leaves $1 unset. + + 10. PCRE provides some extensions to the Perl regular expression facilities: + + (a) Although lookbehind assertions must match fixed length strings, each + alternative branch of a lookbehind assertion can match a different length of + string. Perl 5.005 requires them all to have the same length. + + (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ meta- + character matches only at the very end of the string. + + (c) If PCRE_EXTRA is set, a backslash followed by a letter with no special + meaning is faulted. + + (d) If PCRE_UNGREEDY is set, the greediness of the repetition quantifiers is + inverted, that is, by default they are not greedy, but if followed by a + question mark they are. + + (e) PCRE_ANCHORED can be used to force a pattern to be tried only at the start + of the subject. + + (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for + \fBpcre_exec()\fR have no Perl equivalents. + + (g) The (?R) construct allows for recursive pattern matching (Perl 5.6 can do + this using the (?p{code}) construct, which PCRE cannot of course support.) + + + .SH REGULAR EXPRESSION DETAILS + The syntax and semantics of the regular expressions supported by PCRE are + described below. Regular expressions are also described in the Perl + documentation and in a number of other books, some of which have copious + examples. Jeffrey Friedl's "Mastering Regular Expressions", published by + O'Reilly (ISBN 1-56592-257), covers them in great detail. + + The description here is intended as reference documentation. The basic + operation of PCRE is on strings of bytes. However, there is the beginnings of + some support for UTF-8 character strings. To use this support you must + configure PCRE to include it, and then call \fBpcre_compile()\fR with the + PCRE_UTF8 option. How this affects the pattern matching is described in the + final section of this document. + + A regular expression is a pattern that is matched against a subject string from + left to right. Most characters stand for themselves in a pattern, and match the + corresponding characters in the subject. As a trivial example, the pattern + + The quick brown fox + + matches a portion of a subject string that is identical to itself. The power of + regular expressions comes from the ability to include alternatives and + repetitions in the pattern. These are encoded in the pattern by the use of + \fImeta-characters\fR, which do not stand for themselves but instead are + interpreted in some special way. + + There are two different sets of meta-characters: those that are recognized + anywhere in the pattern except within square brackets, and those that are + recognized in square brackets. Outside square brackets, the meta-characters are + as follows: + + \\ general escape character with several uses + ^ assert start of subject (or line, in multiline mode) + $ assert end of subject (or line, in multiline mode) + . match any character except newline (by default) + [ start character class definition + | start of alternative branch + ( start subpattern + ) end subpattern + ? extends the meaning of ( + also 0 or 1 quantifier + also quantifier minimizer + * 0 or more quantifier + + 1 or more quantifier + { start min/max quantifier + + Part of a pattern that is in square brackets is called a "character class". In + a character class the only meta-characters are: + + \\ general escape character + ^ negate the class, but only if the first character + - indicates character range + ] terminates the character class + + The following sections describe the use of each of the meta-characters. + + + .SH BACKSLASH + The backslash character has several uses. Firstly, if it is followed by a + non-alphameric character, it takes away any special meaning that character may + have. This use of backslash as an escape character applies both inside and + outside character classes. + + For example, if you want to match a "*" character, you write "\\*" in the + pattern. This applies whether or not the following character would otherwise be + interpreted as a meta-character, so it is always safe to precede a + non-alphameric with "\\" to specify that it stands for itself. In particular, + if you want to match a backslash, you write "\\\\". + + If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the + pattern (other than in a character class) and characters between a "#" outside + a character class and the next newline character are ignored. An escaping + backslash can be used to include a whitespace or "#" character as part of the + pattern. + + A second use of backslash provides a way of encoding non-printing characters + in patterns in a visible manner. There is no restriction on the appearance of + non-printing characters, apart from the binary zero that terminates a pattern, + but when a pattern is being prepared by text editing, it is usually easier to + use one of the following escape sequences than the binary character it + represents: + + \\a alarm, that is, the BEL character (hex 07) + \\cx "control-x", where x is any character + \\e escape (hex 1B) + \\f formfeed (hex 0C) + \\n newline (hex 0A) + \\r carriage return (hex 0D) + \\t tab (hex 09) + \\xhh character with hex code hh + \\ddd character with octal code ddd, or backreference + + The precise effect of "\\cx" is as follows: if "x" is a lower case letter, it + is converted to upper case. Then bit 6 of the character (hex 40) is inverted. + Thus "\\cz" becomes hex 1A, but "\\c{" becomes hex 3B, while "\\c;" becomes hex + 7B. + + After "\\x", up to two hexadecimal digits are read (letters can be in upper or + lower case). + + After "\\0" up to two further octal digits are read. In both cases, if there + are fewer than two digits, just those that are present are used. Thus the + sequence "\\0\\x\\07" specifies two binary zeros followed by a BEL character. + Make sure you supply two digits after the initial zero if the character that + follows is itself an octal digit. + + The handling of a backslash followed by a digit other than 0 is complicated. + Outside a character class, PCRE reads it and any following digits as a decimal + number. If the number is less than 10, or if there have been at least that many + previous capturing left parentheses in the expression, the entire sequence is + taken as a \fIback reference\fR. A description of how this works is given + later, following the discussion of parenthesized subpatterns. + + Inside a character class, or if the decimal number is greater than 9 and there + have not been that many capturing subpatterns, PCRE re-reads up to three octal + digits following the backslash, and generates a single byte from the least + significant 8 bits of the value. Any subsequent digits stand for themselves. + For example: + + \\040 is another way of writing a space + \\40 is the same, provided there are fewer than 40 + previous capturing subpatterns + \\7 is always a back reference + \\11 might be a back reference, or another way of + writing a tab + \\011 is always a tab + \\0113 is a tab followed by the character "3" + \\113 is the character with octal code 113 (since there + can be no more than 99 back references) + \\377 is a byte consisting entirely of 1 bits + \\81 is either a back reference, or a binary zero + followed by the two characters "8" and "1" + + Note that octal values of 100 or greater must not be introduced by a leading + zero, because no more than three octal digits are ever read. + + All the sequences that define a single byte value can be used both inside and + outside character classes. In addition, inside a character class, the sequence + "\\b" is interpreted as the backspace character (hex 08). Outside a character + class it has a different meaning (see below). + + The third use of backslash is for specifying generic character types: + + \\d any decimal digit + \\D any character that is not a decimal digit + \\s any whitespace character + \\S any character that is not a whitespace character + \\w any "word" character + \\W any "non-word" character + + Each pair of escape sequences partitions the complete set of characters into + two disjoint sets. Any given character matches one, and only one, of each pair. + + A "word" character is any letter or digit or the underscore character, that is, + any character which can be part of a Perl "word". The definition of letters and + digits is controlled by PCRE's character tables, and may vary if locale- + specific matching is taking place (see "Locale support" above). For example, in + the "fr" (French) locale, some character codes greater than 128 are used for + accented letters, and these are matched by \\w. + + These character type sequences can appear both inside and outside character + classes. They each match one character of the appropriate type. If the current + matching point is at the end of the subject string, all of them fail, since + there is no character to match. + + The fourth use of backslash is for certain simple assertions. An assertion + specifies a condition that has to be met at a particular point in a match, + without consuming any characters from the subject string. The use of + subpatterns for more complicated assertions is described below. The backslashed + assertions are + + \\b word boundary + \\B not a word boundary + \\A start of subject (independent of multiline mode) + \\Z end of subject or newline at end (independent of multiline mode) + \\z end of subject (independent of multiline mode) + + These assertions may not appear in character classes (but note that "\\b" has a + different meaning, namely the backspace character, inside a character class). + + A word boundary is a position in the subject string where the current character + and the previous character do not both match \\w or \\W (i.e. one matches + \\w and the other matches \\W), or the start or end of the string if the + first or last character matches \\w, respectively. + + The \\A, \\Z, and \\z assertions differ from the traditional circumflex and + dollar (described below) in that they only ever match at the very start and end + of the subject string, whatever options are set. They are not affected by the + PCRE_NOTBOL or PCRE_NOTEOL options. If the \fIstartoffset\fR argument of + \fBpcre_exec()\fR is non-zero, \\A can never match. The difference between \\Z + and \\z is that \\Z matches before a newline that is the last character of the + string as well as at the end of the string, whereas \\z matches only at the + end. + + + .SH CIRCUMFLEX AND DOLLAR + Outside a character class, in the default matching mode, the circumflex + character is an assertion which is true only if the current matching point is + at the start of the subject string. If the \fIstartoffset\fR argument of + \fBpcre_exec()\fR is non-zero, circumflex can never match. Inside a character + class, circumflex has an entirely different meaning (see below). + + Circumflex need not be the first character of the pattern if a number of + alternatives are involved, but it should be the first thing in each alternative + in which it appears if the pattern is ever to match that branch. If all + possible alternatives start with a circumflex, that is, if the pattern is + constrained to match only at the start of the subject, it is said to be an + "anchored" pattern. (There are also other constructs that can cause a pattern + to be anchored.) + + A dollar character is an assertion which is true only if the current matching + point is at the end of the subject string, or immediately before a newline + character that is the last character in the string (by default). Dollar need + not be the last character of the pattern if a number of alternatives are + involved, but it should be the last item in any branch in which it appears. + Dollar has no special meaning in a character class. + + The meaning of dollar can be changed so that it matches only at the very end of + the string, by setting the PCRE_DOLLAR_ENDONLY option at compile or matching + time. This does not affect the \\Z assertion. + + The meanings of the circumflex and dollar characters are changed if the + PCRE_MULTILINE option is set. When this is the case, they match immediately + after and immediately before an internal "\\n" character, respectively, in + addition to matching at the start and end of the subject string. For example, + the pattern /^abc$/ matches the subject string "def\\nabc" in multiline mode, + but not otherwise. Consequently, patterns that are anchored in single line mode + because all branches start with "^" are not anchored in multiline mode, and a + match for circumflex is possible when the \fIstartoffset\fR argument of + \fBpcre_exec()\fR is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if + PCRE_MULTILINE is set. + + Note that the sequences \\A, \\Z, and \\z can be used to match the start and + end of the subject in both modes, and if all branches of a pattern start with + \\A is it always anchored, whether PCRE_MULTILINE is set or not. + + + .SH FULL STOP (PERIOD, DOT) + Outside a character class, a dot in the pattern matches any one character in + the subject, including a non-printing character, but not (by default) newline. + If the PCRE_DOTALL option is set, dots match newlines as well. The handling of + dot is entirely independent of the handling of circumflex and dollar, the only + relationship being that they both involve newline characters. Dot has no + special meaning in a character class. + + + .SH SQUARE BRACKETS + An opening square bracket introduces a character class, terminated by a closing + square bracket. A closing square bracket on its own is not special. If a + closing square bracket is required as a member of the class, it should be the + first data character in the class (after an initial circumflex, if present) or + escaped with a backslash. + + A character class matches a single character in the subject; the character must + be in the set of characters defined by the class, unless the first character in + the class is a circumflex, in which case the subject character must not be in + the set defined by the class. If a circumflex is actually required as a member + of the class, ensure it is not the first character, or escape it with a + backslash. + + For example, the character class [aeiou] matches any lower case vowel, while + [^aeiou] matches any character that is not a lower case vowel. Note that a + circumflex is just a convenient notation for specifying the characters which + are in the class by enumerating those that are not. It is not an assertion: it + still consumes a character from the subject string, and fails if the current + pointer is at the end of the string. + + When caseless matching is set, any letters in a class represent both their + upper case and lower case versions, so for example, a caseless [aeiou] matches + "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a + caseful version would. + + The newline character is never treated in any special way in character classes, + whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class + such as [^a] will always match a newline. + + The minus (hyphen) character can be used to specify a range of characters in a + character class. For example, [d-m] matches any letter between d and m, + inclusive. If a minus character is required in a class, it must be escaped with + a backslash or appear in a position where it cannot be interpreted as + indicating a range, typically as the first or last character in the class. + + It is not possible to have the literal character "]" as the end character of a + range. A pattern such as [W-]46] is interpreted as a class of two characters + ("W" and "-") followed by a literal string "46]", so it would match "W46]" or + "-46]". However, if the "]" is escaped with a backslash it is interpreted as + the end of range, so [W-\\]46] is interpreted as a single class containing a + range followed by two separate characters. The octal or hexadecimal + representation of "]" can also be used to end a range. + + Ranges operate in ASCII collating sequence. They can also be used for + characters specified numerically, for example [\\000-\\037]. If a range that + includes letters is used when caseless matching is set, it matches the letters + in either case. For example, [W-c] is equivalent to [][\\^_`wxyzabc], matched + caselessly, and if character tables for the "fr" locale are in use, + [\\xc8-\\xcb] matches accented E characters in both cases. + + The character types \\d, \\D, \\s, \\S, \\w, and \\W may also appear in a + character class, and add the characters that they match to the class. For + example, [\\dABCDEF] matches any hexadecimal digit. A circumflex can + conveniently be used with the upper case character types to specify a more + restricted set of characters than the matching lower case type. For example, + the class [^\\W_] matches any letter or digit, but not underscore. + + All non-alphameric characters other than \\, -, ^ (at the start) and the + terminating ] are non-special in character classes, but it does no harm if they + are escaped. + + + .SH POSIX CHARACTER CLASSES + Perl 5.6 (not yet released at the time of writing) is going to support the + POSIX notation for character classes, which uses names enclosed by [: and :] + within the enclosing square brackets. PCRE supports this notation. For example, + + [01[:alpha:]%] + + matches "0", "1", any alphabetic character, or "%". The supported class names + are + + alnum letters and digits + alpha letters + ascii character codes 0 - 127 + cntrl control characters + digit decimal digits (same as \\d) + graph printing characters, excluding space + lower lower case letters + print printing characters, including space + punct printing characters, excluding letters and digits + space white space (same as \\s) + upper upper case letters + word "word" characters (same as \\w) + xdigit hexadecimal digits + + The names "ascii" and "word" are Perl extensions. Another Perl extension is + negation, which is indicated by a ^ character after the colon. For example, + + [12[:^digit:]] + + matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX + syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not + supported, and an error is given if they are encountered. + + + .SH VERTICAL BAR + Vertical bar characters are used to separate alternative patterns. For example, + the pattern + + gilbert|sullivan + + matches either "gilbert" or "sullivan". Any number of alternatives may appear, + and an empty alternative is permitted (matching the empty string). + The matching process tries each alternative in turn, from left to right, + and the first one that succeeds is used. If the alternatives are within a + subpattern (defined below), "succeeds" means matching the rest of the main + pattern as well as the alternative in the subpattern. + + + .SH INTERNAL OPTION SETTING + The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and PCRE_EXTENDED + can be changed from within the pattern by a sequence of Perl option letters + enclosed between "(?" and ")". The option letters are + + i for PCRE_CASELESS + m for PCRE_MULTILINE + s for PCRE_DOTALL + x for PCRE_EXTENDED + + For example, (?im) sets caseless, multiline matching. It is also possible to + unset these options by preceding the letter with a hyphen, and a combined + setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and + PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also + permitted. If a letter appears both before and after the hyphen, the option is + unset. + + The scope of these option changes depends on where in the pattern the setting + occurs. For settings that are outside any subpattern (defined below), the + effect is the same as if the options were set or unset at the start of + matching. The following patterns all behave in exactly the same way: + + (?i)abc + a(?i)bc + ab(?i)c + abc(?i) + + which in turn is the same as compiling the pattern abc with PCRE_CASELESS set. + In other words, such "top level" settings apply to the whole pattern (unless + there are other changes inside subpatterns). If there is more than one setting + of the same option at top level, the rightmost setting is used. + + If an option change occurs inside a subpattern, the effect is different. This + is a change of behaviour in Perl 5.005. An option change inside a subpattern + affects only that part of the subpattern that follows it, so + + (a(?i)b)c + + matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used). + By this means, options can be made to have different settings in different + parts of the pattern. Any changes made in one alternative do carry on + into subsequent branches within the same subpattern. For example, + + (a(?i)b|c) + + matches "ab", "aB", "c", and "C", even though when matching "C" the first + branch is abandoned before the option setting. This is because the effects of + option settings happen at compile time. There would be some very weird + behaviour otherwise. + + The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed in the + same way as the Perl-compatible options by using the characters U and X + respectively. The (?X) flag setting is special in that it must always occur + earlier in the pattern than any of the additional features it turns on, even + when it is at top level. It is best put at the start. + + + .SH SUBPATTERNS + Subpatterns are delimited by parentheses (round brackets), which can be nested. + Marking part of a pattern as a subpattern does two things: + + 1. It localizes a set of alternatives. For example, the pattern + + cat(aract|erpillar|) + + matches one of the words "cat", "cataract", or "caterpillar". Without the + parentheses, it would match "cataract", "erpillar" or the empty string. + + 2. It sets up the subpattern as a capturing subpattern (as defined above). + When the whole pattern matches, that portion of the subject string that matched + the subpattern is passed back to the caller via the \fIovector\fR argument of + \fBpcre_exec()\fR. Opening parentheses are counted from left to right (starting + from 1) to obtain the numbers of the capturing subpatterns. + + For example, if the string "the red king" is matched against the pattern + + the ((red|white) (king|queen)) + + the captured substrings are "red king", "red", and "king", and are numbered 1, + 2, and 3. + + The fact that plain parentheses fulfil two functions is not always helpful. + There are often times when a grouping subpattern is required without a + capturing requirement. If an opening parenthesis is followed by "?:", the + subpattern does not do any capturing, and is not counted when computing the + number of any subsequent capturing subpatterns. For example, if the string "the + white queen" is matched against the pattern + + the ((?:red|white) (king|queen)) + + the captured substrings are "white queen" and "queen", and are numbered 1 and + 2. The maximum number of captured substrings is 99, and the maximum number of + all subpatterns, both capturing and non-capturing, is 200. + + As a convenient shorthand, if any option settings are required at the start of + a non-capturing subpattern, the option letters may appear between the "?" and + the ":". Thus the two patterns + + (?i:saturday|sunday) + (?:(?i)saturday|sunday) + + match exactly the same set of strings. Because alternative branches are tried + from left to right, and options are not reset until the end of the subpattern + is reached, an option setting in one branch does affect subsequent branches, so + the above patterns match "SUNDAY" as well as "Saturday". + + + .SH REPETITION + Repetition is specified by quantifiers, which can follow any of the following + items: + + a single character, possibly escaped + the . metacharacter + a character class + a back reference (see next section) + a parenthesized subpattern (unless it is an assertion - see below) + + The general repetition quantifier specifies a minimum and maximum number of + permitted matches, by giving the two numbers in curly brackets (braces), + separated by a comma. The numbers must be less than 65536, and the first must + be less than or equal to the second. For example: + + z{2,4} + + matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special + character. If the second number is omitted, but the comma is present, there is + no upper limit; if the second number and the comma are both omitted, the + quantifier specifies an exact number of required matches. Thus + + [aeiou]{3,} + + matches at least 3 successive vowels, but may match many more, while + + \\d{8} + + matches exactly 8 digits. An opening curly bracket that appears in a position + where a quantifier is not allowed, or one that does not match the syntax of a + quantifier, is taken as a literal character. For example, {,6} is not a + quantifier, but a literal string of four characters. + + The quantifier {0} is permitted, causing the expression to behave as if the + previous item and the quantifier were not present. + + For convenience (and historical compatibility) the three most common + quantifiers have single-character abbreviations: + + * is equivalent to {0,} + + is equivalent to {1,} + ? is equivalent to {0,1} + + It is possible to construct infinite loops by following a subpattern that can + match no characters with a quantifier that has no upper limit, for example: + + (a?)* + + Earlier versions of Perl and PCRE used to give an error at compile time for + such patterns. However, because there are cases where this can be useful, such + patterns are now accepted, but if any repetition of the subpattern does in fact + match no characters, the loop is forcibly broken. + + By default, the quantifiers are "greedy", that is, they match as much as + possible (up to the maximum number of permitted times), without causing the + rest of the pattern to fail. The classic example of where this gives problems + is in trying to match comments in C programs. These appear between the + sequences /* and */ and within the sequence, individual * and / characters may + appear. An attempt to match C comments by applying the pattern + + /\\*.*\\*/ + + to the string + + /* first command */ not comment /* second comment */ + + fails, because it matches the entire string due to the greediness of the .* + item. + + However, if a quantifier is followed by a question mark, it ceases to be + greedy, and instead matches the minimum number of times possible, so the + pattern + + /\\*.*?\\*/ + + does the right thing with the C comments. The meaning of the various + quantifiers is not otherwise changed, just the preferred number of matches. + Do not confuse this use of question mark with its use as a quantifier in its + own right. Because it has two uses, it can sometimes appear doubled, as in + + \\d??\\d + + which matches one digit by preference, but can match two if that is the only + way the rest of the pattern matches. + + If the PCRE_UNGREEDY option is set (an option which is not available in Perl), + the quantifiers are not greedy by default, but individual ones can be made + greedy by following them with a question mark. In other words, it inverts the + default behaviour. + + When a parenthesized subpattern is quantified with a minimum repeat count that + is greater than 1 or with a limited maximum, more store is required for the + compiled pattern, in proportion to the size of the minimum or maximum. + + If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent + to Perl's /s) is set, thus allowing the . to match newlines, the pattern is + implicitly anchored, because whatever follows will be tried against every + character position in the subject string, so there is no point in retrying the + overall match at any position after the first. PCRE treats such a pattern as + though it were preceded by \\A. In cases where it is known that the subject + string contains no newlines, it is worth setting PCRE_DOTALL when the pattern + begins with .* in order to obtain this optimization, or alternatively using ^ + to indicate anchoring explicitly. + + When a capturing subpattern is repeated, the value captured is the substring + that matched the final iteration. For example, after + + (tweedle[dume]{3}\\s*)+ + + has matched "tweedledum tweedledee" the value of the captured substring is + "tweedledee". However, if there are nested capturing subpatterns, the + corresponding captured values may have been set in previous iterations. For + example, after + + /(a|(b))+/ + + matches "aba" the value of the second captured substring is "b". + + + .SH BACK REFERENCES + Outside a character class, a backslash followed by a digit greater than 0 (and + possibly further digits) is a back reference to a capturing subpattern earlier + (i.e. to its left) in the pattern, provided there have been that many previous + capturing left parentheses. + + However, if the decimal number following the backslash is less than 10, it is + always taken as a back reference, and causes an error only if there are not + that many capturing left parentheses in the entire pattern. In other words, the + parentheses that are referenced need not be to the left of the reference for + numbers less than 10. See the section entitled "Backslash" above for further + details of the handling of digits following a backslash. + + A back reference matches whatever actually matched the capturing subpattern in + the current subject string, rather than anything matching the subpattern + itself. So the pattern + + (sens|respons)e and \\1ibility + + matches "sense and sensibility" and "response and responsibility", but not + "sense and responsibility". If caseful matching is in force at the time of the + back reference, the case of letters is relevant. For example, + + ((?i)rah)\\s+\\1 + + matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original + capturing subpattern is matched caselessly. + + There may be more than one back reference to the same subpattern. If a + subpattern has not actually been used in a particular match, any back + references to it always fail. For example, the pattern + + (a|(bc))\\2 + + always fails if it starts to match "a" rather than "bc". Because there may be + up to 99 back references, all digits following the backslash are taken + as part of a potential back reference number. If the pattern continues with a + digit character, some delimiter must be used to terminate the back reference. + If the PCRE_EXTENDED option is set, this can be whitespace. Otherwise an empty + comment can be used. + + A back reference that occurs inside the parentheses to which it refers fails + when the subpattern is first used, so, for example, (a\\1) never matches. + However, such references can be useful inside repeated subpatterns. For + example, the pattern + + (a|b\\1)+ + + matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of + the subpattern, the back reference matches the character string corresponding + to the previous iteration. In order for this to work, the pattern must be such + that the first iteration does not need to match the back reference. This can be + done using alternation, as in the example above, or by a quantifier with a + minimum of zero. + + + .SH ASSERTIONS + An assertion is a test on the characters following or preceding the current + matching point that does not actually consume any characters. The simple + assertions coded as \\b, \\B, \\A, \\Z, \\z, ^ and $ are described above. More + complicated assertions are coded as subpatterns. There are two kinds: those + that look ahead of the current position in the subject string, and those that + look behind it. + + An assertion subpattern is matched in the normal way, except that it does not + cause the current matching position to be changed. Lookahead assertions start + with (?= for positive assertions and (?! for negative assertions. For example, + + \\w+(?=;) + + matches a word followed by a semicolon, but does not include the semicolon in + the match, and + + foo(?!bar) + + matches any occurrence of "foo" that is not followed by "bar". Note that the + apparently similar pattern + + (?!foo)bar + + does not find an occurrence of "bar" that is preceded by something other than + "foo"; it finds any occurrence of "bar" whatsoever, because the assertion + (?!foo) is always true when the next three characters are "bar". A + lookbehind assertion is needed to achieve this effect. + + Lookbehind assertions start with (?<= for positive assertions and (?<! for + negative assertions. For example, + + (?<!foo)bar + + does find an occurrence of "bar" that is not preceded by "foo". The contents of + a lookbehind assertion are restricted such that all the strings it matches must + have a fixed length. However, if there are several alternatives, they do not + all have to have the same fixed length. Thus + + (?<=bullock|donkey) + + is permitted, but + + (?<!dogs?|cats?) + + causes an error at compile time. Branches that match different length strings + are permitted only at the top level of a lookbehind assertion. This is an + extension compared with Perl 5.005, which requires all branches to match the + same length of string. An assertion such as + + (?<=ab(c|de)) + + is not permitted, because its single top-level branch can match two different + lengths, but it is acceptable if rewritten to use two top-level branches: + + (?<=abc|abde) + + The implementation of lookbehind assertions is, for each alternative, to + temporarily move the current position back by the fixed width and then try to + match. If there are insufficient characters before the current position, the + match is deemed to fail. Lookbehinds in conjunction with once-only subpatterns + can be particularly useful for matching at the ends of strings; an example is + given at the end of the section on once-only subpatterns. + + Several assertions (of any sort) may occur in succession. For example, + + (?<=\\d{3})(?<!999)foo + + matches "foo" preceded by three digits that are not "999". Notice that each of + the assertions is applied independently at the same point in the subject + string. First there is a check that the previous three characters are all + digits, and then there is a check that the same three characters are not "999". + This pattern does \fInot\fR match "foo" preceded by six characters, the first + of which are digits and the last three of which are not "999". For example, it + doesn't match "123abcfoo". A pattern to do that is + + (?<=\\d{3}...)(?<!999)foo + + This time the first assertion looks at the preceding six characters, checking + that the first three are digits, and then the second assertion checks that the + preceding three characters are not "999". + + Assertions can be nested in any combination. For example, + + (?<=(?<!foo)bar)baz + + matches an occurrence of "baz" that is preceded by "bar" which in turn is not + preceded by "foo", while + + (?<=\\d{3}(?!999)...)foo + + is another pattern which matches "foo" preceded by three digits and any three + characters that are not "999". + + Assertion subpatterns are not capturing subpatterns, and may not be repeated, + because it makes no sense to assert the same thing several times. If any kind + of assertion contains capturing subpatterns within it, these are counted for + the purposes of numbering the capturing subpatterns in the whole pattern. + However, substring capturing is carried out only for positive assertions, + because it does not make sense for negative assertions. + + Assertions count towards the maximum of 200 parenthesized subpatterns. + + + .SH ONCE-ONLY SUBPATTERNS + With both maximizing and minimizing repetition, failure of what follows + normally causes the repeated item to be re-evaluated to see if a different + number of repeats allows the rest of the pattern to match. Sometimes it is + useful to prevent this, either to change the nature of the match, or to cause + it fail earlier than it otherwise might, when the author of the pattern knows + there is no point in carrying on. + + Consider, for example, the pattern \\d+foo when applied to the subject line + + 123456bar + + After matching all 6 digits and then failing to match "foo", the normal + action of the matcher is to try again with only 5 digits matching the \\d+ + item, and then with 4, and so on, before ultimately failing. Once-only + subpatterns provide the means for specifying that once a portion of the pattern + has matched, it is not to be re-evaluated in this way, so the matcher would + give up immediately on failing to match "foo" the first time. The notation is + another kind of special parenthesis, starting with (?> as in this example: + + (?>\\d+)bar + + This kind of parenthesis "locks up" the part of the pattern it contains once + it has matched, and a failure further into the pattern is prevented from + backtracking into it. Backtracking past it to previous items, however, works as + normal. + + An alternative description is that a subpattern of this type matches the string + of characters that an identical standalone pattern would match, if anchored at + the current point in the subject string. + + Once-only subpatterns are not capturing subpatterns. Simple cases such as the + above example can be thought of as a maximizing repeat that must swallow + everything it can. So, while both \\d+ and \\d+? are prepared to adjust the + number of digits they match in order to make the rest of the pattern match, + (?>\\d+) can only match an entire sequence of digits. + + This construction can of course contain arbitrarily complicated subpatterns, + and it can be nested. + + Once-only subpatterns can be used in conjunction with lookbehind assertions to + specify efficient matching at the end of the subject string. Consider a simple + pattern such as + + abcd$ + + when applied to a long string which does not match. Because matching proceeds + from left to right, PCRE will look for each "a" in the subject and then see if + what follows matches the rest of the pattern. If the pattern is specified as + + ^.*abcd$ + + the initial .* matches the entire string at first, but when this fails (because + there is no following "a"), it backtracks to match all but the last character, + then all but the last two characters, and so on. Once again the search for "a" + covers the entire string, from right to left, so we are no better off. However, + if the pattern is written as + + ^(?>.*)(?<=abcd) + + there can be no backtracking for the .* item; it can match only the entire + string. The subsequent lookbehind assertion does a single test on the last four + characters. If it fails, the match fails immediately. For long strings, this + approach makes a significant difference to the processing time. + + When a pattern contains an unlimited repeat inside a subpattern that can itself + be repeated an unlimited number of times, the use of a once-only subpattern is + the only way to avoid some failing matches taking a very long time indeed. + The pattern + + (\\D+|<\\d+>)*[!?] + + matches an unlimited number of substrings that either consist of non-digits, or + digits enclosed in <>, followed by either ! or ?. When it matches, it runs + quickly. However, if it is applied to + + aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa + + it takes a long time before reporting failure. This is because the string can + be divided between the two repeats in a large number of ways, and all have to + be tried. (The example used [!?] rather than a single character at the end, + because both PCRE and Perl have an optimization that allows for fast failure + when a single character is used. They remember the last single character that + is required for a match, and fail early if it is not present in the string.) + If the pattern is changed to + + ((?>\\D+)|<\\d+>)*[!?] + + sequences of non-digits cannot be broken, and failure happens quickly. + + + .SH CONDITIONAL SUBPATTERNS + It is possible to cause the matching process to obey a subpattern + conditionally or to choose between two alternative subpatterns, depending on + the result of an assertion, or whether a previous capturing subpattern matched + or not. The two possible forms of conditional subpattern are + + (?(condition)yes-pattern) + (?(condition)yes-pattern|no-pattern) + + If the condition is satisfied, the yes-pattern is used; otherwise the + no-pattern (if present) is used. If there are more than two alternatives in the + subpattern, a compile-time error occurs. + + There are two kinds of condition. If the text between the parentheses consists + of a sequence of digits, the condition is satisfied if the capturing subpattern + of that number has previously matched. Consider the following pattern, which + contains non-significant white space to make it more readable (assume the + PCRE_EXTENDED option) and to divide it into three parts for ease of discussion: + + ( \$ )? [^()]+ (?(1) \$ ) + + The first part matches an optional opening parenthesis, and if that + character is present, sets it as the first captured substring. The second part + matches one or more characters that are not parentheses. The third part is a + conditional subpattern that tests whether the first set of parentheses matched + or not. If they did, that is, if subject started with an opening parenthesis, + the condition is true, and so the yes-pattern is executed and a closing + parenthesis is required. Otherwise, since no-pattern is not present, the + subpattern matches nothing. In other words, this pattern matches a sequence of + non-parentheses, optionally enclosed in parentheses. + + If the condition is not a sequence of digits, it must be an assertion. This may + be a positive or negative lookahead or lookbehind assertion. Consider this + pattern, again containing non-significant white space, and with the two + alternatives on the second line: + + (?(?=[^a-z]*[a-z]) + \\d{2}-[a-z]{3}-\\d{2} | \\d{2}-\\d{2}-\\d{2} ) + + The condition is a positive lookahead assertion that matches an optional + sequence of non-letters followed by a letter. In other words, it tests for the + presence of at least one letter in the subject. If a letter is found, the + subject is matched against the first alternative; otherwise it is matched + against the second. This pattern matches strings in one of the two forms + dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. + + + .SH COMMENTS + The sequence (?# marks the start of a comment which continues up to the next + closing parenthesis. Nested parentheses are not permitted. The characters + that make up a comment play no part in the pattern matching at all. + + If the PCRE_EXTENDED option is set, an unescaped # character outside a + character class introduces a comment that continues up to the next newline + character in the pattern. + + + .SH RECURSIVE PATTERNS + Consider the problem of matching a string in parentheses, allowing for + unlimited nested parentheses. Without the use of recursion, the best that can + be done is to use a pattern that matches up to some fixed depth of nesting. It + is not possible to handle an arbitrary nesting depth. Perl 5.6 has provided an + experimental facility that allows regular expressions to recurse (amongst other + things). It does this by interpolating Perl code in the expression at run time, + and the code can refer to the expression itself. A Perl pattern to solve the + parentheses problem can be created like this: + + $re = qr{\$ (?: (?>[^()]+) | (?p{$re}) )* \$}x; + + The (?p{...}) item interpolates Perl code at run time, and in this case refers + recursively to the pattern in which it appears. Obviously, PCRE cannot support + the interpolation of Perl code. Instead, the special item (?R) is provided for + the specific case of recursion. This PCRE pattern solves the parentheses + problem (assume the PCRE_EXTENDED option is set so that white space is + ignored): + + \$ ( (?>[^()]+) | (?R) )* \$ + + First it matches an opening parenthesis. Then it matches any number of + substrings which can either be a sequence of non-parentheses, or a recursive + match of the pattern itself (i.e. a correctly parenthesized substring). Finally + there is a closing parenthesis. + + This particular example pattern contains nested unlimited repeats, and so the + use of a once-only subpattern for matching strings of non-parentheses is + important when applying the pattern to strings that do not match. For example, + when it is applied to + + (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() + + it yields "no match" quickly. However, if a once-only subpattern is not used, + the match runs for a very long time indeed because there are so many different + ways the + and * repeats can carve up the subject, and all have to be tested + before failure can be reported. + + The values set for any capturing subpatterns are those from the outermost level + of the recursion at which the subpattern value is set. If the pattern above is + matched against + + (ab(cd)ef) + + the value for the capturing parentheses is "ef", which is the last value taken + on at the top level. If additional parentheses are added, giving + + \$ ( ( (?>[^()]+) | (?R) )* ) \$ + ^ ^ + ^ ^ + the string they capture is "ab(cd)ef", the contents of the top level + parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE + has to obtain extra memory to store data during a recursion, which it does by + using \fBpcre_malloc\fR, freeing it via \fBpcre_free\fR afterwards. If no + memory can be obtained, it saves data for the first 15 capturing parentheses + only, as there is no way to give an out-of-memory error from within a + recursion. + + + .SH PERFORMANCE + Certain items that may appear in patterns are more efficient than others. It is + more efficient to use a character class like [aeiou] than a set of alternatives + such as (a|e|i|o|u). In general, the simplest construction that provides the + required behaviour is usually the most efficient. Jeffrey Friedl's book + contains a lot of discussion about optimizing regular expressions for efficient + performance. + + When a pattern begins with .* and the PCRE_DOTALL option is set, the pattern is + implicitly anchored by PCRE, since it can match only at the start of a subject + string. However, if PCRE_DOTALL is not set, PCRE cannot make this optimization, + because the . metacharacter does not then match a newline, and if the subject + string contains newlines, the pattern may match from the character immediately + following one of them instead of from the very start. For example, the pattern + + (.*) second + + matches the subject "first\\nand second" (where \\n stands for a newline + character) with the first captured substring being "and". In order to do this, + PCRE has to retry the match starting after every newline in the subject. + + If you are using such a pattern with subject strings that do not contain + newlines, the best performance is obtained by setting PCRE_DOTALL, or starting + the pattern with ^.* to indicate explicit anchoring. That saves PCRE from + having to scan along the subject looking for a newline to restart at. + + Beware of patterns that contain nested indefinite repeats. These can take a + long time to run when applied to a string that does not match. Consider the + pattern fragment + + (a+)* + + This can match "aaaa" in 33 different ways, and this number increases very + rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4 + times, and for each of those cases other than 0, the + repeats can match + different numbers of times.) When the remainder of the pattern is such that the + entire match is going to fail, PCRE has in principle to try every possible + variation, and this can take an extremely long time. + + An optimization catches some of the more simple cases such as + + (a+)*b + + where a literal character follows. Before embarking on the standard matching + procedure, PCRE checks that there is a "b" later in the subject string, and if + there is not, it fails the match immediately. However, when there is no + following literal this optimization cannot be used. You can see the difference + by comparing the behaviour of + + (a+)*\\d + + with the pattern above. The former gives a failure almost instantly when + applied to a whole line of "a" characters, whereas the latter takes an + appreciable time with strings longer than about 20 characters. + + + .SH UTF-8 SUPPORT + Starting at release 3.3, PCRE has some support for character strings encoded + in the UTF-8 format. This is incomplete, and is regarded as experimental. In + order to use it, you must configure PCRE to include UTF-8 support in the code, + and, in addition, you must call \fBpcre_compile()\fR with the PCRE_UTF8 option + flag. When you do this, both the pattern and any subject strings that are + matched against it are treated as UTF-8 strings instead of just strings of + bytes, but only in the cases that are mentioned below. + + If you compile PCRE with UTF-8 support, but do not use it at run time, the + library will be a bit bigger, but the additional run time overhead is limited + to testing the PCRE_UTF8 flag in several places, so should not be very large. + + PCRE assumes that the strings it is given contain valid UTF-8 codes. It does + not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE, + the results are undefined. + + Running with PCRE_UTF8 set causes these changes in the way PCRE works: + + 1. In a pattern, the escape sequence \\x{...}, where the contents of the braces + is a string of hexadecimal digits, is interpreted as a UTF-8 character whose + code number is the given hexadecimal number, for example: \\x{1234}. This + inserts from one to six literal bytes into the pattern, using the UTF-8 + encoding. If a non-hexadecimal digit appears between the braces, the item is + not recognized. + + 2. The original hexadecimal escape sequence, \\xhh, generates a two-byte UTF-8 + character if its value is greater than 127. + + 3. Repeat quantifiers are NOT correctly handled if they follow a multibyte + character. For example, \\x{100}* and \\xc3+ do not work. If you want to + repeat such characters, you must enclose them in non-capturing parentheses, + for example (?:\\x{100}), at present. + + 4. The dot metacharacter matches one UTF-8 character instead of a single byte. + + 5. Unlike literal UTF-8 characters, the dot metacharacter followed by a + repeat quantifier does operate correctly on UTF-8 characters instead of + single bytes. + + 4. Although the \\x{...} escape is permitted in a character class, characters + whose values are greater than 255 cannot be included in a class. + + 5. A class is matched against a UTF-8 character instead of just a single byte, + but it can match only characters whose values are less than 256. Characters + with greater values always fail to match a class. + + 6. Repeated classes work correctly on multiple characters. + + 7. Classes containing just a single character whose value is greater than 127 + (but less than 256), for example, [\\x80] or [^\\x{93}], do not work because + these are optimized into single byte matches. In the first case, of course, + the class brackets are just redundant. + + 8. Lookbehind assertions move backwards in the subject by a fixed number of + characters instead of a fixed number of bytes. Simple cases have been tested + to work correctly, but there may be hidden gotchas herein. + + 9. The character types such as \\d and \\w do not work correctly with UTF-8 + characters. They continue to test a single byte. + + 10. Anything not explicitly mentioned here continues to work in bytes rather + than in characters. + + The following UTF-8 features of Perl 5.6 are not implemented: + + 1. The escape sequence \\C to match a single byte. + + 2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X. + + .SH AUTHOR + Philip Hazel <ph10@cam.ac.uk> + .br + University Computing Service, + .br + New Museums Site, + .br + Cambridge CB2 3QG, England. + .br + Phone: +44 1223 334714 + + Last updated: 28 August 2000, + .br + the 250th anniversary of the death of J.S. Bach. + .br + Copyright (c) 1997-2000 University of Cambridge.

ossp-pkg/pcre/pcre.c -> 1.5

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,5146 ---- + /************************************************* + * Perl-Compatible Regular Expressions * + *************************************************/ + + /* + This is a library of functions to support regular expressions whose syntax + and semantics are as close as possible to those of the Perl 5 language. See + the file Tech.Notes for some information on the internals. + + Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997-2000 University of Cambridge + + ----------------------------------------------------------------------------- + Permission is granted to anyone to use this software for any purpose on any + computer system, and to redistribute it freely, subject to the following + restrictions: + + 1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + + 2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + + 3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + + 4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. + ----------------------------------------------------------------------------- + */ + + + /* Define DEBUG to get debugging output on stdout. */ + + /* #define DEBUG */ + + /* Use a macro for debugging printing, 'cause that eliminates the use of #ifdef + inline, and there are *still* stupid compilers about that don't like indented + pre-processor statements. I suppose it's only been 10 years... */ + + #ifdef DEBUG + #define DPRINTF(p) printf p + #else + #define DPRINTF(p) /*nothing*/ + #endif + + /* Include the internals header, which itself includes Standard C headers plus + the external pcre header. */ + + #include "pcre_internal.h" + + + /* Allow compilation as C++ source code, should anybody want to do that. */ + + #ifdef __cplusplus + #define class pcre_class + #endif + + + /* Number of items on the nested bracket stacks at compile time. This should + not be set greater than 200. */ + + #define BRASTACK_SIZE 200 + + + /* The number of bytes in a literal character string above which we can't add + any more is different when UTF-8 characters may be encountered. */ + + #ifdef SUPPORT_UTF8 + #define MAXLIT 250 + #else + #define MAXLIT 255 + #endif + + + /* Min and max values for the common repeats; for the maxima, 0 => infinity */ + + static const char rep_min[] = { 0, 0, 1, 1, 0, 0 }; + static const char rep_max[] = { 0, 0, 0, 0, 1, 1 }; + + /* Text forms of OP_ values and things, for debugging (not all used) */ + + #ifdef DEBUG + static const char *OP_names[] = { + "End", "\\A", "\\B", "\\b", "\\D", "\\d", + "\\S", "\\s", "\\W", "\\w", "\\Z", "\\z", + "Opt", "^", "$", "Any", "chars", "not", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", + "class", "Ref", "Recurse", + "Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not", + "AssertB", "AssertB not", "Reverse", "Once", "Cond", "Cref", + "Brazero", "Braminzero", "Bra" + }; + #endif + + /* Table for handling escaped characters in the range '0'-'z'. Positive returns + are simple data values; negative values are for special things like \d and so + on. Zero means further processing is needed (for things like \x), or the escape + is invalid. */ + + static const short int escapes[] = { + 0, 0, 0, 0, 0, 0, 0, 0, /* 0 - 7 */ + 0, 0, ':', ';', '<', '=', '>', '?', /* 8 - ? */ + '@', -ESC_A, -ESC_B, 0, -ESC_D, 0, 0, 0, /* @ - G */ + 0, 0, 0, 0, 0, 0, 0, 0, /* H - O */ + 0, 0, 0, -ESC_S, 0, 0, 0, -ESC_W, /* P - W */ + 0, 0, -ESC_Z, '[', '\\', ']', '^', '_', /* X - _ */ + '`', 7, -ESC_b, 0, -ESC_d, 27, '\f', 0, /* ` - g */ + 0, 0, 0, 0, 0, 0, '\n', 0, /* h - o */ + 0, 0, '\r', -ESC_s, '\t', 0, 0, -ESC_w, /* p - w */ + 0, 0, -ESC_z /* x - z */ + }; + + /* Tables of names of POSIX character classes and their lengths. The list is + terminated by a zero length entry. The first three must be alpha, upper, lower, + as this is assumed for handling case independence. */ + + static const char *posix_names[] = { + "alpha", "lower", "upper", + "alnum", "ascii", "cntrl", "digit", "graph", + "print", "punct", "space", "word", "xdigit" }; + + static const uschar posix_name_lengths[] = { + 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 6, 0 }; + + /* Table of class bit maps for each POSIX class; up to three may be combined + to form the class. */ + + static const int posix_class_maps[] = { + cbit_lower, cbit_upper, -1, /* alpha */ + cbit_lower, -1, -1, /* lower */ + cbit_upper, -1, -1, /* upper */ + cbit_digit, cbit_lower, cbit_upper, /* alnum */ + cbit_print, cbit_cntrl, -1, /* ascii */ + cbit_cntrl, -1, -1, /* cntrl */ + cbit_digit, -1, -1, /* digit */ + cbit_graph, -1, -1, /* graph */ + cbit_print, -1, -1, /* print */ + cbit_punct, -1, -1, /* punct */ + cbit_space, -1, -1, /* space */ + cbit_word, -1, -1, /* word */ + cbit_xdigit,-1, -1 /* xdigit */ + }; + + + /* Definition to allow mutual recursion */ + + static BOOL + compile_regex(int, int, int *, uschar **, const uschar **, const char **, + BOOL, int, int *, int *, compile_data *); + + /* Structure for building a chain of data that actually lives on the + stack, for holding the values of the subject pointer at the start of each + subpattern, so as to detect when an empty string has been matched by a + subpattern - to break infinite loops. */ + + typedef struct eptrblock { + struct eptrblock *prev; + const uschar *saved_eptr; + } eptrblock; + + /* Flag bits for the match() function */ + + #define match_condassert 0x01 /* Called to check a condition assertion */ + #define match_isgroup 0x02 /* Set if start of bracketed group */ + + + + /************************************************* + * Global variables * + *************************************************/ + + /* PCRE is thread-clean and doesn't use any global variables in the normal + sense. However, it calls memory allocation and free functions via the two + indirections below, which are can be changed by the caller, but are shared + between all threads. */ + + void *(*pcre_malloc)(size_t) = malloc; + void (*pcre_free)(void *) = free; + + + + /************************************************* + * Macros and tables for character handling * + *************************************************/ + + /* When UTF-8 encoding is being used, a character is no longer just a single + byte. The macros for character handling generate simple sequences when used in + byte-mode, and more complicated ones for UTF-8 characters. */ + + #ifndef SUPPORT_UTF8 + #define GETCHARINC(c, eptr) c = *eptr++; + #define GETCHARLEN(c, eptr, len) c = *eptr; + #define BACKCHAR(eptr) + + #else /* SUPPORT_UTF8 */ + + /* Get the next UTF-8 character, advancing the pointer */ + + #define GETCHARINC(c, eptr) \ + c = *eptr++; \ + if (md->utf8 && (c & 0xc0) == 0xc0) \ + { \ + int a = utf8_table4[c & 0x3f]; /* Number of additional bytes */ \ + int s = 6 - a; /* Amount to shift next byte */ \ + c &= utf8_table3[a]; /* Low order bits from first byte */ \ + while (a-- > 0) \ + { \ + c |= (*eptr++ & 0x3f) << s; \ + s += 6; \ + } \ + } + + /* Get the next UTF-8 character, not advancing the pointer, setting length */ + + #define GETCHARLEN(c, eptr, len) \ + c = *eptr; \ + len = 1; \ + if (md->utf8 && (c & 0xc0) == 0xc0) \ + { \ + int i; \ + int a = utf8_table4[c & 0x3f]; /* Number of additional bytes */ \ + int s = 6 - a; /* Amount to shift next byte */ \ + c &= utf8_table3[a]; /* Low order bits from first byte */ \ + for (i = 1; i <= a; i++) \ + { \ + c |= (eptr[i] & 0x3f) << s; \ + s += 6; \ + } \ + len += a; \ + } + + /* If the pointer is not at the start of a character, move it back until + it is. */ + + #define BACKCHAR(eptr) while((*eptr & 0xc0) == 0x80) eptr--; + + #endif + + + + /************************************************* + * Default character tables * + *************************************************/ + + /* A default set of character tables is included in the PCRE binary. Its source + is built by the maketables auxiliary program, which uses the default C ctypes + functions, and put in the file chartables.c. These tables are used by PCRE + whenever the caller of pcre_compile() does not provide an alternate set of + tables. */ + + #include "pcre_chartables.c" + + + #ifdef SUPPORT_UTF8 + /************************************************* + * Tables for UTF-8 support * + *************************************************/ + + /* These are the breakpoints for different numbers of bytes in a UTF-8 + character. */ + + static int utf8_table1[] = { 0x7f, 0x7ff, 0xffff, 0x1fffff, 0x3ffffff, 0x7fffffff}; + + /* These are the indicator bits and the mask for the data bits to set in the + first byte of a character, indexed by the number of additional bytes. */ + + static int utf8_table2[] = { 0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc}; + static int utf8_table3[] = { 0xff, 0x1f, 0x0f, 0x07, 0x03, 0x01}; + + /* Table of the number of extra characters, indexed by the first character + masked with 0x3f. The highest number for a valid UTF-8 character is in fact + 0x3d. */ + + static uschar utf8_table4[] = { + 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, + 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, + 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, + 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5 }; + + + /************************************************* + * Convert character value to UTF-8 * + *************************************************/ + + /* This function takes an integer value in the range 0 - 0x7fffffff + and encodes it as a UTF-8 character in 0 to 6 bytes. + + Arguments: + cvalue the character value + buffer pointer to buffer for result - at least 6 bytes long + + Returns: number of characters placed in the buffer + */ + + static int + ord2utf8(int cvalue, uschar *buffer) + { + register int i, j; + for (i = 0; i < sizeof(utf8_table1)/sizeof(int); i++) + if (cvalue <= utf8_table1[i]) break; + *buffer++ = utf8_table2[i] | (cvalue & utf8_table3[i]); + cvalue >>= 6 - i; + for (j = 0; j < i; j++) + { + *buffer++ = 0x80 | (cvalue & 0x3f); + cvalue >>= 6; + } + return i + 1; + } + #endif + + + + /************************************************* + >>>>>>> ../arc/pcre-3.3/pcre.c + * Return version string * + *************************************************/ + + #define STRING(a) # a + #define XSTRING(s) STRING(s) + + const char * + pcre_version(void) + { + return XSTRING(PCRE_MAJOR) "." XSTRING(PCRE_MINOR) " " XSTRING(PCRE_DATE); + } + + + + + /************************************************* + * (Obsolete) Return info about compiled pattern * + *************************************************/ + + /* This is the original "info" function. It picks potentially useful data out + of the private structure, but its interface was too rigid. It remains for + backwards compatibility. The public options are passed back in an int - though + the re->options field has been expanded to a long int, all the public options + at the low end of it, and so even on 16-bit systems this will still be OK. + Therefore, I haven't changed the API for pcre_info(). + + Arguments: + external_re points to compiled code + optptr where to pass back the options + first_char where to pass back the first character, + or -1 if multiline and all branches start ^, + or -2 otherwise + + Returns: number of capturing subpatterns + or negative values on error + */ + + int + pcre_info(const pcre *external_re, int *optptr, int *first_char) + { + const real_pcre *re = (const real_pcre *)external_re; + if (re == NULL) return PCRE_ERROR_NULL; + if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC; + if (optptr != NULL) *optptr = (int)(re->options & PUBLIC_OPTIONS); + if (first_char != NULL) + *first_char = ((re->options & PCRE_FIRSTSET) != 0)? re->first_char : + ((re->options & PCRE_STARTLINE) != 0)? -1 : -2; + return re->top_bracket; + } + + + + /************************************************* + * Return info about compiled pattern * + *************************************************/ + + /* This is a newer "info" function which has an extensible interface so + that additional items can be added compatibly. + + Arguments: + external_re points to compiled code + external_study points to study data, or NULL + what what information is required + where where to put the information + + Returns: 0 if data returned, negative on error + */ + + int + pcre_fullinfo(const pcre *external_re, const pcre_extra *study_data, int what, + void *where) + { + const real_pcre *re = (const real_pcre *)external_re; + const real_pcre_extra *study = (const real_pcre_extra *)study_data; + + if (re == NULL || where == NULL) return PCRE_ERROR_NULL; + if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC; + + switch (what) + { + case PCRE_INFO_OPTIONS: + *((unsigned long int *)where) = re->options & PUBLIC_OPTIONS; + break; + + case PCRE_INFO_SIZE: + *((size_t *)where) = re->size; + break; + + case PCRE_INFO_CAPTURECOUNT: + *((int *)where) = re->top_bracket; + break; + + case PCRE_INFO_BACKREFMAX: + *((int *)where) = re->top_backref; + break; + + case PCRE_INFO_FIRSTCHAR: + *((int *)where) = + ((re->options & PCRE_FIRSTSET) != 0)? re->first_char : + ((re->options & PCRE_STARTLINE) != 0)? -1 : -2; + break; + + case PCRE_INFO_FIRSTTABLE: + *((const uschar **)where) = + (study != NULL && (study->options & PCRE_STUDY_MAPPED) != 0)? + study->start_bits : NULL; + break; + + case PCRE_INFO_LASTLITERAL: + *((int *)where) = + ((re->options & PCRE_REQCHSET) != 0)? re->req_char : -1; + break; + + default: return PCRE_ERROR_BADOPTION; + } + + return 0; + } + + + + #ifdef DEBUG + /************************************************* + * Debugging function to print chars * + *************************************************/ + + /* Print a sequence of chars in printable format, stopping at the end of the + subject if the requested. + + Arguments: + p points to characters + length number to print + is_subject TRUE if printing from within md->start_subject + md pointer to matching data block, if is_subject is TRUE + + Returns: nothing + */ + + static void + pchars(const uschar *p, int length, BOOL is_subject, match_data *md) + { + int c; + if (is_subject && length > md->end_subject - p) length = md->end_subject - p; + while (length-- > 0) + if (isprint(c = *(p++))) printf("%c", c); else printf("\\x%02x", c); + } + #endif + + + + + /************************************************* + * Handle escapes * + *************************************************/ + + /* This function is called when a \ has been encountered. It either returns a + positive value for a simple escape such as \n, or a negative value which + encodes one of the more complicated things such as \d. When UTF-8 is enabled, + a positive value greater than 255 may be returned. On entry, ptr is pointing at + the \. On exit, it is on the final character of the escape sequence. + + Arguments: + ptrptr points to the pattern position pointer + errorptr points to the pointer to the error message + bracount number of previous extracting brackets + options the options bits + isclass TRUE if inside a character class + cd pointer to char tables block + + Returns: zero or positive => a data character + negative => a special escape sequence + on error, errorptr is set + */ + + static int + check_escape(const uschar **ptrptr, const char **errorptr, int bracount, + int options, BOOL isclass, compile_data *cd) + { + const uschar *ptr = *ptrptr; + int c, i; + + /* If backslash is at the end of the pattern, it's an error. */ + + c = *(++ptr); + if (c == 0) *errorptr = ERR1; + + /* Digits or letters may have special meaning; all others are literals. */ + + else if (c < '0' || c > 'z') {} + + /* Do an initial lookup in a table. A non-zero result is something that can be + returned immediately. Otherwise further processing may be required. */ + + else if ((i = escapes[c - '0']) != 0) c = i; + + /* Escapes that need further processing, or are illegal. */ + + else + { + const uschar *oldptr; + switch (c) + { + /* The handling of escape sequences consisting of a string of digits + starting with one that is not zero is not straightforward. By experiment, + the way Perl works seems to be as follows: + + Outside a character class, the digits are read as a decimal number. If the + number is less than 10, or if there are that many previous extracting + left brackets, then it is a back reference. Otherwise, up to three octal + digits are read to form an escaped byte. Thus \123 is likely to be octal + 123 (cf \0123, which is octal 012 followed by the literal 3). If the octal + value is greater than 377, the least significant 8 bits are taken. Inside a + character class, \ followed by a digit is always an octal number. */ + + case '1': case '2': case '3': case '4': case '5': + case '6': case '7': case '8': case '9': + + if (!isclass) + { + oldptr = ptr; + c -= '0'; + while ((cd->ctypes[ptr[1]] & ctype_digit) != 0) + c = c * 10 + *(++ptr) - '0'; + if (c < 10 || c <= bracount) + { + c = -(ESC_REF + c); + break; + } + ptr = oldptr; /* Put the pointer back and fall through */ + } + + /* Handle an octal number following \. If the first digit is 8 or 9, Perl + generates a binary zero byte and treats the digit as a following literal. + Thus we have to pull back the pointer by one. */ + + if ((c = *ptr) >= '8') + { + ptr--; + c = 0; + break; + } + + /* \0 always starts an octal number, but we may drop through to here with a + larger first octal digit. */ + + case '0': + c -= '0'; + while(i++ < 2 && (cd->ctypes[ptr[1]] & ctype_digit) != 0 && + ptr[1] != '8' && ptr[1] != '9') + c = c * 8 + *(++ptr) - '0'; + c &= 255; /* Take least significant 8 bits */ + break; + + /* \x is complicated when UTF-8 is enabled. \x{ddd} is a character number + which can be greater than 0xff, but only if the ddd are hex digits. */ + + case 'x': + #ifdef SUPPORT_UTF8 + if (ptr[1] == '{' && (options & PCRE_UTF8) != 0) + { + const uschar *pt = ptr + 2; + register int count = 0; + c = 0; + while ((cd->ctypes[*pt] & ctype_xdigit) != 0) + { + count++; + c = c * 16 + cd->lcc[*pt] - + (((cd->ctypes[*pt] & ctype_digit) != 0)? '0' : 'W'); + pt++; + } + if (*pt == '}') + { + if (c < 0 || count > 8) *errorptr = ERR34; + ptr = pt; + break; + } + /* If the sequence of hex digits does not end with '}', then we don't + recognize this construct; fall through to the normal \x handling. */ + } + #endif + + /* Read just a single hex char */ + + c = 0; + while (i++ < 2 && (cd->ctypes[ptr[1]] & ctype_xdigit) != 0) + { + ptr++; + c = c * 16 + cd->lcc[*ptr] - + (((cd->ctypes[*ptr] & ctype_digit) != 0)? '0' : 'W'); + } + break; + + /* Other special escapes not starting with a digit are straightforward */ + + case 'c': + c = *(++ptr); + if (c == 0) + { + *errorptr = ERR2; + return 0; + } + + /* A letter is upper-cased; then the 0x40 bit is flipped */ + + if (c >= 'a' && c <= 'z') c = cd->fcc[c]; + c ^= 0x40; + break; + + /* PCRE_EXTRA enables extensions to Perl in the matter of escapes. Any + other alphameric following \ is an error if PCRE_EXTRA was set; otherwise, + for Perl compatibility, it is a literal. This code looks a bit odd, but + there used to be some cases other than the default, and there may be again + in future, so I haven't "optimized" it. */ + + default: + if ((options & PCRE_EXTRA) != 0) switch(c) + { + default: + *errorptr = ERR3; + break; + } + break; + } + } + + *ptrptr = ptr; + return c; + } + + + + /************************************************* + * Check for counted repeat * + *************************************************/ + + /* This function is called when a '{' is encountered in a place where it might + start a quantifier. It looks ahead to see if it really is a quantifier or not. + It is only a quantifier if it is one of the forms {ddd} {ddd,} or {ddd,ddd} + where the ddds are digits. + + Arguments: + p pointer to the first char after '{' + cd pointer to char tables block + + Returns: TRUE or FALSE + */ + + static BOOL + is_counted_repeat(const uschar *p, compile_data *cd) + { + if ((cd->ctypes[*p++] & ctype_digit) == 0) return FALSE; + while ((cd->ctypes[*p] & ctype_digit) != 0) p++; + if (*p == '}') return TRUE; + + if (*p++ != ',') return FALSE; + if (*p == '}') return TRUE; + + if ((cd->ctypes[*p++] & ctype_digit) == 0) return FALSE; + while ((cd->ctypes[*p] & ctype_digit) != 0) p++; + return (*p == '}'); + } + + + + /************************************************* + * Read repeat counts * + *************************************************/ + + /* Read an item of the form {n,m} and return the values. This is called only + after is_counted_repeat() has confirmed that a repeat-count quantifier exists, + so the syntax is guaranteed to be correct, but we need to check the values. + + Arguments: + p pointer to first char after '{' + minp pointer to int for min + maxp pointer to int for max + returned as -1 if no max + errorptr points to pointer to error message + cd pointer to character tables clock + + Returns: pointer to '}' on success; + current ptr on error, with errorptr set + */ + + static const uschar * + read_repeat_counts(const uschar *p, int *minp, int *maxp, + const char **errorptr, compile_data *cd) + { + int min = 0; + int max = -1; + + while ((cd->ctypes[*p] & ctype_digit) != 0) min = min * 10 + *p++ - '0'; + + if (*p == '}') max = min; else + { + if (*(++p) != '}') + { + max = 0; + while((cd->ctypes[*p] & ctype_digit) != 0) max = max * 10 + *p++ - '0'; + if (max < min) + { + *errorptr = ERR4; + return p; + } + } + } + + /* Do paranoid checks, then fill in the required variables, and pass back the + pointer to the terminating '}'. */ + + if (min > 65535 || max > 65535) + *errorptr = ERR5; + else + { + *minp = min; + *maxp = max; + } + return p; + } + + + + /************************************************* + * Find the fixed length of a pattern * + *************************************************/ + + /* Scan a pattern and compute the fixed length of subject that will match it, + if the length is fixed. This is needed for dealing with backward assertions. + + Arguments: + code points to the start of the pattern (the bracket) + options the compiling options + + Returns: the fixed length, or -1 if there is no fixed length + */ + + static int + find_fixedlength(uschar *code, int options) + { + int length = -1; + + register int branchlength = 0; + register uschar *cc = code + 3; + + /* Scan along the opcodes for this branch. If we get to the end of the + branch, check the length against that of the other branches. */ + + for (;;) + { + int d; + register int op = *cc; + if (op >= OP_BRA) op = OP_BRA; + + switch (op) + { + case OP_BRA: + case OP_ONCE: + case OP_COND: + d = find_fixedlength(cc, options); + if (d < 0) return -1; + branchlength += d; + do cc += (cc[1] << 8) + cc[2]; while (*cc == OP_ALT); + cc += 3; + break; + + /* Reached end of a branch; if it's a ket it is the end of a nested + call. If it's ALT it is an alternation in a nested call. If it is + END it's the end of the outer call. All can be handled by the same code. */ + + case OP_ALT: + case OP_KET: + case OP_KETRMAX: + case OP_KETRMIN: + case OP_END: + if (length < 0) length = branchlength; + else if (length != branchlength) return -1; + if (*cc != OP_ALT) return length; + cc += 3; + branchlength = 0; + break; + + /* Skip over assertive subpatterns */ + + case OP_ASSERT: + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + do cc += (cc[1] << 8) + cc[2]; while (*cc == OP_ALT); + cc += 3; + break; + + /* Skip over things that don't match chars */ + + case OP_REVERSE: + cc++; + /* Fall through */ + + case OP_CREF: + case OP_OPT: + cc++; + /* Fall through */ + + case OP_SOD: + case OP_EOD: + case OP_EODN: + case OP_CIRC: + case OP_DOLL: + case OP_NOT_WORD_BOUNDARY: + case OP_WORD_BOUNDARY: + cc++; + break; + + /* Handle char strings. In UTF-8 mode we must count characters, not bytes. + This requires a scan of the string, unfortunately. We assume valid UTF-8 + strings, so all we do is reduce the length by one for byte whose bits are + 10xxxxxx. */ + + case OP_CHARS: + branchlength += *(++cc); + #ifdef SUPPORT_UTF8 + for (d = 1; d <= *cc; d++) + if ((cc[d] & 0xc0) == 0x80) branchlength--; + #endif + cc += *cc + 1; + break; + + /* Handle exact repetitions */ + + case OP_EXACT: + case OP_TYPEEXACT: + branchlength += (cc[1] << 8) + cc[2]; + cc += 4; + break; + + /* Handle single-char matchers */ + + case OP_NOT_DIGIT: + case OP_DIGIT: + case OP_NOT_WHITESPACE: + case OP_WHITESPACE: + case OP_NOT_WORDCHAR: + case OP_WORDCHAR: + case OP_ANY: + branchlength++; + cc++; + break; + + + /* Check a class for variable quantification */ + + case OP_CLASS: + cc += (*cc == OP_REF)? 2 : 33; + + switch (*cc) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRQUERY: + case OP_CRMINQUERY: + return -1; + + case OP_CRRANGE: + case OP_CRMINRANGE: + if ((cc[1] << 8) + cc[2] != (cc[3] << 8) + cc[4]) return -1; + branchlength += (cc[1] << 8) + cc[2]; + cc += 5; + break; + + default: + branchlength++; + } + break; + + /* Anything else is variable length */ + + default: + return -1; + } + } + /* Control never gets here */ + } + + + + + /************************************************* + * Check for POSIX class syntax * + *************************************************/ + + /* This function is called when the sequence "[:" or "[." or "[=" is + encountered in a character class. It checks whether this is followed by an + optional ^ and then a sequence of letters, terminated by a matching ":]" or + ".]" or "=]". + + Argument: + ptr pointer to the initial [ + endptr where to return the end pointer + cd pointer to compile data + + Returns: TRUE or FALSE + */ + + static BOOL + check_posix_syntax(const uschar *ptr, const uschar **endptr, compile_data *cd) + { + int terminator; /* Don't combine these lines; the Solaris cc */ + terminator = *(++ptr); /* compiler warns about "non-constant" initializer. */ + if (*(++ptr) == '^') ptr++; + while ((cd->ctypes[*ptr] & ctype_letter) != 0) ptr++; + if (*ptr == terminator && ptr[1] == ']') + { + *endptr = ptr; + return TRUE; + } + return FALSE; + } + + + + + /************************************************* + * Check POSIX class name * + *************************************************/ + + /* This function is called to check the name given in a POSIX-style class entry + such as [:alnum:]. + + Arguments: + ptr points to the first letter + len the length of the name + + Returns: a value representing the name, or -1 if unknown + */ + + static int + check_posix_name(const uschar *ptr, int len) + { + register int yield = 0; + while (posix_name_lengths[yield] != 0) + { + if (len == posix_name_lengths[yield] && + strncmp((const char *)ptr, posix_names[yield], len) == 0) return yield; + yield++; + } + return -1; + } + + + + + /************************************************* + * Compile one branch * + *************************************************/ + + /* Scan the pattern, compiling it into the code vector. + + Arguments: + options the option bits + brackets points to number of brackets used + code points to the pointer to the current code point + ptrptr points to the current pattern pointer + errorptr points to pointer to error message + optchanged set to the value of the last OP_OPT item compiled + reqchar set to the last literal character required, else -1 + countlits set to count of mandatory literal characters + cd contains pointers to tables + + Returns: TRUE on success + FALSE, with *errorptr set on error + */ + + static BOOL + compile_branch(int options, int *brackets, uschar **codeptr, + const uschar **ptrptr, const char **errorptr, int *optchanged, + int *reqchar, int *countlits, compile_data *cd) + { + int repeat_type, op_type; + int repeat_min, repeat_max; + int bravalue, length; + int greedy_default, greedy_non_default; + int prevreqchar; + int condcount = 0; + int subcountlits = 0; + register int c; + register uschar *code = *codeptr; + uschar *tempcode; + const uschar *ptr = *ptrptr; + const uschar *tempptr; + uschar *previous = NULL; + uschar class[32]; + + /* Set up the default and non-default settings for greediness */ + + greedy_default = ((options & PCRE_UNGREEDY) != 0); + greedy_non_default = greedy_default ^ 1; + + /* Initialize no required char, and count of literals */ + + *reqchar = prevreqchar = -1; + *countlits = 0; + + /* Switch on next character until the end of the branch */ + + for (;; ptr++) + { + BOOL negate_class; + int class_charcount; + int class_lastchar; + int newoptions; + int condref; + int subreqchar; + + c = *ptr; + if ((options & PCRE_EXTENDED) != 0) + { + if ((cd->ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + /* The space before the ; is to avoid a warning on a silly compiler + on the Macintosh. */ + while ((c = *(++ptr)) != 0 && c != '\n') ; + continue; + } + } + + switch(c) + { + /* The branch terminates at end of string, |, or ). */ + + case 0: + case '|': + case ')': + *codeptr = code; + *ptrptr = ptr; + return TRUE; + + /* Handle single-character metacharacters */ + + case '^': + previous = NULL; + *code++ = OP_CIRC; + break; + + case '$': + previous = NULL; + *code++ = OP_DOLL; + break; + + case '.': + previous = code; + *code++ = OP_ANY; + break; + + /* Character classes. These always build a 32-byte bitmap of the permitted + characters, except in the special case where there is only one character. + For negated classes, we build the map as usual, then invert it at the end. + */ + + case '[': + previous = code; + *code++ = OP_CLASS; + + /* If the first character is '^', set the negation flag and skip it. */ + + if ((c = *(++ptr)) == '^') + { + negate_class = TRUE; + c = *(++ptr); + } + else negate_class = FALSE; + + /* Keep a count of chars so that we can optimize the case of just a single + character. */ + + class_charcount = 0; + class_lastchar = -1; + + /* Initialize the 32-char bit map to all zeros. We have to build the + map in a temporary bit of store, in case the class contains only 1 + character, because in that case the compiled code doesn't use the + bit map. */ + + memset(class, 0, 32 * sizeof(uschar)); + + /* Process characters until ] is reached. By writing this as a "do" it + means that an initial ] is taken as a data character. */ + + do + { + if (c == 0) + { + *errorptr = ERR6; + goto FAILED; + } + + /* Handle POSIX class names. Perl allows a negation extension of the + form [:^name]. A square bracket that doesn't match the syntax is + treated as a literal. We also recognize the POSIX constructions + [.ch.] and [=ch=] ("collating elements") and fault them, as Perl + 5.6 does. */ + + if (c == '[' && + (ptr[1] == ':' || ptr[1] == '.' || ptr[1] == '=') && + check_posix_syntax(ptr, &tempptr, cd)) + { + BOOL local_negate = FALSE; + int posix_class, i; + register const uschar *cbits = cd->cbits; + + if (ptr[1] != ':') + { + *errorptr = ERR31; + goto FAILED; + } + + ptr += 2; + if (*ptr == '^') + { + local_negate = TRUE; + ptr++; + } + + posix_class = check_posix_name(ptr, tempptr - ptr); + if (posix_class < 0) + { + *errorptr = ERR30; + goto FAILED; + } + + /* If matching is caseless, upper and lower are converted to + alpha. This relies on the fact that the class table starts with + alpha, lower, upper as the first 3 entries. */ + + if ((options & PCRE_CASELESS) != 0 && posix_class <= 2) + posix_class = 0; + + /* Or into the map we are building up to 3 of the static class + tables, or their negations. */ + + posix_class *= 3; + for (i = 0; i < 3; i++) + { + int taboffset = posix_class_maps[posix_class + i]; + if (taboffset < 0) break; + if (local_negate) + for (c = 0; c < 32; c++) class[c] |= ~cbits[c+taboffset]; + else + for (c = 0; c < 32; c++) class[c] |= cbits[c+taboffset]; + } + + ptr = tempptr + 1; + class_charcount = 10; /* Set > 1; assumes more than 1 per class */ + continue; + } + + /* Backslash may introduce a single character, or it may introduce one + of the specials, which just set a flag. Escaped items are checked for + validity in the pre-compiling pass. The sequence \b is a special case. + Inside a class (and only there) it is treated as backspace. Elsewhere + it marks a word boundary. Other escapes have preset maps ready to + or into the one we are building. We assume they have more than one + character in them, so set class_count bigger than one. */ + + if (c == '\\') + { + c = check_escape(&ptr, errorptr, *brackets, options, TRUE, cd); + if (-c == ESC_b) c = '\b'; + else if (c < 0) + { + register const uschar *cbits = cd->cbits; + class_charcount = 10; + switch (-c) + { + case ESC_d: + for (c = 0; c < 32; c++) class[c] |= cbits[c+cbit_digit]; + continue; + + case ESC_D: + for (c = 0; c < 32; c++) class[c] |= ~cbits[c+cbit_digit]; + continue; + + case ESC_w: + for (c = 0; c < 32; c++) class[c] |= cbits[c+cbit_word]; + continue; + + case ESC_W: + for (c = 0; c < 32; c++) class[c] |= ~cbits[c+cbit_word]; + continue; + + case ESC_s: + for (c = 0; c < 32; c++) class[c] |= cbits[c+cbit_space]; + continue; + + case ESC_S: + for (c = 0; c < 32; c++) class[c] |= ~cbits[c+cbit_space]; + continue; + + default: + *errorptr = ERR7; + goto FAILED; + } + } + + /* Fall through if single character, but don't at present allow + chars > 255 in UTF-8 mode. */ + + #ifdef SUPPORT_UTF8 + if (c > 255) + { + *errorptr = ERR33; + goto FAILED; + } + #endif + } + + /* A single character may be followed by '-' to form a range. However, + Perl does not permit ']' to be the end of the range. A '-' character + here is treated as a literal. */ + + if (ptr[1] == '-' && ptr[2] != ']') + { + int d; + ptr += 2; + d = *ptr; + + if (d == 0) + { + *errorptr = ERR6; + goto FAILED; + } + + /* The second part of a range can be a single-character escape, but + not any of the other escapes. Perl 5.6 treats a hyphen as a literal + in such circumstances. */ + + if (d == '\\') + { + const uschar *oldptr = ptr; + d = check_escape(&ptr, errorptr, *brackets, options, TRUE, cd); + + #ifdef SUPPORT_UTF8 + if (d > 255) + { + *errorptr = ERR33; + goto FAILED; + } + #endif + /* \b is backslash; any other special means the '-' was literal */ + + if (d < 0) + { + if (d == -ESC_b) d = '\b'; else + { + ptr = oldptr - 2; + goto SINGLE_CHARACTER; /* A few lines below */ + } + } + } + + if (d < c) + { + *errorptr = ERR8; + goto FAILED; + } + + for (; c <= d; c++) + { + class[c/8] |= (1 << (c&7)); + if ((options & PCRE_CASELESS) != 0) + { + int uc = cd->fcc[c]; /* flip case */ + class[uc/8] |= (1 << (uc&7)); + } + class_charcount++; /* in case a one-char range */ + class_lastchar = c; + } + continue; /* Go get the next char in the class */ + } + + /* Handle a lone single character - we can get here for a normal + non-escape char, or after \ that introduces a single character. */ + + SINGLE_CHARACTER: + + class [c/8] |= (1 << (c&7)); + if ((options & PCRE_CASELESS) != 0) + { + c = cd->fcc[c]; /* flip case */ + class[c/8] |= (1 << (c&7)); + } + class_charcount++; + class_lastchar = c; + } + + /* Loop until ']' reached; the check for end of string happens inside the + loop. This "while" is the end of the "do" above. */ + + while ((c = *(++ptr)) != ']'); + + /* If class_charcount is 1 and class_lastchar is not negative, we saw + precisely one character. This doesn't need the whole 32-byte bit map. + We turn it into a 1-character OP_CHAR if it's positive, or OP_NOT if + it's negative. */ + + if (class_charcount == 1 && class_lastchar >= 0) + { + if (negate_class) + { + code[-1] = OP_NOT; + } + else + { + code[-1] = OP_CHARS; + *code++ = 1; + } + *code++ = class_lastchar; + } + + /* Otherwise, negate the 32-byte map if necessary, and copy it into + the code vector. */ + + else + { + if (negate_class) + for (c = 0; c < 32; c++) code[c] = ~class[c]; + else + memcpy(code, class, 32); + code += 32; + } + break; + + /* Various kinds of repeat */ + + case '{': + if (!is_counted_repeat(ptr+1, cd)) goto NORMAL_CHAR; + ptr = read_repeat_counts(ptr+1, &repeat_min, &repeat_max, errorptr, cd); + if (*errorptr != NULL) goto FAILED; + goto REPEAT; + + case '*': + repeat_min = 0; + repeat_max = -1; + goto REPEAT; + + case '+': + repeat_min = 1; + repeat_max = -1; + goto REPEAT; + + case '?': + repeat_min = 0; + repeat_max = 1; + + REPEAT: + if (previous == NULL) + { + *errorptr = ERR9; + goto FAILED; + } + + /* If the next character is '?' this is a minimizing repeat, by default, + but if PCRE_UNGREEDY is set, it works the other way round. Advance to the + next character. */ + + if (ptr[1] == '?') + { repeat_type = greedy_non_default; ptr++; } + else repeat_type = greedy_default; + + /* If previous was a string of characters, chop off the last one and use it + as the subject of the repeat. If there was only one character, we can + abolish the previous item altogether. A repeat with a zero minimum wipes + out any reqchar setting, backing up to the previous value. We must also + adjust the countlits value. */ + + if (*previous == OP_CHARS) + { + int len = previous[1]; + + if (repeat_min == 0) *reqchar = prevreqchar; + *countlits += repeat_min - 1; + + if (len == 1) + { + c = previous[2]; + code = previous; + } + else + { + c = previous[len+1]; + previous[1]--; + code--; + } + op_type = 0; /* Use single-char op codes */ + goto OUTPUT_SINGLE_REPEAT; /* Code shared with single character types */ + } + + /* If previous was a single negated character ([^a] or similar), we use + one of the special opcodes, replacing it. The code is shared with single- + character repeats by adding a suitable offset into repeat_type. */ + + else if ((int)*previous == OP_NOT) + { + op_type = OP_NOTSTAR - OP_STAR; /* Use "not" opcodes */ + c = previous[1]; + code = previous; + goto OUTPUT_SINGLE_REPEAT; + } + + /* If previous was a character type match (\d or similar), abolish it and + create a suitable repeat item. The code is shared with single-character + repeats by adding a suitable offset into repeat_type. */ + + else if ((int)*previous < OP_EODN || *previous == OP_ANY) + { + op_type = OP_TYPESTAR - OP_STAR; /* Use type opcodes */ + c = *previous; + code = previous; + + OUTPUT_SINGLE_REPEAT: + + /* If the maximum is zero then the minimum must also be zero; Perl allows + this case, so we do too - by simply omitting the item altogether. */ + + if (repeat_max == 0) goto END_REPEAT; + + /* Combine the op_type with the repeat_type */ + + repeat_type += op_type; + + /* A minimum of zero is handled either as the special case * or ?, or as + an UPTO, with the maximum given. */ + + if (repeat_min == 0) + { + if (repeat_max == -1) *code++ = OP_STAR + repeat_type; + else if (repeat_max == 1) *code++ = OP_QUERY + repeat_type; + else + { + *code++ = OP_UPTO + repeat_type; + *code++ = repeat_max >> 8; + *code++ = (repeat_max & 255); + } + } + + /* The case {1,} is handled as the special case + */ + + else if (repeat_min == 1 && repeat_max == -1) + *code++ = OP_PLUS + repeat_type; + + /* The case {n,n} is just an EXACT, while the general case {n,m} is + handled as an EXACT followed by an UPTO. An EXACT of 1 is optimized. */ + + else + { + if (repeat_min != 1) + { + *code++ = OP_EXACT + op_type; /* NB EXACT doesn't have repeat_type */ + *code++ = repeat_min >> 8; + *code++ = (repeat_min & 255); + } + + /* If the mininum is 1 and the previous item was a character string, + we either have to put back the item that got cancelled if the string + length was 1, or add the character back onto the end of a longer + string. For a character type nothing need be done; it will just get + put back naturally. Note that the final character is always going to + get added below. */ + + else if (*previous == OP_CHARS) + { + if (code == previous) code += 2; else previous[1]++; + } + + /* For a single negated character we also have to put back the + item that got cancelled. */ + + else if (*previous == OP_NOT) code++; + + /* If the maximum is unlimited, insert an OP_STAR. */ + + if (repeat_max < 0) + { + *code++ = c; + *code++ = OP_STAR + repeat_type; + } + + /* Else insert an UPTO if the max is greater than the min. */ + + else if (repeat_max != repeat_min) + { + *code++ = c; + repeat_max -= repeat_min; + *code++ = OP_UPTO + repeat_type; + *code++ = repeat_max >> 8; + *code++ = (repeat_max & 255); + } + } + + /* The character or character type itself comes last in all cases. */ + + *code++ = c; + } + + /* If previous was a character class or a back reference, we put the repeat + stuff after it, but just skip the item if the repeat was {0,0}. */ + + else if (*previous == OP_CLASS || *previous == OP_REF) + { + if (repeat_max == 0) + { + code = previous; + goto END_REPEAT; + } + if (repeat_min == 0 && repeat_max == -1) + *code++ = OP_CRSTAR + repeat_type; + else if (repeat_min == 1 && repeat_max == -1) + *code++ = OP_CRPLUS + repeat_type; + else if (repeat_min == 0 && repeat_max == 1) + *code++ = OP_CRQUERY + repeat_type; + else + { + *code++ = OP_CRRANGE + repeat_type; + *code++ = repeat_min >> 8; + *code++ = repeat_min & 255; + if (repeat_max == -1) repeat_max = 0; /* 2-byte encoding for max */ + *code++ = repeat_max >> 8; + *code++ = repeat_max & 255; + } + } + + /* If previous was a bracket group, we may have to replicate it in certain + cases. */ + + else if ((int)*previous >= OP_BRA || (int)*previous == OP_ONCE || + (int)*previous == OP_COND) + { + register int i; + int ketoffset = 0; + int len = code - previous; + uschar *bralink = NULL; + + /* If the maximum repeat count is unlimited, find the end of the bracket + by scanning through from the start, and compute the offset back to it + from the current code pointer. There may be an OP_OPT setting following + the final KET, so we can't find the end just by going back from the code + pointer. */ + + if (repeat_max == -1) + { + register uschar *ket = previous; + do ket += (ket[1] << 8) + ket[2]; while (*ket != OP_KET); + ketoffset = code - ket; + } + + /* The case of a zero minimum is special because of the need to stick + OP_BRAZERO in front of it, and because the group appears once in the + data, whereas in other cases it appears the minimum number of times. For + this reason, it is simplest to treat this case separately, as otherwise + the code gets far too mess. There are several special subcases when the + minimum is zero. */ + + if (repeat_min == 0) + { + /* If we set up a required char from the bracket, we must back off + to the previous value and reset the countlits value too. */ + + if (subcountlits > 0) + { + *reqchar = prevreqchar; + *countlits -= subcountlits; + } + + /* If the maximum is also zero, we just omit the group from the output + altogether. */ + + if (repeat_max == 0) + { + code = previous; + goto END_REPEAT; + } + + /* If the maximum is 1 or unlimited, we just have to stick in the + BRAZERO and do no more at this point. */ + + if (repeat_max <= 1) + { + memmove(previous+1, previous, len); + code++; + *previous++ = OP_BRAZERO + repeat_type; + } + + /* If the maximum is greater than 1 and limited, we have to replicate + in a nested fashion, sticking OP_BRAZERO before each set of brackets. + The first one has to be handled carefully because it's the original + copy, which has to be moved up. The remainder can be handled by code + that is common with the non-zero minimum case below. We just have to + adjust the value or repeat_max, since one less copy is required. */ + + else + { + int offset; + memmove(previous+4, previous, len); + code += 4; + *previous++ = OP_BRAZERO + repeat_type; + *previous++ = OP_BRA; + + /* We chain together the bracket offset fields that have to be + filled in later when the ends of the brackets are reached. */ + + offset = (bralink == NULL)? 0 : previous - bralink; + bralink = previous; + *previous++ = offset >> 8; + *previous++ = offset & 255; + } + + repeat_max--; + } + + /* If the minimum is greater than zero, replicate the group as many + times as necessary, and adjust the maximum to the number of subsequent + copies that we need. */ + + else + { + for (i = 1; i < repeat_min; i++) + { + memcpy(code, previous, len); + code += len; + } + if (repeat_max > 0) repeat_max -= repeat_min; + } + + /* This code is common to both the zero and non-zero minimum cases. If + the maximum is limited, it replicates the group in a nested fashion, + remembering the bracket starts on a stack. In the case of a zero minimum, + the first one was set up above. In all cases the repeat_max now specifies + the number of additional copies needed. */ + + if (repeat_max >= 0) + { + for (i = repeat_max - 1; i >= 0; i--) + { + *code++ = OP_BRAZERO + repeat_type; + + /* All but the final copy start a new nesting, maintaining the + chain of brackets outstanding. */ + + if (i != 0) + { + int offset; + *code++ = OP_BRA; + offset = (bralink == NULL)? 0 : code - bralink; + bralink = code; + *code++ = offset >> 8; + *code++ = offset & 255; + } + + memcpy(code, previous, len); + code += len; + } + + /* Now chain through the pending brackets, and fill in their length + fields (which are holding the chain links pro tem). */ + + while (bralink != NULL) + { + int oldlinkoffset; + int offset = code - bralink + 1; + uschar *bra = code - offset; + oldlinkoffset = (bra[1] << 8) + bra[2]; + bralink = (oldlinkoffset == 0)? NULL : bralink - oldlinkoffset; + *code++ = OP_KET; + *code++ = bra[1] = offset >> 8; + *code++ = bra[2] = (offset & 255); + } + } + + /* If the maximum is unlimited, set a repeater in the final copy. We + can't just offset backwards from the current code point, because we + don't know if there's been an options resetting after the ket. The + correct offset was computed above. */ + + else code[-ketoffset] = OP_KETRMAX + repeat_type; + } + + /* Else there's some kind of shambles */ + + else + { + *errorptr = ERR11; + goto FAILED; + } + + /* In all case we no longer have a previous item. */ + + END_REPEAT: + previous = NULL; + break; + + + /* Start of nested bracket sub-expression, or comment or lookahead or + lookbehind or option setting or condition. First deal with special things + that can come after a bracket; all are introduced by ?, and the appearance + of any of them means that this is not a referencing group. They were + checked for validity in the first pass over the string, so we don't have to + check for syntax errors here. */ + + case '(': + newoptions = options; + condref = -1; + + if (*(++ptr) == '?') + { + int set, unset; + int *optset; + + switch (*(++ptr)) + { + case '#': /* Comment; skip to ket */ + ptr++; + while (*ptr != ')') ptr++; + continue; + + case ':': /* Non-extracting bracket */ + bravalue = OP_BRA; + ptr++; + break; + + case '(': + bravalue = OP_COND; /* Conditional group */ + if ((cd->ctypes[*(++ptr)] & ctype_digit) != 0) + { + condref = *ptr - '0'; + while (*(++ptr) != ')') condref = condref*10 + *ptr - '0'; + ptr++; + } + else ptr--; + break; + + case '=': /* Positive lookahead */ + bravalue = OP_ASSERT; + ptr++; + break; + + case '!': /* Negative lookahead */ + bravalue = OP_ASSERT_NOT; + ptr++; + break; + + case '<': /* Lookbehinds */ + switch (*(++ptr)) + { + case '=': /* Positive lookbehind */ + bravalue = OP_ASSERTBACK; + ptr++; + break; + + case '!': /* Negative lookbehind */ + bravalue = OP_ASSERTBACK_NOT; + ptr++; + break; + + default: /* Syntax error */ + *errorptr = ERR24; + goto FAILED; + } + break; + + case '>': /* One-time brackets */ + bravalue = OP_ONCE; + ptr++; + break; + + case 'R': /* Pattern recursion */ + *code++ = OP_RECURSE; + ptr++; + continue; + + default: /* Option setting */ + set = unset = 0; + optset = &set; + + while (*ptr != ')' && *ptr != ':') + { + switch (*ptr++) + { + case '-': optset = &unset; break; + + case 'i': *optset |= PCRE_CASELESS; break; + case 'm': *optset |= PCRE_MULTILINE; break; + case 's': *optset |= PCRE_DOTALL; break; + case 'x': *optset |= PCRE_EXTENDED; break; + case 'U': *optset |= PCRE_UNGREEDY; break; + case 'X': *optset |= PCRE_EXTRA; break; + + default: + *errorptr = ERR12; + goto FAILED; + } + } + + /* Set up the changed option bits, but don't change anything yet. */ + + newoptions = (options | set) & (~unset); + + /* If the options ended with ')' this is not the start of a nested + group with option changes, so the options change at this level. At top + level there is nothing else to be done (the options will in fact have + been set from the start of compiling as a result of the first pass) but + at an inner level we must compile code to change the ims options if + necessary, and pass the new setting back so that it can be put at the + start of any following branches, and when this group ends, a resetting + item can be compiled. */ + + if (*ptr == ')') + { + if ((options & PCRE_INGROUP) != 0 && + (options & PCRE_IMS) != (newoptions & PCRE_IMS)) + { + *code++ = OP_OPT; + *code++ = *optchanged = newoptions & PCRE_IMS; + } + options = newoptions; /* Change options at this level */ + previous = NULL; /* This item can't be repeated */ + continue; /* It is complete */ + } + + /* If the options ended with ':' we are heading into a nested group + with possible change of options. Such groups are non-capturing and are + not assertions of any kind. All we need to do is skip over the ':'; + the newoptions value is handled below. */ + + bravalue = OP_BRA; + ptr++; + } + } + + /* Else we have a referencing group; adjust the opcode. */ + + else + { + if (++(*brackets) > EXTRACT_MAX) + { + *errorptr = ERR13; + goto FAILED; + } + bravalue = OP_BRA + *brackets; + } + + /* Process nested bracketed re. Assertions may not be repeated, but other + kinds can be. We copy code into a non-register variable in order to be able + to pass its address because some compilers complain otherwise. Pass in a + new setting for the ims options if they have changed. */ + + previous = (bravalue >= OP_ONCE)? code : NULL; + *code = bravalue; + tempcode = code; + + if (!compile_regex( + options | PCRE_INGROUP, /* Set for all nested groups */ + ((options & PCRE_IMS) != (newoptions & PCRE_IMS))? + newoptions & PCRE_IMS : -1, /* Pass ims options if changed */ + brackets, /* Bracket level */ + &tempcode, /* Where to put code (updated) */ + &ptr, /* Input pointer (updated) */ + errorptr, /* Where to put an error message */ + (bravalue == OP_ASSERTBACK || + bravalue == OP_ASSERTBACK_NOT), /* TRUE if back assert */ + condref, /* Condition reference number */ + &subreqchar, /* For possible last char */ + &subcountlits, /* For literal count */ + cd)) /* Tables block */ + goto FAILED; + + /* At the end of compiling, code is still pointing to the start of the + group, while tempcode has been updated to point past the end of the group + and any option resetting that may follow it. The pattern pointer (ptr) + is on the bracket. */ + + /* If this is a conditional bracket, check that there are no more than + two branches in the group. */ + + if (bravalue == OP_COND) + { + uschar *tc = code; + condcount = 0; + + do { + condcount++; + tc += (tc[1] << 8) | tc[2]; + } + while (*tc != OP_KET); + + if (condcount > 2) + { + *errorptr = ERR27; + goto FAILED; + } + } + + /* Handle updating of the required character. If the subpattern didn't + set one, leave it as it was. Otherwise, update it for normal brackets of + all kinds, forward assertions, and conditions with two branches. Don't + update the literal count for forward assertions, however. If the bracket + is followed by a quantifier with zero repeat, we have to back off. Hence + the definition of prevreqchar and subcountlits outside the main loop so + that they can be accessed for the back off. */ + + if (subreqchar > 0 && + (bravalue >= OP_BRA || bravalue == OP_ONCE || bravalue == OP_ASSERT || + (bravalue == OP_COND && condcount == 2))) + { + prevreqchar = *reqchar; + *reqchar = subreqchar; + if (bravalue != OP_ASSERT) *countlits += subcountlits; + } + + /* Now update the main code pointer to the end of the group. */ + + code = tempcode; + + /* Error if hit end of pattern */ + + if (*ptr != ')') + { + *errorptr = ERR14; + goto FAILED; + } + break; + + /* Check \ for being a real metacharacter; if not, fall through and handle + it as a data character at the start of a string. Escape items are checked + for validity in the pre-compiling pass. */ + + case '\\': + tempptr = ptr; + c = check_escape(&ptr, errorptr, *brackets, options, FALSE, cd); + + /* Handle metacharacters introduced by \. For ones like \d, the ESC_ values + are arranged to be the negation of the corresponding OP_values. For the + back references, the values are ESC_REF plus the reference number. Only + back references and those types that consume a character may be repeated. + We can test for values between ESC_b and ESC_Z for the latter; this may + have to change if any new ones are ever created. */ + + if (c < 0) + { + if (-c >= ESC_REF) + { + previous = code; + *code++ = OP_REF; + *code++ = -c - ESC_REF; + } + else + { + previous = (-c > ESC_b && -c < ESC_Z)? code : NULL; + *code++ = -c; + } + continue; + } + + /* Data character: reset and fall through */ + + ptr = tempptr; + c = '\\'; + + /* Handle a run of data characters until a metacharacter is encountered. + The first character is guaranteed not to be whitespace or # when the + extended flag is set. */ + + NORMAL_CHAR: + default: + previous = code; + *code = OP_CHARS; + code += 2; + length = 0; + + do + { + if ((options & PCRE_EXTENDED) != 0) + { + if ((cd->ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + /* The space before the ; is to avoid a warning on a silly compiler + on the Macintosh. */ + while ((c = *(++ptr)) != 0 && c != '\n') ; + if (c == 0) break; + continue; + } + } + + /* Backslash may introduce a data char or a metacharacter. Escaped items + are checked for validity in the pre-compiling pass. Stop the string + before a metaitem. */ + + if (c == '\\') + { + tempptr = ptr; + c = check_escape(&ptr, errorptr, *brackets, options, FALSE, cd); + if (c < 0) { ptr = tempptr; break; } + + /* If a character is > 127 in UTF-8 mode, we have to turn it into + two or more characters in the UTF-8 encoding. */ + + #ifdef SUPPORT_UTF8 + if (c > 127 && (options & PCRE_UTF8) != 0) + { + uschar buffer[8]; + int len = ord2utf8(c, buffer); + for (c = 0; c < len; c++) *code++ = buffer[c]; + length += len; + continue; + } + #endif + } + + /* Ordinary character or single-char escape */ + + *code++ = c; + length++; + } + + /* This "while" is the end of the "do" above. */ + + while (length < MAXLIT && (cd->ctypes[c = *(++ptr)] & ctype_meta) == 0); + + /* Update the last character and the count of literals */ + + prevreqchar = (length > 1)? code[-2] : *reqchar; + *reqchar = code[-1]; + *countlits += length; + + /* Compute the length and set it in the data vector, and advance to + the next state. */ + + previous[1] = length; + if (length < MAXLIT) ptr--; + break; + } + } /* end of big loop */ + + /* Control never reaches here by falling through, only by a goto for all the + error states. Pass back the position in the pattern so that it can be displayed + to the user for diagnosing the error. */ + + FAILED: + *ptrptr = ptr; + return FALSE; + } + + + + + /************************************************* + * Compile sequence of alternatives * + *************************************************/ + + /* On entry, ptr is pointing past the bracket character, but on return + it points to the closing bracket, or vertical bar, or end of string. + The code variable is pointing at the byte into which the BRA operator has been + stored. If the ims options are changed at the start (for a (?ims: group) or + during any branch, we need to insert an OP_OPT item at the start of every + following branch to ensure they get set correctly at run time, and also pass + the new options into every subsequent branch compile. + + Argument: + options the option bits + optchanged new ims options to set as if (?ims) were at the start, or -1 + for no change + brackets -> int containing the number of extracting brackets used + codeptr -> the address of the current code pointer + ptrptr -> the address of the current pattern pointer + errorptr -> pointer to error message + lookbehind TRUE if this is a lookbehind assertion + condref > 0 for OPT_CREF setting at start of conditional group + reqchar -> place to put the last required character, or a negative number + countlits -> place to put the shortest literal count of any branch + cd points to the data block with tables pointers + + Returns: TRUE on success + */ + + static BOOL + compile_regex(int options, int optchanged, int *brackets, uschar **codeptr, + const uschar **ptrptr, const char **errorptr, BOOL lookbehind, int condref, + int *reqchar, int *countlits, compile_data *cd) + { + const uschar *ptr = *ptrptr; + uschar *code = *codeptr; + uschar *last_branch = code; + uschar *start_bracket = code; + uschar *reverse_count = NULL; + int oldoptions = options & PCRE_IMS; + int branchreqchar, branchcountlits; + + *reqchar = -1; + *countlits = INT_MAX; + code += 3; + + /* At the start of a reference-based conditional group, insert the reference + number as an OP_CREF item. */ + + if (condref > 0) + { + *code++ = OP_CREF; + *code++ = condref; + } + + /* Loop for each alternative branch */ + + for (;;) + { + int length; + + /* Handle change of options */ + + if (optchanged >= 0) + { + *code++ = OP_OPT; + *code++ = optchanged; + options = (options & ~PCRE_IMS) | optchanged; + } + + /* Set up dummy OP_REVERSE if lookbehind assertion */ + + if (lookbehind) + { + *code++ = OP_REVERSE; + reverse_count = code; + *code++ = 0; + *code++ = 0; + } + + /* Now compile the branch */ + + if (!compile_branch(options, brackets, &code, &ptr, errorptr, &optchanged, + &branchreqchar, &branchcountlits, cd)) + { + *ptrptr = ptr; + return FALSE; + } + + /* Fill in the length of the last branch */ + + length = code - last_branch; + last_branch[1] = length >> 8; + last_branch[2] = length & 255; + + /* Save the last required character if all branches have the same; a current + value of -1 means unset, while -2 means "previous branch had no last required + char". */ + + if (*reqchar != -2) + { + if (branchreqchar >= 0) + { + if (*reqchar == -1) *reqchar = branchreqchar; + else if (*reqchar != branchreqchar) *reqchar = -2; + } + else *reqchar = -2; + } + + /* Keep the shortest literal count */ + + if (branchcountlits < *countlits) *countlits = branchcountlits; + DPRINTF(("literal count = %d min=%d\n", branchcountlits, *countlits)); + + /* If lookbehind, check that this branch matches a fixed-length string, + and put the length into the OP_REVERSE item. Temporarily mark the end of + the branch with OP_END. */ + + if (lookbehind) + { + *code = OP_END; + length = find_fixedlength(last_branch, options); + DPRINTF(("fixed length = %d\n", length)); + if (length < 0) + { + *errorptr = ERR25; + *ptrptr = ptr; + return FALSE; + } + reverse_count[0] = (length >> 8); + reverse_count[1] = length & 255; + } + + /* Reached end of expression, either ')' or end of pattern. Insert a + terminating ket and the length of the whole bracketed item, and return, + leaving the pointer at the terminating char. If any of the ims options + were changed inside the group, compile a resetting op-code following. */ + + if (*ptr != '|') + { + length = code - start_bracket; + *code++ = OP_KET; + *code++ = length >> 8; + *code++ = length & 255; + if (optchanged >= 0) + { + *code++ = OP_OPT; + *code++ = oldoptions; + } + *codeptr = code; + *ptrptr = ptr; + return TRUE; + } + + /* Another branch follows; insert an "or" node and advance the pointer. */ + + *code = OP_ALT; + last_branch = code; + code += 3; + ptr++; + } + /* Control never reaches here */ + } + + + + + /************************************************* + * Find first significant op code * + *************************************************/ + + /* This is called by several functions that scan a compiled expression looking + for a fixed first character, or an anchoring op code etc. It skips over things + that do not influence this. For one application, a change of caseless option is + important. + + Arguments: + code pointer to the start of the group + options pointer to external options + optbit the option bit whose changing is significant, or + zero if none are + optstop TRUE to return on option change, otherwise change the options + value and continue + + Returns: pointer to the first significant opcode + */ + + static const uschar* + first_significant_code(const uschar *code, int *options, int optbit, + BOOL optstop) + { + for (;;) + { + switch ((int)*code) + { + case OP_OPT: + if (optbit > 0 && ((int)code[1] & optbit) != (*options & optbit)) + { + if (optstop) return code; + *options = (int)code[1]; + } + code += 2; + break; + + case OP_CREF: + code += 2; + break; + + case OP_WORD_BOUNDARY: + case OP_NOT_WORD_BOUNDARY: + code++; + break; + + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + do code += (code[1] << 8) + code[2]; while (*code == OP_ALT); + code += 3; + break; + + default: + return code; + } + } + /* Control never reaches here */ + } + + + + + /************************************************* + * Check for anchored expression * + *************************************************/ + + /* Try to find out if this is an anchored regular expression. Consider each + alternative branch. If they all start with OP_SOD or OP_CIRC, or with a bracket + all of whose alternatives start with OP_SOD or OP_CIRC (recurse ad lib), then + it's anchored. However, if this is a multiline pattern, then only OP_SOD + counts, since OP_CIRC can match in the middle. + + A branch is also implicitly anchored if it starts with .* and DOTALL is set, + because that will try the rest of the pattern at all possible matching points, + so there is no point trying them again. + + Arguments: + code points to start of expression (the bracket) + options points to the options setting + + Returns: TRUE or FALSE + */ + + static BOOL + is_anchored(register const uschar *code, int *options) + { + do { + const uschar *scode = first_significant_code(code + 3, options, + PCRE_MULTILINE, FALSE); + register int op = *scode; + if (op >= OP_BRA || op == OP_ASSERT || op == OP_ONCE || op == OP_COND) + { if (!is_anchored(scode, options)) return FALSE; } + else if ((op == OP_TYPESTAR || op == OP_TYPEMINSTAR) && + (*options & PCRE_DOTALL) != 0) + { if (scode[1] != OP_ANY) return FALSE; } + else if (op != OP_SOD && + ((*options & PCRE_MULTILINE) != 0 || op != OP_CIRC)) + return FALSE; + code += (code[1] << 8) + code[2]; + } + while (*code == OP_ALT); + return TRUE; + } + + + + /************************************************* + * Check for starting with ^ or .* * + *************************************************/ + + /* This is called to find out if every branch starts with ^ or .* so that + "first char" processing can be done to speed things up in multiline + matching and for non-DOTALL patterns that start with .* (which must start at + the beginning or after \n). + + Argument: points to start of expression (the bracket) + Returns: TRUE or FALSE + */ + + static BOOL + is_startline(const uschar *code) + { + do { + const uschar *scode = first_significant_code(code + 3, NULL, 0, FALSE); + register int op = *scode; + if (op >= OP_BRA || op == OP_ASSERT || op == OP_ONCE || op == OP_COND) + { if (!is_startline(scode)) return FALSE; } + else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR) + { if (scode[1] != OP_ANY) return FALSE; } + else if (op != OP_CIRC) return FALSE; + code += (code[1] << 8) + code[2]; + } + while (*code == OP_ALT); + return TRUE; + } + + + + /************************************************* + * Check for fixed first char * + *************************************************/ + + /* Try to find out if there is a fixed first character. This is called for + unanchored expressions, as it speeds up their processing quite considerably. + Consider each alternative branch. If they all start with the same char, or with + a bracket all of whose alternatives start with the same char (recurse ad lib), + then we return that char, otherwise -1. + + Arguments: + code points to start of expression (the bracket) + options pointer to the options (used to check casing changes) + + Returns: -1 or the fixed first char + */ + + static int + find_firstchar(const uschar *code, int *options) + { + register int c = -1; + do { + int d; + const uschar *scode = first_significant_code(code + 3, options, + PCRE_CASELESS, TRUE); + register int op = *scode; + + if (op >= OP_BRA) op = OP_BRA; + + switch(op) + { + default: + return -1; + + case OP_BRA: + case OP_ASSERT: + case OP_ONCE: + case OP_COND: + if ((d = find_firstchar(scode, options)) < 0) return -1; + if (c < 0) c = d; else if (c != d) return -1; + break; + + case OP_EXACT: /* Fall through */ + scode++; + + case OP_CHARS: /* Fall through */ + scode++; + + case OP_PLUS: + case OP_MINPLUS: + if (c < 0) c = scode[1]; else if (c != scode[1]) return -1; + break; + } + + code += (code[1] << 8) + code[2]; + } + while (*code == OP_ALT); + return c; + } + + + + + + /************************************************* + * Compile a Regular Expression * + *************************************************/ + + /* This function takes a string and returns a pointer to a block of store + holding a compiled version of the expression. + + Arguments: + pattern the regular expression + options various option bits + errorptr pointer to pointer to error text + erroroffset ptr offset in pattern where error was detected + tables pointer to character tables or NULL + + Returns: pointer to compiled data block, or NULL on error, + with errorptr and erroroffset set + */ + + pcre * + pcre_compile(const char *pattern, int options, const char **errorptr, + int *erroroffset, const unsigned char *tables) + { + real_pcre *re; + int length = 3; /* For initial BRA plus length */ + int runlength; + int c, reqchar, countlits; + int bracount = 0; + int top_backref = 0; + int branch_extra = 0; + int branch_newextra; + unsigned int brastackptr = 0; + size_t size; + uschar *code; + const uschar *ptr; + compile_data compile_block; + int brastack[BRASTACK_SIZE]; + uschar bralenstack[BRASTACK_SIZE]; + + #ifdef DEBUG + uschar *code_base, *code_end; + #endif + + /* Can't support UTF8 unless PCRE has been compiled to include the code. */ + + #ifndef SUPPORT_UTF8 + if ((options & PCRE_UTF8) != 0) + { + *errorptr = ERR32; + return NULL; + } + #endif + + /* We can't pass back an error message if errorptr is NULL; I guess the best we + can do is just return NULL. */ + + if (errorptr == NULL) return NULL; + *errorptr = NULL; + + /* However, we can give a message for this error */ + + if (erroroffset == NULL) + { + *errorptr = ERR16; + return NULL; + } + *erroroffset = 0; + + if ((options & ~PUBLIC_OPTIONS) != 0) + { + *errorptr = ERR17; + return NULL; + } + + /* Set up pointers to the individual character tables */ + + if (tables == NULL) tables = pcre_default_tables; + compile_block.lcc = tables + lcc_offset; + compile_block.fcc = tables + fcc_offset; + compile_block.cbits = tables + cbits_offset; + compile_block.ctypes = tables + ctypes_offset; + + /* Reflect pattern for debugging output */ + + DPRINTF(("------------------------------------------------------------------\n")); + DPRINTF(("%s\n", pattern)); + + /* The first thing to do is to make a pass over the pattern to compute the + amount of store required to hold the compiled code. This does not have to be + perfect as long as errors are overestimates. At the same time we can detect any + internal flag settings. Make an attempt to correct for any counted white space + if an "extended" flag setting appears late in the pattern. We can't be so + clever for #-comments. */ + + ptr = (const uschar *)(pattern - 1); + while ((c = *(++ptr)) != 0) + { + int min, max; + int class_charcount; + + if ((options & PCRE_EXTENDED) != 0) + { + if ((compile_block.ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + /* The space before the ; is to avoid a warning on a silly compiler + on the Macintosh. */ + while ((c = *(++ptr)) != 0 && c != '\n') ; + continue; + } + } + + switch(c) + { + /* A backslashed item may be an escaped "normal" character or a + character type. For a "normal" character, put the pointers and + character back so that tests for whitespace etc. in the input + are done correctly. */ + + case '\\': + { + const uschar *save_ptr = ptr; + c = check_escape(&ptr, errorptr, bracount, options, FALSE, &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if (c >= 0) + { + ptr = save_ptr; + c = '\\'; + goto NORMAL_CHAR; + } + } + length++; + + /* A back reference needs an additional char, plus either one or 5 + bytes for a repeat. We also need to keep the value of the highest + back reference. */ + + if (c <= -ESC_REF) + { + int refnum = -c - ESC_REF; + if (refnum > top_backref) top_backref = refnum; + length++; /* For single back reference */ + if (ptr[1] == '{' && is_counted_repeat(ptr+2, &compile_block)) + { + ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if ((min == 0 && (max == 1 || max == -1)) || + (min == 1 && max == -1)) + length++; + else length += 5; + if (ptr[1] == '?') ptr++; + } + } + continue; + + case '^': + case '.': + case '$': + case '*': /* These repeats won't be after brackets; */ + case '+': /* those are handled separately */ + case '?': + length++; + continue; + + /* This covers the cases of repeats after a single char, metachar, class, + or back reference. */ + + case '{': + if (!is_counted_repeat(ptr+1, &compile_block)) goto NORMAL_CHAR; + ptr = read_repeat_counts(ptr+1, &min, &max, errorptr, &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if ((min == 0 && (max == 1 || max == -1)) || + (min == 1 && max == -1)) + length++; + else + { + length--; /* Uncount the original char or metachar */ + if (min == 1) length++; else if (min > 0) length += 4; + if (max > 0) length += 4; else length += 2; + } + if (ptr[1] == '?') ptr++; + continue; + + /* An alternation contains an offset to the next branch or ket. If any ims + options changed in the previous branch(es), and/or if we are in a + lookbehind assertion, extra space will be needed at the start of the + branch. This is handled by branch_extra. */ + + case '|': + length += 3 + branch_extra; + continue; + + /* A character class uses 33 characters. Don't worry about character types + that aren't allowed in classes - they'll get picked up during the compile. + A character class that contains only one character uses 2 or 3 bytes, + depending on whether it is negated or not. Notice this where we can. */ + + case '[': + class_charcount = 0; + if (*(++ptr) == '^') ptr++; + do + { + if (*ptr == '\\') + { + int ch = check_escape(&ptr, errorptr, bracount, options, TRUE, + &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if (-ch == ESC_b) class_charcount++; else class_charcount = 10; + } + else class_charcount++; + ptr++; + } + while (*ptr != 0 && *ptr != ']'); + + /* Repeats for negated single chars are handled by the general code */ + + if (class_charcount == 1) length += 3; else + { + length += 33; + + /* A repeat needs either 1 or 5 bytes. */ + + if (*ptr != 0 && ptr[1] == '{' && is_counted_repeat(ptr+2, &compile_block)) + { + ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if ((min == 0 && (max == 1 || max == -1)) || + (min == 1 && max == -1)) + length++; + else length += 5; + if (ptr[1] == '?') ptr++; + } + } + continue; + + /* Brackets may be genuine groups or special things */ + + case '(': + branch_newextra = 0; + + /* Handle special forms of bracket, which all start (? */ + + if (ptr[1] == '?') + { + int set, unset; + int *optset; + + switch (c = ptr[2]) + { + /* Skip over comments entirely */ + case '#': + ptr += 3; + while (*ptr != 0 && *ptr != ')') ptr++; + if (*ptr == 0) + { + *errorptr = ERR18; + goto PCRE_ERROR_RETURN; + } + continue; + + /* Non-referencing groups and lookaheads just move the pointer on, and + then behave like a non-special bracket, except that they don't increment + the count of extracting brackets. Ditto for the "once only" bracket, + which is in Perl from version 5.005. */ + + case ':': + case '=': + case '!': + case '>': + ptr += 2; + break; + + /* A recursive call to the regex is an extension, to provide the + facility which can be obtained by $(?p{perl-code}) in Perl 5.6. */ + + case 'R': + if (ptr[3] != ')') + { + *errorptr = ERR29; + goto PCRE_ERROR_RETURN; + } + ptr += 3; + length += 1; + break; + + /* Lookbehinds are in Perl from version 5.005 */ + + case '<': + if (ptr[3] == '=' || ptr[3] == '!') + { + ptr += 3; + branch_newextra = 3; + length += 3; /* For the first branch */ + break; + } + *errorptr = ERR24; + goto PCRE_ERROR_RETURN; + + /* Conditionals are in Perl from version 5.005. The bracket must either + be followed by a number (for bracket reference) or by an assertion + group. */ + + case '(': + if ((compile_block.ctypes[ptr[3]] & ctype_digit) != 0) + { + ptr += 4; + length += 2; + while ((compile_block.ctypes[*ptr] & ctype_digit) != 0) ptr++; + if (*ptr != ')') + { + *errorptr = ERR26; + goto PCRE_ERROR_RETURN; + } + } + else /* An assertion must follow */ + { + ptr++; /* Can treat like ':' as far as spacing is concerned */ + if (ptr[2] != '?' || + (ptr[3] != '=' && ptr[3] != '!' && ptr[3] != '<') ) + { + ptr += 2; /* To get right offset in message */ + *errorptr = ERR28; + goto PCRE_ERROR_RETURN; + } + } + break; + + /* Else loop checking valid options until ) is met. Anything else is an + error. If we are without any brackets, i.e. at top level, the settings + act as if specified in the options, so massage the options immediately. + This is for backward compatibility with Perl 5.004. */ + + default: + set = unset = 0; + optset = &set; + ptr += 2; + + for (;; ptr++) + { + c = *ptr; + switch (c) + { + case 'i': + *optset |= PCRE_CASELESS; + continue; + + case 'm': + *optset |= PCRE_MULTILINE; + continue; + + case 's': + *optset |= PCRE_DOTALL; + continue; + + case 'x': + *optset |= PCRE_EXTENDED; + continue; + + case 'X': + *optset |= PCRE_EXTRA; + continue; + + case 'U': + *optset |= PCRE_UNGREEDY; + continue; + + case '-': + optset = &unset; + continue; + + /* A termination by ')' indicates an options-setting-only item; + this is global at top level; otherwise nothing is done here and + it is handled during the compiling process on a per-bracket-group + basis. */ + + case ')': + if (brastackptr == 0) + { + options = (options | set) & (~unset); + set = unset = 0; /* To save length */ + } + /* Fall through */ + + /* A termination by ':' indicates the start of a nested group with + the given options set. This is again handled at compile time, but + we must allow for compiled space if any of the ims options are + set. We also have to allow for resetting space at the end of + the group, which is why 4 is added to the length and not just 2. + If there are several changes of options within the same group, this + will lead to an over-estimate on the length, but this shouldn't + matter very much. We also have to allow for resetting options at + the start of any alternations, which we do by setting + branch_newextra to 2. Finally, we record whether the case-dependent + flag ever changes within the regex. This is used by the "required + character" code. */ + + case ':': + if (((set|unset) & PCRE_IMS) != 0) + { + length += 4; + branch_newextra = 2; + if (((set|unset) & PCRE_CASELESS) != 0) options |= PCRE_ICHANGED; + } + goto END_OPTIONS; + + /* Unrecognized option character */ + + default: + *errorptr = ERR12; + goto PCRE_ERROR_RETURN; + } + } + + /* If we hit a closing bracket, that's it - this is a freestanding + option-setting. We need to ensure that branch_extra is updated if + necessary. The only values branch_newextra can have here are 0 or 2. + If the value is 2, then branch_extra must either be 2 or 5, depending + on whether this is a lookbehind group or not. */ + + END_OPTIONS: + if (c == ')') + { + if (branch_newextra == 2 && (branch_extra == 0 || branch_extra == 3)) + branch_extra += branch_newextra; + continue; + } + + /* If options were terminated by ':' control comes here. Fall through + to handle the group below. */ + } + } + + /* Extracting brackets must be counted so we can process escapes in a + Perlish way. */ + + else bracount++; + + /* Non-special forms of bracket. Save length for computing whole length + at end if there's a repeat that requires duplication of the group. Also + save the current value of branch_extra, and start the new group with + the new value. If non-zero, this will either be 2 for a (?imsx: group, or 3 + for a lookbehind assertion. */ + + if (brastackptr >= sizeof(brastack)/sizeof(int)) + { + *errorptr = ERR19; + goto PCRE_ERROR_RETURN; + } + + bralenstack[brastackptr] = branch_extra; + branch_extra = branch_newextra; + + brastack[brastackptr++] = length; + length += 3; + continue; + + /* Handle ket. Look for subsequent max/min; for certain sets of values we + have to replicate this bracket up to that many times. If brastackptr is + 0 this is an unmatched bracket which will generate an error, but take care + not to try to access brastack[-1] when computing the length and restoring + the branch_extra value. */ + + case ')': + length += 3; + { + int minval = 1; + int maxval = 1; + int duplength; + + if (brastackptr > 0) + { + duplength = length - brastack[--brastackptr]; + branch_extra = bralenstack[brastackptr]; + } + else duplength = 0; + + /* Leave ptr at the final char; for read_repeat_counts this happens + automatically; for the others we need an increment. */ + + if ((c = ptr[1]) == '{' && is_counted_repeat(ptr+2, &compile_block)) + { + ptr = read_repeat_counts(ptr+2, &minval, &maxval, errorptr, + &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + } + else if (c == '*') { minval = 0; maxval = -1; ptr++; } + else if (c == '+') { maxval = -1; ptr++; } + else if (c == '?') { minval = 0; ptr++; } + + /* If the minimum is zero, we have to allow for an OP_BRAZERO before the + group, and if the maximum is greater than zero, we have to replicate + maxval-1 times; each replication acquires an OP_BRAZERO plus a nesting + bracket set - hence the 7. */ + + if (minval == 0) + { + length++; + if (maxval > 0) length += (maxval - 1) * (duplength + 7); + } + + /* When the minimum is greater than zero, 1 we have to replicate up to + minval-1 times, with no additions required in the copies. Then, if + there is a limited maximum we have to replicate up to maxval-1 times + allowing for a BRAZERO item before each optional copy and nesting + brackets for all but one of the optional copies. */ + + else + { + length += (minval - 1) * duplength; + if (maxval > minval) /* Need this test as maxval=-1 means no limit */ + length += (maxval - minval) * (duplength + 7) - 6; + } + } + continue; + + /* Non-special character. For a run of such characters the length required + is the number of characters + 2, except that the maximum run length is 255. + We won't get a skipped space or a non-data escape or the start of a # + comment as the first character, so the length can't be zero. */ + + NORMAL_CHAR: + default: + length += 2; + runlength = 0; + do + { + if ((options & PCRE_EXTENDED) != 0) + { + if ((compile_block.ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + /* The space before the ; is to avoid a warning on a silly compiler + on the Macintosh. */ + while ((c = *(++ptr)) != 0 && c != '\n') ; + continue; + } + } + + /* Backslash may introduce a data char or a metacharacter; stop the + string before the latter. */ + + if (c == '\\') + { + const uschar *saveptr = ptr; + c = check_escape(&ptr, errorptr, bracount, options, FALSE, + &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if (c < 0) { ptr = saveptr; break; } + + #ifdef SUPPORT_UTF8 + if (c > 127 && (options & PCRE_UTF8) != 0) + { + int i; + for (i = 0; i < sizeof(utf8_table1)/sizeof(int); i++) + if (c <= utf8_table1[i]) break; + runlength += i; + } + #endif + } + + /* Ordinary character or single-char escape */ + + runlength++; + } + + /* This "while" is the end of the "do" above. */ + + while (runlength < MAXLIT && + (compile_block.ctypes[c = *(++ptr)] & ctype_meta) == 0); + + ptr--; + length += runlength; + continue; + } + } + + length += 4; /* For final KET and END */ + + if (length > 65539) + { + *errorptr = ERR20; + return NULL; + } + + /* Compute the size of data block needed and get it, either from malloc or + externally provided function. We specify "code[0]" in the offsetof() expression + rather than just "code", because it has been reported that one broken compiler + fails on "code" because it is also an independent variable. It should make no + difference to the value of the offsetof(). */ + + size = length + offsetof(real_pcre, code[0]); + re = (real_pcre *)(pcre_malloc)(size); + + if (re == NULL) + { + *errorptr = ERR21; + return NULL; + } + + /* Put in the magic number, and save the size, options, and table pointer */ + + re->magic_number = MAGIC_NUMBER; + re->size = size; + re->options = options; + re->tables = tables; + + /* Set up a starting, non-extracting bracket, then compile the expression. On + error, *errorptr will be set non-NULL, so we don't need to look at the result + of the function here. */ + + ptr = (const uschar *)pattern; + code = re->code; + *code = OP_BRA; + bracount = 0; + (void)compile_regex(options, -1, &bracount, &code, &ptr, errorptr, FALSE, -1, + &reqchar, &countlits, &compile_block); + re->top_bracket = bracount; + re->top_backref = top_backref; + + /* If not reached end of pattern on success, there's an excess bracket. */ + + if (*errorptr == NULL && *ptr != 0) *errorptr = ERR22; + + /* Fill in the terminating state and check for disastrous overflow, but + if debugging, leave the test till after things are printed out. */ + + *code++ = OP_END; + + #ifndef DEBUG + if (code - re->code > length) *errorptr = ERR23; + #endif + + /* Give an error if there's back reference to a non-existent capturing + subpattern. */ + + if (top_backref > re->top_bracket) *errorptr = ERR15; + + /* Failed to compile */ + + if (*errorptr != NULL) + { + (pcre_free)(re); + PCRE_ERROR_RETURN: + *erroroffset = ptr - (const uschar *)pattern; + return NULL; + } + + /* If the anchored option was not passed, set flag if we can determine that the + pattern is anchored by virtue of ^ characters or \A or anything else (such as + starting with .* when DOTALL is set). + + Otherwise, see if we can determine what the first character has to be, because + that speeds up unanchored matches no end. If not, see if we can set the + PCRE_STARTLINE flag. This is helpful for multiline matches when all branches + start with ^. and also when all branches start with .* for non-DOTALL matches. + */ + + if ((options & PCRE_ANCHORED) == 0) + { + int temp_options = options; + if (is_anchored(re->code, &temp_options)) + re->options |= PCRE_ANCHORED; + else + { + int ch = find_firstchar(re->code, &temp_options); + if (ch >= 0) + { + re->first_char = ch; + re->options |= PCRE_FIRSTSET; + } + else if (is_startline(re->code)) + re->options |= PCRE_STARTLINE; + } + } + + /* Save the last required character if there are at least two literal + characters on all paths, or if there is no first character setting. */ + + if (reqchar >= 0 && (countlits > 1 || (re->options & PCRE_FIRSTSET) == 0)) + { + re->req_char = reqchar; + re->options |= PCRE_REQCHSET; + } + + /* Print out the compiled data for debugging */ + + #ifdef DEBUG + + printf("Length = %d top_bracket = %d top_backref = %d\n", + length, re->top_bracket, re->top_backref); + + if (re->options != 0) + { + printf("%s%s%s%s%s%s%s%s%s\n", + ((re->options & PCRE_ANCHORED) != 0)? "anchored " : "", + ((re->options & PCRE_CASELESS) != 0)? "caseless " : "", + ((re->options & PCRE_ICHANGED) != 0)? "case state changed " : "", + ((re->options & PCRE_EXTENDED) != 0)? "extended " : "", + ((re->options & PCRE_MULTILINE) != 0)? "multiline " : "", + ((re->options & PCRE_DOTALL) != 0)? "dotall " : "", + ((re->options & PCRE_DOLLAR_ENDONLY) != 0)? "endonly " : "", + ((re->options & PCRE_EXTRA) != 0)? "extra " : "", + ((re->options & PCRE_UNGREEDY) != 0)? "ungreedy " : ""); + } + + if ((re->options & PCRE_FIRSTSET) != 0) + { + if (isprint(re->first_char)) printf("First char = %c\n", re->first_char); + else printf("First char = \\x%02x\n", re->first_char); + } + + if ((re->options & PCRE_REQCHSET) != 0) + { + if (isprint(re->req_char)) printf("Req char = %c\n", re->req_char); + else printf("Req char = \\x%02x\n", re->req_char); + } + + code_end = code; + code_base = code = re->code; + + while (code < code_end) + { + int charlength; + + printf("%3d ", code - code_base); + + if (*code >= OP_BRA) + { + printf("%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA); + code += 2; + } + + else switch(*code) + { + case OP_OPT: + printf(" %.2x %s", code[1], OP_names[*code]); + code++; + break; + + case OP_COND: + printf("%3d Cond", (code[1] << 8) + code[2]); + code += 2; + break; + + case OP_CREF: + printf(" %.2d %s", code[1], OP_names[*code]); + code++; + break; + + case OP_CHARS: + charlength = *(++code); + printf("%3d ", charlength); + while (charlength-- > 0) + if (isprint(c = *(++code))) printf("%c", c); else printf("\\x%02x", c); + break; + + case OP_KETRMAX: + case OP_KETRMIN: + case OP_ALT: + case OP_KET: + case OP_ASSERT: + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + case OP_ONCE: + printf("%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_REVERSE: + printf("%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_STAR: + case OP_MINSTAR: + case OP_PLUS: + case OP_MINPLUS: + case OP_QUERY: + case OP_MINQUERY: + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + if (*code >= OP_TYPESTAR) + printf(" %s", OP_names[code[1]]); + else if (isprint(c = code[1])) printf(" %c", c); + else printf(" \\x%02x", c); + printf("%s", OP_names[*code++]); + break; + + case OP_EXACT: + case OP_UPTO: + case OP_MINUPTO: + if (isprint(c = code[3])) printf(" %c{", c); + else printf(" \\x%02x{", c); + if (*code != OP_EXACT) printf("0,"); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_MINUPTO) printf("?"); + code += 3; + break; + + case OP_TYPEEXACT: + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + printf(" %s{", OP_names[code[3]]); + if (*code != OP_TYPEEXACT) printf(","); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_TYPEMINUPTO) printf("?"); + code += 3; + break; + + case OP_NOT: + if (isprint(c = *(++code))) printf(" [^%c]", c); + else printf(" [^\\x%02x]", c); + break; + + case OP_NOTSTAR: + case OP_NOTMINSTAR: + case OP_NOTPLUS: + case OP_NOTMINPLUS: + case OP_NOTQUERY: + case OP_NOTMINQUERY: + if (isprint(c = code[1])) printf(" [^%c]", c); + else printf(" [^\\x%02x]", c); + printf("%s", OP_names[*code++]); + break; + + case OP_NOTEXACT: + case OP_NOTUPTO: + case OP_NOTMINUPTO: + if (isprint(c = code[3])) printf(" [^%c]{", c); + else printf(" [^\\x%02x]{", c); + if (*code != OP_NOTEXACT) printf(","); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_NOTMINUPTO) printf("?"); + code += 3; + break; + + case OP_REF: + printf(" \\%d", *(++code)); + code ++; + goto CLASS_REF_REPEAT; + + case OP_CLASS: + { + int i, min, max; + code++; + printf(" ["); + + for (i = 0; i < 256; i++) + { + if ((code[i/8] & (1 << (i&7))) != 0) + { + int j; + for (j = i+1; j < 256; j++) + if ((code[j/8] & (1 << (j&7))) == 0) break; + if (i == '-' || i == ']') printf("\\"); + if (isprint(i)) printf("%c", i); else printf("\\x%02x", i); + if (--j > i) + { + printf("-"); + if (j == '-' || j == ']') printf("\\"); + if (isprint(j)) printf("%c", j); else printf("\\x%02x", j); + } + i = j; + } + } + printf("]"); + code += 32; + + CLASS_REF_REPEAT: + + switch(*code) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + printf("%s", OP_names[*code]); + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + min = (code[1] << 8) + code[2]; + max = (code[3] << 8) + code[4]; + if (max == 0) printf("{%d,}", min); + else printf("{%d,%d}", min, max); + if (*code == OP_CRMINRANGE) printf("?"); + code += 4; + break; + + default: + code--; + } + } + break; + + /* Anything else is just a one-node item */ + + default: + printf(" %s", OP_names[*code]); + break; + } + + code++; + printf("\n"); + } + printf("------------------------------------------------------------------\n"); + + /* This check is done here in the debugging case so that the code that + was compiled can be seen. */ + + if (code - re->code > length) + { + *errorptr = ERR23; + (pcre_free)(re); + *erroroffset = ptr - (uschar *)pattern; + return NULL; + } + #endif + + return (pcre *)re; + } + + + + /************************************************* + * Match a back-reference * + *************************************************/ + + /* If a back reference hasn't been set, the length that is passed is greater + than the number of characters left in the string, so the match fails. + + Arguments: + offset index into the offset vector + eptr points into the subject + length length to be matched + md points to match data block + ims the ims flags + + Returns: TRUE if matched + */ + + static BOOL + match_ref(int offset, register const uschar *eptr, int length, match_data *md, + unsigned long int ims) + { + const uschar *p = md->start_subject + md->offset_vector[offset]; + + #ifdef DEBUG + if (eptr >= md->end_subject) + printf("matching subject <null>"); + else + { + printf("matching subject "); + pchars(eptr, length, TRUE, md); + } + printf(" against backref "); + pchars(p, length, FALSE, md); + printf("\n"); + #endif + + /* Always fail if not enough characters left */ + + if (length > md->end_subject - eptr) return FALSE; + + /* Separate the caselesss case for speed */ + + if ((ims & PCRE_CASELESS) != 0) + { + while (length-- > 0) + if (md->lcc[*p++] != md->lcc[*eptr++]) return FALSE; + } + else + { while (length-- > 0) if (*p++ != *eptr++) return FALSE; } + + return TRUE; + } + + + + /************************************************* + * Match from current position * + *************************************************/ + + /* On entry ecode points to the first opcode, and eptr to the first character + in the subject string, while eptrb holds the value of eptr at the start of the + last bracketed group - used for breaking infinite loops matching zero-length + strings. + + Arguments: + eptr pointer in subject + ecode position in code + offset_top current top pointer + md pointer to "static" info for the match + ims current /i, /m, and /s options + eptrb pointer to chain of blocks containing eptr at start of + brackets - for testing for empty matches + flags can contain + match_condassert - this is an assertion condition + match_isgroup - this is the start of a bracketed group + + Returns: TRUE if matched + */ + + static BOOL + match(register const uschar *eptr, register const uschar *ecode, + int offset_top, match_data *md, unsigned long int ims, eptrblock *eptrb, + int flags) + { + unsigned long int original_ims = ims; /* Save for resetting on ')' */ + eptrblock newptrb; + + /* At the start of a bracketed group, add the current subject pointer to the + stack of such pointers, to be re-instated at the end of the group when we hit + the closing ket. When match() is called in other circumstances, we don't add to + the stack. */ + + if ((flags & match_isgroup) != 0) + { + newptrb.prev = eptrb; + newptrb.saved_eptr = eptr; + eptrb = &newptrb; + } + + /* Now start processing the operations. */ + + for (;;) + { + int op = (int)*ecode; + int min, max, ctype; + register int i; + register int c; + BOOL minimize = FALSE; + + /* Opening capturing bracket. If there is space in the offset vector, save + the current subject position in the working slot at the top of the vector. We + mustn't change the current values of the data slot, because they may be set + from a previous iteration of this group, and be referred to by a reference + inside the group. + + If the bracket fails to match, we need to restore this value and also the + values of the final offsets, in case they were set by a previous iteration of + the same bracket. + + If there isn't enough space in the offset vector, treat this as if it were a + non-capturing bracket. Don't worry about setting the flag for the error case + here; that is handled in the code for KET. */ + + if (op > OP_BRA) + { + int number = op - OP_BRA; + int offset = number << 1; + + #ifdef DEBUG + printf("start bracket %d subject=", number); + pchars(eptr, 16, TRUE, md); + printf("\n"); + #endif + + if (offset < md->offset_max) + { + int save_offset1 = md->offset_vector[offset]; + int save_offset2 = md->offset_vector[offset+1]; + int save_offset3 = md->offset_vector[md->offset_end - number]; + + DPRINTF(("saving %d %d %d\n", save_offset1, save_offset2, save_offset3)); + md->offset_vector[md->offset_end - number] = eptr - md->start_subject; + + do + { + if (match(eptr, ecode+3, offset_top, md, ims, eptrb, match_isgroup)) + return TRUE; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + + DPRINTF(("bracket %d failed\n", number)); + + md->offset_vector[offset] = save_offset1; + md->offset_vector[offset+1] = save_offset2; + md->offset_vector[md->offset_end - number] = save_offset3; + return FALSE; + } + + /* Insufficient room for saving captured contents */ + + else op = OP_BRA; + } + + /* Other types of node can be handled by a switch */ + + switch(op) + { + case OP_BRA: /* Non-capturing bracket: optimized */ + DPRINTF(("start bracket 0\n")); + do + { + if (match(eptr, ecode+3, offset_top, md, ims, eptrb, match_isgroup)) + return TRUE; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + DPRINTF(("bracket 0 failed\n")); + return FALSE; + + /* Conditional group: compilation checked that there are no more than + two branches. If the condition is false, skipping the first branch takes us + past the end if there is only one branch, but that's OK because that is + exactly what going to the ket would do. */ + + case OP_COND: + if (ecode[3] == OP_CREF) /* Condition is extraction test */ + { + int offset = ecode[4] << 1; /* Doubled reference number */ + return match(eptr, + ecode + ((offset < offset_top && md->offset_vector[offset] >= 0)? + 5 : 3 + (ecode[1] << 8) + ecode[2]), + offset_top, md, ims, eptrb, match_isgroup); + } + + /* The condition is an assertion. Call match() to evaluate it - setting + the final argument TRUE causes it to stop at the end of an assertion. */ + + else + { + if (match(eptr, ecode+3, offset_top, md, ims, NULL, + match_condassert | match_isgroup)) + { + ecode += 3 + (ecode[4] << 8) + ecode[5]; + while (*ecode == OP_ALT) ecode += (ecode[1] << 8) + ecode[2]; + } + else ecode += (ecode[1] << 8) + ecode[2]; + return match(eptr, ecode+3, offset_top, md, ims, eptrb, match_isgroup); + } + /* Control never reaches here */ + + /* Skip over conditional reference data if encountered (should not be) */ + + case OP_CREF: + ecode += 2; + break; + + /* End of the pattern. If PCRE_NOTEMPTY is set, fail if we have matched + an empty string - recursion will then try other alternatives, if any. */ + + case OP_END: + if (md->notempty && eptr == md->start_match) return FALSE; + md->end_match_ptr = eptr; /* Record where we ended */ + md->end_offset_top = offset_top; /* and how many extracts were taken */ + return TRUE; + + /* Change option settings */ + + case OP_OPT: + ims = ecode[1]; + ecode += 2; + DPRINTF(("ims set to %02lx\n", ims)); + break; + + /* Assertion brackets. Check the alternative branches in turn - the + matching won't pass the KET for an assertion. If any one branch matches, + the assertion is true. Lookbehind assertions have an OP_REVERSE item at the + start of each branch to move the current point backwards, so the code at + this level is identical to the lookahead case. */ + + case OP_ASSERT: + case OP_ASSERTBACK: + do + { + if (match(eptr, ecode+3, offset_top, md, ims, NULL, match_isgroup)) break; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + if (*ecode == OP_KET) return FALSE; + + /* If checking an assertion for a condition, return TRUE. */ + + if ((flags & match_condassert) != 0) return TRUE; + + /* Continue from after the assertion, updating the offsets high water + mark, since extracts may have been taken during the assertion. */ + + do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + ecode += 3; + offset_top = md->end_offset_top; + continue; + + /* Negative assertion: all branches must fail to match */ + + case OP_ASSERT_NOT: + case OP_ASSERTBACK_NOT: + do + { + if (match(eptr, ecode+3, offset_top, md, ims, NULL, match_isgroup)) + return FALSE; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + + if ((flags & match_condassert) != 0) return TRUE; + + ecode += 3; + continue; + + /* Move the subject pointer back. This occurs only at the start of + each branch of a lookbehind assertion. If we are too close to the start to + move back, this match function fails. When working with UTF-8 we move + back a number of characters, not bytes. */ + + case OP_REVERSE: + #ifdef SUPPORT_UTF8 + c = (ecode[1] << 8) + ecode[2]; + for (i = 0; i < c; i++) + { + eptr--; + BACKCHAR(eptr) + } + #else + eptr -= (ecode[1] << 8) + ecode[2]; + #endif + + if (eptr < md->start_subject) return FALSE; + ecode += 3; + break; + + /* Recursion matches the current regex, nested. If there are any capturing + brackets started but not finished, we have to save their starting points + and reinstate them after the recursion. However, we don't know how many + such there are (offset_top records the completed total) so we just have + to save all the potential data. There may be up to 99 such values, which + is a bit large to put on the stack, but using malloc for small numbers + seems expensive. As a compromise, the stack is used when there are fewer + than 16 values to store; otherwise malloc is used. A problem is what to do + if the malloc fails ... there is no way of returning to the top level with + an error. Save the top 15 values on the stack, and accept that the rest + may be wrong. */ + + case OP_RECURSE: + { + BOOL rc; + int *save; + int stacksave[15]; + + c = md->offset_max; + + if (c < 16) save = stacksave; else + { + save = (int *)(pcre_malloc)((c+1) * sizeof(int)); + if (save == NULL) + { + save = stacksave; + c = 15; + } + } + + for (i = 1; i <= c; i++) + save[i] = md->offset_vector[md->offset_end - i]; + rc = match(eptr, md->start_pattern, offset_top, md, ims, eptrb, + match_isgroup); + for (i = 1; i <= c; i++) + md->offset_vector[md->offset_end - i] = save[i]; + if (save != stacksave) (pcre_free)(save); + if (!rc) return FALSE; + + /* In case the recursion has set more capturing values, save the final + number, then move along the subject till after the recursive match, + and advance one byte in the pattern code. */ + + offset_top = md->end_offset_top; + eptr = md->end_match_ptr; + ecode++; + } + break; + + /* "Once" brackets are like assertion brackets except that after a match, + the point in the subject string is not moved back. Thus there can never be + a move back into the brackets. Check the alternative branches in turn - the + matching won't pass the KET for this kind of subpattern. If any one branch + matches, we carry on as at the end of a normal bracket, leaving the subject + pointer. */ + + case OP_ONCE: + { + const uschar *prev = ecode; + const uschar *saved_eptr = eptr; + + do + { + if (match(eptr, ecode+3, offset_top, md, ims, eptrb, match_isgroup)) + break; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + + /* If hit the end of the group (which could be repeated), fail */ + + if (*ecode != OP_ONCE && *ecode != OP_ALT) return FALSE; + + /* Continue as from after the assertion, updating the offsets high water + mark, since extracts may have been taken. */ + + do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + + offset_top = md->end_offset_top; + eptr = md->end_match_ptr; + + /* For a non-repeating ket, just continue at this level. This also + happens for a repeating ket if no characters were matched in the group. + This is the forcible breaking of infinite loops as implemented in Perl + 5.005. If there is an options reset, it will get obeyed in the normal + course of events. */ + + if (*ecode == OP_KET || eptr == saved_eptr) + { + ecode += 3; + break; + } + + /* The repeating kets try the rest of the pattern or restart from the + preceding bracket, in the appropriate order. We need to reset any options + that changed within the bracket before re-running it, so check the next + opcode. */ + + if (ecode[3] == OP_OPT) + { + ims = (ims & ~PCRE_IMS) | ecode[4]; + DPRINTF(("ims set to %02lx at group repeat\n", ims)); + } + + if (*ecode == OP_KETRMIN) + { + if (match(eptr, ecode+3, offset_top, md, ims, eptrb, 0) || + match(eptr, prev, offset_top, md, ims, eptrb, match_isgroup)) + return TRUE; + } + else /* OP_KETRMAX */ + { + if (match(eptr, prev, offset_top, md, ims, eptrb, match_isgroup) || + match(eptr, ecode+3, offset_top, md, ims, eptrb, 0)) return TRUE; + } + } + return FALSE; + + /* An alternation is the end of a branch; scan along to find the end of the + bracketed group and go to there. */ + + case OP_ALT: + do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + break; + + /* BRAZERO and BRAMINZERO occur just before a bracket group, indicating + that it may occur zero times. It may repeat infinitely, or not at all - + i.e. it could be ()* or ()? in the pattern. Brackets with fixed upper + repeat limits are compiled as a number of copies, with the optional ones + preceded by BRAZERO or BRAMINZERO. */ + + case OP_BRAZERO: + { + const uschar *next = ecode+1; + if (match(eptr, next, offset_top, md, ims, eptrb, match_isgroup)) + return TRUE; + do next += (next[1] << 8) + next[2]; while (*next == OP_ALT); + ecode = next + 3; + } + break; + + case OP_BRAMINZERO: + { + const uschar *next = ecode+1; + do next += (next[1] << 8) + next[2]; while (*next == OP_ALT); + if (match(eptr, next+3, offset_top, md, ims, eptrb, match_isgroup)) + return TRUE; + ecode++; + } + break; + + /* End of a group, repeated or non-repeating. If we are at the end of + an assertion "group", stop matching and return TRUE, but record the + current high water mark for use by positive assertions. Do this also + for the "once" (not-backup up) groups. */ + + case OP_KET: + case OP_KETRMIN: + case OP_KETRMAX: + { + const uschar *prev = ecode - (ecode[1] << 8) - ecode[2]; + const uschar *saved_eptr = eptrb->saved_eptr; + + eptrb = eptrb->prev; /* Back up the stack of bracket start pointers */ + + if (*prev == OP_ASSERT || *prev == OP_ASSERT_NOT || + *prev == OP_ASSERTBACK || *prev == OP_ASSERTBACK_NOT || + *prev == OP_ONCE) + { + md->end_match_ptr = eptr; /* For ONCE */ + md->end_offset_top = offset_top; + return TRUE; + } + + /* In all other cases except a conditional group we have to check the + group number back at the start and if necessary complete handling an + extraction by setting the offsets and bumping the high water mark. */ + + if (*prev != OP_COND) + { + int number = *prev - OP_BRA; + int offset = number << 1; + + #ifdef DEBUG + printf("end bracket %d", number); + printf("\n"); + #endif + + if (number > 0) + { + if (offset >= md->offset_max) md->offset_overflow = TRUE; else + { + md->offset_vector[offset] = + md->offset_vector[md->offset_end - number]; + md->offset_vector[offset+1] = eptr - md->start_subject; + if (offset_top <= offset) offset_top = offset + 2; + } + } + } + + /* Reset the value of the ims flags, in case they got changed during + the group. */ + + ims = original_ims; + DPRINTF(("ims reset to %02lx\n", ims)); + + /* For a non-repeating ket, just continue at this level. This also + happens for a repeating ket if no characters were matched in the group. + This is the forcible breaking of infinite loops as implemented in Perl + 5.005. If there is an options reset, it will get obeyed in the normal + course of events. */ + + if (*ecode == OP_KET || eptr == saved_eptr) + { + ecode += 3; + break; + } + + /* The repeating kets try the rest of the pattern or restart from the + preceding bracket, in the appropriate order. */ + + if (*ecode == OP_KETRMIN) + { + if (match(eptr, ecode+3, offset_top, md, ims, eptrb, 0) || + match(eptr, prev, offset_top, md, ims, eptrb, match_isgroup)) + return TRUE; + } + else /* OP_KETRMAX */ + { + if (match(eptr, prev, offset_top, md, ims, eptrb, match_isgroup) || + match(eptr, ecode+3, offset_top, md, ims, eptrb, 0)) return TRUE; + } + } + return FALSE; + + /* Start of subject unless notbol, or after internal newline if multiline */ + + case OP_CIRC: + if (md->notbol && eptr == md->start_subject) return FALSE; + if ((ims & PCRE_MULTILINE) != 0) + { + if (eptr != md->start_subject && eptr[-1] != '\n') return FALSE; + ecode++; + break; + } + /* ... else fall through */ + + /* Start of subject assertion */ + + case OP_SOD: + if (eptr != md->start_subject) return FALSE; + ecode++; + break; + + /* Assert before internal newline if multiline, or before a terminating + newline unless endonly is set, else end of subject unless noteol is set. */ + + case OP_DOLL: + if ((ims & PCRE_MULTILINE) != 0) + { + if (eptr < md->end_subject) { if (*eptr != '\n') return FALSE; } + else { if (md->noteol) return FALSE; } + ecode++; + break; + } + else + { + if (md->noteol) return FALSE; + if (!md->endonly) + { + if (eptr < md->end_subject - 1 || + (eptr == md->end_subject - 1 && *eptr != '\n')) return FALSE; + + ecode++; + break; + } + } + /* ... else fall through */ + + /* End of subject assertion (\z) */ + + case OP_EOD: + if (eptr < md->end_subject) return FALSE; + ecode++; + break; + + /* End of subject or ending \n assertion (\Z) */ + + case OP_EODN: + if (eptr < md->end_subject - 1 || + (eptr == md->end_subject - 1 && *eptr != '\n')) return FALSE; + ecode++; + break; + + /* Word boundary assertions */ + + case OP_NOT_WORD_BOUNDARY: + case OP_WORD_BOUNDARY: + { + BOOL prev_is_word = (eptr != md->start_subject) && + ((md->ctypes[eptr[-1]] & ctype_word) != 0); + BOOL cur_is_word = (eptr < md->end_subject) && + ((md->ctypes[*eptr] & ctype_word) != 0); + if ((*ecode++ == OP_WORD_BOUNDARY)? + cur_is_word == prev_is_word : cur_is_word != prev_is_word) + return FALSE; + } + break; + + /* Match a single character type; inline for speed */ + + case OP_ANY: + if ((ims & PCRE_DOTALL) == 0 && eptr < md->end_subject && *eptr == '\n') + return FALSE; + if (eptr++ >= md->end_subject) return FALSE; + #ifdef SUPPORT_UTF8 + if (md->utf8) + while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++; + #endif + ecode++; + break; + + case OP_NOT_DIGIT: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_digit) != 0) + return FALSE; + ecode++; + break; + + case OP_DIGIT: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_digit) == 0) + return FALSE; + ecode++; + break; + + case OP_NOT_WHITESPACE: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_space) != 0) + return FALSE; + ecode++; + break; + + case OP_WHITESPACE: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_space) == 0) + return FALSE; + ecode++; + break; + + case OP_NOT_WORDCHAR: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_word) != 0) + return FALSE; + ecode++; + break; + + case OP_WORDCHAR: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_word) == 0) + return FALSE; + ecode++; + break; + + /* Match a back reference, possibly repeatedly. Look past the end of the + item to see if there is repeat information following. The code is similar + to that for character classes, but repeated for efficiency. Then obey + similar code to character type repeats - written out again for speed. + However, if the referenced string is the empty string, always treat + it as matched, any number of times (otherwise there could be infinite + loops). */ + + case OP_REF: + { + int length; + int offset = ecode[1] << 1; /* Doubled reference number */ + ecode += 2; /* Advance past the item */ + + /* If the reference is unset, set the length to be longer than the amount + of subject left; this ensures that every attempt at a match fails. We + can't just fail here, because of the possibility of quantifiers with zero + minima. */ + + length = (offset >= offset_top || md->offset_vector[offset] < 0)? + md->end_subject - eptr + 1 : + md->offset_vector[offset+1] - md->offset_vector[offset]; + + /* Set up for repetition, or handle the non-repeated case */ + + switch (*ecode) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + c = *ecode++ - OP_CRSTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + minimize = (*ecode == OP_CRMINRANGE); + min = (ecode[1] << 8) + ecode[2]; + max = (ecode[3] << 8) + ecode[4]; + if (max == 0) max = INT_MAX; + ecode += 5; + break; + + default: /* No repeat follows */ + if (!match_ref(offset, eptr, length, md, ims)) return FALSE; + eptr += length; + continue; /* With the main loop */ + } + + /* If the length of the reference is zero, just continue with the + main loop. */ + + if (length == 0) continue; + + /* First, ensure the minimum number of matches are present. We get back + the length of the reference string explicitly rather than passing the + address of eptr, so that eptr can be a register variable. */ + + for (i = 1; i <= min; i++) + { + if (!match_ref(offset, eptr, length, md, ims)) return FALSE; + eptr += length; + } + + /* If min = max, continue at the same level without recursion. + They are not both allowed to be zero. */ + + if (min == max) continue; + + /* If minimizing, keep trying and advancing the pointer */ + + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, eptrb, 0)) + return TRUE; + if (i >= max || !match_ref(offset, eptr, length, md, ims)) + return FALSE; + eptr += length; + } + /* Control never gets here */ + } + + /* If maximizing, find the longest string and work backwards */ + + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (!match_ref(offset, eptr, length, md, ims)) break; + eptr += length; + } + while (eptr >= pp) + { + if (match(eptr, ecode, offset_top, md, ims, eptrb, 0)) + return TRUE; + eptr -= length; + } + return FALSE; + } + } + /* Control never gets here */ + + + + /* Match a character class, possibly repeatedly. Look past the end of the + item to see if there is repeat information following. Then obey similar + code to character type repeats - written out again for speed. */ + + case OP_CLASS: + { + const uschar *data = ecode + 1; /* Save for matching */ + ecode += 33; /* Advance past the item */ + + switch (*ecode) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + c = *ecode++ - OP_CRSTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + minimize = (*ecode == OP_CRMINRANGE); + min = (ecode[1] << 8) + ecode[2]; + max = (ecode[3] << 8) + ecode[4]; + if (max == 0) max = INT_MAX; + ecode += 5; + break; + + default: /* No repeat follows */ + min = max = 1; + break; + } + + /* First, ensure the minimum number of matches are present. */ + + for (i = 1; i <= min; i++) + { + if (eptr >= md->end_subject) return FALSE; + GETCHARINC(c, eptr) /* Get character; increment eptr */ + + #ifdef SUPPORT_UTF8 + /* We do not yet support class members > 255 */ + if (c > 255) return FALSE; + #endif + + if ((data[c/8] & (1 << (c&7))) != 0) continue; + return FALSE; + } + + /* If max == min we can continue with the main loop without the + need to recurse. */ + + if (min == max) continue; + + /* If minimizing, keep testing the rest of the expression and advancing + the pointer while it matches the class. */ + + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, eptrb, 0)) + return TRUE; + if (i >= max || eptr >= md->end_subject) return FALSE; + GETCHARINC(c, eptr) /* Get character; increment eptr */ + + #ifdef SUPPORT_UTF8 + /* We do not yet support class members > 255 */ + if (c > 255) return FALSE; + #endif + if ((data[c/8] & (1 << (c&7))) != 0) continue; + return FALSE; + } + /* Control never gets here */ + } + + /* If maximizing, find the longest possible run, then work backwards. */ + + else + { + const uschar *pp = eptr; + int len = 1; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject) break; + GETCHARLEN(c, eptr, len) /* Get character, set length if UTF-8 */ + + #ifdef SUPPORT_UTF8 + /* We do not yet support class members > 255 */ + if (c > 255) break; + #endif + if ((data[c/8] & (1 << (c&7))) == 0) break; + eptr += len; + } + + while (eptr >= pp) + { + if (match(eptr--, ecode, offset_top, md, ims, eptrb, 0)) + return TRUE; + + #ifdef SUPPORT_UTF8 + BACKCHAR(eptr) + #endif + } + return FALSE; + } + } + /* Control never gets here */ + + /* Match a run of characters */ + + case OP_CHARS: + { + register int length = ecode[1]; + ecode += 2; + + #ifdef DEBUG /* Sigh. Some compilers never learn. */ + if (eptr >= md->end_subject) + printf("matching subject <null> against pattern "); + else + { + printf("matching subject "); + pchars(eptr, length, TRUE, md); + printf(" against pattern "); + } + pchars(ecode, length, FALSE, md); + printf("\n"); + #endif + + if (length > md->end_subject - eptr) return FALSE; + if ((ims & PCRE_CASELESS) != 0) + { + while (length-- > 0) + if (md->lcc[*ecode++] != md->lcc[*eptr++]) + return FALSE; + } + else + { + while (length-- > 0) if (*ecode++ != *eptr++) return FALSE; + } + } + break; + + /* Match a single character repeatedly; different opcodes share code. */ + + case OP_EXACT: + min = max = (ecode[1] << 8) + ecode[2]; + ecode += 3; + goto REPEATCHAR; + + case OP_UPTO: + case OP_MINUPTO: + min = 0; + max = (ecode[1] << 8) + ecode[2]; + minimize = *ecode == OP_MINUPTO; + ecode += 3; + goto REPEATCHAR; + + case OP_STAR: + case OP_MINSTAR: + case OP_PLUS: + case OP_MINPLUS: + case OP_QUERY: + case OP_MINQUERY: + c = *ecode++ - OP_STAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + + /* Common code for all repeated single-character matches. We can give + up quickly if there are fewer than the minimum number of characters left in + the subject. */ + + REPEATCHAR: + if (min > md->end_subject - eptr) return FALSE; + c = *ecode++; + + /* The code is duplicated for the caseless and caseful cases, for speed, + since matching characters is likely to be quite common. First, ensure the + minimum number of matches are present. If min = max, continue at the same + level without recursing. Otherwise, if minimizing, keep trying the rest of + the expression and advancing one matching character if failing, up to the + maximum. Alternatively, if maximizing, find the maximum number of + characters and work backwards. */ + + DPRINTF(("matching %c{%d,%d} against subject %.*s\n", c, min, max, + max, eptr)); + + if ((ims & PCRE_CASELESS) != 0) + { + c = md->lcc[c]; + for (i = 1; i <= min; i++) + if (c != md->lcc[*eptr++]) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, eptrb, 0)) + return TRUE; + if (i >= max || eptr >= md->end_subject || + c != md->lcc[*eptr++]) + return FALSE; + } + /* Control never gets here */ + } + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c != md->lcc[*eptr]) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, eptrb, 0)) + return TRUE; + return FALSE; + } + /* Control never gets here */ + } + + /* Caseful comparisons */ + + else + { + for (i = 1; i <= min; i++) if (c != *eptr++) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, eptrb, 0)) + return TRUE; + if (i >= max || eptr >= md->end_subject || c != *eptr++) return FALSE; + } + /* Control never gets here */ + } + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c != *eptr) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, eptrb, 0)) + return TRUE; + return FALSE; + } + } + /* Control never gets here */ + + /* Match a negated single character */ + + case OP_NOT: + if (eptr >= md->end_subject) return FALSE; + ecode++; + if ((ims & PCRE_CASELESS) != 0) + { + if (md->lcc[*ecode++] == md->lcc[*eptr++]) return FALSE; + } + else + { + if (*ecode++ == *eptr++) return FALSE; + } + break; + + /* Match a negated single character repeatedly. This is almost a repeat of + the code for a repeated single character, but I haven't found a nice way of + commoning these up that doesn't require a test of the positive/negative + option for each character match. Maybe that wouldn't add very much to the + time taken, but character matching *is* what this is all about... */ + + case OP_NOTEXACT: + min = max = (ecode[1] << 8) + ecode[2]; + ecode += 3; + goto REPEATNOTCHAR; + + case OP_NOTUPTO: + case OP_NOTMINUPTO: + min = 0; + max = (ecode[1] << 8) + ecode[2]; + minimize = *ecode == OP_NOTMINUPTO; + ecode += 3; + goto REPEATNOTCHAR; + + case OP_NOTSTAR: + case OP_NOTMINSTAR: + case OP_NOTPLUS: + case OP_NOTMINPLUS: + case OP_NOTQUERY: + case OP_NOTMINQUERY: + c = *ecode++ - OP_NOTSTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + + /* Common code for all repeated single-character matches. We can give + up quickly if there are fewer than the minimum number of characters left in + the subject. */ + + REPEATNOTCHAR: + if (min > md->end_subject - eptr) return FALSE; + c = *ecode++; + + /* The code is duplicated for the caseless and caseful cases, for speed, + since matching characters is likely to be quite common. First, ensure the + minimum number of matches are present. If min = max, continue at the same + level without recursing. Otherwise, if minimizing, keep trying the rest of + the expression and advancing one matching character if failing, up to the + maximum. Alternatively, if maximizing, find the maximum number of + characters and work backwards. */ + + DPRINTF(("negative matching %c{%d,%d} against subject %.*s\n", c, min, max, + max, eptr)); + + if ((ims & PCRE_CASELESS) != 0) + { + c = md->lcc[c]; + for (i = 1; i <= min; i++) + if (c == md->lcc[*eptr++]) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, eptrb, 0)) + return TRUE; + if (i >= max || eptr >= md->end_subject || + c == md->lcc[*eptr++]) + return FALSE; + } + /* Control never gets here */ + } + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c == md->lcc[*eptr]) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, eptrb, 0)) + return TRUE; + return FALSE; + } + /* Control never gets here */ + } + + /* Caseful comparisons */ + + else + { + for (i = 1; i <= min; i++) if (c == *eptr++) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, eptrb, 0)) + return TRUE; + if (i >= max || eptr >= md->end_subject || c == *eptr++) return FALSE; + } + /* Control never gets here */ + } + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c == *eptr) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, eptrb, 0)) + return TRUE; + return FALSE; + } + } + /* Control never gets here */ + + /* Match a single character type repeatedly; several different opcodes + share code. This is very similar to the code for single characters, but we + repeat it in the interests of efficiency. */ + + case OP_TYPEEXACT: + min = max = (ecode[1] << 8) + ecode[2]; + minimize = TRUE; + ecode += 3; + goto REPEATTYPE; + + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + min = 0; + max = (ecode[1] << 8) + ecode[2]; + minimize = *ecode == OP_TYPEMINUPTO; + ecode += 3; + goto REPEATTYPE; + + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + c = *ecode++ - OP_TYPESTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + + /* Common code for all repeated single character type matches */ + + REPEATTYPE: + ctype = *ecode++; /* Code for the character type */ + + /* First, ensure the minimum number of matches are present. Use inline + code for maximizing the speed, and do the type test once at the start + (i.e. keep it out of the loop). Also we can test that there are at least + the minimum number of bytes before we start, except when doing '.' in + UTF8 mode. Leave the test in in all cases; in the special case we have + to test after each character. */ + + if (min > md->end_subject - eptr) return FALSE; + if (min > 0) switch(ctype) + { + case OP_ANY: + #ifdef SUPPORT_UTF8 + if (md->utf8) + { + for (i = 1; i <= min; i++) + { + if (eptr >= md->end_subject || + (*eptr++ == '\n' && (ims & PCRE_DOTALL) == 0)) + return FALSE; + while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++; + } + break; + } + #endif + /* Non-UTF8 can be faster */ + if ((ims & PCRE_DOTALL) == 0) + { for (i = 1; i <= min; i++) if (*eptr++ == '\n') return FALSE; } + else eptr += min; + break; + + case OP_NOT_DIGIT: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_digit) != 0) return FALSE; + break; + + case OP_DIGIT: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_digit) == 0) return FALSE; + break; + + case OP_NOT_WHITESPACE: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_space) != 0) return FALSE; + break; + + case OP_WHITESPACE: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_space) == 0) return FALSE; + break; + + case OP_NOT_WORDCHAR: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_word) != 0) + return FALSE; + break; + + case OP_WORDCHAR: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_word) == 0) + return FALSE; + break; + } + + /* If min = max, continue at the same level without recursing */ + + if (min == max) continue; + + /* If minimizing, we have to test the rest of the pattern before each + subsequent match. */ + + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, eptrb, 0)) return TRUE; + if (i >= max || eptr >= md->end_subject) return FALSE; + + c = *eptr++; + switch(ctype) + { + case OP_ANY: + if ((ims & PCRE_DOTALL) == 0 && c == '\n') return FALSE; + #ifdef SUPPORT_UTF8 + if (md->utf8) + while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++; + #endif + break; + + case OP_NOT_DIGIT: + if ((md->ctypes[c] & ctype_digit) != 0) return FALSE; + break; + + case OP_DIGIT: + if ((md->ctypes[c] & ctype_digit) == 0) return FALSE; + break; + + case OP_NOT_WHITESPACE: + if ((md->ctypes[c] & ctype_space) != 0) return FALSE; + break; + + case OP_WHITESPACE: + if ((md->ctypes[c] & ctype_space) == 0) return FALSE; + break; + + case OP_NOT_WORDCHAR: + if ((md->ctypes[c] & ctype_word) != 0) return FALSE; + break; + + case OP_WORDCHAR: + if ((md->ctypes[c] & ctype_word) == 0) return FALSE; + break; + } + } + /* Control never gets here */ + } + + /* If maximizing it is worth using inline code for speed, doing the type + test once at the start (i.e. keep it out of the loop). */ + + else + { + const uschar *pp = eptr; + switch(ctype) + { + case OP_ANY: + + /* Special code is required for UTF8, but when the maximum is unlimited + we don't need it. */ + + #ifdef SUPPORT_UTF8 + if (md->utf8 && max < INT_MAX) + { + if ((ims & PCRE_DOTALL) == 0) + { + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || *eptr++ == '\n') break; + while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++; + } + } + else + { + for (i = min; i < max; i++) + { + eptr++; + while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++; + } + } + break; + } + #endif + /* Non-UTF8 can be faster */ + if ((ims & PCRE_DOTALL) == 0) + { + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || *eptr == '\n') break; + eptr++; + } + } + else + { + c = max - min; + if (c > md->end_subject - eptr) c = md->end_subject - eptr; + eptr += c; + } + break; + + case OP_NOT_DIGIT: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_digit) != 0) + break; + eptr++; + } + break; + + case OP_DIGIT: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_digit) == 0) + break; + eptr++; + } + break; + + case OP_NOT_WHITESPACE: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_space) != 0) + break; + eptr++; + } + break; + + case OP_WHITESPACE: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_space) == 0) + break; + eptr++; + } + break; + + case OP_NOT_WORDCHAR: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_word) != 0) + break; + eptr++; + } + break; + + case OP_WORDCHAR: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_word) == 0) + break; + eptr++; + } + break; + } + + while (eptr >= pp) + { + if (match(eptr--, ecode, offset_top, md, ims, eptrb, 0)) + return TRUE; + #ifdef SUPPORT_UTF8 + if (md->utf8) + while (eptr > pp && (*eptr & 0xc0) == 0x80) eptr--; + #endif + } + return FALSE; + } + /* Control never gets here */ + + /* There's been some horrible disaster. */ + + default: + DPRINTF(("Unknown opcode %d\n", *ecode)); + md->errorcode = PCRE_ERROR_UNKNOWN_NODE; + return FALSE; + } + + /* Do not stick any code in here without much thought; it is assumed + that "continue" in the code above comes out to here to repeat the main + loop. */ + + } /* End of main loop */ + /* Control never reaches here */ + } + + + + + /************************************************* + * Execute a Regular Expression * + *************************************************/ + + /* This function applies a compiled re to a subject string and picks out + portions of the string if it matches. Two elements in the vector are set for + each substring: the offsets to the start and end of the substring. + + Arguments: + external_re points to the compiled expression + external_extra points to "hints" from pcre_study() or is NULL + subject points to the subject string + length length of subject string (may contain binary zeros) + start_offset where to start in the subject string + options option bits + offsets points to a vector of ints to be filled in with offsets + offsetcount the number of elements in the vector + + Returns: > 0 => success; value is the number of elements filled in + = 0 => success, but offsets is not big enough + -1 => failed to match + < -1 => some kind of unexpected problem + */ + + int + pcre_exec(const pcre *external_re, const pcre_extra *external_extra, + const char *subject, int length, int start_offset, int options, int *offsets, + int offsetcount) + { + int resetcount, ocount; + int first_char = -1; + int req_char = -1; + int req_char2 = -1; + unsigned long int ims = 0; + match_data match_block; + const uschar *start_bits = NULL; + const uschar *start_match = (const uschar *)subject + start_offset; + const uschar *end_subject; + const uschar *req_char_ptr = start_match - 1; + const real_pcre *re = (const real_pcre *)external_re; + const real_pcre_extra *extra = (const real_pcre_extra *)external_extra; + BOOL using_temporary_offsets = FALSE; + BOOL anchored = ((re->options | options) & PCRE_ANCHORED) != 0; + BOOL startline = (re->options & PCRE_STARTLINE) != 0; + + if ((options & ~PUBLIC_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION; + + if (re == NULL || subject == NULL || + (offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL; + if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC; + + match_block.start_pattern = re->code; + match_block.start_subject = (const uschar *)subject; + match_block.end_subject = match_block.start_subject + length; + end_subject = match_block.end_subject; + + match_block.endonly = (re->options & PCRE_DOLLAR_ENDONLY) != 0; + match_block.utf8 = (re->options & PCRE_UTF8) != 0; + + match_block.notbol = (options & PCRE_NOTBOL) != 0; + match_block.noteol = (options & PCRE_NOTEOL) != 0; + match_block.notempty = (options & PCRE_NOTEMPTY) != 0; + + match_block.errorcode = PCRE_ERROR_NOMATCH; /* Default error */ + + match_block.lcc = re->tables + lcc_offset; + match_block.ctypes = re->tables + ctypes_offset; + + /* The ims options can vary during the matching as a result of the presence + of (?ims) items in the pattern. They are kept in a local variable so that + restoring at the exit of a group is easy. */ + + ims = re->options & (PCRE_CASELESS|PCRE_MULTILINE|PCRE_DOTALL); + + /* If the expression has got more back references than the offsets supplied can + hold, we get a temporary bit of working store to use during the matching. + Otherwise, we can use the vector supplied, rounding down its size to a multiple + of 3. */ + + ocount = offsetcount - (offsetcount % 3); + + if (re->top_backref > 0 && re->top_backref >= ocount/3) + { + ocount = re->top_backref * 3 + 3; + match_block.offset_vector = (int *)(pcre_malloc)(ocount * sizeof(int)); + if (match_block.offset_vector == NULL) return PCRE_ERROR_NOMEMORY; + using_temporary_offsets = TRUE; + DPRINTF(("Got memory to hold back references\n")); + } + else match_block.offset_vector = offsets; + + match_block.offset_end = ocount; + match_block.offset_max = (2*ocount)/3; + match_block.offset_overflow = FALSE; + + /* Compute the minimum number of offsets that we need to reset each time. Doing + this makes a huge difference to execution time when there aren't many brackets + in the pattern. */ + + resetcount = 2 + re->top_bracket * 2; + if (resetcount > offsetcount) resetcount = ocount; + + /* Reset the working variable associated with each extraction. These should + never be used unless previously set, but they get saved and restored, and so we + initialize them to avoid reading uninitialized locations. */ + + if (match_block.offset_vector != NULL) + { + register int *iptr = match_block.offset_vector + ocount; + register int *iend = iptr - resetcount/2 + 1; + while (--iptr >= iend) *iptr = -1; + } + + /* Set up the first character to match, if available. The first_char value is + never set for an anchored regular expression, but the anchoring may be forced + at run time, so we have to test for anchoring. The first char may be unset for + an unanchored pattern, of course. If there's no first char and the pattern was + studied, there may be a bitmap of possible first characters. */ + + if (!anchored) + { + if ((re->options & PCRE_FIRSTSET) != 0) + { + first_char = re->first_char; + if ((ims & PCRE_CASELESS) != 0) first_char = match_block.lcc[first_char]; + } + else + if (!startline && extra != NULL && + (extra->options & PCRE_STUDY_MAPPED) != 0) + start_bits = extra->start_bits; + } + + /* For anchored or unanchored matches, there may be a "last known required + character" set. If the PCRE_CASELESS is set, implying that the match starts + caselessly, or if there are any changes of this flag within the regex, set up + both cases of the character. Otherwise set the two values the same, which will + avoid duplicate testing (which takes significant time). This covers the vast + majority of cases. It will be suboptimal when the case flag changes in a regex + and the required character in fact is caseful. */ + + if ((re->options & PCRE_REQCHSET) != 0) + { + req_char = re->req_char; + req_char2 = ((re->options & (PCRE_CASELESS | PCRE_ICHANGED)) != 0)? + (re->tables + fcc_offset)[req_char] : req_char; + } + + /* Loop for handling unanchored repeated matching attempts; for anchored regexs + the loop runs just once. */ + + do + { + int rc; + register int *iptr = match_block.offset_vector; + register int *iend = iptr + resetcount; + + /* Reset the maximum number of extractions we might see. */ + + while (iptr < iend) *iptr++ = -1; + + /* Advance to a unique first char if possible */ + + if (first_char >= 0) + { + if ((ims & PCRE_CASELESS) != 0) + while (start_match < end_subject && + match_block.lcc[*start_match] != first_char) + start_match++; + else + while (start_match < end_subject && *start_match != first_char) + start_match++; + } + + /* Or to just after \n for a multiline match if possible */ + + else if (startline) + { + if (start_match > match_block.start_subject + start_offset) + { + while (start_match < end_subject && start_match[-1] != '\n') + start_match++; + } + } + + /* Or to a non-unique first char after study */ + + else if (start_bits != NULL) + { + while (start_match < end_subject) + { + register int c = *start_match; + if ((start_bits[c/8] & (1 << (c&7))) == 0) start_match++; else break; + } + } + + #ifdef DEBUG /* Sigh. Some compilers never learn. */ + printf(">>>> Match against: "); + pchars(start_match, end_subject - start_match, TRUE, &match_block); + printf("\n"); + #endif + + /* If req_char is set, we know that that character must appear in the subject + for the match to succeed. If the first character is set, req_char must be + later in the subject; otherwise the test starts at the match point. This + optimization can save a huge amount of backtracking in patterns with nested + unlimited repeats that aren't going to match. We don't know what the state of + case matching may be when this character is hit, so test for it in both its + cases if necessary. However, the different cased versions will not be set up + unless PCRE_CASELESS was given or the casing state changes within the regex. + Writing separate code makes it go faster, as does using an autoincrement and + backing off on a match. */ + + if (req_char >= 0) + { + register const uschar *p = start_match + ((first_char >= 0)? 1 : 0); + + /* We don't need to repeat the search if we haven't yet reached the + place we found it at last time. */ + + if (p > req_char_ptr) + { + /* Do a single test if no case difference is set up */ + + if (req_char == req_char2) + { + while (p < end_subject) + { + if (*p++ == req_char) { p--; break; } + } + } + + /* Otherwise test for either case */ + + else + { + while (p < end_subject) + { + register int pp = *p++; + if (pp == req_char || pp == req_char2) { p--; break; } + } + } + + /* If we can't find the required character, break the matching loop */ + + if (p >= end_subject) break; + + /* If we have found the required character, save the point where we + found it, so that we don't search again next time round the loop if + the start hasn't passed this character yet. */ + + req_char_ptr = p; + } + } + + /* When a match occurs, substrings will be set for all internal extractions; + we just need to set up the whole thing as substring 0 before returning. If + there were too many extractions, set the return code to zero. In the case + where we had to get some local store to hold offsets for backreferences, copy + those back references that we can. In this case there need not be overflow + if certain parts of the pattern were not used. */ + + match_block.start_match = start_match; + if (!match(start_match, re->code, 2, &match_block, ims, NULL, match_isgroup)) + continue; + + /* Copy the offset information from temporary store if necessary */ + + if (using_temporary_offsets) + { + if (offsetcount >= 4) + { + memcpy(offsets + 2, match_block.offset_vector + 2, + (offsetcount - 2) * sizeof(int)); + DPRINTF(("Copied offsets from temporary memory\n")); + } + if (match_block.end_offset_top > offsetcount) + match_block.offset_overflow = TRUE; + + DPRINTF(("Freeing temporary memory\n")); + (pcre_free)(match_block.offset_vector); + } + + rc = match_block.offset_overflow? 0 : match_block.end_offset_top/2; + + if (match_block.offset_end < 2) rc = 0; else + { + offsets[0] = start_match - match_block.start_subject; + offsets[1] = match_block.end_match_ptr - match_block.start_subject; + } + + DPRINTF((">>>> returning %d\n", rc)); + return rc; + } + + /* This "while" is the end of the "do" above */ + + while (!anchored && + match_block.errorcode == PCRE_ERROR_NOMATCH && + start_match++ < end_subject); + + if (using_temporary_offsets) + { + DPRINTF(("Freeing temporary memory\n")); + (pcre_free)(match_block.offset_vector); + } + + DPRINTF((">>>> returning %d\n", match_block.errorcode)); + + return match_block.errorcode; + } + + /* End of pcre.c */

ossp-pkg/pcre/pcre.h -> 1.6

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,114 ---- + /************************************************* + * Perl-Compatible Regular Expressions * + *************************************************/ + + /* Copyright (c) 1997-2000 University of Cambridge */ + + #ifndef _PCRE_H + #define _PCRE_H + + #define PCRE_MAJOR 3 + #define PCRE_MINOR 3 + #define PCRE_DATE 01-Aug-2000 + + #define __PCRE_STRING(a) #a + #define __PCRE_XSTRING(s) _STRING(s) + #define PCRE_VERSION __PCRE_XSTRING(PCRE_MAJOR) "." \ + __PCRE_XSTRING(PCRE_MINOR) " " \ + __PCRE_XSTRING(PCRE_DATE); + + /* Win32 uses DLL by default */ + + #ifdef _WIN32 + # ifdef STATIC + # define PCRE_DL_IMPORT + # else + # define PCRE_DL_IMPORT __declspec(dllimport) + # endif + #else + # define PCRE_DL_IMPORT + #endif + + /* Have to include stdlib.h in order to ensure that size_t is defined; + it is needed here for malloc. */ + + #include <stdlib.h> + + /* Allow for C++ users */ + + #ifdef __cplusplus + extern "C" { + #endif + + /* Options */ + + #define PCRE_CASELESS 0x0001 + #define PCRE_MULTILINE 0x0002 + #define PCRE_DOTALL 0x0004 + #define PCRE_EXTENDED 0x0008 + #define PCRE_ANCHORED 0x0010 + #define PCRE_DOLLAR_ENDONLY 0x0020 + #define PCRE_EXTRA 0x0040 + #define PCRE_NOTBOL 0x0080 + #define PCRE_NOTEOL 0x0100 + #define PCRE_UNGREEDY 0x0200 + #define PCRE_NOTEMPTY 0x0400 + #define PCRE_UTF8 0x0800 + + /* Exec-time and get-time error codes */ + + #define PCRE_ERROR_NOMATCH (-1) + #define PCRE_ERROR_NULL (-2) + #define PCRE_ERROR_BADOPTION (-3) + #define PCRE_ERROR_BADMAGIC (-4) + #define PCRE_ERROR_UNKNOWN_NODE (-5) + #define PCRE_ERROR_NOMEMORY (-6) + #define PCRE_ERROR_NOSUBSTRING (-7) + + /* Request types for pcre_fullinfo() */ + + #define PCRE_INFO_OPTIONS 0 + #define PCRE_INFO_SIZE 1 + #define PCRE_INFO_CAPTURECOUNT 2 + #define PCRE_INFO_BACKREFMAX 3 + #define PCRE_INFO_FIRSTCHAR 4 + #define PCRE_INFO_FIRSTTABLE 5 + #define PCRE_INFO_LASTLITERAL 6 + + /* Types */ + + typedef void pcre; + typedef void pcre_extra; + + /* Store get and free functions. These can be set to alternative malloc/free + functions if required. Some magic is required for Win32 DLL; it is null on + other OS. */ + + PCRE_DL_IMPORT extern void *(*pcre_malloc)(size_t); + PCRE_DL_IMPORT extern void (*pcre_free)(void *); + + #undef PCRE_DL_IMPORT + + /* Functions */ + + extern pcre *pcre_compile(const char *, int, const char **, int *, + const unsigned char *); + extern int pcre_copy_substring(const char *, int *, int, int, char *, int); + extern int pcre_exec(const pcre *, const pcre_extra *, const char *, + int, int, int, int *, int); + extern void pcre_free_substring(const char *); + extern void pcre_free_substring_list(const char **); + extern int pcre_get_substring(const char *, int *, int, int, const char **); + extern int pcre_get_substring_list(const char *, int *, int, const char ***); + extern int pcre_info(const pcre *, int *, int *); + extern int pcre_fullinfo(const pcre *, const pcre_extra *, int, void *); + extern unsigned const char *pcre_maketables(void); + extern pcre_extra *pcre_study(const pcre *, int, const char **); + extern const char *pcre_version(void); + extern int pcre_match(const char *, const char *, ...); + + #ifdef __cplusplus + } /* extern "C" */ + #endif + + #endif /* End of pcre.h */

ossp-pkg/pcre/pcre_get.c -> 1.2

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,227 ---- + /************************************************* + * Perl-Compatible Regular Expressions * + *************************************************/ + + /* + This is a library of functions to support regular expressions whose syntax + and semantics are as close as possible to those of the Perl 5 language. See + the file Tech.Notes for some information on the internals. + + Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997-2000 University of Cambridge + + ----------------------------------------------------------------------------- + Permission is granted to anyone to use this software for any purpose on any + computer system, and to redistribute it freely, subject to the following + restrictions: + + 1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + + 2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + + 3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + + 4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. + ----------------------------------------------------------------------------- + */ + + /* This module contains some convenience functions for extracting substrings + from the subject string after a regex match has succeeded. The original idea + for these functions came from Scott Wimer <scottw@cgibuilder.com>. */ + + + /* Include the internals header, which itself includes Standard C headers plus + the external pcre header. */ + + #include "pcre_internal.h" + + + + /************************************************* + * Copy captured string to given buffer * + *************************************************/ + + /* This function copies a single captured substring into a given buffer. + Note that we use memcpy() rather than strncpy() in case there are binary zeros + in the string. + + Arguments: + subject the subject string that was matched + ovector pointer to the offsets table + stringcount the number of substrings that were captured + (i.e. the yield of the pcre_exec call, unless + that was zero, in which case it should be 1/3 + of the offset table size) + stringnumber the number of the required substring + buffer where to put the substring + size the size of the buffer + + Returns: if successful: + the length of the copied string, not including the zero + that is put on the end; can be zero + if not successful: + PCRE_ERROR_NOMEMORY (-6) buffer too small + PCRE_ERROR_NOSUBSTRING (-7) no such captured substring + */ + + int + pcre_copy_substring(const char *subject, int *ovector, int stringcount, + int stringnumber, char *buffer, int size) + { + int yield; + if (stringnumber < 0 || stringnumber >= stringcount) + return PCRE_ERROR_NOSUBSTRING; + stringnumber *= 2; + yield = ovector[stringnumber+1] - ovector[stringnumber]; + if (size < yield + 1) return PCRE_ERROR_NOMEMORY; + memcpy(buffer, subject + ovector[stringnumber], yield); + buffer[yield] = 0; + return yield; + } + + + + /************************************************* + * Copy all captured strings to new store * + *************************************************/ + + /* This function gets one chunk of store and builds a list of pointers and all + of the captured substrings in it. A NULL pointer is put on the end of the list. + + Arguments: + subject the subject string that was matched + ovector pointer to the offsets table + stringcount the number of substrings that were captured + (i.e. the yield of the pcre_exec call, unless + that was zero, in which case it should be 1/3 + of the offset table size) + listptr set to point to the list of pointers + + Returns: if successful: 0 + if not successful: + PCRE_ERROR_NOMEMORY (-6) failed to get store + */ + + int + pcre_get_substring_list(const char *subject, int *ovector, int stringcount, + const char ***listptr) + { + int i; + int size = sizeof(char *); + int double_count = stringcount * 2; + char **stringlist; + char *p; + + for (i = 0; i < double_count; i += 2) + size += sizeof(char *) + ovector[i+1] - ovector[i] + 1; + + stringlist = (char **)(pcre_malloc)(size); + if (stringlist == NULL) return PCRE_ERROR_NOMEMORY; + + *listptr = (const char **)stringlist; + p = (char *)(stringlist + stringcount + 1); + + for (i = 0; i < double_count; i += 2) + { + int len = ovector[i+1] - ovector[i]; + memcpy(p, subject + ovector[i], len); + *stringlist++ = p; + p += len; + *p++ = 0; + } + + *stringlist = NULL; + return 0; + } + + + + /************************************************* + * Free store obtained by get_substring_list * + *************************************************/ + + /* This function exists for the benefit of people calling PCRE from non-C + programs that can call its functions, but not free() or (pcre_free)() directly. + + Argument: the result of a previous pcre_get_substring_list() + Returns: nothing + */ + + void + pcre_free_substring_list(const char **pointer) + { + (pcre_free)((void *)pointer); + } + + + + /************************************************* + * Copy captured string to new store * + *************************************************/ + + /* This function copies a single captured substring into a piece of new + store + + Arguments: + subject the subject string that was matched + ovector pointer to the offsets table + stringcount the number of substrings that were captured + (i.e. the yield of the pcre_exec call, unless + that was zero, in which case it should be 1/3 + of the offset table size) + stringnumber the number of the required substring + stringptr where to put a pointer to the substring + + Returns: if successful: + the length of the string, not including the zero that + is put on the end; can be zero + if not successful: + PCRE_ERROR_NOMEMORY (-6) failed to get store + PCRE_ERROR_NOSUBSTRING (-7) substring not present + */ + + int + pcre_get_substring(const char *subject, int *ovector, int stringcount, + int stringnumber, const char **stringptr) + { + int yield; + char *substring; + if (stringnumber < 0 || stringnumber >= stringcount) + return PCRE_ERROR_NOSUBSTRING; + stringnumber *= 2; + yield = ovector[stringnumber+1] - ovector[stringnumber]; + substring = (char *)(pcre_malloc)(yield + 1); + if (substring == NULL) return PCRE_ERROR_NOMEMORY; + memcpy(substring, subject + ovector[stringnumber], yield); + substring[yield] = 0; + *stringptr = substring; + return yield; + } + + + + /************************************************* + * Free store obtained by get_substring * + *************************************************/ + + /* This function exists for the benefit of people calling PCRE from non-C + programs that can call its functions, but not free() or (pcre_free)() directly. + + Argument: the result of a previous pcre_get_substring() + Returns: nothing + */ + + void + pcre_free_substring(const char *pointer) + { + (pcre_free)((void *)pointer); + } + + /* End of get.c */

ossp-pkg/pcre/pcre_internal.h -> 1.5

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,380 ---- + /************************************************* + * Perl-Compatible Regular Expressions * + *************************************************/ + + + /* This is a library of functions to support regular expressions whose syntax + and semantics are as close as possible to those of the Perl 5 language. See + the file Tech.Notes for some information on the internals. + + Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997-2000 University of Cambridge + + ----------------------------------------------------------------------------- + Permission is granted to anyone to use this software for any purpose on any + computer system, and to redistribute it freely, subject to the following + restrictions: + + 1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + + 2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + + 3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + + 4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. + ----------------------------------------------------------------------------- + */ + + /* This header contains definitions that are shared between the different + modules, but which are not relevant to the outside. */ + + /* Get the definitions provided by running "configure" */ + + #include "config.h" + + /* To cope with SunOS4 and other systems that lack memmove() but have bcopy(), + define a macro for memmove() if HAVE_MEMMOVE is false, provided that HAVE_BCOPY + is set. Otherwise, include an emulating function for those systems that have + neither (there some non-Unix environments where this is the case). This assumes + that all calls to memmove are moving strings upwards in store, which is the + case in PCRE. */ + + #if ! HAVE_MEMMOVE + #undef memmove /* some systems may have a macro */ + #if HAVE_BCOPY + #define memmove(a, b, c) bcopy(b, a, c) + #else + void * + pcre_memmove(unsigned char *dest, const unsigned char *src, size_t n) + { + int i; + dest += n; + src += n; + for (i = 0; i < n; ++i) *(--dest) = *(--src); + } + #define memmove(a, b, c) pcre_memmove(a, b, c) + #endif + #endif + + /* Standard C headers plus the external interface definition */ + + #include <ctype.h> + #include <limits.h> + #include <stddef.h> + #include <stdio.h> + #include <stdlib.h> + #include <string.h> + #include "pcre.h" + + /* In case there is no definition of offsetof() provided - though any proper + Standard C system should have one. */ + + #ifndef offsetof + #define offsetof(p_type,field) ((size_t)&(((p_type *)0)->field)) + #endif + + /* These are the public options that can change during matching. */ + + #define PCRE_IMS (PCRE_CASELESS|PCRE_MULTILINE|PCRE_DOTALL) + + /* Private options flags start at the most significant end of the four bytes, + but skip the top bit so we can use ints for convenience without getting tangled + with negative values. The public options defined in pcre.h start at the least + significant end. Make sure they don't overlap, though now that we have expanded + to four bytes there is plenty of space. */ + + #define PCRE_FIRSTSET 0x40000000 /* first_char is set */ + #define PCRE_REQCHSET 0x20000000 /* req_char is set */ + #define PCRE_STARTLINE 0x10000000 /* start after \n for multiline */ + #define PCRE_INGROUP 0x08000000 /* compiling inside a group */ + #define PCRE_ICHANGED 0x04000000 /* i option changes within regex */ + + /* Options for the "extra" block produced by pcre_study(). */ + + #define PCRE_STUDY_MAPPED 0x01 /* a map of starting chars exists */ + + /* Masks for identifying the public options which are permitted at compile + time, run time or study time, respectively. */ + + #define PUBLIC_OPTIONS \ + (PCRE_CASELESS|PCRE_EXTENDED|PCRE_ANCHORED|PCRE_MULTILINE| \ + PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8) + + #define PUBLIC_EXEC_OPTIONS \ + (PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY) + + #define PUBLIC_STUDY_OPTIONS 0 /* None defined */ + + /* Magic number to provide a small check against being handed junk. */ + + #define MAGIC_NUMBER 0x50435245UL /* 'PCRE' */ + + /* Miscellaneous definitions */ + + typedef int BOOL; + + #define FALSE 0 + #define TRUE 1 + + /* These are escaped items that aren't just an encoding of a particular data + value such as \n. They must have non-zero values, as check_escape() returns + their negation. Also, they must appear in the same order as in the opcode + definitions below, up to ESC_z. The final one must be ESC_REF as subsequent + values are used for \1, \2, \3, etc. There is a test in the code for an escape + greater than ESC_b and less than ESC_X to detect the types that may be + repeated. If any new escapes are put in-between that don't consume a character, + that code will have to change. */ + + enum { ESC_A = 1, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s, ESC_W, ESC_w, + ESC_Z, ESC_z, ESC_REF }; + + /* Opcode table: OP_BRA must be last, as all values >= it are used for brackets + that extract substrings. Starting from 1 (i.e. after OP_END), the values up to + OP_EOD must correspond in order to the list of escapes immediately above. */ + + enum { + OP_END, /* End of pattern */ + + /* Values corresponding to backslashed metacharacters */ + + OP_SOD, /* Start of data: \A */ + OP_NOT_WORD_BOUNDARY, /* \B */ + OP_WORD_BOUNDARY, /* \b */ + OP_NOT_DIGIT, /* \D */ + OP_DIGIT, /* \d */ + OP_NOT_WHITESPACE, /* \S */ + OP_WHITESPACE, /* \s */ + OP_NOT_WORDCHAR, /* \W */ + OP_WORDCHAR, /* \w */ + OP_EODN, /* End of data or \n at end of data: \Z. */ + OP_EOD, /* End of data: \z */ + + OP_OPT, /* Set runtime options */ + OP_CIRC, /* Start of line - varies with multiline switch */ + OP_DOLL, /* End of line - varies with multiline switch */ + OP_ANY, /* Match any character */ + OP_CHARS, /* Match string of characters */ + OP_NOT, /* Match anything but the following char */ + + OP_STAR, /* The maximizing and minimizing versions of */ + OP_MINSTAR, /* all these opcodes must come in pairs, with */ + OP_PLUS, /* the minimizing one second. */ + OP_MINPLUS, /* This first set applies to single characters */ + OP_QUERY, + OP_MINQUERY, + OP_UPTO, /* From 0 to n matches */ + OP_MINUPTO, + OP_EXACT, /* Exactly n matches */ + + OP_NOTSTAR, /* The maximizing and minimizing versions of */ + OP_NOTMINSTAR, /* all these opcodes must come in pairs, with */ + OP_NOTPLUS, /* the minimizing one second. */ + OP_NOTMINPLUS, /* This first set applies to "not" single characters */ + OP_NOTQUERY, + OP_NOTMINQUERY, + OP_NOTUPTO, /* From 0 to n matches */ + OP_NOTMINUPTO, + OP_NOTEXACT, /* Exactly n matches */ + + OP_TYPESTAR, /* The maximizing and minimizing versions of */ + OP_TYPEMINSTAR, /* all these opcodes must come in pairs, with */ + OP_TYPEPLUS, /* the minimizing one second. These codes must */ + OP_TYPEMINPLUS, /* be in exactly the same order as those above. */ + OP_TYPEQUERY, /* This set applies to character types such as \d */ + OP_TYPEMINQUERY, + OP_TYPEUPTO, /* From 0 to n matches */ + OP_TYPEMINUPTO, + OP_TYPEEXACT, /* Exactly n matches */ + + OP_CRSTAR, /* The maximizing and minimizing versions of */ + OP_CRMINSTAR, /* all these opcodes must come in pairs, with */ + OP_CRPLUS, /* the minimizing one second. These codes must */ + OP_CRMINPLUS, /* be in exactly the same order as those above. */ + OP_CRQUERY, /* These are for character classes and back refs */ + OP_CRMINQUERY, + OP_CRRANGE, /* These are different to the three seta above. */ + OP_CRMINRANGE, + + OP_CLASS, /* Match a character class */ + OP_REF, /* Match a back reference */ + OP_RECURSE, /* Match this pattern recursively */ + + OP_ALT, /* Start of alternation */ + OP_KET, /* End of group that doesn't have an unbounded repeat */ + OP_KETRMAX, /* These two must remain together and in this */ + OP_KETRMIN, /* order. They are for groups the repeat for ever. */ + + /* The assertions must come before ONCE and COND */ + + OP_ASSERT, /* Positive lookahead */ + OP_ASSERT_NOT, /* Negative lookahead */ + OP_ASSERTBACK, /* Positive lookbehind */ + OP_ASSERTBACK_NOT, /* Negative lookbehind */ + OP_REVERSE, /* Move pointer back - used in lookbehind assertions */ + + /* ONCE and COND must come after the assertions, with ONCE first, as there's + a test for >= ONCE for a subpattern that isn't an assertion. */ + + OP_ONCE, /* Once matched, don't back up into the subpattern */ + OP_COND, /* Conditional group */ + OP_CREF, /* Used to hold an extraction string number */ + + OP_BRAZERO, /* These two must remain together and in this */ + OP_BRAMINZERO, /* order. */ + + OP_BRA /* This and greater values are used for brackets that + extract substrings. */ + }; + + /* The highest extraction number. This is limited by the number of opcodes + left after OP_BRA, i.e. 255 - OP_BRA. We actually set it somewhat lower. */ + + #define EXTRACT_MAX 99 + + /* The texts of compile-time error messages are defined as macros here so that + they can be accessed by the POSIX wrapper and converted into error codes. Yes, + I could have used error codes in the first place, but didn't feel like changing + just to accommodate the POSIX wrapper. */ + + #define ERR1 "\\ at end of pattern" + #define ERR2 "\\c at end of pattern" + #define ERR3 "unrecognized character follows \\" + #define ERR4 "numbers out of order in {} quantifier" + #define ERR5 "number too big in {} quantifier" + #define ERR6 "missing terminating ] for character class" + #define ERR7 "invalid escape sequence in character class" + #define ERR8 "range out of order in character class" + #define ERR9 "nothing to repeat" + #define ERR10 "operand of unlimited repeat could match the empty string" + #define ERR11 "internal error: unexpected repeat" + #define ERR12 "unrecognized character after (?" + #define ERR13 "too many capturing parenthesized sub-patterns" + #define ERR14 "missing )" + #define ERR15 "back reference to non-existent subpattern" + #define ERR16 "erroffset passed as NULL" + #define ERR17 "unknown option bit(s) set" + #define ERR18 "missing ) after comment" + #define ERR19 "too many sets of parentheses" + #define ERR20 "regular expression too large" + #define ERR21 "failed to get memory" + #define ERR22 "unmatched parentheses" + #define ERR23 "internal error: code overflow" + #define ERR24 "unrecognized character after (?<" + #define ERR25 "lookbehind assertion is not fixed length" + #define ERR26 "malformed number after (?(" + #define ERR27 "conditional group contains more than two branches" + #define ERR28 "assertion expected after (?(" + #define ERR29 "(?p must be followed by )" + #define ERR30 "unknown POSIX class name" + #define ERR31 "POSIX collating elements are not supported" + #define ERR32 "this version of PCRE is not compiled with PCRE_UTF8 support" + #define ERR33 "characters with values > 255 are not yet supported in classes" + #define ERR34 "character value in \\x{...} sequence is too large" + + /* All character handling must be done as unsigned characters. Otherwise there + are problems with top-bit-set characters and functions such as isspace(). + However, we leave the interface to the outside world as char *, because that + should make things easier for callers. We define a short type for unsigned char + to save lots of typing. I tried "uchar", but it causes problems on Digital + Unix, where it is defined in sys/types, so use "uschar" instead. */ + + typedef unsigned char uschar; + + /* The real format of the start of the pcre block; the actual code vector + runs on as long as necessary after the end. */ + + typedef struct real_pcre { + unsigned long int magic_number; + size_t size; + const unsigned char *tables; + unsigned long int options; + uschar top_bracket; + uschar top_backref; + uschar first_char; + uschar req_char; + uschar code[1]; + } real_pcre; + + /* The real format of the extra block returned by pcre_study(). */ + + typedef struct real_pcre_extra { + uschar options; + uschar start_bits[32]; + } real_pcre_extra; + + + /* Structure for passing "static" information around between the functions + doing the compiling, so that they are thread-safe. */ + + typedef struct compile_data { + const uschar *lcc; /* Points to lower casing table */ + const uschar *fcc; /* Points to case-flipping table */ + const uschar *cbits; /* Points to character type table */ + const uschar *ctypes; /* Points to table of type maps */ + } compile_data; + + /* Structure for passing "static" information around between the functions + doing the matching, so that they are thread-safe. */ + + typedef struct match_data { + int errorcode; /* As it says */ + int *offset_vector; /* Offset vector */ + int offset_end; /* One past the end */ + int offset_max; /* The maximum usable for return data */ + const uschar *lcc; /* Points to lower casing table */ + const uschar *ctypes; /* Points to table of type maps */ + BOOL offset_overflow; /* Set if too many extractions */ + BOOL notbol; /* NOTBOL flag */ + BOOL noteol; /* NOTEOL flag */ + BOOL utf8; /* UTF8 flag */ + BOOL endonly; /* Dollar not before final \n */ + BOOL notempty; /* Empty string match not wanted */ + const uschar *start_pattern; /* For use when recursing */ + const uschar *start_subject; /* Start of the subject string */ + const uschar *end_subject; /* End of the subject string */ + const uschar *start_match; /* Start of this match attempt */ + const uschar *end_match_ptr; /* Subject position at end match */ + int end_offset_top; /* Highwater mark at end of match */ + } match_data; + + /* Bit definitions for entries in the pcre_ctypes table. */ + + #define ctype_space 0x01 + #define ctype_letter 0x02 + #define ctype_digit 0x04 + #define ctype_xdigit 0x08 + #define ctype_word 0x10 /* alphameric or '_' */ + #define ctype_meta 0x80 /* regexp meta char or zero (end pattern) */ + + /* Offsets for the bitmap tables in pcre_cbits. Each table contains a set + of bits for a class map. Some classes are built by combining these tables. */ + + #define cbit_space 0 /* [:space:] or \s */ + #define cbit_xdigit 32 /* [:xdigit:] */ + #define cbit_digit 64 /* [:digit:] or \d */ + #define cbit_upper 96 /* [:upper:] */ + #define cbit_lower 128 /* [:lower:] */ + #define cbit_word 160 /* [:word:] or \w */ + #define cbit_graph 192 /* [:graph:] */ + #define cbit_print 224 /* [:print:] */ + #define cbit_punct 256 /* [:punct:] */ + #define cbit_cntrl 288 /* [:cntrl:] */ + #define cbit_length 320 /* Length of the cbits table */ + + /* Offsets of the various tables from the base tables pointer, and + total length. */ + + #define lcc_offset 0 + #define fcc_offset 256 + #define cbits_offset 512 + #define ctypes_offset (cbits_offset + cbit_length) + #define tables_length (ctypes_offset + 256) + + /* End of internal.h */

ossp-pkg/pcre/pcre_test.c -> 1.5

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,1225 ---- + /************************************************* + * PCRE testing program * + *************************************************/ + + #include <ctype.h> + #include <stdio.h> + #include <string.h> + #include <stdlib.h> + #include <time.h> + #include <locale.h> + + /* Use the internal info for displaying the results of pcre_study(). */ + + #include "pcre_internal.h" + + /* It is possible to compile this test program without including support for + testing the POSIX interface, though this is not available via the standard + Makefile. */ + + #if !defined NOPOSIX + #include "pcreposix.h" + #endif + + #ifndef CLOCKS_PER_SEC + #ifdef CLK_TCK + #define CLOCKS_PER_SEC CLK_TCK + #else + #define CLOCKS_PER_SEC 100 + #endif + #endif + + #define LOOPREPEAT 20000 + + + static FILE *outfile; + static int log_store = 0; + static size_t gotten_store; + + + + static int utf8_table1[] = { + 0x0000007f, 0x000007ff, 0x0000ffff, 0x001fffff, 0x03ffffff, 0x7fffffff}; + + static int utf8_table2[] = { + 0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc}; + + static int utf8_table3[] = { + 0xff, 0x1f, 0x0f, 0x07, 0x03, 0x01}; + + + /************************************************* + * Convert character value to UTF-8 * + *************************************************/ + + /* This function takes an integer value in the range 0 - 0x7fffffff + and encodes it as a UTF-8 character in 0 to 6 bytes. + + Arguments: + cvalue the character value + buffer pointer to buffer for result - at least 6 bytes long + + Returns: number of characters placed in the buffer + -1 if input character is negative + 0 if input character is positive but too big (only when + int is longer than 32 bits) + */ + + static int + ord2utf8(int cvalue, unsigned char *buffer) + { + register int i, j; + for (i = 0; i < sizeof(utf8_table1)/sizeof(int); i++) + if (cvalue <= utf8_table1[i]) break; + if (i >= sizeof(utf8_table1)/sizeof(int)) return 0; + if (cvalue < 0) return -1; + *buffer++ = utf8_table2[i] | (cvalue & utf8_table3[i]); + cvalue >>= 6 - i; + for (j = 0; j < i; j++) + { + *buffer++ = 0x80 | (cvalue & 0x3f); + cvalue >>= 6; + } + return i + 1; + } + + + /************************************************* + * Convert UTF-8 string to value * + *************************************************/ + + /* This function takes one or more bytes that represents a UTF-8 character, + and returns the value of the character. + + Argument: + buffer a pointer to the byte vector + vptr a pointer to an int to receive the value + + Returns: > 0 => the number of bytes consumed + -6 to 0 => malformed UTF-8 character at offset = (-return) + */ + + static int + utf82ord(unsigned char *buffer, int *vptr) + { + int c = *buffer++; + int d = c; + int i, j, s; + + for (i = -1; i < 6; i++) /* i is number of additional bytes */ + { + if ((d & 0x80) == 0) break; + d <<= 1; + } + + if (i == -1) { *vptr = c; return 1; } /* ascii character */ + if (i == 0 || i == 6) return 0; /* invalid UTF-8 */ + + /* i now has a value in the range 1-5 */ + + d = c & utf8_table3[i]; + s = 6 - i; + + for (j = 0; j < i; j++) + { + c = *buffer++; + if ((c & 0xc0) != 0x80) return -(j+1); + d |= (c & 0x3f) << s; + s += 6; + } + + /* Check that encoding was the correct unique one */ + + for (j = 0; j < sizeof(utf8_table1)/sizeof(int); j++) + if (d <= utf8_table1[j]) break; + if (j != i) return -(i+1); + + /* Valid value */ + + *vptr = d; + return i+1; + } + + + + + + + /* Debugging function to print the internal form of the regex. This is the same + code as contained in pcre.c under the DEBUG macro. */ + + static const char *OP_names[] = { + "End", "\\A", "\\B", "\\b", "\\D", "\\d", + "\\S", "\\s", "\\W", "\\w", "\\Z", "\\z", + "Opt", "^", "$", "Any", "chars", "not", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", + "class", "Ref", "Recurse", + "Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not", + "AssertB", "AssertB not", "Reverse", "Once", "Cond", "Cref", + "Brazero", "Braminzero", "Bra" + }; + + + static void print_internals(pcre *re) + { + unsigned char *code = ((real_pcre *)re)->code; + + fprintf(outfile, "------------------------------------------------------------------\n"); + + for(;;) + { + int c; + int charlength; + + fprintf(outfile, "%3d ", (int)(code - ((real_pcre *)re)->code)); + + if (*code >= OP_BRA) + { + fprintf(outfile, "%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA); + code += 2; + } + + else switch(*code) + { + case OP_END: + fprintf(outfile, " %s\n", OP_names[*code]); + fprintf(outfile, "------------------------------------------------------------------\n"); + return; + + case OP_OPT: + fprintf(outfile, " %.2x %s", code[1], OP_names[*code]); + code++; + break; + + case OP_COND: + fprintf(outfile, "%3d Cond", (code[1] << 8) + code[2]); + code += 2; + break; + + case OP_CREF: + fprintf(outfile, " %.2d %s", code[1], OP_names[*code]); + code++; + break; + + case OP_CHARS: + charlength = *(++code); + fprintf(outfile, "%3d ", charlength); + while (charlength-- > 0) + if (isprint(c = *(++code))) fprintf(outfile, "%c", c); + else fprintf(outfile, "\\x%02x", c); + break; + + case OP_KETRMAX: + case OP_KETRMIN: + case OP_ALT: + case OP_KET: + case OP_ASSERT: + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + case OP_ONCE: + fprintf(outfile, "%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_REVERSE: + fprintf(outfile, "%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_STAR: + case OP_MINSTAR: + case OP_PLUS: + case OP_MINPLUS: + case OP_QUERY: + case OP_MINQUERY: + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + if (*code >= OP_TYPESTAR) + fprintf(outfile, " %s", OP_names[code[1]]); + else if (isprint(c = code[1])) fprintf(outfile, " %c", c); + else fprintf(outfile, " \\x%02x", c); + fprintf(outfile, "%s", OP_names[*code++]); + break; + + case OP_EXACT: + case OP_UPTO: + case OP_MINUPTO: + if (isprint(c = code[3])) fprintf(outfile, " %c{", c); + else fprintf(outfile, " \\x%02x{", c); + if (*code != OP_EXACT) fprintf(outfile, ","); + fprintf(outfile, "%d}", (code[1] << 8) + code[2]); + if (*code == OP_MINUPTO) fprintf(outfile, "?"); + code += 3; + break; + + case OP_TYPEEXACT: + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + fprintf(outfile, " %s{", OP_names[code[3]]); + if (*code != OP_TYPEEXACT) fprintf(outfile, "0,"); + fprintf(outfile, "%d}", (code[1] << 8) + code[2]); + if (*code == OP_TYPEMINUPTO) fprintf(outfile, "?"); + code += 3; + break; + + case OP_NOT: + if (isprint(c = *(++code))) fprintf(outfile, " [^%c]", c); + else fprintf(outfile, " [^\\x%02x]", c); + break; + + case OP_NOTSTAR: + case OP_NOTMINSTAR: + case OP_NOTPLUS: + case OP_NOTMINPLUS: + case OP_NOTQUERY: + case OP_NOTMINQUERY: + if (isprint(c = code[1])) fprintf(outfile, " [^%c]", c); + else fprintf(outfile, " [^\\x%02x]", c); + fprintf(outfile, "%s", OP_names[*code++]); + break; + + case OP_NOTEXACT: + case OP_NOTUPTO: + case OP_NOTMINUPTO: + if (isprint(c = code[3])) fprintf(outfile, " [^%c]{", c); + else fprintf(outfile, " [^\\x%02x]{", c); + if (*code != OP_NOTEXACT) fprintf(outfile, ","); + fprintf(outfile, "%d}", (code[1] << 8) + code[2]); + if (*code == OP_NOTMINUPTO) fprintf(outfile, "?"); + code += 3; + break; + + case OP_REF: + fprintf(outfile, " \\%d", *(++code)); + code++; + goto CLASS_REF_REPEAT; + + case OP_CLASS: + { + int i, min, max; + code++; + fprintf(outfile, " ["); + + for (i = 0; i < 256; i++) + { + if ((code[i/8] & (1 << (i&7))) != 0) + { + int j; + for (j = i+1; j < 256; j++) + if ((code[j/8] & (1 << (j&7))) == 0) break; + if (i == '-' || i == ']') fprintf(outfile, "\\"); + if (isprint(i)) fprintf(outfile, "%c", i); else fprintf(outfile, "\\x%02x", i); + if (--j > i) + { + fprintf(outfile, "-"); + if (j == '-' || j == ']') fprintf(outfile, "\\"); + if (isprint(j)) fprintf(outfile, "%c", j); else fprintf(outfile, "\\x%02x", j); + } + i = j; + } + } + fprintf(outfile, "]"); + code += 32; + + CLASS_REF_REPEAT: + + switch(*code) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + fprintf(outfile, "%s", OP_names[*code]); + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + min = (code[1] << 8) + code[2]; + max = (code[3] << 8) + code[4]; + if (max == 0) fprintf(outfile, "{%d,}", min); + else fprintf(outfile, "{%d,%d}", min, max); + if (*code == OP_CRMINRANGE) fprintf(outfile, "?"); + code += 4; + break; + + default: + code--; + } + } + break; + + /* Anything else is just a one-node item */ + + default: + fprintf(outfile, " %s", OP_names[*code]); + break; + } + + code++; + fprintf(outfile, "\n"); + } + } + + + + /* Character string printing function. A "normal" and a UTF-8 version. */ + + static void pchars(unsigned char *p, int length, int utf8) + { + int c; + while (length-- > 0) + { + if (utf8) + { + int rc = utf82ord(p, &c); + if (rc > 0) + { + length -= rc - 1; + p += rc; + if (c < 256 && isprint(c)) fprintf(outfile, "%c", c); + else fprintf(outfile, "\\x{%02x}", c); + continue; + } + } + + /* Not UTF-8, or malformed UTF-8 */ + + if (isprint(c = *(p++))) fprintf(outfile, "%c", c); + else fprintf(outfile, "\\x%02x", c); + } + } + + + + /* Alternative malloc function, to test functionality and show the size of the + compiled re. */ + + static void *new_malloc(size_t size) + { + gotten_store = size; + if (log_store) + fprintf(outfile, "Memory allocation (code space): %d\n", + (int)((int)size - offsetof(real_pcre, code[0]))); + return malloc(size); + } + + + + + /* Get one piece of information from the pcre_fullinfo() function */ + + static void new_info(pcre *re, pcre_extra *study, int option, void *ptr) + { + int rc; + if ((rc = pcre_fullinfo(re, study, option, ptr)) < 0) + fprintf(outfile, "Error %d from pcre_fullinfo(%d)\n", rc, option); + } + + + + + /* Read lines from named file or stdin and write to named file or stdout; lines + consist of a regular expression, in delimiters and optionally followed by + options, followed by a set of test data, terminated by an empty line. */ + + int main(int argc, char **argv) + { + FILE *infile = stdin; + int options = 0; + int study_options = 0; + int op = 1; + int timeit = 0; + int showinfo = 0; + int showstore = 0; + int posix = 0; + int debug = 0; + int done = 0; + unsigned char buffer[30000]; + unsigned char dbuffer[1024]; + + /* Static so that new_malloc can use it. */ + + outfile = stdout; + + /* Scan options */ + + while (argc > 1 && argv[op][0] == '-') + { + if (strcmp(argv[op], "-s") == 0 || strcmp(argv[op], "-m") == 0) + showstore = 1; + else if (strcmp(argv[op], "-t") == 0) timeit = 1; + else if (strcmp(argv[op], "-i") == 0) showinfo = 1; + else if (strcmp(argv[op], "-d") == 0) showinfo = debug = 1; + else if (strcmp(argv[op], "-p") == 0) posix = 1; + else + { + printf("*** Unknown option %s\n", argv[op]); + printf("Usage: pcretest [-d] [-i] [-p] [-s] [-t] [<input> [<output>]]\n"); + printf(" -d debug: show compiled code; implies -i\n" + " -i show information about compiled pattern\n" + " -p use POSIX interface\n" + " -s output store information\n" + " -t time compilation and execution\n"); + return 1; + } + op++; + argc--; + } + + /* Sort out the input and output files */ + + if (argc > 1) + { + infile = fopen(argv[op], "r"); + if (infile == NULL) + { + printf("** Failed to open %s\n", argv[op]); + return 1; + } + } + + if (argc > 2) + { + outfile = fopen(argv[op+1], "w"); + if (outfile == NULL) + { + printf("** Failed to open %s\n", argv[op+1]); + return 1; + } + } + + /* Set alternative malloc function */ + + pcre_malloc = new_malloc; + + /* Heading line, then prompt for first regex if stdin */ + + fprintf(outfile, "PCRE version %s\n\n", pcre_version()); + + /* Main loop */ + + while (!done) + { + pcre *re = NULL; + pcre_extra *extra = NULL; + + #if !defined NOPOSIX /* There are still compilers that require no indent */ + regex_t preg; + #endif + + const char *error; + unsigned char *p, *pp, *ppp; + unsigned const char *tables = NULL; + int do_study = 0; + int do_debug = debug; + int do_G = 0; + int do_g = 0; + int do_showinfo = showinfo; + int do_showrest = 0; + int do_posix = 0; + int utf8 = 0; + int erroroffset, len, delimiter; + + if (infile == stdin) printf(" re> "); + if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) break; + if (infile != stdin) fprintf(outfile, "%s", (char *)buffer); + + p = buffer; + while (isspace(*p)) p++; + if (*p == 0) continue; + + /* Get the delimiter and seek the end of the pattern; if is isn't + complete, read more. */ + + delimiter = *p++; + + if (isalnum(delimiter) || delimiter == '\\') + { + fprintf(outfile, "** Delimiter must not be alphameric or \\\n"); + goto SKIP_DATA; + } + + pp = p; + + for(;;) + { + while (*pp != 0) + { + if (*pp == '\\' && pp[1] != 0) pp++; + else if (*pp == delimiter) break; + pp++; + } + if (*pp != 0) break; + + len = sizeof(buffer) - (pp - buffer); + if (len < 256) + { + fprintf(outfile, "** Expression too long - missing delimiter?\n"); + goto SKIP_DATA; + } + + if (infile == stdin) printf(" > "); + if (fgets((char *)pp, len, infile) == NULL) + { + fprintf(outfile, "** Unexpected EOF\n"); + done = 1; + goto CONTINUE; + } + if (infile != stdin) fprintf(outfile, "%s", (char *)pp); + } + + /* If the first character after the delimiter is backslash, make + the pattern end with backslash. This is purely to provide a way + of testing for the error message when a pattern ends with backslash. */ + + if (pp[1] == '\\') *pp++ = '\\'; + + /* Terminate the pattern at the delimiter */ + + *pp++ = 0; + + /* Look for options after final delimiter */ + + options = 0; + study_options = 0; + log_store = showstore; /* default from command line */ + + while (*pp != 0) + { + switch (*pp++) + { + case 'g': do_g = 1; break; + case 'i': options |= PCRE_CASELESS; break; + case 'm': options |= PCRE_MULTILINE; break; + case 's': options |= PCRE_DOTALL; break; + case 'x': options |= PCRE_EXTENDED; break; + + case '+': do_showrest = 1; break; + case 'A': options |= PCRE_ANCHORED; break; + case 'D': do_debug = do_showinfo = 1; break; + case 'E': options |= PCRE_DOLLAR_ENDONLY; break; + case 'G': do_G = 1; break; + case 'I': do_showinfo = 1; break; + case 'M': log_store = 1; break; + + #if !defined NOPOSIX + case 'P': do_posix = 1; break; + #endif + + case 'S': do_study = 1; break; + case 'U': options |= PCRE_UNGREEDY; break; + case 'X': options |= PCRE_EXTRA; break; + case '8': options |= PCRE_UTF8; utf8 = 1; break; + + case 'L': + ppp = pp; + while (*ppp != '\n' && *ppp != ' ') ppp++; + *ppp = 0; + if (setlocale(LC_CTYPE, (const char *)pp) == NULL) + { + fprintf(outfile, "** Failed to set locale \"%s\"\n", pp); + goto SKIP_DATA; + } + tables = pcre_maketables(); + pp = ppp; + break; + + case '\n': case ' ': break; + default: + fprintf(outfile, "** Unknown option '%c'\n", pp[-1]); + goto SKIP_DATA; + } + } + + /* Handle compiling via the POSIX interface, which doesn't support the + timing, showing, or debugging options, nor the ability to pass over + local character tables. */ + + #if !defined NOPOSIX + if (posix || do_posix) + { + int rc; + int cflags = 0; + if ((options & PCRE_CASELESS) != 0) cflags |= REG_ICASE; + if ((options & PCRE_MULTILINE) != 0) cflags |= REG_NEWLINE; + rc = regcomp(&preg, (char *)p, cflags); + + /* Compilation failed; go back for another re, skipping to blank line + if non-interactive. */ + + if (rc != 0) + { + (void)regerror(rc, &preg, (char *)buffer, sizeof(buffer)); + fprintf(outfile, "Failed: POSIX code %d: %s\n", rc, buffer); + goto SKIP_DATA; + } + } + + /* Handle compiling via the native interface */ + + else + #endif /* !defined NOPOSIX */ + + { + if (timeit) + { + register int i; + clock_t time_taken; + clock_t start_time = clock(); + for (i = 0; i < LOOPREPEAT; i++) + { + re = pcre_compile((char *)p, options, &error, &erroroffset, tables); + if (re != NULL) free(re); + } + time_taken = clock() - start_time; + fprintf(outfile, "Compile time %.3f milliseconds\n", + ((double)time_taken * 1000.0) / + ((double)LOOPREPEAT * (double)CLOCKS_PER_SEC)); + } + + re = pcre_compile((char *)p, options, &error, &erroroffset, tables); + + /* Compilation failed; go back for another re, skipping to blank line + if non-interactive. */ + + if (re == NULL) + { + fprintf(outfile, "Failed: %s at offset %d\n", error, erroroffset); + SKIP_DATA: + if (infile != stdin) + { + for (;;) + { + if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) + { + done = 1; + goto CONTINUE; + } + len = (int)strlen((char *)buffer); + while (len > 0 && isspace(buffer[len-1])) len--; + if (len == 0) break; + } + fprintf(outfile, "\n"); + } + goto CONTINUE; + } + + /* Compilation succeeded; print data if required. There are now two + info-returning functions. The old one has a limited interface and + returns only limited data. Check that it agrees with the newer one. */ + + if (do_showinfo) + { + int old_first_char, old_options, old_count; + int count, backrefmax, first_char, need_char; + size_t size; + + if (do_debug) print_internals(re); + + new_info(re, NULL, PCRE_INFO_OPTIONS, &options); + new_info(re, NULL, PCRE_INFO_SIZE, &size); + new_info(re, NULL, PCRE_INFO_CAPTURECOUNT, &count); + new_info(re, NULL, PCRE_INFO_BACKREFMAX, &backrefmax); + new_info(re, NULL, PCRE_INFO_FIRSTCHAR, &first_char); + new_info(re, NULL, PCRE_INFO_LASTLITERAL, &need_char); + + old_count = pcre_info(re, &old_options, &old_first_char); + if (count < 0) fprintf(outfile, + "Error %d from pcre_info()\n", count); + else + { + if (old_count != count) fprintf(outfile, + "Count disagreement: pcre_fullinfo=%d pcre_info=%d\n", count, + old_count); + + if (old_first_char != first_char) fprintf(outfile, + "First char disagreement: pcre_fullinfo=%d pcre_info=%d\n", + first_char, old_first_char); + + if (old_options != options) fprintf(outfile, + "Options disagreement: pcre_fullinfo=%d pcre_info=%d\n", options, + old_options); + } + + if (size != gotten_store) fprintf(outfile, + "Size disagreement: pcre_fullinfo=%d call to malloc for %d\n", + size, gotten_store); + + fprintf(outfile, "Capturing subpattern count = %d\n", count); + if (backrefmax > 0) + fprintf(outfile, "Max back reference = %d\n", backrefmax); + if (options == 0) fprintf(outfile, "No options\n"); + else fprintf(outfile, "Options:%s%s%s%s%s%s%s%s%s\n", + ((options & PCRE_ANCHORED) != 0)? " anchored" : "", + ((options & PCRE_CASELESS) != 0)? " caseless" : "", + ((options & PCRE_EXTENDED) != 0)? " extended" : "", + ((options & PCRE_MULTILINE) != 0)? " multiline" : "", + ((options & PCRE_DOTALL) != 0)? " dotall" : "", + ((options & PCRE_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "", + ((options & PCRE_EXTRA) != 0)? " extra" : "", + ((options & PCRE_UNGREEDY) != 0)? " ungreedy" : "", + ((options & PCRE_UTF8) != 0)? " utf8" : ""); + + if (((((real_pcre *)re)->options) & PCRE_ICHANGED) != 0) + fprintf(outfile, "Case state changes\n"); + + if (first_char == -1) + { + fprintf(outfile, "First char at start or follows \\n\n"); + } + else if (first_char < 0) + { + fprintf(outfile, "No first char\n"); + } + else + { + if (isprint(first_char)) + fprintf(outfile, "First char = \'%c\'\n", first_char); + else + fprintf(outfile, "First char = %d\n", first_char); + } + + if (need_char < 0) + { + fprintf(outfile, "No need char\n"); + } + else + { + if (isprint(need_char)) + fprintf(outfile, "Need char = \'%c\'\n", need_char); + else + fprintf(outfile, "Need char = %d\n", need_char); + } + } + + /* If /S was present, study the regexp to generate additional info to + help with the matching. */ + + if (do_study) + { + if (timeit) + { + register int i; + clock_t time_taken; + clock_t start_time = clock(); + for (i = 0; i < LOOPREPEAT; i++) + extra = pcre_study(re, study_options, &error); + time_taken = clock() - start_time; + if (extra != NULL) free(extra); + fprintf(outfile, " Study time %.3f milliseconds\n", + ((double)time_taken * 1000.0)/ + ((double)LOOPREPEAT * (double)CLOCKS_PER_SEC)); + } + + extra = pcre_study(re, study_options, &error); + if (error != NULL) + fprintf(outfile, "Failed to study: %s\n", error); + else if (extra == NULL) + fprintf(outfile, "Study returned NULL\n"); + + else if (do_showinfo) + { + uschar *start_bits = NULL; + new_info(re, extra, PCRE_INFO_FIRSTTABLE, &start_bits); + if (start_bits == NULL) + fprintf(outfile, "No starting character set\n"); + else + { + int i; + int c = 24; + fprintf(outfile, "Starting character set: "); + for (i = 0; i < 256; i++) + { + if ((start_bits[i/8] & (1<<(i%8))) != 0) + { + if (c > 75) + { + fprintf(outfile, "\n "); + c = 2; + } + if (isprint(i) && i != ' ') + { + fprintf(outfile, "%c ", i); + c += 2; + } + else + { + fprintf(outfile, "\\x%02x ", i); + c += 5; + } + } + } + fprintf(outfile, "\n"); + } + } + } + } + + /* Read data lines and test them */ + + for (;;) + { + unsigned char *q; + unsigned char *bptr = dbuffer; + int count, c; + int copystrings = 0; + int getstrings = 0; + int getlist = 0; + int gmatched = 0; + int start_offset = 0; + int g_notempty = 0; + int offsets[45]; + int size_offsets = sizeof(offsets)/sizeof(int); + + options = 0; + + if (infile == stdin) printf("data> "); + if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) + { + done = 1; + goto CONTINUE; + } + if (infile != stdin) fprintf(outfile, "%s", (char *)buffer); + + len = (int)strlen((char *)buffer); + while (len > 0 && isspace(buffer[len-1])) len--; + buffer[len] = 0; + if (len == 0) break; + + p = buffer; + while (isspace(*p)) p++; + + q = dbuffer; + while ((c = *p++) != 0) + { + int i = 0; + int n = 0; + if (c == '\\') switch ((c = *p++)) + { + case 'a': c = 7; break; + case 'b': c = '\b'; break; + case 'e': c = 27; break; + case 'f': c = '\f'; break; + case 'n': c = '\n'; break; + case 'r': c = '\r'; break; + case 't': c = '\t'; break; + case 'v': c = '\v'; break; + + case '0': case '1': case '2': case '3': + case '4': case '5': case '6': case '7': + c -= '0'; + while (i++ < 2 && isdigit(*p) && *p != '8' && *p != '9') + c = c * 8 + *p++ - '0'; + break; + + case 'x': + + /* Handle \x{..} specially - new Perl thing for utf8 */ + + if (*p == '{') + { + unsigned char *pt = p; + c = 0; + while (isxdigit(*(++pt))) + c = c * 16 + tolower(*pt) - ((isdigit(*pt))? '0' : 'W'); + if (*pt == '}') + { + unsigned char utf8_buffer[8]; + int ii, utn; + utn = ord2utf8(c, utf8_buffer); + for (ii = 0; ii < utn - 1; ii++) *q++ = utf8_buffer[ii]; + c = utf8_buffer[ii]; /* Last byte */ + p = pt + 1; + break; + } + /* Not correct form; fall through */ + } + + /* Ordinary \x */ + + c = 0; + while (i++ < 2 && isxdigit(*p)) + { + c = c * 16 + tolower(*p) - ((isdigit(*p))? '0' : 'W'); + p++; + } + break; + + case 0: /* Allows for an empty line */ + p--; + continue; + + case 'A': /* Option setting */ + options |= PCRE_ANCHORED; + continue; + + case 'B': + options |= PCRE_NOTBOL; + continue; + + case 'C': + while(isdigit(*p)) n = n * 10 + *p++ - '0'; + copystrings |= 1 << n; + continue; + + case 'G': + while(isdigit(*p)) n = n * 10 + *p++ - '0'; + getstrings |= 1 << n; + continue; + + case 'L': + getlist = 1; + continue; + + case 'N': + options |= PCRE_NOTEMPTY; + continue; + + case 'O': + while(isdigit(*p)) n = n * 10 + *p++ - '0'; + if (n <= (int)(sizeof(offsets)/sizeof(int))) size_offsets = n; + continue; + + case 'Z': + options |= PCRE_NOTEOL; + continue; + } + *q++ = c; + } + *q = 0; + len = q - dbuffer; + + /* Handle matching via the POSIX interface, which does not + support timing. */ + + #if !defined NOPOSIX + if (posix || do_posix) + { + int rc; + int eflags = 0; + regmatch_t pmatch[sizeof(offsets)/sizeof(int)]; + if ((options & PCRE_NOTBOL) != 0) eflags |= REG_NOTBOL; + if ((options & PCRE_NOTEOL) != 0) eflags |= REG_NOTEOL; + + rc = regexec(&preg, (const char *)bptr, size_offsets, pmatch, eflags); + + if (rc != 0) + { + (void)regerror(rc, &preg, (char *)buffer, sizeof(buffer)); + fprintf(outfile, "No match: POSIX code %d: %s\n", rc, buffer); + } + else + { + size_t i; + for (i = 0; i < size_offsets; i++) + { + if (pmatch[i].rm_so >= 0) + { + fprintf(outfile, "%2d: ", (int)i); + pchars(dbuffer + pmatch[i].rm_so, + pmatch[i].rm_eo - pmatch[i].rm_so, utf8); + fprintf(outfile, "\n"); + if (i == 0 && do_showrest) + { + fprintf(outfile, " 0+ "); + pchars(dbuffer + pmatch[i].rm_eo, len - pmatch[i].rm_eo, utf8); + fprintf(outfile, "\n"); + } + } + } + } + } + + /* Handle matching via the native interface - repeats for /g and /G */ + + else + #endif /* !defined NOPOSIX */ + + for (;; gmatched++) /* Loop for /g or /G */ + { + if (timeit) + { + register int i; + clock_t time_taken; + clock_t start_time = clock(); + for (i = 0; i < LOOPREPEAT; i++) + count = pcre_exec(re, extra, (char *)bptr, len, + start_offset, options | g_notempty, offsets, size_offsets); + time_taken = clock() - start_time; + fprintf(outfile, "Execute time %.3f milliseconds\n", + ((double)time_taken * 1000.0)/ + ((double)LOOPREPEAT * (double)CLOCKS_PER_SEC)); + } + + count = pcre_exec(re, extra, (char *)bptr, len, + start_offset, options | g_notempty, offsets, size_offsets); + + if (count == 0) + { + fprintf(outfile, "Matched, but too many substrings\n"); + count = size_offsets/3; + } + + /* Matched */ + + if (count >= 0) + { + int i; + for (i = 0; i < count * 2; i += 2) + { + if (offsets[i] < 0) + fprintf(outfile, "%2d: <unset>\n", i/2); + else + { + fprintf(outfile, "%2d: ", i/2); + pchars(bptr + offsets[i], offsets[i+1] - offsets[i], utf8); + fprintf(outfile, "\n"); + if (i == 0) + { + if (do_showrest) + { + fprintf(outfile, " 0+ "); + pchars(bptr + offsets[i+1], len - offsets[i+1], utf8); + fprintf(outfile, "\n"); + } + } + } + } + + for (i = 0; i < 32; i++) + { + if ((copystrings & (1 << i)) != 0) + { + char copybuffer[16]; + int rc = pcre_copy_substring((char *)bptr, offsets, count, + i, copybuffer, sizeof(copybuffer)); + if (rc < 0) + fprintf(outfile, "copy substring %d failed %d\n", i, rc); + else + fprintf(outfile, "%2dC %s (%d)\n", i, copybuffer, rc); + } + } + + for (i = 0; i < 32; i++) + { + if ((getstrings & (1 << i)) != 0) + { + const char *substring; + int rc = pcre_get_substring((char *)bptr, offsets, count, + i, &substring); + if (rc < 0) + fprintf(outfile, "get substring %d failed %d\n", i, rc); + else + { + fprintf(outfile, "%2dG %s (%d)\n", i, substring, rc); + /* free((void *)substring); */ + pcre_free_substring(substring); + } + } + } + + if (getlist) + { + const char **stringlist; + int rc = pcre_get_substring_list((char *)bptr, offsets, count, + &stringlist); + if (rc < 0) + fprintf(outfile, "get substring list failed %d\n", rc); + else + { + for (i = 0; i < count; i++) + fprintf(outfile, "%2dL %s\n", i, stringlist[i]); + if (stringlist[i] != NULL) + fprintf(outfile, "string list not terminated by NULL\n"); + /* free((void *)stringlist); */ + pcre_free_substring_list(stringlist); + } + } + } + + /* Failed to match. If this is a /g or /G loop and we previously set + g_notempty after a null match, this is not necessarily the end. + We want to advance the start offset, and continue. Fudge the offset + values to achieve this. We won't be at the end of the string - that + was checked before setting g_notempty. */ + + else + { + if (g_notempty != 0) + { + offsets[0] = start_offset; + offsets[1] = start_offset + 1; + } + else + { + if (gmatched == 0) /* Error if no previous matches */ + { + if (count == -1) fprintf(outfile, "No match\n"); + else fprintf(outfile, "Error %d\n", count); + } + break; /* Out of the /g loop */ + } + } + + /* If not /g or /G we are done */ + + if (!do_g && !do_G) break; + + /* If we have matched an empty string, first check to see if we are at + the end of the subject. If so, the /g loop is over. Otherwise, mimic + what Perl's /g options does. This turns out to be rather cunning. First + we set PCRE_NOTEMPTY and PCRE_ANCHORED and try the match again at the + same point. If this fails (picked up above) we advance to the next + character. */ + + g_notempty = 0; + if (offsets[0] == offsets[1]) + { + if (offsets[0] == len) break; + g_notempty = PCRE_NOTEMPTY | PCRE_ANCHORED; + } + + /* For /g, update the start offset, leaving the rest alone */ + + if (do_g) start_offset = offsets[1]; + + /* For /G, update the pointer and length */ + + else + { + bptr += offsets[1]; + len -= offsets[1]; + } + } /* End of loop for /g and /G */ + } /* End of loop for data lines */ + + CONTINUE: + + #if !defined NOPOSIX + if (posix || do_posix) regfree(&preg); + #endif + + if (re != NULL) free(re); + if (extra != NULL) free(extra); + if (tables != NULL) + { + free((void *)tables); + setlocale(LC_CTYPE, "C"); + } + } + + fprintf(outfile, "\n"); + return 0; + } + + /* End */

ossp-pkg/pcre/pcre_test.d/testinput1 -> 1.3

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,1921 ---- + /the quick brown fox/ + the quick brown fox + The quick brown FOX + What do you know about the quick brown fox? + What do you know about THE QUICK BROWN FOX? + + /The quick brown fox/i + the quick brown fox + The quick brown FOX + What do you know about the quick brown fox? + What do you know about THE QUICK BROWN FOX? + + /abcd\t\n\r\f\a\e\071\x3b\$\\\?caxyz/ + abcd\t\n\r\f\a\e9;\$\\?caxyz + + /a*abc?xyz+pqr{3}ab{2,}xy{4,5}pq{0,6}AB{0,}zz/ + abxyzpqrrrabbxyyyypqAzz + abxyzpqrrrabbxyyyypqAzz + aabxyzpqrrrabbxyyyypqAzz + aaabxyzpqrrrabbxyyyypqAzz + aaaabxyzpqrrrabbxyyyypqAzz + abcxyzpqrrrabbxyyyypqAzz + aabcxyzpqrrrabbxyyyypqAzz + aaabcxyzpqrrrabbxyyyypAzz + aaabcxyzpqrrrabbxyyyypqAzz + aaabcxyzpqrrrabbxyyyypqqAzz + aaabcxyzpqrrrabbxyyyypqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqqqAzz + aaaabcxyzpqrrrabbxyyyypqAzz + abxyzzpqrrrabbxyyyypqAzz + aabxyzzzpqrrrabbxyyyypqAzz + aaabxyzzzzpqrrrabbxyyyypqAzz + aaaabxyzzzzpqrrrabbxyyyypqAzz + abcxyzzpqrrrabbxyyyypqAzz + aabcxyzzzpqrrrabbxyyyypqAzz + aaabcxyzzzzpqrrrabbxyyyypqAzz + aaaabcxyzzzzpqrrrabbxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyyypqAzz + aaabcxyzpqrrrabbxyyyypABzz + aaabcxyzpqrrrabbxyyyypABBzz + >>>aaabxyzpqrrrabbxyyyypqAzz + >aaaabxyzpqrrrabbxyyyypqAzz + >>>>abcxyzpqrrrabbxyyyypqAzz + *** Failers + abxyzpqrrabbxyyyypqAzz + abxyzpqrrrrabbxyyyypqAzz + abxyzpqrrrabxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyypqAzz + aaabcxyzpqrrrabbxyyyypqqqqqqqAzz + + /^(abc){1,2}zz/ + abczz + abcabczz + *** Failers + zz + abcabcabczz + >>abczz + + /^(b+?|a){1,2}?c/ + bc + bbc + bbbc + bac + bbac + aac + abbbbbbbbbbbc + bbbbbbbbbbbac + *** Failers + aaac + abbbbbbbbbbbac + + /^(b+|a){1,2}c/ + bc + bbc + bbbc + bac + bbac + aac + abbbbbbbbbbbc + bbbbbbbbbbbac + *** Failers + aaac + abbbbbbbbbbbac + + /^(b+|a){1,2}?bc/ + bbc + + /^(b*|ba){1,2}?bc/ + babc + bbabc + bababc + *** Failers + bababbc + babababc + + /^(ba|b*){1,2}?bc/ + babc + bbabc + bababc + *** Failers + bababbc + babababc + + /^\ca\cA\c[\c{\c:/ + \x01\x01\e;z + + /^[ab\]cde]/ + athing + bthing + ]thing + cthing + dthing + ething + *** Failers + fthing + [thing + \\thing + + /^[]cde]/ + ]thing + cthing + dthing + ething + *** Failers + athing + fthing + + /^[^ab\]cde]/ + fthing + [thing + \\thing + *** Failers + athing + bthing + ]thing + cthing + dthing + ething + + /^[^]cde]/ + athing + fthing + *** Failers + ]thing + cthing + dthing + ething + + /^\�/ + � + + /^�/ + � + + /^[0-9]+$/ + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 100 + *** Failers + abc + + /^.*nter/ + enter + inter + uponter + + /^xxx[0-9]+$/ + xxx0 + xxx1234 + *** Failers + xxx + + /^.+[0-9][0-9][0-9]$/ + x123 + xx123 + 123456 + *** Failers + 123 + x1234 + + /^.+?[0-9][0-9][0-9]$/ + x123 + xx123 + 123456 + *** Failers + 123 + x1234 + + /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/ + abc!pqr=apquxz.ixr.zzz.ac.uk + *** Failers + !pqr=apquxz.ixr.zzz.ac.uk + abc!=apquxz.ixr.zzz.ac.uk + abc!pqr=apquxz:ixr.zzz.ac.uk + abc!pqr=apquxz.ixr.zzz.ac.ukk + + /:/ + Well, we need a colon: somewhere + *** Fail if we don't + + /([\da-f:]+)$/i + 0abc + abc + fed + E + :: + 5f03:12C0::932e + fed def + Any old stuff + *** Failers + 0zzz + gzzz + fed\x20 + Any old rubbish + + /^.*\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$/ + .1.2.3 + A.12.123.0 + *** Failers + .1.2.3333 + 1.2.3 + 1234.2.3 + + /^(\d+)\s+IN\s+SOA\s+(\S+)\s+(\S+)\s*\(\s*$/ + 1 IN SOA non-sp1 non-sp2( + 1 IN SOA non-sp1 non-sp2 ( + *** Failers + 1IN SOA non-sp1 non-sp2( + + /^[a-zA-Z\d][a-zA-Z\d\-]*(\.[a-zA-Z\d][a-zA-z\d\-]*)*\.$/ + a. + Z. + 2. + ab-c.pq-r. + sxk.zzz.ac.uk. + x-.y-. + *** Failers + -abc.peq. + + /^\*\.[a-z]([a-z\-\d]*[a-z\d]+)?(\.[a-z]([a-z\-\d]*[a-z\d]+)?)*$/ + *.a + *.b0-a + *.c3-b.c + *.c-a.b-c + *** Failers + *.0 + *.a- + *.a-b.c- + *.c-a.0-c + + /^(?=ab(de))(abd)(e)/ + abde + + /^(?!(ab)de|x)(abd)(f)/ + abdf + + /^(?=(ab(cd)))(ab)/ + abcd + + /^[\da-f](\.[\da-f])*$/i + a.b.c.d + A.B.C.D + a.b.c.1.2.3.C + + /^\".*\"\s*(;.*)?$/ + \"1234\" + \"abcd\" ; + \"\" ; rhubarb + *** Failers + \"1234\" : things + + /^$/ + \ + *** Failers + + / ^ a (?# begins with a) b\sc (?# then b c) $ (?# then end)/x + ab c + *** Failers + abc + ab cde + + /(?x) ^ a (?# begins with a) b\sc (?# then b c) $ (?# then end)/ + ab c + *** Failers + abc + ab cde + + /^ a\ b[c ]d $/x + a bcd + a b d + *** Failers + abcd + ab d + + /^(a(b(c)))(d(e(f)))(h(i(j)))(k(l(m)))$/ + abcdefhijklm + + /^(?:a(b(c)))(?:d(e(f)))(?:h(i(j)))(?:k(l(m)))$/ + abcdefhijklm + + /^[\w][\W][\s][\S][\d][\D][\b][\n][\c]][\022]/ + a+ Z0+\x08\n\x1d\x12 + + /^[.^$|()*+?{,}]+/ + .^\$(*+)|{?,?} + + /^a*\w/ + z + az + aaaz + a + aa + aaaa + a+ + aa+ + + /^a*?\w/ + z + az + aaaz + a + aa + aaaa + a+ + aa+ + + /^a+\w/ + az + aaaz + aa + aaaa + aa+ + + /^a+?\w/ + az + aaaz + aa + aaaa + aa+ + + /^\d{8}\w{2,}/ + 1234567890 + 12345678ab + 12345678__ + *** Failers + 1234567 + + /^[aeiou\d]{4,5}$/ + uoie + 1234 + 12345 + aaaaa + *** Failers + 123456 + + /^[aeiou\d]{4,5}?/ + uoie + 1234 + 12345 + aaaaa + 123456 + + /\A(abc|def)=(\1){2,3}\Z/ + abc=abcabc + def=defdefdef + *** Failers + abc=defdef + + /^(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\11*(\3\4)\1(?#)2$/ + abcdefghijkcda2 + abcdefghijkkkkcda2 + + /(cat(a(ract|tonic)|erpillar)) \1()2(3)/ + cataract cataract23 + catatonic catatonic23 + caterpillar caterpillar23 + + + /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ + From abcd Mon Sep 01 12:33:02 1997 + + /^From\s+\S+\s+([a-zA-Z]{3}\s+){2}\d{1,2}\s+\d\d:\d\d/ + From abcd Mon Sep 01 12:33:02 1997 + From abcd Mon Sep 1 12:33:02 1997 + *** Failers + From abcd Sep 01 12:33:02 1997 + + /^12.34/s + 12\n34 + 12\r34 + + /\w+(?=\t)/ + the quick brown\t fox + + /foo(?!bar)(.*)/ + foobar is foolish see? + + /(?:(?!foo)...|^.{0,2})bar(.*)/ + foobar crowbar etc + barrel + 2barrel + A barrel + + /^(\D*)(?=\d)(?!123)/ + abc456 + *** Failers + abc123 + + /^1234(?# test newlines + inside)/ + 1234 + + /^1234 #comment in extended re + /x + 1234 + + /#rhubarb + abcd/x + abcd + + /^abcd#rhubarb/x + abcd + + /^(a)\1{2,3}(.)/ + aaab + aaaab + aaaaab + aaaaaab + + /(?!^)abc/ + the abc + *** Failers + abc + + /(?=^)abc/ + abc + *** Failers + the abc + + /^[ab]{1,3}(ab*|b)/ + aabbbbb + + /^[ab]{1,3}?(ab*|b)/ + aabbbbb + + /^[ab]{1,3}?(ab*?|b)/ + aabbbbb + + /^[ab]{1,3}(ab*?|b)/ + aabbbbb + + / (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* # optional leading comment + (?: (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + " (?: # opening quote... + [^\\\x80-\xff\n\015"] # Anything except backslash and quote + | # or + \\ [^\x80-\xff] # Escaped something (something != CR) + )* " # closing quote + ) # initial word + (?: (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* \. (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + " (?: # opening quote... + [^\\\x80-\xff\n\015"] # Anything except backslash and quote + | # or + \\ [^\x80-\xff] # Escaped something (something != CR) + )* " # closing quote + ) )* # further okay, if led by a period + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* @ (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # initial subdomain + (?: # + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* \. # if led by a period... + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # ...further okay + )* + # address + | # or + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + " (?: # opening quote... + [^\\\x80-\xff\n\015"] # Anything except backslash and quote + | # or + \\ [^\x80-\xff] # Escaped something (something != CR) + )* " # closing quote + ) # one word, optionally followed by.... + (?: + [^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] | # atom and space parts, or... + $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) | # comments, or... + + " (?: # opening quote... + [^\\\x80-\xff\n\015"] # Anything except backslash and quote + | # or + \\ [^\x80-\xff] # Escaped something (something != CR) + )* " # closing quote + # quoted strings + )* + < (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* # leading < + (?: @ (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # initial subdomain + (?: # + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* \. # if led by a period... + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # ...further okay + )* + + (?: (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* , (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* @ (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # initial subdomain + (?: # + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* \. # if led by a period... + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # ...further okay + )* + )* # further okay, if led by comma + : # closing colon + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* )? # optional route + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + " (?: # opening quote... + [^\\\x80-\xff\n\015"] # Anything except backslash and quote + | # or + \\ [^\x80-\xff] # Escaped something (something != CR) + )* " # closing quote + ) # initial word + (?: (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* \. (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + " (?: # opening quote... + [^\\\x80-\xff\n\015"] # Anything except backslash and quote + | # or + \\ [^\x80-\xff] # Escaped something (something != CR) + )* " # closing quote + ) )* # further okay, if led by a period + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* @ (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # initial subdomain + (?: # + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* \. # if led by a period... + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # ...further okay + )* + # address spec + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* > # trailing > + # name and address + ) (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* # optional trailing comment + /x + Alan Other <user\@dom.ain> + <user\@dom.ain> + user\@dom.ain + \"A. Other\" <user.1234\@dom.ain> (a comment) + A. Other <user.1234\@dom.ain> (a comment) + \"/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/\"\@x400-re.lay + A missing angle <user\@some.where + *** Failers + The quick brown fox + + /[\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional leading comment + (?: + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + # Atom + | # or + " # " + [^\\\x80-\xff\n\015"] * # normal + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* + " # " + # Quoted string + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + \. + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + # Atom + | # or + " # " + [^\\\x80-\xff\n\015"] * # normal + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* + " # " + # Quoted string + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # additional words + )* + @ + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + (?: + \. + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + )* + # address + | # or + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + # Atom + | # or + " # " + [^\\\x80-\xff\n\015"] * # normal + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* + " # " + # Quoted string + ) + # leading word + [^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] * # "normal" atoms and or spaces + (?: + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + | + " # " + [^\\\x80-\xff\n\015"] * # normal + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* + " # " + ) # "special" comment or quoted string + [^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] * # more "normal" + )* + < + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # < + (?: + @ + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + (?: + \. + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + )* + (?: , + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + @ + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + (?: + \. + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + )* + )* # additional domains + : + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + )? # optional route + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + # Atom + | # or + " # " + [^\\\x80-\xff\n\015"] * # normal + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* + " # " + # Quoted string + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + \. + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + # Atom + | # or + " # " + [^\\\x80-\xff\n\015"] * # normal + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* + " # " + # Quoted string + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # additional words + )* + @ + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + (?: + \. + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + )* + # address spec + > # > + # name and address + ) + /x + Alan Other <user\@dom.ain> + <user\@dom.ain> + user\@dom.ain + \"A. Other\" <user.1234\@dom.ain> (a comment) + A. Other <user.1234\@dom.ain> (a comment) + \"/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/\"\@x400-re.lay + A missing angle <user\@some.where + *** Failers + The quick brown fox + + /abc\0def\00pqr\000xyz\0000AB/ + abc\0def\00pqr\000xyz\0000AB + abc456 abc\0def\00pqr\000xyz\0000ABCDE + + /abc\x0def\x00pqr\x000xyz\x0000AB/ + abc\x0def\x00pqr\x000xyz\x0000AB + abc456 abc\x0def\x00pqr\x000xyz\x0000ABCDE + + /^[\000-\037]/ + \0A + \01B + \037C + + /\0*/ + \0\0\0\0 + + /A\x0{2,3}Z/ + The A\x0\x0Z + An A\0\x0\0Z + *** Failers + A\0Z + A\0\x0\0\x0Z + + /^(cow|)\1(bell)/ + cowcowbell + bell + *** Failers + cowbell + + /^\s/ + \040abc + \x0cabc + \nabc + \rabc + \tabc + *** Failers + abc + + /^a b + c/x + abc + + /^(a|)\1*b/ + ab + aaaab + b + *** Failers + acb + + /^(a|)\1+b/ + aab + aaaab + b + *** Failers + ab + + /^(a|)\1?b/ + ab + aab + b + *** Failers + acb + + /^(a|)\1{2}b/ + aaab + b + *** Failers + ab + aab + aaaab + + /^(a|)\1{2,3}b/ + aaab + aaaab + b + *** Failers + ab + aab + aaaaab + + /ab{1,3}bc/ + abbbbc + abbbc + abbc + *** Failers + abc + abbbbbc + + /([^.]*)\.([^:]*):[T ]+(.*)/ + track1.title:TBlah blah blah + + /([^.]*)\.([^:]*):[T ]+(.*)/i + track1.title:TBlah blah blah + + /([^.]*)\.([^:]*):[t ]+(.*)/i + track1.title:TBlah blah blah + + /^[W-c]+$/ + WXY_^abc + ***Failers + wxy + + /^[W-c]+$/i + WXY_^abc + wxy_^ABC + + /^[\x3f-\x5F]+$/i + WXY_^abc + wxy_^ABC + + /^abc$/m + abc + qqq\nabc + abc\nzzz + qqq\nabc\nzzz + + /^abc$/ + abc + *** Failers + qqq\nabc + abc\nzzz + qqq\nabc\nzzz + + /\Aabc\Z/m + abc + abc\n + *** Failers + qqq\nabc + abc\nzzz + qqq\nabc\nzzz + + /\A(.)*\Z/s + abc\ndef + + /\A(.)*\Z/m + *** Failers + abc\ndef + + /(?:b)|(?::+)/ + b::c + c::b + + /[-az]+/ + az- + *** Failers + b + + /[az-]+/ + za- + *** Failers + b + + /[a\-z]+/ + a-z + *** Failers + b + + /[a-z]+/ + abcdxyz + + /[\d-]+/ + 12-34 + *** Failers + aaa + + /[\d-z]+/ + 12-34z + *** Failers + aaa + + /\x5c/ + \\ + + /\x20Z/ + the Zoo + *** Failers + Zulu + + /(abc)\1/i + abcabc + ABCabc + abcABC + + /(main(O)?)+/ + mainmain + mainOmain + + /ab{3cd/ + ab{3cd + + /ab{3,cd/ + ab{3,cd + + /ab{3,4a}cd/ + ab{3,4a}cd + + /{4,5a}bc/ + {4,5a}bc + + /^a.b/ + a\rb + *** Failers + a\nb + + /abc$/ + abc + abc\n + *** Failers + abc\ndef + + /(abc)\123/ + abc\x53 + + /(abc)\223/ + abc\x93 + + /(abc)\323/ + abc\xd3 + + /(abc)\500/ + abc\x40 + abc\100 + + /(abc)\5000/ + abc\x400 + abc\x40\x30 + abc\1000 + abc\100\x30 + abc\100\060 + abc\100\60 + + /abc\81/ + abc\081 + abc\0\x38\x31 + + /abc\91/ + abc\091 + abc\0\x39\x31 + + /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\12\123/ + abcdefghijkllS + + /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/ + abcdefghijk\12S + + /ab\gdef/ + abgdef + + /a{0}bc/ + bc + + /(a|(bc)){0,0}?xyz/ + xyz + + /abc[\10]de/ + abc\010de + + /abc[\1]de/ + abc\1de + + /(abc)[\1]de/ + abc\1de + + /a.b(?s)/ + a\nb + + /^([^a])([^\b])([^c]*)([^d]{3,4})/ + baNOTccccd + baNOTcccd + baNOTccd + bacccd + *** Failers + anything + b\bc + baccd + + /[^a]/ + Abc + + /[^a]/i + Abc + + /[^a]+/ + AAAaAbc + + /[^a]+/i + AAAaAbc + + /[^a]+/ + bbb\nccc + + /[^k]$/ + abc + *** Failers + abk + + /[^k]{2,3}$/ + abc + kbc + kabc + *** Failers + abk + akb + akk + + /^\d{8,}\@.+[^k]$/ + 12345678\@a.b.c.d + 123456789\@x.y.z + *** Failers + 12345678\@x.y.uk + 1234567\@a.b.c.d + + /(a)\1{8,}/ + aaaaaaaaa + aaaaaaaaaa + *** Failers + aaaaaaa + + /[^a]/ + aaaabcd + aaAabcd + + /[^a]/i + aaaabcd + aaAabcd + + /[^az]/ + aaaabcd + aaAabcd + + /[^az]/i + aaaabcd + aaAabcd + + /\000\001\002\003\004\005\006\007\010\011\012\013\014\015\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037\040\041\042\043\044\045\046\047\050\051\052\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136\137\140\141\142\143\144\145\146\147\150\151\152\153\154\155\156\157\160\161\162\163\164\165\166\167\170\171\172\173\174\175\176\177\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377/ + \000\001\002\003\004\005\006\007\010\011\012\013\014\015\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037\040\041\042\043\044\045\046\047\050\051\052\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136\137\140\141\142\143\144\145\146\147\150\151\152\153\154\155\156\157\160\161\162\163\164\165\166\167\170\171\172\173\174\175\176\177\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377 + + /P[^*]TAIRE[^*]{1,6}?LL/ + xxxxxxxxxxxPSTAIREISLLxxxxxxxxx + + /P[^*]TAIRE[^*]{1,}?LL/ + xxxxxxxxxxxPSTAIREISLLxxxxxxxxx + + /(\.\d\d[1-9]?)\d+/ + 1.230003938 + 1.875000282 + 1.235 + + /(\.\d\d((?=0)|\d(?=\d)))/ + 1.230003938 + 1.875000282 + *** Failers + 1.235 + + /a(?)b/ + ab + + /\b(foo)\s+(\w+)/i + Food is on the foo table + + /foo(.*)bar/ + The food is under the bar in the barn. + + /foo(.*?)bar/ + The food is under the bar in the barn. + + /(.*)(\d*)/ + I have 2 numbers: 53147 + + /(.*)(\d+)/ + I have 2 numbers: 53147 + + /(.*?)(\d*)/ + I have 2 numbers: 53147 + + /(.*?)(\d+)/ + I have 2 numbers: 53147 + + /(.*)(\d+)$/ + I have 2 numbers: 53147 + + /(.*?)(\d+)$/ + I have 2 numbers: 53147 + + /(.*)\b(\d+)$/ + I have 2 numbers: 53147 + + /(.*\D)(\d+)$/ + I have 2 numbers: 53147 + + /^\D*(?!123)/ + ABC123 + + /^(\D*)(?=\d)(?!123)/ + ABC445 + *** Failers + ABC123 + + /^[W-]46]/ + W46]789 + -46]789 + *** Failers + Wall + Zebra + 42 + [abcd] + ]abcd[ + + /^[W-\]46]/ + W46]789 + Wall + Zebra + Xylophone + 42 + [abcd] + ]abcd[ + \\backslash + *** Failers + -46]789 + well + + /\d\d\/\d\d\/\d\d\d\d/ + 01/01/2000 + + /word (?:[a-zA-Z0-9]+ ){0,10}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark otherword + word cat dog elephant mussel cow horse canary baboon snake shark + + /word (?:[a-zA-Z0-9]+ ){0,300}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark the quick brown fox and the lazy dog and several other words getting close to thirty by now I hope + + /^(a){0,0}/ + bcd + abc + aab + + /^(a){0,1}/ + bcd + abc + aab + + /^(a){0,2}/ + bcd + abc + aab + + /^(a){0,3}/ + bcd + abc + aab + aaa + + /^(a){0,}/ + bcd + abc + aab + aaa + aaaaaaaa + + /^(a){1,1}/ + bcd + abc + aab + + /^(a){1,2}/ + bcd + abc + aab + + /^(a){1,3}/ + bcd + abc + aab + aaa + + /^(a){1,}/ + bcd + abc + aab + aaa + aaaaaaaa + + /.*\.gif/ + borfle\nbib.gif\nno + + /.{0,}\.gif/ + borfle\nbib.gif\nno + + /.*\.gif/m + borfle\nbib.gif\nno + + /.*\.gif/s + borfle\nbib.gif\nno + + /.*\.gif/ms + borfle\nbib.gif\nno + + /.*$/ + borfle\nbib.gif\nno + + /.*$/m + borfle\nbib.gif\nno + + /.*$/s + borfle\nbib.gif\nno + + /.*$/ms + borfle\nbib.gif\nno + + /.*$/ + borfle\nbib.gif\nno\n + + /.*$/m + borfle\nbib.gif\nno\n + + /.*$/s + borfle\nbib.gif\nno\n + + /.*$/ms + borfle\nbib.gif\nno\n + + /(.*X|^B)/ + abcde\n1234Xyz + BarFoo + *** Failers + abcde\nBar + + /(.*X|^B)/m + abcde\n1234Xyz + BarFoo + abcde\nBar + + /(.*X|^B)/s + abcde\n1234Xyz + BarFoo + *** Failers + abcde\nBar + + /(.*X|^B)/ms + abcde\n1234Xyz + BarFoo + abcde\nBar + + /(?s)(.*X|^B)/ + abcde\n1234Xyz + BarFoo + *** Failers + abcde\nBar + + /(?s:.*X|^B)/ + abcde\n1234Xyz + BarFoo + *** Failers + abcde\nBar + + /^.*B/ + **** Failers + abc\nB + + /(?s)^.*B/ + abc\nB + + /(?m)^.*B/ + abc\nB + + /(?ms)^.*B/ + abc\nB + + /(?ms)^B/ + abc\nB + + /(?s)B$/ + B\n + + /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]/ + 123456654321 + + /^\d\d\d\d\d\d\d\d\d\d\d\d/ + 123456654321 + + /^[\d][\d][\d][\d][\d][\d][\d][\d][\d][\d][\d][\d]/ + 123456654321 + + /^[abc]{12}/ + abcabcabcabc + + /^[a-c]{12}/ + abcabcabcabc + + /^(a|b|c){12}/ + abcabcabcabc + + /^[abcdefghijklmnopqrstuvwxy0123456789]/ + n + *** Failers + z + + /abcde{0,0}/ + abcd + *** Failers + abce + + /ab[cd]{0,0}e/ + abe + *** Failers + abcde + + /ab(c){0,0}d/ + abd + *** Failers + abcd + + /a(b*)/ + a + ab + abbbb + *** Failers + bbbbb + + /ab\d{0}e/ + abe + *** Failers + ab1e + + /"([^\\"]+|\\.)*"/ + the \"quick\" brown fox + \"the \\\"quick\\\" brown fox\" + + /.*?/g+ + abc + + /\b/g+ + abc + + /\b/+g + abc + + //g + abc + + /<tr([\w\W\s\d][^<>]{0,})><TD([\w\W\s\d][^<>]{0,})>([\d]{0,}\.)(.*)(( ([\w\W\s\d][^<>]{0,})|[\s]{0,}))<\/a><\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><\/TR>/is + <TR BGCOLOR='#DBE9E9'><TD align=left valign=top>43.<a href='joblist.cfm?JobID=94 6735&Keyword='>Word Processor (N-1286)</a></TD><TD align=left valign=top>Lega lstaff.com</TD><TD align=left valign=top>CA - Statewide</TD></TR> + + /a[^a]b/ + acb + a\nb + + /a.b/ + acb + *** Failers + a\nb + + /a[^a]b/s + acb + a\nb + + /a.b/s + acb + a\nb + + / End of testinput1 /

ossp-pkg/pcre/pcre_test.d/testinput2 -> 1.4

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,710 ---- + /(a)b|/ + + /abc/ + abc + defabc + \Aabc + *** Failers + \Adefabc + ABC + + /^abc/ + abc + \Aabc + *** Failers + defabc + \Adefabc + + /a+bc/ + + /a*bc/ + + /a{3}bc/ + + /(abc|a+z)/ + + /^abc$/ + abc + *** Failers + def\nabc + + /ab\gdef/X + + /(?X)ab\gdef/X + + /x{5,4}/ + + /z{65536}/ + + /[abcd/ + + /[\B]/ + + /[z-a]/ + + /^*/ + + /(abc/ + + /(?# abc/ + + /(?z)abc/ + + /.*b/ + + /.*?b/ + + /cat|dog|elephant/ + this sentence eventually mentions a cat + this sentences rambles on and on for a while and then reaches elephant + + /cat|dog|elephant/S + this sentence eventually mentions a cat + this sentences rambles on and on for a while and then reaches elephant + + /cat|dog|elephant/iS + this sentence eventually mentions a CAT cat + this sentences rambles on and on for a while to elephant ElePhant + + /a|[bcd]/S + + /(a|[^\dZ])/S + + /(a|b)*[\s]/S + + /(ab\2)/ + + /{4,5}abc/ + + /(a)(b)(c)\2/ + abcb + \O0abcb + \O3abcb + \O6abcb + \O9abcb + \O12abcb + + /(a)bc|(a)(b)\2/ + abc + \O0abc + \O3abc + \O6abc + aba + \O0aba + \O3aba + \O6aba + \O9aba + \O12aba + + /abc$/E + abc + *** Failers + abc\n + abc\ndef + + /(a)(b)(c)(d)(e)\6/ + + /the quick brown fox/ + the quick brown fox + this is a line with the quick brown fox + + /the quick brown fox/A + the quick brown fox + *** Failers + this is a line with the quick brown fox + + /ab(?z)cd/ + + /^abc|def/ + abcdef + abcdef\B + + /.*((abc)$|(def))/ + defabc + \Zdefabc + + /abc/P + abc + *** Failers + + /^abc|def/P + abcdef + abcdef\B + + /.*((abc)$|(def))/P + defabc + \Zdefabc + + /the quick brown fox/P + the quick brown fox + *** Failers + The Quick Brown Fox + + /the quick brown fox/Pi + the quick brown fox + The Quick Brown Fox + + /abc.def/P + *** Failers + abc\ndef + + /abc$/P + abc + abc\n + + /(abc)\2/P + + /(abc\1)/P + abc + + /)/ + + /a[]b/ + + /[^aeiou ]{3,}/ + co-processors, and for + + /<.*>/ + abc<def>ghi<klm>nop + + /<.*?>/ + abc<def>ghi<klm>nop + + /<.*>/U + abc<def>ghi<klm>nop + + /<.*>(?U)/ + abc<def>ghi<klm>nop + + /<.*?>/U + abc<def>ghi<klm>nop + + /={3,}/U + abc========def + + /(?U)={3,}?/ + abc========def + + /(?<!bar|cattle)foo/ + foo + catfoo + *** Failers + the barfoo + and cattlefoo + + /(?<=a+)b/ + + /(?<=aaa|b{0,3})b/ + + /(?<!(foo)a\1)bar/ + + /(?i)abc/ + + /(a|(?m)a)/ + + /(?i)^1234/ + + /(^b|(?i)^d)/ + + /(?s).*/ + + /[abcd]/S + + /(?i)[abcd]/S + + /(?m)[xy]|(b|c)/S + + /(^a|^b)/m + + /(?i)(^a|^b)/m + + /(a)(?(1)a|b|c)/ + + /(?(?=a)a|b|c)/ + + /(?(1a)/ + + /(?(?i))/ + + /(?(abc))/ + + /(?(?<ab))/ + + /((?s)blah)\s+\1/ + + /((?i)blah)\s+\1/ + + /((?i)b)/DS + + /(a*b|(?i:c*(?-i)d))/S + + /a$/ + a + a\n + *** Failers + \Za + \Za\n + + /a$/m + a + a\n + \Za\n + *** Failers + \Za + + /\Aabc/m + + /^abc/m + + /^((a+)(?U)([ab]+)(?-U)([bc]+)(\w*))/ + aaaaabbbbbcccccdef + + /(?<=foo)[ab]/S + + /(?<!foo)(alpha|omega)/S + + /(?!alphabet)[ab]/S + + /(?<=foo\n)^bar/m + + /(?>^abc)/m + abc + def\nabc + *** Failers + defabc + + /(?<=ab(c+)d)ef/ + + /(?<=ab(?<=c+)d)ef/ + + /(?<=ab(c|de)f)g/ + + /The next three are in testinput2 because they have variable length branches/ + + /(?<=bullock|donkey)-cart/ + the bullock-cart + a donkey-cart race + *** Failers + cart + horse-and-cart + + /(?<=ab(?i)x|y|z)/ + + /(?>.*)(?<=(abcd)|(xyz))/ + alphabetabcd + endingxyz + + /(?<=ab(?i)x(?-i)y|(?i)z|b)ZZ/ + abxyZZ + abXyZZ + ZZZ + zZZ + bZZ + BZZ + *** Failers + ZZ + abXYZZ + zzz + bzz + + /(?<!(foo)a)bar/ + bar + foobbar + *** Failers + fooabar + + /This one is here because Perl 5.005_02 doesn't fail it/ + + /^(a)?(?(1)a|b)+$/ + *** Failers + a + + /This one is here because I think Perl 5.005_02 gets the setting of $1 wrong/ + + /^(a\1?){4}$/ + aaaaaa + + /These are syntax tests from Perl 5.005/ + + /a[b-a]/ + + /a[]b/ + + /a[/ + + /*a/ + + /(*)b/ + + /abc)/ + + /(abc/ + + /a**/ + + /)(/ + + /\1/ + + /\2/ + + /(a)|\2/ + + /a[b-a]/i + + /a[]b/i + + /a[/i + + /*a/i + + /(*)b/i + + /abc)/i + + /(abc/i + + /a**/i + + /)(/i + + /:(?:/ + + /(?<%)b/ + + /a(?{)b/ + + /a(?{{})b/ + + /a(?{}})b/ + + /a(?{"{"})b/ + + /a(?{"{"}})b/ + + /(?(1?)a|b)/ + + /(?(1)a|b|c)/ + + /[a[:xyz:/ + + /(?<=x+)y/ + + /a{37,17}/ + + /abc/\ + + /abc/\P + + /abc/\i + + /(a)bc(d)/ + abcd + abcd\C2 + abcd\C5 + + /(.{20})/ + abcdefghijklmnopqrstuvwxyz + abcdefghijklmnopqrstuvwxyz\C1 + abcdefghijklmnopqrstuvwxyz\G1 + + /(.{15})/ + abcdefghijklmnopqrstuvwxyz + abcdefghijklmnopqrstuvwxyz\C1\G1 + + /(.{16})/ + abcdefghijklmnopqrstuvwxyz + abcdefghijklmnopqrstuvwxyz\C1\G1\L + + /^(a|(bc))de(f)/ + adef\G1\G2\G3\G4\L + bcdef\G1\G2\G3\G4\L + adefghijk\C0 + + /^abc\00def/ + abc\00def\L\C0 + + /word ((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ + )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ + )?)?)?)?)?)?)?)?)?otherword/M + + /.*X/D + + /.*X/Ds + + /(.*X|^B)/D + + /(.*X|^B)/Ds + + /(?s)(.*X|^B)/D + + /(?s:.*X|^B)/D + + /\Biss\B/+ + Mississippi + + /\Biss\B/+P + Mississippi + + /iss/G+ + Mississippi + + /\Biss\B/G+ + Mississippi + + /\Biss\B/g+ + Mississippi + *** Failers + Mississippi\A + + /(?<=[Ms])iss/g+ + Mississippi + + /(?<=[Ms])iss/G+ + Mississippi + + /^iss/g+ + ississippi + + /.*iss/g+ + abciss\nxyzisspqr + + /.i./+g + Mississippi + Mississippi\A + Missouri river + Missouri river\A + + /^.is/+g + Mississippi + + /^ab\n/g+ + ab\nab\ncd + + /^ab\n/mg+ + ab\nab\ncd + + /abc/ + + /abc|bac/ + + /(abc|bac)/ + + /(abc|(c|dc))/ + + /(abc|(d|de)c)/ + + /a*/ + + /a+/ + + /(baa|a+)/ + + /a{0,3}/ + + /baa{3,}/ + + /"([^\\"]+|\\.)*"/ + + /(abc|ab[cd])/ + + /(a|.)/ + + /a|ba|\w/ + + /abc(?=pqr)/ + + /...(?<=abc)/ + + /abc(?!pqr)/ + + /ab./ + + /ab[xyz]/ + + /abc*/ + + /ab.c*/ + + /a.c*/ + + /.c*/ + + /ac*/ + + /(a.c*|b.c*)/ + + /a.c*|aba/ + + /.+a/ + + /(?=abcda)a.*/ + + /(?=a)a.*/ + + /a(b)*/ + + /a\d*/ + + /ab\d*/ + + /a(\d)*/ + + /abcde{0,0}/ + + /ab\d+/ + + /a(?(1)b)/ + + /a(?(1)bag|big)/ + + /a(?(1)bag|big)*/ + + /a(?(1)bag|big)+/ + + /a(?(1)b..|b..)/ + + /ab\d{0}e/ + + /a?b?/ + a + b + ab + \ + *** Failers + \N + + /|-/ + abcd + -abc + \Nab-c + *** Failers + \Nabc + + /a*(b+)(z)(z)/P + aaaabbbbzzzz + aaaabbbbzzzz\O0 + aaaabbbbzzzz\O1 + aaaabbbbzzzz\O2 + aaaabbbbzzzz\O3 + aaaabbbbzzzz\O4 + aaaabbbbzzzz\O5 + + /^.?abcd/S + + /$ # ( at start + (?: # Non-capturing bracket + (?>[^()]+) # Either a sequence of non-brackets (no backtracking) + | # Or + (?R) # Recurse - i.e. nested bracketed string + )* # Zero or more contents + $ # Closing ) + /x + (abcd) + (abcd)xyz + xyz(abcd) + (ab(xy)cd)pqr + (ab(xycd)pqr + () abc () + 12(abcde(fsh)xyz(foo(bar))lmno)89 + *** Failers + abcd + abcd) + (abcd + + /$ ( (?>[^()]+) | (?R) )* $ /xg + (ab(xy)cd)pqr + 1(abcd)(x(y)z)pqr + + /$ (?: (?>[^()]+) | (?R) ) $ /x + (abcd) + (ab(xy)cd) + (a(b(c)d)e) + ((ab)) + *** Failers + () + + /$ (?: (?>[^()]+) | (?R) )? $ /x + () + 12(abcde(fsh)xyz(foo(bar))lmno)89 + + /$ ( (?>[^()]+) | (?R) )* $ /x + (ab(xy)cd) + + /$ ( ( (?>[^()]+) | (?R) )* ) $ /x + (ab(xy)cd) + + /$ (123)? ( ( (?>[^()]+) | (?R) )* ) $ /x + (ab(xy)cd) + (123ab(xy)cd) + + /$ ( (123)? ( (?>[^()]+) | (?R) )* ) $ /x + (ab(xy)cd) + (123ab(xy)cd) + + /$ (((((((((( ( (?>[^()]+) | (?R) )* )))))))))) $ /x + (ab(xy)cd) + + /$ ( ( (?>[^()<>]+) | ((?>[^()]+)) | (?R) )* ) $ /x + (abcd(xyzqrs)123) + + /$ ( ( (?>[^()]+) | ((?R)) )* ) $ /x + (ab(cd)ef) + (ab(cd(ef)gh)ij) + + /^[[:alnum:]]/D + + /^[[:alpha:]]/D + + /^[[:ascii:]]/D + + /^[[:cntrl:]]/D + + /^[[:digit:]]/D + + /^[[:graph:]]/D + + /^[[:lower:]]/D + + /^[[:print:]]/D + + /^[[:punct:]]/D + + /^[[:space:]]/D + + /^[[:upper:]]/D + + /^[[:xdigit:]]/D + + /^[[:word:]]/D + + /^[[:^cntrl:]]/D + + /^[12[:^digit:]]/D + + /[01[:alpha:]%]/D + + /[[.ch.]]/ + + /[[=ch=]]/ + + /[[:rhubarb:]]/ + + /[[:upper:]]/i + A + a + + /[[:lower:]]/i + A + a + + /((?-i)[[:lower:]])[[:lower:]]/i + ab + aB + *** Failers + Ab + AB + + /[\200-\410]/ + + / End of testinput2 /

ossp-pkg/pcre/pcre_test.d/testinput3 -> 1.4

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,1724 ---- + /(?<!bar)foo/ + foo + catfood + arfootle + rfoosh + *** Failers + barfoo + towbarfoo + + /\w{3}(?<!bar)foo/ + catfood + *** Failers + foo + barfoo + towbarfoo + + /(?<=(foo)a)bar/ + fooabar + *** Failers + bar + foobbar + + /\Aabc\z/m + abc + *** Failers + abc\n + qqq\nabc + abc\nzzz + qqq\nabc\nzzz + + "(?>.*/)foo" + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/it/you/see/ + + "(?>.*/)foo" + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + + /(?>(\.\d\d[1-9]?))\d+/ + 1.230003938 + 1.875000282 + *** Failers + 1.235 + + /^((?>\w+)|(?>\s+))*$/ + now is the time for all good men to come to the aid of the party + *** Failers + this is not a line with only words and spaces! + + /(\d+)(\w)/ + 12345a + 12345+ + + /((?>\d+))(\w)/ + 12345a + *** Failers + 12345+ + + /(?>a+)b/ + aaab + + /((?>a+)b)/ + aaab + + /(?>(a+))b/ + aaab + + /(?>b)+/ + aaabbbccc + + /(?>a+|b+|c+)*c/ + aaabbbbccccd + + /((?>[^()]+)|$[^()]*$)+/ + ((abc(ade)ufh()()x + + /$((?>[^()]+)|\([^()]+$)+\)/ + (abc) + (abc(def)xyz) + *** Failers + ((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa + + /a(?-i)b/i + ab + *** Failers + Ab + aB + AB + + /(a (?x)b c)d e/ + a bcd e + *** Failers + a b cd e + abcd e + a bcde + + /(a b(?x)c d (?-x)e f)/ + a bcde f + *** Failers + abcdef + + /(a(?i)b)c/ + abc + aBc + *** Failers + abC + aBC + Abc + ABc + ABC + AbC + + /a(?i:b)c/ + abc + aBc + *** Failers + ABC + abC + aBC + + /a(?i:b)*c/ + aBc + aBBc + *** Failers + aBC + aBBC + + /a(?=b(?i)c)\w\wd/ + abcd + abCd + *** Failers + aBCd + abcD + + /(?s-i:more.*than).*million/i + more than million + more than MILLION + more \n than Million + *** Failers + MORE THAN MILLION + more \n than \n million + + /(?:(?s-i)more.*than).*million/i + more than million + more than MILLION + more \n than Million + *** Failers + MORE THAN MILLION + more \n than \n million + + /(?>a(?i)b+)+c/ + abc + aBbc + aBBc + *** Failers + Abc + abAb + abbC + + /(?=a(?i)b)\w\wc/ + abc + aBc + *** Failers + Ab + abC + aBC + + /(?<=a(?i)b)(\w\w)c/ + abxxc + aBxxc + *** Failers + Abxxc + ABxxc + abxxC + + /(?:(a)|b)(?(1)A|B)/ + aA + bB + *** Failers + aB + bA + + /^(a)?(?(1)a|b)+$/ + aa + b + bb + *** Failers + ab + + /^(?(?=abc)\w{3}:|\d\d)$/ + abc: + 12 + *** Failers + 123 + xyz + + /^(?(?!abc)\d\d|\w{3}:)$/ + abc: + 12 + *** Failers + 123 + xyz + + /(?(?<=foo)bar|cat)/ + foobar + cat + fcat + focat + *** Failers + foocat + + /(?(?<!foo)cat|bar)/ + foobar + cat + fcat + focat + *** Failers + foocat + + /( $ )? [^()]+ (?(1) $ |) /x + abcd + (abcd) + the quick (abcd) fox + (abcd + + /( $ )? [^()]+ (?(1) $ ) /x + abcd + (abcd) + the quick (abcd) fox + (abcd + + /^(?(2)a|(1)(2))+$/ + 12 + 12a + 12aa + *** Failers + 1234 + + /((?i)blah)\s+\1/ + blah blah + BLAH BLAH + Blah Blah + blaH blaH + *** Failers + blah BLAH + Blah blah + blaH blah + + /((?i)blah)\s+(?i:\1)/ + blah blah + BLAH BLAH + Blah Blah + blaH blaH + blah BLAH + Blah blah + blaH blah + + /(?>a*)*/ + a + aa + aaaa + + /(abc|)+/ + abc + abcabc + abcabcabc + xyz + + /([a]*)*/ + a + aaaaa + + /([ab]*)*/ + a + b + ababab + aaaabcde + bbbb + + /([^a]*)*/ + b + bbbb + aaa + + /([^ab]*)*/ + cccc + abab + + /([a]*?)*/ + a + aaaa + + /([ab]*?)*/ + a + b + abab + baba + + /([^a]*?)*/ + b + bbbb + aaa + + /([^ab]*?)*/ + c + cccc + baba + + /(?>a*)*/ + a + aaabcde + + /((?>a*))*/ + aaaaa + aabbaa + + /((?>a*?))*/ + aaaaa + aabbaa + + /(?(?=[^a-z]+[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) /x + 12-sep-98 + 12-09-98 + *** Failers + sep-12-98 + + /(?<=(foo))bar\1/ + foobarfoo + foobarfootling + *** Failers + foobar + barfoo + + /(?i:saturday|sunday)/ + saturday + sunday + Saturday + Sunday + SATURDAY + SUNDAY + SunDay + + /(a(?i)bc|BB)x/ + abcx + aBCx + bbx + BBx + *** Failers + abcX + aBCX + bbX + BBX + + /^([ab](?i)[cd]|[ef])/ + ac + aC + bD + elephant + Europe + frog + France + *** Failers + Africa + + /^(ab|a(?i)[b-c](?m-i)d|x(?i)y|z)/ + ab + aBd + xy + xY + zebra + Zambesi + *** Failers + aCD + XY + + /(?<=foo\n)^bar/m + foo\nbar + *** Failers + bar + baz\nbar + + /(?<=(?<!foo)bar)baz/ + barbaz + barbarbaz + koobarbaz + *** Failers + baz + foobarbaz + + /The case of aaaaaa is missed out below because I think Perl 5.005_02 gets/ + /it wrong; it sets $1 to aaa rather than aa. Compare the following test,/ + /where it does set $1 to aa when matching aaaaaa./ + + /^(a\1?){4}$/ + a + aa + aaa + aaaa + aaaaa + aaaaaaa + aaaaaaaa + aaaaaaaaa + aaaaaaaaaa + aaaaaaaaaaa + aaaaaaaaaaaa + aaaaaaaaaaaaa + aaaaaaaaaaaaaa + aaaaaaaaaaaaaaa + aaaaaaaaaaaaaaaa + + /^(a\1?)(a\1?)(a\2?)(a\3?)$/ + a + aa + aaa + aaaa + aaaaa + aaaaaa + aaaaaaa + aaaaaaaa + aaaaaaaaa + aaaaaaaaaa + aaaaaaaaaaa + aaaaaaaaaaaa + aaaaaaaaaaaaa + aaaaaaaaaaaaaa + aaaaaaaaaaaaaaa + aaaaaaaaaaaaaaaa + + /The following tests are taken from the Perl 5.005 test suite; some of them/ + /are compatible with 5.004, but I'd rather not have to sort them out./ + + /abc/ + abc + xabcy + ababc + *** Failers + xbc + axc + abx + + /ab*c/ + abc + + /ab*bc/ + abc + abbc + abbbbc + + /.{1}/ + abbbbc + + /.{3,4}/ + abbbbc + + /ab{0,}bc/ + abbbbc + + /ab+bc/ + abbc + *** Failers + abc + abq + + /ab{1,}bc/ + + /ab+bc/ + abbbbc + + /ab{1,}bc/ + abbbbc + + /ab{1,3}bc/ + abbbbc + + /ab{3,4}bc/ + abbbbc + + /ab{4,5}bc/ + *** Failers + abq + abbbbc + + /ab?bc/ + abbc + abc + + /ab{0,1}bc/ + abc + + /ab?bc/ + + /ab?c/ + abc + + /ab{0,1}c/ + abc + + /^abc$/ + abc + *** Failers + abbbbc + abcc + + /^abc/ + abcc + + /^abc$/ + + /abc$/ + aabc + *** Failers + aabc + aabcd + + /^/ + abc + + /$/ + abc + + /a.c/ + abc + axc + + /a.*c/ + axyzc + + /a[bc]d/ + abd + *** Failers + axyzd + abc + + /a[b-d]e/ + ace + + /a[b-d]/ + aac + + /a[-b]/ + a- + + /a[b-]/ + a- + + /a]/ + a] + + /a[]]b/ + a]b + + /a[^bc]d/ + aed + *** Failers + abd + abd + + /a[^-b]c/ + adc + + /a[^]b]c/ + adc + *** Failers + a-c + a]c + + /\ba\b/ + a- + -a + -a- + + /\by\b/ + *** Failers + xy + yz + xyz + + /\Ba\B/ + *** Failers + a- + -a + -a- + + /\By\b/ + xy + + /\by\B/ + yz + + /\By\B/ + xyz + + /\w/ + a + + /\W/ + - + *** Failers + - + a + + /a\sb/ + a b + + /a\Sb/ + a-b + *** Failers + a-b + a b + + /\d/ + 1 + + /\D/ + - + *** Failers + - + 1 + + /[\w]/ + a + + /[\W]/ + - + *** Failers + - + a + + /a[\s]b/ + a b + + /a[\S]b/ + a-b + *** Failers + a-b + a b + + /[\d]/ + 1 + + /[\D]/ + - + *** Failers + - + 1 + + /ab|cd/ + abc + abcd + + /()ef/ + def + + /$b/ + + /a\(b/ + a(b + + /a\(*b/ + ab + a((b + + /a\\b/ + a\b + + /((a))/ + abc + + /(a)b(c)/ + abc + + /a+b+c/ + aabbabc + + /a{1,}b{1,}c/ + aabbabc + + /a.+?c/ + abcabc + + /(a+|b)*/ + ab + + /(a+|b){0,}/ + ab + + /(a+|b)+/ + ab + + /(a+|b){1,}/ + ab + + /(a+|b)?/ + ab + + /(a+|b){0,1}/ + ab + + /[^ab]*/ + cde + + /abc/ + *** Failers + b + + + /a*/ + + + /([abc])*d/ + abbbcd + + /([abc])*bcd/ + abcd + + /a|b|c|d|e/ + e + + /(a|b|c|d|e)f/ + ef + + /abcd*efg/ + abcdefg + + /ab*/ + xabyabbbz + xayabbbz + + /(ab|cd)e/ + abcde + + /[abhgefdc]ij/ + hij + + /^(ab|cd)e/ + + /(abc|)ef/ + abcdef + + /(a|b)c*d/ + abcd + + /(ab|ab*)bc/ + abc + + /a([bc]*)c*/ + abc + + /a([bc]*)(c*d)/ + abcd + + /a([bc]+)(c*d)/ + abcd + + /a([bc]*)(c+d)/ + abcd + + /a[bcd]*dcdcde/ + adcdcde + + /a[bcd]+dcdcde/ + *** Failers + abcde + adcdcde + + /(ab|a)b*c/ + abc + + /((a)(b)c)(d)/ + abcd + + /[a-zA-Z_][a-zA-Z0-9_]*/ + alpha + + /^a(bc+|b[eh])g|.h$/ + abh + + /(bc+d$|ef*g.|h?i(j|k))/ + effgz + ij + reffgz + *** Failers + effg + bcdd + + /((((((((((a))))))))))/ + a + + /((((((((((a))))))))))\10/ + aa + + /(((((((((a)))))))))/ + a + + /multiple words of text/ + *** Failers + aa + uh-uh + + /multiple words/ + multiple words, yeah + + /(.*)c(.*)/ + abcde + + /$(.*), (.*)$/ + (a, b) + + /[k]/ + + /abcd/ + abcd + + /a(bc)d/ + abcd + + /a[-]?c/ + ac + + /(abc)\1/ + abcabc + + /([a-c]*)\1/ + abcabc + + /(a)|\1/ + a + *** Failers + ab + x + + /(([a-c])b*?\2)*/ + ababbbcbc + + /(([a-c])b*?\2){3}/ + ababbbcbc + + /((\3|b)\2(a)x)+/ + aaaxabaxbaaxbbax + + /((\3|b)\2(a)){2,}/ + bbaababbabaaaaabbaaaabba + + /abc/i + ABC + XABCY + ABABC + *** Failers + aaxabxbaxbbx + XBC + AXC + ABX + + /ab*c/i + ABC + + /ab*bc/i + ABC + ABBC + + /ab*?bc/i + ABBBBC + + /ab{0,}?bc/i + ABBBBC + + /ab+?bc/i + ABBC + + /ab+bc/i + *** Failers + ABC + ABQ + + /ab{1,}bc/i + + /ab+bc/i + ABBBBC + + /ab{1,}?bc/i + ABBBBC + + /ab{1,3}?bc/i + ABBBBC + + /ab{3,4}?bc/i + ABBBBC + + /ab{4,5}?bc/i + *** Failers + ABQ + ABBBBC + + /ab??bc/i + ABBC + ABC + + /ab{0,1}?bc/i + ABC + + /ab??bc/i + + /ab??c/i + ABC + + /ab{0,1}?c/i + ABC + + /^abc$/i + ABC + *** Failers + ABBBBC + ABCC + + /^abc/i + ABCC + + /^abc$/i + + /abc$/i + AABC + + /^/i + ABC + + /$/i + ABC + + /a.c/i + ABC + AXC + + /a.*?c/i + AXYZC + + /a.*c/i + *** Failers + AABC + AXYZD + + /a[bc]d/i + ABD + + /a[b-d]e/i + ACE + *** Failers + ABC + ABD + + /a[b-d]/i + AAC + + /a[-b]/i + A- + + /a[b-]/i + A- + + /a]/i + A] + + /a[]]b/i + A]B + + /a[^bc]d/i + AED + + /a[^-b]c/i + ADC + *** Failers + ABD + A-C + + /a[^]b]c/i + ADC + + /ab|cd/i + ABC + ABCD + + /()ef/i + DEF + + /$b/i + *** Failers + A]C + B + + /a\(b/i + A(B + + /a\(*b/i + AB + A((B + + /a\\b/i + A\B + + /((a))/i + ABC + + /(a)b(c)/i + ABC + + /a+b+c/i + AABBABC + + /a{1,}b{1,}c/i + AABBABC + + /a.+?c/i + ABCABC + + /a.*?c/i + ABCABC + + /a.{0,5}?c/i + ABCABC + + /(a+|b)*/i + AB + + /(a+|b){0,}/i + AB + + /(a+|b)+/i + AB + + /(a+|b){1,}/i + AB + + /(a+|b)?/i + AB + + /(a+|b){0,1}/i + AB + + /(a+|b){0,1}?/i + AB + + /[^ab]*/i + CDE + + /abc/i + + /a*/i + + + /([abc])*d/i + ABBBCD + + /([abc])*bcd/i + ABCD + + /a|b|c|d|e/i + E + + /(a|b|c|d|e)f/i + EF + + /abcd*efg/i + ABCDEFG + + /ab*/i + XABYABBBZ + XAYABBBZ + + /(ab|cd)e/i + ABCDE + + /[abhgefdc]ij/i + HIJ + + /^(ab|cd)e/i + ABCDE + + /(abc|)ef/i + ABCDEF + + /(a|b)c*d/i + ABCD + + /(ab|ab*)bc/i + ABC + + /a([bc]*)c*/i + ABC + + /a([bc]*)(c*d)/i + ABCD + + /a([bc]+)(c*d)/i + ABCD + + /a([bc]*)(c+d)/i + ABCD + + /a[bcd]*dcdcde/i + ADCDCDE + + /a[bcd]+dcdcde/i + + /(ab|a)b*c/i + ABC + + /((a)(b)c)(d)/i + ABCD + + /[a-zA-Z_][a-zA-Z0-9_]*/i + ALPHA + + /^a(bc+|b[eh])g|.h$/i + ABH + + /(bc+d$|ef*g.|h?i(j|k))/i + EFFGZ + IJ + REFFGZ + *** Failers + ADCDCDE + EFFG + BCDD + + /((((((((((a))))))))))/i + A + + /((((((((((a))))))))))\10/i + AA + + /(((((((((a)))))))))/i + A + + /(?:(?:(?:(?:(?:(?:(?:(?:(?:(a))))))))))/i + A + + /(?:(?:(?:(?:(?:(?:(?:(?:(?:(a|b|c))))))))))/i + C + + /multiple words of text/i + *** Failers + AA + UH-UH + + /multiple words/i + MULTIPLE WORDS, YEAH + + /(.*)c(.*)/i + ABCDE + + /$(.*), (.*)$/i + (A, B) + + /[k]/i + + /abcd/i + ABCD + + /a(bc)d/i + ABCD + + /a[-]?c/i + AC + + /(abc)\1/i + ABCABC + + /([a-c]*)\1/i + ABCABC + + /a(?!b)./ + abad + + /a(?=d)./ + abad + + /a(?=c|d)./ + abad + + /a(?:b|c|d)(.)/ + ace + + /a(?:b|c|d)*(.)/ + ace + + /a(?:b|c|d)+?(.)/ + ace + acdbcdbe + + /a(?:b|c|d)+(.)/ + acdbcdbe + + /a(?:b|c|d){2}(.)/ + acdbcdbe + + /a(?:b|c|d){4,5}(.)/ + acdbcdbe + + /a(?:b|c|d){4,5}?(.)/ + acdbcdbe + + /((foo)|(bar))*/ + foobar + + /a(?:b|c|d){6,7}(.)/ + acdbcdbe + + /a(?:b|c|d){6,7}?(.)/ + acdbcdbe + + /a(?:b|c|d){5,6}(.)/ + acdbcdbe + + /a(?:b|c|d){5,6}?(.)/ + acdbcdbe + + /a(?:b|c|d){5,7}(.)/ + acdbcdbe + + /a(?:b|c|d){5,7}?(.)/ + acdbcdbe + + /a(?:b|(c|e){1,2}?|d)+?(.)/ + ace + + /^(.+)?B/ + AB + + /^([^a-z])|(\^)$/ + . + + /^[<>]&/ + <&OUT + + /^(a\1?){4}$/ + aaaaaaaaaa + *** Failers + AB + aaaaaaaaa + aaaaaaaaaaa + + /^(a(?(1)\1)){4}$/ + aaaaaaaaaa + *** Failers + aaaaaaaaa + aaaaaaaaaaa + + /(?:(f)(o)(o)|(b)(a)(r))*/ + foobar + + /(?<=a)b/ + ab + *** Failers + cb + b + + /(?<!c)b/ + ab + b + b + + /(?:..)*a/ + aba + + /(?:..)*?a/ + aba + + /^(?:b|a(?=(.)))*\1/ + abc + + /^(){3,5}/ + abc + + /^(a+)*ax/ + aax + + /^((a|b)+)*ax/ + aax + + /^((a|bc)+)*ax/ + aax + + /(a|x)*ab/ + cab + + /(a)*ab/ + cab + + /(?:(?i)a)b/ + ab + + /((?i)a)b/ + ab + + /(?:(?i)a)b/ + Ab + + /((?i)a)b/ + Ab + + /(?:(?i)a)b/ + *** Failers + cb + aB + + /((?i)a)b/ + + /(?i:a)b/ + ab + + /((?i:a))b/ + ab + + /(?i:a)b/ + Ab + + /((?i:a))b/ + Ab + + /(?i:a)b/ + *** Failers + aB + aB + + /((?i:a))b/ + + /(?:(?-i)a)b/i + ab + + /((?-i)a)b/i + ab + + /(?:(?-i)a)b/i + aB + + /((?-i)a)b/i + aB + + /(?:(?-i)a)b/i + *** Failers + aB + Ab + + /((?-i)a)b/i + + /(?:(?-i)a)b/i + aB + + /((?-i)a)b/i + aB + + /(?:(?-i)a)b/i + *** Failers + Ab + AB + + /((?-i)a)b/i + + /(?-i:a)b/i + ab + + /((?-i:a))b/i + ab + + /(?-i:a)b/i + aB + + /((?-i:a))b/i + aB + + /(?-i:a)b/i + *** Failers + AB + Ab + + /((?-i:a))b/i + + /(?-i:a)b/i + aB + + /((?-i:a))b/i + aB + + /(?-i:a)b/i + *** Failers + Ab + AB + + /((?-i:a))b/i + + /((?-i:a.))b/i + *** Failers + AB + a\nB + + /((?s-i:a.))b/i + a\nB + + /(?:c|d)(?:)(?:a(?:)(?:b)(?:b(?:))(?:b(?:)(?:b)))/ + cabbbb + + /(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))/ + caaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb + + /(ab)\d\1/i + Ab4ab + ab4Ab + + /foo\w*\d{4}baz/ + foobar1234baz + + /x(~~)*(?:(?:F)?)?/ + x~~ + + /^a(?#xxx){3}c/ + aaac + + /^a (?#xxx) (?#yyy) {3}c/x + aaac + + /(?<![cd])b/ + *** Failers + B\nB + dbcb + + /(?<![cd])[ab]/ + dbaacb + + /(?<!(c|d))b/ + + /(?<!(c|d))[ab]/ + dbaacb + + /(?<!cd)[ab]/ + cdaccb + + /^(?:a?b?)*$/ + *** Failers + dbcb + a-- + + /((?s)^a(.))((?m)^b$)/ + a\nb\nc\n + + /((?m)^b$)/ + a\nb\nc\n + + /(?m)^b/ + a\nb\n + + /(?m)^(b)/ + a\nb\n + + /((?m)^b)/ + a\nb\n + + /\n((?m)^b)/ + a\nb\n + + /((?s).)c(?!.)/ + a\nb\nc\n + a\nb\nc\n + + /((?s)b.)c(?!.)/ + a\nb\nc\n + a\nb\nc\n + + /^b/ + + /()^b/ + *** Failers + a\nb\nc\n + a\nb\nc\n + + /((?m)^b)/ + a\nb\nc\n + + /(?(1)a|b)/ + + /(?(1)b|a)/ + a + + /(x)?(?(1)a|b)/ + *** Failers + a + a + + /(x)?(?(1)b|a)/ + a + + /()?(?(1)b|a)/ + a + + /()(?(1)b|a)/ + + /()?(?(1)a|b)/ + a + + /^($)?blah(?(1)($))$/ + (blah) + blah + *** Failers + a + blah) + (blah + + /^($+)?blah(?(1)($))$/ + (blah) + blah + *** Failers + blah) + (blah + + /(?(?!a)a|b)/ + + /(?(?!a)b|a)/ + a + + /(?(?=a)b|a)/ + *** Failers + a + a + + /(?(?=a)a|b)/ + a + + /(?=(a+?))(\1ab)/ + aaab + + /^(?=(a+?))\1ab/ + + /(\w+:)+/ + one: + + /$(?<=^(a))/ + a + + /(?=(a+?))(\1ab)/ + aaab + + /^(?=(a+?))\1ab/ + *** Failers + aaab + aaab + + /([\w:]+::)?(\w+)$/ + abcd + xy:z:::abcd + + /^[^bcd]*(c+)/ + aexycd + + /(a*)b+/ + caab + + /([\w:]+::)?(\w+)$/ + abcd + xy:z:::abcd + *** Failers + abcd: + abcd: + + /^[^bcd]*(c+)/ + aexycd + + /(>a+)ab/ + + /(?>a+)b/ + aaab + + /([[:]+)/ + a:[b]: + + /([[=]+)/ + a=[b]= + + /([[.]+)/ + a.[b]. + + /((?>a+)b)/ + aaab + + /(?>(a+))b/ + aaab + + /((?>[^()]+)|$[^()]*$)+/ + ((abc(ade)ufh()()x + + /a\Z/ + *** Failers + aaab + a\nb\n + + /b\Z/ + a\nb\n + + /b\z/ + + /b\Z/ + a\nb + + /b\z/ + a\nb + *** Failers + + /^(?>(?(1)\.|())[^\W_](?>[a-z0-9-]*[^\W_])?)+$/ + a + abc + a-b + 0-9 + a.b + 5.6.7 + the.quick.brown.fox + a100.b200.300c + 12-ab.1245 + ***Failers + \ + .a + -a + a- + a. + a_b + a.- + a.. + ab..bc + the.quick.brown.fox- + the.quick.brown.fox. + the.quick.brown.fox_ + the.quick.brown.fox+ + + /(?>.*)(?<=(abcd|wxyz))/ + alphabetabcd + endingwxyz + *** Failers + a rather long string that doesn't end with one of them + + /word (?>(?:(?!otherword)[a-zA-Z0-9]+ ){0,30})otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark otherword + word cat dog elephant mussel cow horse canary baboon snake shark + + /word (?>[a-zA-Z0-9]+ ){0,30}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark the quick brown fox and the lazy dog and several other words getting close to thirty by now I hope + + /(?<=\d{3}(?!999))foo/ + 999foo + 123999foo + *** Failers + 123abcfoo + + /(?<=(?!...999)\d{3})foo/ + 999foo + 123999foo + *** Failers + 123abcfoo + + /(?<=\d{3}(?!999)...)foo/ + 123abcfoo + 123456foo + *** Failers + 123999foo + + /(?<=\d{3}...)(?<!999)foo/ + 123abcfoo + 123456foo + *** Failers + 123999foo + + /<a[\s]+href[\s]*=[\s]* # find <a href= + ([\"\'])? # find single or double quote + (?(1) (.*?)\1 | ([^\s]+)) # if quote found, match up to next matching + # quote, otherwise match up to next space + /isx + <a href=abcd xyz + <a href=\"abcd xyz pqr\" cats + <a href=\'abcd xyz pqr\' cats + + /<a\s+href\s*=\s* # find <a href= + (["'])? # find single or double quote + (?(1) (.*?)\1 | (\S+)) # if quote found, match up to next matching + # quote, otherwise match up to next space + /isx + <a href=abcd xyz + <a href=\"abcd xyz pqr\" cats + <a href = \'abcd xyz pqr\' cats + + /<a\s+href(?>\s*)=(?>\s*) # find <a href= + (["'])? # find single or double quote + (?(1) (.*?)\1 | (\S+)) # if quote found, match up to next matching + # quote, otherwise match up to next space + /isx + <a href=abcd xyz + <a href=\"abcd xyz pqr\" cats + <a href = \'abcd xyz pqr\' cats + + /((Z)+|A)*/ + ZABCDEFG + + /(Z()|A)*/ + ZABCDEFG + + /(Z(())|A)*/ + ZABCDEFG + + /((?>Z)+|A)*/ + ZABCDEFG + + /((?>)+|A)*/ + ZABCDEFG + + /a*/g + abbab + + /^[a-\d]/ + abcde + -things + 0digit + *** Failers + bcdef + + /^[\d-a]/ + abcde + -things + 0digit + *** Failers + bcdef + + / End of testinput3 /

ossp-pkg/pcre/pcre_test.d/testinput4 -> 1.2

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,65 ---- + /^[\w]+/ + *** Failers + �cole + + /^[\w]+/Lfr + �cole + + /^[\w]+/ + *** Failers + �cole + + /^[\W]+/ + �cole + + /^[\W]+/Lfr + *** Failers + �cole + + /[\b]/ + \b + *** Failers + a + + /[\b]/Lfr + \b + *** Failers + a + + /^\w+/ + *** Failers + �cole + + /^\w+/Lfr + �cole + + /(.+)\b(.+)/ + �cole + + /(.+)\b(.+)/Lfr + *** Failers + �cole + + /�cole/i + �cole + *** Failers + �cole + + /�cole/iLfr + �cole + �cole + + /\w/IS + + /\w/ISLfr + + /^[\xc8-\xc9]/iLfr + �cole + �cole + + /^[\xc8-\xc9]/Lfr + �cole + *** Failers + �cole + + / End of testinput4 /

ossp-pkg/pcre/pcre_test.d/testinput5 -> 1.1

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,118 ---- + /-- Because of problems with Perl 5.6 in handling UTF-8 vs non UTF-8 --/ + /-- strings automatically, do not use the \x{} construct except with --/ + /-- patterns that have the /8 option set, and don't use them without! --/ + + /a.b/8 + acb + a\x7fb + a\x{100}b + *** Failers + a\nb + + /a(.{3})b/8 + a\x{4000}xyb + a\x{4000}\x7fyb + a\x{4000}\x{100}yb + *** Failers + a\x{4000}b + ac\ncb + + /a(.*?)(.)/ + a\xc0\x88b + + /a(.*?)(.)/8 + a\x{100}b + + /a(.*)(.)/ + a\xc0\x88b + + /a(.*)(.)/8 + a\x{100}b + + /a(.)(.)/ + a\xc0\x92bcd + + /a(.)(.)/8 + a\x{240}bcd + + /a(.?)(.)/ + a\xc0\x92bcd + + /a(.?)(.)/8 + a\x{240}bcd + + /a(.??)(.)/ + a\xc0\x92bcd + + /a(.??)(.)/8 + a\x{240}bcd + + /a(.{3})b/8 + a\x{1234}xyb + a\x{1234}\x{4321}yb + a\x{1234}\x{4321}\x{3412}b + *** Failers + a\x{1234}b + ac\ncb + + /a(.{3,})b/8 + a\x{1234}xyb + a\x{1234}\x{4321}yb + a\x{1234}\x{4321}\x{3412}b + axxxxbcdefghijb + a\x{1234}\x{4321}\x{3412}\x{3421}b + *** Failers + a\x{1234}b + + /a(.{3,}?)b/8 + a\x{1234}xyb + a\x{1234}\x{4321}yb + a\x{1234}\x{4321}\x{3412}b + axxxxbcdefghijb + a\x{1234}\x{4321}\x{3412}\x{3421}b + *** Failers + a\x{1234}b + + /a(.{3,5})b/8 + a\x{1234}xyb + a\x{1234}\x{4321}yb + a\x{1234}\x{4321}\x{3412}b + axxxxbcdefghijb + a\x{1234}\x{4321}\x{3412}\x{3421}b + axbxxbcdefghijb + axxxxxbcdefghijb + *** Failers + a\x{1234}b + axxxxxxbcdefghijb + + /a(.{3,5}?)b/8 + a\x{1234}xyb + a\x{1234}\x{4321}yb + a\x{1234}\x{4321}\x{3412}b + axxxxbcdefghijb + a\x{1234}\x{4321}\x{3412}\x{3421}b + axbxxbcdefghijb + axxxxxbcdefghijb + *** Failers + a\x{1234}b + axxxxxxbcdefghijb + + /^[a\x{c0}]/8 + *** Failers + \x{100} + + /(?<=aXb)cd/8 + aXbcd + + /(?<=a\x{100}b)cd/8 + a\x{100}bcd + + /(?<=a\x{100000}b)cd/8 + a\x{100000}bcd + + /(?:\x{100}){3}b/8 + \x{100}\x{100}\x{100}b + *** Failers + \x{100}\x{100}b + + / End of testinput5 /

ossp-pkg/pcre/pcre_test.d/testinput6 -> 1.1

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,52 ---- + /\x{100}/8DM + + /\x{1000}/8DM + + /\x{10000}/8DM + + /\x{100000}/8DM + + /\x{1000000}/8DM + + /\x{4000000}/8DM + + /\x{7fffFFFF}/8DM + + /[\x{ff}]/8DM + + /[\x{100}]/8DM + + /\x{ffffffff}/8 + + /\x{100000000}/8 + + /^\x{100}a\x{1234}/8 + \x{100}a\x{1234}bcd + + /\x80/8D + + /\xff/8D + + /-- These tests are here rather than in testinput5 because Perl 5.6 has --/ + /-- some problems with UTF-8 support, in the area of \x{..} where the --/ + /-- value is < 255. It grumbles about invalid UTF-8 strings. --/ + + /^[a\x{c0}]b/8 + \x{c0}b + + /^([a\x{c0}]*?)aa/8 + a\x{c0}aaaa/ + + /^([a\x{c0}]*?)aa/8 + a\x{c0}aaaa/ + a\x{c0}a\x{c0}aaa/ + + /^([a\x{c0}]*)aa/8 + a\x{c0}aaaa/ + a\x{c0}a\x{c0}aaa/ + + /^([a\x{c0}]*)a\x{c0}/8 + a\x{c0}aaaa/ + a\x{c0}a\x{c0}aaa/ + + / End of testinput6 /

ossp-pkg/pcre/pcre_test.d/testoutput1 -> 1.6

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,2965 ---- + PCRE version 3.3 01-Aug-2000 + + /the quick brown fox/ + the quick brown fox + 0: the quick brown fox + The quick brown FOX + No match + What do you know about the quick brown fox? + 0: the quick brown fox + What do you know about THE QUICK BROWN FOX? + No match + + /The quick brown fox/i + the quick brown fox + 0: the quick brown fox + The quick brown FOX + 0: The quick brown FOX + What do you know about the quick brown fox? + 0: the quick brown fox + What do you know about THE QUICK BROWN FOX? + 0: THE QUICK BROWN FOX + + /abcd\t\n\r\f\a\e\071\x3b\$\\\?caxyz/ + abcd\t\n\r\f\a\e9;\$\\?caxyz + 0: abcd\x09\x0a\x0d\x0c\x07\x1b9;$\?caxyz + + /a*abc?xyz+pqr{3}ab{2,}xy{4,5}pq{0,6}AB{0,}zz/ + abxyzpqrrrabbxyyyypqAzz + 0: abxyzpqrrrabbxyyyypqAzz + abxyzpqrrrabbxyyyypqAzz + 0: abxyzpqrrrabbxyyyypqAzz + aabxyzpqrrrabbxyyyypqAzz + 0: aabxyzpqrrrabbxyyyypqAzz + aaabxyzpqrrrabbxyyyypqAzz + 0: aaabxyzpqrrrabbxyyyypqAzz + aaaabxyzpqrrrabbxyyyypqAzz + 0: aaaabxyzpqrrrabbxyyyypqAzz + abcxyzpqrrrabbxyyyypqAzz + 0: abcxyzpqrrrabbxyyyypqAzz + aabcxyzpqrrrabbxyyyypqAzz + 0: aabcxyzpqrrrabbxyyyypqAzz + aaabcxyzpqrrrabbxyyyypAzz + 0: aaabcxyzpqrrrabbxyyyypAzz + aaabcxyzpqrrrabbxyyyypqAzz + 0: aaabcxyzpqrrrabbxyyyypqAzz + aaabcxyzpqrrrabbxyyyypqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqAzz + aaabcxyzpqrrrabbxyyyypqqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqqqqqAzz + aaaabcxyzpqrrrabbxyyyypqAzz + 0: aaaabcxyzpqrrrabbxyyyypqAzz + abxyzzpqrrrabbxyyyypqAzz + 0: abxyzzpqrrrabbxyyyypqAzz + aabxyzzzpqrrrabbxyyyypqAzz + 0: aabxyzzzpqrrrabbxyyyypqAzz + aaabxyzzzzpqrrrabbxyyyypqAzz + 0: aaabxyzzzzpqrrrabbxyyyypqAzz + aaaabxyzzzzpqrrrabbxyyyypqAzz + 0: aaaabxyzzzzpqrrrabbxyyyypqAzz + abcxyzzpqrrrabbxyyyypqAzz + 0: abcxyzzpqrrrabbxyyyypqAzz + aabcxyzzzpqrrrabbxyyyypqAzz + 0: aabcxyzzzpqrrrabbxyyyypqAzz + aaabcxyzzzzpqrrrabbxyyyypqAzz + 0: aaabcxyzzzzpqrrrabbxyyyypqAzz + aaaabcxyzzzzpqrrrabbxyyyypqAzz + 0: aaaabcxyzzzzpqrrrabbxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyypqAzz + 0: aaaabcxyzzzzpqrrrabbbxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyyypqAzz + 0: aaaabcxyzzzzpqrrrabbbxyyyyypqAzz + aaabcxyzpqrrrabbxyyyypABzz + 0: aaabcxyzpqrrrabbxyyyypABzz + aaabcxyzpqrrrabbxyyyypABBzz + 0: aaabcxyzpqrrrabbxyyyypABBzz + >>>aaabxyzpqrrrabbxyyyypqAzz + 0: aaabxyzpqrrrabbxyyyypqAzz + >aaaabxyzpqrrrabbxyyyypqAzz + 0: aaaabxyzpqrrrabbxyyyypqAzz + >>>>abcxyzpqrrrabbxyyyypqAzz + 0: abcxyzpqrrrabbxyyyypqAzz + *** Failers + No match + abxyzpqrrabbxyyyypqAzz + No match + abxyzpqrrrrabbxyyyypqAzz + No match + abxyzpqrrrabxyyyypqAzz + No match + aaaabcxyzzzzpqrrrabbbxyyyyyypqAzz + No match + aaaabcxyzzzzpqrrrabbbxyyypqAzz + No match + aaabcxyzpqrrrabbxyyyypqqqqqqqAzz + No match + + /^(abc){1,2}zz/ + abczz + 0: abczz + 1: abc + abcabczz + 0: abcabczz + 1: abc + *** Failers + No match + zz + No match + abcabcabczz + No match + >>abczz + No match + + /^(b+?|a){1,2}?c/ + bc + 0: bc + 1: b + bbc + 0: bbc + 1: b + bbbc + 0: bbbc + 1: bb + bac + 0: bac + 1: a + bbac + 0: bbac + 1: a + aac + 0: aac + 1: a + abbbbbbbbbbbc + 0: abbbbbbbbbbbc + 1: bbbbbbbbbbb + bbbbbbbbbbbac + 0: bbbbbbbbbbbac + 1: a + *** Failers + No match + aaac + No match + abbbbbbbbbbbac + No match + + /^(b+|a){1,2}c/ + bc + 0: bc + 1: b + bbc + 0: bbc + 1: bb + bbbc + 0: bbbc + 1: bbb + bac + 0: bac + 1: a + bbac + 0: bbac + 1: a + aac + 0: aac + 1: a + abbbbbbbbbbbc + 0: abbbbbbbbbbbc + 1: bbbbbbbbbbb + bbbbbbbbbbbac + 0: bbbbbbbbbbbac + 1: a + *** Failers + No match + aaac + No match + abbbbbbbbbbbac + No match + + /^(b+|a){1,2}?bc/ + bbc + 0: bbc + 1: b + + /^(b*|ba){1,2}?bc/ + babc + 0: babc + 1: ba + bbabc + 0: bbabc + 1: ba + bababc + 0: bababc + 1: ba + *** Failers + No match + bababbc + No match + babababc + No match + + /^(ba|b*){1,2}?bc/ + babc + 0: babc + 1: ba + bbabc + 0: bbabc + 1: ba + bababc + 0: bababc + 1: ba + *** Failers + No match + bababbc + No match + babababc + No match + + /^\ca\cA\c[\c{\c:/ + \x01\x01\e;z + 0: \x01\x01\x1b;z + + /^[ab\]cde]/ + athing + 0: a + bthing + 0: b + ]thing + 0: ] + cthing + 0: c + dthing + 0: d + ething + 0: e + *** Failers + No match + fthing + No match + [thing + No match + \\thing + No match + + /^[]cde]/ + ]thing + 0: ] + cthing + 0: c + dthing + 0: d + ething + 0: e + *** Failers + No match + athing + No match + fthing + No match + + /^[^ab\]cde]/ + fthing + 0: f + [thing + 0: [ + \\thing + 0: \ + *** Failers + 0: * + athing + No match + bthing + No match + ]thing + No match + cthing + No match + dthing + No match + ething + No match + + /^[^]cde]/ + athing + 0: a + fthing + 0: f + *** Failers + 0: * + ]thing + No match + cthing + No match + dthing + No match + ething + No match + + /^\�/ + � + 0: \x81 + + /^�/ + � + 0: \xff + + /^[0-9]+$/ + 0 + 0: 0 + 1 + 0: 1 + 2 + 0: 2 + 3 + 0: 3 + 4 + 0: 4 + 5 + 0: 5 + 6 + 0: 6 + 7 + 0: 7 + 8 + 0: 8 + 9 + 0: 9 + 10 + 0: 10 + 100 + 0: 100 + *** Failers + No match + abc + No match + + /^.*nter/ + enter + 0: enter + inter + 0: inter + uponter + 0: uponter + + /^xxx[0-9]+$/ + xxx0 + 0: xxx0 + xxx1234 + 0: xxx1234 + *** Failers + No match + xxx + No match + + /^.+[0-9][0-9][0-9]$/ + x123 + 0: x123 + xx123 + 0: xx123 + 123456 + 0: 123456 + *** Failers + No match + 123 + No match + x1234 + 0: x1234 + + /^.+?[0-9][0-9][0-9]$/ + x123 + 0: x123 + xx123 + 0: xx123 + 123456 + 0: 123456 + *** Failers + No match + 123 + No match + x1234 + 0: x1234 + + /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/ + abc!pqr=apquxz.ixr.zzz.ac.uk + 0: abc!pqr=apquxz.ixr.zzz.ac.uk + 1: abc + 2: pqr + *** Failers + No match + !pqr=apquxz.ixr.zzz.ac.uk + No match + abc!=apquxz.ixr.zzz.ac.uk + No match + abc!pqr=apquxz:ixr.zzz.ac.uk + No match + abc!pqr=apquxz.ixr.zzz.ac.ukk + No match + + /:/ + Well, we need a colon: somewhere + 0: : + *** Fail if we don't + No match + + /([\da-f:]+)$/i + 0abc + 0: 0abc + 1: 0abc + abc + 0: abc + 1: abc + fed + 0: fed + 1: fed + E + 0: E + 1: E + :: + 0: :: + 1: :: + 5f03:12C0::932e + 0: 5f03:12C0::932e + 1: 5f03:12C0::932e + fed def + 0: def + 1: def + Any old stuff + 0: ff + 1: ff + *** Failers + No match + 0zzz + No match + gzzz + No match + fed\x20 + No match + Any old rubbish + No match + + /^.*\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$/ + .1.2.3 + 0: .1.2.3 + 1: 1 + 2: 2 + 3: 3 + A.12.123.0 + 0: A.12.123.0 + 1: 12 + 2: 123 + 3: 0 + *** Failers + No match + .1.2.3333 + No match + 1.2.3 + No match + 1234.2.3 + No match + + /^(\d+)\s+IN\s+SOA\s+(\S+)\s+(\S+)\s*\(\s*$/ + 1 IN SOA non-sp1 non-sp2( + 0: 1 IN SOA non-sp1 non-sp2( + 1: 1 + 2: non-sp1 + 3: non-sp2 + 1 IN SOA non-sp1 non-sp2 ( + 0: 1 IN SOA non-sp1 non-sp2 ( + 1: 1 + 2: non-sp1 + 3: non-sp2 + *** Failers + No match + 1IN SOA non-sp1 non-sp2( + No match + + /^[a-zA-Z\d][a-zA-Z\d\-]*(\.[a-zA-Z\d][a-zA-z\d\-]*)*\.$/ + a. + 0: a. + Z. + 0: Z. + 2. + 0: 2. + ab-c.pq-r. + 0: ab-c.pq-r. + 1: .pq-r + sxk.zzz.ac.uk. + 0: sxk.zzz.ac.uk. + 1: .uk + x-.y-. + 0: x-.y-. + 1: .y- + *** Failers + No match + -abc.peq. + No match + + /^\*\.[a-z]([a-z\-\d]*[a-z\d]+)?(\.[a-z]([a-z\-\d]*[a-z\d]+)?)*$/ + *.a + 0: *.a + *.b0-a + 0: *.b0-a + 1: 0-a + *.c3-b.c + 0: *.c3-b.c + 1: 3-b + 2: .c + *.c-a.b-c + 0: *.c-a.b-c + 1: -a + 2: .b-c + 3: -c + *** Failers + No match + *.0 + No match + *.a- + No match + *.a-b.c- + No match + *.c-a.0-c + No match + + /^(?=ab(de))(abd)(e)/ + abde + 0: abde + 1: de + 2: abd + 3: e + + /^(?!(ab)de|x)(abd)(f)/ + abdf + 0: abdf + 1: <unset> + 2: abd + 3: f + + /^(?=(ab(cd)))(ab)/ + abcd + 0: ab + 1: abcd + 2: cd + 3: ab + + /^[\da-f](\.[\da-f])*$/i + a.b.c.d + 0: a.b.c.d + 1: .d + A.B.C.D + 0: A.B.C.D + 1: .D + a.b.c.1.2.3.C + 0: a.b.c.1.2.3.C + 1: .C + + /^\".*\"\s*(;.*)?$/ + \"1234\" + 0: "1234" + \"abcd\" ; + 0: "abcd" ; + 1: ; + \"\" ; rhubarb + 0: "" ; rhubarb + 1: ; rhubarb + *** Failers + No match + \"1234\" : things + No match + + /^$/ + \ + 0: + *** Failers + No match + + / ^ a (?# begins with a) b\sc (?# then b c) $ (?# then end)/x + ab c + 0: ab c + *** Failers + No match + abc + No match + ab cde + No match + + /(?x) ^ a (?# begins with a) b\sc (?# then b c) $ (?# then end)/ + ab c + 0: ab c + *** Failers + No match + abc + No match + ab cde + No match + + /^ a\ b[c ]d $/x + a bcd + 0: a bcd + a b d + 0: a b d + *** Failers + No match + abcd + No match + ab d + No match + + /^(a(b(c)))(d(e(f)))(h(i(j)))(k(l(m)))$/ + abcdefhijklm + 0: abcdefhijklm + 1: abc + 2: bc + 3: c + 4: def + 5: ef + 6: f + 7: hij + 8: ij + 9: j + 10: klm + 11: lm + 12: m + + /^(?:a(b(c)))(?:d(e(f)))(?:h(i(j)))(?:k(l(m)))$/ + abcdefhijklm + 0: abcdefhijklm + 1: bc + 2: c + 3: ef + 4: f + 5: ij + 6: j + 7: lm + 8: m + + /^[\w][\W][\s][\S][\d][\D][\b][\n][\c]][\022]/ + a+ Z0+\x08\n\x1d\x12 + 0: a+ Z0+\x08\x0a\x1d\x12 + + /^[.^$|()*+?{,}]+/ + .^\$(*+)|{?,?} + 0: .^$(*+)|{?,?} + + /^a*\w/ + z + 0: z + az + 0: az + aaaz + 0: aaaz + a + 0: a + aa + 0: aa + aaaa + 0: aaaa + a+ + 0: a + aa+ + 0: aa + + /^a*?\w/ + z + 0: z + az + 0: a + aaaz + 0: a + a + 0: a + aa + 0: a + aaaa + 0: a + a+ + 0: a + aa+ + 0: a + + /^a+\w/ + az + 0: az + aaaz + 0: aaaz + aa + 0: aa + aaaa + 0: aaaa + aa+ + 0: aa + + /^a+?\w/ + az + 0: az + aaaz + 0: aa + aa + 0: aa + aaaa + 0: aa + aa+ + 0: aa + + /^\d{8}\w{2,}/ + 1234567890 + 0: 1234567890 + 12345678ab + 0: 12345678ab + 12345678__ + 0: 12345678__ + *** Failers + No match + 1234567 + No match + + /^[aeiou\d]{4,5}$/ + uoie + 0: uoie + 1234 + 0: 1234 + 12345 + 0: 12345 + aaaaa + 0: aaaaa + *** Failers + No match + 123456 + No match + + /^[aeiou\d]{4,5}?/ + uoie + 0: uoie + 1234 + 0: 1234 + 12345 + 0: 1234 + aaaaa + 0: aaaa + 123456 + 0: 1234 + + /\A(abc|def)=(\1){2,3}\Z/ + abc=abcabc + 0: abc=abcabc + 1: abc + 2: abc + def=defdefdef + 0: def=defdefdef + 1: def + 2: def + *** Failers + No match + abc=defdef + No match + + /^(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\11*(\3\4)\1(?#)2$/ + abcdefghijkcda2 + 0: abcdefghijkcda2 + 1: a + 2: b + 3: c + 4: d + 5: e + 6: f + 7: g + 8: h + 9: i + 10: j + 11: k + 12: cd + abcdefghijkkkkcda2 + 0: abcdefghijkkkkcda2 + 1: a + 2: b + 3: c + 4: d + 5: e + 6: f + 7: g + 8: h + 9: i + 10: j + 11: k + 12: cd + + /(cat(a(ract|tonic)|erpillar)) \1()2(3)/ + cataract cataract23 + 0: cataract cataract23 + 1: cataract + 2: aract + 3: ract + 4: + 5: 3 + catatonic catatonic23 + 0: catatonic catatonic23 + 1: catatonic + 2: atonic + 3: tonic + 4: + 5: 3 + caterpillar caterpillar23 + 0: caterpillar caterpillar23 + 1: caterpillar + 2: erpillar + 3: <unset> + 4: + 5: 3 + + + /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ + From abcd Mon Sep 01 12:33:02 1997 + 0: From abcd Mon Sep 01 12:33 + 1: abcd + + /^From\s+\S+\s+([a-zA-Z]{3}\s+){2}\d{1,2}\s+\d\d:\d\d/ + From abcd Mon Sep 01 12:33:02 1997 + 0: From abcd Mon Sep 01 12:33 + 1: Sep + From abcd Mon Sep 1 12:33:02 1997 + 0: From abcd Mon Sep 1 12:33 + 1: Sep + *** Failers + No match + From abcd Sep 01 12:33:02 1997 + No match + + /^12.34/s + 12\n34 + 0: 12\x0a34 + 12\r34 + 0: 12\x0d34 + + /\w+(?=\t)/ + the quick brown\t fox + 0: brown + + /foo(?!bar)(.*)/ + foobar is foolish see? + 0: foolish see? + 1: lish see? + + /(?:(?!foo)...|^.{0,2})bar(.*)/ + foobar crowbar etc + 0: rowbar etc + 1: etc + barrel + 0: barrel + 1: rel + 2barrel + 0: 2barrel + 1: rel + A barrel + 0: A barrel + 1: rel + + /^(\D*)(?=\d)(?!123)/ + abc456 + 0: abc + 1: abc + *** Failers + No match + abc123 + No match + + /^1234(?# test newlines + inside)/ + 1234 + 0: 1234 + + /^1234 #comment in extended re + /x + 1234 + 0: 1234 + + /#rhubarb + abcd/x + abcd + 0: abcd + + /^abcd#rhubarb/x + abcd + 0: abcd + + /^(a)\1{2,3}(.)/ + aaab + 0: aaab + 1: a + 2: b + aaaab + 0: aaaab + 1: a + 2: b + aaaaab + 0: aaaaa + 1: a + 2: a + aaaaaab + 0: aaaaa + 1: a + 2: a + + /(?!^)abc/ + the abc + 0: abc + *** Failers + No match + abc + No match + + /(?=^)abc/ + abc + 0: abc + *** Failers + No match + the abc + No match + + /^[ab]{1,3}(ab*|b)/ + aabbbbb + 0: aabb + 1: b + + /^[ab]{1,3}?(ab*|b)/ + aabbbbb + 0: aabbbbb + 1: abbbbb + + /^[ab]{1,3}?(ab*?|b)/ + aabbbbb + 0: aa + 1: a + + /^[ab]{1,3}(ab*?|b)/ + aabbbbb + 0: aabb + 1: b + + / (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* # optional leading comment + (?: (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + " (?: # opening quote... + [^\\\x80-\xff\n\015"] # Anything except backslash and quote + | # or + \\ [^\x80-\xff] # Escaped something (something != CR) + )* " # closing quote + ) # initial word + (?: (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* \. (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + " (?: # opening quote... + [^\\\x80-\xff\n\015"] # Anything except backslash and quote + | # or + \\ [^\x80-\xff] # Escaped something (something != CR) + )* " # closing quote + ) )* # further okay, if led by a period + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* @ (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # initial subdomain + (?: # + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* \. # if led by a period... + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # ...further okay + )* + # address + | # or + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + " (?: # opening quote... + [^\\\x80-\xff\n\015"] # Anything except backslash and quote + | # or + \\ [^\x80-\xff] # Escaped something (something != CR) + )* " # closing quote + ) # one word, optionally followed by.... + (?: + [^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] | # atom and space parts, or... + $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) | # comments, or... + + " (?: # opening quote... + [^\\\x80-\xff\n\015"] # Anything except backslash and quote + | # or + \\ [^\x80-\xff] # Escaped something (something != CR) + )* " # closing quote + # quoted strings + )* + < (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* # leading < + (?: @ (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # initial subdomain + (?: # + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* \. # if led by a period... + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # ...further okay + )* + + (?: (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* , (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* @ (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # initial subdomain + (?: # + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* \. # if led by a period... + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # ...further okay + )* + )* # further okay, if led by comma + : # closing colon + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* )? # optional route + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + " (?: # opening quote... + [^\\\x80-\xff\n\015"] # Anything except backslash and quote + | # or + \\ [^\x80-\xff] # Escaped something (something != CR) + )* " # closing quote + ) # initial word + (?: (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* \. (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + " (?: # opening quote... + [^\\\x80-\xff\n\015"] # Anything except backslash and quote + | # or + \\ [^\x80-\xff] # Escaped something (something != CR) + )* " # closing quote + ) )* # further okay, if led by a period + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* @ (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # initial subdomain + (?: # + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* \. # if led by a period... + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) # ...further okay + )* + # address spec + (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* > # trailing > + # name and address + ) (?: [\040\t] | $ + (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* $ )* + \) )* # optional trailing comment + /x + Alan Other <user\@dom.ain> + 0: Alan Other <user@dom.ain> + <user\@dom.ain> + 0: user@dom.ain + user\@dom.ain + 0: user@dom.ain + \"A. Other\" <user.1234\@dom.ain> (a comment) + 0: "A. Other" <user.1234@dom.ain> (a comment) + A. Other <user.1234\@dom.ain> (a comment) + 0: Other <user.1234@dom.ain> (a comment) + \"/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/\"\@x400-re.lay + 0: "/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/"@x400-re.lay + A missing angle <user\@some.where + 0: user@some.where + *** Failers + No match + The quick brown fox + No match + + /[\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional leading comment + (?: + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + # Atom + | # or + " # " + [^\\\x80-\xff\n\015"] * # normal + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* + " # " + # Quoted string + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + \. + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + # Atom + | # or + " # " + [^\\\x80-\xff\n\015"] * # normal + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* + " # " + # Quoted string + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # additional words + )* + @ + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + (?: + \. + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + )* + # address + | # or + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + # Atom + | # or + " # " + [^\\\x80-\xff\n\015"] * # normal + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* + " # " + # Quoted string + ) + # leading word + [^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] * # "normal" atoms and or spaces + (?: + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + | + " # " + [^\\\x80-\xff\n\015"] * # normal + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* + " # " + ) # "special" comment or quoted string + [^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] * # more "normal" + )* + < + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # < + (?: + @ + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + (?: + \. + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + )* + (?: , + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + @ + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + (?: + \. + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + )* + )* # additional domains + : + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + )? # optional route + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + # Atom + | # or + " # " + [^\\\x80-\xff\n\015"] * # normal + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* + " # " + # Quoted string + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + \. + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + # Atom + | # or + " # " + [^\\\x80-\xff\n\015"] * # normal + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* + " # " + # Quoted string + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # additional words + )* + @ + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + (?: + \. + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + (?: + [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... + (?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom + | + \[ # [ + (?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff + \] # ] + ) + [\040\t]* # Nab whitespace. + (?: + $ # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: # ( + (?: \\ [^\x80-\xff] | + \( # ( + [^\\\x80-\xff\n\015()] * # normal* + (?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* + $ # ) + ) # special + [^\\\x80-\xff\n\015()] * # normal* + )* # )* + \) # ) + [\040\t]* )* # If comment found, allow more spaces. + # optional trailing comments + )* + # address spec + > # > + # name and address + ) + /x + Alan Other <user\@dom.ain> + 0: Alan Other <user@dom.ain> + <user\@dom.ain> + 0: user@dom.ain + user\@dom.ain + 0: user@dom.ain + \"A. Other\" <user.1234\@dom.ain> (a comment) + 0: "A. Other" <user.1234@dom.ain> + A. Other <user.1234\@dom.ain> (a comment) + 0: Other <user.1234@dom.ain> + \"/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/\"\@x400-re.lay + 0: "/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/"@x400-re.lay + A missing angle <user\@some.where + 0: user@some.where + *** Failers + No match + The quick brown fox + No match + + /abc\0def\00pqr\000xyz\0000AB/ + abc\0def\00pqr\000xyz\0000AB + 0: abc\x00def\x00pqr\x00xyz\x000AB + abc456 abc\0def\00pqr\000xyz\0000ABCDE + 0: abc\x00def\x00pqr\x00xyz\x000AB + + /abc\x0def\x00pqr\x000xyz\x0000AB/ + abc\x0def\x00pqr\x000xyz\x0000AB + 0: abc\x0def\x00pqr\x000xyz\x0000AB + abc456 abc\x0def\x00pqr\x000xyz\x0000ABCDE + 0: abc\x0def\x00pqr\x000xyz\x0000AB + + /^[\000-\037]/ + \0A + 0: \x00 + \01B + 0: \x01 + \037C + 0: \x1f + + /\0*/ + \0\0\0\0 + 0: \x00\x00\x00\x00 + + /A\x0{2,3}Z/ + The A\x0\x0Z + 0: A\x00\x00Z + An A\0\x0\0Z + 0: A\x00\x00\x00Z + *** Failers + No match + A\0Z + No match + A\0\x0\0\x0Z + No match + + /^(cow|)\1(bell)/ + cowcowbell + 0: cowcowbell + 1: cow + 2: bell + bell + 0: bell + 1: + 2: bell + *** Failers + No match + cowbell + No match + + /^\s/ + \040abc + 0: + \x0cabc + 0: \x0c + \nabc + 0: \x0a + \rabc + 0: \x0d + \tabc + 0: \x09 + *** Failers + No match + abc + No match + + /^a b + c/x + abc + 0: abc + + /^(a|)\1*b/ + ab + 0: ab + 1: a + aaaab + 0: aaaab + 1: a + b + 0: b + 1: + *** Failers + No match + acb + No match + + /^(a|)\1+b/ + aab + 0: aab + 1: a + aaaab + 0: aaaab + 1: a + b + 0: b + 1: + *** Failers + No match + ab + No match + + /^(a|)\1?b/ + ab + 0: ab + 1: a + aab + 0: aab + 1: a + b + 0: b + 1: + *** Failers + No match + acb + No match + + /^(a|)\1{2}b/ + aaab + 0: aaab + 1: a + b + 0: b + 1: + *** Failers + No match + ab + No match + aab + No match + aaaab + No match + + /^(a|)\1{2,3}b/ + aaab + 0: aaab + 1: a + aaaab + 0: aaaab + 1: a + b + 0: b + 1: + *** Failers + No match + ab + No match + aab + No match + aaaaab + No match + + /ab{1,3}bc/ + abbbbc + 0: abbbbc + abbbc + 0: abbbc + abbc + 0: abbc + *** Failers + No match + abc + No match + abbbbbc + No match + + /([^.]*)\.([^:]*):[T ]+(.*)/ + track1.title:TBlah blah blah + 0: track1.title:TBlah blah blah + 1: track1 + 2: title + 3: Blah blah blah + + /([^.]*)\.([^:]*):[T ]+(.*)/i + track1.title:TBlah blah blah + 0: track1.title:TBlah blah blah + 1: track1 + 2: title + 3: Blah blah blah + + /([^.]*)\.([^:]*):[t ]+(.*)/i + track1.title:TBlah blah blah + 0: track1.title:TBlah blah blah + 1: track1 + 2: title + 3: Blah blah blah + + /^[W-c]+$/ + WXY_^abc + 0: WXY_^abc + ***Failers + No match + wxy + No match + + /^[W-c]+$/i + WXY_^abc + 0: WXY_^abc + wxy_^ABC + 0: wxy_^ABC + + /^[\x3f-\x5F]+$/i + WXY_^abc + 0: WXY_^abc + wxy_^ABC + 0: wxy_^ABC + + /^abc$/m + abc + 0: abc + qqq\nabc + 0: abc + abc\nzzz + 0: abc + qqq\nabc\nzzz + 0: abc + + /^abc$/ + abc + 0: abc + *** Failers + No match + qqq\nabc + No match + abc\nzzz + No match + qqq\nabc\nzzz + No match + + /\Aabc\Z/m + abc + 0: abc + abc\n + 0: abc + *** Failers + No match + qqq\nabc + No match + abc\nzzz + No match + qqq\nabc\nzzz + No match + + /\A(.)*\Z/s + abc\ndef + 0: abc\x0adef + 1: f + + /\A(.)*\Z/m + *** Failers + 0: *** Failers + 1: s + abc\ndef + No match + + /(?:b)|(?::+)/ + b::c + 0: b + c::b + 0: :: + + /[-az]+/ + az- + 0: az- + *** Failers + 0: a + b + No match + + /[az-]+/ + za- + 0: za- + *** Failers + 0: a + b + No match + + /[a\-z]+/ + a-z + 0: a-z + *** Failers + 0: a + b + No match + + /[a-z]+/ + abcdxyz + 0: abcdxyz + + /[\d-]+/ + 12-34 + 0: 12-34 + *** Failers + No match + aaa + No match + + /[\d-z]+/ + 12-34z + 0: 12-34z + *** Failers + No match + aaa + No match + + /\x5c/ + \\ + 0: \ + + /\x20Z/ + the Zoo + 0: Z + *** Failers + No match + Zulu + No match + + /(abc)\1/i + abcabc + 0: abcabc + 1: abc + ABCabc + 0: ABCabc + 1: ABC + abcABC + 0: abcABC + 1: abc + + /(main(O)?)+/ + mainmain + 0: mainmain + 1: main + mainOmain + 0: mainOmain + 1: main + 2: O + + /ab{3cd/ + ab{3cd + 0: ab{3cd + + /ab{3,cd/ + ab{3,cd + 0: ab{3,cd + + /ab{3,4a}cd/ + ab{3,4a}cd + 0: ab{3,4a}cd + + /{4,5a}bc/ + {4,5a}bc + 0: {4,5a}bc + + /^a.b/ + a\rb + 0: a\x0db + *** Failers + No match + a\nb + No match + + /abc$/ + abc + 0: abc + abc\n + 0: abc + *** Failers + No match + abc\ndef + No match + + /(abc)\123/ + abc\x53 + 0: abcS + 1: abc + + /(abc)\223/ + abc\x93 + 0: abc\x93 + 1: abc + + /(abc)\323/ + abc\xd3 + 0: abc\xd3 + 1: abc + + /(abc)\500/ + abc\x40 + 0: abc@ + 1: abc + abc\100 + 0: abc@ + 1: abc + + /(abc)\5000/ + abc\x400 + 0: abc@0 + 1: abc + abc\x40\x30 + 0: abc@0 + 1: abc + abc\1000 + 0: abc@0 + 1: abc + abc\100\x30 + 0: abc@0 + 1: abc + abc\100\060 + 0: abc@0 + 1: abc + abc\100\60 + 0: abc@0 + 1: abc + + /abc\81/ + abc\081 + 0: abc\x0081 + abc\0\x38\x31 + 0: abc\x0081 + + /abc\91/ + abc\091 + 0: abc\x0091 + abc\0\x39\x31 + 0: abc\x0091 + + /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\12\123/ + abcdefghijkllS + 0: abcdefghijkllS + 1: a + 2: b + 3: c + 4: d + 5: e + 6: f + 7: g + 8: h + 9: i + 10: j + 11: k + 12: l + + /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/ + abcdefghijk\12S + 0: abcdefghijk\x0aS + 1: a + 2: b + 3: c + 4: d + 5: e + 6: f + 7: g + 8: h + 9: i + 10: j + 11: k + + /ab\gdef/ + abgdef + 0: abgdef + + /a{0}bc/ + bc + 0: bc + + /(a|(bc)){0,0}?xyz/ + xyz + 0: xyz + + /abc[\10]de/ + abc\010de + 0: abc\x08de + + /abc[\1]de/ + abc\1de + 0: abc\x01de + + /(abc)[\1]de/ + abc\1de + 0: abc\x01de + 1: abc + + /a.b(?s)/ + a\nb + 0: a\x0ab + + /^([^a])([^\b])([^c]*)([^d]{3,4})/ + baNOTccccd + 0: baNOTcccc + 1: b + 2: a + 3: NOT + 4: cccc + baNOTcccd + 0: baNOTccc + 1: b + 2: a + 3: NOT + 4: ccc + baNOTccd + 0: baNOTcc + 1: b + 2: a + 3: NO + 4: Tcc + bacccd + 0: baccc + 1: b + 2: a + 3: + 4: ccc + *** Failers + 0: *** Failers + 1: * + 2: * + 3: * Fail + 4: ers + anything + No match + b\bc + No match + baccd + No match + + /[^a]/ + Abc + 0: A + + /[^a]/i + Abc + 0: b + + /[^a]+/ + AAAaAbc + 0: AAA + + /[^a]+/i + AAAaAbc + 0: bc + + /[^a]+/ + bbb\nccc + 0: bbb\x0accc + + /[^k]$/ + abc + 0: c + *** Failers + 0: s + abk + No match + + /[^k]{2,3}$/ + abc + 0: abc + kbc + 0: bc + kabc + 0: abc + *** Failers + 0: ers + abk + No match + akb + No match + akk + No match + + /^\d{8,}\@.+[^k]$/ + 12345678\@a.b.c.d + 0: 12345678@a.b.c.d + 123456789\@x.y.z + 0: 123456789@x.y.z + *** Failers + No match + 12345678\@x.y.uk + No match + 1234567\@a.b.c.d + No match + + /(a)\1{8,}/ + aaaaaaaaa + 0: aaaaaaaaa + 1: a + aaaaaaaaaa + 0: aaaaaaaaaa + 1: a + *** Failers + No match + aaaaaaa + No match + + /[^a]/ + aaaabcd + 0: b + aaAabcd + 0: A + + /[^a]/i + aaaabcd + 0: b + aaAabcd + 0: b + + /[^az]/ + aaaabcd + 0: b + aaAabcd + 0: A + + /[^az]/i + aaaabcd + 0: b + aaAabcd + 0: b + + /\000\001\002\003\004\005\006\007\010\011\012\013\014\015\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037\040\041\042\043\044\045\046\047\050\051\052\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136\137\140\141\142\143\144\145\146\147\150\151\152\153\154\155\156\157\160\161\162\163\164\165\166\167\170\171\172\173\174\175\176\177\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377/ + \000\001\002\003\004\005\006\007\010\011\012\013\014\015\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037\040\041\042\043\044\045\046\047\050\051\052\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136\137\140\141\142\143\144\145\146\147\150\151\152\153\154\155\156\157\160\161\162\163\164\165\166\167\170\171\172\173\174\175\176\177\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377 + 0: \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff + + /P[^*]TAIRE[^*]{1,6}?LL/ + xxxxxxxxxxxPSTAIREISLLxxxxxxxxx + 0: PSTAIREISLL + + /P[^*]TAIRE[^*]{1,}?LL/ + xxxxxxxxxxxPSTAIREISLLxxxxxxxxx + 0: PSTAIREISLL + + /(\.\d\d[1-9]?)\d+/ + 1.230003938 + 0: .230003938 + 1: .23 + 1.875000282 + 0: .875000282 + 1: .875 + 1.235 + 0: .235 + 1: .23 + + /(\.\d\d((?=0)|\d(?=\d)))/ + 1.230003938 + 0: .23 + 1: .23 + 2: + 1.875000282 + 0: .875 + 1: .875 + 2: 5 + *** Failers + No match + 1.235 + No match + + /a(?)b/ + ab + 0: ab + + /\b(foo)\s+(\w+)/i + Food is on the foo table + 0: foo table + 1: foo + 2: table + + /foo(.*)bar/ + The food is under the bar in the barn. + 0: food is under the bar in the bar + 1: d is under the bar in the + + /foo(.*?)bar/ + The food is under the bar in the barn. + 0: food is under the bar + 1: d is under the + + /(.*)(\d*)/ + I have 2 numbers: 53147 + 0: I have 2 numbers: 53147 + 1: I have 2 numbers: 53147 + 2: + + /(.*)(\d+)/ + I have 2 numbers: 53147 + 0: I have 2 numbers: 53147 + 1: I have 2 numbers: 5314 + 2: 7 + + /(.*?)(\d*)/ + I have 2 numbers: 53147 + 0: + 1: + 2: + + /(.*?)(\d+)/ + I have 2 numbers: 53147 + 0: I have 2 + 1: I have + 2: 2 + + /(.*)(\d+)$/ + I have 2 numbers: 53147 + 0: I have 2 numbers: 53147 + 1: I have 2 numbers: 5314 + 2: 7 + + /(.*?)(\d+)$/ + I have 2 numbers: 53147 + 0: I have 2 numbers: 53147 + 1: I have 2 numbers: + 2: 53147 + + /(.*)\b(\d+)$/ + I have 2 numbers: 53147 + 0: I have 2 numbers: 53147 + 1: I have 2 numbers: + 2: 53147 + + /(.*\D)(\d+)$/ + I have 2 numbers: 53147 + 0: I have 2 numbers: 53147 + 1: I have 2 numbers: + 2: 53147 + + /^\D*(?!123)/ + ABC123 + 0: AB + + /^(\D*)(?=\d)(?!123)/ + ABC445 + 0: ABC + 1: ABC + *** Failers + No match + ABC123 + No match + + /^[W-]46]/ + W46]789 + 0: W46] + -46]789 + 0: -46] + *** Failers + No match + Wall + No match + Zebra + No match + 42 + No match + [abcd] + No match + ]abcd[ + No match + + /^[W-\]46]/ + W46]789 + 0: W + Wall + 0: W + Zebra + 0: Z + Xylophone + 0: X + 42 + 0: 4 + [abcd] + 0: [ + ]abcd[ + 0: ] + \\backslash + 0: \ + *** Failers + No match + -46]789 + No match + well + No match + + /\d\d\/\d\d\/\d\d\d\d/ + 01/01/2000 + 0: 01/01/2000 + + /word (?:[a-zA-Z0-9]+ ){0,10}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark otherword + 0: word cat dog elephant mussel cow horse canary baboon snake shark otherword + word cat dog elephant mussel cow horse canary baboon snake shark + No match + + /word (?:[a-zA-Z0-9]+ ){0,300}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark the quick brown fox and the lazy dog and several other words getting close to thirty by now I hope + No match + + /^(a){0,0}/ + bcd + 0: + abc + 0: + aab + 0: + + /^(a){0,1}/ + bcd + 0: + abc + 0: a + 1: a + aab + 0: a + 1: a + + /^(a){0,2}/ + bcd + 0: + abc + 0: a + 1: a + aab + 0: aa + 1: a + + /^(a){0,3}/ + bcd + 0: + abc + 0: a + 1: a + aab + 0: aa + 1: a + aaa + 0: aaa + 1: a + + /^(a){0,}/ + bcd + 0: + abc + 0: a + 1: a + aab + 0: aa + 1: a + aaa + 0: aaa + 1: a + aaaaaaaa + 0: aaaaaaaa + 1: a + + /^(a){1,1}/ + bcd + No match + abc + 0: a + 1: a + aab + 0: a + 1: a + + /^(a){1,2}/ + bcd + No match + abc + 0: a + 1: a + aab + 0: aa + 1: a + + /^(a){1,3}/ + bcd + No match + abc + 0: a + 1: a + aab + 0: aa + 1: a + aaa + 0: aaa + 1: a + + /^(a){1,}/ + bcd + No match + abc + 0: a + 1: a + aab + 0: aa + 1: a + aaa + 0: aaa + 1: a + aaaaaaaa + 0: aaaaaaaa + 1: a + + /.*\.gif/ + borfle\nbib.gif\nno + 0: bib.gif + + /.{0,}\.gif/ + borfle\nbib.gif\nno + 0: bib.gif + + /.*\.gif/m + borfle\nbib.gif\nno + 0: bib.gif + + /.*\.gif/s + borfle\nbib.gif\nno + 0: borfle\x0abib.gif + + /.*\.gif/ms + borfle\nbib.gif\nno + 0: borfle\x0abib.gif + + /.*$/ + borfle\nbib.gif\nno + 0: no + + /.*$/m + borfle\nbib.gif\nno + 0: borfle + + /.*$/s + borfle\nbib.gif\nno + 0: borfle\x0abib.gif\x0ano + + /.*$/ms + borfle\nbib.gif\nno + 0: borfle\x0abib.gif\x0ano + + /.*$/ + borfle\nbib.gif\nno\n + 0: no + + /.*$/m + borfle\nbib.gif\nno\n + 0: borfle + + /.*$/s + borfle\nbib.gif\nno\n + 0: borfle\x0abib.gif\x0ano\x0a + + /.*$/ms + borfle\nbib.gif\nno\n + 0: borfle\x0abib.gif\x0ano\x0a + + /(.*X|^B)/ + abcde\n1234Xyz + 0: 1234X + 1: 1234X + BarFoo + 0: B + 1: B + *** Failers + No match + abcde\nBar + No match + + /(.*X|^B)/m + abcde\n1234Xyz + 0: 1234X + 1: 1234X + BarFoo + 0: B + 1: B + abcde\nBar + 0: B + 1: B + + /(.*X|^B)/s + abcde\n1234Xyz + 0: abcde\x0a1234X + 1: abcde\x0a1234X + BarFoo + 0: B + 1: B + *** Failers + No match + abcde\nBar + No match + + /(.*X|^B)/ms + abcde\n1234Xyz + 0: abcde\x0a1234X + 1: abcde\x0a1234X + BarFoo + 0: B + 1: B + abcde\nBar + 0: B + 1: B + + /(?s)(.*X|^B)/ + abcde\n1234Xyz + 0: abcde\x0a1234X + 1: abcde\x0a1234X + BarFoo + 0: B + 1: B + *** Failers + No match + abcde\nBar + No match + + /(?s:.*X|^B)/ + abcde\n1234Xyz + 0: abcde\x0a1234X + BarFoo + 0: B + *** Failers + No match + abcde\nBar + No match + + /^.*B/ + **** Failers + No match + abc\nB + No match + + /(?s)^.*B/ + abc\nB + 0: abc\x0aB + + /(?m)^.*B/ + abc\nB + 0: B + + /(?ms)^.*B/ + abc\nB + 0: abc\x0aB + + /(?ms)^B/ + abc\nB + 0: B + + /(?s)B$/ + B\n + 0: B + + /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]/ + 123456654321 + 0: 123456654321 + + /^\d\d\d\d\d\d\d\d\d\d\d\d/ + 123456654321 + 0: 123456654321 + + /^[\d][\d][\d][\d][\d][\d][\d][\d][\d][\d][\d][\d]/ + 123456654321 + 0: 123456654321 + + /^[abc]{12}/ + abcabcabcabc + 0: abcabcabcabc + + /^[a-c]{12}/ + abcabcabcabc + 0: abcabcabcabc + + /^(a|b|c){12}/ + abcabcabcabc + 0: abcabcabcabc + 1: c + + /^[abcdefghijklmnopqrstuvwxy0123456789]/ + n + 0: n + *** Failers + No match + z + No match + + /abcde{0,0}/ + abcd + 0: abcd + *** Failers + No match + abce + No match + + /ab[cd]{0,0}e/ + abe + 0: abe + *** Failers + No match + abcde + No match + + /ab(c){0,0}d/ + abd + 0: abd + *** Failers + No match + abcd + No match + + /a(b*)/ + a + 0: a + 1: + ab + 0: ab + 1: b + abbbb + 0: abbbb + 1: bbbb + *** Failers + 0: a + 1: + bbbbb + No match + + /ab\d{0}e/ + abe + 0: abe + *** Failers + No match + ab1e + No match + + /"([^\\"]+|\\.)*"/ + the \"quick\" brown fox + 0: "quick" + 1: quick + \"the \\\"quick\\\" brown fox\" + 0: "the \"quick\" brown fox" + 1: brown fox + + /.*?/g+ + abc + 0: + 0+ abc + 0: a + 0+ bc + 0: + 0+ bc + 0: b + 0+ c + 0: + 0+ c + 0: c + 0+ + 0: + 0+ + + /\b/g+ + abc + 0: + 0+ abc + 0: + 0+ + + /\b/+g + abc + 0: + 0+ abc + 0: + 0+ + + //g + abc + 0: + 0: + 0: + 0: + + /<tr([\w\W\s\d][^<>]{0,})><TD([\w\W\s\d][^<>]{0,})>([\d]{0,}\.)(.*)(( ([\w\W\s\d][^<>]{0,})|[\s]{0,}))<\/a><\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><\/TR>/is + <TR BGCOLOR='#DBE9E9'><TD align=left valign=top>43.<a href='joblist.cfm?JobID=94 6735&Keyword='>Word Processor (N-1286)</a></TD><TD align=left valign=top>Lega lstaff.com</TD><TD align=left valign=top>CA - Statewide</TD></TR> + 0: <TR BGCOLOR='#DBE9E9'><TD align=left valign=top>43.<a href='joblist.cfm?JobID=94 6735&Keyword='>Word Processor (N-1286)</a></TD><TD align=left valign=top>Lega lstaff.com</TD><TD align=left valign=top>CA - Statewide</TD></TR> + 1: BGCOLOR='#DBE9E9' + 2: align=left valign=top + 3: 43. + 4: <a href='joblist.cfm?JobID=94 6735&Keyword='>Word Processor (N-1286) + 5: + 6: + 7: <unset> + 8: align=left valign=top + 9: Lega lstaff.com + 10: align=left valign=top + 11: CA - Statewide + + /a[^a]b/ + acb + 0: acb + a\nb + 0: a\x0ab + + /a.b/ + acb + 0: acb + *** Failers + No match + a\nb + No match + + /a[^a]b/s + acb + 0: acb + a\nb + 0: a\x0ab + + /a.b/s + acb + 0: acb + a\nb + 0: a\x0ab + + / End of testinput1 / +

ossp-pkg/pcre/pcre_test.d/testoutput2 -> 1.6

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,2072 ---- + PCRE version 3.3 01-Aug-2000 + + /(a)b|/ + Capturing subpattern count = 1 + No options + No first char + No need char + + /abc/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'c' + abc + 0: abc + defabc + 0: abc + \Aabc + 0: abc + *** Failers + No match + \Adefabc + No match + ABC + No match + + /^abc/ + Capturing subpattern count = 0 + Options: anchored + No first char + Need char = 'c' + abc + 0: abc + \Aabc + 0: abc + *** Failers + No match + defabc + No match + \Adefabc + No match + + /a+bc/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'c' + + /a*bc/ + Capturing subpattern count = 0 + No options + No first char + Need char = 'c' + + /a{3}bc/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'c' + + /(abc|a+z)/ + Capturing subpattern count = 1 + No options + First char = 'a' + No need char + + /^abc$/ + Capturing subpattern count = 0 + Options: anchored + No first char + Need char = 'c' + abc + 0: abc + *** Failers + No match + def\nabc + No match + + /ab\gdef/X + Failed: unrecognized character follows \ at offset 3 + + /(?X)ab\gdef/X + Failed: unrecognized character follows \ at offset 7 + + /x{5,4}/ + Failed: numbers out of order in {} quantifier at offset 5 + + /z{65536}/ + Failed: number too big in {} quantifier at offset 7 + + /[abcd/ + Failed: missing terminating ] for character class at offset 5 + + /[\B]/ + Failed: invalid escape sequence in character class at offset 2 + + /[z-a]/ + Failed: range out of order in character class at offset 3 + + /^*/ + Failed: nothing to repeat at offset 1 + + /(abc/ + Failed: missing ) at offset 4 + + /(?# abc/ + Failed: missing ) after comment at offset 7 + + /(?z)abc/ + Failed: unrecognized character after (? at offset 2 + + /.*b/ + Capturing subpattern count = 0 + No options + First char at start or follows \n + Need char = 'b' + + /.*?b/ + Capturing subpattern count = 0 + No options + First char at start or follows \n + Need char = 'b' + + /cat|dog|elephant/ + Capturing subpattern count = 0 + No options + No first char + No need char + this sentence eventually mentions a cat + 0: cat + this sentences rambles on and on for a while and then reaches elephant + 0: elephant + + /cat|dog|elephant/S + Capturing subpattern count = 0 + No options + No first char + No need char + Starting character set: c d e + this sentence eventually mentions a cat + 0: cat + this sentences rambles on and on for a while and then reaches elephant + 0: elephant + + /cat|dog|elephant/iS + Capturing subpattern count = 0 + Options: caseless + No first char + No need char + Starting character set: C D E c d e + this sentence eventually mentions a CAT cat + 0: CAT + this sentences rambles on and on for a while to elephant ElePhant + 0: elephant + + /a|[bcd]/S + Capturing subpattern count = 0 + No options + No first char + No need char + Starting character set: a b c d + + /(a|[^\dZ])/S + Capturing subpattern count = 1 + No options + No first char + No need char + Starting character set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a + \x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 + \x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / : ; < = > + ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y [ \ ] ^ _ ` a b c d + e f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f \x80 \x81 \x82 \x83 + \x84 \x85 \x86 \x87 \x88 \x89 \x8a \x8b \x8c \x8d \x8e \x8f \x90 \x91 \x92 + \x93 \x94 \x95 \x96 \x97 \x98 \x99 \x9a \x9b \x9c \x9d \x9e \x9f \xa0 \xa1 + \xa2 \xa3 \xa4 \xa5 \xa6 \xa7 \xa8 \xa9 \xaa \xab \xac \xad \xae \xaf \xb0 + \xb1 \xb2 \xb3 \xb4 \xb5 \xb6 \xb7 \xb8 \xb9 \xba \xbb \xbc \xbd \xbe \xbf + \xc0 \xc1 \xc2 \xc3 \xc4 \xc5 \xc6 \xc7 \xc8 \xc9 \xca \xcb \xcc \xcd \xce + \xcf \xd0 \xd1 \xd2 \xd3 \xd4 \xd5 \xd6 \xd7 \xd8 \xd9 \xda \xdb \xdc \xdd + \xde \xdf \xe0 \xe1 \xe2 \xe3 \xe4 \xe5 \xe6 \xe7 \xe8 \xe9 \xea \xeb \xec + \xed \xee \xef \xf0 \xf1 \xf2 \xf3 \xf4 \xf5 \xf6 \xf7 \xf8 \xf9 \xfa \xfb + \xfc \xfd \xfe \xff + + /(a|b)*[\s]/S + Capturing subpattern count = 1 + No options + No first char + No need char + Starting character set: \x09 \x0a \x0b \x0c \x0d \x20 a b + + /(ab\2)/ + Failed: back reference to non-existent subpattern at offset 6 + + /{4,5}abc/ + Failed: nothing to repeat at offset 4 + + /(a)(b)(c)\2/ + Capturing subpattern count = 3 + Max back reference = 2 + No options + First char = 'a' + Need char = 'c' + abcb + 0: abcb + 1: a + 2: b + 3: c + \O0abcb + Matched, but too many substrings + \O3abcb + Matched, but too many substrings + 0: abcb + \O6abcb + Matched, but too many substrings + 0: abcb + 1: a + \O9abcb + Matched, but too many substrings + 0: abcb + 1: a + 2: b + \O12abcb + 0: abcb + 1: a + 2: b + 3: c + + /(a)bc|(a)(b)\2/ + Capturing subpattern count = 3 + Max back reference = 2 + No options + First char = 'a' + No need char + abc + 0: abc + 1: a + \O0abc + Matched, but too many substrings + \O3abc + Matched, but too many substrings + 0: abc + \O6abc + 0: abc + 1: a + aba + 0: aba + 1: <unset> + 2: a + 3: b + \O0aba + Matched, but too many substrings + \O3aba + Matched, but too many substrings + 0: aba + \O6aba + Matched, but too many substrings + 0: aba + 1: <unset> + \O9aba + Matched, but too many substrings + 0: aba + 1: <unset> + 2: a + \O12aba + 0: aba + 1: <unset> + 2: a + 3: b + + /abc$/E + Capturing subpattern count = 0 + Options: dollar_endonly + First char = 'a' + Need char = 'c' + abc + 0: abc + *** Failers + No match + abc\n + No match + abc\ndef + No match + + /(a)(b)(c)(d)(e)\6/ + Failed: back reference to non-existent subpattern at offset 17 + + /the quick brown fox/ + Capturing subpattern count = 0 + No options + First char = 't' + Need char = 'x' + the quick brown fox + 0: the quick brown fox + this is a line with the quick brown fox + 0: the quick brown fox + + /the quick brown fox/A + Capturing subpattern count = 0 + Options: anchored + No first char + Need char = 'x' + the quick brown fox + 0: the quick brown fox + *** Failers + No match + this is a line with the quick brown fox + No match + + /ab(?z)cd/ + Failed: unrecognized character after (? at offset 4 + + /^abc|def/ + Capturing subpattern count = 0 + No options + No first char + No need char + abcdef + 0: abc + abcdef\B + 0: def + + /.*((abc)$|(def))/ + Capturing subpattern count = 3 + No options + First char at start or follows \n + No need char + defabc + 0: defabc + 1: abc + 2: abc + \Zdefabc + 0: def + 1: def + 2: <unset> + 3: def + + /abc/P + abc + 0: abc + *** Failers + No match: POSIX code 17: match failed + + /^abc|def/P + abcdef + 0: abc + abcdef\B + 0: def + + /.*((abc)$|(def))/P + defabc + 0: defabc + 1: abc + 2: abc + \Zdefabc + 0: def + 1: def + 3: def + + /the quick brown fox/P + the quick brown fox + 0: the quick brown fox + *** Failers + No match: POSIX code 17: match failed + The Quick Brown Fox + No match: POSIX code 17: match failed + + /the quick brown fox/Pi + the quick brown fox + 0: the quick brown fox + The Quick Brown Fox + 0: The Quick Brown Fox + + /abc.def/P + *** Failers + No match: POSIX code 17: match failed + abc\ndef + No match: POSIX code 17: match failed + + /abc$/P + abc + 0: abc + abc\n + 0: abc + + /(abc)\2/P + Failed: POSIX code 15: bad back reference at offset 7 + + /(abc\1)/P + abc + No match: POSIX code 17: match failed + + /)/ + Failed: unmatched parentheses at offset 0 + + /a[]b/ + Failed: missing terminating ] for character class at offset 4 + + /[^aeiou ]{3,}/ + Capturing subpattern count = 0 + No options + No first char + No need char + co-processors, and for + 0: -pr + + /<.*>/ + Capturing subpattern count = 0 + No options + First char = '<' + Need char = '>' + abc<def>ghi<klm>nop + 0: <def>ghi<klm> + + /<.*?>/ + Capturing subpattern count = 0 + No options + First char = '<' + Need char = '>' + abc<def>ghi<klm>nop + 0: <def> + + /<.*>/U + Capturing subpattern count = 0 + Options: ungreedy + First char = '<' + Need char = '>' + abc<def>ghi<klm>nop + 0: <def> + + /<.*>(?U)/ + Capturing subpattern count = 0 + Options: ungreedy + First char = '<' + Need char = '>' + abc<def>ghi<klm>nop + 0: <def> + + /<.*?>/U + Capturing subpattern count = 0 + Options: ungreedy + First char = '<' + Need char = '>' + abc<def>ghi<klm>nop + 0: <def>ghi<klm> + + /={3,}/U + Capturing subpattern count = 0 + Options: ungreedy + First char = '=' + Need char = '=' + abc========def + 0: === + + /(?U)={3,}?/ + Capturing subpattern count = 0 + Options: ungreedy + First char = '=' + Need char = '=' + abc========def + 0: ======== + + /(?<!bar|cattle)foo/ + Capturing subpattern count = 0 + No options + First char = 'f' + Need char = 'o' + foo + 0: foo + catfoo + 0: foo + *** Failers + No match + the barfoo + No match + and cattlefoo + No match + + /(?<=a+)b/ + Failed: lookbehind assertion is not fixed length at offset 6 + + /(?<=aaa|b{0,3})b/ + Failed: lookbehind assertion is not fixed length at offset 14 + + /(?<!(foo)a\1)bar/ + Failed: lookbehind assertion is not fixed length at offset 12 + + /(?i)abc/ + Capturing subpattern count = 0 + Options: caseless + First char = 'a' + Need char = 'c' + + /(a|(?m)a)/ + Capturing subpattern count = 1 + No options + First char = 'a' + No need char + + /(?i)^1234/ + Capturing subpattern count = 0 + Options: anchored caseless + No first char + Need char = '4' + + /(^b|(?i)^d)/ + Capturing subpattern count = 1 + Options: anchored + Case state changes + No first char + No need char + + /(?s).*/ + Capturing subpattern count = 0 + Options: anchored dotall + No first char + No need char + + /[abcd]/S + Capturing subpattern count = 0 + No options + No first char + No need char + Starting character set: a b c d + + /(?i)[abcd]/S + Capturing subpattern count = 0 + Options: caseless + No first char + No need char + Starting character set: A B C D a b c d + + /(?m)[xy]|(b|c)/S + Capturing subpattern count = 1 + Options: multiline + No first char + No need char + Starting character set: b c x y + + /(^a|^b)/m + Capturing subpattern count = 1 + Options: multiline + First char at start or follows \n + No need char + + /(?i)(^a|^b)/m + Capturing subpattern count = 1 + Options: caseless multiline + First char at start or follows \n + No need char + + /(a)(?(1)a|b|c)/ + Failed: conditional group contains more than two branches at offset 13 + + /(?(?=a)a|b|c)/ + Failed: conditional group contains more than two branches at offset 12 + + /(?(1a)/ + Failed: malformed number after (?( at offset 4 + + /(?(?i))/ + Failed: assertion expected after (?( at offset 3 + + /(?(abc))/ + Failed: assertion expected after (?( at offset 3 + + /(?(?<ab))/ + Failed: unrecognized character after (?< at offset 2 + + /((?s)blah)\s+\1/ + Capturing subpattern count = 1 + Max back reference = 1 + No options + First char = 'b' + Need char = 'h' + + /((?i)blah)\s+\1/ + Capturing subpattern count = 1 + Max back reference = 1 + No options + Case state changes + No first char + Need char = 'h' + + /((?i)b)/DS + ------------------------------------------------------------------ + 0 16 Bra 0 + 3 8 Bra 1 + 6 01 Opt + 8 1 b + 11 8 Ket + 14 00 Opt + 16 16 Ket + 19 End + ------------------------------------------------------------------ + Capturing subpattern count = 1 + No options + Case state changes + No first char + Need char = 'b' + Starting character set: B b + + /(a*b|(?i:c*(?-i)d))/S + Capturing subpattern count = 1 + No options + Case state changes + No first char + No need char + Starting character set: C a b c d + + /a$/ + Capturing subpattern count = 0 + No options + First char = 'a' + No need char + a + 0: a + a\n + 0: a + *** Failers + No match + \Za + No match + \Za\n + No match + + /a$/m + Capturing subpattern count = 0 + Options: multiline + First char = 'a' + No need char + a + 0: a + a\n + 0: a + \Za\n + 0: a + *** Failers + No match + \Za + No match + + /\Aabc/m + Capturing subpattern count = 0 + Options: anchored multiline + No first char + Need char = 'c' + + /^abc/m + Capturing subpattern count = 0 + Options: multiline + First char at start or follows \n + Need char = 'c' + + /^((a+)(?U)([ab]+)(?-U)([bc]+)(\w*))/ + Capturing subpattern count = 5 + Options: anchored + No first char + Need char = 'a' + aaaaabbbbbcccccdef + 0: aaaaabbbbbcccccdef + 1: aaaaabbbbbcccccdef + 2: aaaaa + 3: b + 4: bbbbccccc + 5: def + + /(?<=foo)[ab]/S + Capturing subpattern count = 0 + No options + No first char + No need char + Starting character set: a b + + /(?<!foo)(alpha|omega)/S + Capturing subpattern count = 1 + No options + No first char + Need char = 'a' + Starting character set: a o + + /(?!alphabet)[ab]/S + Capturing subpattern count = 0 + No options + No first char + No need char + Starting character set: a b + + /(?<=foo\n)^bar/m + Capturing subpattern count = 0 + Options: multiline + First char at start or follows \n + Need char = 'r' + + /(?>^abc)/m + Capturing subpattern count = 0 + Options: multiline + First char at start or follows \n + Need char = 'c' + abc + 0: abc + def\nabc + 0: abc + *** Failers + No match + defabc + No match + + /(?<=ab(c+)d)ef/ + Failed: lookbehind assertion is not fixed length at offset 11 + + /(?<=ab(?<=c+)d)ef/ + Failed: lookbehind assertion is not fixed length at offset 12 + + /(?<=ab(c|de)f)g/ + Failed: lookbehind assertion is not fixed length at offset 13 + + /The next three are in testinput2 because they have variable length branches/ + Capturing subpattern count = 0 + No options + First char = 'T' + Need char = 's' + + /(?<=bullock|donkey)-cart/ + Capturing subpattern count = 0 + No options + First char = '-' + Need char = 't' + the bullock-cart + 0: -cart + a donkey-cart race + 0: -cart + *** Failers + No match + cart + No match + horse-and-cart + No match + + /(?<=ab(?i)x|y|z)/ + Capturing subpattern count = 0 + No options + Case state changes + No first char + No need char + + /(?>.*)(?<=(abcd)|(xyz))/ + Capturing subpattern count = 2 + No options + First char at start or follows \n + No need char + alphabetabcd + 0: alphabetabcd + 1: abcd + endingxyz + 0: endingxyz + 1: <unset> + 2: xyz + + /(?<=ab(?i)x(?-i)y|(?i)z|b)ZZ/ + Capturing subpattern count = 0 + No options + Case state changes + First char = 'Z' + Need char = 'Z' + abxyZZ + 0: ZZ + abXyZZ + 0: ZZ + ZZZ + 0: ZZ + zZZ + 0: ZZ + bZZ + 0: ZZ + BZZ + 0: ZZ + *** Failers + No match + ZZ + No match + abXYZZ + No match + zzz + No match + bzz + No match + + /(?<!(foo)a)bar/ + Capturing subpattern count = 1 + No options + First char = 'b' + Need char = 'r' + bar + 0: bar + foobbar + 0: bar + *** Failers + No match + fooabar + No match + + /This one is here because Perl 5.005_02 doesn't fail it/ + Capturing subpattern count = 0 + No options + First char = 'T' + Need char = 't' + + /^(a)?(?(1)a|b)+$/ + Capturing subpattern count = 1 + Options: anchored + No first char + No need char + *** Failers + No match + a + No match + + /This one is here because I think Perl 5.005_02 gets the setting of $1 wrong/ + Capturing subpattern count = 0 + No options + First char = 'T' + Need char = 'g' + + /^(a\1?){4}$/ + Capturing subpattern count = 1 + Max back reference = 1 + Options: anchored + No first char + Need char = 'a' + aaaaaa + 0: aaaaaa + 1: aa + + /These are syntax tests from Perl 5.005/ + Capturing subpattern count = 0 + No options + First char = 'T' + Need char = '5' + + /a[b-a]/ + Failed: range out of order in character class at offset 4 + + /a[]b/ + Failed: missing terminating ] for character class at offset 4 + + /a[/ + Failed: missing terminating ] for character class at offset 2 + + /*a/ + Failed: nothing to repeat at offset 0 + + /(*)b/ + Failed: nothing to repeat at offset 1 + + /abc)/ + Failed: unmatched parentheses at offset 3 + + /(abc/ + Failed: missing ) at offset 4 + + /a**/ + Failed: nothing to repeat at offset 2 + + /)(/ + Failed: unmatched parentheses at offset 0 + + /\1/ + Failed: back reference to non-existent subpattern at offset 2 + + /\2/ + Failed: back reference to non-existent subpattern at offset 2 + + /(a)|\2/ + Failed: back reference to non-existent subpattern at offset 6 + + /a[b-a]/i + Failed: range out of order in character class at offset 4 + + /a[]b/i + Failed: missing terminating ] for character class at offset 4 + + /a[/i + Failed: missing terminating ] for character class at offset 2 + + /*a/i + Failed: nothing to repeat at offset 0 + + /(*)b/i + Failed: nothing to repeat at offset 1 + + /abc)/i + Failed: unmatched parentheses at offset 3 + + /(abc/i + Failed: missing ) at offset 4 + + /a**/i + Failed: nothing to repeat at offset 2 + + /)(/i + Failed: unmatched parentheses at offset 0 + + /:(?:/ + Failed: missing ) at offset 4 + + /(?<%)b/ + Failed: unrecognized character after (?< at offset 0 + + /a(?{)b/ + Failed: unrecognized character after (? at offset 3 + + /a(?{{})b/ + Failed: unrecognized character after (? at offset 3 + + /a(?{}})b/ + Failed: unrecognized character after (? at offset 3 + + /a(?{"{"})b/ + Failed: unrecognized character after (? at offset 3 + + /a(?{"{"}})b/ + Failed: unrecognized character after (? at offset 3 + + /(?(1?)a|b)/ + Failed: malformed number after (?( at offset 4 + + /(?(1)a|b|c)/ + Failed: conditional group contains more than two branches at offset 10 + + /[a[:xyz:/ + Failed: missing terminating ] for character class at offset 8 + + /(?<=x+)y/ + Failed: lookbehind assertion is not fixed length at offset 6 + + /a{37,17}/ + Failed: numbers out of order in {} quantifier at offset 7 + + /abc/\ + Failed: \ at end of pattern at offset 4 + + /abc/\P + Failed: POSIX code 9: bad escape sequence at offset 4 + + /abc/\i + Failed: \ at end of pattern at offset 4 + + /(a)bc(d)/ + Capturing subpattern count = 2 + No options + First char = 'a' + Need char = 'd' + abcd + 0: abcd + 1: a + 2: d + abcd\C2 + 0: abcd + 1: a + 2: d + 2C d (1) + abcd\C5 + 0: abcd + 1: a + 2: d + copy substring 5 failed -7 + + /(.{20})/ + Capturing subpattern count = 1 + No options + No first char + No need char + abcdefghijklmnopqrstuvwxyz + 0: abcdefghijklmnopqrst + 1: abcdefghijklmnopqrst + abcdefghijklmnopqrstuvwxyz\C1 + 0: abcdefghijklmnopqrst + 1: abcdefghijklmnopqrst + copy substring 1 failed -6 + abcdefghijklmnopqrstuvwxyz\G1 + 0: abcdefghijklmnopqrst + 1: abcdefghijklmnopqrst + 1G abcdefghijklmnopqrst (20) + + /(.{15})/ + Capturing subpattern count = 1 + No options + No first char + No need char + abcdefghijklmnopqrstuvwxyz + 0: abcdefghijklmno + 1: abcdefghijklmno + abcdefghijklmnopqrstuvwxyz\C1\G1 + 0: abcdefghijklmno + 1: abcdefghijklmno + 1C abcdefghijklmno (15) + 1G abcdefghijklmno (15) + + /(.{16})/ + Capturing subpattern count = 1 + No options + No first char + No need char + abcdefghijklmnopqrstuvwxyz + 0: abcdefghijklmnop + 1: abcdefghijklmnop + abcdefghijklmnopqrstuvwxyz\C1\G1\L + 0: abcdefghijklmnop + 1: abcdefghijklmnop + copy substring 1 failed -6 + 1G abcdefghijklmnop (16) + 0L abcdefghijklmnop + 1L abcdefghijklmnop + + /^(a|(bc))de(f)/ + Capturing subpattern count = 3 + Options: anchored + No first char + Need char = 'f' + adef\G1\G2\G3\G4\L + 0: adef + 1: a + 2: <unset> + 3: f + 1G a (1) + 2G (0) + 3G f (1) + get substring 4 failed -7 + 0L adef + 1L a + 2L + 3L f + bcdef\G1\G2\G3\G4\L + 0: bcdef + 1: bc + 2: bc + 3: f + 1G bc (2) + 2G bc (2) + 3G f (1) + get substring 4 failed -7 + 0L bcdef + 1L bc + 2L bc + 3L f + adefghijk\C0 + 0: adef + 1: a + 2: <unset> + 3: f + 0C adef (4) + + /^abc\00def/ + Capturing subpattern count = 0 + Options: anchored + No first char + Need char = 'f' + abc\00def\L\C0 + 0: abc\x00def + 0C abc (7) + 0L abc + + /word ((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ + )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ + )?)?)?)?)?)?)?)?)?otherword/M + Memory allocation (code space): 428 + Capturing subpattern count = 8 + No options + First char = 'w' + Need char = 'd' + + /.*X/D + ------------------------------------------------------------------ + 0 8 Bra 0 + 3 Any* + 5 1 X + 8 8 Ket + 11 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + No options + First char at start or follows \n + Need char = 'X' + + /.*X/Ds + ------------------------------------------------------------------ + 0 8 Bra 0 + 3 Any* + 5 1 X + 8 8 Ket + 11 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored dotall + No first char + Need char = 'X' + + /(.*X|^B)/D + ------------------------------------------------------------------ + 0 21 Bra 0 + 3 8 Bra 1 + 6 Any* + 8 1 X + 11 7 Alt + 14 ^ + 15 1 B + 18 15 Ket + 21 21 Ket + 24 End + ------------------------------------------------------------------ + Capturing subpattern count = 1 + No options + First char at start or follows \n + No need char + + /(.*X|^B)/Ds + ------------------------------------------------------------------ + 0 21 Bra 0 + 3 8 Bra 1 + 6 Any* + 8 1 X + 11 7 Alt + 14 ^ + 15 1 B + 18 15 Ket + 21 21 Ket + 24 End + ------------------------------------------------------------------ + Capturing subpattern count = 1 + Options: anchored dotall + No first char + No need char + + /(?s)(.*X|^B)/D + ------------------------------------------------------------------ + 0 21 Bra 0 + 3 8 Bra 1 + 6 Any* + 8 1 X + 11 7 Alt + 14 ^ + 15 1 B + 18 15 Ket + 21 21 Ket + 24 End + ------------------------------------------------------------------ + Capturing subpattern count = 1 + Options: anchored dotall + No first char + No need char + + /(?s:.*X|^B)/D + ------------------------------------------------------------------ + 0 27 Bra 0 + 3 10 Bra 0 + 6 04 Opt + 8 Any* + 10 1 X + 13 9 Alt + 16 04 Opt + 18 ^ + 19 1 B + 22 19 Ket + 25 00 Opt + 27 27 Ket + 30 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + No options + First char at start or follows \n + No need char + + /\Biss\B/+ + Capturing subpattern count = 0 + No options + First char = 'i' + Need char = 's' + Mississippi + 0: iss + 0+ issippi + + /\Biss\B/+P + Mississippi + 0: iss + 0+ issippi + + /iss/G+ + Capturing subpattern count = 0 + No options + First char = 'i' + Need char = 's' + Mississippi + 0: iss + 0+ issippi + 0: iss + 0+ ippi + + /\Biss\B/G+ + Capturing subpattern count = 0 + No options + First char = 'i' + Need char = 's' + Mississippi + 0: iss + 0+ issippi + + /\Biss\B/g+ + Capturing subpattern count = 0 + No options + First char = 'i' + Need char = 's' + Mississippi + 0: iss + 0+ issippi + 0: iss + 0+ ippi + *** Failers + No match + Mississippi\A + No match + + /(?<=[Ms])iss/g+ + Capturing subpattern count = 0 + No options + First char = 'i' + Need char = 's' + Mississippi + 0: iss + 0+ issippi + 0: iss + 0+ ippi + + /(?<=[Ms])iss/G+ + Capturing subpattern count = 0 + No options + First char = 'i' + Need char = 's' + Mississippi + 0: iss + 0+ issippi + + /^iss/g+ + Capturing subpattern count = 0 + Options: anchored + No first char + Need char = 's' + ississippi + 0: iss + 0+ issippi + + /.*iss/g+ + Capturing subpattern count = 0 + No options + First char at start or follows \n + Need char = 's' + abciss\nxyzisspqr + 0: abciss + 0+ \x0axyzisspqr + 0: xyziss + 0+ pqr + + /.i./+g + Capturing subpattern count = 0 + No options + No first char + Need char = 'i' + Mississippi + 0: Mis + 0+ sissippi + 0: sis + 0+ sippi + 0: sip + 0+ pi + Mississippi\A + 0: Mis + 0+ sissippi + 0: sis + 0+ sippi + 0: sip + 0+ pi + Missouri river + 0: Mis + 0+ souri river + 0: ri + 0+ river + 0: riv + 0+ er + Missouri river\A + 0: Mis + 0+ souri river + + /^.is/+g + Capturing subpattern count = 0 + Options: anchored + No first char + Need char = 's' + Mississippi + 0: Mis + 0+ sissippi + + /^ab\n/g+ + Capturing subpattern count = 0 + Options: anchored + No first char + Need char = 10 + ab\nab\ncd + 0: ab\x0a + 0+ ab\x0acd + + /^ab\n/mg+ + Capturing subpattern count = 0 + Options: multiline + First char at start or follows \n + Need char = 10 + ab\nab\ncd + 0: ab\x0a + 0+ ab\x0acd + 0: ab\x0a + 0+ cd + + /abc/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'c' + + /abc|bac/ + Capturing subpattern count = 0 + No options + No first char + Need char = 'c' + + /(abc|bac)/ + Capturing subpattern count = 1 + No options + No first char + Need char = 'c' + + /(abc|(c|dc))/ + Capturing subpattern count = 2 + No options + No first char + Need char = 'c' + + /(abc|(d|de)c)/ + Capturing subpattern count = 2 + No options + No first char + Need char = 'c' + + /a*/ + Capturing subpattern count = 0 + No options + No first char + No need char + + /a+/ + Capturing subpattern count = 0 + No options + First char = 'a' + No need char + + /(baa|a+)/ + Capturing subpattern count = 1 + No options + No first char + Need char = 'a' + + /a{0,3}/ + Capturing subpattern count = 0 + No options + No first char + No need char + + /baa{3,}/ + Capturing subpattern count = 0 + No options + First char = 'b' + Need char = 'a' + + /"([^\\"]+|\\.)*"/ + Capturing subpattern count = 1 + No options + First char = '"' + Need char = '"' + + /(abc|ab[cd])/ + Capturing subpattern count = 1 + No options + First char = 'a' + No need char + + /(a|.)/ + Capturing subpattern count = 1 + No options + No first char + No need char + + /a|ba|\w/ + Capturing subpattern count = 0 + No options + No first char + No need char + + /abc(?=pqr)/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'r' + + /...(?<=abc)/ + Capturing subpattern count = 0 + No options + No first char + No need char + + /abc(?!pqr)/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'c' + + /ab./ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'b' + + /ab[xyz]/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'b' + + /abc*/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'b' + + /ab.c*/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'b' + + /a.c*/ + Capturing subpattern count = 0 + No options + First char = 'a' + No need char + + /.c*/ + Capturing subpattern count = 0 + No options + No first char + No need char + + /ac*/ + Capturing subpattern count = 0 + No options + First char = 'a' + No need char + + /(a.c*|b.c*)/ + Capturing subpattern count = 1 + No options + No first char + No need char + + /a.c*|aba/ + Capturing subpattern count = 0 + No options + First char = 'a' + No need char + + /.+a/ + Capturing subpattern count = 0 + No options + No first char + Need char = 'a' + + /(?=abcda)a.*/ + Capturing subpattern count = 0 + No options + First char = 'a' + No need char + + /(?=a)a.*/ + Capturing subpattern count = 0 + No options + First char = 'a' + No need char + + /a(b)*/ + Capturing subpattern count = 1 + No options + First char = 'a' + No need char + + /a\d*/ + Capturing subpattern count = 0 + No options + First char = 'a' + No need char + + /ab\d*/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'b' + + /a(\d)*/ + Capturing subpattern count = 1 + No options + First char = 'a' + No need char + + /abcde{0,0}/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'd' + + /ab\d+/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'b' + + /a(?(1)b)/ + Capturing subpattern count = 0 + No options + First char = 'a' + No need char + + /a(?(1)bag|big)/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'g' + + /a(?(1)bag|big)*/ + Capturing subpattern count = 0 + No options + First char = 'a' + No need char + + /a(?(1)bag|big)+/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'g' + + /a(?(1)b..|b..)/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'b' + + /ab\d{0}e/ + Capturing subpattern count = 0 + No options + First char = 'a' + Need char = 'e' + + /a?b?/ + Capturing subpattern count = 0 + No options + No first char + No need char + a + 0: a + b + 0: b + ab + 0: ab + \ + 0: + *** Failers + 0: + \N + No match + + /|-/ + Capturing subpattern count = 0 + No options + No first char + No need char + abcd + 0: + -abc + 0: + \Nab-c + 0: - + *** Failers + 0: + \Nabc + No match + + /a*(b+)(z)(z)/P + aaaabbbbzzzz + 0: aaaabbbbzz + 1: bbbb + 2: z + 3: z + aaaabbbbzzzz\O0 + aaaabbbbzzzz\O1 + 0: aaaabbbbzz + aaaabbbbzzzz\O2 + 0: aaaabbbbzz + 1: bbbb + aaaabbbbzzzz\O3 + 0: aaaabbbbzz + 1: bbbb + 2: z + aaaabbbbzzzz\O4 + 0: aaaabbbbzz + 1: bbbb + 2: z + 3: z + aaaabbbbzzzz\O5 + 0: aaaabbbbzz + 1: bbbb + 2: z + 3: z + + /^.?abcd/S + Capturing subpattern count = 0 + Options: anchored + No first char + Need char = 'd' + Study returned NULL + + /$ # ( at start + (?: # Non-capturing bracket + (?>[^()]+) # Either a sequence of non-brackets (no backtracking) + | # Or + (?R) # Recurse - i.e. nested bracketed string + )* # Zero or more contents + $ # Closing ) + /x + Capturing subpattern count = 0 + Options: extended + First char = '(' + Need char = ')' + (abcd) + 0: (abcd) + (abcd)xyz + 0: (abcd) + xyz(abcd) + 0: (abcd) + (ab(xy)cd)pqr + 0: (ab(xy)cd) + (ab(xycd)pqr + 0: (xycd) + () abc () + 0: () + 12(abcde(fsh)xyz(foo(bar))lmno)89 + 0: (abcde(fsh)xyz(foo(bar))lmno) + *** Failers + No match + abcd + No match + abcd) + No match + (abcd + No match + + /$ ( (?>[^()]+) | (?R) )* $ /xg + Capturing subpattern count = 1 + Options: extended + First char = '(' + Need char = ')' + (ab(xy)cd)pqr + 0: (ab(xy)cd) + 1: cd + 1(abcd)(x(y)z)pqr + 0: (abcd) + 1: abcd + 0: (x(y)z) + 1: z + + /$ (?: (?>[^()]+) | (?R) ) $ /x + Capturing subpattern count = 0 + Options: extended + First char = '(' + Need char = ')' + (abcd) + 0: (abcd) + (ab(xy)cd) + 0: (xy) + (a(b(c)d)e) + 0: (c) + ((ab)) + 0: ((ab)) + *** Failers + No match + () + No match + + /$ (?: (?>[^()]+) | (?R) )? $ /x + Capturing subpattern count = 0 + Options: extended + First char = '(' + Need char = ')' + () + 0: () + 12(abcde(fsh)xyz(foo(bar))lmno)89 + 0: (fsh) + + /$ ( (?>[^()]+) | (?R) )* $ /x + Capturing subpattern count = 1 + Options: extended + First char = '(' + Need char = ')' + (ab(xy)cd) + 0: (ab(xy)cd) + 1: cd + + /$ ( ( (?>[^()]+) | (?R) )* ) $ /x + Capturing subpattern count = 2 + Options: extended + First char = '(' + Need char = ')' + (ab(xy)cd) + 0: (ab(xy)cd) + 1: ab(xy)cd + 2: cd + + /$ (123)? ( ( (?>[^()]+) | (?R) )* ) $ /x + Capturing subpattern count = 3 + Options: extended + First char = '(' + Need char = ')' + (ab(xy)cd) + 0: (ab(xy)cd) + 1: <unset> + 2: ab(xy)cd + 3: cd + (123ab(xy)cd) + 0: (123ab(xy)cd) + 1: 123 + 2: ab(xy)cd + 3: cd + + /$ ( (123)? ( (?>[^()]+) | (?R) )* ) $ /x + Capturing subpattern count = 3 + Options: extended + First char = '(' + Need char = ')' + (ab(xy)cd) + 0: (ab(xy)cd) + 1: ab(xy)cd + 2: <unset> + 3: cd + (123ab(xy)cd) + 0: (123ab(xy)cd) + 1: 123ab(xy)cd + 2: 123 + 3: cd + + /$ (((((((((( ( (?>[^()]+) | (?R) )* )))))))))) $ /x + Capturing subpattern count = 11 + Options: extended + First char = '(' + Need char = ')' + (ab(xy)cd) + 0: (ab(xy)cd) + 1: ab(xy)cd + 2: ab(xy)cd + 3: ab(xy)cd + 4: ab(xy)cd + 5: ab(xy)cd + 6: ab(xy)cd + 7: ab(xy)cd + 8: ab(xy)cd + 9: ab(xy)cd + 10: ab(xy)cd + 11: cd + + /$ ( ( (?>[^()<>]+) | ((?>[^()]+)) | (?R) )* ) $ /x + Capturing subpattern count = 3 + Options: extended + First char = '(' + Need char = ')' + (abcd(xyzqrs)123) + 0: (abcd(xyzqrs)123) + 1: abcd(xyzqrs)123 + 2: 123 + 3: qrs + + /$ ( ( (?>[^()]+) | ((?R)) )* ) $ /x + Capturing subpattern count = 3 + Options: extended + First char = '(' + Need char = ')' + (ab(cd)ef) + 0: (ab(cd)ef) + 1: ab(cd)ef + 2: ef + 3: (cd) + (ab(cd(ef)gh)ij) + 0: (ab(cd(ef)gh)ij) + 1: ab(cd(ef)gh)ij + 2: ij + 3: (cd(ef)gh) + + /^[[:alnum:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [0-9A-Za-z] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[[:alpha:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [A-Za-z] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[[:ascii:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [\x00-\x7f] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[[:cntrl:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [\x00-\x1f\x7f] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[[:digit:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [0-9] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[[:graph:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [!-~] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[[:lower:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [a-z] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[[:print:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [ -~] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[[:punct:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [!-/:-@[-`{-~] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[[:space:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [\x09-\x0d ] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[[:upper:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [A-Z] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[[:xdigit:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [0-9A-Fa-f] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[[:word:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [0-9A-Z_a-z] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[[:^cntrl:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [ -~\x80-\xff] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /^[12[:^digit:]]/D + ------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [\x00-/1-2:-\xff] + 37 37 Ket + 40 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: anchored + No first char + No need char + + /[01[:alpha:]%]/D + ------------------------------------------------------------------ + 0 36 Bra 0 + 3 [%0-1A-Za-z] + 36 36 Ket + 39 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + No options + No first char + No need char + + /[[.ch.]]/ + Failed: POSIX collating elements are not supported at offset 1 + + /[[=ch=]]/ + Failed: POSIX collating elements are not supported at offset 1 + + /[[:rhubarb:]]/ + Failed: unknown POSIX class name at offset 3 + + /[[:upper:]]/i + Capturing subpattern count = 0 + Options: caseless + No first char + No need char + A + 0: A + a + 0: a + + /[[:lower:]]/i + Capturing subpattern count = 0 + Options: caseless + No first char + No need char + A + 0: A + a + 0: a + + /((?-i)[[:lower:]])[[:lower:]]/i + Capturing subpattern count = 1 + Options: caseless + Case state changes + No first char + No need char + ab + 0: ab + 1: a + aB + 0: aB + 1: a + *** Failers + 0: ai + 1: a + Ab + No match + AB + No match + + /[\200-\410]/ + Failed: range out of order in character class at offset 9 + + / End of testinput2 / + Capturing subpattern count = 0 + No options + First char = ' ' + Need char = ' ' +

ossp-pkg/pcre/pcre_test.d/testoutput3 -> 1.6

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,2991 ---- + PCRE version 3.3 01-Aug-2000 + + /(?<!bar)foo/ + foo + 0: foo + catfood + 0: foo + arfootle + 0: foo + rfoosh + 0: foo + *** Failers + No match + barfoo + No match + towbarfoo + No match + + /\w{3}(?<!bar)foo/ + catfood + 0: catfoo + *** Failers + No match + foo + No match + barfoo + No match + towbarfoo + No match + + /(?<=(foo)a)bar/ + fooabar + 0: bar + 1: foo + *** Failers + No match + bar + No match + foobbar + No match + + /\Aabc\z/m + abc + 0: abc + *** Failers + No match + abc\n + No match + qqq\nabc + No match + abc\nzzz + No match + qqq\nabc\nzzz + No match + + "(?>.*/)foo" + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/it/you/see/ + No match + + "(?>.*/)foo" + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + 0: /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + + /(?>(\.\d\d[1-9]?))\d+/ + 1.230003938 + 0: .230003938 + 1: .23 + 1.875000282 + 0: .875000282 + 1: .875 + *** Failers + No match + 1.235 + No match + + /^((?>\w+)|(?>\s+))*$/ + now is the time for all good men to come to the aid of the party + 0: now is the time for all good men to come to the aid of the party + 1: party + *** Failers + No match + this is not a line with only words and spaces! + No match + + /(\d+)(\w)/ + 12345a + 0: 12345a + 1: 12345 + 2: a + 12345+ + 0: 12345 + 1: 1234 + 2: 5 + + /((?>\d+))(\w)/ + 12345a + 0: 12345a + 1: 12345 + 2: a + *** Failers + No match + 12345+ + No match + + /(?>a+)b/ + aaab + 0: aaab + + /((?>a+)b)/ + aaab + 0: aaab + 1: aaab + + /(?>(a+))b/ + aaab + 0: aaab + 1: aaa + + /(?>b)+/ + aaabbbccc + 0: bbb + + /(?>a+|b+|c+)*c/ + aaabbbbccccd + 0: aaabbbbc + + /((?>[^()]+)|$[^()]*$)+/ + ((abc(ade)ufh()()x + 0: abc(ade)ufh()()x + 1: x + + /$((?>[^()]+)|\([^()]+$)+\)/ + (abc) + 0: (abc) + 1: abc + (abc(def)xyz) + 0: (abc(def)xyz) + 1: xyz + *** Failers + No match + ((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa + No match + + /a(?-i)b/i + ab + 0: ab + *** Failers + No match + Ab + No match + aB + No match + AB + No match + + /(a (?x)b c)d e/ + a bcd e + 0: a bcd e + 1: a bc + *** Failers + No match + a b cd e + No match + abcd e + No match + a bcde + No match + + /(a b(?x)c d (?-x)e f)/ + a bcde f + 0: a bcde f + 1: a bcde f + *** Failers + No match + abcdef + No match + + /(a(?i)b)c/ + abc + 0: abc + 1: ab + aBc + 0: aBc + 1: aB + *** Failers + No match + abC + No match + aBC + No match + Abc + No match + ABc + No match + ABC + No match + AbC + No match + + /a(?i:b)c/ + abc + 0: abc + aBc + 0: aBc + *** Failers + No match + ABC + No match + abC + No match + aBC + No match + + /a(?i:b)*c/ + aBc + 0: aBc + aBBc + 0: aBBc + *** Failers + No match + aBC + No match + aBBC + No match + + /a(?=b(?i)c)\w\wd/ + abcd + 0: abcd + abCd + 0: abCd + *** Failers + No match + aBCd + No match + abcD + No match + + /(?s-i:more.*than).*million/i + more than million + 0: more than million + more than MILLION + 0: more than MILLION + more \n than Million + 0: more \x0a than Million + *** Failers + No match + MORE THAN MILLION + No match + more \n than \n million + No match + + /(?:(?s-i)more.*than).*million/i + more than million + 0: more than million + more than MILLION + 0: more than MILLION + more \n than Million + 0: more \x0a than Million + *** Failers + No match + MORE THAN MILLION + No match + more \n than \n million + No match + + /(?>a(?i)b+)+c/ + abc + 0: abc + aBbc + 0: aBbc + aBBc + 0: aBBc + *** Failers + No match + Abc + No match + abAb + No match + abbC + No match + + /(?=a(?i)b)\w\wc/ + abc + 0: abc + aBc + 0: aBc + *** Failers + No match + Ab + No match + abC + No match + aBC + No match + + /(?<=a(?i)b)(\w\w)c/ + abxxc + 0: xxc + 1: xx + aBxxc + 0: xxc + 1: xx + *** Failers + No match + Abxxc + No match + ABxxc + No match + abxxC + No match + + /(?:(a)|b)(?(1)A|B)/ + aA + 0: aA + 1: a + bB + 0: bB + *** Failers + No match + aB + No match + bA + No match + + /^(a)?(?(1)a|b)+$/ + aa + 0: aa + 1: a + b + 0: b + bb + 0: bb + *** Failers + No match + ab + No match + + /^(?(?=abc)\w{3}:|\d\d)$/ + abc: + 0: abc: + 12 + 0: 12 + *** Failers + No match + 123 + No match + xyz + No match + + /^(?(?!abc)\d\d|\w{3}:)$/ + abc: + 0: abc: + 12 + 0: 12 + *** Failers + No match + 123 + No match + xyz + No match + + /(?(?<=foo)bar|cat)/ + foobar + 0: bar + cat + 0: cat + fcat + 0: cat + focat + 0: cat + *** Failers + No match + foocat + No match + + /(?(?<!foo)cat|bar)/ + foobar + 0: bar + cat + 0: cat + fcat + 0: cat + focat + 0: cat + *** Failers + No match + foocat + No match + + /( $ )? [^()]+ (?(1) $ |) /x + abcd + 0: abcd + (abcd) + 0: (abcd) + 1: ( + the quick (abcd) fox + 0: the quick + (abcd + 0: abcd + + /( $ )? [^()]+ (?(1) $ ) /x + abcd + 0: abcd + (abcd) + 0: (abcd) + 1: ( + the quick (abcd) fox + 0: the quick + (abcd + 0: abcd + + /^(?(2)a|(1)(2))+$/ + 12 + 0: 12 + 1: 1 + 2: 2 + 12a + 0: 12a + 1: 1 + 2: 2 + 12aa + 0: 12aa + 1: 1 + 2: 2 + *** Failers + No match + 1234 + No match + + /((?i)blah)\s+\1/ + blah blah + 0: blah blah + 1: blah + BLAH BLAH + 0: BLAH BLAH + 1: BLAH + Blah Blah + 0: Blah Blah + 1: Blah + blaH blaH + 0: blaH blaH + 1: blaH + *** Failers + No match + blah BLAH + No match + Blah blah + No match + blaH blah + No match + + /((?i)blah)\s+(?i:\1)/ + blah blah + 0: blah blah + 1: blah + BLAH BLAH + 0: BLAH BLAH + 1: BLAH + Blah Blah + 0: Blah Blah + 1: Blah + blaH blaH + 0: blaH blaH + 1: blaH + blah BLAH + 0: blah BLAH + 1: blah + Blah blah + 0: Blah blah + 1: Blah + blaH blah + 0: blaH blah + 1: blaH + + /(?>a*)*/ + a + 0: a + aa + 0: aa + aaaa + 0: aaaa + + /(abc|)+/ + abc + 0: abc + 1: + abcabc + 0: abcabc + 1: + abcabcabc + 0: abcabcabc + 1: + xyz + 0: + 1: + + /([a]*)*/ + a + 0: a + 1: + aaaaa + 0: aaaaa + 1: + + /([ab]*)*/ + a + 0: a + 1: + b + 0: b + 1: + ababab + 0: ababab + 1: + aaaabcde + 0: aaaab + 1: + bbbb + 0: bbbb + 1: + + /([^a]*)*/ + b + 0: b + 1: + bbbb + 0: bbbb + 1: + aaa + 0: + 1: + + /([^ab]*)*/ + cccc + 0: cccc + 1: + abab + 0: + 1: + + /([a]*?)*/ + a + 0: + 1: + aaaa + 0: + 1: + + /([ab]*?)*/ + a + 0: + 1: + b + 0: + 1: + abab + 0: + 1: + baba + 0: + 1: + + /([^a]*?)*/ + b + 0: + 1: + bbbb + 0: + 1: + aaa + 0: + 1: + + /([^ab]*?)*/ + c + 0: + 1: + cccc + 0: + 1: + baba + 0: + 1: + + /(?>a*)*/ + a + 0: a + aaabcde + 0: aaa + + /((?>a*))*/ + aaaaa + 0: aaaaa + 1: + aabbaa + 0: aa + 1: + + /((?>a*?))*/ + aaaaa + 0: + 1: + aabbaa + 0: + 1: + + /(?(?=[^a-z]+[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) /x + 12-sep-98 + 0: 12-sep-98 + 12-09-98 + 0: 12-09-98 + *** Failers + No match + sep-12-98 + No match + + /(?<=(foo))bar\1/ + foobarfoo + 0: barfoo + 1: foo + foobarfootling + 0: barfoo + 1: foo + *** Failers + No match + foobar + No match + barfoo + No match + + /(?i:saturday|sunday)/ + saturday + 0: saturday + sunday + 0: sunday + Saturday + 0: Saturday + Sunday + 0: Sunday + SATURDAY + 0: SATURDAY + SUNDAY + 0: SUNDAY + SunDay + 0: SunDay + + /(a(?i)bc|BB)x/ + abcx + 0: abcx + 1: abc + aBCx + 0: aBCx + 1: aBC + bbx + 0: bbx + 1: bb + BBx + 0: BBx + 1: BB + *** Failers + No match + abcX + No match + aBCX + No match + bbX + No match + BBX + No match + + /^([ab](?i)[cd]|[ef])/ + ac + 0: ac + 1: ac + aC + 0: aC + 1: aC + bD + 0: bD + 1: bD + elephant + 0: e + 1: e + Europe + 0: E + 1: E + frog + 0: f + 1: f + France + 0: F + 1: F + *** Failers + No match + Africa + No match + + /^(ab|a(?i)[b-c](?m-i)d|x(?i)y|z)/ + ab + 0: ab + 1: ab + aBd + 0: aBd + 1: aBd + xy + 0: xy + 1: xy + xY + 0: xY + 1: xY + zebra + 0: z + 1: z + Zambesi + 0: Z + 1: Z + *** Failers + No match + aCD + No match + XY + No match + + /(?<=foo\n)^bar/m + foo\nbar + 0: bar + *** Failers + No match + bar + No match + baz\nbar + No match + + /(?<=(?<!foo)bar)baz/ + barbaz + 0: baz + barbarbaz + 0: baz + koobarbaz + 0: baz + *** Failers + No match + baz + No match + foobarbaz + No match + + /The case of aaaaaa is missed out below because I think Perl 5.005_02 gets/ + /it wrong; it sets $1 to aaa rather than aa. Compare the following test,/ + No match + /where it does set $1 to aa when matching aaaaaa./ + No match + + /^(a\1?){4}$/ + a + No match + aa + No match + aaa + No match + aaaa + 0: aaaa + 1: a + aaaaa + 0: aaaaa + 1: a + aaaaaaa + 0: aaaaaaa + 1: a + aaaaaaaa + No match + aaaaaaaaa + No match + aaaaaaaaaa + 0: aaaaaaaaaa + 1: aaaa + aaaaaaaaaaa + No match + aaaaaaaaaaaa + No match + aaaaaaaaaaaaa + No match + aaaaaaaaaaaaaa + No match + aaaaaaaaaaaaaaa + No match + aaaaaaaaaaaaaaaa + No match + + /^(a\1?)(a\1?)(a\2?)(a\3?)$/ + a + No match + aa + No match + aaa + No match + aaaa + 0: aaaa + 1: a + 2: a + 3: a + 4: a + aaaaa + 0: aaaaa + 1: a + 2: aa + 3: a + 4: a + aaaaaa + 0: aaaaaa + 1: a + 2: aa + 3: a + 4: aa + aaaaaaa + 0: aaaaaaa + 1: a + 2: aa + 3: aaa + 4: a + aaaaaaaa + No match + aaaaaaaaa + No match + aaaaaaaaaa + 0: aaaaaaaaaa + 1: a + 2: aa + 3: aaa + 4: aaaa + aaaaaaaaaaa + No match + aaaaaaaaaaaa + No match + aaaaaaaaaaaaa + No match + aaaaaaaaaaaaaa + No match + aaaaaaaaaaaaaaa + No match + aaaaaaaaaaaaaaaa + No match + + /The following tests are taken from the Perl 5.005 test suite; some of them/ + /are compatible with 5.004, but I'd rather not have to sort them out./ + No match + + /abc/ + abc + 0: abc + xabcy + 0: abc + ababc + 0: abc + *** Failers + No match + xbc + No match + axc + No match + abx + No match + + /ab*c/ + abc + 0: abc + + /ab*bc/ + abc + 0: abc + abbc + 0: abbc + abbbbc + 0: abbbbc + + /.{1}/ + abbbbc + 0: a + + /.{3,4}/ + abbbbc + 0: abbb + + /ab{0,}bc/ + abbbbc + 0: abbbbc + + /ab+bc/ + abbc + 0: abbc + *** Failers + No match + abc + No match + abq + No match + + /ab{1,}bc/ + + /ab+bc/ + abbbbc + 0: abbbbc + + /ab{1,}bc/ + abbbbc + 0: abbbbc + + /ab{1,3}bc/ + abbbbc + 0: abbbbc + + /ab{3,4}bc/ + abbbbc + 0: abbbbc + + /ab{4,5}bc/ + *** Failers + No match + abq + No match + abbbbc + No match + + /ab?bc/ + abbc + 0: abbc + abc + 0: abc + + /ab{0,1}bc/ + abc + 0: abc + + /ab?bc/ + + /ab?c/ + abc + 0: abc + + /ab{0,1}c/ + abc + 0: abc + + /^abc$/ + abc + 0: abc + *** Failers + No match + abbbbc + No match + abcc + No match + + /^abc/ + abcc + 0: abc + + /^abc$/ + + /abc$/ + aabc + 0: abc + *** Failers + No match + aabc + 0: abc + aabcd + No match + + /^/ + abc + 0: + + /$/ + abc + 0: + + /a.c/ + abc + 0: abc + axc + 0: axc + + /a.*c/ + axyzc + 0: axyzc + + /a[bc]d/ + abd + 0: abd + *** Failers + No match + axyzd + No match + abc + No match + + /a[b-d]e/ + ace + 0: ace + + /a[b-d]/ + aac + 0: ac + + /a[-b]/ + a- + 0: a- + + /a[b-]/ + a- + 0: a- + + /a]/ + a] + 0: a] + + /a[]]b/ + a]b + 0: a]b + + /a[^bc]d/ + aed + 0: aed + *** Failers + No match + abd + No match + abd + No match + + /a[^-b]c/ + adc + 0: adc + + /a[^]b]c/ + adc + 0: adc + *** Failers + No match + a-c + 0: a-c + a]c + No match + + /\ba\b/ + a- + 0: a + -a + 0: a + -a- + 0: a + + /\by\b/ + *** Failers + No match + xy + No match + yz + No match + xyz + No match + + /\Ba\B/ + *** Failers + 0: a + a- + No match + -a + No match + -a- + No match + + /\By\b/ + xy + 0: y + + /\by\B/ + yz + 0: y + + /\By\B/ + xyz + 0: y + + /\w/ + a + 0: a + + /\W/ + - + 0: - + *** Failers + 0: * + - + 0: - + a + No match + + /a\sb/ + a b + 0: a b + + /a\Sb/ + a-b + 0: a-b + *** Failers + No match + a-b + 0: a-b + a b + No match + + /\d/ + 1 + 0: 1 + + /\D/ + - + 0: - + *** Failers + 0: * + - + 0: - + 1 + No match + + /[\w]/ + a + 0: a + + /[\W]/ + - + 0: - + *** Failers + 0: * + - + 0: - + a + No match + + /a[\s]b/ + a b + 0: a b + + /a[\S]b/ + a-b + 0: a-b + *** Failers + No match + a-b + 0: a-b + a b + No match + + /[\d]/ + 1 + 0: 1 + + /[\D]/ + - + 0: - + *** Failers + 0: * + - + 0: - + 1 + No match + + /ab|cd/ + abc + 0: ab + abcd + 0: ab + + /()ef/ + def + 0: ef + 1: + + /$b/ + + /a\(b/ + a(b + 0: a(b + + /a\(*b/ + ab + 0: ab + a((b + 0: a((b + + /a\\b/ + a\b + No match + + /((a))/ + abc + 0: a + 1: a + 2: a + + /(a)b(c)/ + abc + 0: abc + 1: a + 2: c + + /a+b+c/ + aabbabc + 0: abc + + /a{1,}b{1,}c/ + aabbabc + 0: abc + + /a.+?c/ + abcabc + 0: abc + + /(a+|b)*/ + ab + 0: ab + 1: b + + /(a+|b){0,}/ + ab + 0: ab + 1: b + + /(a+|b)+/ + ab + 0: ab + 1: b + + /(a+|b){1,}/ + ab + 0: ab + 1: b + + /(a+|b)?/ + ab + 0: a + 1: a + + /(a+|b){0,1}/ + ab + 0: a + 1: a + + /[^ab]*/ + cde + 0: cde + + /abc/ + *** Failers + No match + b + No match + + + /a*/ + + + /([abc])*d/ + abbbcd + 0: abbbcd + 1: c + + /([abc])*bcd/ + abcd + 0: abcd + 1: a + + /a|b|c|d|e/ + e + 0: e + + /(a|b|c|d|e)f/ + ef + 0: ef + 1: e + + /abcd*efg/ + abcdefg + 0: abcdefg + + /ab*/ + xabyabbbz + 0: ab + xayabbbz + 0: a + + /(ab|cd)e/ + abcde + 0: cde + 1: cd + + /[abhgefdc]ij/ + hij + 0: hij + + /^(ab|cd)e/ + + /(abc|)ef/ + abcdef + 0: ef + 1: + + /(a|b)c*d/ + abcd + 0: bcd + 1: b + + /(ab|ab*)bc/ + abc + 0: abc + 1: a + + /a([bc]*)c*/ + abc + 0: abc + 1: bc + + /a([bc]*)(c*d)/ + abcd + 0: abcd + 1: bc + 2: d + + /a([bc]+)(c*d)/ + abcd + 0: abcd + 1: bc + 2: d + + /a([bc]*)(c+d)/ + abcd + 0: abcd + 1: b + 2: cd + + /a[bcd]*dcdcde/ + adcdcde + 0: adcdcde + + /a[bcd]+dcdcde/ + *** Failers + No match + abcde + No match + adcdcde + No match + + /(ab|a)b*c/ + abc + 0: abc + 1: ab + + /((a)(b)c)(d)/ + abcd + 0: abcd + 1: abc + 2: a + 3: b + 4: d + + /[a-zA-Z_][a-zA-Z0-9_]*/ + alpha + 0: alpha + + /^a(bc+|b[eh])g|.h$/ + abh + 0: bh + + /(bc+d$|ef*g.|h?i(j|k))/ + effgz + 0: effgz + 1: effgz + ij + 0: ij + 1: ij + 2: j + reffgz + 0: effgz + 1: effgz + *** Failers + No match + effg + No match + bcdd + No match + + /((((((((((a))))))))))/ + a + 0: a + 1: a + 2: a + 3: a + 4: a + 5: a + 6: a + 7: a + 8: a + 9: a + 10: a + + /((((((((((a))))))))))\10/ + aa + 0: aa + 1: a + 2: a + 3: a + 4: a + 5: a + 6: a + 7: a + 8: a + 9: a + 10: a + + /(((((((((a)))))))))/ + a + 0: a + 1: a + 2: a + 3: a + 4: a + 5: a + 6: a + 7: a + 8: a + 9: a + + /multiple words of text/ + *** Failers + No match + aa + No match + uh-uh + No match + + /multiple words/ + multiple words, yeah + 0: multiple words + + /(.*)c(.*)/ + abcde + 0: abcde + 1: ab + 2: de + + /$(.*), (.*)$/ + (a, b) + 0: (a, b) + 1: a + 2: b + + /[k]/ + + /abcd/ + abcd + 0: abcd + + /a(bc)d/ + abcd + 0: abcd + 1: bc + + /a[-]?c/ + ac + 0: ac + + /(abc)\1/ + abcabc + 0: abcabc + 1: abc + + /([a-c]*)\1/ + abcabc + 0: abcabc + 1: abc + + /(a)|\1/ + a + 0: a + 1: a + *** Failers + 0: a + 1: a + ab + 0: a + 1: a + x + No match + + /(([a-c])b*?\2)*/ + ababbbcbc + 0: ababb + 1: bb + 2: b + + /(([a-c])b*?\2){3}/ + ababbbcbc + 0: ababbbcbc + 1: cbc + 2: c + + /((\3|b)\2(a)x)+/ + aaaxabaxbaaxbbax + 0: bbax + 1: bbax + 2: b + 3: a + + /((\3|b)\2(a)){2,}/ + bbaababbabaaaaabbaaaabba + 0: bbaaaabba + 1: bba + 2: b + 3: a + + /abc/i + ABC + 0: ABC + XABCY + 0: ABC + ABABC + 0: ABC + *** Failers + No match + aaxabxbaxbbx + No match + XBC + No match + AXC + No match + ABX + No match + + /ab*c/i + ABC + 0: ABC + + /ab*bc/i + ABC + 0: ABC + ABBC + 0: ABBC + + /ab*?bc/i + ABBBBC + 0: ABBBBC + + /ab{0,}?bc/i + ABBBBC + 0: ABBBBC + + /ab+?bc/i + ABBC + 0: ABBC + + /ab+bc/i + *** Failers + No match + ABC + No match + ABQ + No match + + /ab{1,}bc/i + + /ab+bc/i + ABBBBC + 0: ABBBBC + + /ab{1,}?bc/i + ABBBBC + 0: ABBBBC + + /ab{1,3}?bc/i + ABBBBC + 0: ABBBBC + + /ab{3,4}?bc/i + ABBBBC + 0: ABBBBC + + /ab{4,5}?bc/i + *** Failers + No match + ABQ + No match + ABBBBC + No match + + /ab??bc/i + ABBC + 0: ABBC + ABC + 0: ABC + + /ab{0,1}?bc/i + ABC + 0: ABC + + /ab??bc/i + + /ab??c/i + ABC + 0: ABC + + /ab{0,1}?c/i + ABC + 0: ABC + + /^abc$/i + ABC + 0: ABC + *** Failers + No match + ABBBBC + No match + ABCC + No match + + /^abc/i + ABCC + 0: ABC + + /^abc$/i + + /abc$/i + AABC + 0: ABC + + /^/i + ABC + 0: + + /$/i + ABC + 0: + + /a.c/i + ABC + 0: ABC + AXC + 0: AXC + + /a.*?c/i + AXYZC + 0: AXYZC + + /a.*c/i + *** Failers + No match + AABC + 0: AABC + AXYZD + No match + + /a[bc]d/i + ABD + 0: ABD + + /a[b-d]e/i + ACE + 0: ACE + *** Failers + No match + ABC + No match + ABD + No match + + /a[b-d]/i + AAC + 0: AC + + /a[-b]/i + A- + 0: A- + + /a[b-]/i + A- + 0: A- + + /a]/i + A] + 0: A] + + /a[]]b/i + A]B + 0: A]B + + /a[^bc]d/i + AED + 0: AED + + /a[^-b]c/i + ADC + 0: ADC + *** Failers + No match + ABD + No match + A-C + No match + + /a[^]b]c/i + ADC + 0: ADC + + /ab|cd/i + ABC + 0: AB + ABCD + 0: AB + + /()ef/i + DEF + 0: EF + 1: + + /$b/i + *** Failers + No match + A]C + No match + B + No match + + /a\(b/i + A(B + 0: A(B + + /a\(*b/i + AB + 0: AB + A((B + 0: A((B + + /a\\b/i + A\B + No match + + /((a))/i + ABC + 0: A + 1: A + 2: A + + /(a)b(c)/i + ABC + 0: ABC + 1: A + 2: C + + /a+b+c/i + AABBABC + 0: ABC + + /a{1,}b{1,}c/i + AABBABC + 0: ABC + + /a.+?c/i + ABCABC + 0: ABC + + /a.*?c/i + ABCABC + 0: ABC + + /a.{0,5}?c/i + ABCABC + 0: ABC + + /(a+|b)*/i + AB + 0: AB + 1: B + + /(a+|b){0,}/i + AB + 0: AB + 1: B + + /(a+|b)+/i + AB + 0: AB + 1: B + + /(a+|b){1,}/i + AB + 0: AB + 1: B + + /(a+|b)?/i + AB + 0: A + 1: A + + /(a+|b){0,1}/i + AB + 0: A + 1: A + + /(a+|b){0,1}?/i + AB + 0: + + /[^ab]*/i + CDE + 0: CDE + + /abc/i + + /a*/i + + + /([abc])*d/i + ABBBCD + 0: ABBBCD + 1: C + + /([abc])*bcd/i + ABCD + 0: ABCD + 1: A + + /a|b|c|d|e/i + E + 0: E + + /(a|b|c|d|e)f/i + EF + 0: EF + 1: E + + /abcd*efg/i + ABCDEFG + 0: ABCDEFG + + /ab*/i + XABYABBBZ + 0: AB + XAYABBBZ + 0: A + + /(ab|cd)e/i + ABCDE + 0: CDE + 1: CD + + /[abhgefdc]ij/i + HIJ + 0: HIJ + + /^(ab|cd)e/i + ABCDE + No match + + /(abc|)ef/i + ABCDEF + 0: EF + 1: + + /(a|b)c*d/i + ABCD + 0: BCD + 1: B + + /(ab|ab*)bc/i + ABC + 0: ABC + 1: A + + /a([bc]*)c*/i + ABC + 0: ABC + 1: BC + + /a([bc]*)(c*d)/i + ABCD + 0: ABCD + 1: BC + 2: D + + /a([bc]+)(c*d)/i + ABCD + 0: ABCD + 1: BC + 2: D + + /a([bc]*)(c+d)/i + ABCD + 0: ABCD + 1: B + 2: CD + + /a[bcd]*dcdcde/i + ADCDCDE + 0: ADCDCDE + + /a[bcd]+dcdcde/i + + /(ab|a)b*c/i + ABC + 0: ABC + 1: AB + + /((a)(b)c)(d)/i + ABCD + 0: ABCD + 1: ABC + 2: A + 3: B + 4: D + + /[a-zA-Z_][a-zA-Z0-9_]*/i + ALPHA + 0: ALPHA + + /^a(bc+|b[eh])g|.h$/i + ABH + 0: BH + + /(bc+d$|ef*g.|h?i(j|k))/i + EFFGZ + 0: EFFGZ + 1: EFFGZ + IJ + 0: IJ + 1: IJ + 2: J + REFFGZ + 0: EFFGZ + 1: EFFGZ + *** Failers + No match + ADCDCDE + No match + EFFG + No match + BCDD + No match + + /((((((((((a))))))))))/i + A + 0: A + 1: A + 2: A + 3: A + 4: A + 5: A + 6: A + 7: A + 8: A + 9: A + 10: A + + /((((((((((a))))))))))\10/i + AA + 0: AA + 1: A + 2: A + 3: A + 4: A + 5: A + 6: A + 7: A + 8: A + 9: A + 10: A + + /(((((((((a)))))))))/i + A + 0: A + 1: A + 2: A + 3: A + 4: A + 5: A + 6: A + 7: A + 8: A + 9: A + + /(?:(?:(?:(?:(?:(?:(?:(?:(?:(a))))))))))/i + A + 0: A + 1: A + + /(?:(?:(?:(?:(?:(?:(?:(?:(?:(a|b|c))))))))))/i + C + 0: C + 1: C + + /multiple words of text/i + *** Failers + No match + AA + No match + UH-UH + No match + + /multiple words/i + MULTIPLE WORDS, YEAH + 0: MULTIPLE WORDS + + /(.*)c(.*)/i + ABCDE + 0: ABCDE + 1: AB + 2: DE + + /$(.*), (.*)$/i + (A, B) + 0: (A, B) + 1: A + 2: B + + /[k]/i + + /abcd/i + ABCD + 0: ABCD + + /a(bc)d/i + ABCD + 0: ABCD + 1: BC + + /a[-]?c/i + AC + 0: AC + + /(abc)\1/i + ABCABC + 0: ABCABC + 1: ABC + + /([a-c]*)\1/i + ABCABC + 0: ABCABC + 1: ABC + + /a(?!b)./ + abad + 0: ad + + /a(?=d)./ + abad + 0: ad + + /a(?=c|d)./ + abad + 0: ad + + /a(?:b|c|d)(.)/ + ace + 0: ace + 1: e + + /a(?:b|c|d)*(.)/ + ace + 0: ace + 1: e + + /a(?:b|c|d)+?(.)/ + ace + 0: ace + 1: e + acdbcdbe + 0: acd + 1: d + + /a(?:b|c|d)+(.)/ + acdbcdbe + 0: acdbcdbe + 1: e + + /a(?:b|c|d){2}(.)/ + acdbcdbe + 0: acdb + 1: b + + /a(?:b|c|d){4,5}(.)/ + acdbcdbe + 0: acdbcdb + 1: b + + /a(?:b|c|d){4,5}?(.)/ + acdbcdbe + 0: acdbcd + 1: d + + /((foo)|(bar))*/ + foobar + 0: foobar + 1: bar + 2: foo + 3: bar + + /a(?:b|c|d){6,7}(.)/ + acdbcdbe + 0: acdbcdbe + 1: e + + /a(?:b|c|d){6,7}?(.)/ + acdbcdbe + 0: acdbcdbe + 1: e + + /a(?:b|c|d){5,6}(.)/ + acdbcdbe + 0: acdbcdbe + 1: e + + /a(?:b|c|d){5,6}?(.)/ + acdbcdbe + 0: acdbcdb + 1: b + + /a(?:b|c|d){5,7}(.)/ + acdbcdbe + 0: acdbcdbe + 1: e + + /a(?:b|c|d){5,7}?(.)/ + acdbcdbe + 0: acdbcdb + 1: b + + /a(?:b|(c|e){1,2}?|d)+?(.)/ + ace + 0: ace + 1: c + 2: e + + /^(.+)?B/ + AB + 0: AB + 1: A + + /^([^a-z])|(\^)$/ + . + 0: . + 1: . + + /^[<>]&/ + <&OUT + 0: <& + + /^(a\1?){4}$/ + aaaaaaaaaa + 0: aaaaaaaaaa + 1: aaaa + *** Failers + No match + AB + No match + aaaaaaaaa + No match + aaaaaaaaaaa + No match + + /^(a(?(1)\1)){4}$/ + aaaaaaaaaa + 0: aaaaaaaaaa + 1: aaaa + *** Failers + No match + aaaaaaaaa + No match + aaaaaaaaaaa + No match + + /(?:(f)(o)(o)|(b)(a)(r))*/ + foobar + 0: foobar + 1: f + 2: o + 3: o + 4: b + 5: a + 6: r + + /(?<=a)b/ + ab + 0: b + *** Failers + No match + cb + No match + b + No match + + /(?<!c)b/ + ab + 0: b + b + 0: b + b + 0: b + + /(?:..)*a/ + aba + 0: aba + + /(?:..)*?a/ + aba + 0: a + + /^(?:b|a(?=(.)))*\1/ + abc + 0: ab + 1: b + + /^(){3,5}/ + abc + 0: + 1: + + /^(a+)*ax/ + aax + 0: aax + 1: a + + /^((a|b)+)*ax/ + aax + 0: aax + 1: a + 2: a + + /^((a|bc)+)*ax/ + aax + 0: aax + 1: a + 2: a + + /(a|x)*ab/ + cab + 0: ab + + /(a)*ab/ + cab + 0: ab + + /(?:(?i)a)b/ + ab + 0: ab + + /((?i)a)b/ + ab + 0: ab + 1: a + + /(?:(?i)a)b/ + Ab + 0: Ab + + /((?i)a)b/ + Ab + 0: Ab + 1: A + + /(?:(?i)a)b/ + *** Failers + No match + cb + No match + aB + No match + + /((?i)a)b/ + + /(?i:a)b/ + ab + 0: ab + + /((?i:a))b/ + ab + 0: ab + 1: a + + /(?i:a)b/ + Ab + 0: Ab + + /((?i:a))b/ + Ab + 0: Ab + 1: A + + /(?i:a)b/ + *** Failers + No match + aB + No match + aB + No match + + /((?i:a))b/ + + /(?:(?-i)a)b/i + ab + 0: ab + + /((?-i)a)b/i + ab + 0: ab + 1: a + + /(?:(?-i)a)b/i + aB + 0: aB + + /((?-i)a)b/i + aB + 0: aB + 1: a + + /(?:(?-i)a)b/i + *** Failers + No match + aB + 0: aB + Ab + No match + + /((?-i)a)b/i + + /(?:(?-i)a)b/i + aB + 0: aB + + /((?-i)a)b/i + aB + 0: aB + 1: a + + /(?:(?-i)a)b/i + *** Failers + No match + Ab + No match + AB + No match + + /((?-i)a)b/i + + /(?-i:a)b/i + ab + 0: ab + + /((?-i:a))b/i + ab + 0: ab + 1: a + + /(?-i:a)b/i + aB + 0: aB + + /((?-i:a))b/i + aB + 0: aB + 1: a + + /(?-i:a)b/i + *** Failers + No match + AB + No match + Ab + No match + + /((?-i:a))b/i + + /(?-i:a)b/i + aB + 0: aB + + /((?-i:a))b/i + aB + 0: aB + 1: a + + /(?-i:a)b/i + *** Failers + No match + Ab + No match + AB + No match + + /((?-i:a))b/i + + /((?-i:a.))b/i + *** Failers + No match + AB + No match + a\nB + No match + + /((?s-i:a.))b/i + a\nB + 0: a\x0aB + 1: a\x0a + + /(?:c|d)(?:)(?:a(?:)(?:b)(?:b(?:))(?:b(?:)(?:b)))/ + cabbbb + 0: cabbbb + + /(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))/ + caaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb + 0: caaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb + + /(ab)\d\1/i + Ab4ab + 0: Ab4ab + 1: Ab + ab4Ab + 0: ab4Ab + 1: ab + + /foo\w*\d{4}baz/ + foobar1234baz + 0: foobar1234baz + + /x(~~)*(?:(?:F)?)?/ + x~~ + 0: x~~ + 1: ~~ + + /^a(?#xxx){3}c/ + aaac + 0: aaac + + /^a (?#xxx) (?#yyy) {3}c/x + aaac + 0: aaac + + /(?<![cd])b/ + *** Failers + No match + B\nB + No match + dbcb + No match + + /(?<![cd])[ab]/ + dbaacb + 0: a + + /(?<!(c|d))b/ + + /(?<!(c|d))[ab]/ + dbaacb + 0: a + + /(?<!cd)[ab]/ + cdaccb + 0: b + + /^(?:a?b?)*$/ + *** Failers + No match + dbcb + No match + a-- + No match + + /((?s)^a(.))((?m)^b$)/ + a\nb\nc\n + 0: a\x0ab + 1: a\x0a + 2: \x0a + 3: b + + /((?m)^b$)/ + a\nb\nc\n + 0: b + 1: b + + /(?m)^b/ + a\nb\n + 0: b + + /(?m)^(b)/ + a\nb\n + 0: b + 1: b + + /((?m)^b)/ + a\nb\n + 0: b + 1: b + + /\n((?m)^b)/ + a\nb\n + 0: \x0ab + 1: b + + /((?s).)c(?!.)/ + a\nb\nc\n + 0: \x0ac + 1: \x0a + a\nb\nc\n + 0: \x0ac + 1: \x0a + + /((?s)b.)c(?!.)/ + a\nb\nc\n + 0: b\x0ac + 1: b\x0a + a\nb\nc\n + 0: b\x0ac + 1: b\x0a + + /^b/ + + /()^b/ + *** Failers + No match + a\nb\nc\n + No match + a\nb\nc\n + No match + + /((?m)^b)/ + a\nb\nc\n + 0: b + 1: b + + /(?(1)a|b)/ + + /(?(1)b|a)/ + a + 0: a + + /(x)?(?(1)a|b)/ + *** Failers + No match + a + No match + a + No match + + /(x)?(?(1)b|a)/ + a + 0: a + + /()?(?(1)b|a)/ + a + 0: a + + /()(?(1)b|a)/ + + /()?(?(1)a|b)/ + a + 0: a + 1: + + /^($)?blah(?(1)($))$/ + (blah) + 0: (blah) + 1: ( + 2: ) + blah + 0: blah + *** Failers + No match + a + No match + blah) + No match + (blah + No match + + /^($+)?blah(?(1)($))$/ + (blah) + 0: (blah) + 1: ( + 2: ) + blah + 0: blah + *** Failers + No match + blah) + No match + (blah + No match + + /(?(?!a)a|b)/ + + /(?(?!a)b|a)/ + a + 0: a + + /(?(?=a)b|a)/ + *** Failers + No match + a + No match + a + No match + + /(?(?=a)a|b)/ + a + 0: a + + /(?=(a+?))(\1ab)/ + aaab + 0: aab + 1: a + 2: aab + + /^(?=(a+?))\1ab/ + + /(\w+:)+/ + one: + 0: one: + 1: one: + + /$(?<=^(a))/ + a + 0: + 1: a + + /(?=(a+?))(\1ab)/ + aaab + 0: aab + 1: a + 2: aab + + /^(?=(a+?))\1ab/ + *** Failers + No match + aaab + No match + aaab + No match + + /([\w:]+::)?(\w+)$/ + abcd + 0: abcd + 1: <unset> + 2: abcd + xy:z:::abcd + 0: xy:z:::abcd + 1: xy:z::: + 2: abcd + + /^[^bcd]*(c+)/ + aexycd + 0: aexyc + 1: c + + /(a*)b+/ + caab + 0: aab + 1: aa + + /([\w:]+::)?(\w+)$/ + abcd + 0: abcd + 1: <unset> + 2: abcd + xy:z:::abcd + 0: xy:z:::abcd + 1: xy:z::: + 2: abcd + *** Failers + 0: Failers + 1: <unset> + 2: Failers + abcd: + No match + abcd: + No match + + /^[^bcd]*(c+)/ + aexycd + 0: aexyc + 1: c + + /(>a+)ab/ + + /(?>a+)b/ + aaab + 0: aaab + + /([[:]+)/ + a:[b]: + 0: :[ + 1: :[ + + /([[=]+)/ + a=[b]= + 0: =[ + 1: =[ + + /([[.]+)/ + a.[b]. + 0: .[ + 1: .[ + + /((?>a+)b)/ + aaab + 0: aaab + 1: aaab + + /(?>(a+))b/ + aaab + 0: aaab + 1: aaa + + /((?>[^()]+)|$[^()]*$)+/ + ((abc(ade)ufh()()x + 0: abc(ade)ufh()()x + 1: x + + /a\Z/ + *** Failers + No match + aaab + No match + a\nb\n + No match + + /b\Z/ + a\nb\n + 0: b + + /b\z/ + + /b\Z/ + a\nb + 0: b + + /b\z/ + a\nb + 0: b + *** Failers + No match + + /^(?>(?(1)\.|())[^\W_](?>[a-z0-9-]*[^\W_])?)+$/ + a + 0: a + 1: + abc + 0: abc + 1: + a-b + 0: a-b + 1: + 0-9 + 0: 0-9 + 1: + a.b + 0: a.b + 1: + 5.6.7 + 0: 5.6.7 + 1: + the.quick.brown.fox + 0: the.quick.brown.fox + 1: + a100.b200.300c + 0: a100.b200.300c + 1: + 12-ab.1245 + 0: 12-ab.1245 + 1: + ***Failers + No match + \ + No match + .a + No match + -a + No match + a- + No match + a. + No match + a_b + No match + a.- + No match + a.. + No match + ab..bc + No match + the.quick.brown.fox- + No match + the.quick.brown.fox. + No match + the.quick.brown.fox_ + No match + the.quick.brown.fox+ + No match + + /(?>.*)(?<=(abcd|wxyz))/ + alphabetabcd + 0: alphabetabcd + 1: abcd + endingwxyz + 0: endingwxyz + 1: wxyz + *** Failers + No match + a rather long string that doesn't end with one of them + No match + + /word (?>(?:(?!otherword)[a-zA-Z0-9]+ ){0,30})otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark otherword + 0: word cat dog elephant mussel cow horse canary baboon snake shark otherword + word cat dog elephant mussel cow horse canary baboon snake shark + No match + + /word (?>[a-zA-Z0-9]+ ){0,30}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark the quick brown fox and the lazy dog and several other words getting close to thirty by now I hope + No match + + /(?<=\d{3}(?!999))foo/ + 999foo + 0: foo + 123999foo + 0: foo + *** Failers + No match + 123abcfoo + No match + + /(?<=(?!...999)\d{3})foo/ + 999foo + 0: foo + 123999foo + 0: foo + *** Failers + No match + 123abcfoo + No match + + /(?<=\d{3}(?!999)...)foo/ + 123abcfoo + 0: foo + 123456foo + 0: foo + *** Failers + No match + 123999foo + No match + + /(?<=\d{3}...)(?<!999)foo/ + 123abcfoo + 0: foo + 123456foo + 0: foo + *** Failers + No match + 123999foo + No match + + /<a[\s]+href[\s]*=[\s]* # find <a href= + ([\"\'])? # find single or double quote + (?(1) (.*?)\1 | ([^\s]+)) # if quote found, match up to next matching + # quote, otherwise match up to next space + /isx + <a href=abcd xyz + 0: <a href=abcd + 1: <unset> + 2: <unset> + 3: abcd + <a href=\"abcd xyz pqr\" cats + 0: <a href="abcd xyz pqr" + 1: " + 2: abcd xyz pqr + <a href=\'abcd xyz pqr\' cats + 0: <a href='abcd xyz pqr' + 1: ' + 2: abcd xyz pqr + + /<a\s+href\s*=\s* # find <a href= + (["'])? # find single or double quote + (?(1) (.*?)\1 | (\S+)) # if quote found, match up to next matching + # quote, otherwise match up to next space + /isx + <a href=abcd xyz + 0: <a href=abcd + 1: <unset> + 2: <unset> + 3: abcd + <a href=\"abcd xyz pqr\" cats + 0: <a href="abcd xyz pqr" + 1: " + 2: abcd xyz pqr + <a href = \'abcd xyz pqr\' cats + 0: <a href = 'abcd xyz pqr' + 1: ' + 2: abcd xyz pqr + + /<a\s+href(?>\s*)=(?>\s*) # find <a href= + (["'])? # find single or double quote + (?(1) (.*?)\1 | (\S+)) # if quote found, match up to next matching + # quote, otherwise match up to next space + /isx + <a href=abcd xyz + 0: <a href=abcd + 1: <unset> + 2: <unset> + 3: abcd + <a href=\"abcd xyz pqr\" cats + 0: <a href="abcd xyz pqr" + 1: " + 2: abcd xyz pqr + <a href = \'abcd xyz pqr\' cats + 0: <a href = 'abcd xyz pqr' + 1: ' + 2: abcd xyz pqr + + /((Z)+|A)*/ + ZABCDEFG + 0: ZA + 1: A + 2: Z + + /(Z()|A)*/ + ZABCDEFG + 0: ZA + 1: A + 2: + + /(Z(())|A)*/ + ZABCDEFG + 0: ZA + 1: A + 2: + 3: + + /((?>Z)+|A)*/ + ZABCDEFG + 0: ZA + 1: A + + /((?>)+|A)*/ + ZABCDEFG + 0: + 1: + + /a*/g + abbab + 0: a + 0: + 0: + 0: a + 0: + 0: + + /^[a-\d]/ + abcde + 0: a + -things + 0: - + 0digit + 0: 0 + *** Failers + No match + bcdef + No match + + /^[\d-a]/ + abcde + 0: a + -things + 0: - + 0digit + 0: 0 + *** Failers + No match + bcdef + No match + + / End of testinput3 / +

ossp-pkg/pcre/pcre_test.d/testoutput4 -> 1.6

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,116 ---- + PCRE version 3.3 01-Aug-2000 + + /^[\w]+/ + *** Failers + No match + �cole + No match + + /^[\w]+/Lfr + �cole + 0: �cole + + /^[\w]+/ + *** Failers + No match + �cole + No match + + /^[\W]+/ + �cole + 0: \xc9 + + /^[\W]+/Lfr + *** Failers + 0: *** + �cole + No match + + /[\b]/ + \b + 0: \x08 + *** Failers + No match + a + No match + + /[\b]/Lfr + \b + 0: \x08 + *** Failers + No match + a + No match + + /^\w+/ + *** Failers + No match + �cole + No match + + /^\w+/Lfr + �cole + 0: �cole + + /(.+)\b(.+)/ + �cole + 0: \xc9cole + 1: \xc9 + 2: cole + + /(.+)\b(.+)/Lfr + *** Failers + 0: *** Failers + 1: *** + 2: Failers + �cole + No match + + /�cole/i + �cole + 0: \xc9cole + *** Failers + No match + �cole + No match + + /�cole/iLfr + �cole + 0: �cole + �cole + 0: �cole + + /\w/IS + Capturing subpattern count = 0 + No options + No first char + No need char + Starting character set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P + Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z + + /\w/ISLfr + Capturing subpattern count = 0 + No options + No first char + No need char + Starting character set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P + Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z + � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � + � � � � � � � � � � � � � � � � � � � � � � � � � + + /^[\xc8-\xc9]/iLfr + �cole + 0: � + �cole + 0: � + + /^[\xc8-\xc9]/Lfr + �cole + 0: � + *** Failers + No match + �cole + No match + + / End of testinput4 / +

ossp-pkg/pcre/pcre_test.d/testoutput5 -> 1.1

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,242 ---- + PCRE version 3.3 01-Aug-2000 + + /-- Because of problems with Perl 5.6 in handling UTF-8 vs non UTF-8 --/ + /-- strings automatically, do not use the \x{} construct except with --/ + No match + /-- patterns that have the /8 option set, and don't use them without! --/ + No match + + /a.b/8 + acb + 0: acb + a\x7fb + 0: a\x{7f}b + a\x{100}b + 0: a\x{100}b + *** Failers + No match + a\nb + No match + + /a(.{3})b/8 + a\x{4000}xyb + 0: a\x{4000}xyb + 1: \x{4000}xy + a\x{4000}\x7fyb + 0: a\x{4000}\x{7f}yb + 1: \x{4000}\x{7f}y + a\x{4000}\x{100}yb + 0: a\x{4000}\x{100}yb + 1: \x{4000}\x{100}y + *** Failers + No match + a\x{4000}b + No match + ac\ncb + No match + + /a(.*?)(.)/ + a\xc0\x88b + 0: a\xc0 + 1: + 2: \xc0 + + /a(.*?)(.)/8 + a\x{100}b + 0: a\x{100} + 1: + 2: \x{100} + + /a(.*)(.)/ + a\xc0\x88b + 0: a\xc0\x88b + 1: \xc0\x88 + 2: b + + /a(.*)(.)/8 + a\x{100}b + 0: a\x{100}b + 1: \x{100} + 2: b + + /a(.)(.)/ + a\xc0\x92bcd + 0: a\xc0\x92 + 1: \xc0 + 2: \x92 + + /a(.)(.)/8 + a\x{240}bcd + 0: a\x{240}b + 1: \x{240} + 2: b + + /a(.?)(.)/ + a\xc0\x92bcd + 0: a\xc0\x92 + 1: \xc0 + 2: \x92 + + /a(.?)(.)/8 + a\x{240}bcd + 0: a\x{240}b + 1: \x{240} + 2: b + + /a(.??)(.)/ + a\xc0\x92bcd + 0: a\xc0 + 1: + 2: \xc0 + + /a(.??)(.)/8 + a\x{240}bcd + 0: a\x{240} + 1: + 2: \x{240} + + /a(.{3})b/8 + a\x{1234}xyb + 0: a\x{1234}xyb + 1: \x{1234}xy + a\x{1234}\x{4321}yb + 0: a\x{1234}\x{4321}yb + 1: \x{1234}\x{4321}y + a\x{1234}\x{4321}\x{3412}b + 0: a\x{1234}\x{4321}\x{3412}b + 1: \x{1234}\x{4321}\x{3412} + *** Failers + No match + a\x{1234}b + No match + ac\ncb + No match + + /a(.{3,})b/8 + a\x{1234}xyb + 0: a\x{1234}xyb + 1: \x{1234}xy + a\x{1234}\x{4321}yb + 0: a\x{1234}\x{4321}yb + 1: \x{1234}\x{4321}y + a\x{1234}\x{4321}\x{3412}b + 0: a\x{1234}\x{4321}\x{3412}b + 1: \x{1234}\x{4321}\x{3412} + axxxxbcdefghijb + 0: axxxxbcdefghijb + 1: xxxxbcdefghij + a\x{1234}\x{4321}\x{3412}\x{3421}b + 0: a\x{1234}\x{4321}\x{3412}\x{3421}b + 1: \x{1234}\x{4321}\x{3412}\x{3421} + *** Failers + No match + a\x{1234}b + No match + + /a(.{3,}?)b/8 + a\x{1234}xyb + 0: a\x{1234}xyb + 1: \x{1234}xy + a\x{1234}\x{4321}yb + 0: a\x{1234}\x{4321}yb + 1: \x{1234}\x{4321}y + a\x{1234}\x{4321}\x{3412}b + 0: a\x{1234}\x{4321}\x{3412}b + 1: \x{1234}\x{4321}\x{3412} + axxxxbcdefghijb + 0: axxxxb + 1: xxxx + a\x{1234}\x{4321}\x{3412}\x{3421}b + 0: a\x{1234}\x{4321}\x{3412}\x{3421}b + 1: \x{1234}\x{4321}\x{3412}\x{3421} + *** Failers + No match + a\x{1234}b + No match + + /a(.{3,5})b/8 + a\x{1234}xyb + 0: a\x{1234}xyb + 1: \x{1234}xy + a\x{1234}\x{4321}yb + 0: a\x{1234}\x{4321}yb + 1: \x{1234}\x{4321}y + a\x{1234}\x{4321}\x{3412}b + 0: a\x{1234}\x{4321}\x{3412}b + 1: \x{1234}\x{4321}\x{3412} + axxxxbcdefghijb + 0: axxxxb + 1: xxxx + a\x{1234}\x{4321}\x{3412}\x{3421}b + 0: a\x{1234}\x{4321}\x{3412}\x{3421}b + 1: \x{1234}\x{4321}\x{3412}\x{3421} + axbxxbcdefghijb + 0: axbxxb + 1: xbxx + axxxxxbcdefghijb + 0: axxxxxb + 1: xxxxx + *** Failers + No match + a\x{1234}b + No match + axxxxxxbcdefghijb + No match + + /a(.{3,5}?)b/8 + a\x{1234}xyb + 0: a\x{1234}xyb + 1: \x{1234}xy + a\x{1234}\x{4321}yb + 0: a\x{1234}\x{4321}yb + 1: \x{1234}\x{4321}y + a\x{1234}\x{4321}\x{3412}b + 0: a\x{1234}\x{4321}\x{3412}b + 1: \x{1234}\x{4321}\x{3412} + axxxxbcdefghijb + 0: axxxxb + 1: xxxx + a\x{1234}\x{4321}\x{3412}\x{3421}b + 0: a\x{1234}\x{4321}\x{3412}\x{3421}b + 1: \x{1234}\x{4321}\x{3412}\x{3421} + axbxxbcdefghijb + 0: axbxxb + 1: xbxx + axxxxxbcdefghijb + 0: axxxxxb + 1: xxxxx + *** Failers + No match + a\x{1234}b + No match + axxxxxxbcdefghijb + No match + + /^[a\x{c0}]/8 + *** Failers + No match + \x{100} + No match + + /(?<=aXb)cd/8 + aXbcd + 0: cd + + /(?<=a\x{100}b)cd/8 + a\x{100}bcd + 0: cd + + /(?<=a\x{100000}b)cd/8 + a\x{100000}bcd + 0: cd + + /(?:\x{100}){3}b/8 + \x{100}\x{100}\x{100}b + 0: \x{100}\x{100}\x{100}b + *** Failers + No match + \x{100}\x{100}b + No match + + / End of testinput5 / +

ossp-pkg/pcre/pcre_test.d/testoutput6 -> 1.1

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,185 ---- + PCRE version 3.3 01-Aug-2000 + + /\x{100}/8DM + Memory allocation (code space): 11 + ------------------------------------------------------------------ + 0 7 Bra 0 + 3 2 \xc0\x88 + 7 7 Ket + 10 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: utf8 + First char = 192 + Need char = 136 + + /\x{1000}/8DM + Memory allocation (code space): 12 + ------------------------------------------------------------------ + 0 8 Bra 0 + 3 3 \xe0\x80\x84 + 8 8 Ket + 11 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: utf8 + First char = 224 + Need char = 132 + + /\x{10000}/8DM + Memory allocation (code space): 13 + ------------------------------------------------------------------ + 0 9 Bra 0 + 3 4 \xf0\x80\x80\x82 + 9 9 Ket + 12 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: utf8 + First char = 240 + Need char = 130 + + /\x{100000}/8DM + Memory allocation (code space): 13 + ------------------------------------------------------------------ + 0 9 Bra 0 + 3 4 \xf0\x80\x80\xa0 + 9 9 Ket + 12 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: utf8 + First char = 240 + Need char = 160 + + /\x{1000000}/8DM + Memory allocation (code space): 14 + ------------------------------------------------------------------ + 0 10 Bra 0 + 3 5 \xf8\x80\x80\x80\x90 + 10 10 Ket + 13 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: utf8 + First char = 248 + Need char = 144 + + /\x{4000000}/8DM + Memory allocation (code space): 15 + ------------------------------------------------------------------ + 0 11 Bra 0 + 3 6 \xfc\x80\x80\x80\x80\x82 + 11 11 Ket + 14 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: utf8 + First char = 252 + Need char = 130 + + /\x{7fffFFFF}/8DM + Memory allocation (code space): 15 + ------------------------------------------------------------------ + 0 11 Bra 0 + 3 6 \xfd\xbf\xbf\xbf\xbf\xbf + 11 11 Ket + 14 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: utf8 + First char = 253 + Need char = 191 + + /[\x{ff}]/8DM + Memory allocation (code space): 40 + ------------------------------------------------------------------ + 0 6 Bra 0 + 3 1 \xff + 6 6 Ket + 9 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: utf8 + First char = 255 + No need char + + /[\x{100}]/8DM + Memory allocation (code space): 40 + Failed: characters with values > 255 are not yet supported in classes at offset 7 + + /\x{ffffffff}/8 + Failed: character value in \x{...} sequence is too large at offset 11 + + /\x{100000000}/8 + Failed: character value in \x{...} sequence is too large at offset 12 + + /^\x{100}a\x{1234}/8 + \x{100}a\x{1234}bcd + 0: \x{100}a\x{1234} + + /\x80/8D + ------------------------------------------------------------------ + 0 7 Bra 0 + 3 2 \xc0\x84 + 7 7 Ket + 10 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: utf8 + First char = 192 + Need char = 132 + + /\xff/8D + ------------------------------------------------------------------ + 0 7 Bra 0 + 3 2 \xdf\x87 + 7 7 Ket + 10 End + ------------------------------------------------------------------ + Capturing subpattern count = 0 + Options: utf8 + First char = 223 + Need char = 135 + + /-- These tests are here rather than in testinput5 because Perl 5.6 has --/ + /-- some problems with UTF-8 support, in the area of \x{..} where the --/ + No match + /-- value is < 255. It grumbles about invalid UTF-8 strings. --/ + No match + + /^[a\x{c0}]b/8 + \x{c0}b + 0: \x{c0}b + + /^([a\x{c0}]*?)aa/8 + a\x{c0}aaaa/ + 0: a\x{c0}aa + 1: a\x{c0} + + /^([a\x{c0}]*?)aa/8 + a\x{c0}aaaa/ + 0: a\x{c0}aa + 1: a\x{c0} + a\x{c0}a\x{c0}aaa/ + 0: a\x{c0}a\x{c0}aa + 1: a\x{c0}a\x{c0} + + /^([a\x{c0}]*)aa/8 + a\x{c0}aaaa/ + 0: a\x{c0}aaaa + 1: a\x{c0}aa + a\x{c0}a\x{c0}aaa/ + 0: a\x{c0}a\x{c0}aaa + 1: a\x{c0}a\x{c0}a + + /^([a\x{c0}]*)a\x{c0}/8 + a\x{c0}aaaa/ + 0: a\x{c0} + 1: + a\x{c0}a\x{c0}aaa/ + 0: a\x{c0}a\x{c0} + 1: a\x{c0} + + / End of testinput6 / +

ossp-pkg/pcre/pcreposix.3 -> 1.2

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,149 ---- + .TH PCRE 3 + .SH NAME + pcreposix - POSIX API for Perl-compatible regular expressions. + .SH SYNOPSIS + .B #include <pcreposix.h> + .PP + .SM + .br + .B int regcomp(regex_t *\fIpreg\fR, const char *\fIpattern\fR, + .ti +5n + .B int \fIcflags\fR); + .PP + .br + .B int regexec(regex_t *\fIpreg\fR, const char *\fIstring\fR, + .ti +5n + .B size_t \fInmatch\fR, regmatch_t \fIpmatch\fR[], int \fIeflags\fR); + .PP + .br + .B size_t regerror(int \fIerrcode\fR, const regex_t *\fIpreg\fR, + .ti +5n + .B char *\fIerrbuf\fR, size_t \fIerrbuf_size\fR); + .PP + .br + .B void regfree(regex_t *\fIpreg\fR); + + + .SH DESCRIPTION + This set of functions provides a POSIX-style API to the PCRE regular expression + package. See the \fBpcre\fR documentation for a description of the native API, + which contains additional functionality. + + The functions described here are just wrapper functions that ultimately call + the native API. Their prototypes are defined in the \fBpcreposix.h\fR header + file, and on Unix systems the library itself is called \fBpcreposix.a\fR, so + can be accessed by adding \fB-lpcreposix\fR to the command for linking an + application which uses them. Because the POSIX functions call the native ones, + it is also necessary to add \fR-lpcre\fR. + + As I am pretty ignorant about POSIX, these functions must be considered as + experimental. I have implemented only those option bits that can be reasonably + mapped to PCRE native options. Other POSIX options are not even defined. It may + be that it is useful to define, but ignore, other options. Feedback from more + knowledgeable folk may cause this kind of detail to change. + + When PCRE is called via these functions, it is only the API that is POSIX-like + in style. The syntax and semantics of the regular expressions themselves are + still those of Perl, subject to the setting of various PCRE options, as + described below. + + The header for these functions is supplied as \fBpcreposix.h\fR to avoid any + potential clash with other POSIX libraries. It can, of course, be renamed or + aliased as \fBregex.h\fR, which is the "correct" name. It provides two + structure types, \fIregex_t\fR for compiled internal forms, and + \fIregmatch_t\fR for returning captured substrings. It also defines some + constants whose names start with "REG_"; these are used for setting options and + identifying error codes. + + + .SH COMPILING A PATTERN + + The function \fBregcomp()\fR is called to compile a pattern into an + internal form. The pattern is a C string terminated by a binary zero, and + is passed in the argument \fIpattern\fR. The \fIpreg\fR argument is a pointer + to a regex_t structure which is used as a base for storing information about + the compiled expression. + + The argument \fIcflags\fR is either zero, or contains one or more of the bits + defined by the following macros: + + REG_ICASE + + The PCRE_CASELESS option is set when the expression is passed for compilation + to the native function. + + REG_NEWLINE + + The PCRE_MULTILINE option is set when the expression is passed for compilation + to the native function. + + In the absence of these flags, no options are passed to the native function. + This means the the regex is compiled with PCRE default semantics. In + particular, the way it handles newline characters in the subject string is the + Perl way, not the POSIX way. Note that setting PCRE_MULTILINE has only + \fIsome\fR of the effects specified for REG_NEWLINE. It does not affect the way + newlines are matched by . (they aren't) or a negative class such as [^a] (they + are). + + The yield of \fBregcomp()\fR is zero on success, and non-zero otherwise. The + \fIpreg\fR structure is filled in on success, and one member of the structure + is publicized: \fIre_nsub\fR contains the number of capturing subpatterns in + the regular expression. Various error codes are defined in the header file. + + + .SH MATCHING A PATTERN + The function \fBregexec()\fR is called to match a pre-compiled pattern + \fIpreg\fR against a given \fIstring\fR, which is terminated by a zero byte, + subject to the options in \fIeflags\fR. These can be: + + REG_NOTBOL + + The PCRE_NOTBOL option is set when calling the underlying PCRE matching + function. + + REG_NOTEOL + + The PCRE_NOTEOL option is set when calling the underlying PCRE matching + function. + + The portion of the string that was matched, and also any captured substrings, + are returned via the \fIpmatch\fR argument, which points to an array of + \fInmatch\fR structures of type \fIregmatch_t\fR, containing the members + \fIrm_so\fR and \fIrm_eo\fR. These contain the offset to the first character of + each substring and the offset to the first character after the end of each + substring, respectively. The 0th element of the vector relates to the entire + portion of \fIstring\fR that was matched; subsequent elements relate to the + capturing subpatterns of the regular expression. Unused entries in the array + have both structure members set to -1. + + A successful match yields a zero return; various error codes are defined in the + header file, of which REG_NOMATCH is the "expected" failure code. + + + .SH ERROR MESSAGES + The \fBregerror()\fR function maps a non-zero errorcode from either + \fBregcomp\fR or \fBregexec\fR to a printable message. If \fIpreg\fR is not + NULL, the error should have arisen from the use of that structure. A message + terminated by a binary zero is placed in \fIerrbuf\fR. The length of the + message, including the zero, is limited to \fIerrbuf_size\fR. The yield of the + function is the size of buffer needed to hold the whole message. + + + .SH STORAGE + Compiling a regular expression causes memory to be allocated and associated + with the \fIpreg\fR structure. The function \fBregfree()\fR frees all such + memory, after which \fIpreg\fR may no longer be used as a compiled expression. + + + .SH AUTHOR + Philip Hazel <ph10@cam.ac.uk> + .br + University Computing Service, + .br + New Museums Site, + .br + Cambridge CB2 3QG, England. + .br + Phone: +44 1223 334714 + + Copyright (c) 1997-2000 University of Cambridge.

ossp-pkg/pcre/pcreposix.c -> 1.3

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,279 ---- + /************************************************* + * Perl-Compatible Regular Expressions * + *************************************************/ + + /* + This is a library of functions to support regular expressions whose syntax + and semantics are as close as possible to those of the Perl 5 language. See + the file Tech.Notes for some information on the internals. + + This module is a wrapper that provides a POSIX API to the underlying PCRE + functions. + + Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997-2000 University of Cambridge + + ----------------------------------------------------------------------------- + Permission is granted to anyone to use this software for any purpose on any + computer system, and to redistribute it freely, subject to the following + restrictions: + + 1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + + 2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + + 3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + + 4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. + ----------------------------------------------------------------------------- + */ + + #include "pcre_internal.h" + #include "pcreposix.h" + + #include <stdlib.h> + + + /* Corresponding tables of PCRE error messages and POSIX error codes. */ + + static const char *estring[] = { + ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9, ERR10, + ERR11, ERR12, ERR13, ERR14, ERR15, ERR16, ERR17, ERR18, ERR19, ERR20, + ERR21, ERR22, ERR23, ERR24, ERR25, ERR26, ERR27, ERR29, ERR29, ERR30, + ERR31 }; + + static int eint[] = { + REG_EESCAPE, /* "\\ at end of pattern" */ + REG_EESCAPE, /* "\\c at end of pattern" */ + REG_EESCAPE, /* "unrecognized character follows \\" */ + REG_BADBR, /* "numbers out of order in {} quantifier" */ + REG_BADBR, /* "number too big in {} quantifier" */ + REG_EBRACK, /* "missing terminating ] for character class" */ + REG_ECTYPE, /* "invalid escape sequence in character class" */ + REG_ERANGE, /* "range out of order in character class" */ + REG_BADRPT, /* "nothing to repeat" */ + REG_BADRPT, /* "operand of unlimited repeat could match the empty string" */ + REG_ASSERT, /* "internal error: unexpected repeat" */ + REG_BADPAT, /* "unrecognized character after (?" */ + REG_ESIZE, /* "too many capturing parenthesized sub-patterns" */ + REG_EPAREN, /* "missing )" */ + REG_ESUBREG, /* "back reference to non-existent subpattern" */ + REG_INVARG, /* "erroffset passed as NULL" */ + REG_INVARG, /* "unknown option bit(s) set" */ + REG_EPAREN, /* "missing ) after comment" */ + REG_ESIZE, /* "too many sets of parentheses" */ + REG_ESIZE, /* "regular expression too large" */ + REG_ESPACE, /* "failed to get memory" */ + REG_EPAREN, /* "unmatched brackets" */ + REG_ASSERT, /* "internal error: code overflow" */ + REG_BADPAT, /* "unrecognized character after (?<" */ + REG_BADPAT, /* "lookbehind assertion is not fixed length" */ + REG_BADPAT, /* "malformed number after (?(" */ + REG_BADPAT, /* "conditional group containe more than two branches" */ + REG_BADPAT, /* "assertion expected after (?(" */ + REG_BADPAT, /* "(?p must be followed by )" */ + REG_ECTYPE, /* "unknown POSIX class name" */ + REG_BADPAT, /* "POSIX collating elements are not supported" */ + REG_INVARG, /* "this version of PCRE is not compiled with PCRE_UTF8 support" */ + REG_BADPAT, /* "characters with values > 255 are not yet supported in classes" */ + REG_BADPAT /* "character value in \x{...} sequence is too large" */ + }; + + /* Table of texts corresponding to POSIX error codes */ + + static const char *pstring[] = { + "", /* Dummy for value 0 */ + "internal error", /* REG_ASSERT */ + "invalid repeat counts in {}", /* BADBR */ + "pattern error", /* BADPAT */ + "? * + invalid", /* BADRPT */ + "unbalanced {}", /* EBRACE */ + "unbalanced []", /* EBRACK */ + "collation error - not relevant", /* ECOLLATE */ + "bad class", /* ECTYPE */ + "bad escape sequence", /* EESCAPE */ + "empty expression", /* EMPTY */ + "unbalanced ()", /* EPAREN */ + "bad range inside []", /* ERANGE */ + "expression too big", /* ESIZE */ + "failed to get memory", /* ESPACE */ + "bad back reference", /* ESUBREG */ + "bad argument", /* INVARG */ + "match failed" /* NOMATCH */ + }; + + + + + /************************************************* + * Translate PCRE text code to int * + *************************************************/ + + /* PCRE compile-time errors are given as strings defined as macros. We can just + look them up in a table to turn them into POSIX-style error codes. */ + + static int + pcre_posix_error_code(const char *s) + { + size_t i; + for (i = 0; i < sizeof(estring)/sizeof(char *); i++) + if (strcmp(s, estring[i]) == 0) return eint[i]; + return REG_ASSERT; + } + + + + /************************************************* + * Translate error code to string * + *************************************************/ + + size_t + regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size) + { + const char *message, *addmessage; + size_t length, addlength; + + message = (errcode >= (int)(sizeof(pstring)/sizeof(char *)))? + "unknown error code" : pstring[errcode]; + length = strlen(message) + 1; + + addmessage = " at offset "; + addlength = (preg != NULL && (int)preg->re_erroffset != -1)? + strlen(addmessage) + 6 : 0; + + if (errbuf_size > 0) + { + if (addlength > 0 && errbuf_size >= length + addlength) + sprintf(errbuf, "%s%s%-6d", message, addmessage, (int)preg->re_erroffset); + else + { + strncpy(errbuf, message, errbuf_size - 1); + errbuf[errbuf_size-1] = 0; + } + } + + return length + addlength; + } + + + + + /************************************************* + * Free store held by a regex * + *************************************************/ + + void + regfree(regex_t *preg) + { + (pcre_free)(preg->re_pcre); + } + + + + + /************************************************* + * Compile a regular expression * + *************************************************/ + + /* + Arguments: + preg points to a structure for recording the compiled expression + pattern the pattern to compile + cflags compilation flags + + Returns: 0 on success + various non-zero codes on failure + */ + + int + regcomp(regex_t *preg, const char *pattern, int cflags) + { + const char *errorptr; + int erroffset; + int options = 0; + + if ((cflags & REG_ICASE) != 0) options |= PCRE_CASELESS; + if ((cflags & REG_NEWLINE) != 0) options |= PCRE_MULTILINE; + + preg->re_pcre = pcre_compile(pattern, options, &errorptr, &erroffset, NULL); + preg->re_erroffset = erroffset; + + if (preg->re_pcre == NULL) return pcre_posix_error_code(errorptr); + + preg->re_nsub = pcre_info(preg->re_pcre, NULL, NULL); + return 0; + } + + + + + /************************************************* + * Match a regular expression * + *************************************************/ + + /* Unfortunately, PCRE requires 3 ints of working space for each captured + substring, so we have to get and release working store instead of just using + the POSIX structures as was done in earlier releases when PCRE needed only 2 + ints. */ + + int + regexec(regex_t *preg, const char *string, size_t nmatch, + regmatch_t pmatch[], int eflags) + { + int rc; + int options = 0; + int *ovector = NULL; + + if ((eflags & REG_NOTBOL) != 0) options |= PCRE_NOTBOL; + if ((eflags & REG_NOTEOL) != 0) options |= PCRE_NOTEOL; + + preg->re_erroffset = (size_t)(-1); /* Only has meaning after compile */ + + if (nmatch > 0) + { + ovector = (int *)malloc(sizeof(int) * nmatch * 3); + if (ovector == NULL) return REG_ESPACE; + } + + rc = pcre_exec(preg->re_pcre, NULL, string, (int)strlen(string), 0, options, + ovector, nmatch * 3); + + if (rc == 0) rc = nmatch; /* All captured slots were filled in */ + + if (rc >= 0) + { + size_t i; + for (i = 0; i < rc; i++) + { + pmatch[i].rm_so = ovector[i*2]; + pmatch[i].rm_eo = ovector[i*2+1]; + } + if (ovector != NULL) free(ovector); + for (; i < nmatch; i++) pmatch[i].rm_so = pmatch[i].rm_eo = -1; + return 0; + } + + else + { + if (ovector != NULL) free(ovector); + switch(rc) + { + case PCRE_ERROR_NOMATCH: return REG_NOMATCH; + case PCRE_ERROR_NULL: return REG_INVARG; + case PCRE_ERROR_BADOPTION: return REG_INVARG; + case PCRE_ERROR_BADMAGIC: return REG_INVARG; + case PCRE_ERROR_UNKNOWN_NODE: return REG_ASSERT; + case PCRE_ERROR_NOMEMORY: return REG_ESPACE; + default: return REG_ASSERT; + } + } + } + + /* End of pcreposix.c */

ossp-pkg/pcre/pgrep.1 -> 1.2

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:58 2026 *************** *** 0 **** --- 1,76 ---- + .TH PCREGREP 1 + .SH NAME + pcregrep - a grep with Perl-compatible regular expressions. + .SH SYNOPSIS + .B pcregrep [-Vchilnsvx] pattern [file] ... + + + .SH DESCRIPTION + \fBpcregrep\fR searches files for character patterns, in the same way as other + grep commands do, but it uses the PCRE regular expression library to support + patterns that are compatible with the regular expressions of Perl 5. See + \fBpcre(3)\fR for a full description of syntax and semantics. + + If no files are specified, \fBpcregrep\fR reads the standard input. By default, + each line that matches the pattern is copied to the standard output, and if + there is more than one file, the file name is printed before each line of + output. However, there are options that can change how \fBpcregrep\fR behaves. + + Lines are limited to BUFSIZ characters. BUFSIZ is defined in \fB<stdio.h>\fR. + The newline character is removed from the end of each line before it is matched + against the pattern. + + + .SH OPTIONS + .TP 10 + \fB-V\fR + Write the version number of the PCRE library being used to the standard error + stream. + .TP + \fB-c\fR + Do not print individual lines; instead just print a count of the number of + lines that would otherwise have been printed. If several files are given, a + count is printed for each of them. + .TP + \fB-h\fR + Suppress printing of filenames when searching multiple files. + .TP + \fB-i\fR + Ignore upper/lower case distinctions during comparisons. + .TP + \fB-l\fR + Instead of printing lines from the files, just print the names of the files + containing lines that would have been printed. Each file name is printed + once, on a separate line. + .TP + \fB-n\fR + Precede each line by its line number in the file. + .TP + \fB-s\fR + Work silently, that is, display nothing except error messages. + The exit status indicates whether any matches were found. + .TP + \fB-v\fR + Invert the sense of the match, so that lines which do \fInot\fR match the + pattern are now the ones that are found. + .TP + \fB-x\fR + Force the pattern to be anchored (it must start matching at the beginning of + the line) and in addition, require it to match the entire line. This is + equivalent to having ^ and $ characters at the start and end of each + alternative branch in the regular expression. + + + .SH SEE ALSO + \fBpcre(3)\fR, Perl 5 documentation + + + .SH DIAGNOSTICS + Exit status is 0 if any matches were found, 1 if no matches were found, and 2 + for syntax errors or inacessible files (even if matches were found). + + + .SH AUTHOR + Philip Hazel <ph10@cam.ac.uk> + .br + Copyright (c) 1997-2000 University of Cambridge.

ossp-pkg/pcre/pgrep.c -> 1.3

*** /dev/null Mon May 25 16:34:34 2026 --- - Mon May 25 16:34:59 2026 *************** *** 0 **** --- 1,228 ---- + /************************************************* + * pcregrep program * + *************************************************/ + + /* This is a grep program that uses the PCRE regular expression library to do + its pattern matching. */ + + #include <stdio.h> + #include <string.h> + #include <stdlib.h> + #include <errno.h> + #include "config.h" + #include "pcre.h" + + #define FALSE 0 + #define TRUE 1 + + typedef int BOOL; + + + + /************************************************* + * Global variables * + *************************************************/ + + static pcre *pattern; + static pcre_extra *hints; + + static BOOL count_only = FALSE; + static BOOL filenames_only = FALSE; + static BOOL invert = FALSE; + static BOOL number = FALSE; + static BOOL silent = FALSE; + static BOOL whole_lines = FALSE; + + + + #if ! HAVE_STRERROR + /************************************************* + * Provide strerror() for non-ANSI libraries * + *************************************************/ + + /* Some old-fashioned systems still around (e.g. SunOS4) don't have strerror() + in their libraries, but can provide the same facility by this simple + alternative function. */ + + extern int sys_nerr; + extern char *sys_errlist[]; + + char * + strerror(int n) + { + if (n < 0 || n >= sys_nerr) return "unknown error number"; + return sys_errlist[n]; + } + #endif /* HAVE_STRERROR */ + + + + /************************************************* + * Grep an individual file * + *************************************************/ + + static int + pcregrep(FILE *in, char *name) + { + int rc = 1; + int linenumber = 0; + int count = 0; + int offsets[99]; + char buffer[BUFSIZ]; + + while (fgets(buffer, sizeof(buffer), in) != NULL) + { + BOOL match; + int length = (int)strlen(buffer); + if (length > 0 && buffer[length-1] == '\n') buffer[--length] = 0; + linenumber++; + + match = pcre_exec(pattern, hints, buffer, length, 0, 0, offsets, 99) >= 0; + if (match && whole_lines && offsets[1] != length) match = FALSE; + + if (match != invert) + { + if (count_only) count++; + + else if (filenames_only) + { + fprintf(stdout, "%s\n", (name == NULL)? "<stdin>" : name); + return 0; + } + + else if (silent) return 0; + + else + { + if (name != NULL) fprintf(stdout, "%s:", name); + if (number) fprintf(stdout, "%d:", linenumber); + fprintf(stdout, "%s\n", buffer); + } + + rc = 0; + } + } + + if (count_only) + { + if (name != NULL) fprintf(stdout, "%s:", name); + fprintf(stdout, "%d\n", count); + } + + return rc; + } + + + + + /************************************************* + * Usage function * + *************************************************/ + + static int + usage(int rc) + { + fprintf(stderr, "Usage: pcregrep [-Vchilnsvx] pattern [file] ...\n"); + return rc; + } + + + + + /************************************************* + * Main program * + *************************************************/ + + int + main(int argc, char **argv) + { + int i; + int rc = 1; + int options = 0; + int errptr; + const char *error; + BOOL filenames = TRUE; + + /* Process the options */ + + for (i = 1; i < argc; i++) + { + char *s; + if (argv[i][0] != '-') break; + s = argv[i] + 1; + while (*s != 0) + { + switch (*s++) + { + case 'c': count_only = TRUE; break; + case 'h': filenames = FALSE; break; + case 'i': options |= PCRE_CASELESS; break; + case 'l': filenames_only = TRUE; + case 'n': number = TRUE; break; + case 's': silent = TRUE; break; + case 'v': invert = TRUE; break; + case 'x': whole_lines = TRUE; options |= PCRE_ANCHORED; break; + + case 'V': + fprintf(stderr, "PCRE version %s\n", pcre_version()); + break; + + default: + fprintf(stderr, "pcregrep: unknown option %c\n", s[-1]); + return usage(2); + } + } + } + + /* There must be at least a regexp argument */ + + if (i >= argc) return usage(0); + + /* Compile the regular expression. */ + + pattern = pcre_compile(argv[i++], options, &error, &errptr, NULL); + if (pattern == NULL) + { + fprintf(stderr, "pcregrep: error in regex at offset %d: %s\n", errptr, error); + return 2; + } + + /* Study the regular expression, as we will be running it may times */ + + hints = pcre_study(pattern, 0, &error); + if (error != NULL) + { + fprintf(stderr, "pcregrep: error while studing regex: %s\n", error); + return 2; + } + + /* If there are no further arguments, do the business on stdin and exit */ + + if (i >= argc) return pcregrep(stdin, NULL); + + /* Otherwise, work through the remaining arguments as files. If there is only + one, don't give its name on the output. */ + + if (i == argc - 1) filenames = FALSE; + if (filenames_only) filenames = TRUE; + + for (; i < argc; i++) + { + FILE *in = fopen(argv[i], "r"); + if (in == NULL) + { + fprintf(stderr, "%s: failed to open: %s\n", argv[i], strerror(errno)); + rc = 2; + } + else + { + int frc = pcregrep(in, filenames? argv[i] : NULL); + if (frc == 0 && rc == 1) rc = 0; + fclose(in); + } + } + + return rc; + } + + /* End */

ossp-pkg/pcre/shtool 1.5 -> 1.6

--- shtool 2000/07/07 06:18:50 1.5 +++ shtool 2000/08/02 09:46:06 1.6 @@ -6,7 +6,7 @@ ## See http://www.gnu.org/software/shtool/ for more information. ## See ftp://ftp.gnu.org/gnu/shtool/ for latest version. ## -## Version: 1.5.0 (01-Jul-2000) +## Version: 1.5.1 (29-Jul-2000) ## Contents: 6/17 available modules ## @@ -65,7 +65,7 @@ exit 1 fi if [ ".$1" = ".-h" -o ".$1" = ".--help" ]; then - echo "This is GNU shtool, version 1.5.0 (01-Jul-2000)" + echo "This is GNU shtool, version 1.5.1 (29-Jul-2000)" echo "Copyright (c) 1994-2000 Ralf S. Engelschall <rse@engelschall.com>" echo "Report bugs to <bug-shtool@gnu.org>" echo '' @@ -105,7 +105,7 @@ exit 0 fi if [ ".$1" = ".-v" -o ".$1" = ."--version" ]; then - echo "GNU shtool 1.5.0 (01-Jul-2000)" + echo "GNU shtool 1.5.1 (29-Jul-2000)" exit 0 fi if [ ".$1" = ".-r" -o ".$1" = ."--recreate" ]; then

OSSP CVS Repository