This is the command re2c that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator
PROGRAM:
NAME
re2c - convert regular expressions to C/C++ code
SYNOPSIS
re2c [OPTIONS] FILE
DESCRIPTION
re2c is a lexer generator for C/C++. It finds regular expression specifications inside of
C/C++ comments and replaces them with a hard-coded DFA. The user must supply some
interface code in order to control and customize the generated DFA.
OPTIONS
-? -h --help
Invoke a short help.
-b --bit-vectors
Implies -s. Use bit vectors as well in the attempt to coax better code out of the
compiler. Most useful for specifications with more than a few keywords (e.g. for
most programming languages).
-c --conditions
Used to support (f)lex-like condition support.
-d --debug-output
Creates a parser that dumps information about the current position and in which
state the parser is while parsing the input. This is useful to debug parser issues
and states. If you use this switch you need to define a macro YYDEBUG that is
called like a function with two parameters: void YYDEBUG (int state, char current).
The first parameter receives the state or -1 and the second parameter receives the
input at the current cursor.
-D --emit-dot
Emit Graphviz dot data. It can then be processed with e.g. dot -Tpng input.dot >
output.png. Please note that scanners with many states may crash dot.
-e --ecb
Generate a parser that supports EBCDIC. The generated code can deal with any
character up to 0xFF. In this mode re2c assumes that input character size is 1
byte. This switch is incompatible with -w, -x, -u and -8.
-f --storable-state
Generate a scanner with support for storable state.
-F --flex-syntax
Partial support for flex syntax. When this flag is active then named definitions
must be surrounded by curly braces and can be defined without an equal sign and the
terminating semi colon. Instead names are treated as direct double quoted strings.
-g --computed-gotos
Generate a scanner that utilizes GCC's computed goto feature. That is re2c
generates jump tables whenever a decision is of a certain complexity (e.g. a lot of
if conditions are otherwise necessary). This is only useable with GCC and produces
output that cannot be compiled with any other compiler. Note that this implies -b
and that the complexity threshold can be configured using the inplace configuration
cgoto:threshold.
-i --no-debug-info
Do not output #line information. This is useful when you want use a CMS tool with
the re2c output which you might want if you do not require your users to have re2c
themselves when building from your source.
-o OUTPUT --output=OUTPUT
Specify the OUTPUT file.
-r --reusable
Allows reuse of scanner definitions with /*!use:re2c */ after /*!rules:re2c */. In
this mode no /*!re2c */ block and exactly one /*!rules:re2c */ must be present.
The rules are being saved and used by every /*!use:re2c */ block that follows.
These blocks can contain inplace configurations, especially re2c:flags:e,
re2c:flags:w, re2c:flags:x, re2c:flags:u and re2c:flags:8. That way it is possible
to create the same scanner multiple times for different character types, different
input mechanisms or different output mechanisms. The /*!use:re2c */ blocks can
also contain additional rules that will be appended to the set of rules in
/*!rules:re2c */.
-s --nested-ifs
Generate nested ifs for some switches. Many compilers need this assist to generate
better code.
-t HEADER --type-header=HEADER
Create a HEADER file that contains types for the (f)lex-like condition support.
This can only be activated when -c is in use.
-u --unicode
Generate a parser that supports UTF-32. The generated code can deal with any valid
Unicode character up to 0x10FFFF. In this mode re2c assumes that input character
size is 4 bytes. This switch is incompatible with -e, -w, -x and -8. This implies
-s.
-v --version
Show version information.
-V --vernum
Show the version as a number XXYYZZ.
-w --wide-chars
Generate a parser that supports UCS-2. The generated code can deal with any valid
Unicode character up to 0xFFFF. In this mode re2c assumes that input character
size is 2 bytes. This switch is incompatible with -e, -x, -u and -8. This implies
-s.
-x --utf-16
Generate a parser that supports UTF-16. The generated code can deal with any valid
Unicode character up to 0x10FFFF. In this mode re2c assumes that input character
size is 2 bytes. This switch is incompatible with -e, -w, -u and -8. This implies
-s.
-8 --utf-8
Generate a parser that supports UTF-8. The generated code can deal with any valid
Unicode character up to 0x10FFFF. In this mode re2c assumes that input character
size is 1 byte. This switch is incompatible with -e, -w, -x and -u.
--case-insensitive
All strings are case insensitive, so all "-expressions are treated in the same way
'-expressions are.
--case-inverted
Invert the meaning of single and double quoted strings. With this switch single
quotes are case sensitive and double quotes are case insensitive.
--no-generation-date
Suppress date output in the generated file.
--no-generation-date
Suppress version output in the generated file.
--encoding-policy POLICY
Specify how re2c must treat Unicode surrogates. POLICY can be one of the following:
fail (abort with error when surrogate encountered), substitute (silently substitute
surrogate with error code point 0xFFFD), ignore (treat surrogates as normal code
points). By default re2c ignores surrogates (for backward compatibility). Unicode
standard says that standalone surrogates are invalid code points, but different
libraries and programs treat them differently.
--input INPUT
Specify re2c input API. INPUT can be one of the following: default, custom.
-S --skeleton
Instead of embedding re2c-generated code into C/C++ source, generate a
self-contained program for the same DFA. Most useful for correctness and
performance testing.
--empty-class POLICY
What to do if user inputs empty character class. POLICY can be one of the
following: match-empty (match empty input: pretty illogical, but this is the
default for backwards compatibility reason), match-none (fail to match on any
input), error (compilation error). Note that there are various ways to construct
empty class, e.g: [], [^\x00-\xFF], [\x00-\xFF][\x00-\xFF].
--dfa-minimization <table | moore>
Internal algorithm used by re2c to minimize DFA (defaults to moore). Both table
filling and Moore's algorithms should produce identical DFA (up to states
relabelling). Table filling algorithm is much simpler and slower; it serves as a
reference implementation.
-1 --single-pass
Deprecated and does nothing (single pass is by default now).
-W Turn on all warnings.
-Werror
Turn warnings into errors. Note that this option along doesn't turn on any
warnings, it only affects those warnings that have been turned on so far or will be
turned on later.
-W<warning>
Turn on individual warning.
-Wno-<warning>
Turn off individual warning.
-Werror-<warning>
Turn on individual warning and treat it as error (this implies -W<warning>).
-Wno-error-<warning>
Don't treat this particular warning as error. This doesn't turn off the warning
itself.
-Wcondition-order
Warn if the generated program makes implicit assumptions about condition numbering.
One should use either -t, --type-header option or /*!types:re2c*/ directive to
generate mapping of condition names to numbers and use autogenerated condition
names.
-Wempty-character-class
Warn if regular expression contains empty character class. From the rational point
of view trying to match empty character class makes no sense: it should always
fail. However, for backwards compatibility reasons re2c allows empty character
class and treats it as empty string. Use --empty-class option to change default
behaviour.
-Wmatch-empty-string
Warn if regular expression in a rule is nullable (matches empty string). If DFA
runs in a loop and empty match is unintentional (input position in not advanced
manually), lexer may get stuck in eternal loop.
-Wswapped-range
Warn if range lower bound is greater that upper bound. Default re2c behaviour is to
silently swap range bounds.
-Wundefined-control-flow
Warn if some input strings cause undefined control flow in lexer (the faulty
patterns are reported). This is the most dangerous and common mistake. It can be
easily fixed by adding default rule * (this rule has the lowest priority, matches
any code unit and consumes exactly one code unit).
-Wuseless-escape
Warn if a symbol is escaped when it shouldn't be. By default re2c silently ignores
escape, but this may as well indicate a typo or an error in escape sequence.
INTERFACE CODE
The user must supply interface code either in the form of C/C++ code (macros, functions,
variables, etc.) or in the form of INPLACE CONFIGURATIONS. Which symbols must be defined
and which are optional depends on a particular use case.
YYCONDTYPE
In -c mode you can use -t to generate a file that contains the enumeration used as
conditions. Each of the values refers to a condition of a rule set.
YYCTXMARKER
l-value of type YYCTYPE *. The generated code saves trailing context backtracking
information in YYCTXMARKER. The user only needs to define this macro if a scanner
specification uses trailing context in one or more of its regular expressions.
YYCTYPE
Type used to hold an input symbol (code unit). Usually char or unsigned char for
ASCII, EBCDIC and UTF-8, unsigned short for UTF-16 or UCS-2 and unsigned int for
UTF-32.
YYCURSOR
l-value of type YYCTYPE * that points to the current input symbol. The generated
code advances YYCURSOR as symbols are matched. On entry, YYCURSOR is assumed to
point to the first character of the current token. On exit, YYCURSOR will point to
the first character of the following token.
YYDEBUG (state, current)
This is only needed if the -d flag was specified. It allows one to easily debug the
generated parser by calling a user defined function for every state. The function
should have the following signature: void YYDEBUG (int state, char current). The
first parameter receives the state or -1 and the second parameter receives the
input at the current cursor.
YYFILL (n)
The generated code "calls"" YYFILL (n) when the buffer needs (re)filling: at least
n additional characters should be provided. YYFILL (n) should adjust YYCURSOR,
YYLIMIT, YYMARKER and YYCTXMARKER as needed. Note that for typical programming
languages n will be the length of the longest keyword plus one. The user can place
a comment of the form /*!max:re2c*/ to insert YYMAXFILL definition that is set to
the maximum length value.
YYGETCONDITION ()
This define is used to get the condition prior to entering the scanner code when
using -c switch. The value must be initialized with a value from the enumeration
YYCONDTYPE type.
YYGETSTATE ()
The user only needs to define this macro if the -f flag was specified. In that
case, the generated code "calls" YYGETSTATE () at the very beginning of the scanner
in order to obtain the saved state. YYGETSTATE () must return a signed integer. The
value must be either -1, indicating that the scanner is entered for the first time,
or a value previously saved by YYSETSTATE (s). In the second case, the scanner will
resume operations right after where the last YYFILL (n) was called.
YYLIMIT
Expression of type YYCTYPE * that marks the end of the buffer YYLIMIT[-1] is the
last character in the buffer). The generated code repeatedly compares YYCURSOR to
YYLIMIT to determine when the buffer needs (re)filling.
YYMARKER
l-value of type YYCTYPE *. The generated code saves backtracking information in
YYMARKER. Some easy scanners might not use this.
YYMAXFILL
This will be automatically defined by /*!max:re2c*/ blocks as explained above.
YYSETCONDITION (c)
This define is used to set the condition in transition rules. This is only being
used when -c is active and transition rules are being used.
YYSETSTATE (s)
The user only needs to define this macro if the -f flag was specified. In that
case, the generated code "calls" YYSETSTATE just before calling YYFILL (n). The
parameter to YYSETSTATE is a signed integer that uniquely identifies the specific
instance of YYFILL (n) that is about to be called. Should the user wish to save the
state of the scanner and have YYFILL (n) return to the caller, all he has to do is
store that unique identifer in a variable. Later, when the scannered is called
again, it will call YYGETSTATE () and resume execution right where it left off. The
generated code will contain both YYSETSTATE (s) and YYGETSTATE even if YYFILL (n)
is being disabled.
SYNTAX
Code for re2c consists of a set of RULES, NAMED DEFINITIONS and INPLACE CONFIGURATIONS.
RULES
Rules consist of a regular expression (see REGULAR EXPRESSIONS) along with a block of
C/C++ code that is to be executed when the associated regular expression is matched. You
can either start the code with an opening curly brace or the sequence :=. When the code
with a curly brace then re2c counts the brace depth and stops looking for code
automatically. Otherwise curly braces are not allowed and re2c stops looking for code at
the first line that does not begin with whitespace. If two or more rules overlap, the
first rule is preferred.
regular-expression { C/C++ code }
regular-expression := C/C++ code
There is one special rule: default rule *
* { C/C++ code }
* := C/C++ code
Note that default rule * differs from [^]: default rule has the lowest priority, matches
any code unit (either valid or invalid) and always consumes one character; while [^]
matches any valid code point (not code unit) and can consume multiple code units. In fact,
when variable-length encoding is used, * is the only possible way to match invalid input
character (see ENCODINGS for details).
If -c is active then each regular expression is preceded by a list of comma separated
condition names. Besides normal naming rules there are two special cases: <*> (such rules
are merged to all conditions) and <> (such the rule cannot have an associated regular
expression, its code is merged to all actions). Non empty rules may further more specify
the new condition. In that case re2c will generate the necessary code to change the
condition automatically. Rules can use :=> as a shortcut to automatically generate code
that not only sets the new condition state but also continues execution with the new
state. A shortcut rule should not be used in a loop where there is code between the start
of the loop and the re2c block unless re2c:cond:goto is changed to continue. If code is
necessary before all rules (though not simple jumps) you can doso by using <!>
pseudo-rules.
<condition-list> regular-expression { C/C++ code }
<condition-list> regular-expression := C/C++ code
<condition-list> * { C/C++ code }
<condition-list> * := C/C++ code
<condition-list> regular-expression => condition { C/C++ code }
<condition-list> regular-expression => condition := C/C++ code
<condition-list> * => condition { C/C++ code }
<condition-list> * => condition := C/C++ code
<condition-list> regular-expression :=> condition
<*> regular-expression { C/C++ code }
<*> regular-expression := C/C++ code
<*> * { C/C++ code }
<*> * := C/C++ code
<*> regular-expression => condition { C/C++ code }
<*> regular-expression => condition := C/C++ code
<*> * => condition { C/C++ code }
<*> * => condition := C/C++ code
<*> regular-expression :=> condition
<> { C/C++ code }
<> := C/C++ code
<> => condition { C/C++ code }
<> => condition := C/C++ code
<> :=> condition
<> :=> condition
<! condition-list> { C/C++ code }
<! condition-list> := C/C++ code
<!> { C/C++ code }
<!> := C/C++ code
NAMED DEFINITIONS
Named definitions are of the form:
name = regular-expression;
If -F is active, then named definitions are also of the form:
name { regular-expression }
INPLACE CONFIGURATIONS
re2c:condprefix = yyc;
Allows one to specify the prefix used for condition labels. That is this text is
prepended to any condition label in the generated output file.
re2c:condenumprefix = yyc;
Allows one to specify the prefix used for condition values. That is this text is
prepended to any condition enum value in the generated output file.
re2c:cond:divider = /* *********************************** */ ;
Allows one to customize the devider for condition blocks. You can use @@ to put the
name of the condition or customize the placeholder using re2c:cond:divider@cond.
re2c:cond:divider@cond = @@;
Specifies the placeholder that will be replaced with the condition name in
re2c:cond:divider.
re2c:cond:goto = goto @@; ;
Allows one to customize the condition goto statements used with :=> style rules.
You can use @@ to put the name of the condition or ustomize the placeholder using
re2c:cond:goto@cond. You can also change this to continue;, which would allow you
to continue with the next loop cycle including any code between loop start and re2c
block.
re2c:cond:goto@cond = @@;
Spcifies the placeholder that will be replaced with the condition label in
re2c:cond:goto.
re2c:indent:top = 0;
Specifies the minimum number of indentation to use. Requires a numeric value
greater than or equal zero.
re2c:indent:string = \t ;
Specifies the string to use for indentation. Requires a string that should contain
only whitespace unless you need this for external tools. The easiest way to specify
spaces is to enclude them in single or double quotes. If you do not want any
indentation at all you can simply set this to "".
re2c:yych:conversion = 0;
When this setting is non zero, then re2c automatically generates conversion code
whenever yych gets read. In this case the type must be defined using
re2c:define:YYCTYPE.
re2c:yych:emit = 1;
Generation of yych can be suppressed by setting this to 0.
re2c:yybm:hex = 0;
If set to zero then a decimal table is being used else a hexadecimal table will be
generated.
re2c:yyfill:enable = 1;
Set this to zero to suppress generation of YYFILL (n). When using this be sure to
verify that the generated scanner does not read behind input. Allowing this
behavior might introduce sever security issues to you programs.
re2c:yyfill:check = 1;
This can be set 0 to suppress output of the pre condition using YYCURSOR and
YYLIMIT which becomes useful when YYLIMIT + YYMAXFILL is always accessible.
re2c:define:YYFILL = YYFILL ;
Substitution for YYFILL. Note that by default re2c generates argument in braces and
semicolon after YYFILL. If you need to make YYFILL an arbitrary statement rather
than a call, set re2c:define:YYFILL:naked to non-zero and use
re2c:define:YYFILL@len to denote formal parameter inside of YYFILL body.
re2c:define:YYFILL@len = @@ ;
Any occurrence of this text inside of YYFILL will be replaced with the actual
argument.
re2c:yyfill:parameter = 1;
Controls argument in braces after YYFILL. If zero, agrument is omitted. If
non-zero, argument is generated unless re2c:define:YYFILL:naked is set to non-zero.
re2c:define:YYFILL:naked = 0;
Controls argument in braces and semicolon after YYFILL. If zero, both agrument and
semicolon are omitted. If non-zero, argument is generated unless
re2c:yyfill:parameter is set to zero and semicolon is generated unconditionally.
re2c:startlabel = 0;
If set to a non zero integer then the start label of the next scanner blocks will
be generated even if not used by the scanner itself. Otherwise the normal yy0 like
start label is only being generated if needed. If set to a text value then a label
with that text will be generated regardless of whether the normal start label is
being used or not. This setting is being reset to 0 after a start label has been
generated.
re2c:labelprefix = yy ;
Allows one to change the prefix of numbered labels. The default is yy and can be
set any string that is a valid label.
re2c:state:abort = 0;
When not zero and switch -f is active then the YYGETSTATE block will contain a
default case that aborts and a -1 case is used for initialization.
re2c:state:nextlabel = 0;
Used when -f is active to control whether the YYGETSTATE block is followed by a
yyNext: label line. Instead of using yyNext you can usually also use configuration
startlabel to force a specific start label or default to yy0 as start label.
Instead of using a dedicated label it is often better to separate the YYGETSTATE
code from the actual scanner code by placing a /*!getstate:re2c*/ comment.
re2c:cgoto:threshold = 9;
When -g is active this value specifies the complexity threshold that triggers
generation of jump tables rather than using nested if's and decision bitfields. The
threshold is compared against a calculated estimation of if-s needed where every
used bitmap divides the threshold by 2.
re2c:yych:conversion = 0;
When the input uses signed characters and -s or -b switches are in effect re2c
allows one to automatically convert to the unsigned character type that is then
necessary for its internal single character. When this setting is zero or an empty
string the conversion is disabled. Using a non zero number the conversion is taken
from YYCTYPE. If that is given by an inplace configuration that value is being
used. Otherwise it will be (YYCTYPE) and changes to that configuration are no
longer possible. When this setting is a string the braces must be specified. Now
assuming your input is a char * buffer and you are using above mentioned switches
you can set YYCTYPE to unsigned char and this setting to either 1 or (unsigned
char).
re2c:define:YYCONDTYPE = YYCONDTYPE ;
Enumeration used for condition support with -c mode.
re2c:define:YYCTXMARKER = YYCTXMARKER ;
Allows one to overwrite the define YYCTXMARKER and thus avoiding it by setting the
value to the actual code needed.
re2c:define:YYCTYPE = YYCTYPE ;
Allows one to overwrite the define YYCTYPE and thus avoiding it by setting the
value to the actual code needed.
re2c:define:YYCURSOR = YYCURSOR ;
Allows one to overwrite the define YYCURSOR and thus avoiding it by setting the
value to the actual code needed.
re2c:define:YYDEBUG = YYDEBUG ;
Allows one to overwrite the define YYDEBUG and thus avoiding it by setting the
value to the actual code needed.
re2c:define:YYGETCONDITION = YYGETCONDITION ;
Substitution for YYGETCONDITION. Note that by default re2c generates braces after
YYGETCONDITION. Set re2c:define:YYGETCONDITION:naked to non-zero to omit braces.
re2c:define:YYGETCONDITION:naked = 0;
Controls braces after YYGETCONDITION. If zero, braces are omitted. If non-zero,
braces are generated.
re2c:define:YYSETCONDITION = YYSETCONDITION ;
Substitution for YYSETCONDITION. Note that by default re2c generates argument in
braces and semicolon after YYSETCONDITION. If you need to make YYSETCONDITION an
arbitrary statement rather than a call, set re2c:define:YYSETCONDITION:naked to
non-zero and use re2c:define:YYSETCONDITION@cond to denote formal parameter inside
of YYSETCONDITION body.
re2c:define:YYSETCONDITION@cond = @@ ;
Any occurrence of this text inside of YYSETCONDITION will be replaced with the
actual argument.
re2c:define:YYSETCONDITION:naked = 0;
Controls argument in braces and semicolon after YYSETCONDITION. If zero, both
agrument and semicolon are omitted. If non-zero, both argument and semicolon are
generated.
re2c:define:YYGETSTATE = YYGETSTATE ;
Substitution for YYGETSTATE. Note that by default re2c generates braces after
YYGETSTATE. Set re2c:define:YYGETSTATE:naked to non-zero to omit braces.
re2c:define:YYGETSTATE:naked = 0;
Controls braces after YYGETSTATE. If zero, braces are omitted. If non-zero, braces
are generated.
re2c:define:YYSETSTATE = YYSETSTATE ;
Substitution for YYSETSTATE. Note that by default re2c generates argument in braces
and semicolon after YYSETSTATE. If you need to make YYSETSTATE an arbitrary
statement rather than a call, set re2c:define:YYSETSTATE:naked to non-zero and use
re2c:define:YYSETSTATE@cond to denote formal parameter inside of YYSETSTATE body.
re2c:define:YYSETSTATE@state = @@ ;
Any occurrence of this text inside of YYSETSTATE will be replaced with the actual
argument.
re2c:define:YYSETSTATE:naked = 0;
Controls argument in braces and semicolon after YYSETSTATE. If zero, both agrument
and semicolon are omitted. If non-zero, both argument and semicolon are generated.
re2c:define:YYLIMIT = YYLIMIT ;
Allows one to overwrite the define YYLIMIT and thus avoiding it by setting the
value to the actual code needed.
re2c:define:YYMARKER = YYMARKER ;
Allows one to overwrite the define YYMARKER and thus avoiding it by setting the
value to the actual code needed.
re2c:label:yyFillLabel = yyFillLabel ;
Allows one to overwrite the name of the label yyFillLabel.
re2c:label:yyNext = yyNext ;
Allows one to overwrite the name of the label yyNext.
re2c:variable:yyaccept = yyaccept;
Allows one to overwrite the name of the variable yyaccept.
re2c:variable:yybm = yybm ;
Allows one to overwrite the name of the variable yybm.
re2c:variable:yych = yych ;
Allows one to overwrite the name of the variable yych.
re2c:variable:yyctable = yyctable ;
When both -c and -g are active then re2c uses this variable to generate a static
jump table for YYGETCONDITION.
re2c:variable:yystable = yystable ;
Deprecated.
re2c:variable:yytarget = yytarget ;
Allows one to overwrite the name of the variable yytarget.
REGULAR EXPRESSIONS
"foo" literal string "foo". ANSI-C escape sequences can be used.
'foo' literal string "foo" (characters [a-zA-Z] treated case-insensitive). ANSI-C escape
sequences can be used.
[xyz] character class; in this case, regular expression matches either x, y, or z.
[abj-oZ]
character class with a range in it; matches a, b, any letter from j through o or Z.
[^class]
inverted character class.
r s match any r which isn't s. r and s must be regular expressions which can be
expressed as character classes.
r* zero or more occurrences of r.
r+ one or more occurrences of r.
r? optional r.
(r) r; parentheses are used to override precedence.
r s r followed by s (concatenation).
r | s either r or s (alternative).
r / s r but only if it is followed by s. Note that s is not part of the matched text.
This type of regular expression is called "trailing context". Trailing context can
only be the end of a rule and not part of a named definition.
r{n} matches r exactly n times.
r{n,} matches r at least n times.
r{n,m} matches r at least n times, but not more than m times.
. match any character except newline.
name matches named definition as specified by name only if -F is off. If -F is active
then this behaves like it was enclosed in double quotes and matches the string
"name".
Character classes and string literals may contain octal or hexadecimal character
definitions and the following set of escape sequences: \a, \b, \f, \n, \r, \t, \v, \. An
octal character is defined by a backslash followed by its three octal digits (e.g. \377).
Hexadecimal characters from 0 to 0xFF are defined by backslash, a lower cased x and two
hexadecimal digits (e.g. \x12). Hexadecimal characters from 0x100 to 0xFFFF are defined by
backslash, a lower cased \u or an upper cased \X and four hexadecimal digits (e.g.
\u1234). Hexadecimal characters from 0x10000 to 0xFFFFffff are defined by backslash, an
upper cased \U and eight hexadecimal digits (e.g. \U12345678).
The only portable "any" rule is the default rule *.
SCANNER WITH STORABLE STATES
When the -f flag is specified, re2c generates a scanner that can store its current state,
return to the caller, and later resume operations exactly where it left off.
The default operation of re2c is a "pull" model, where the scanner asks for extra input
whenever it needs it. However, this mode of operation assumes that the scanner is the
"owner" the parsing loop, and that may not always be convenient.
Typically, if there is a preprocessor ahead of the scanner in the stream, or for that
matter any other procedural source of data, the scanner cannot "ask" for more data unless
both scanner and source live in a separate threads.
The -f flag is useful for just this situation: it lets users design scanners that work in
a "push" model, i.e. where data is fed to the scanner chunk by chunk. When the scanner
runs out of data to consume, it just stores its state, and return to the caller. When more
input data is fed to the scanner, it resumes operations exactly where it left off.
Changes needed compared to the "pull" model:
· User has to supply macros YYSETSTATE () and YYGETSTATE (state).
· The -f option inhibits declaration of yych and yyaccept. So the user has to declare
these. Also the user has to save and restore these. In the example
examples/push_model/push.re these are declared as fields of the (C++) class of which the
scanner is a method, so they do not need to be saved/restored explicitly. For C they
could e.g. be made macros that select fields from a structure passed in as parameter.
Alternatively, they could be declared as local variables, saved with YYFILL (n) when it
decides to return and restored at entry to the function. Also, it could be more
efficient to save the state from YYFILL (n) because YYSETSTATE (state) is called
unconditionally. YYFILL (n) however does not get state as parameter, so we would have
to store state in a local variable by YYSETSTATE (state).
· Modify YYFILL (n) to return (from the function calling it) if more input is needed.
· Modify caller to recognise if more input is needed and respond appropriately.
· The generated code will contain a switch block that is used to restores the last state
by jumping behind the corrspoding YYFILL (n) call. This code is automatically generated
in the epilog of the first /*!re2c */ block. It is possible to trigger generation of the
YYGETSTATE () block earlier by placing a /*!getstate:re2c*/ comment. This is especially
useful when the scanner code should be wrapped inside a loop.
Please see examples/push_model/push.re for "push" model scanner. The generated code can be
tweaked using inplace configurations state:abort and state:nextlabel.
SCANNER WITH CONDITION SUPPORT
You can preceed regular expressions with a list of condition names when using the -c
switch. In this case re2c generates scanner blocks for each conditon. Where each of the
generated blocks has its own precondition. The precondition is given by the interface
define YYGETCONDITON() and must be of type YYCONDTYPE.
There are two special rule types. First, the rules of the condition <*> are merged to all
conditions (note that they have lower priority than other rules of that condition). And
second the empty condition list allows one to provide a code block that does not have a
scanner part. Meaning it does not allow any regular expression. The condition value
referring to this special block is always the one with the enumeration value 0. This way
the code of this special rule can be used to initialize a scanner. It is in no way
necessary to have these rules: but sometimes it is helpful to have a dedicated
uninitialized condition state.
Non empty rules allow one to specify the new condition, which makes them transition rules.
Besides generating calls for the define YYSETCONDTITION no other special code is
generated.
There is another kind of special rules that allow one to prepend code to any code block of
all rules of a certain set of conditions or to all code blocks to all rules. This can be
helpful when some operation is common among rules. For instance this can be used to store
the length of the scanned string. These special setup rules start with an exclamation mark
followed by either a list of conditions <! condition, ... > or a star <!*>. When re2c
generates the code for a rule whose state does not have a setup rule and a star'd setup
rule is present, than that code will be used as setup code.
ENCODINGS
re2c supports the following encodings: ASCII (default), EBCDIC (-e), UCS-2 (-w), UTF-16
(-x), UTF-32 (-u) and UTF-8 (-8). See also inplace configuration re2c:flags.
The following concepts should be clarified when talking about encoding. Code point is an
abstract number, which represents single encoding symbol. Code unit is the smallest unit
of memory, which is used in the encoded text (it corresponds to one character in the input
stream). One or more code units can be needed to represent a single code point, depending
on the encoding. In fixed-length encoding, each code point is represented with equal
number of code units. In variable-length encoding, different code points can be
represented with different number of code units.
ASCII is a fixed-length encoding. Its code space includes 0x100 code points, from 0 to
0xFF. One code point is represented with exactly one 1-byte code unit, which has
the same value as the code point. Size of YYCTYPE must be 1 byte.
EBCDIC is a fixed-length encoding. Its code space includes 0x100 code points, from 0 to
0xFF. One code point is represented with exactly one 1-byte code unit, which has
the same value as the code point. Size of YYCTYPE must be 1 byte.
UCS-2 is a fixed-length encoding. Its code space includes 0x10000 code points, from 0 to
0xFFFF. One code point is represented with exactly one 2-byte code unit, which has
the same value as the code point. Size of YYCTYPE must be 2 bytes.
UTF-16 is a variable-length encoding. Its code space includes all Unicode code points,
from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One code point is represented with
one or two 2-byte code units. Size of YYCTYPE must be 2 bytes.
UTF-32 is a fixed-length encoding. Its code space includes all Unicode code points, from 0
to 0xD7FF and from 0xE000 to 0x10FFFF. One code point is represented with exactly
one 4-byte code unit. Size of YYCTYPE must be 4 bytes.
UTF-8 is a variable-length encoding. Its code space includes all Unicode code points,
from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One code point is represented with
sequence of one, two, three or four 1-byte code units. Size of YYCTYPE must be 1
byte.
In Unicode, values from range 0xD800 to 0xDFFF (surrogates) are not valid Unicode code
points, any encoded sequence of code units, that would map to Unicode code points in the
range 0xD800-0xDFFF, is ill-formed. The user can control how re2c treats such ill-formed
sequences with --encoding-policy <policy> flag (see OPTIONS for full explanation).
For some encodings, there are code units, that never occur in valid encoded stream (e.g.
0xFF byte in UTF-8). If the generated scanner must check for invalid input, the only true
way to do so is to use default rule *. Note, that full range rule [^] won't catch invalid
code units when variable-length encoding is used ([^] means "all valid code points", while
default rule * means "all possible code units").
GENERIC INPUT API
re2c usually operates on input using pointer-like primitives YYCURSOR, YYMARKER,
YYCTXMARKER and YYLIMIT.
Generic input API (enabled with --input custom switch) allows one to customize input
operations. In this mode, re2c will express all operations on input in terms of the
following primitives:
┌────────────────┬──────────────────────────────────┐
│YYPEEK () │ get current input character │
├────────────────┼──────────────────────────────────┤
│YYSKIP () │ advance to the next character │
├────────────────┼──────────────────────────────────┤
│YYBACKUP () │ backup current input position │
├────────────────┼──────────────────────────────────┤
│YYBACKUPCTX () │ backup current input position │
│ │ for trailing context │
├────────────────┼──────────────────────────────────┤
│YYRESTORE () │ restore current input position │
├────────────────┼──────────────────────────────────┤
│YYRESTORECTX () │ restore current input position │
│ │ for trailing context │
├────────────────┼──────────────────────────────────┤
│YYLESSTHAN (n) │ check if less than n input │
│ │ characters are left │
└────────────────┴──────────────────────────────────┘
A couple of useful links that provide some examples:
1. http://skvadrik.github.io/aleph_null/posts/re2c/2015-01-13-input_model.html
2. http://skvadrik.github.io/aleph_null/posts/re2c/2015-01-15-input_model_custom.html
Use re2c online using onworks.net services