BP character constants - The GNU Pascal Manual

Next: Compiler directives internally, Previous: Lexer problems, Up: Lexical analyzer

12.2.2 BP character constants

Borland-style character constants of the form ^M need special care. For example look at the following type declaration:

     type
       X = Integer;
       Y = ^X;        { pointer type }
       Z = ^X .. ^Y;  { subrange type }

One way one could attempt to resolve this is to let the parser tell the lexer (via a global flag) whether a character constant or the symbol ^ (to create pointer types or to dereference pointer expressions) is suitable in the current context. This was done in previous versions, but it had a number of disadvantages: First, any dependency of the lexer on the parser (see Lexical Tie-Ins (bison)) is problematic by itself since it must be taken care of manually in each relevant parser rule. Furthermore, the parser read-ahead must be taken into account, so the flag must usually be changed apparently one token too early. Using GLR (see GLR Parsers (bison)) makes this problem worse since it may read many tokens while the parser is split before it can perform any semantic action (which is where the flag could be modified). Secondly, as the example above shows, there are contexts in which both meanings are acceptable. So further look-ahead (within the lexer) was needed to resolve the problem.

Therefore, we now use another approach. When seeing ^X, the lexer returns two tokens, a regular ^ and a special token LEX_CARET_LETTER with semantic value X. The parser accepts LEX_CARET_LETTER wherever an identifier is accepted, and turns it into the identifier X via the nonterminal caret_letter. Furthermore, it accepts the sequence ^, LEX_CARET_LETTER as a string constant (whose value is a one-character string). Since LEX_CARET_LETTER is only produced by the lexer immediately after ^, with no white-space in between, this works (whereas otherwise, pasting tokens in the parser is not reliable due to white-space, e.g. the token sequence : and = could stand for := (if := weren't a token by itself), but also for : = with a space in between). With this trick, we can handle ^ followed by a single letter or underscore. The fact that this doesn't cause any conflict in the grammar tell us that this method works.

However, BP even allows any other character after ^ as a char constant. E.g., ^) could be a pointer dereference after an expression and followed by a closing parenthesis, or the character i (sic!).

Some characters are unproblematic because they can never occur after a ^ in its regular meaning, so the sequence can be lexed as a char constant directly. These are all characters that are not part of any Pascal tokens at all (which includes all control characters except white-space, all non-ASCII characters and the characters !, &, %, ?, \, `, |, ~ and } – the last one occurs at the end of comments, but within a comment this issue doesn't occur, anyway) and those characters that can only start constants because a constant can never follow a ^ in Pascal; these are #, $, ', " and the digits.

For ^ followed by whitespace, we return the token LEX_CARET_WHITE which the parser accepts as either a string constant or equivalent to ^ (because in the regular meaning, the white-space is meaningless).

If ^ is followed by one of the remaining characters (apart from one, see below), namely ,, ., :, ;, (, ), [, ], +, -, *, /, <, =, >, @, ^, the lexer just returns the tokens regularly, and the parser accepts these sequences as a char constant (besides the normal meaning of the tokens). (Again, since white-space after ^ is already dealt with, this token pasting works here.)

But ^ can also be followed by a multi-character alphanumeric sequence such as ^cto which might be read as ^ cto or ^c to (since BP also allows omitting white-space after constants), or by a multi-character token such as ^<= which could be ^ <= or ^< =. Both could be solved with extra tokens, e.g. lexing ^<= as ^, LEX_CARET_LESS, = and accepting ^, LEX_CARET_LESS in the parser as a string constant and LEX_CARET_LESS, = as equivalent to <= (relying on the fact that the lexer doesn't produce LEX_CARET_LESS if there's white-space after the < because then the simple ^, < will work, so justifying the token-pasting once again). This has not been done yet (in the alphanumeric case, this might add a lot of special tokens because of keywords etc., and it's doubtful whether that's worth it).

Finally, we have ^{ and ^(*. This is so incredibly stupid (e.g., think of the construct type c = Integer; foo = ^{ .. ^|; bar = {} c; which would become ambiguous then), that perhaps we should not attempt to handle this ...

(As a side-note, BP itself doesn't handle ^ character constants in many situations, including many that GPC does handle with the mechanisms described above, probably the clearest sign for a design bug. But if we support them at all, we might just as well do it better than BP ... :−)