PostgreSQL/doc/TODO.detail/yacc

From selkovjr@mcs.anl.gov Sat Jul 25 05:31:05 1998
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id FAA16564
	for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 05:31:03 -0400 (EDT)
Received: from antares.mcs.anl.gov (mcs.anl.gov [140.221.9.6]) by renoir.op.net (o1/$ Revision: 1.18 $) with SMTP id FAA01775 for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 05:28:22 -0400 (EDT)
Received: from mcs.anl.gov (wit.mcs.anl.gov [140.221.5.148]) by antares.mcs.anl.gov (8.6.10/8.6.10)  with ESMTP
	id EAA28698 for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 04:27:05 -0500
Sender: selkovjr@mcs.anl.gov
Message-ID: <35B9968D.21CF60A2@mcs.anl.gov>
Date: Sat, 25 Jul 1998 08:25:49 +0000
From: "Gene Selkov, Jr." <selkovjr@mcs.anl.gov>
Organization: MCS, Argonne Natl. Lab
X-Mailer: Mozilla 4.03 [en] (X11; I; Linux 2.0.32 i586)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
Subject: position-aware scanners
References: <199807250524.BAA07296@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: RO

Bruce,

I attached here (trough the web links) a couple examples, totally
irrelevant to postgres but good enough to discuss token locations. I
might as well try to patch the backend parser, though not sure how soon.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.

The first c parser I wrote,
http://wit.mcs.anl.gov/~selkovjr/unit-troff.tgz, is not very
sophisticated, so token locations reported by yyerr() may be slightly
incorrect (+/- one position depending on the existence and type of the
lookahead token. It is a filter used to typeset the units of measurement
with eqn. To use it, unpack the tar file and run make. The Makefile is
not too generic but I built it on various systems including linux,
freebsd and sunos 4.3. The invocation can be something like this:

./check 0 parse "l**3/(mmoll*min)"
parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
`'(''

l**3/(mmoll*min)
      ^^^^^

Now to the guts. As far as I can imagine, the only way to consistently
keep track of each character read by the scanner (regardless of the
length of expressions it will match) is to redefine its YY_INPUT like
this:

#undef YY_INPUT
#define YY_INPUT(buf,result,max_size) \
{ \
	int c	= (int) buffer[pos++]; \
	result = (c == '\0') ?	YY_NULL	: (buf[0] = c, 1); \
}

Here, buffer is the pointer to the origin of the string being scanned
and pos is a global variable, similar in usage to a file pointer (you
can both read and manipulate it at will). The buffer and the pointer are
initialized by the function

void setString(char *s)
{
   buffer = s;
   pos = 0;
}

each time the new string is to be parsed. This (exportable) function is
part of the interface.

In this simplistic design, yyerror() is part of the scanner module and
it uses the pos variable to report the location of unexpected tokens.
The downside of such arrangement is that in case of error condition, you
can't easily tell whether your context is current or lookahead token, it
just reports the position of the last token read (be it $ (end of
buffer) or something else):

./check 0 convert "mol/foo"
parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
`'(''

mol/foo
       ^^^

(should be at the beginning of "foo")

./check 0 convert "mmol//l"
parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
`'(''

mmol//l
    ^

(should be at the second '/')


I believe this is why most simple parsers made with yacc would report
parse errors being "at or near" some token, which is fair enough if the
expression is not too complex.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2. The second version of the same scanner,
http://wit.mcs.anl.gov/~selkovjr/scanner-example.tgz, addresses this
problem by recording exact locations of the tokens in each instance of
the token semantic data structure. The global,

UNIT_YYSTYPE unit_yylval;

would be normally used to export the token semantics (including its
original or modified text and location data) to the parser.
Unfortunately, I cannot show you the parser part in c, because that's
about when I stopped writing parsers in c. Instead, I included a small
test program, test.c, that mimics the parser's expectations for the
scanner data pretty well. I am assuming here that you are not interested
in digging someone else's ugly guts for relatively small bit of
information; let me know if I am wrong and I will send you the complete
perl code (also generated with bison).

To run this example, unpack the tar file and run Make. Then do

  gcc test.c scanner.o

and run a.out

Note the line

    yylval = unit_getyylval();

in test.c. You will not normally need it in a c parser. It is enough to
define yylval as an external variable and link it to yylval in yylex()

In the bison-generated parser, yylval gets pushed into a stack (pointed
to by yylsp) each time a new token is read. For each syntax rule, the
bison macros @1, @2, ... are just shortcuts to locations in the stack 1,
2, ... levels deep. In following code fragment, @3 refers to the
location info for the third term in the rule (INTEGER):

(sorry about perl, but I think you can do the same things in c without
significant changes to your existing parser)

term:           base    {
                        $$ = $1;
                        $$->{'order'} = 1;
                }
        |       base EXP INTEGER {
                        $$ = $1;
                        $$->{'order'} = @3->{'text'};
                        $$->{'scale'} = $$->{'scale'} ** $$->{'order'};
                        if ( $$->{'order'} == 0 ) {
                                yyerror("Error: expecting a non-zero
integer exponent");
                                YYERROR;
                        }
                }


which translates to:

  ($yyn == 10)    && do {
          $yyval = $yyvsa[-1];
          $yyval->{'order'} = 1;
          last SWITCH;
  };

  ($yyn == 11)    && do {
          $yyval = $yyvsa[-3];
          $yyval->{'order'} = $yylsa[-1]->{'text'}
          $yyval->{'scale'} = $yyval->{'scale'} ** $yyval->{'order'};
          if ( $yyval->{'order'} == 0 ) {
                   yyerror("Error: expecting a non-zero integer
exponent");
                   goto yyerrlab1 ;
          }
          last SWITCH;
  };

In c, you will have a bit more complicated pointer arithmetic to adress
the stack, but the usage of objects will be the same. Note here that it
is convenient to keep all information about the token in its location
info, (yylsa, yylsp, yylval, @n), while everything relating to the value
of the expression, or to the parse tree, is better placed in the
semantic stack (yyssa, yyssp, yysval, $n). Also note that in some cases
you can do semantic checks inside rules and report useful messages
before or instead of invoking yyerror();

Finally, it is useful to make the following wrapper function around
external yylex() in order to maintain your own token stack. Unlike the
parser's internal stack which is only as deep as the rule being reduced,
this one can hold all tokens recognized during the current run, and that
can be extremely helpful for error reporting and any transformations you
may need. In this way, you can even scan (tokenize) the whole buffer
before handing it off to the parser (who knows, you may need a token
ahead of what is currently seen by the parser):


sub tokenize {
    undef @tokenTable;
    my ($tok, $text, $name, $unit, $first_line, $first_column,
$last_line, $last_column);

    while ( ($tok = &UnitLex::yylex()) > 0 ) { # this is where the
c-coded yylex is called,
                                               # UnitLex is the perl
extension encapsulating it
       ( $text, $name, $unit, $first_line, $first_column, $last_line,
$last_column ) = &UnitLex::getyylval;
       push(@tokenTable,
           Unit::yyltype->new (
              'token'         => $tok,
              'text'          => $text,
              'name'          => $name,
              'unit'          => $unit,
              'first_line'    => $first_line,
              'first_column'  => $first_column,
              'last_line'     => $last_line,
              'last_column'   => $last_column,
           )
       )
    }

}


It is now a lot easier to handle various state-related problems, such as
backtracking and error reporting. The yylex() function as seen by the
parser might be constructed somewhat like this:

sub yylex {
    $yylloc = $tokenTable[$tokenNo];  # $tokenNo is a global; now
instead of a "file pointer",
                                      # as in the first example, we have
a "token pointer"
    undef $yylval;


    # disregard this; name this block "computing semantic values"
    if ( $yylloc->{'token'} == UNIT) {
        $yylval = Unit::Operand->new(
        'unit'  => Unit::Dict::unit($yylloc->{'unit'}),
        'base'  => Unit::Dict::base($yylloc->{'unit'}),
        'scale' => Unit::Dict::scale($yylloc->{'unit'}),
        'scaleToBase' => Unit::Dict::scaleToBase($yylloc->{'unit'}),
        'loc'   => $yylloc,
       );
    }
    elsif ( ($yylloc->{'token'} == INTEGER ) || ($yylloc->{'token'} ==
POSITIVE_NUMBER) ) {
        $yylval = Unit::Operand->new(
          'unit' => '1',
          'base' => '1',
          'scale' => 1,
          'scaleToBase' => 1,
          'loc'   => $yylloc,
        );
    }

    $tokenNo++;
    return(%{$yylloc}->{'token'}); # This is all the parser needs to
know about this token.
                                   # But we already made sure we saved
everything we need to know.
}


Now the most interesting part, the error reporting routine:


sub yyerror {
    my ($str) = @_;
    my ($message, $start, $end, $loc);

    $loc = $tokenTable[$tokenNo-1]; # This is the same as to say,
                                    # "obtain the location info for the
current token"

    # You may use this routine for your own purposes or let parser use
it
    if( $str ne 'parse error' ) {
        $message = "$str instead of `" . $loc->{'name'} . "' <" .
$loc->{'text'} . ">,  at line " . $loc->{'first_line'} . ":\n\
n";
    }
    else {
        $message = "unexpected token `" . $loc->{'name'} . "' <" .
$loc->{'text'} . ">,  at line " . loc->{'first_line'} . ":\n
\n";
    }

    $message .= $parseBuffer . "\n"; # that's the original string that
was used to set the parser buffer

    $message .= ( ' ' x ($loc->{'first_column'} + 1) ) . ( '^' x
length($loc->{'text'}) ). "\n";
    if( $str ne 'parse error' ) {
        print STDERR "$str instead of `", $loc->{'name'}, "' {",
$loc->{'text'}, "},  at line ", $loc->{'first_line'}, ":\n\n";
    }
    else {
        print STDERR "unexpected token `", $loc->{'name'}, "' {",
$loc->{'text'}, "},  at line ", $loc->{'first_line'}, ":\n\n";
    }

    print STDERR "$parseBuffer\n";
    print STDERR ' ' x ($loc->{'first_column'} + 1), '^' x
length($loc->{'text'}), "\n";
}

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Scanners used in these examples assume there is a single line of text on
the input (the first_line and last_line elements of yylloc are simply
ignored). If you want to be able to parse multi-line buffers, just add a
lex rule for '\n' that will increment the line count and reset the pos
variable to zero.


Ugly as it may seem, I find this approach extremely liberating. If the
grammar becomes too complicated for a LALR(1) parser, I can cascade
multiple parsers. The token table can then be used to reassemble parts
of original expression for subordinate parsers, preserving the location
info all the way down, so that subordinate parsers can report their
problems consistently. You probably don't need this, as SQL is very well
thought of and has parsable grammar. But it may be of some help, for
error reporting.


--Gene

From pgsql-patches-owner+M1499@postgresql.org Sat Aug  4 13:11:53 2001
Return-path: <pgsql-patches-owner+M1499@postgresql.org>
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f74HBrh11339
	for <pgman@candle.pha.pa.us>; Sat, 4 Aug 2001 13:11:53 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
	by postgresql.org (8.11.3/8.11.4) with SMTP id f74H89655183;
	Sat, 4 Aug 2001 13:08:09 -0400 (EDT)
	(envelope-from pgsql-patches-owner+M1499@postgresql.org)
Received: from sss.pgh.pa.us ([192.204.191.242])
	by postgresql.org (8.11.3/8.11.4) with ESMTP id f74Gxb653074
	for <pgsql-patches@postgresql.org>; Sat, 4 Aug 2001 12:59:37 -0400 (EDT)
	(envelope-from tgl@sss.pgh.pa.us)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id f74GtPC29183;
	Sat, 4 Aug 2001 12:55:25 -0400 (EDT)
To: Dave Page <dpage@vale-housing.co.uk>
cc: "'Fernando Nasser'" <fnasser@cygnus.com>,
   Bruce Momjian <pgman@candle.pha.pa.us>, Neil Padgett <npadgett@redhat.com>,
   pgsql-patches@postgresql.org
Subject: Re: [PATCHES] Patch for Improved Syntax Error Reporting
In-Reply-To: <8568FC767B4AD311AC33006097BCD3D61A2D70@woody.vale-housing.co.uk>
References: <8568FC767B4AD311AC33006097BCD3D61A2D70@woody.vale-housing.co.uk>
Comments: In-reply-to Dave Page <dpage@vale-housing.co.uk>
	message dated "Sat, 04 Aug 2001 12:37:23 +0100"
Date: Sat, 04 Aug 2001 12:55:24 -0400
Message-ID: <29180.996944124@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Precedence: bulk
Sender: pgsql-patches-owner@postgresql.org
Status: OR

Dave Page <dpage@vale-housing.co.uk> writes:
> Oh, I quite agree. I'm not adverse to updating my code, I just want to avoid
> users getting misleading messages until I come up with those updates.

Hmm ... if they were actively misleading then I'd share your concern.

I guess what you're thinking is that the error offset reported by the
backend won't correspond directly to what the user typed, and if the
user tries to use the offset to manually count off characters, he may
arrive at the wrong place?  Good point.  I'm not sure whether a message
like

	ERROR:  parser: parse error at or near 'frum';
	POSITION: 42

would be likely to encourage people to try that.  Thoughts?  (I do think
this is a good argument for not embedding the position straight into the
main error message though...)

One possible compromise is to combine the straight character-offset
approach with a simplistic context display:

	ERROR:  parser: parse error at or near 'frum';
	POSITION: 42  ... oid,relname FRUM ...

The idea is to define the "POSITION" field as an integer offset possibly
followed by whitespace and noise words.  An updated client would grab
the offset, ignore the rest of the field, and do the right thing.  A
not-updated client would display the entire message, and with any luck
the user would read it correctly.

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html