mirror of
				https://github.com/postgres/postgres.git
				synced 2025-11-04 00:02:52 -05:00 
			
		
		
		
	
		
			
				
	
	
		
			403 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			403 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
From selkovjr@mcs.anl.gov Sat Jul 25 05:31:05 1998
 | 
						|
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | 
						|
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id FAA16564
 | 
						|
	for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 05:31:03 -0400 (EDT)
 | 
						|
Received: from antares.mcs.anl.gov (mcs.anl.gov [140.221.9.6]) by renoir.op.net (o1/$ Revision: 1.18 $) with SMTP id FAA01775 for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 05:28:22 -0400 (EDT)
 | 
						|
Received: from mcs.anl.gov (wit.mcs.anl.gov [140.221.5.148]) by antares.mcs.anl.gov (8.6.10/8.6.10)  with ESMTP
 | 
						|
	id EAA28698 for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 04:27:05 -0500
 | 
						|
Sender: selkovjr@mcs.anl.gov
 | 
						|
Message-ID: <35B9968D.21CF60A2@mcs.anl.gov>
 | 
						|
Date: Sat, 25 Jul 1998 08:25:49 +0000
 | 
						|
From: "Gene Selkov, Jr." <selkovjr@mcs.anl.gov>
 | 
						|
Organization: MCS, Argonne Natl. Lab
 | 
						|
X-Mailer: Mozilla 4.03 [en] (X11; I; Linux 2.0.32 i586)
 | 
						|
MIME-Version: 1.0
 | 
						|
To: Bruce Momjian <maillist@candle.pha.pa.us>
 | 
						|
Subject: position-aware scanners
 | 
						|
References: <199807250524.BAA07296@candle.pha.pa.us>
 | 
						|
Content-Type: text/plain; charset=us-ascii
 | 
						|
Content-Transfer-Encoding: 7bit
 | 
						|
Status: RO
 | 
						|
 | 
						|
Bruce,
 | 
						|
 | 
						|
I attached here (trough the web links) a couple examples, totally
 | 
						|
irrelevant to postgres but good enough to discuss token locations. I
 | 
						|
might as well try to patch the backend parser, though not sure how soon.
 | 
						|
 | 
						|
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
1. 
 | 
						|
 | 
						|
The first c parser I wrote,
 | 
						|
http://wit.mcs.anl.gov/~selkovjr/unit-troff.tgz, is not very
 | 
						|
sophisticated, so token locations reported by yyerr() may be slightly
 | 
						|
incorrect (+/- one position depending on the existence and type of the
 | 
						|
lookahead token. It is a filter used to typeset the units of measurement
 | 
						|
with eqn. To use it, unpack the tar file and run make. The Makefile is
 | 
						|
not too generic but I built it on various systems including linux,
 | 
						|
freebsd and sunos 4.3. The invocation can be something like this:
 | 
						|
 | 
						|
./check 0 parse "l**3/(mmoll*min)"
 | 
						|
parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
 | 
						|
`'(''
 | 
						|
 | 
						|
l**3/(mmoll*min)
 | 
						|
      ^^^^^
 | 
						|
 | 
						|
Now to the guts. As far as I can imagine, the only way to consistently
 | 
						|
keep track of each character read by the scanner (regardless of the
 | 
						|
length of expressions it will match) is to redefine its YY_INPUT like
 | 
						|
this:
 | 
						|
 | 
						|
#undef YY_INPUT
 | 
						|
#define YY_INPUT(buf,result,max_size) \
 | 
						|
{ \
 | 
						|
	int c	= (int) buffer[pos++]; \
 | 
						|
	result = (c == '\0') ?	YY_NULL	: (buf[0] = c, 1); \
 | 
						|
}
 | 
						|
 | 
						|
Here, buffer is the pointer to the origin of the string being scanned
 | 
						|
and pos is a global variable, similar in usage to a file pointer (you
 | 
						|
can both read and manipulate it at will). The buffer and the pointer are
 | 
						|
initialized by the function 
 | 
						|
 | 
						|
void setString(char *s)
 | 
						|
{
 | 
						|
   buffer = s;
 | 
						|
   pos = 0;
 | 
						|
}
 | 
						|
 | 
						|
each time the new string is to be parsed. This (exportable) function is
 | 
						|
part of the interface. 
 | 
						|
 | 
						|
In this simplistic design, yyerror() is part of the scanner module and
 | 
						|
it uses the pos variable to report the location of unexpected tokens.
 | 
						|
The downside of such arrangement is that in case of error condition, you
 | 
						|
can't easily tell whether your context is current or lookahead token, it
 | 
						|
just reports the position of the last token read (be it $ (end of
 | 
						|
buffer) or something else):
 | 
						|
 | 
						|
./check 0 convert "mol/foo"
 | 
						|
parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
 | 
						|
`'(''
 | 
						|
 | 
						|
mol/foo
 | 
						|
       ^^^
 | 
						|
 | 
						|
(should be at the beginning of "foo")
 | 
						|
 | 
						|
./check 0 convert "mmol//l"        
 | 
						|
parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
 | 
						|
`'(''
 | 
						|
 | 
						|
mmol//l
 | 
						|
    ^
 | 
						|
 | 
						|
(should be at the second '/')
 | 
						|
 | 
						|
 | 
						|
I believe this is why most simple parsers made with yacc would report
 | 
						|
parse errors being "at or near" some token, which is fair enough if the
 | 
						|
expression is not too complex.
 | 
						|
 | 
						|
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
2. The second version of the same scanner,
 | 
						|
http://wit.mcs.anl.gov/~selkovjr/scanner-example.tgz, addresses this
 | 
						|
problem by recording exact locations of the tokens in each instance of
 | 
						|
the token semantic data structure. The global,
 | 
						|
 | 
						|
UNIT_YYSTYPE unit_yylval;
 | 
						|
 | 
						|
would be normally used to export the token semantics (including its
 | 
						|
original or modified text and location data) to the parser.
 | 
						|
Unfortunately, I cannot show you the parser part in c, because that's
 | 
						|
about when I stopped writing parsers in c. Instead, I included a small
 | 
						|
test program, test.c, that mimics the parser's expectations for the
 | 
						|
scanner data pretty well. I am assuming here that you are not interested
 | 
						|
in digging someone else's ugly guts for relatively small bit of
 | 
						|
information; let me know if I am wrong and I will send you the complete
 | 
						|
perl code (also generated with bison).
 | 
						|
 | 
						|
To run this example, unpack the tar file and run Make. Then do
 | 
						|
 | 
						|
  gcc test.c scanner.o
 | 
						|
 | 
						|
and run a.out
 | 
						|
 | 
						|
Note the line
 | 
						|
 | 
						|
    yylval = unit_getyylval();
 | 
						|
 | 
						|
in test.c. You will not normally need it in a c parser. It is enough to
 | 
						|
define yylval as an external variable and link it to yylval in yylex()
 | 
						|
 | 
						|
In the bison-generated parser, yylval gets pushed into a stack (pointed
 | 
						|
to by yylsp) each time a new token is read. For each syntax rule, the
 | 
						|
bison macros @1, @2, ... are just shortcuts to locations in the stack 1,
 | 
						|
2, ... levels deep. In following code fragment, @3 refers to the
 | 
						|
location info for the third term in the rule (INTEGER):
 | 
						|
 | 
						|
(sorry about perl, but I think you can do the same things in c without
 | 
						|
significant changes to your existing parser)
 | 
						|
 | 
						|
term:           base    {
 | 
						|
                        $$ = $1;
 | 
						|
                        $$->{'order'} = 1;
 | 
						|
                }
 | 
						|
        |       base EXP INTEGER {
 | 
						|
                        $$ = $1;
 | 
						|
                        $$->{'order'} = @3->{'text'};
 | 
						|
                        $$->{'scale'} = $$->{'scale'} ** $$->{'order'};
 | 
						|
                        if ( $$->{'order'} == 0 ) {
 | 
						|
                                yyerror("Error: expecting a non-zero
 | 
						|
integer exponent");
 | 
						|
                                YYERROR;
 | 
						|
                        }
 | 
						|
                }
 | 
						|
 | 
						|
 | 
						|
which translates to:
 | 
						|
 | 
						|
  ($yyn == 10)    && do {
 | 
						|
          $yyval = $yyvsa[-1];
 | 
						|
          $yyval->{'order'} = 1;
 | 
						|
          last SWITCH;
 | 
						|
  };
 | 
						|
 | 
						|
  ($yyn == 11)    && do {
 | 
						|
          $yyval = $yyvsa[-3];
 | 
						|
          $yyval->{'order'} = $yylsa[-1]->{'text'}
 | 
						|
          $yyval->{'scale'} = $yyval->{'scale'} ** $yyval->{'order'};
 | 
						|
          if ( $yyval->{'order'} == 0 ) {
 | 
						|
                   yyerror("Error: expecting a non-zero integer
 | 
						|
exponent");
 | 
						|
                   goto yyerrlab1 ;
 | 
						|
          }
 | 
						|
          last SWITCH;
 | 
						|
  };
 | 
						|
 | 
						|
In c, you will have a bit more complicated pointer arithmetic to adress
 | 
						|
the stack, but the usage of objects will be the same. Note here that it
 | 
						|
is convenient to keep all information about the token in its location
 | 
						|
info, (yylsa, yylsp, yylval, @n), while everything relating to the value
 | 
						|
of the expression, or to the parse tree, is better placed in the
 | 
						|
semantic stack (yyssa, yyssp, yysval, $n). Also note that in some cases
 | 
						|
you can do semantic checks inside rules and report useful messages
 | 
						|
before or instead of invoking yyerror();
 | 
						|
 | 
						|
Finally, it is useful to make the following wrapper function around
 | 
						|
external yylex() in order to maintain your own token stack. Unlike the
 | 
						|
parser's internal stack which is only as deep as the rule being reduced,
 | 
						|
this one can hold all tokens recognized during the current run, and that
 | 
						|
can be extremely helpful for error reporting and any transformations you
 | 
						|
may need. In this way, you can even scan (tokenize) the whole buffer
 | 
						|
before handing it off to the parser (who knows, you may need a token
 | 
						|
ahead of what is currently seen by the parser):
 | 
						|
 | 
						|
 | 
						|
sub tokenize {
 | 
						|
    undef @tokenTable;
 | 
						|
    my ($tok, $text, $name, $unit, $first_line, $first_column,
 | 
						|
$last_line, $last_column);
 | 
						|
    
 | 
						|
    while ( ($tok = &UnitLex::yylex()) > 0 ) { # this is where the
 | 
						|
c-coded yylex is called,
 | 
						|
                                               # UnitLex is the perl
 | 
						|
extension encapsulating it                            
 | 
						|
       ( $text, $name, $unit, $first_line, $first_column, $last_line,
 | 
						|
$last_column ) = &UnitLex::getyylval;
 | 
						|
       push(@tokenTable, 
 | 
						|
           Unit::yyltype->new (
 | 
						|
              'token'         => $tok,
 | 
						|
              'text'          => $text,
 | 
						|
              'name'          => $name,
 | 
						|
              'unit'          => $unit,
 | 
						|
              'first_line'    => $first_line,
 | 
						|
              'first_column'  => $first_column,
 | 
						|
              'last_line'     => $last_line,
 | 
						|
              'last_column'   => $last_column,
 | 
						|
           )
 | 
						|
       )
 | 
						|
    }
 | 
						|
 | 
						|
}
 | 
						|
 | 
						|
 | 
						|
It is now a lot easier to handle various state-related problems, such as
 | 
						|
backtracking and error reporting. The yylex() function as seen by the
 | 
						|
parser might be constructed somewhat like this:
 | 
						|
 | 
						|
sub yylex {
 | 
						|
    $yylloc = $tokenTable[$tokenNo];  # $tokenNo is a global; now
 | 
						|
instead of a "file pointer",
 | 
						|
                                      # as in the first example, we have
 | 
						|
a "token pointer"
 | 
						|
    undef $yylval;
 | 
						|
 | 
						|
 | 
						|
    # disregard this; name this block "computing semantic values"       
 | 
						|
    if ( $yylloc->{'token'} == UNIT) {
 | 
						|
        $yylval = Unit::Operand->new(
 | 
						|
        'unit'  => Unit::Dict::unit($yylloc->{'unit'}),
 | 
						|
        'base'  => Unit::Dict::base($yylloc->{'unit'}),
 | 
						|
        'scale' => Unit::Dict::scale($yylloc->{'unit'}),
 | 
						|
        'scaleToBase' => Unit::Dict::scaleToBase($yylloc->{'unit'}),
 | 
						|
        'loc'   => $yylloc,
 | 
						|
       );    
 | 
						|
    }
 | 
						|
    elsif ( ($yylloc->{'token'} == INTEGER ) || ($yylloc->{'token'} ==
 | 
						|
POSITIVE_NUMBER) ) {
 | 
						|
        $yylval = Unit::Operand->new(
 | 
						|
          'unit' => '1',
 | 
						|
          'base' => '1',
 | 
						|
          'scale' => 1,
 | 
						|
          'scaleToBase' => 1,
 | 
						|
          'loc'   => $yylloc,
 | 
						|
        );
 | 
						|
    }
 | 
						|
 | 
						|
    $tokenNo++;
 | 
						|
    return(%{$yylloc}->{'token'}); # This is all the parser needs to
 | 
						|
know about this token. 
 | 
						|
                                   # But we already made sure we saved
 | 
						|
everything we need to know.
 | 
						|
}
 | 
						|
 | 
						|
 | 
						|
Now the most interesting part, the error reporting routine:
 | 
						|
 | 
						|
 | 
						|
sub yyerror {
 | 
						|
    my ($str) = @_;
 | 
						|
    my ($message, $start, $end, $loc);
 | 
						|
 | 
						|
    $loc = $tokenTable[$tokenNo-1]; # This is the same as to say, 
 | 
						|
                                    # "obtain the location info for the
 | 
						|
current token"
 | 
						|
  
 | 
						|
    # You may use this routine for your own purposes or let parser use
 | 
						|
it
 | 
						|
    if( $str ne 'parse error' ) {
 | 
						|
        $message = "$str instead of `" . $loc->{'name'} . "' <" .
 | 
						|
$loc->{'text'} . ">,  at line " . $loc->{'first_line'} . ":\n\
 | 
						|
n";
 | 
						|
    }
 | 
						|
    else {
 | 
						|
        $message = "unexpected token `" . $loc->{'name'} . "' <" .
 | 
						|
$loc->{'text'} . ">,  at line " . loc->{'first_line'} . ":\n
 | 
						|
\n";
 | 
						|
    }
 | 
						|
 | 
						|
    $message .= $parseBuffer . "\n"; # that's the original string that
 | 
						|
was used to set the parser buffer
 | 
						|
 | 
						|
    $message .= ( ' ' x ($loc->{'first_column'} + 1) ) . ( '^' x
 | 
						|
length($loc->{'text'}) ). "\n";
 | 
						|
    if( $str ne 'parse error' ) {
 | 
						|
        print STDERR "$str instead of `", $loc->{'name'}, "' {",
 | 
						|
$loc->{'text'}, "},  at line ", $loc->{'first_line'}, ":\n\n";
 | 
						|
    }
 | 
						|
    else {
 | 
						|
        print STDERR "unexpected token `", $loc->{'name'}, "' {",
 | 
						|
$loc->{'text'}, "},  at line ", $loc->{'first_line'}, ":\n\n";
 | 
						|
    }
 | 
						|
    
 | 
						|
    print STDERR "$parseBuffer\n";
 | 
						|
    print STDERR ' ' x ($loc->{'first_column'} + 1), '^' x
 | 
						|
length($loc->{'text'}), "\n";
 | 
						|
}
 | 
						|
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
Scanners used in these examples assume there is a single line of text on
 | 
						|
the input (the first_line and last_line elements of yylloc are simply
 | 
						|
ignored). If you want to be able to parse multi-line buffers, just add a
 | 
						|
lex rule for '\n' that will increment the line count and reset the pos
 | 
						|
variable to zero.
 | 
						|
 | 
						|
 | 
						|
Ugly as it may seem, I find this approach extremely liberating. If the
 | 
						|
grammar becomes too complicated for a LALR(1) parser, I can cascade
 | 
						|
multiple parsers. The token table can then be used to reassemble parts
 | 
						|
of original expression for subordinate parsers, preserving the location
 | 
						|
info all the way down, so that subordinate parsers can report their
 | 
						|
problems consistently. You probably don't need this, as SQL is very well
 | 
						|
thought of and has parsable grammar. But it may be of some help, for
 | 
						|
error reporting. 
 | 
						|
 | 
						|
 | 
						|
--Gene
 | 
						|
 | 
						|
From pgsql-patches-owner+M1499@postgresql.org Sat Aug  4 13:11:53 2001
 | 
						|
Return-path: <pgsql-patches-owner+M1499@postgresql.org>
 | 
						|
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						|
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f74HBrh11339
 | 
						|
	for <pgman@candle.pha.pa.us>; Sat, 4 Aug 2001 13:11:53 -0400 (EDT)
 | 
						|
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
 | 
						|
	by postgresql.org (8.11.3/8.11.4) with SMTP id f74H89655183;
 | 
						|
	Sat, 4 Aug 2001 13:08:09 -0400 (EDT)
 | 
						|
	(envelope-from pgsql-patches-owner+M1499@postgresql.org)
 | 
						|
Received: from sss.pgh.pa.us ([192.204.191.242])
 | 
						|
	by postgresql.org (8.11.3/8.11.4) with ESMTP id f74Gxb653074
 | 
						|
	for <pgsql-patches@postgresql.org>; Sat, 4 Aug 2001 12:59:37 -0400 (EDT)
 | 
						|
	(envelope-from tgl@sss.pgh.pa.us)
 | 
						|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						|
	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id f74GtPC29183;
 | 
						|
	Sat, 4 Aug 2001 12:55:25 -0400 (EDT)
 | 
						|
To: Dave Page <dpage@vale-housing.co.uk>
 | 
						|
cc: "'Fernando Nasser'" <fnasser@cygnus.com>,
 | 
						|
   Bruce Momjian <pgman@candle.pha.pa.us>, Neil Padgett <npadgett@redhat.com>,
 | 
						|
   pgsql-patches@postgresql.org
 | 
						|
Subject: Re: [PATCHES] Patch for Improved Syntax Error Reporting 
 | 
						|
In-Reply-To: <8568FC767B4AD311AC33006097BCD3D61A2D70@woody.vale-housing.co.uk> 
 | 
						|
References: <8568FC767B4AD311AC33006097BCD3D61A2D70@woody.vale-housing.co.uk>
 | 
						|
Comments: In-reply-to Dave Page <dpage@vale-housing.co.uk>
 | 
						|
	message dated "Sat, 04 Aug 2001 12:37:23 +0100"
 | 
						|
Date: Sat, 04 Aug 2001 12:55:24 -0400
 | 
						|
Message-ID: <29180.996944124@sss.pgh.pa.us>
 | 
						|
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-patches-owner@postgresql.org
 | 
						|
Status: OR
 | 
						|
 | 
						|
Dave Page <dpage@vale-housing.co.uk> writes:
 | 
						|
> Oh, I quite agree. I'm not adverse to updating my code, I just want to avoid
 | 
						|
> users getting misleading messages until I come up with those updates.
 | 
						|
 | 
						|
Hmm ... if they were actively misleading then I'd share your concern.
 | 
						|
 | 
						|
I guess what you're thinking is that the error offset reported by the
 | 
						|
backend won't correspond directly to what the user typed, and if the
 | 
						|
user tries to use the offset to manually count off characters, he may
 | 
						|
arrive at the wrong place?  Good point.  I'm not sure whether a message
 | 
						|
like
 | 
						|
 | 
						|
	ERROR:  parser: parse error at or near 'frum';
 | 
						|
	POSITION: 42
 | 
						|
 | 
						|
would be likely to encourage people to try that.  Thoughts?  (I do think
 | 
						|
this is a good argument for not embedding the position straight into the
 | 
						|
main error message though...)
 | 
						|
 | 
						|
One possible compromise is to combine the straight character-offset
 | 
						|
approach with a simplistic context display:
 | 
						|
 | 
						|
	ERROR:  parser: parse error at or near 'frum';
 | 
						|
	POSITION: 42  ... oid,relname FRUM ...
 | 
						|
 | 
						|
The idea is to define the "POSITION" field as an integer offset possibly
 | 
						|
followed by whitespace and noise words.  An updated client would grab
 | 
						|
the offset, ignore the rest of the field, and do the right thing.  A
 | 
						|
not-updated client would display the entire message, and with any luck
 | 
						|
the user would read it correctly.
 | 
						|
 | 
						|
			regards, tom lane
 | 
						|
 | 
						|
---------------------------(end of broadcast)---------------------------
 | 
						|
TIP 5: Have you checked our extensive FAQ?
 | 
						|
 | 
						|
http://www.postgresql.org/users-lounge/docs/faq.html
 | 
						|
 |