mirror of
				https://github.com/postgres/postgres.git
				synced 2025-11-04 00:02:52 -05:00 
			
		
		
		
	the newer XML stuff in core. (This should probably also be referred to in the release notes.)
		
			
				
	
	
		
			279 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			279 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
XML-handling functions for PostgreSQL
 | 
						|
=====================================
 | 
						|
 | 
						|
    DEPRECATION NOTICE:  From PostgreSQL 8.3 on, there is XML-related
 | 
						|
    functionality based on the SQL/XML standard in the core server.
 | 
						|
    That functionality covers XML syntax checking and XPath queries,
 | 
						|
    which is what this module does as well, and more, but the API is
 | 
						|
    not at all compatible.  It is planned that this module will be
 | 
						|
    removed in PostgreSQL 8.4 in favor of the newer standard API, so
 | 
						|
    you are encouraged to try converting your applications.  If you
 | 
						|
    find that some of the functionality of this module is not
 | 
						|
    available in an adequate form with the newer API, please explain
 | 
						|
    your issue to pgsql-hackers@postgresql.org so that the deficiency
 | 
						|
    can be addressed.
 | 
						|
                                      -- Peter Eisentraut, 2007-05-24
 | 
						|
 | 
						|
Development of this module was sponsored by Torchbox Ltd. (www.torchbox.com)
 | 
						|
It has the same BSD licence as PostgreSQL.
 | 
						|
 | 
						|
This version of the XML functions provides both XPath querying and
 | 
						|
XSLT functionality. There is also a new table function which allows
 | 
						|
the straightforward return of multiple XML results. Note that the current code
 | 
						|
doesn't take any particular care over character sets - this is
 | 
						|
something that should be fixed at some point!
 | 
						|
 | 
						|
Installation
 | 
						|
------------
 | 
						|
 | 
						|
The current build process will only work if the files are in
 | 
						|
contrib/xml2 in a PostgreSQL 7.3 or later source tree which has been
 | 
						|
configured and built (If you alter the subdir value in the Makefile
 | 
						|
you can place it in a different directory in a PostgreSQL tree).
 | 
						|
 | 
						|
Before you begin, just check the Makefile, and then just 'make' and
 | 
						|
'make install'.
 | 
						|
 | 
						|
By default, this module requires both libxml2 and libxslt to be installed
 | 
						|
on your system.  If you do not have libxslt or do not want to use XSLT
 | 
						|
functions, you must edit the Makefile to not build the XSLT functions,
 | 
						|
as directed in its comments; and edit pgxml.sql.in to remove the XSLT
 | 
						|
function declarations, as directed in its comments.
 | 
						|
 | 
						|
Description of functions
 | 
						|
------------------------
 | 
						|
 | 
						|
The first set of functions are straightforward XML parsing and XPath queries:
 | 
						|
 | 
						|
xml_is_well_formed(document) RETURNS bool
 | 
						|
 | 
						|
This parses the document text in its parameter and returns true if the
 | 
						|
document is well-formed XML.  (Note: before PostgreSQL 8.2, this function
 | 
						|
was called xml_valid().  That is the wrong name since validity and
 | 
						|
well-formedness have different meanings in XML.  The old name is still
 | 
						|
available, but is deprecated and will be removed in 8.3.)
 | 
						|
 | 
						|
xpath_string(document,query) RETURNS text
 | 
						|
xpath_number(document,query) RETURNS float4
 | 
						|
xpath_bool(document,query) RETURNS bool
 | 
						|
 | 
						|
These functions evaluate the XPath query on the supplied document, and
 | 
						|
cast the result to the specified type.
 | 
						|
 | 
						|
 | 
						|
xpath_nodeset(document,query,toptag,itemtag) RETURNS text
 | 
						|
 | 
						|
This evaluates query on document and wraps the result in XML tags. If
 | 
						|
the result is multivalued, the output will look like:
 | 
						|
 | 
						|
<toptag>
 | 
						|
<itemtag>Value 1 which could be an XML fragment</itemtag>
 | 
						|
<itemtag>Value 2....</itemtag>
 | 
						|
</toptag>
 | 
						|
 | 
						|
If either toptag or itemtag is an empty string, the relevant tag is omitted.
 | 
						|
There are also wrapper functions for this operation:
 | 
						|
 | 
						|
xpath_nodeset(document,query) RETURNS text omits both tags.
 | 
						|
xpath_nodeset(document,query,itemtag) RETURNS text omits toptag.
 | 
						|
 | 
						|
 | 
						|
xpath_list(document,query,seperator) RETURNS text
 | 
						|
 | 
						|
This function returns multiple values seperated by the specified
 | 
						|
seperator, e.g. Value 1,Value 2,Value 3 if seperator=','.
 | 
						|
 | 
						|
xpath_list(document,query) RETURNS text
 | 
						|
 | 
						|
This is a wrapper for the above function that uses ',' as the seperator.
 | 
						|
 | 
						|
 | 
						|
xpath_table
 | 
						|
-----------
 | 
						|
 | 
						|
This is a table function which evaluates a set of XPath queries on
 | 
						|
each of a set of documents and returns the results as a table. The
 | 
						|
primary key field from the original document table is returned as the
 | 
						|
first column of the result so that the resultset from xpath_table can
 | 
						|
be readily used in joins.
 | 
						|
 | 
						|
The function itself takes 5 arguments, all text.
 | 
						|
 | 
						|
xpath_table(key,document,relation,xpaths,criteria)
 | 
						|
 | 
						|
key - the name of the "key" field - this is just a field to be used as
 | 
						|
the first column of the output table i.e. it identifies the record from
 | 
						|
which each output row came (see note below about multiple values).
 | 
						|
 | 
						|
document - the name of the field containing the XML document
 | 
						|
 | 
						|
relation - the name of the table or view containing the documents
 | 
						|
 | 
						|
xpaths - multiple xpath expressions separated by |
 | 
						|
 | 
						|
criteria - The contents of the where clause. This needs to be specified,
 | 
						|
so use "true" or "1=1" here if you want to process all the rows in the
 | 
						|
relation.
 | 
						|
 | 
						|
NB These parameters (except the XPath strings) are just substituted
 | 
						|
into a plain SQL SELECT statement, so you have some flexibility - the
 | 
						|
statement is
 | 
						|
 | 
						|
SELECT <key>,<document> FROM <relation> WHERE <criteria>
 | 
						|
 | 
						|
so those parameters can be *anything* valid in those particular
 | 
						|
locations. The result from this SELECT needs to return exactly two
 | 
						|
columns (which it will unless you try to list multiple fields for key
 | 
						|
or document). Beware that this simplistic approach requires that you
 | 
						|
validate any user-supplied values to avoid SQL injection attacks.
 | 
						|
 | 
						|
Using the function
 | 
						|
 | 
						|
The function has to be used in a FROM expression. This gives the following
 | 
						|
form:
 | 
						|
 | 
						|
SELECT * FROM
 | 
						|
xpath_table('article_id', 
 | 
						|
	'article_xml',
 | 
						|
	'articles', 
 | 
						|
	'/article/author|/article/pages|/article/title',
 | 
						|
	'date_entered > ''2003-01-01'' ') 
 | 
						|
AS t(article_id integer, author text, page_count integer, title text);
 | 
						|
 | 
						|
The AS clause defines the names and types of the columns in the
 | 
						|
virtual table. If there are more XPath queries than result columns,
 | 
						|
the extra queries will be ignored. If there are more result columns
 | 
						|
than XPath queries, the extra columns will be NULL.
 | 
						|
 | 
						|
Note that I've said in this example that pages is an integer.  The
 | 
						|
function deals internally with string representations, so when you say
 | 
						|
you want an integer in the output, it will take the string
 | 
						|
representation of the XPath result and use PostgreSQL input functions
 | 
						|
to transform it into an integer (or whatever type the AS clause
 | 
						|
requests). An error will result if it can't do this - for example if
 | 
						|
the result is empty - so you may wish to just stick to 'text' as the
 | 
						|
column type if you think your data has any problems.
 | 
						|
 | 
						|
The select statement doesn't need to use * alone - it can reference the
 | 
						|
columns by name or join them to other tables. The function produces a
 | 
						|
virtual table with which you can perform any operation you wish (e.g.
 | 
						|
aggregation, joining, sorting etc). So we could also have:
 | 
						|
 | 
						|
SELECT t.title, p.fullname, p.email 
 | 
						|
FROM xpath_table('article_id','article_xml','articles',
 | 
						|
            '/article/title|/article/author/@id',
 | 
						|
            'xpath_string(article_xml,''/article/@date'') > ''2003-03-20'' ')
 | 
						|
            AS t(article_id integer, title text, author_id integer), 
 | 
						|
     tblPeopleInfo AS p 
 | 
						|
WHERE t.author_id = p.person_id;
 | 
						|
 | 
						|
as a more complicated example. Of course, you could wrap all
 | 
						|
of this in a view for convenience.
 | 
						|
 | 
						|
Multivalued results
 | 
						|
 | 
						|
The xpath_table function assumes that the results of each XPath query
 | 
						|
might be multi-valued, so the number of rows returned by the function
 | 
						|
may not be the same as the number of input documents. The first row
 | 
						|
returned contains the first result from each query, the second row the
 | 
						|
second result from each query. If one of the queries has fewer values
 | 
						|
than the others, NULLs will be returned instead.
 | 
						|
 | 
						|
In some cases, a user will know that a given XPath query will return
 | 
						|
only a single result (perhaps a unique document identifier) - if used
 | 
						|
alongside an XPath query returning multiple results, the single-valued
 | 
						|
result will appear only on the first row of the result. The solution
 | 
						|
to this is to use the key field as part of a join against a simpler
 | 
						|
XPath query. As an example:
 | 
						|
 | 
						|
 | 
						|
CREATE TABLE test
 | 
						|
(
 | 
						|
  id int4 NOT NULL,
 | 
						|
  xml text,
 | 
						|
  CONSTRAINT pk PRIMARY KEY (id)
 | 
						|
) 
 | 
						|
WITHOUT OIDS;
 | 
						|
 | 
						|
INSERT INTO test VALUES (1, '<doc num="C1">
 | 
						|
<line num="L1"><a>1</a><b>2</b><c>3</c></line>
 | 
						|
<line num="L2"><a>11</a><b>22</b><c>33</c></line>
 | 
						|
</doc>');
 | 
						|
 | 
						|
INSERT INTO test VALUES (2, '<doc num="C2">
 | 
						|
<line num="L1"><a>111</a><b>222</b><c>333</c></line>
 | 
						|
<line num="L2"><a>111</a><b>222</b><c>333</c></line>
 | 
						|
</doc>');
 | 
						|
 | 
						|
 | 
						|
The query:
 | 
						|
 | 
						|
SELECT * FROM  xpath_table('id','xml','test', 
 | 
						|
'/doc/@num|/doc/line/@num|/doc/line/a|/doc/line/b|/doc/line/c','1=1') 
 | 
						|
AS t(id int4, doc_num varchar(10), line_num varchar(10), val1 int4, 
 | 
						|
val2 int4, val3 int4)
 | 
						|
WHERE id = 1 ORDER BY doc_num, line_num
 | 
						|
 | 
						|
 | 
						|
Gives the result:
 | 
						|
 | 
						|
 id | doc_num | line_num | val1 | val2 | val3
 | 
						|
----+---------+----------+------+------+------
 | 
						|
  1 | C1      | L1       |    1 |    2 |    3
 | 
						|
  1 |         | L2       |   11 |   22 |   33
 | 
						|
 | 
						|
To get doc_num on every line, the solution is to use two invocations
 | 
						|
of xpath_table and join the results:
 | 
						|
 | 
						|
SELECT t.*,i.doc_num FROM 
 | 
						|
  xpath_table('id','xml','test',
 | 
						|
   '/doc/line/@num|/doc/line/a|/doc/line/b|/doc/line/c','1=1') 
 | 
						|
        AS t(id int4, line_num varchar(10), val1 int4, val2 int4, val3 int4),
 | 
						|
  xpath_table('id','xml','test','/doc/@num','1=1') 
 | 
						|
        AS i(id int4, doc_num varchar(10))
 | 
						|
WHERE i.id=t.id AND i.id=1
 | 
						|
ORDER BY doc_num, line_num;
 | 
						|
 | 
						|
which gives the desired result:
 | 
						|
 | 
						|
 id | line_num | val1 | val2 | val3 | doc_num
 | 
						|
----+----------+------+------+------+---------
 | 
						|
  1 | L1       |    1 |    2 |    3 | C1
 | 
						|
  1 | L2       |   11 |   22 |   33 | C1
 | 
						|
(2 rows)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
XSLT functions
 | 
						|
--------------
 | 
						|
 | 
						|
The following functions are available if libxslt is installed (this is
 | 
						|
not currently detected automatically, so you will have to amend the
 | 
						|
Makefile)
 | 
						|
 | 
						|
xslt_process(document,stylesheet,paramlist) RETURNS text
 | 
						|
 | 
						|
This function appplies the XSL stylesheet to the document and returns
 | 
						|
the transformed result. The paramlist is a list of parameter
 | 
						|
assignments to be used in the transformation, specified in the form
 | 
						|
'a=1,b=2'. Note that this is also proof-of-concept code and the
 | 
						|
parameter parsing is very simple-minded (e.g. parameter values cannot
 | 
						|
contain commas!)
 | 
						|
 | 
						|
Also note that if either the document or stylesheet values do not
 | 
						|
begin with a < then they will be treated as URLs and libxslt will
 | 
						|
fetch them. It thus follows that you can use xslt_process as a means
 | 
						|
to fetch the contents of URLs - you should be aware of the security
 | 
						|
implications of this.
 | 
						|
 | 
						|
There is also a two-parameter version of xslt_process which does not
 | 
						|
pass any parameters to the transformation.
 | 
						|
 | 
						|
 | 
						|
Feedback
 | 
						|
--------
 | 
						|
 | 
						|
If you have any comments or suggestions, please do contact me at
 | 
						|
jgray@azuli.co.uk. Unfortunately, this isn't my main job, so I can't
 | 
						|
guarantee a rapid response to your query!
 |