mirror of
https://github.com/postgres/postgres.git
synced 2025-06-01 00:01:20 -04:00
Add some real documentation about TOAST (finally). Combine this with
the old 'page' chapter and the recently added 'filelayout' chapter to make a coherent chapter about PostgreSQL's physical storage layout.
This commit is contained in:
parent
521e8888e9
commit
ab3bb9cf78
@ -1,5 +1,5 @@
|
|||||||
<!--
|
<!--
|
||||||
$PostgreSQL: pgsql/doc/src/sgml/diskusage.sgml,v 1.13 2004/12/28 19:08:58 tgl Exp $
|
$PostgreSQL: pgsql/doc/src/sgml/diskusage.sgml,v 1.14 2005/01/10 00:04:38 tgl Exp $
|
||||||
-->
|
-->
|
||||||
|
|
||||||
<chapter id="diskusage">
|
<chapter id="diskusage">
|
||||||
@ -22,12 +22,12 @@ $PostgreSQL: pgsql/doc/src/sgml/diskusage.sgml,v 1.13 2004/12/28 19:08:58 tgl Ex
|
|||||||
stored. If the table has any columns with potentially-wide values,
|
stored. If the table has any columns with potentially-wide values,
|
||||||
there is also a <acronym>TOAST</> file associated with the table,
|
there is also a <acronym>TOAST</> file associated with the table,
|
||||||
which is used to store values too wide to fit comfortably in the main
|
which is used to store values too wide to fit comfortably in the main
|
||||||
table. There will be one index on the
|
table (see <xref linkend="storage-toast">). There will be one index on the
|
||||||
<acronym>TOAST</> table, if present. There may also be indexes associated
|
<acronym>TOAST</> table, if present. There may also be indexes associated
|
||||||
with the base table. Each table and index is stored in a separate disk
|
with the base table. Each table and index is stored in a separate disk
|
||||||
file — possibly more than one file, if the file would exceed one
|
file — possibly more than one file, if the file would exceed one
|
||||||
gigabyte. Naming conventions for these files are described in <xref
|
gigabyte. Naming conventions for these files are described in <xref
|
||||||
linkend="file-layout">.
|
linkend="storage-file-layout">.
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
|
@ -1,161 +0,0 @@
|
|||||||
<!--
|
|
||||||
$PostgreSQL: pgsql/doc/src/sgml/filelayout.sgml,v 1.2 2004/11/16 15:00:36 tgl Exp $
|
|
||||||
-->
|
|
||||||
|
|
||||||
<chapter id="file-layout">
|
|
||||||
|
|
||||||
<title>Database File Layout</title>
|
|
||||||
|
|
||||||
<abstract>
|
|
||||||
<para>
|
|
||||||
A description of the database physical storage layout.
|
|
||||||
</para>
|
|
||||||
</abstract>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
This section provides an overview of the physical format used by
|
|
||||||
<productname>PostgreSQL</productname> databases.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
All the data needed for a database cluster is stored within the cluster's data
|
|
||||||
directory, commonly referred to as <varname>PGDATA</> (after the name of the
|
|
||||||
environment variable that can be used to define it). A common location for
|
|
||||||
<varname>PGDATA</> is <filename>/var/lib/pgsql/data</>. Multiple clusters,
|
|
||||||
managed by different postmasters, can exist on the same machine.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
The <varname>PGDATA</> directory contains several subdirectories and control
|
|
||||||
files, as shown in <xref linkend="pgdata-contents-table">. In addition to
|
|
||||||
these required items, the cluster configuration files
|
|
||||||
<filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
|
|
||||||
<filename>pg_ident.conf</filename> are traditionally stored in
|
|
||||||
<varname>PGDATA</> (although beginning in
|
|
||||||
<productname>PostgreSQL</productname> 8.0 it is possible to keep them
|
|
||||||
elsewhere).
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<table tocentry="1" id="pgdata-contents-table">
|
|
||||||
<title>Contents of <varname>PGDATA</></title>
|
|
||||||
<tgroup cols="2">
|
|
||||||
<thead>
|
|
||||||
<row>
|
|
||||||
<entry>
|
|
||||||
Item
|
|
||||||
</entry>
|
|
||||||
<entry>Description</entry>
|
|
||||||
</row>
|
|
||||||
</thead>
|
|
||||||
|
|
||||||
<tbody>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry><filename>PG_VERSION</></entry>
|
|
||||||
<entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry><filename>base</></entry>
|
|
||||||
<entry>Subdirectory containing per-database subdirectories</entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry><filename>global</></entry>
|
|
||||||
<entry>Subdirectory containing cluster-wide tables, such as
|
|
||||||
<structname>pg_database</></entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry><filename>pg_clog</></entry>
|
|
||||||
<entry>Subdirectory containing transaction commit status data</entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry><filename>pg_subtrans</></entry>
|
|
||||||
<entry>Subdirectory containing subtransaction status data</entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry><filename>pg_tblspc</></entry>
|
|
||||||
<entry>Subdirectory containing symbolic links to tablespaces</entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry><filename>pg_xlog</></entry>
|
|
||||||
<entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry><filename>postmaster.opts</></entry>
|
|
||||||
<entry>A file recording the command-line options the postmaster was
|
|
||||||
last started with</entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry><filename>postmaster.pid</></entry>
|
|
||||||
<entry>A lock file recording the current postmaster PID and shared memory
|
|
||||||
segment ID (not present after postmaster shutdown)</entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
</tbody>
|
|
||||||
</tgroup>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
For each database in the cluster there is a subdirectory within
|
|
||||||
<varname>PGDATA</><filename>/base</>, named after the database's OID in
|
|
||||||
<structname>pg_database</>. This subdirectory is the default location
|
|
||||||
for the database's files; in particular, its system catalogs are stored
|
|
||||||
there.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
Each table and index is stored in a separate file, named after the table
|
|
||||||
or index's <firstterm>filenode</> number, which can be found in
|
|
||||||
<structname>pg_class</>.<structfield>relfilenode</>.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<caution>
|
|
||||||
<para>
|
|
||||||
Note that while a table's filenode often matches its OID, this is
|
|
||||||
<emphasis>not</> necessarily the case; some operations, like
|
|
||||||
<command>TRUNCATE</>, <command>REINDEX</>, <command>CLUSTER</> and some forms
|
|
||||||
of <command>ALTER TABLE</>, can change the filenode while preserving the OID.
|
|
||||||
Avoid assuming that filenode and table OID are the same.
|
|
||||||
</para>
|
|
||||||
</caution>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
When a table or index exceeds 1Gb, it is divided into gigabyte-sized
|
|
||||||
<firstterm>segments</>. The first segment's file name is the same as the
|
|
||||||
filenode; subsequent segments are named filenode.1, filenode.2, etc.
|
|
||||||
This arrangement avoids problems on platforms that have file size limitations.
|
|
||||||
The contents of tables and indexes are discussed further in
|
|
||||||
<xref linkend="page">.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
A table that has columns with potentially large entries will have an
|
|
||||||
associated <firstterm>TOAST</> table, which is used for out-of-line storage of
|
|
||||||
field values that are too large to keep in the table rows proper.
|
|
||||||
<structname>pg_class</>.<structfield>reltoastrelid</> links from a table to
|
|
||||||
its TOAST table, if any.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
Tablespaces make the scenario more complicated. Each user-defined tablespace
|
|
||||||
has a symbolic link inside the <varname>PGDATA</><filename>/pg_tblspc</>
|
|
||||||
directory, which points to the physical tablespace directory (as specified in
|
|
||||||
its <command>CREATE TABLESPACE</> command). The symbolic link is named after
|
|
||||||
the tablespace's OID. Inside the physical tablespace directory there is
|
|
||||||
a subdirectory for each database that has elements in the tablespace, named
|
|
||||||
after the database's OID. Tables within that directory follow the filenode
|
|
||||||
naming scheme. The <literal>pg_default</> tablespace is not accessed through
|
|
||||||
<filename>pg_tblspc</>, but corresponds to
|
|
||||||
<varname>PGDATA</><filename>/base</>. Similarly, the <literal>pg_global</>
|
|
||||||
tablespace is not accessed through <filename>pg_tblspc</>, but corresponds to
|
|
||||||
<varname>PGDATA</><filename>/global</>.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
</chapter>
|
|
@ -1,4 +1,4 @@
|
|||||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.40 2004/12/03 05:50:18 momjian Exp $ -->
|
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.41 2005/01/10 00:04:38 tgl Exp $ -->
|
||||||
|
|
||||||
<!entity history SYSTEM "history.sgml">
|
<!entity history SYSTEM "history.sgml">
|
||||||
<!entity info SYSTEM "info.sgml">
|
<!entity info SYSTEM "info.sgml">
|
||||||
@ -75,15 +75,14 @@
|
|||||||
<!entity arch-dev SYSTEM "arch-dev.sgml">
|
<!entity arch-dev SYSTEM "arch-dev.sgml">
|
||||||
<!entity bki SYSTEM "bki.sgml">
|
<!entity bki SYSTEM "bki.sgml">
|
||||||
<!entity catalogs SYSTEM "catalogs.sgml">
|
<!entity catalogs SYSTEM "catalogs.sgml">
|
||||||
<!entity filelayout SYSTEM "filelayout.sgml">
|
|
||||||
<!entity geqo SYSTEM "geqo.sgml">
|
<!entity geqo SYSTEM "geqo.sgml">
|
||||||
<!entity gist SYSTEM "gist.sgml">
|
<!entity gist SYSTEM "gist.sgml">
|
||||||
<!entity indexcost SYSTEM "indexcost.sgml">
|
<!entity indexcost SYSTEM "indexcost.sgml">
|
||||||
<!entity nls SYSTEM "nls.sgml">
|
<!entity nls SYSTEM "nls.sgml">
|
||||||
<!entity page SYSTEM "page.sgml">
|
|
||||||
<!entity plhandler SYSTEM "plhandler.sgml">
|
<!entity plhandler SYSTEM "plhandler.sgml">
|
||||||
<!entity protocol SYSTEM "protocol.sgml">
|
<!entity protocol SYSTEM "protocol.sgml">
|
||||||
<!entity sources SYSTEM "sources.sgml">
|
<!entity sources SYSTEM "sources.sgml">
|
||||||
|
<!entity storage SYSTEM "storage.sgml">
|
||||||
|
|
||||||
<!-- appendixes -->
|
<!-- appendixes -->
|
||||||
<!entity contacts SYSTEM "contacts.sgml">
|
<!entity contacts SYSTEM "contacts.sgml">
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
<!--
|
<!--
|
||||||
$PostgreSQL: pgsql/doc/src/sgml/lobj.sgml,v 1.35 2005/01/08 22:13:33 tgl Exp $
|
$PostgreSQL: pgsql/doc/src/sgml/lobj.sgml,v 1.36 2005/01/10 00:04:38 tgl Exp $
|
||||||
-->
|
-->
|
||||||
|
|
||||||
<chapter id="largeObjects">
|
<chapter id="largeObjects">
|
||||||
@ -51,9 +51,11 @@ $PostgreSQL: pgsql/doc/src/sgml/lobj.sgml,v 1.35 2005/01/08 22:13:33 tgl Exp $
|
|||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
<indexterm><primary>TOAST</></>
|
<indexterm>
|
||||||
<indexterm><primary>sliced bread</><see>TOAST</></indexterm>
|
<primary>TOAST</primary>
|
||||||
<productname>PostgreSQL 7.1</productname> introduced a mechanism
|
<secondary>versus large objects</secondary>
|
||||||
|
</indexterm>
|
||||||
|
<productname>PostgreSQL</productname> 7.1 introduced a mechanism
|
||||||
(nicknamed <quote><acronym>TOAST</acronym></quote>) that allows
|
(nicknamed <quote><acronym>TOAST</acronym></quote>) that allows
|
||||||
data values to be much larger than single pages. This
|
data values to be much larger than single pages. This
|
||||||
makes the large object facility partially obsolete. One
|
makes the large object facility partially obsolete. One
|
||||||
|
@ -1,354 +0,0 @@
|
|||||||
<!--
|
|
||||||
$PostgreSQL: pgsql/doc/src/sgml/page.sgml,v 1.19 2004/11/12 21:50:53 tgl Exp $
|
|
||||||
-->
|
|
||||||
|
|
||||||
<chapter id="page">
|
|
||||||
|
|
||||||
<title>Database Page Layout</title>
|
|
||||||
|
|
||||||
<abstract>
|
|
||||||
<para>
|
|
||||||
A description of the database file page format.
|
|
||||||
</para>
|
|
||||||
</abstract>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
This section provides an overview of the page format used by
|
|
||||||
<productname>PostgreSQL</productname> tables and indexes.<footnote>
|
|
||||||
<para>
|
|
||||||
Actually, index access methods need not use this page format.
|
|
||||||
All the existing index methods do use this basic format,
|
|
||||||
but the data kept on index metapages usually doesn't follow
|
|
||||||
the item layout rules.
|
|
||||||
</para>
|
|
||||||
</footnote>
|
|
||||||
TOAST tables and sequences are formatted just like a regular table.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
In the following explanation, a
|
|
||||||
<firstterm>byte</firstterm>
|
|
||||||
is assumed to contain 8 bits. In addition, the term
|
|
||||||
<firstterm>item</firstterm>
|
|
||||||
refers to an individual data value that is stored on a page. In a table,
|
|
||||||
an item is a row; in an index, an item is an index entry.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
Every table and index is stored as an array of <firstterm>pages</> of a
|
|
||||||
fixed size (usually 8K, although a different page size can be selected
|
|
||||||
when compiling the server). In a table, all the pages are logically
|
|
||||||
equivalent, so a particular item (row) can be stored in any page. In
|
|
||||||
indexes, the first page is generally reserved as a <firstterm>metapage</>
|
|
||||||
holding control information, and there may be different types of pages
|
|
||||||
within the index, depending on the index access method.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
<xref linkend="page-table"> shows the overall layout of a page.
|
|
||||||
There are five parts to each page.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<table tocentry="1" id="page-table">
|
|
||||||
<title>Overall Page Layout</title>
|
|
||||||
<titleabbrev>Page Layout</titleabbrev>
|
|
||||||
<tgroup cols="2">
|
|
||||||
<thead>
|
|
||||||
<row>
|
|
||||||
<entry>
|
|
||||||
Item
|
|
||||||
</entry>
|
|
||||||
<entry>Description</entry>
|
|
||||||
</row>
|
|
||||||
</thead>
|
|
||||||
|
|
||||||
<tbody>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry>PageHeaderData</entry>
|
|
||||||
<entry>20 bytes long. Contains general information about the page, including
|
|
||||||
free space pointers.</entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry>ItemPointerData</entry>
|
|
||||||
<entry>Array of (offset,length) pairs pointing to the actual items.
|
|
||||||
4 bytes per item.</entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry>Free space</entry>
|
|
||||||
<entry>The unallocated space. New item pointers are allocated from the start
|
|
||||||
of this area, new items from the end.</entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry>Items</entry>
|
|
||||||
<entry>The actual items themselves.</entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
<row>
|
|
||||||
<entry>Special space</entry>
|
|
||||||
<entry>Index access method specific data. Different methods store different
|
|
||||||
data. Empty in ordinary tables.</entry>
|
|
||||||
</row>
|
|
||||||
|
|
||||||
</tbody>
|
|
||||||
</tgroup>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
|
|
||||||
The first 20 bytes of each page consists of a page header
|
|
||||||
(PageHeaderData). Its format is detailed in <xref
|
|
||||||
linkend="pageheaderdata-table">. The first two fields track the most
|
|
||||||
recent WAL entry related to this page. They are followed by three 2-byte
|
|
||||||
integer fields
|
|
||||||
(<structfield>pd_lower</structfield>, <structfield>pd_upper</structfield>,
|
|
||||||
and <structfield>pd_special</structfield>). These contain byte offsets
|
|
||||||
from the page start to the start
|
|
||||||
of unallocated space, to the end of unallocated space, and to the start of
|
|
||||||
the special space.
|
|
||||||
The last 2 bytes of the page header,
|
|
||||||
<structfield>pd_pagesize_version</structfield>, store both the page size
|
|
||||||
and a version indicator. Beginning with
|
|
||||||
<productname>PostgreSQL</productname> 8.0 the version number is 2;
|
|
||||||
<productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
|
|
||||||
prior releases used version number 0.
|
|
||||||
(The basic page layout and header format has not changed in these versions,
|
|
||||||
but the layout of heap row headers has.) The page size
|
|
||||||
is basically only present as a cross-check; there is no support for having
|
|
||||||
more than one page size in an installation.
|
|
||||||
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<table tocentry="1" id="pageheaderdata-table">
|
|
||||||
<title>PageHeaderData Layout</title>
|
|
||||||
<titleabbrev>PageHeaderData Layout</titleabbrev>
|
|
||||||
<tgroup cols="4">
|
|
||||||
<thead>
|
|
||||||
<row>
|
|
||||||
<entry>Field</entry>
|
|
||||||
<entry>Type</entry>
|
|
||||||
<entry>Length</entry>
|
|
||||||
<entry>Description</entry>
|
|
||||||
</row>
|
|
||||||
</thead>
|
|
||||||
<tbody>
|
|
||||||
<row>
|
|
||||||
<entry>pd_lsn</entry>
|
|
||||||
<entry>XLogRecPtr</entry>
|
|
||||||
<entry>8 bytes</entry>
|
|
||||||
<entry>LSN: next byte after last byte of xlog record for last change
|
|
||||||
to this page</entry>
|
|
||||||
</row>
|
|
||||||
<row>
|
|
||||||
<entry>pd_tli</entry>
|
|
||||||
<entry>TimeLineID</entry>
|
|
||||||
<entry>4 bytes</entry>
|
|
||||||
<entry>TLI of last change</entry>
|
|
||||||
</row>
|
|
||||||
<row>
|
|
||||||
<entry>pd_lower</entry>
|
|
||||||
<entry>LocationIndex</entry>
|
|
||||||
<entry>2 bytes</entry>
|
|
||||||
<entry>Offset to start of free space</entry>
|
|
||||||
</row>
|
|
||||||
<row>
|
|
||||||
<entry>pd_upper</entry>
|
|
||||||
<entry>LocationIndex</entry>
|
|
||||||
<entry>2 bytes</entry>
|
|
||||||
<entry>Offset to end of free space</entry>
|
|
||||||
</row>
|
|
||||||
<row>
|
|
||||||
<entry>pd_special</entry>
|
|
||||||
<entry>LocationIndex</entry>
|
|
||||||
<entry>2 bytes</entry>
|
|
||||||
<entry>Offset to start of special space</entry>
|
|
||||||
</row>
|
|
||||||
<row>
|
|
||||||
<entry>pd_pagesize_version</entry>
|
|
||||||
<entry>uint16</entry>
|
|
||||||
<entry>2 bytes</entry>
|
|
||||||
<entry>Page size and layout version number information</entry>
|
|
||||||
</row>
|
|
||||||
</tbody>
|
|
||||||
</tgroup>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
All the details may be found in
|
|
||||||
<filename>src/include/storage/bufpage.h</filename>.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
|
|
||||||
Following the page header are item identifiers
|
|
||||||
(<type>ItemIdData</type>), each requiring four bytes.
|
|
||||||
An item identifier contains a byte-offset to
|
|
||||||
the start of an item, its length in bytes, and a few attribute bits
|
|
||||||
which affect its interpretation.
|
|
||||||
New item identifiers are allocated
|
|
||||||
as needed from the beginning of the unallocated space.
|
|
||||||
The number of item identifiers present can be determined by looking at
|
|
||||||
<structfield>pd_lower</>, which is increased to allocate a new identifier.
|
|
||||||
Because an item
|
|
||||||
identifier is never moved until it is freed, its index may be used on a
|
|
||||||
long-term basis to reference an item, even when the item itself is moved
|
|
||||||
around on the page to compact free space. In fact, every pointer to an
|
|
||||||
item (<type>ItemPointer</type>, also known as
|
|
||||||
<type>CTID</type>) created by
|
|
||||||
<productname>PostgreSQL</productname> consists of a page number and the
|
|
||||||
index of an item identifier.
|
|
||||||
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
|
|
||||||
The items themselves are stored in space allocated backwards from the end
|
|
||||||
of unallocated space. The exact structure varies depending on what the
|
|
||||||
table is to contain. Tables and sequences both use a structure named
|
|
||||||
<type>HeapTupleHeaderData</type>, described below.
|
|
||||||
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
|
|
||||||
The final section is the <quote>special section</quote> which may
|
|
||||||
contain anything the access method wishes to store. For example,
|
|
||||||
b-tree indexes store links to the page's left and right siblings,
|
|
||||||
as well as some other data relevant to the index structure.
|
|
||||||
Ordinary tables do not use a special section at all (indicated by setting
|
|
||||||
<structfield>pd_special</> to equal the page size).
|
|
||||||
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
|
|
||||||
All table rows are structured in the same way. There is a fixed-size
|
|
||||||
header (occupying 27 bytes on most machines), followed by an optional null
|
|
||||||
bitmap, an optional object ID field, and the user data. The header is
|
|
||||||
detailed
|
|
||||||
in <xref linkend="heaptupleheaderdata-table">. The actual user data
|
|
||||||
(columns of the row) begins at the offset indicated by
|
|
||||||
<structfield>t_hoff</>, which must always be a multiple of the MAXALIGN
|
|
||||||
distance for the platform.
|
|
||||||
The null bitmap is
|
|
||||||
only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
|
|
||||||
<structfield>t_infomask</structfield>. If it is present it begins just after
|
|
||||||
the fixed header and occupies enough bytes to have one bit per data column
|
|
||||||
(that is, <structfield>t_natts</> bits altogether). In this list of bits, a
|
|
||||||
1 bit indicates not-null, a 0 bit is a null. When the bitmap is not
|
|
||||||
present, all columns are assumed not-null.
|
|
||||||
The object ID is only present if the <firstterm>HEAP_HASOID</firstterm> bit
|
|
||||||
is set in <structfield>t_infomask</structfield>. If present, it appears just
|
|
||||||
before the <structfield>t_hoff</> boundary. Any padding needed to make
|
|
||||||
<structfield>t_hoff</> a MAXALIGN multiple will appear between the null
|
|
||||||
bitmap and the object ID. (This in turn ensures that the object ID is
|
|
||||||
suitably aligned.)
|
|
||||||
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<table tocentry="1" id="heaptupleheaderdata-table">
|
|
||||||
<title>HeapTupleHeaderData Layout</title>
|
|
||||||
<titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
|
|
||||||
<tgroup cols="4">
|
|
||||||
<thead>
|
|
||||||
<row>
|
|
||||||
<entry>Field</entry>
|
|
||||||
<entry>Type</entry>
|
|
||||||
<entry>Length</entry>
|
|
||||||
<entry>Description</entry>
|
|
||||||
</row>
|
|
||||||
</thead>
|
|
||||||
<tbody>
|
|
||||||
<row>
|
|
||||||
<entry>t_xmin</entry>
|
|
||||||
<entry>TransactionId</entry>
|
|
||||||
<entry>4 bytes</entry>
|
|
||||||
<entry>insert XID stamp</entry>
|
|
||||||
</row>
|
|
||||||
<row>
|
|
||||||
<entry>t_cmin</entry>
|
|
||||||
<entry>CommandId</entry>
|
|
||||||
<entry>4 bytes</entry>
|
|
||||||
<entry>insert CID stamp</entry>
|
|
||||||
</row>
|
|
||||||
<row>
|
|
||||||
<entry>t_xmax</entry>
|
|
||||||
<entry>TransactionId</entry>
|
|
||||||
<entry>4 bytes</entry>
|
|
||||||
<entry>delete XID stamp</entry>
|
|
||||||
</row>
|
|
||||||
<row>
|
|
||||||
<entry>t_cmax</entry>
|
|
||||||
<entry>CommandId</entry>
|
|
||||||
<entry>4 bytes</entry>
|
|
||||||
<entry>delete CID stamp (overlays with t_xvac)</entry>
|
|
||||||
</row>
|
|
||||||
<row>
|
|
||||||
<entry>t_xvac</entry>
|
|
||||||
<entry>TransactionId</entry>
|
|
||||||
<entry>4 bytes</entry>
|
|
||||||
<entry>XID for VACUUM operation moving a row version</entry>
|
|
||||||
</row>
|
|
||||||
<row>
|
|
||||||
<entry>t_ctid</entry>
|
|
||||||
<entry>ItemPointerData</entry>
|
|
||||||
<entry>6 bytes</entry>
|
|
||||||
<entry>current TID of this or newer row version</entry>
|
|
||||||
</row>
|
|
||||||
<row>
|
|
||||||
<entry>t_natts</entry>
|
|
||||||
<entry>int16</entry>
|
|
||||||
<entry>2 bytes</entry>
|
|
||||||
<entry>number of attributes</entry>
|
|
||||||
</row>
|
|
||||||
<row>
|
|
||||||
<entry>t_infomask</entry>
|
|
||||||
<entry>uint16</entry>
|
|
||||||
<entry>2 bytes</entry>
|
|
||||||
<entry>various flag bits</entry>
|
|
||||||
</row>
|
|
||||||
<row>
|
|
||||||
<entry>t_hoff</entry>
|
|
||||||
<entry>uint8</entry>
|
|
||||||
<entry>1 byte</entry>
|
|
||||||
<entry>offset to user data</entry>
|
|
||||||
</row>
|
|
||||||
</tbody>
|
|
||||||
</tgroup>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
All the details may be found in
|
|
||||||
<filename>src/include/access/htup.h</filename>.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
|
|
||||||
Interpreting the actual data can only be done with information obtained
|
|
||||||
from other tables, mostly <structname>pg_attribute</structname>. The
|
|
||||||
key values needed to identify field locations are
|
|
||||||
<structfield>attlen</structfield> and <structfield>attalign</structfield>.
|
|
||||||
There is no way to directly get a
|
|
||||||
particular attribute, except when there are only fixed width fields and no
|
|
||||||
NULLs. All this trickery is wrapped up in the functions
|
|
||||||
<firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
|
|
||||||
and <firstterm>heap_getsysattr</firstterm>.
|
|
||||||
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
|
|
||||||
To read the data you need to examine each attribute in turn. First check
|
|
||||||
whether the field is NULL according to the null bitmap. If it is, go to
|
|
||||||
the next. Then make sure you have the right alignment. If the field is a
|
|
||||||
fixed width field, then all the bytes are simply placed. If it's a
|
|
||||||
variable length field (attlen = -1) then it's a bit more complicated.
|
|
||||||
All variable-length datatypes share the common header structure
|
|
||||||
<type>varattrib</type>, which includes the total length of the stored
|
|
||||||
value and some flag bits. Depending on the flags, the data may be either
|
|
||||||
inline or in another table (TOAST); it might be compressed, too.
|
|
||||||
|
|
||||||
</para>
|
|
||||||
</chapter>
|
|
@ -1,5 +1,5 @@
|
|||||||
<!--
|
<!--
|
||||||
$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.72 2004/12/30 03:13:56 tgl Exp $
|
$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.73 2005/01/10 00:04:38 tgl Exp $
|
||||||
-->
|
-->
|
||||||
|
|
||||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V4.2//EN" [
|
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V4.2//EN" [
|
||||||
@ -237,8 +237,7 @@ $PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.72 2004/12/30 03:13:56 tgl Exp
|
|||||||
&geqo;
|
&geqo;
|
||||||
&indexcost;
|
&indexcost;
|
||||||
&gist;
|
&gist;
|
||||||
&filelayout;
|
&storage;
|
||||||
&page;
|
|
||||||
&bki;
|
&bki;
|
||||||
|
|
||||||
</part>
|
</part>
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
<!--
|
<!--
|
||||||
$PostgreSQL: pgsql/doc/src/sgml/ref/alter_table.sgml,v 1.75 2005/01/04 00:39:53 tgl Exp $
|
$PostgreSQL: pgsql/doc/src/sgml/ref/alter_table.sgml,v 1.76 2005/01/10 00:04:43 tgl Exp $
|
||||||
PostgreSQL documentation
|
PostgreSQL documentation
|
||||||
-->
|
-->
|
||||||
|
|
||||||
@ -153,10 +153,14 @@ where <replaceable class="PARAMETER">action</replaceable> is one of:
|
|||||||
inline, uncompressed. <literal>MAIN</literal> is for inline,
|
inline, uncompressed. <literal>MAIN</literal> is for inline,
|
||||||
compressible data. <literal>EXTERNAL</literal> is for external,
|
compressible data. <literal>EXTERNAL</literal> is for external,
|
||||||
uncompressed data, and <literal>EXTENDED</literal> is for external,
|
uncompressed data, and <literal>EXTENDED</literal> is for external,
|
||||||
compressed data. <literal>EXTENDED</literal> is the default for all
|
compressed data. <literal>EXTENDED</literal> is the default for most
|
||||||
data types that support it. Use of <literal>EXTERNAL</literal> will
|
data types that support non-<literal>PLAIN</literal> storage.
|
||||||
|
Use of <literal>EXTERNAL</literal> will
|
||||||
make substring operations on <type>text</type> and <type>bytea</type>
|
make substring operations on <type>text</type> and <type>bytea</type>
|
||||||
columns faster, at the penalty of increased storage space.
|
columns faster, at the penalty of increased storage space. Note that
|
||||||
|
<literal>SET STORAGE</> doesn't itself change anything in the table,
|
||||||
|
it just sets the strategy to be pursued during future table updates.
|
||||||
|
See <xref linkend="storage-toast"> for more information.
|
||||||
</para>
|
</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
676
doc/src/sgml/storage.sgml
Normal file
676
doc/src/sgml/storage.sgml
Normal file
@ -0,0 +1,676 @@
|
|||||||
|
<!--
|
||||||
|
$PostgreSQL: pgsql/doc/src/sgml/storage.sgml,v 1.4 2005/01/10 00:04:38 tgl Exp $
|
||||||
|
-->
|
||||||
|
|
||||||
|
<chapter id="storage">
|
||||||
|
|
||||||
|
<title>Database Physical Storage</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
This chapter provides an overview of the physical storage format used by
|
||||||
|
<productname>PostgreSQL</productname> databases.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<sect1 id="storage-file-layout">
|
||||||
|
|
||||||
|
<title>Database File Layout</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
This section describes the storage format at the level of files and
|
||||||
|
directories.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
All the data needed for a database cluster is stored within the cluster's data
|
||||||
|
directory, commonly referred to as <varname>PGDATA</> (after the name of the
|
||||||
|
environment variable that can be used to define it). A common location for
|
||||||
|
<varname>PGDATA</> is <filename>/var/lib/pgsql/data</>. Multiple clusters,
|
||||||
|
managed by different postmasters, can exist on the same machine.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The <varname>PGDATA</> directory contains several subdirectories and control
|
||||||
|
files, as shown in <xref linkend="pgdata-contents-table">. In addition to
|
||||||
|
these required items, the cluster configuration files
|
||||||
|
<filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
|
||||||
|
<filename>pg_ident.conf</filename> are traditionally stored in
|
||||||
|
<varname>PGDATA</> (although beginning in
|
||||||
|
<productname>PostgreSQL</productname> 8.0 it is possible to keep them
|
||||||
|
elsewhere).
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<table tocentry="1" id="pgdata-contents-table">
|
||||||
|
<title>Contents of <varname>PGDATA</></title>
|
||||||
|
<tgroup cols="2">
|
||||||
|
<thead>
|
||||||
|
<row>
|
||||||
|
<entry>
|
||||||
|
Item
|
||||||
|
</entry>
|
||||||
|
<entry>Description</entry>
|
||||||
|
</row>
|
||||||
|
</thead>
|
||||||
|
|
||||||
|
<tbody>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry><filename>PG_VERSION</></entry>
|
||||||
|
<entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry><filename>base</></entry>
|
||||||
|
<entry>Subdirectory containing per-database subdirectories</entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry><filename>global</></entry>
|
||||||
|
<entry>Subdirectory containing cluster-wide tables, such as
|
||||||
|
<structname>pg_database</></entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry><filename>pg_clog</></entry>
|
||||||
|
<entry>Subdirectory containing transaction commit status data</entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry><filename>pg_subtrans</></entry>
|
||||||
|
<entry>Subdirectory containing subtransaction status data</entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry><filename>pg_tblspc</></entry>
|
||||||
|
<entry>Subdirectory containing symbolic links to tablespaces</entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry><filename>pg_xlog</></entry>
|
||||||
|
<entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry><filename>postmaster.opts</></entry>
|
||||||
|
<entry>A file recording the command-line options the postmaster was
|
||||||
|
last started with</entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry><filename>postmaster.pid</></entry>
|
||||||
|
<entry>A lock file recording the current postmaster PID and shared memory
|
||||||
|
segment ID (not present after postmaster shutdown)</entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
</tbody>
|
||||||
|
</tgroup>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
For each database in the cluster there is a subdirectory within
|
||||||
|
<varname>PGDATA</><filename>/base</>, named after the database's OID in
|
||||||
|
<structname>pg_database</>. This subdirectory is the default location
|
||||||
|
for the database's files; in particular, its system catalogs are stored
|
||||||
|
there.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Each table and index is stored in a separate file, named after the table
|
||||||
|
or index's <firstterm>filenode</> number, which can be found in
|
||||||
|
<structname>pg_class</>.<structfield>relfilenode</>.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<caution>
|
||||||
|
<para>
|
||||||
|
Note that while a table's filenode often matches its OID, this is
|
||||||
|
<emphasis>not</> necessarily the case; some operations, like
|
||||||
|
<command>TRUNCATE</>, <command>REINDEX</>, <command>CLUSTER</> and some forms
|
||||||
|
of <command>ALTER TABLE</>, can change the filenode while preserving the OID.
|
||||||
|
Avoid assuming that filenode and table OID are the same.
|
||||||
|
</para>
|
||||||
|
</caution>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
When a table or index exceeds 1Gb, it is divided into gigabyte-sized
|
||||||
|
<firstterm>segments</>. The first segment's file name is the same as the
|
||||||
|
filenode; subsequent segments are named filenode.1, filenode.2, etc.
|
||||||
|
This arrangement avoids problems on platforms that have file size limitations.
|
||||||
|
The contents of tables and indexes are discussed further in
|
||||||
|
<xref linkend="storage-page-layout">.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
A table that has columns with potentially large entries will have an
|
||||||
|
associated <firstterm>TOAST</> table, which is used for out-of-line storage of
|
||||||
|
field values that are too large to keep in the table rows proper.
|
||||||
|
<structname>pg_class</>.<structfield>reltoastrelid</> links from a table to
|
||||||
|
its <acronym>TOAST</> table, if any.
|
||||||
|
See <xref linkend="storage-toast"> for more information.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Tablespaces make the scenario more complicated. Each user-defined tablespace
|
||||||
|
has a symbolic link inside the <varname>PGDATA</><filename>/pg_tblspc</>
|
||||||
|
directory, which points to the physical tablespace directory (as specified in
|
||||||
|
its <command>CREATE TABLESPACE</> command). The symbolic link is named after
|
||||||
|
the tablespace's OID. Inside the physical tablespace directory there is
|
||||||
|
a subdirectory for each database that has elements in the tablespace, named
|
||||||
|
after the database's OID. Tables within that directory follow the filenode
|
||||||
|
naming scheme. The <literal>pg_default</> tablespace is not accessed through
|
||||||
|
<filename>pg_tblspc</>, but corresponds to
|
||||||
|
<varname>PGDATA</><filename>/base</>. Similarly, the <literal>pg_global</>
|
||||||
|
tablespace is not accessed through <filename>pg_tblspc</>, but corresponds to
|
||||||
|
<varname>PGDATA</><filename>/global</>.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
</sect1>
|
||||||
|
|
||||||
|
<sect1 id="storage-toast">
|
||||||
|
|
||||||
|
<title>TOAST</title>
|
||||||
|
|
||||||
|
<indexterm>
|
||||||
|
<primary>TOAST</primary>
|
||||||
|
</indexterm>
|
||||||
|
<indexterm><primary>sliced bread</><see>TOAST</></indexterm>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
This section provides an overview of <acronym>TOAST</> (The
|
||||||
|
Oversized-Attribute Storage Technique).
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Since <productname>PostgreSQL</productname> uses a fixed page size (commonly
|
||||||
|
8Kb), and does not allow tuples to span multiple pages, it's not possible to
|
||||||
|
store very large field values directly. Before <productname>PostgreSQL</> 7.1
|
||||||
|
there was a hard limit of just under one page on the total amount of data that
|
||||||
|
could be put into a table row. In release 7.1 and later, this limit is
|
||||||
|
overcome by allowing large field values to be compressed and/or broken up into
|
||||||
|
multiple physical rows. This happens transparently to the user, with only
|
||||||
|
small impact on most of the backend code. The technique is affectionately
|
||||||
|
known as <acronym>TOAST</> (or <quote>the best thing since sliced bread</>).
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Only certain data types support <acronym>TOAST</> — there is no need to
|
||||||
|
impose the overhead on data types that cannot produce large field values.
|
||||||
|
To support <acronym>TOAST</>, a data type must have a variable-length
|
||||||
|
(<firstterm>varlena</>) representation, in which the first 32-bit word of any
|
||||||
|
stored value contains the total length of the value in bytes (including
|
||||||
|
itself). <acronym>TOAST</> does not constrain the rest of the representation.
|
||||||
|
All the C-level functions supporting a <acronym>TOAST</>-able data type must
|
||||||
|
be careful to handle <acronym>TOAST</>ed input values. (This is normally done
|
||||||
|
by invoking <function>PG_DETOAST_DATUM</> before doing anything with an input
|
||||||
|
value; but in some cases more efficient approaches are possible.)
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<acronym>TOAST</> usurps the high-order two bits of the varlena length word,
|
||||||
|
thereby limiting the logical size of any value of a <acronym>TOAST</>-able
|
||||||
|
data type to 1Gb (2<superscript>30</> - 1 bytes). When both bits are zero,
|
||||||
|
the value is an ordinary un-<acronym>TOAST</>ed value of the data type. One
|
||||||
|
of these bits, if set, indicates that the value has been compressed and must
|
||||||
|
be decompressed before use. The other bit, if set, indicates that the value
|
||||||
|
has been stored out-of-line. In this case the remainder of the value is
|
||||||
|
actually just a pointer, and the correct data has to be found elsewhere. When
|
||||||
|
both bits are set, the out-of-line data has been compressed too. In each case
|
||||||
|
the length in the low-order bits of the varlena word indicates the actual size
|
||||||
|
of the datum, not the size of the logical value that would be extracted by
|
||||||
|
decompression or fetching of the out-of-line data.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
If any of the columns of a table are <acronym>TOAST</>-able, the table will
|
||||||
|
have an associated <acronym>TOAST</> table, whose OID is stored in the table's
|
||||||
|
<structname>pg_class</>.<structfield>reltoastrelid</> entry. Out-of-line
|
||||||
|
<acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as
|
||||||
|
described in more detail below.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The compression technique used is a fairly simple and very fast member
|
||||||
|
of the LZ family of compression techniques. See
|
||||||
|
<filename>src/backend/utils/adt/pg_lzcompress.c</> for the details.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Out-of-line values are divided (after compression if used) into chunks of at
|
||||||
|
most <literal>TOAST_MAX_CHUNK_SIZE</> bytes (this value is a little less than
|
||||||
|
<literal>BLCKSZ/4</>, or about 2000 bytes by default). Each chunk is stored
|
||||||
|
as a separate row in the <acronym>TOAST</> table for the owning table. Every
|
||||||
|
<acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OID
|
||||||
|
identifying the particular <acronym>TOAST</>ed value),
|
||||||
|
<structfield>chunk_seq</> (a sequence number for the chunk within its value),
|
||||||
|
and <structfield>chunk_data</> (the actual data of the chunk). A unique index
|
||||||
|
on <structfield>chunk_id</> and <structfield>chunk_seq</> provides fast
|
||||||
|
retrieval of the values. A pointer datum representing an out-of-line
|
||||||
|
<acronym>TOAST</>ed value therefore needs to store the OID of the
|
||||||
|
<acronym>TOAST</> table in which to look and the OID of the specific value
|
||||||
|
(its <structfield>chunk_id</>). For convenience, pointer datums also store the
|
||||||
|
logical datum size (original uncompressed data length) and actual stored size
|
||||||
|
(different if compression was applied). Allowing for the varlena header word,
|
||||||
|
the total size of a <acronym>TOAST</> pointer datum is therefore 20 bytes
|
||||||
|
regardless of the actual size of the represented value.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The <acronym>TOAST</> code is triggered only
|
||||||
|
when a row value to be stored in a table is wider than <literal>BLCKSZ/4</>
|
||||||
|
bytes (normally 2Kb). The <acronym>TOAST</> code will compress and/or move
|
||||||
|
field values out-of-line until the row value is shorter than
|
||||||
|
<literal>BLCKSZ/4</> bytes or no more gains can be had. During an UPDATE
|
||||||
|
operation, values of unchanged fields are normally preserved as-is; so an
|
||||||
|
UPDATE of a row with out-of-line values incurs no <acronym>TOAST</> costs if
|
||||||
|
none of the out-of-line values change.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The <acronym>TOAST</> code recognizes four different strategies for storing
|
||||||
|
<acronym>TOAST</>-able columns:
|
||||||
|
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
<literal>PLAIN</literal> prevents either compression or
|
||||||
|
out-of-line storage. This is the only possible strategy for
|
||||||
|
columns of non-<acronym>TOAST</>-able data types.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
<literal>EXTENDED</literal> allows both compression and out-of-line
|
||||||
|
storage. This is the default for most <acronym>TOAST</>-able data types.
|
||||||
|
Compression will be attempted first, then out-of-line storage if
|
||||||
|
the row is still too big.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
<literal>EXTERNAL</literal> allows out-of-line storage but not
|
||||||
|
compression. Use of <literal>EXTERNAL</literal> will
|
||||||
|
make substring operations on wide <type>text</type> and
|
||||||
|
<type>bytea</type> columns faster (at the penalty of increased storage
|
||||||
|
space) because these operations are optimized to fetch only the
|
||||||
|
required parts of the out-of-line value when it is not compressed.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
<literal>MAIN</literal> allows compression but not out-of-line
|
||||||
|
storage. (Actually, out-of-line storage will still be performed
|
||||||
|
for such columns, but only as a last resort when there is no other
|
||||||
|
way to make the row small enough.)
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
|
||||||
|
Each <acronym>TOAST</>-able data type specifies a default strategy for columns
|
||||||
|
of that data type, but the strategy for a given table column can be altered
|
||||||
|
with <command>ALTER TABLE SET STORAGE</>.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
This scheme has a number of advantages compared to a more straightforward
|
||||||
|
approach such as allowing row values to span pages. Assuming that queries are
|
||||||
|
usually qualified by comparisons against relatively small key values, most of
|
||||||
|
the work of the executor will be done using the main row entry. The big values
|
||||||
|
of <acronym>TOAST</>ed attributes will only be pulled out (if selected at all)
|
||||||
|
at the time the result set is sent to the client. Thus, the main table is much
|
||||||
|
smaller and more of its rows fit in the shared buffer cache than would be the
|
||||||
|
case without any out-of-line storage. Sort sets shrink also, and sorts will
|
||||||
|
more often be done entirely in memory. A little test showed that a table
|
||||||
|
containing typical HTML pages and their URLs was stored in about half of the
|
||||||
|
raw data size including the <acronym>TOAST</> table, and that the main table
|
||||||
|
contained only about 10% of the entire data (the URLs and some small HTML
|
||||||
|
pages). There was no runtime difference compared to an un-<acronym>TOAST</>ed
|
||||||
|
comparison table, in which all the HTML pages were cut down to 7Kb to fit.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
</sect1>
|
||||||
|
|
||||||
|
<sect1 id="storage-page-layout">
|
||||||
|
|
||||||
|
<title>Database Page Layout</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
This section provides an overview of the page format used within
|
||||||
|
<productname>PostgreSQL</productname> tables and indexes.<footnote>
|
||||||
|
<para>
|
||||||
|
Actually, index access methods need not use this page format.
|
||||||
|
All the existing index methods do use this basic format,
|
||||||
|
but the data kept on index metapages usually doesn't follow
|
||||||
|
the item layout rules.
|
||||||
|
</para>
|
||||||
|
</footnote>
|
||||||
|
Sequences and <acronym>TOAST</> tables are formatted just like a regular table.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
In the following explanation, a
|
||||||
|
<firstterm>byte</firstterm>
|
||||||
|
is assumed to contain 8 bits. In addition, the term
|
||||||
|
<firstterm>item</firstterm>
|
||||||
|
refers to an individual data value that is stored on a page. In a table,
|
||||||
|
an item is a row; in an index, an item is an index entry.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Every table and index is stored as an array of <firstterm>pages</> of a
|
||||||
|
fixed size (usually 8Kb, although a different page size can be selected
|
||||||
|
when compiling the server). In a table, all the pages are logically
|
||||||
|
equivalent, so a particular item (row) can be stored in any page. In
|
||||||
|
indexes, the first page is generally reserved as a <firstterm>metapage</>
|
||||||
|
holding control information, and there may be different types of pages
|
||||||
|
within the index, depending on the index access method.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<xref linkend="page-table"> shows the overall layout of a page.
|
||||||
|
There are five parts to each page.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<table tocentry="1" id="page-table">
|
||||||
|
<title>Overall Page Layout</title>
|
||||||
|
<titleabbrev>Page Layout</titleabbrev>
|
||||||
|
<tgroup cols="2">
|
||||||
|
<thead>
|
||||||
|
<row>
|
||||||
|
<entry>
|
||||||
|
Item
|
||||||
|
</entry>
|
||||||
|
<entry>Description</entry>
|
||||||
|
</row>
|
||||||
|
</thead>
|
||||||
|
|
||||||
|
<tbody>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry>PageHeaderData</entry>
|
||||||
|
<entry>20 bytes long. Contains general information about the page, including
|
||||||
|
free space pointers.</entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry>ItemPointerData</entry>
|
||||||
|
<entry>Array of (offset,length) pairs pointing to the actual items.
|
||||||
|
4 bytes per item.</entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry>Free space</entry>
|
||||||
|
<entry>The unallocated space. New item pointers are allocated from the start
|
||||||
|
of this area, new items from the end.</entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry>Items</entry>
|
||||||
|
<entry>The actual items themselves.</entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
<row>
|
||||||
|
<entry>Special space</entry>
|
||||||
|
<entry>Index access method specific data. Different methods store different
|
||||||
|
data. Empty in ordinary tables.</entry>
|
||||||
|
</row>
|
||||||
|
|
||||||
|
</tbody>
|
||||||
|
</tgroup>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
|
||||||
|
The first 20 bytes of each page consists of a page header
|
||||||
|
(PageHeaderData). Its format is detailed in <xref
|
||||||
|
linkend="pageheaderdata-table">. The first two fields track the most
|
||||||
|
recent WAL entry related to this page. They are followed by three 2-byte
|
||||||
|
integer fields
|
||||||
|
(<structfield>pd_lower</structfield>, <structfield>pd_upper</structfield>,
|
||||||
|
and <structfield>pd_special</structfield>). These contain byte offsets
|
||||||
|
from the page start to the start
|
||||||
|
of unallocated space, to the end of unallocated space, and to the start of
|
||||||
|
the special space.
|
||||||
|
The last 2 bytes of the page header,
|
||||||
|
<structfield>pd_pagesize_version</structfield>, store both the page size
|
||||||
|
and a version indicator. Beginning with
|
||||||
|
<productname>PostgreSQL</productname> 8.0 the version number is 2;
|
||||||
|
<productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
|
||||||
|
prior releases used version number 0.
|
||||||
|
(The basic page layout and header format has not changed in these versions,
|
||||||
|
but the layout of heap row headers has.) The page size
|
||||||
|
is basically only present as a cross-check; there is no support for having
|
||||||
|
more than one page size in an installation.
|
||||||
|
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<table tocentry="1" id="pageheaderdata-table">
|
||||||
|
<title>PageHeaderData Layout</title>
|
||||||
|
<titleabbrev>PageHeaderData Layout</titleabbrev>
|
||||||
|
<tgroup cols="4">
|
||||||
|
<thead>
|
||||||
|
<row>
|
||||||
|
<entry>Field</entry>
|
||||||
|
<entry>Type</entry>
|
||||||
|
<entry>Length</entry>
|
||||||
|
<entry>Description</entry>
|
||||||
|
</row>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<row>
|
||||||
|
<entry>pd_lsn</entry>
|
||||||
|
<entry>XLogRecPtr</entry>
|
||||||
|
<entry>8 bytes</entry>
|
||||||
|
<entry>LSN: next byte after last byte of xlog record for last change
|
||||||
|
to this page</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>pd_tli</entry>
|
||||||
|
<entry>TimeLineID</entry>
|
||||||
|
<entry>4 bytes</entry>
|
||||||
|
<entry>TLI of last change</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>pd_lower</entry>
|
||||||
|
<entry>LocationIndex</entry>
|
||||||
|
<entry>2 bytes</entry>
|
||||||
|
<entry>Offset to start of free space</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>pd_upper</entry>
|
||||||
|
<entry>LocationIndex</entry>
|
||||||
|
<entry>2 bytes</entry>
|
||||||
|
<entry>Offset to end of free space</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>pd_special</entry>
|
||||||
|
<entry>LocationIndex</entry>
|
||||||
|
<entry>2 bytes</entry>
|
||||||
|
<entry>Offset to start of special space</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>pd_pagesize_version</entry>
|
||||||
|
<entry>uint16</entry>
|
||||||
|
<entry>2 bytes</entry>
|
||||||
|
<entry>Page size and layout version number information</entry>
|
||||||
|
</row>
|
||||||
|
</tbody>
|
||||||
|
</tgroup>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
All the details may be found in
|
||||||
|
<filename>src/include/storage/bufpage.h</filename>.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
|
||||||
|
Following the page header are item identifiers
|
||||||
|
(<type>ItemIdData</type>), each requiring four bytes.
|
||||||
|
An item identifier contains a byte-offset to
|
||||||
|
the start of an item, its length in bytes, and a few attribute bits
|
||||||
|
which affect its interpretation.
|
||||||
|
New item identifiers are allocated
|
||||||
|
as needed from the beginning of the unallocated space.
|
||||||
|
The number of item identifiers present can be determined by looking at
|
||||||
|
<structfield>pd_lower</>, which is increased to allocate a new identifier.
|
||||||
|
Because an item
|
||||||
|
identifier is never moved until it is freed, its index may be used on a
|
||||||
|
long-term basis to reference an item, even when the item itself is moved
|
||||||
|
around on the page to compact free space. In fact, every pointer to an
|
||||||
|
item (<type>ItemPointer</type>, also known as
|
||||||
|
<type>CTID</type>) created by
|
||||||
|
<productname>PostgreSQL</productname> consists of a page number and the
|
||||||
|
index of an item identifier.
|
||||||
|
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
|
||||||
|
The items themselves are stored in space allocated backwards from the end
|
||||||
|
of unallocated space. The exact structure varies depending on what the
|
||||||
|
table is to contain. Tables and sequences both use a structure named
|
||||||
|
<type>HeapTupleHeaderData</type>, described below.
|
||||||
|
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
|
||||||
|
The final section is the <quote>special section</quote> which may
|
||||||
|
contain anything the access method wishes to store. For example,
|
||||||
|
b-tree indexes store links to the page's left and right siblings,
|
||||||
|
as well as some other data relevant to the index structure.
|
||||||
|
Ordinary tables do not use a special section at all (indicated by setting
|
||||||
|
<structfield>pd_special</> to equal the page size).
|
||||||
|
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
|
||||||
|
All table rows are structured in the same way. There is a fixed-size
|
||||||
|
header (occupying 27 bytes on most machines), followed by an optional null
|
||||||
|
bitmap, an optional object ID field, and the user data. The header is
|
||||||
|
detailed
|
||||||
|
in <xref linkend="heaptupleheaderdata-table">. The actual user data
|
||||||
|
(columns of the row) begins at the offset indicated by
|
||||||
|
<structfield>t_hoff</>, which must always be a multiple of the MAXALIGN
|
||||||
|
distance for the platform.
|
||||||
|
The null bitmap is
|
||||||
|
only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
|
||||||
|
<structfield>t_infomask</structfield>. If it is present it begins just after
|
||||||
|
the fixed header and occupies enough bytes to have one bit per data column
|
||||||
|
(that is, <structfield>t_natts</> bits altogether). In this list of bits, a
|
||||||
|
1 bit indicates not-null, a 0 bit is a null. When the bitmap is not
|
||||||
|
present, all columns are assumed not-null.
|
||||||
|
The object ID is only present if the <firstterm>HEAP_HASOID</firstterm> bit
|
||||||
|
is set in <structfield>t_infomask</structfield>. If present, it appears just
|
||||||
|
before the <structfield>t_hoff</> boundary. Any padding needed to make
|
||||||
|
<structfield>t_hoff</> a MAXALIGN multiple will appear between the null
|
||||||
|
bitmap and the object ID. (This in turn ensures that the object ID is
|
||||||
|
suitably aligned.)
|
||||||
|
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<table tocentry="1" id="heaptupleheaderdata-table">
|
||||||
|
<title>HeapTupleHeaderData Layout</title>
|
||||||
|
<titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
|
||||||
|
<tgroup cols="4">
|
||||||
|
<thead>
|
||||||
|
<row>
|
||||||
|
<entry>Field</entry>
|
||||||
|
<entry>Type</entry>
|
||||||
|
<entry>Length</entry>
|
||||||
|
<entry>Description</entry>
|
||||||
|
</row>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<row>
|
||||||
|
<entry>t_xmin</entry>
|
||||||
|
<entry>TransactionId</entry>
|
||||||
|
<entry>4 bytes</entry>
|
||||||
|
<entry>insert XID stamp</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>t_cmin</entry>
|
||||||
|
<entry>CommandId</entry>
|
||||||
|
<entry>4 bytes</entry>
|
||||||
|
<entry>insert CID stamp</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>t_xmax</entry>
|
||||||
|
<entry>TransactionId</entry>
|
||||||
|
<entry>4 bytes</entry>
|
||||||
|
<entry>delete XID stamp</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>t_cmax</entry>
|
||||||
|
<entry>CommandId</entry>
|
||||||
|
<entry>4 bytes</entry>
|
||||||
|
<entry>delete CID stamp (overlays with t_xvac)</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>t_xvac</entry>
|
||||||
|
<entry>TransactionId</entry>
|
||||||
|
<entry>4 bytes</entry>
|
||||||
|
<entry>XID for VACUUM operation moving a row version</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>t_ctid</entry>
|
||||||
|
<entry>ItemPointerData</entry>
|
||||||
|
<entry>6 bytes</entry>
|
||||||
|
<entry>current TID of this or newer row version</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>t_natts</entry>
|
||||||
|
<entry>int16</entry>
|
||||||
|
<entry>2 bytes</entry>
|
||||||
|
<entry>number of attributes</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>t_infomask</entry>
|
||||||
|
<entry>uint16</entry>
|
||||||
|
<entry>2 bytes</entry>
|
||||||
|
<entry>various flag bits</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>t_hoff</entry>
|
||||||
|
<entry>uint8</entry>
|
||||||
|
<entry>1 byte</entry>
|
||||||
|
<entry>offset to user data</entry>
|
||||||
|
</row>
|
||||||
|
</tbody>
|
||||||
|
</tgroup>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
All the details may be found in
|
||||||
|
<filename>src/include/access/htup.h</filename>.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
|
||||||
|
Interpreting the actual data can only be done with information obtained
|
||||||
|
from other tables, mostly <structname>pg_attribute</structname>. The
|
||||||
|
key values needed to identify field locations are
|
||||||
|
<structfield>attlen</structfield> and <structfield>attalign</structfield>.
|
||||||
|
There is no way to directly get a
|
||||||
|
particular attribute, except when there are only fixed width fields and no
|
||||||
|
NULLs. All this trickery is wrapped up in the functions
|
||||||
|
<firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
|
||||||
|
and <firstterm>heap_getsysattr</firstterm>.
|
||||||
|
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
|
||||||
|
To read the data you need to examine each attribute in turn. First check
|
||||||
|
whether the field is NULL according to the null bitmap. If it is, go to
|
||||||
|
the next. Then make sure you have the right alignment. If the field is a
|
||||||
|
fixed width field, then all the bytes are simply placed. If it's a
|
||||||
|
variable length field (attlen = -1) then it's a bit more complicated.
|
||||||
|
All variable-length datatypes share the common header structure
|
||||||
|
<type>varattrib</type>, which includes the total length of the stored
|
||||||
|
value and some flag bits. Depending on the flags, the data may be either
|
||||||
|
inline or in a <acronym>TOAST</> table;
|
||||||
|
it might be compressed, too (see <xref linkend="storage-toast">).
|
||||||
|
|
||||||
|
</para>
|
||||||
|
</sect1>
|
||||||
|
|
||||||
|
</chapter>
|
@ -1,5 +1,5 @@
|
|||||||
<!--
|
<!--
|
||||||
$PostgreSQL: pgsql/doc/src/sgml/xtypes.sgml,v 1.24 2005/01/08 22:13:38 tgl Exp $
|
$PostgreSQL: pgsql/doc/src/sgml/xtypes.sgml,v 1.25 2005/01/10 00:04:38 tgl Exp $
|
||||||
-->
|
-->
|
||||||
|
|
||||||
<sect1 id="xtypes">
|
<sect1 id="xtypes">
|
||||||
@ -232,10 +232,14 @@ CREATE TYPE complex (
|
|||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
|
<indexterm>
|
||||||
|
<primary>TOAST</primary>
|
||||||
|
<secondary>and user-defined types</secondary>
|
||||||
|
</indexterm>
|
||||||
If the values of your data type might exceed a few hundred bytes in
|
If the values of your data type might exceed a few hundred bytes in
|
||||||
size (in internal form), you should make the data type
|
size (in internal form), you should make the data type
|
||||||
TOAST-able.<indexterm><primary>TOAST</primary><secondary>and
|
<acronym>TOAST</>-able (see <xref linkend="storage-toast">).
|
||||||
user-defined types</secondary></indexterm> To do this, the internal
|
To do this, the internal
|
||||||
representation must follow the standard layout for variable-length
|
representation must follow the standard layout for variable-length
|
||||||
data: the first four bytes must be an <type>int32</type> containing
|
data: the first four bytes must be an <type>int32</type> containing
|
||||||
the total length in bytes of the datum (including itself). The C
|
the total length in bytes of the datum (including itself). The C
|
||||||
|
Loading…
x
Reference in New Issue
Block a user