Heikki Linnakangas dafaa3efb7 Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.

To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.

A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.

Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.

We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.

Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.

Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-08 00:09:08 +02:00

102 lines
4.0 KiB
C

/*-------------------------------------------------------------------------
*
* relscan.h
* POSTGRES relation scan descriptor definitions.
*
*
* Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
* src/include/access/relscan.h
*
*-------------------------------------------------------------------------
*/
#ifndef RELSCAN_H
#define RELSCAN_H
#include "access/genam.h"
#include "access/heapam.h"
typedef struct HeapScanDescData
{
/* scan parameters */
Relation rs_rd; /* heap relation descriptor */
Snapshot rs_snapshot; /* snapshot to see */
int rs_nkeys; /* number of scan keys */
ScanKey rs_key; /* array of scan key descriptors */
bool rs_bitmapscan; /* true if this is really a bitmap scan */
bool rs_pageatatime; /* verify visibility page-at-a-time? */
bool rs_allow_strat; /* allow or disallow use of access strategy */
bool rs_allow_sync; /* allow or disallow use of syncscan */
/* state set up at initscan time */
BlockNumber rs_nblocks; /* number of blocks to scan */
BlockNumber rs_startblock; /* block # to start at */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
bool rs_relpredicatelocked; /* predicate lock on relation exists */
/* scan current state */
bool rs_inited; /* false = scan not init'd yet */
HeapTupleData rs_ctup; /* current tuple in scan, if any */
BlockNumber rs_cblock; /* current block # in scan, if any */
Buffer rs_cbuf; /* current buffer in scan, if any */
/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
ItemPointerData rs_mctid; /* marked scan position, if any */
/* these fields only used in page-at-a-time mode and for bitmap scans */
int rs_cindex; /* current tuple's index in vistuples */
int rs_mindex; /* marked tuple's saved index */
int rs_ntuples; /* number of visible tuples on page */
OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */
} HeapScanDescData;
/*
* We use the same IndexScanDescData structure for both amgettuple-based
* and amgetbitmap-based index scans. Some fields are only relevant in
* amgettuple-based scans.
*/
typedef struct IndexScanDescData
{
/* scan parameters */
Relation heapRelation; /* heap relation descriptor, or NULL */
Relation indexRelation; /* index relation descriptor */
Snapshot xs_snapshot; /* snapshot to see */
int numberOfKeys; /* number of index qualifier conditions */
int numberOfOrderBys; /* number of ordering operators */
ScanKey keyData; /* array of index qualifier descriptors */
ScanKey orderByData; /* array of ordering op descriptors */
/* signaling to index AM about killing index tuples */
bool kill_prior_tuple; /* last-returned tuple is dead */
bool ignore_killed_tuples; /* do not return killed entries */
bool xactStartedInRecovery; /* prevents killing/seeing killed
* tuples */
/* index access method's private state */
void *opaque; /* access-method-specific info */
/* xs_ctup/xs_cbuf/xs_recheck are valid after a successful index_getnext */
HeapTupleData xs_ctup; /* current heap tuple, if any */
Buffer xs_cbuf; /* current heap buffer in scan, if any */
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
bool xs_recheck; /* T means scan keys must be rechecked */
/* state data for traversing HOT chains in index_getnext */
bool xs_hot_dead; /* T if all members of HOT chain are dead */
OffsetNumber xs_next_hot; /* next member of HOT chain, if any */
TransactionId xs_prev_xmax; /* previous HOT chain member's XMAX, if any */
} IndexScanDescData;
/* Struct for heap-or-index scans of system tables */
typedef struct SysScanDescData
{
Relation heap_rel; /* catalog being scanned */
Relation irel; /* NULL if doing heap scan */
HeapScanDesc scan; /* only valid in heap-scan case */
IndexScanDesc iscan; /* only valid in index-scan case */
} SysScanDescData;
#endif /* RELSCAN_H */