368 Commits

Author SHA1 Message Date
Alvaro Herrera
9261557eb1 Revert "Use "transient" files for blind writes"
This reverts commit 54d9e8c6c19cbefa8fb42ed3442a0a5327590ed3, which
caused a failure on the buildfarm.  Not a good thing to have just before
a beta release.
2011-06-09 16:41:44 -04:00
Alvaro Herrera
54d9e8c6c1 Use "transient" files for blind writes
"Blind writes" are a mechanism to push buffers down to disk when
evicting them; since they may belong to different databases than the one
a backend is connected to, the backend does not necessarily have a
relation to link them to, and thus no way to blow them away.  We were
keeping those files open indefinitely, which would cause a problem if
the underlying table was deleted, because the operating system would not
be able to reclaim the disk space used by those files.

To fix, have bufmgr mark such files as transient to smgr; the lower
layer is allowed to close the file descriptor when the current
transaction ends.  We must be careful to have any other access of the
file to remove the transient markings, to prevent unnecessary expensive
system calls when evicting buffers belonging to our own database (which
files we're likely to require again soon.)
2011-06-09 16:25:49 -04:00
Bruce Momjian
bf50caf105 pgindent run before PG 9.1 beta 1. 2011-04-10 11:42:00 -04:00
Bruce Momjian
5d950e3b0c Stamp copyrights for year 2011. 2011-01-01 13:18:15 -05:00
Robert Haas
53dbc27c62 Support unlogged tables.
The contents of an unlogged table are WAL-logged; thus, they are not
available on standby servers and are truncated whenever the database
system enters recovery.  Indexes on unlogged tables are also unlogged.
Unlogged GiST indexes are not currently supported.
2010-12-29 06:48:53 -05:00
Robert Haas
5f7b58fad8 Generalize concept of temporary relations to "relation persistence".
This commit replaces pg_class.relistemp with pg_class.relpersistence;
and also modifies the RangeVar node type to carry relpersistence rather
than istemp.  It also removes removes rd_istemp from RelationData and
instead performs the correct computation based on relpersistence.

For clarity, we add three new macros: RelationNeedsWAL(),
RelationUsesLocalBuffers(), and RelationUsesTempNamespace(), so that we
can clarify the purpose of each check that previous depended on
rd_istemp.

This is intended as infrastructure for the upcoming unlogged tables
patch, as well as for future possible work on global temporary tables.
2010-12-13 12:34:26 -05:00
Robert Haas
c2281ac87c Remove belt-and-suspenders guards against buffer pin leaks.
Forcibly releasing all leftover buffer pins should be unnecessary now
that we have a robust ResourceOwner mechanism, and it significantly
increases the cost of process shutdown.  Instead, in an assert-enabled
build, assert that no pins are held; in a non-assert-enabled build, do
nothing.
2010-11-25 00:06:46 -05:00
Magnus Hagander
9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Robert Haas
a481ff71af Remove the isLocalBuf argument from ReadBuffer_common.
Since an SMgrRelation now knows whether or not the underlying relation is
temporary, there's no point in also passing that information via an
additional argument.
2010-08-20 01:07:50 +00:00
Robert Haas
27f145a40e Further dtrace adjustments for the backend-IDs-in-relpath patch.
Update the documentation, and back out a few ill-considered changes
whose folly I failed to realize for failure to read the documentation.
2010-08-14 02:22:10 +00:00
Robert Haas
105d4c5ffe Fix assorted dtrace breakage caused by patch to include backend IDs
in temp relpaths.

Per buildfarm.
2010-08-13 22:54:17 +00:00
Robert Haas
debcec7dc3 Include the backend ID in the relpath of temporary relations.
This allows us to reliably remove all leftover temporary relation
files on cluster startup without reference to system catalogs or WAL;
therefore, we no longer include temporary relations in XLOG_XACT_COMMIT
and XLOG_XACT_ABORT WAL records.

Since these changes require including a backend ID in each
SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id
field has been reduced from two bytes to one, and the maximum number
of connections has been reduced from INT_MAX / 4 to 2^23-1.  It would
be possible to remove these restrictions by increasing the size of
SharedInvalidationMessage by 4 bytes, but right now that doesn't seem
like a good trade-off.

Review by Jaime Casanova and Tom Lane.
2010-08-13 20:10:54 +00:00
Bruce Momjian
65e806cba1 pgindent run for 9.0 2010-02-26 02:01:40 +00:00
Simon Riggs
959ac58c04 In HS, Startup process sets SIGALRM when waiting for buffer pin. If
woken by alarm we send SIGUSR1 to all backends requesting that they
check to see if they are blocking Startup process. If so, they throw
ERROR/FATAL as for other conflict resolutions. Deadlock stop gap
removed. max_standby_delay = -1 option removed to prevent deadlock.
2010-01-23 16:37:12 +00:00
Bruce Momjian
0239800893 Update copyright for the year 2010. 2010-01-02 16:58:17 +00:00
Robert Haas
cddca5ec13 Add an EXPLAIN (BUFFERS) option to show buffer-usage statistics.
This patch also removes buffer-usage statistics from the track_counts
output, since this (or the global server statistics) is deemed to be a better
interface to this information.

Itagaki Takahiro, reviewed by Euler Taveira de Oliveira.
2009-12-15 04:57:48 +00:00
Bruce Momjian
d747140279 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list
provided by Andrew.
2009-06-11 14:49:15 +00:00
Tom Lane
1b2bb33a54 Add a comment documenting the question of whether PrefetchBuffer should
try to protect an already-existing buffer from being evicted.  This was
left as an open issue when the posix_fadvise patch was committed.  I'm
not sure there's any evidence to justify more work in this area, but we
should have some record about it in the source code.
2009-04-03 18:17:43 +00:00
Tom Lane
948d6ec90f Modify the relcache to record the temp status of both local and nonlocal
temp relations; this is no more expensive than before, now that we have
pg_class.relistemp.  Insert tests into bufmgr.c to prevent attempting
to fetch pages from nonlocal temp relations.  This provides a low-level
defense against bugs-of-omission allowing temp pages to be loaded into shared
buffers, as in the contrib/pgstattuple problem reported by Stuart Bishop.
While at it, tweak a bunch of places to use new relcache tests (instead of
expensive probes into pg_namespace) to detect local or nonlocal temp tables.
2009-03-31 22:12:48 +00:00
Tom Lane
471913a6a5 More fixes for 8.4 DTrace probes. Remove useless BUFFER_HIT/BUFFER_MISS
probes --- the BUFFER_READ_DONE probe provides the same information and more
besides.  Expand the LOCK_WAIT_START/DONE probe arguments so that there's
actually some chance of telling what is being waited for.  Update and
clean up the documentation.
2009-03-23 01:52:38 +00:00
Tom Lane
44023dc5f5 Add isExtend to the parameters of the buffer_read_start and buffer_read_done
DTrace probes, so that ordinary reads can be distinguished from relation
extension operations.  Move buffer_read_start probe to before the
smgrnblocks() call that's needed in the isExtend case, since really that step
should be charged as part of the time needed for the extension operation.
(This makes it slightly harder to match the read_start with the associated
read_done, since now you can't match them on blockNumber, but it should still
be possible since isExtend operations on the same relation can never be
interleaved.)  Per recent discussion.

In passing, add the page identity (forkNum/blockNum) to the parameters of the
buffer_flush_start/buffer_flush_done probes, which were unaccountably lacking
the info.
2009-03-22 22:39:05 +00:00
Tom Lane
d287c9eff0 Restore previous ordering of BUFFER_FLUSH_START probe. I had wanted to
make it include the time for the possible smgropen() call, but that
results in a null pointer dereference :-(.

An alternative solution would be to fetch the buffer tag instead of
looking at *reln, but I'll just put it back as it was for the moment.

BTW, this indicates that DTrace probes evaluate their arguments even
when nominally inactive.  What was that about "zero cost", again?
2009-03-13 17:46:21 +00:00
Tom Lane
e04810e8c4 Code review for dtrace probes added (so far) to 8.4. Adjust placement of
some bufmgr probes, take out redundant and memory-leak-inducing path arguments
to smgr__md__read__done and smgr__md__write__done, fix bogus attempt to
recalculate space used in sort__done, clean up formatting in places where
I'm not sure pgindent will do a nice job by itself.
2009-03-11 23:19:25 +00:00
Tom Lane
b7b8f0b609 Implement prefetching via posix_fadvise() for bitmap index scans. A new
GUC variable effective_io_concurrency controls how many concurrent block
prefetch requests will be issued.

(The best way to handle this for plain index scans is still under debate,
so that part is not applied yet --- tgl)

Greg Stark
2009-01-12 05:10:45 +00:00
Bruce Momjian
511db38ace Update copyright for 2009. 2009-01-01 17:24:05 +00:00
Bruce Momjian
5a90bc1fbe The attached patch contains a couple of fixes in the existing probes and
includes a few new ones.

- Fixed compilation errors on OS X for probes that use typedefs
- Fixed a number of probes to pass ForkNumber per the relation forks
patch
- The new probes are those that were taken out from the previous
submitted patch and required simple fixes. Will submit the other probes
that may require more discussion in a separate patch.

Robert Lor
2008-12-17 01:39:04 +00:00
Heikki Linnakangas
3396000684 Rethink the way FSM truncation works. Instead of WAL-logging FSM
truncations in FSM code, call FreeSpaceMapTruncateRel from smgr_redo. To
make that cleaner from modularity point of view, move the WAL-logging one
level up to RelationTruncate, and move RelationTruncate and all the
related WAL-logging to new src/backend/catalog/storage.c file. Introduce
new RelationCreateStorage and RelationDropStorage functions that are used
instead of calling smgrcreate/smgrscheduleunlink directly. Move the
pending rel deletion stuff from smgrcreate/smgrscheduleunlink to the new
functions. This leaves smgr.c as a thin wrapper around md.c; all the
transactional stuff is now in storage.c.

This will make it easier to add new forks with similar truncation logic,
like the visibility map.
2008-11-19 10:34:52 +00:00
Heikki Linnakangas
7e8b0b9ab1 Change error messages to print the physical path, like
"base/11517/3767_fsm", instead of symbolic names like "1663/11517/3767/1",
per Alvaro's suggestion. I didn't change the messages in the higher-level
index, heap and FSM routines, though, where the fork is implicit.
2008-11-11 13:19:16 +00:00
Heikki Linnakangas
19c8dc839b Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBuffer
functions into one ReadBufferExtended function, that takes the strategy
and mode as argument. There's three modes, RBM_NORMAL which is the default
used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and
a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages
without throwing an error. The FSM needs the new mode to recover from
corrupt pages, which could happend if we crash after extending an FSM file,
and the new page is "torn".

Add fork number to some error messages in bufmgr.c, that still lacked it.
2008-10-31 15:05:00 +00:00
Alvaro Herrera
089ae3bc9a Properly access a buffer's LSN using existing access macros instead of abusing
knowledge of page layout.

Stolen from Jonah Harris' CRC patch
2008-10-20 21:11:15 +00:00
Tom Lane
35c2a3c3cf Allow ShowBufferUsage() to report the number of reads/writes that have
occurred to temporary files.  This replaces the unused
NDirectFileRead/NDirectFileWrite counters.

Itagaki Takahiro
2008-09-17 13:15:55 +00:00
Heikki Linnakangas
3f0e808c4a Introduce the concept of relation forks. An smgr relation can now consist
of multiple forks, and each fork can be created and grown separately.

The bulk of this patch is about changing the smgr API to include an extra
ForkNumber argument in every smgr function. Also, smgrscheduleunlink and
smgrdounlink no longer implicitly call smgrclose, because other forks might
still exist after unlinking one. The callers of those functions have been
modified to call smgrclose instead.

This patch in itself doesn't have any user-visible effect, but provides the
infrastructure needed for upcoming patches. The additional forks envisioned
are a rewritten FSM implementation that doesn't rely on a fixed-size shared
memory block, and a visibility map to allow skipping portions of a table in
VACUUM that have no dead tuples.
2008-08-11 11:05:11 +00:00
Tom Lane
d8b04d5fac In ReadOrZeroBuffer (and related entry points), don't bother to call
PageHeaderIsValid when we zero the buffer instead of reading the page in.
The actual performance improvement is probably marginal since this function
isn't very heavily used, but a cycle saved is a cycle earned.

Zdenek Kotala
2008-08-05 15:09:04 +00:00
Alvaro Herrera
e36e6b1cab Add a few more DTrace probes to the backend.
Robert Lor
2008-08-01 13:16:09 +00:00
Tom Lane
9d035f4254 Clean up the use of some page-header-access macros: principally, use
SizeOfPageHeaderData instead of sizeof(PageHeaderData) in places where that
makes the code clearer, and avoid casting between Page and PageHeader where
possible.  Zdenek Kotala, with some additional cleanup by Heikki Linnakangas.

I did not apply the parts of the proposed patch that would have resulted in
slightly changing the on-disk format of hash indexes; it seems to me that's
not a win as long as there's any chance of having in-place upgrade for 8.4.
2008-07-13 20:45:47 +00:00
Alvaro Herrera
a3540b0f65 Improve our #include situation by moving pointer types away from the
corresponding struct definitions.  This allows other headers to avoid including
certain highly-loaded headers such as rel.h and relscan.h, instead using just
relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less
unnecessary dependencies.
2008-06-19 00:46:06 +00:00
Heikki Linnakangas
a213f1ee6c Refactor XLogOpenRelation() and XLogReadBuffer() in preparation for relation
forks. XLogOpenRelation() and the associated light-weight relation cache in
xlogutils.c is gone, and XLogReadBuffer() now takes a RelFileNode as argument,
instead of Relation.

For functions that still need a Relation struct during WAL replay, there's a
new function called CreateFakeRelcacheEntry() that returns a fake entry like
XLogOpenRelation() used to.
2008-06-12 09:12:31 +00:00
Alvaro Herrera
cc87402d6e Move BufferGetPageSize and BufferGetPage from bufpage.h to bufmgr.h. It is
more logical that way, and also it reduces the amount of unnecessary includes
in bufpage.h, which is widely used.

Zdenek Kotala.

My previous patch to bufpage.h should also have credited him as author, but I
forgot (sorry about that).
2008-06-08 22:00:48 +00:00
Alvaro Herrera
9084399782 Put back bufmgr.h in bufpage.h -- it is needed by some macros.
Remove #include bufmgr.h from (most?) source files which already include
bufpage.h.
2008-05-12 16:06:10 +00:00
Alvaro Herrera
f8c4d7db60 Restructure some header files a bit, in particular heapam.h, by removing some
unnecessary #include lines in it.  Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.

For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.

While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
2008-05-12 00:00:54 +00:00
Bruce Momjian
9098ab9e32 Update copyrights in source tree to 2008. 2008-01-01 19:46:01 +00:00
Bruce Momjian
fdf5a5efb7 pgindent run for 8.3. 2007-11-15 21:14:46 +00:00
Tom Lane
7a315a09dc Dept. of second thoughts: fix loop in BgBufferSync so that the exit when
bgwriter_lru_maxpages is exceeded leaves the loop variables in the
expected state.  In the original coding, we'd fail to advance
next_to_clean, causing that buffer to be probably-uselessly rechecked next
time, and also have an off-by-one idea of the number of buffers scanned.
2007-09-25 22:11:48 +00:00
Tom Lane
6f5c38dcd0 Just-in-time background writing strategy. This code avoids re-scanning
buffers that cannot possibly need to be cleaned, and estimates how many
buffers it should try to clean based on moving averages of recent allocation
requests and density of reusable buffers.  The patch also adds a couple
more columns to pg_stat_bgwriter to help measure the effectiveness of the
bgwriter.

Greg Smith, building on his own work and ideas from several other people,
in particular a much older patch from Itagaki Takahiro.
2007-09-25 20:03:38 +00:00
Tom Lane
282d2a03dd HOT updates. When we update a tuple without changing any of its indexed
columns, and the new version can be stored on the same heap page, we no longer
generate extra index entries for the new version.  Instead, index searches
follow the HOT-chain links to ensure they find the correct tuple version.

In addition, this patch introduces the ability to "prune" dead tuples on a
per-page basis, without having to do a complete VACUUM pass to recover space.
VACUUM is still needed to clean up dead index entries, however.

Pavan Deolasee, with help from a bunch of other people.
2007-09-20 17:56:33 +00:00
Tom Lane
9fc25c0511 Improve logging of checkpoints. Patch by Greg Smith, worked over
by Heikki and a little bit by me.
2007-06-30 19:12:02 +00:00
Tom Lane
867e2c91a0 Implement "distributed" checkpoints in which the checkpoint I/O is spread
over a fairly long period of time, rather than being spat out in a burst.
This happens only for background checkpoints carried out by the bgwriter;
other cases, such as a shutdown checkpoint, are still done at full speed.

Remove the "all buffers" scan in the bgwriter, and associated stats
infrastructure, since this seems no longer very useful when the checkpoint
itself is properly throttled.

Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas,
and some minor API editorialization by me.
2007-06-28 00:02:40 +00:00
Tom Lane
de6a6383a7 Update obsolete comment: it's no longer the case that mdread() will allow
reads beyond EOF, except by special coercion.
2007-06-18 00:47:20 +00:00
Tom Lane
d526575f89 Make large sequential scans and VACUUMs work in a limited-size "ring" of
buffers, rather than blowing out the whole shared-buffer arena.  Aside from
avoiding cache spoliation, this fixes the problem that VACUUM formerly tended
to cause a WAL flush for every page it modified, because we had it hacked to
use only a single buffer.  Those flushes will now occur only once per
ring-ful.  The exact ring size, and the threshold for seqscans to switch into
the ring usage pattern, remain under debate; but the infrastructure seems
done.  The key bit of infrastructure is a new optional BufferAccessStrategy
object that can be passed to ReadBuffer operations; this replaces the former
StrategyHintVacuum API.

This patch also changes the buffer usage-count methodology a bit: we now
advance usage_count when first pinning a buffer, rather than when last
unpinning it.  To preserve the behavior that a buffer's lifetime starts to
decrease when it's released, the clock sweep code is modified to not decrement
usage_count of pinned buffers.

Work not done in this commit: teach GiST and GIN indexes to use the vacuum
BufferAccessStrategy for vacuum-driven fetches.

Original patch by Simon, reworked by Heikki and again by Tom.
2007-05-30 20:12:03 +00:00
Tom Lane
77947c51c0 Fix up pgstats counting of live and dead tuples to recognize that committed
and aborted transactions have different effects; also teach it not to assume
that prepared transactions are always committed.

Along the way, simplify the pgstats API by tying counting directly to
Relations; I cannot detect any redeeming social value in having stats
pointers in HeapScanDesc and IndexScanDesc structures.  And fix a few
corner cases in which counts might be missed because the relation's
pgstat_info pointer hadn't been set.
2007-05-27 03:50:39 +00:00