PostgreSQL/doc/TODO.detail/pglog

From aoki@postgres.Berkeley.EDU Sun Jun 22 19:31:06 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id TAA19488
	for <maillist@candle.pha.pa.us>; Sun, 22 Jun 1997 19:31:03 -0400 (EDT)
Received: from faerie.CS.Berkeley.EDU (faerie.CS.Berkeley.EDU [128.32.37.53]) by renoir.op.net ($ Revision: 1.12 $) with SMTP id TAA18795 for <maillist@candle.pha.pa.us>; Sun, 22 Jun 1997 19:18:06 -0400 (EDT)
Received: from localhost.Berkeley.EDU (localhost.Berkeley.EDU [127.0.0.1]) by faerie.CS.Berkeley.EDU (8.6.10/8.6.3) with SMTP id QAA07816 for maillist@candle.pha.pa.us; Sun, 22 Jun 1997 16:16:44 -0700
Message-Id: <199706222316.QAA07816@faerie.CS.Berkeley.EDU>
X-Authentication-Warning: faerie.CS.Berkeley.EDU: Host localhost.Berkeley.EDU didn't use HELO protocol
From: aoki@CS.Berkeley.EDU (Paul M. Aoki)
To: Bruce Momjian <maillist@candle.pha.pa.us>
Subject: Re: PostgreSQL psort() function performance
Reply-To: aoki@CS.Berkeley.EDU (Paul M. Aoki)
In-reply-to: Your message of Sun, 22 Jun 1997 09:45:31 -0400 (EDT)
	     <199706221345.JAA11476@candle.pha.pa.us>
Date: Sun, 22 Jun 97 16:16:43 -0700
Sender: aoki@postgres.Berkeley.EDU
X-Mts: smtp
Status: OR

the mariposa distribution (http://mariposa.cs.berkeley.edu/) contains
some hacks to nodeSort.c and psort.c that
	- make psort read directly from the executor node below it
	(instead of an input relation)
	- makes the Sort node read directly from the last set of psort runs
	(instead of an output relation)
speeds things up quite a bit.  kind of ruins psort for other purposes,
though (which is why nbtsort.c exists).

i'd merge these in first and see how far that gets you.
--
  Paul M. Aoki         | University of California at Berkeley
  aoki@CS.Berkeley.EDU | Dept. of EECS, Computer Science Division #1776
                       | Berkeley, CA 94720-1776

From owner-pgsql-hackers@hub.org Mon Nov  3 09:31:04 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id JAA01676
	for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 09:31:02 -0500 (EST)
Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id JAA07345 for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 09:13:20 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id IAA13315; Mon, 3 Nov 1997 08:50:26 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 08:48:07 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id IAA11722 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 08:48:02 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id IAA11539 for <hackers@postgreSQL.org>; Mon, 3 Nov 1997 08:47:34 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id UAA19066; Mon, 3 Nov 1997 20:48:04 +0700 (KRS)
Message-ID: <345DD614.345BF651@sable.krasnoyarsk.su>
Date: Mon, 03 Nov 1997 20:48:04 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Marc Howard Zuckman <marc@fallon.classyad.com>
CC: Bruce Momjian <maillist@candle.pha.pa.us>, hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <Pine.LNX.3.95.971103090709.21917A-100000@fallon.classyad.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

Marc Howard Zuckman wrote:
>
> On Mon, 3 Nov 1997, Bruce Momjian wrote:
>
> > With fsync off, I just did an insert of 1000 integers into a table
> > containing a single int4 column and no indexes, and it completed in 2.3
> > seconds.  This is on the new source tree..  That is 434 inserts/second.
> > Pretty major performance, or 2.3 ms/insert.  This is on a idle PP200
> > with UltraSCSI drives.
> >
> > With fsync on, the time goes to 51 seconds.  Wow, big difference.
>
> If better alternative error recovery methods were available, perhaps
> a facility to replay an interval transactions log from a prior dump,
> it would be reasonable to run the backend without fsync and
> take advantage of the performance gains.

???

>
> I don't know the answer, but I suspect that the commercial databases
> don't "fsync" the way pgsql does.

Could someone try 1000 int4 inserts using postgres and
some commercial database (on the same machine) ?

Vadim


From owner-pgsql-hackers@hub.org Mon Nov  3 09:01:02 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id JAA01183
	for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 09:01:00 -0500 (EST)
Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id IAA06632 for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 08:51:58 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id IAA05964; Mon, 3 Nov 1997 08:39:39 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 08:37:32 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id IAA04729 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 08:37:26 -0500 (EST)
Received: from fallon.classyad.com (root@classyad.com [152.160.43.1]) by hub.org (8.8.5/8.7.5) with ESMTP id IAA04614 for <hackers@postgreSQL.org>; Mon, 3 Nov 1997 08:37:16 -0500 (EST)
Received: from fallon.classyad.com (marc@fallon.classyad.com [152.160.43.1]) by fallon.classyad.com (8.8.5/8.7.3) with SMTP id JAA22108; Mon, 3 Nov 1997 09:11:09 -0500
Date: Mon, 3 Nov 1997 09:11:09 -0500 (EST)
From: Marc Howard Zuckman <marc@fallon.classyad.com>
To: Bruce Momjian <maillist@candle.pha.pa.us>
cc: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>, hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
In-Reply-To: <199711030513.AAA23474@candle.pha.pa.us>
Message-ID: <Pine.LNX.3.95.971103090709.21917A-100000@fallon.classyad.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

On Mon, 3 Nov 1997, Bruce Momjian wrote:

> >
> > Removed...
> >
> > Also, ItemPointerData t_chain (6 bytes) removed from HeapTupleHeader.
> > CommandId is uint32 now (up to the 2^32 - 1 commands per transaction).
> > DOUBLEALIGN(Sizeof(HeapTupleHeader)) is 40 bytes now.
> >
> > 1000 inserts (into table with single int4 column, 1 insert per transaction)
> > takes 70 - 80 sec now (12.5 - 14 transactions/sec).
> > This is hardware/OS limitation:
> >
> >     fd = open ("t", O_RDWR);
> >     for (i = 1; i <= 1000; i++)
> >     {
> >         lseek(fd, 0, SEEK_END);
> >         write(fd, buf, 56);
> >         fsync(fd);
> >     }
> >     close (fd);
> >
> > takes 33 - 39 sec and so it's not possible to be faster
> > having 2 fsync-s per transaction.
> >
> > The same test on 6.2.1: 92 - 107 sec
>
> With fsync off, I just did an insert of 1000 integers into a table
> containing a single int4 column and no indexes, and it completed in 2.3
> seconds.  This is on the new source tree..  That is 434 inserts/second.
> Pretty major performance, or 2.3 ms/insert.  This is on a idle PP200
> with UltraSCSI drives.
>
> With fsync on, the time goes to 51 seconds.  Wow, big difference.

If better alternative error recovery methods were available, perhaps
a facility to replay an interval transactions log from a prior dump,
it would be reasonable to run the backend without fsync and
take advantage of the performance gains.

I don't know the answer, but I suspect that the commercial databases
don't "fsync" the way pgsql does.

Marc Zuckman
marc@fallon.classyad.com

_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
_     Visit The Home and Condo MarketPlace		      _
_          http://www.ClassyAd.com			      _
_							      _
_  FREE basic property listings/advertisements and searches.  _
_							      _
_  Try our premium, yet inexpensive services for a real	      _
_   selling or buying edge!				      _
_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_


From owner-pgsql-hackers@hub.org Mon Nov  3 11:31:03 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id LAA04080
	for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 11:31:00 -0500 (EST)
Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id LAA13680 for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 11:21:30 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id LAA07566; Mon, 3 Nov 1997 11:04:52 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 11:02:59 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id LAA07372 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 11:02:52 -0500 (EST)
Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id LAA07196 for <hackers@postgreSQL.org>; Mon, 3 Nov 1997 11:02:22 -0500 (EST)
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id KAA02525;
	Mon, 3 Nov 1997 10:42:03 -0500 (EST)
From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711031542.KAA02525@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Mon, 3 Nov 1997 10:42:03 -0500 (EST)
Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
In-Reply-To: <345DD614.345BF651@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 3, 97 08:48:04 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

> > I don't know the answer, but I suspect that the commercial databases
> > don't "fsync" the way pgsql does.
>
> Could someone try 1000 int4 inserts using postgres and
> some commercial database (on the same machine) ?

I have been thinking about this since seeing the performance change
with/without fsync.

Commerical databases usually do a log write every 5 or 15 minutes, and
guarantee the logs will contain everything up to this time interval.

Couldn't we have some such mechanism?  Usually they have raw space, so
they can control when the data is hitting the disk.   Using a file
system, some of it may be getting to the disk without our knowing it.

What exactly is a scenario where lack of doing explicit fsync's will
cause data corruption, rather than just lost data from the past few
minutes?

I think Vadim has gotten fsync's down to fsync'ing the modified data
page, and pg_log.

Let's suppose we did not fsync.  There could be cases where pg_log was
fsync'ed by the OS, and some of the modified data pages are fyncs'ed by
the OS, but not others.  This would leave us with a partial transaction.

However, let's suppose we prevent pg_log from being fsync'ed somehow.
Then, because we have a no-overwrite database, we could keep control of
this, and write of some data pages, but not others would not cause us
problems because the pg_log would show all such transactions, which had
not had all their modified data pages fsync'ed, as non-committed.

Perhaps we can even set a flag in pg_log every five minutes to indicate
whether all buffers for the page have been flushed?  That way we could
not have to worry about preventing flushing of pg_log.

Comments?

--
Bruce Momjian
maillist@candle.pha.pa.us


From owner-pgsql-hackers@hub.org Mon Nov  3 12:00:42 1997
Received: from hub.org (hub.org [209.47.148.200])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA04456
	for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 12:00:40 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id LAA26054; Mon, 3 Nov 1997 11:46:49 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 11:46:33 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id LAA25932 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 11:46:30 -0500 (EST)
Received: from orion.SAPserv.Hamburg.dsh.de (polaris.sapserv.debis.de [53.2.131.8]) by hub.org (8.8.5/8.7.5) with SMTP id LAA25750 for <hackers@postgreSQL.org>; Mon, 3 Nov 1997 11:45:53 -0500 (EST)
Received: by orion.SAPserv.Hamburg.dsh.de
	(Linux Smail3.1.29.1 #1)}
	id m0xSPfE-000BGZC; Mon, 3 Nov 97 17:47 MET
Message-Id: <m0xSPfE-000BGZC@orion.SAPserv.Hamburg.dsh.de>
From: wieck@sapserv.debis.de
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: maillist@candle.pha.pa.us (Bruce Momjian)
Date: Mon, 3 Nov 1997 17:47:43 +0100 (MET)
Cc: vadim@sable.krasnoyarsk.su, marc@fallon.classyad.com,
        hackers@postgreSQL.org
Reply-To: wieck@sapserv.debis.de (Jan Wieck)
In-Reply-To: <199711031542.KAA02525@candle.pha.pa.us> from "Bruce Momjian" at Nov 3, 97 10:42:03 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

>
> > > I don't know the answer, but I suspect that the commercial databases
> > > don't "fsync" the way pgsql does.
> >
> > Could someone try 1000 int4 inserts using postgres and
> > some commercial database (on the same machine) ?
>
> I have been thinking about this since seeing the performance change
> with/without fsync.
>
> Commerical databases usually do a log write every 5 or 15 minutes, and
> guarantee the logs will contain everything up to this time interval.
>

    Without  fsync  PostgreSQL  would  only  loose data if the OS
    crashes between the last write operation of a backend and the
    next regular update sync. This is seldom but if it happens it
    really hurts.

    A database can omit fsync on data files (e.g. tablespaces) if
    it  writes  a  redo  log. With that redo log, a backup can be
    restored and than all transactions since the backup redone.

    PostgreSQL doesn't write such a redo  log.  So  an  OS  crash
    after  the fsync of pg_log could corrupt the database without
    a chance to recover.

    Isn't it time to get an (optional) redo log. I don't  exactly
    know all the places where our datafiles can get modified, but
    I hope this is only done  in  the  heap  access  methods  and
    vacuum.  So these are the places from where the redo log data
    comes from (plus transaction commit/rollback).


Until later, Jan

--
#define OPINIONS "they are all mine - not those of debis or daimler-benz"

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================== wieck@sapserv.debis.de (Jan Wieck) #


From owner-pgsql-hackers@hub.org Mon Nov  3 14:01:06 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA06775
	for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 14:01:04 -0500 (EST)
Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id NAA22235 for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 13:43:15 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id NAA11482; Mon, 3 Nov 1997 13:32:40 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 13:32:02 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id NAA11204 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 13:31:58 -0500 (EST)
Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id NAA11119 for <hackers@postgreSQL.org>; Mon, 3 Nov 1997 13:31:44 -0500 (EST)
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id MAA05464;
	Mon, 3 Nov 1997 12:59:01 -0500 (EST)
From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711031759.MAA05464@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: wieck@sapserv.debis.de
Date: Mon, 3 Nov 1997 12:59:01 -0500 (EST)
Cc: vadim@sable.krasnoyarsk.su, marc@fallon.classyad.com,
        hackers@postgreSQL.org
In-Reply-To: <m0xSPfE-000BGZC@orion.SAPserv.Hamburg.dsh.de> from "wieck@sapserv.debis.de" at Nov 3, 97 05:47:43 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

>
> >
> > > > I don't know the answer, but I suspect that the commercial databases
> > > > don't "fsync" the way pgsql does.
> > >
> > > Could someone try 1000 int4 inserts using postgres and
> > > some commercial database (on the same machine) ?
> >
> > I have been thinking about this since seeing the performance change
> > with/without fsync.
> >
> > Commerical databases usually do a log write every 5 or 15 minutes, and
> > guarantee the logs will contain everything up to this time interval.
> >
>
>     Without  fsync  PostgreSQL  would  only  loose data if the OS
>     crashes between the last write operation of a backend and the
>     next regular update sync. This is seldom but if it happens it
>     really hurts.
>
>     A database can omit fsync on data files (e.g. tablespaces) if
>     it  writes  a  redo  log. With that redo log, a backup can be
>     restored and than all transactions since the backup redone.
>
>     PostgreSQL doesn't write such a redo  log.  So  an  OS  crash
>     after  the fsync of pg_log could corrupt the database without
>     a chance to recover.
>
>     Isn't it time to get an (optional) redo log. I don't  exactly
>     know all the places where our datafiles can get modified, but
>     I hope this is only done  in  the  heap  access  methods  and
>     vacuum.  So these are the places from where the redo log data
>     comes from (plus transaction commit/rollback).
>

Yes, but because we are a non-over-write database, I don't see why we
can't just do this without a redo log.

Every five minutes, we fsync() all dirty pages, mark all completed
transactions as fsync'ed in pg_log, and fsync() pg_log.

On postmaster startup, any transaction marked as completed, but not
marked as fsync'ed gets marked as aborted.

Of course, all vacuum operations would have to be fsync'ed.

Comments?

--
Bruce Momjian
maillist@candle.pha.pa.us


From owner-pgsql-hackers@hub.org Mon Nov  3 16:46:01 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id QAA10292
	for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 16:45:59 -0500 (EST)
Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id QAA02040 for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 16:42:40 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id QAA17422; Mon, 3 Nov 1997 16:34:28 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 16:34:10 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id QAA17210 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 16:34:06 -0500 (EST)
Received: from fallon.classyad.com (root@classyad.com [152.160.43.1]) by hub.org (8.8.5/8.7.5) with ESMTP id QAA16690 for <hackers@postgreSQL.org>; Mon, 3 Nov 1997 16:33:27 -0500 (EST)
Received: from fallon.classyad.com (marc@fallon.classyad.com [152.160.43.1]) by fallon.classyad.com (8.8.5/8.7.3) with SMTP id RAA32498; Mon, 3 Nov 1997 17:33:42 -0500
Date: Mon, 3 Nov 1997 17:33:42 -0500 (EST)
From: Marc Howard Zuckman <marc@fallon.classyad.com>
To: Bruce Momjian <maillist@candle.pha.pa.us>
cc: wieck@sapserv.debis.de, vadim@sable.krasnoyarsk.su, hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
In-Reply-To: <199711031759.MAA05464@candle.pha.pa.us>
Message-ID: <Pine.LNX.3.95.971103173129.32055B-100000@fallon.classyad.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

On Mon, 3 Nov 1997, Bruce Momjian wrote:

> >
> > >
> > > > > I don't know the answer, but I suspect that the commercial databases
> > > > > don't "fsync" the way pgsql does.
> > > >
> > > > Could someone try 1000 int4 inserts using postgres and
> > > > some commercial database (on the same machine) ?
> > >
> > > I have been thinking about this since seeing the performance change
> > > with/without fsync.
> > >
> > > Commerical databases usually do a log write every 5 or 15 minutes, and
> > > guarantee the logs will contain everything up to this time interval.
> > >
> >
> >     Without  fsync  PostgreSQL  would  only  loose data if the OS
> >     crashes between the last write operation of a backend and the
> >     next regular update sync. This is seldom but if it happens it
> >     really hurts.
> >
> >     A database can omit fsync on data files (e.g. tablespaces) if
> >     it  writes  a  redo  log. With that redo log, a backup can be
> >     restored and than all transactions since the backup redone.
> >
> >     PostgreSQL doesn't write such a redo  log.  So  an  OS  crash
> >     after  the fsync of pg_log could corrupt the database without
> >     a chance to recover.
> >
> >     Isn't it time to get an (optional) redo log. I don't  exactly
> >     know all the places where our datafiles can get modified, but
> >     I hope this is only done  in  the  heap  access  methods  and
> >     vacuum.  So these are the places from where the redo log data
> >     comes from (plus transaction commit/rollback).
> >
>
> Yes, but because we are a non-over-write database, I don't see why we
> can't just do this without a redo log.

Because if the hard drive is the reason for the failure (instead of
power out, OS bites dust, etc), the database won't be of much help.

The redo log should be on a device different than the database.

Marc Zuckman
marc@fallon.classyad.com

_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
_     Visit The Home and Condo MarketPlace		      _
_          http://www.ClassyAd.com			      _
_							      _
_  FREE basic property listings/advertisements and searches.  _
_							      _
_  Try our premium, yet inexpensive services for a real	      _
_   selling or buying edge!				      _
_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_


From maillist Mon Nov  3 22:59:31 1997
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id WAA16264;
	Mon, 3 Nov 1997 22:59:31 -0500 (EST)
From: Bruce Momjian <maillist>
Message-Id: <199711040359.WAA16264@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: maillist@candle.pha.pa.us (Bruce Momjian)
Date: Mon, 3 Nov 1997 22:59:30 -0500 (EST)
Cc: vadim@sable.krasnoyarsk.su, marc@fallon.classyad.com,
        hackers@postgreSQL.org
In-Reply-To: <199711031542.KAA02525@candle.pha.pa.us> from "Bruce Momjian" at Nov 3, 97 10:42:03 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Status: OR

>
> > > I don't know the answer, but I suspect that the commercial databases
> > > don't "fsync" the way pgsql does.
> >
> > Could someone try 1000 int4 inserts using postgres and
> > some commercial database (on the same machine) ?
>
> I have been thinking about this since seeing the performance change
> with/without fsync.
>
> Commercial databases usually do a log write every 5 or 15 minutes, and
> guarantee the logs will contain everything up to this time interval.
>
> Couldn't we have some such mechanism?  Usually they have raw space, so
> they can control when the data is hitting the disk.   Using a file
> system, some of it may be getting to the disk without our knowing it.
>
> What exactly is a scenario where lack of doing explicit fsync's will
> cause data corruption, rather than just lost data from the past few
> minutes?
>
> I think Vadim has gotten fsync's down to fsync'ing the modified data
> page, and pg_log.
>
> Let's suppose we did not fsync.  There could be cases where pg_log was
> fsync'ed by the OS, and some of the modified data pages are fyncs'ed by
> the OS, but not others.  This would leave us with a partial transaction.
>
> However, let's suppose we prevent pg_log from being fsync'ed somehow.
> Then, because we have a no-overwrite database, we could keep control of
> this, and write of some data pages, but not others would not cause us
> problems because the pg_log would show all such transactions, which had
> not had all their modified data pages fsync'ed, as non-committed.
>
> Perhaps we can even set a flag in pg_log every five minutes to indicate
> whether all buffers for the page have been flushed?  That way we could
> not have to worry about preventing flushing of pg_log.
>
> Comments?

OK, here is a more formal description of what I am suggesting.  It will
give us commercial dbms reliability with no-fsync performance.
Commercial dbms's usually only give restore up to 5 minutes before the
crash, and this is what I am suggesting.  If we can do this, we can
remove the no-fsync option.

First, lets suppose there exists a shared queue that is visible to all
backends and the postmaster that allows transaction id's to be added to
the queue.  We also add a bit to the pg_log record called 'been_synced'
that is initially false.

OK, once a backend starts a transaction, it puts a transaction id in
pg_log.  Once the transaction is finished, it is marked as committed.
At the same time, we now put the transaction id on the shared queue.

Every five minutes, or as defined by the administrator, the postmaster
does a sync() call.  On my OS, anyone use can call sync, and I think
this is typical.  update/pagecleaner does this every 30 seconds anyway,
so it is no big deal for the postmaster to call it every 5 minutes.  The
nice thing about this is that the OS does the syncing of all the dirty
pages for us.  (An alarm() call can set up this 5 minute timing.)

The postmaster then locks the shared transaction id queue, makes a copy
of the entries in the queue, clears the queue, and unlocks the queue.
It does this so no one else modifies the queue while it is being
cleared.

The postmaster then goes through pg_log, and marks each transaction as
'been_synced'.

The postmaster also performs this on shutdown.

On postmaster startup, all transactions are checked and any transaction
that is marked as committed but not 'been_synced' is marked as not
committed.  In this way, we prevent non-synced or partially synced
transactions from being used.

Of course, vacuum would have to do normal fsyncs because it is removing
the transaction log.

We need the shared transaction id queue because there is no way to find
the newly committed transactions since the last sync.  A transaction
can last for hours.

--
Bruce Momjian
maillist@candle.pha.pa.us

From owner-pgsql-hackers@hub.org Tue Nov  4 02:13:08 1997
Received: from hub.org (hub.org [209.47.148.200])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id CAA17544
	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 02:13:06 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id CAA14126; Tue, 4 Nov 1997 02:07:55 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 02:04:59 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id CAA12859 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 02:04:51 -0500 (EST)
Received: from orion.SAPserv.Hamburg.dsh.de (polaris.sapserv.debis.de [53.2.131.8]) by hub.org (8.8.5/8.7.5) with SMTP id CAA12625 for <hackers@postgreSQL.org>; Tue, 4 Nov 1997 02:04:12 -0500 (EST)
Received: by orion.SAPserv.Hamburg.dsh.de
	(Linux Smail3.1.29.1 #1)}
	id m0xSd44-000BFQC; Tue, 4 Nov 97 08:06 MET
Message-Id: <m0xSd44-000BFQC@orion.SAPserv.Hamburg.dsh.de>
From: wieck@sapserv.debis.de
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: maillist@candle.pha.pa.us (Bruce Momjian)
Date: Tue, 4 Nov 1997 08:06:16 +0100 (MET)
Cc: maillist@candle.pha.pa.us, vadim@sable.krasnoyarsk.su,
        marc@fallon.classyad.com, hackers@postgreSQL.org
Reply-To: wieck@sapserv.debis.de (Jan Wieck)
In-Reply-To: <199711040359.WAA16264@candle.pha.pa.us> from "Bruce Momjian" at Nov 3, 97 10:59:30 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

> OK, here is a more formal description of what I am suggesting.  It will
> give us commercial dbms reliability with no-fsync performance.
> Commercial dbms's usually only give restore up to 5 minutes before the
> crash, and this is what I am suggesting.  If we can do this, we can
> remove the no-fsync option.

    I'm not 100% sure but as far as I know Oracle, it can recover
    up to the last committed transaction using  the  online  redo
    logs.   And even if commercial dbms's aren't able to do that,
    it should be our target.

> [description about transaction queue]

    This all  depends  on  the  fact  that  PostgreSQL  is  a  no
    overwrite  dbms.  Otherwise the space of deleted tuples might
    get overwritten by later transactions and the information  is
    finally lost.

    Another  issue:  All we up to now though of are crashes where
    the database files are still usable after restart.  But  take
    the  simple  case  of a write error. A new bad block or track
    will get remapped (in some way) but the data in it  is  lost.
    So  we  end  up  with  one or more totally corrupted database
    files. And I don't trust mirrored  disks  farer  than  I  can
    throw  them.  A  bug  in the OS or a memory failure (many new
    PeeCee boards don't support parity and even with parity a two
    bit  failure  is still the wrong data but with a valid parity
    bit) can also currupt the data.

    I still prefer redo logs. They should reside on  a  different
    disk  and the possibility of loosing the database files along
    with the redo log is very small.


Until later, Jan

--
#define OPINIONS "they are all mine - not those of debis or daimler-benz"

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================== wieck@sapserv.debis.de (Jan Wieck) #


From vadim@sable.krasnoyarsk.su Tue Nov  4 04:12:50 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id EAA18487
	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 04:12:48 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id EAA03152 for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 04:12:06 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id QAA20591; Tue, 4 Nov 1997 16:14:06 +0700 (KRS)
Sender: root@www.krasnet.ru
Message-ID: <345EE75D.398A68D@sable.krasnoyarsk.su>
Date: Tue, 04 Nov 1997 16:14:05 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
CC: marc@fallon.classyad.com, hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <199711040359.WAA16264@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

Bruce Momjian wrote:
>
> OK, here is a more formal description of what I am suggesting.  It will
> give us commercial dbms reliability with no-fsync performance.
> Commercial dbms's usually only give restore up to 5 minutes before the
                                      ^^^^^^^^^^^^^^^^^^^^^^^
I'm sure that this is not true!
If on-line redo_file is damaged then you have
single ability: restore your last backup.
In all other cases database will be recovered up to the last
committed transaction automatically!

DBMS-s using WAL have to fsync only redo file on commit
(and they do it!), non-overwriting systems have to
fsync data files and transaction log.

We could optimize fsync-s for multi-user environment: do not
fsync when we're ensured that our changes flushed to disk by
another backend.

> crash, and this is what I am suggesting.  If we can do this, we can
> remove the no-fsync option.
>
...
>
> On postmaster startup, all transactions are checked and any transaction
> that is marked as committed but not 'been_synced' is marked as not
> committed.  In this way, we prevent non-synced or partially synced
> transactions from being used.

And what should users (ensured that their transaction are
committed) do in this case ?

Vadim

From owner-pgsql-hackers@hub.org Tue Nov  4 04:21:04 1997
Received: from hub.org (hub.org [209.47.148.200])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id EAA18536
	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 04:21:01 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id EAA15551; Tue, 4 Nov 1997 04:15:15 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 04:14:23 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id EAA14464 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 04:14:18 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id EAA13437 for <hackers@postgreSQL.org>; Tue, 4 Nov 1997 04:13:33 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id QAA20591; Tue, 4 Nov 1997 16:14:06 +0700 (KRS)
Message-ID: <345EE75D.398A68D@sable.krasnoyarsk.su>
Date: Tue, 04 Nov 1997 16:14:05 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
CC: marc@fallon.classyad.com, hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <199711040359.WAA16264@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

Bruce Momjian wrote:
>
> OK, here is a more formal description of what I am suggesting.  It will
> give us commercial dbms reliability with no-fsync performance.
> Commercial dbms's usually only give restore up to 5 minutes before the
                                      ^^^^^^^^^^^^^^^^^^^^^^^
I'm sure that this is not true!
If on-line redo_file is damaged then you have
single ability: restore your last backup.
In all other cases database will be recovered up to the last
committed transaction automatically!

DBMS-s using WAL have to fsync only redo file on commit
(and they do it!), non-overwriting systems have to
fsync data files and transaction log.

We could optimize fsync-s for multi-user environment: do not
fsync when we're ensured that our changes flushed to disk by
another backend.

> crash, and this is what I am suggesting.  If we can do this, we can
> remove the no-fsync option.
>
...
>
> On postmaster startup, all transactions are checked and any transaction
> that is marked as committed but not 'been_synced' is marked as not
> committed.  In this way, we prevent non-synced or partially synced
> transactions from being used.

And what should users (ensured that their transaction are
committed) do in this case ?

Vadim


From owner-pgsql-hackers@hub.org Tue Nov  4 06:43:00 1997
Received: from hub.org (hub.org [209.47.148.200])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id GAA19743
	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 06:42:57 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id GAA10352; Tue, 4 Nov 1997 06:36:08 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 06:35:42 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id GAA10158 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 06:35:37 -0500 (EST)
Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id GAA10096 for <hackers@postgreSQL.org>; Tue, 4 Nov 1997 06:35:27 -0500 (EST)
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id GAA19665;
	Tue, 4 Nov 1997 06:35:10 -0500 (EST)
From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711041135.GAA19665@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: wieck@sapserv.debis.de
Date: Tue, 4 Nov 1997 06:35:10 -0500 (EST)
Cc: hackers@postgreSQL.org (PostgreSQL-development)
In-Reply-To: <m0xSd44-000BFQC@orion.SAPserv.Hamburg.dsh.de> from "wieck@sapserv.debis.de" at Nov 4, 97 08:06:16 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

>
> > OK, here is a more formal description of what I am suggesting.  It will
> > give us commercial dbms reliability with no-fsync performance.
> > Commercial dbms's usually only give restore up to 5 minutes before the
> > crash, and this is what I am suggesting.  If we can do this, we can
> > remove the no-fsync option.
>
>     I'm not 100% sure but as far as I know Oracle, it can recover
>     up to the last committed transaction using  the  online  redo
>     logs.   And even if commercial dbms's aren't able to do that,
>     it should be our target.
>
> > [description about transaction queue]
>
>     This all  depends  on  the  fact  that  PostgreSQL  is  a  no
>     overwrite  dbms.  Otherwise the space of deleted tuples might
>     get overwritten by later transactions and the information  is
>     finally lost.
>
>     Another  issue:  All we up to now though of are crashes where
>     the database files are still usable after restart.  But  take
>     the  simple  case  of a write error. A new bad block or track
>     will get remapped (in some way) but the data in it  is  lost.
>     So  we  end  up  with  one or more totally corrupted database
>     files. And I don't trust mirrored  disks  farer  than  I  can
>     throw  them.  A  bug  in the OS or a memory failure (many new
>     PeeCee boards don't support parity and even with parity a two
>     bit  failure  is still the wrong data but with a valid parity
>     bit) can also currupt the data.
>
>     I still prefer redo logs. They should reside on  a  different
>     disk  and the possibility of loosing the database files along
>     with the redo log is very small.

I have been thinking about re-do logs, and I think it is a good idea.
It would not be hard to have the queries spit out to a separate file
configurable by the user.

--
Bruce Momjian
maillist@candle.pha.pa.us


From owner-pgsql-hackers@hub.org Tue Nov  4 07:31:01 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA22051
	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:30:59 -0500 (EST)
Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id HAA07444 for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:25:14 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id HAA08818; Tue, 4 Nov 1997 07:03:30 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 07:02:44 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id HAA08418 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 07:02:29 -0500 (EST)
Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id HAA08331 for <hackers@postgreSQL.org>; Tue, 4 Nov 1997 07:02:07 -0500 (EST)
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id GAA21484;
	Tue, 4 Nov 1997 06:50:24 -0500 (EST)
From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711041150.GAA21484@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Tue, 4 Nov 1997 06:50:24 -0500 (EST)
Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
In-Reply-To: <345EE75D.398A68D@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 4, 97 04:14:05 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

>
> Bruce Momjian wrote:
> >
> > OK, here is a more formal description of what I am suggesting.  It will
> > give us commercial dbms reliability with no-fsync performance.
> > Commercial dbms's usually only give restore up to 5 minutes before the
>                                       ^^^^^^^^^^^^^^^^^^^^^^^
> I'm sure that this is not true!

You may be right.  This five minute figure is when you restore from your
previous backup, then restore from the log file.

Can't we do something like sync every 5 seconds, rather than after every
transaction?  It just seems like such overkill.

Actually, I found a problem with my description.  Because pg_log is not
fsync'ed, after a crash, pages with new transactions could have been
flushed to disk, but not the pg_log table that contains the transaction
ids.  The problem is that the new backend could assign a transaction id
that is already in use.

We could set a flag upon successful shutdown, and if it is not set on
reboot, either do a vacuum to find the max transaction id, and
invalidate all them not in pg_log as synced, or increase the next
transaction id to some huge number and invalidate all them in between.


> If on-line redo_file is damaged then you have
> single ability: restore your last backup.
> In all other cases database will be recovered up to the last
> committed transaction automatically!
>
> DBMS-s using WAL have to fsync only redo file on commit
> (and they do it!), non-overwriting systems have to
> fsync data files and transaction log.
>
> We could optimize fsync-s for multi-user environment: do not
> fsync when we're ensured that our changes flushed to disk by
> another backend.
>
> > crash, and this is what I am suggesting.  If we can do this, we can
> > remove the no-fsync option.
> >
> ...
> >
> > On postmaster startup, all transactions are checked and any transaction
> > that is marked as committed but not 'been_synced' is marked as not
> > committed.  In this way, we prevent non-synced or partially synced
> > transactions from being used.
>
> And what should users (ensured that their transaction are
> committed) do in this case ?
>
> Vadim
>
>


--
Bruce Momjian
maillist@candle.pha.pa.us


From wieck@sapserv.debis.de Tue Nov  4 07:01:00 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA21697
	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:00:58 -0500 (EST)
From: wieck@sapserv.debis.de
Received: from orion.SAPserv.Hamburg.dsh.de (polaris.sapserv.debis.de [53.2.131.8]) by renoir.op.net (o1/$ Revision: 1.14 $) with SMTP id GAA06401 for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 06:48:25 -0500 (EST)
Received: by orion.SAPserv.Hamburg.dsh.de
	(Linux Smail3.1.29.1 #1)}
	id m0xShVQ-000BGZC; Tue, 4 Nov 97 12:50 MET
Message-Id: <m0xShVQ-000BGZC@orion.SAPserv.Hamburg.dsh.de>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: maillist@candle.pha.pa.us (Bruce Momjian)
Date: Tue, 4 Nov 1997 12:50:45 +0100 (MET)
Cc: wieck@sapserv.debis.de, hackers@postgreSQL.org
Reply-To: wieck@sapserv.debis.de (Jan Wieck)
In-Reply-To: <199711041135.GAA19665@candle.pha.pa.us> from "Bruce Momjian" at Nov 4, 97 06:35:10 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Status: OR


Bruce Momjian wrote:
> I have been thinking about re-do logs, and I think it is a good idea.
> It would not be hard to have the queries spit out to a separate file
> configurable by the user.

    This  way the recovery process will be very complicated. When
    multiple  backends  run  concurrently,  there  are   multiple
    transactions  active  at  the  same time. And what tuples are
    affected by an update e.g. depends much on the timing.

    I had something different in mind. The redo log contains  the
    information  from  the  executor (e.g. the transactionId, the
    tupleId and the new tuple values when calling  ExecReplace())
    and  the information which transactions commit and which not.
    When recovering,  those  operations  where  the  transactions
    committed are again passed to the executors functions that do
    the real updates with the values from the logfile.


Until later, Jan

--
#define OPINIONS "they are all mine - not those of debis or daimler-benz"

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================== wieck@sapserv.debis.de (Jan Wieck) #


From owner-pgsql-hackers@hub.org Tue Nov  4 07:30:59 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA22048
	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:30:57 -0500 (EST)
Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id HAA07189 for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:18:02 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id HAA08856; Tue, 4 Nov 1997 07:03:37 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 07:03:03 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id HAA08487 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 07:02:46 -0500 (EST)
Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id HAA08192 for <hackers@postgreSQL.org>; Tue, 4 Nov 1997 07:02:02 -0500 (EST)
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id HAA21653;
	Tue, 4 Nov 1997 07:00:20 -0500 (EST)
From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711041200.HAA21653@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!u
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Tue, 4 Nov 1997 07:00:19 -0500 (EST)
Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
In-Reply-To: <345EE75D.398A68D@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 4, 97 04:14:05 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

>
> Bruce Momjian wrote:
> >
> > OK, here is a more formal description of what I am suggesting.  It will
> > give us commercial dbms reliability with no-fsync performance.
> > Commercial dbms's usually only give restore up to 5 minutes before the
>                                       ^^^^^^^^^^^^^^^^^^^^^^^
> I'm sure that this is not true!
> If on-line redo_file is damaged then you have
> single ability: restore your last backup.
> In all other cases database will be recovered up to the last
> committed transaction automatically!

I doubt commercial dbms's sync to disk after every transaction.  They
pick a time, maybe five seconds, and see all dirty pages get flushed by
then.

What they do do is to make certain that you are restored to a consistent
state, perhaps 15 seconds ago.

--
Bruce Momjian
maillist@candle.pha.pa.us


From vadim@sable.krasnoyarsk.su Tue Nov  4 07:32:45 1997
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA22066
	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:32:35 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id TAA20889; Tue, 4 Nov 1997 19:35:12 +0700 (KRS)
Sender: root@www.krasnet.ru
Message-ID: <345F1680.60E33853@sable.krasnoyarsk.su>
Date: Tue, 04 Nov 1997 19:35:12 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Jan Wieck <wieck@sapserv.debis.de>
CC: Bruce Momjian <maillist@candle.pha.pa.us>, marc@fallon.classyad.com,
        hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <m0xSd44-000BFQC@orion.SAPserv.Hamburg.dsh.de>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

wieck@sapserv.debis.de wrote:
>
>     I still prefer redo logs. They should reside on  a  different
>     disk  and the possibility of loosing the database files along
>     with the redo log is very small.

Agreed. This way we could don't fsync data files and
fsync both redo and pg_log. This is much faster.

Vadim

From vadim@sable.krasnoyarsk.su Tue Nov  4 08:00:58 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id IAA22371
	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 08:00:56 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id HAA08540 for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:57:25 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id TAA20935; Tue, 4 Nov 1997 19:59:46 +0700 (KRS)
Sender: root@www.krasnet.ru
Message-ID: <345F1C42.1F1A7590@sable.krasnoyarsk.su>
Date: Tue, 04 Nov 1997 19:59:46 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Jan Wieck <wieck@sapserv.debis.de>
CC: Bruce Momjian <maillist@candle.pha.pa.us>, hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <m0xShVQ-000BGZC@orion.SAPserv.Hamburg.dsh.de>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

wieck@sapserv.debis.de wrote:
>
> Bruce Momjian wrote:
> > I have been thinking about re-do logs, and I think it is a good idea.
> > It would not be hard to have the queries spit out to a separate file
> > configurable by the user.
>
>     This  way the recovery process will be very complicated. When
>     multiple  backends  run  concurrently,  there  are   multiple
>     transactions  active  at  the  same time. And what tuples are
>     affected by an update e.g. depends much on the timing.
>
>     I had something different in mind. The redo log contains  the
>     information  from  the  executor (e.g. the transactionId, the
>     tupleId and the new tuple values when calling  ExecReplace())
>     and  the information which transactions commit and which not.
>     When recovering,  those  operations  where  the  transactions
>     committed are again passed to the executors functions that do
>     the real updates with the values from the logfile.

It seems that this is what Oracle does, but Sybase writes queries
(with transaction ids, of 'course, and before execution) and
begin, commit/abort events <-- this is better for non-overwriting
system (shorter redo file), but, agreed, recovering is more complicated.

Vadim

From owner-pgsql-hackers@hub.org Tue Nov  4 22:35:45 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id WAA05060
	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 22:35:43 -0500 (EST)
Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id WAA26725 for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 22:35:10 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id WAA27875; Tue, 4 Nov 1997 22:23:14 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 22:20:55 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id WAA24162 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 22:20:50 -0500 (EST)
Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id WAA22727 for <hackers@postgreSQL.org>; Tue, 4 Nov 1997 22:20:18 -0500 (EST)
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id WAA04674;
	Tue, 4 Nov 1997 22:17:52 -0500 (EST)
From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711050317.WAA04674@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Tue, 4 Nov 1997 22:17:52 -0500 (EST)
Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
In-Reply-To: <345F14E7.28CC1042@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 4, 97 07:28:23 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

>
> Bruce Momjian wrote:
> >
> > >
> > > Bruce Momjian wrote:
> > > >
> > > > OK, here is a more formal description of what I am suggesting.  It will
> > > > give us commercial dbms reliability with no-fsync performance.
> > > > Commercial dbms's usually only give restore up to 5 minutes before the
> > >                                       ^^^^^^^^^^^^^^^^^^^^^^^
> > > I'm sure that this is not true!
> >
> > You may be right.  This five minute figure is when you restore from your
> > previous backup, then restore from the log file.
> >
> > Can't we do something like sync every 5 seconds, rather than after every
> > transaction?  It just seems like such overkill.
>
> Isn't -F and sync in crontab the same ?

OK, let me again try to marshall some (any?) support for my suggestion.

Informix version 5/7 has three levels of logging:  unbuffered
logging(our normal fsync mode), buffered logging, and no logging(our no
fsync mode).

We don't have buffered logging.  Buffered logging guarantees you get put
back to a consistent state after an os/server crash, usually to within
30/90 seconds.  You do not have any partial transactions lying around,
but you do have some transactions that you thought were done, but are
not.

This is faster then non-buffered logging, but not as fast as no logging.
Guess what mode everyone uses?  The one we don't have, buffered logging!

Unbuffered logging performance is terrible.  Non-buffered logging is
used to load huge chunks of data during off-hours.

The problem we have is that we fsync every transaction, which causes a
9-times slowdown in performance on single-integer inserts.

That is a pretty heavy cost.  But the alternative we give people is
no-fsync mode, where we don't sync anything, and in a crash, you could
come back with partially committed data in your database, if pg_log was
sync'ed by the database, and only some of the data pages were sync'ed,
so if any data was changing within 30 seconds of the crash, you have to
restore your previous backup.

We really need a middle solution, that gives better data integrity, for
a smaller price.

>
> >
> > Actually, I found a problem with my description.  Because pg_log is not
> > fsync'ed, after a crash, pages with new transactions could have been
> > flushed to disk, but not the pg_log table that contains the transaction
> > ids.  The problem is that the new backend could assign a transaction id
> > that is already in use.
>
> Impossible. Backend flushes pg_variable after fetching nex 32 xids.

My suggestion is that we don't need to flush pg_variable or pg_log that
much.  My suggestion would speed up the test you do with 100 inserts
inside a single transaction vs. 100 separate inserts.

> >
> > We could set a flag upon successful shutdown, and if it is not set on
> > reboot, either do a vacuum to find the max transaction id, and
> > invalidate all them not in pg_log as synced, or increase the next
> > transaction id to some huge number and invalidate all them in between.
> >

I have a fix for the problem stated above, and it doesn't require a
vacuum.

We decide to fsync pg_variable and pg_log every 10,000 transactions or
oids.  Then if the database is brought up, and it was not brought down
cleanly, you increment oid and transaction_id by 10,000, because you
know you couldn't have gotten more than that.  All intermediate
transactions that are not marked committed/synced are marked aborted.

---------------------------------------------------------------------------

The problem we have with the current system is that we sync by action,
not by time interval.  If you are doing tons of inserts or updates, it
is syncing after every one.  What people really want is something that
will sync not after every action, but after every minute or five
minutes, so when the system is busy, the syncing every minutes is just a
small amount, and when the system is idle, no one cares if is syncs, and
no one has to wait for the sync to complete.


--
Bruce Momjian
maillist@candle.pha.pa.us


From matti@algonet.se Wed Nov  5 11:02:33 1997
Received: from smtp.algonet.se (tomei.algonet.se [194.213.74.114])
	by candle.pha.pa.us (8.8.5/8.8.5) with SMTP id LAA02099
	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 11:02:28 -0500 (EST)
Received: (qmail 6685 invoked from network); 5 Nov 1997 17:01:06 +0100
Received: from du228-6.ppp.algonet.se (HELO gamma) (root@195.100.6.228)
  by tomei.algonet.se with SMTP; 5 Nov 1997 17:01:06 +0100
Sender: root
Message-ID: <34609871.27EED9D@algonet.se>
Date: Wed, 05 Nov 1997 17:02:16 +0100
From: Mattias Kregert <matti@algonet.se>
Organization: Algonet ISP
X-Mailer: Mozilla 3.0Gold (X11; I; Linux 2.0.29 i586)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
CC: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <199711050317.WAA04674@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

Bruce Momjian wrote:
>
> We don't have buffered logging.  Buffered logging guarantees you get put
> back to a consistent state after an os/server crash, usually to within
> 30/90 seconds.  You do not have any partial transactions lying around,
> but you do have some transactions that you thought were done, but are
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> not.
  ^^^^
>
> This is faster then non-buffered logging, but not as fast as no logging.
> Guess what mode everyone uses?  The one we don't have, buffered logging!

Ouch! I would *not* like to use "buffered logging".
What's the point in having the wrong data in the database and not
knowing what updates, inserts or deletes to do to get the correct data?

That's irrecoverable loss of data. Not what *I* want. Do *you* want it?


> We really need a middle solution, that gives better data integrity, for
> a smaller price.

What I would like to have is this:

If a backend tells the frontend that a transaction has completed,
then that transaction should absolutely not get lost in case of a crash.

What is needed is a log of changes since the last backup. This
log would preferrably reside on a remote machine or at least
another disk. Then, if the power goes in the middle of a disk write,
the disk explodes and the computer goes up in flames, you can
install Postgresql on a new machine, restore the last backup and
re-run the change log.


> The problem we have with the current system is that we sync by action,
> not by time interval.  If you are doing tons of inserts or updates, it
> is syncing after every one.  What people really want is something that
> will sync not after every action, but after every minute or five
> minutes, so when the system is busy, the syncing every minutes is just a
> small amount, and when the system is idle, no one cares if is syncs, and
> no one has to wait for the sync to complete.

Yes, but this would only be the first step on the way to better
crash-recovery.

/* m */

From vadim@sable.krasnoyarsk.su Wed Nov  5 12:20:23 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA05156
	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 12:20:13 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id LAA24123 for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 11:44:49 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id XAA23062; Wed, 5 Nov 1997 23:48:52 +0700 (KRS)
Sender: root@www.krasnet.ru
Message-ID: <3460A374.41C67EA6@sable.krasnoyarsk.su>
Date: Wed, 05 Nov 1997 23:48:52 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
CC: marc@fallon.classyad.com, hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <199711050317.WAA04674@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

Bruce Momjian wrote:
>
> OK, let me again try to marshall some (any?) support for my suggestion.
>
> Informix version 5/7 has three levels of logging:  unbuffered
> logging(our normal fsync mode), buffered logging, and no logging(our no
> fsync mode).
>
> We don't have buffered logging.  Buffered logging guarantees you get put
> back to a consistent state after an os/server crash, usually to within
> 30/90 seconds.  You do not have any partial transactions lying around,
> but you do have some transactions that you thought were done, but are
> not.
>
> This is faster then non-buffered logging, but not as fast as no logging.
> Guess what mode everyone uses?  The one we don't have, buffered logging!
>
> Unbuffered logging performance is terrible.  Non-buffered logging is
> used to load huge chunks of data during off-hours.
>
> The problem we have is that we fsync every transaction, which causes a
> 9-times slowdown in performance on single-integer inserts.
>
> That is a pretty heavy cost.  But the alternative we give people is
> no-fsync mode, where we don't sync anything, and in a crash, you could
> come back with partially committed data in your database, if pg_log was
> sync'ed by the database, and only some of the data pages were sync'ed,
> so if any data was changing within 30 seconds of the crash, you have to
> restore your previous backup.
>
> We really need a middle solution, that gives better data integrity, for
> a smaller price.

There is no fsync synchronization currently.
How could we be ensured that all modified data pages are flushed
when we decided to flush pg_log ?
If backend doesn't fsync data pages & pg_log at the commit time
then when he must flush them (data first) ?

This is what Oracle does:

it uses dedicated DBWR process for writing/flushing modified
data pages and LGWR process for writing/flushing redo log
(redo log is transaction log also). LGWR always flushes log pages
when committing, but durty data pages can be flushed _after_ transaction
commit when DBWR decides that it's time to do it (ala checkpoints interval).

Using redo log we could implement buffered logging quite easy.
We can even don't use dedicated processes (but flush redo before pg_log),
though having LGWR could simplify things.

Without redo log or without some fsync synchronization we can't implement
buffered logging. BTW, shared system cache could help with
fsync synchonization, but, imho, redo is better (and faster for
un-buffered logging too).

> > > Actually, I found a problem with my description.  Because pg_log is not
> > > fsync'ed, after a crash, pages with new transactions could have been
> > > flushed to disk, but not the pg_log table that contains the transaction
> > > ids.  The problem is that the new backend could assign a transaction id
> > > that is already in use.
> >
> > Impossible. Backend flushes pg_variable after fetching nex 32 xids.
>
> My suggestion is that we don't need to flush pg_variable or pg_log that
> much.  My suggestion would speed up the test you do with 100 inserts
> inside a single transaction vs. 100 separate inserts.
>
> > >
> > > We could set a flag upon successful shutdown, and if it is not set on
> > > reboot, either do a vacuum to find the max transaction id, and
> > > invalidate all them not in pg_log as synced, or increase the next
> > > transaction id to some huge number and invalidate all them in between.
> > >
>
> I have a fix for the problem stated above, and it doesn't require a
> vacuum.
>
> We decide to fsync pg_variable and pg_log every 10,000 transactions or
> oids.  Then if the database is brought up, and it was not brought down
> cleanly, you increment oid and transaction_id by 10,000, because you
> know you couldn't have gotten more than that.  All intermediate
> transactions that are not marked committed/synced are marked aborted.

This is what I suppose to do by placing next available oid/xid
in shmem: this allows pre-fetch much more than 32 ids at once
without losing them when session closed.

> The problem we have with the current system is that we sync by action,
> not by time interval.  If you are doing tons of inserts or updates, it
> is syncing after every one.  What people really want is something that
> will sync not after every action, but after every minute or five
> minutes, so when the system is busy, the syncing every minutes is just a
> small amount, and when the system is idle, no one cares if is syncs, and
> no one has to wait for the sync to complete.

When I'm really doing tons of inserts/updates/deletes I use
BEGIN/END. But it doesn't work for multi-user environment, of 'course.
As for about what people really want, I remember that recently someone
said in user list that if one want to have 10-20 inserts/sec then he
should use mysql, but I got 25 inserts/sec on AIC-7880 & WD Enterprise
when using one session, 32 inserts/sec with two sessions inserting
in two different tables and only 20 inserts/sec with two sessions
inserting in the same table. Imho, this difference between 20 and 32
is more important thing to fix, and these results are not so bad
in comparison with others.

(BTW, we shouldn't forget about using raw devices to speed up things).

Vadim

From vadim@sable.krasnoyarsk.su Wed Nov  5 12:20:08 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA05150
	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 12:20:07 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id LAA24889 for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 11:59:27 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id AAA23096; Thu, 6 Nov 1997 00:03:19 +0700 (KRS)
Sender: root@www.krasnet.ru
Message-ID: <3460A6D7.167EB0E7@sable.krasnoyarsk.su>
Date: Thu, 06 Nov 1997 00:03:19 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Mattias Kregert <matti@algonet.se>
CC: Bruce Momjian <maillist@candle.pha.pa.us>, pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <199711050317.WAA04674@candle.pha.pa.us> <34609871.27EED9D@algonet.se>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

Mattias Kregert wrote:
>
> Bruce Momjian wrote:
> >
> > We don't have buffered logging.  Buffered logging guarantees you get put
> > back to a consistent state after an os/server crash, usually to within
> > 30/90 seconds.  You do not have any partial transactions lying around,
> > but you do have some transactions that you thought were done, but are
>                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > not.
>   ^^^^
> >
> > This is faster then non-buffered logging, but not as fast as no logging.
> > Guess what mode everyone uses?  The one we don't have, buffered logging!
>
> Ouch! I would *not* like to use "buffered logging".

And I.

> What's the point in having the wrong data in the database and not
> knowing what updates, inserts or deletes to do to get the correct data?
>
> That's irrecoverable loss of data. Not what *I* want. Do *you* want it?
>
> > We really need a middle solution, that gives better data integrity, for
> > a smaller price.
>
> What I would like to have is this:
>
> If a backend tells the frontend that a transaction has completed,
> then that transaction should absolutely not get lost in case of a crash.

Agreed.

>
> What is needed is a log of changes since the last backup. This
> log would preferrably reside on a remote machine or at least
> another disk. Then, if the power goes in the middle of a disk write,
> the disk explodes and the computer goes up in flames, you can
> install Postgresql on a new machine, restore the last backup and
> re-run the change log.

Yes. And as I already said - this will speed up things because
redo flushing is faster than flushing NNN tables which can be
unflushed for some interval.

Vadim

From owner-pgsql-hackers@hub.org Wed Nov  5 12:20:39 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA05168
	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 12:20:38 -0500 (EST)
Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id MAA25888 for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 12:14:14 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id MAA02259; Wed, 5 Nov 1997 12:02:33 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 12:00:21 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id MAA00750 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 12:00:10 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id LAA00598 for <pgsql-hackers@postgreSQL.org>; Wed, 5 Nov 1997 11:59:45 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id AAA23096; Thu, 6 Nov 1997 00:03:19 +0700 (KRS)
Message-ID: <3460A6D7.167EB0E7@sable.krasnoyarsk.su>
Date: Thu, 06 Nov 1997 00:03:19 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Mattias Kregert <matti@algonet.se>
CC: Bruce Momjian <maillist@candle.pha.pa.us>, pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <199711050317.WAA04674@candle.pha.pa.us> <34609871.27EED9D@algonet.se>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

Mattias Kregert wrote:
>
> Bruce Momjian wrote:
> >
> > We don't have buffered logging.  Buffered logging guarantees you get put
> > back to a consistent state after an os/server crash, usually to within
> > 30/90 seconds.  You do not have any partial transactions lying around,
> > but you do have some transactions that you thought were done, but are
>                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > not.
>   ^^^^
> >
> > This is faster then non-buffered logging, but not as fast as no logging.
> > Guess what mode everyone uses?  The one we don't have, buffered logging!
>
> Ouch! I would *not* like to use "buffered logging".

And I.

> What's the point in having the wrong data in the database and not
> knowing what updates, inserts or deletes to do to get the correct data?
>
> That's irrecoverable loss of data. Not what *I* want. Do *you* want it?
>
> > We really need a middle solution, that gives better data integrity, for
> > a smaller price.
>
> What I would like to have is this:
>
> If a backend tells the frontend that a transaction has completed,
> then that transaction should absolutely not get lost in case of a crash.

Agreed.

>
> What is needed is a log of changes since the last backup. This
> log would preferrably reside on a remote machine or at least
> another disk. Then, if the power goes in the middle of a disk write,
> the disk explodes and the computer goes up in flames, you can
> install Postgresql on a new machine, restore the last backup and
> re-run the change log.

Yes. And as I already said - this will speed up things because
redo flushing is faster than flushing NNN tables which can be
unflushed for some interval.

Vadim


From owner-pgsql-hackers@hub.org Wed Nov  5 14:01:02 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA07017
	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 14:00:59 -0500 (EST)
Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id NAA01759 for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 13:52:36 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id NAA03611; Wed, 5 Nov 1997 13:29:43 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 13:27:48 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id NAA03291 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 13:27:41 -0500 (EST)
Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id NAA02823 for <hackers@postgreSQL.org>; Wed, 5 Nov 1997 13:26:20 -0500 (EST)
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id NAA05863;
	Wed, 5 Nov 1997 13:16:09 -0500 (EST)
From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711051816.NAA05863@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Wed, 5 Nov 1997 13:16:09 -0500 (EST)
Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
In-Reply-To: <3460A374.41C67EA6@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 5, 97 11:48:52 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

> There is no fsync synchronization currently.
> How could we be ensured that all modified data pages are flushed
> when we decided to flush pg_log ?
> If backend doesn't fsync data pages & pg_log at the commit time
> then when he must flush them (data first) ?

My idea was to have the backend do a 'sync' that causes the OS to sync
all dirty pages, then mark all committed transactions on pg_log as
'synced'.  Then sync pg_log.  That way, there is a clear system where we
know everything is flushed to disk, and we mark the transactions as
synced.

The only time that synced flag is used, is when the database starts up,
and it sees that the previous shutdown was not clean.

What am I missing here?

>
> This is what Oracle does:
>
> it uses dedicated DBWR process for writing/flushing modified
> data pages and LGWR process for writing/flushing redo log
> (redo log is transaction log also). LGWR always flushes log pages
> when committing, but durty data pages can be flushed _after_ transaction
> commit when DBWR decides that it's time to do it (ala checkpoints interval).
>
> Using redo log we could implement buffered logging quite easy.
> We can even don't use dedicated processes (but flush redo before pg_log),
> though having LGWR could simplify things.
>
> Without redo log or without some fsync synchronization we can't implement
> buffered logging. BTW, shared system cache could help with
> fsync synchonization, but, imho, redo is better (and faster for
> un-buffered logging too).
>

I suggested my solution because it is clean, does flushing in one
central location(postmaster), and does quick restores.

> > > > Actually, I found a problem with my description.  Because pg_log is not
> > > > fsync'ed, after a crash, pages with new transactions could have been
> > > > flushed to disk, but not the pg_log table that contains the transaction
> > > > ids.  The problem is that the new backend could assign a transaction id
> > > > that is already in use.
> > >
> > > Impossible. Backend flushes pg_variable after fetching nex 32 xids.
> >
> > My suggestion is that we don't need to flush pg_variable or pg_log that
> > much.  My suggestion would speed up the test you do with 100 inserts
> > inside a single transaction vs. 100 separate inserts.
> >
> > > >
> > > > We could set a flag upon successful shutdown, and if it is not set on
> > > > reboot, either do a vacuum to find the max transaction id, and
> > > > invalidate all them not in pg_log as synced, or increase the next
> > > > transaction id to some huge number and invalidate all them in between.
> > > >
> >
> > I have a fix for the problem stated above, and it doesn't require a
> > vacuum.
> >
> > We decide to fsync pg_variable and pg_log every 10,000 transactions or
> > oids.  Then if the database is brought up, and it was not brought down
> > cleanly, you increment oid and transaction_id by 10,000, because you
> > know you couldn't have gotten more than that.  All intermediate
> > transactions that are not marked committed/synced are marked aborted.
>
> This is what I suppose to do by placing next available oid/xid
> in shmem: this allows pre-fetch much more than 32 ids at once
> without losing them when session closed.
>
> > The problem we have with the current system is that we sync by action,
> > not by time interval.  If you are doing tons of inserts or updates, it
> > is syncing after every one.  What people really want is something that
> > will sync not after every action, but after every minute or five
> > minutes, so when the system is busy, the syncing every minutes is just a
> > small amount, and when the system is idle, no one cares if is syncs, and
> > no one has to wait for the sync to complete.
>
> When I'm really doing tons of inserts/updates/deletes I use
> BEGIN/END. But it doesn't work for multi-user environment, of 'course.
> As for about what people really want, I remember that recently someone
> said in user list that if one want to have 10-20 inserts/sec then he
> should use mysql, but I got 25 inserts/sec on AIC-7880 & WD Enterprise
> when using one session, 32 inserts/sec with two sessions inserting
> in two different tables and only 20 inserts/sec with two sessions
> inserting in the same table. Imho, this difference between 20 and 32
> is more important thing to fix, and these results are not so bad
> in comparison with others.
>
> (BTW, we shouldn't forget about using raw devices to speed up things).
>
> Vadim
>


--
Bruce Momjian
maillist@candle.pha.pa.us


From james@blarg.net Wed Nov  5 13:26:46 1997
Received: from animal.blarg.net (mail@animal.blarg.net [206.114.144.1])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id NAA06130
	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 13:26:26 -0500 (EST)
Received: from animal.blarg.net (james@animal.blarg.net [206.114.144.1])
          by animal.blarg.net (8.8.5/8.8.4) with SMTP
          id KAA09775; Wed, 5 Nov 1997 10:26:10 -0800
Date: Wed, 5 Nov 1997 10:26:10 -0800 (PST)
From: "James A. Hillyerd" <james@blarg.net>
To: Bruce Momjian <maillist@candle.pha.pa.us>
cc: Mattias Kregert <matti@algonet.se>, pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
In-Reply-To: <199711051615.LAA02260@candle.pha.pa.us>
Message-ID: <Pine.LNX.3.95.971105102332.6252E-100000@animal.blarg.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: OR

On Wed, 5 Nov 1997, Bruce Momjian wrote:
>
> The strange thing I am hearing is that the people who use PostgreSQL are
> more worried about data recovery from a crash than million-dollar
> companies that use commercial databases.
>

If I may throw in my 2 cents, I'd prefer to see that database in a
consistent state, with the data being up to date as of 1 minute or
less before the crash.  I'd rather have higher performance than up to the
second data.

-james

[  James A. Hillyerd (JH2162) - james@blarg.net - Web Developer  ]
[  http://www.blarg.net/~james/   http://www.hyperglyphics.com/  ]
[ 1024/B11C3751 CA 1C B3 A9 07 2F 57 C9  91 F4 73 F2 19 A4 C5 88 ]


From vadim@sable.krasnoyarsk.su Wed Nov  5 14:24:03 1997
Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA07830
	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 14:24:02 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id OAA02778 for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 14:13:45 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id CAA23376; Thu, 6 Nov 1997 02:17:51 +0700 (KRS)
Sender: root@www.krasnet.ru
Message-ID: <3460C65E.446B9B3D@sable.krasnoyarsk.su>
Date: Thu, 06 Nov 1997 02:17:50 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
CC: marc@fallon.classyad.com, hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <199711051816.NAA05863@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

Bruce Momjian wrote:
>
> > There is no fsync synchronization currently.
> > How could we be ensured that all modified data pages are flushed
> > when we decided to flush pg_log ?
> > If backend doesn't fsync data pages & pg_log at the commit time
> > then when he must flush them (data first) ?
>
> My idea was to have the backend do a 'sync' that causes the OS to sync
> all dirty pages, then mark all committed transactions on pg_log as
> 'synced'.  Then sync pg_log.  That way, there is a clear system where we
> know everything is flushed to disk, and we mark the transactions as
> synced.
>
> The only time that synced flag is used, is when the database starts up,
> and it sees that the previous shutdown was not clean.
>
> What am I missing here?

Ok, I see. But we can avoid 'synced' flag: we can make (just before
sync-ing data pages) in-memory copies of "on-line" durty pg_log pages
to being written/fsynced and perform write/fsync from these copies
without stopping new commits in "on-line" page(s) (nothing must go
to disk from "on-line" log pages).

Vadim

From owner-pgsql-hackers@hub.org Wed Nov  5 14:32:25 1997
Received: from hub.org (hub.org [209.47.148.200])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA08101
	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 14:32:21 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id OAA22970; Wed, 5 Nov 1997 14:26:47 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 14:24:59 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id OAA22344 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 14:24:56 -0500 (EST)
Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id OAA22319 for <hackers@postgreSQL.org>; Wed, 5 Nov 1997 14:24:38 -0500 (EST)
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id OAA07661;
	Wed, 5 Nov 1997 14:22:46 -0500 (EST)
From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711051922.OAA07661@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Wed, 5 Nov 1997 14:22:45 -0500 (EST)
Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
In-Reply-To: <3460A374.41C67EA6@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 5, 97 11:48:52 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

Just a clarification.  When I say the postmaster issues a sync, I mean
sync(2), not fsync(2).

The sync flushes all dirty pages on all file systems.  Ordinary users
can issue this, and update usually does this every 30 seconds anyway.

By using this, we let the kernel figure out which buffers are dirty.  We
don't have to figure this out in the postmaster.

Then we update the pg_log table to mark those transactions as synced.
On recovery from a crash, we mark the committed transactions as
uncommitted if they do not have the synced flag.

--
Bruce Momjian
maillist@candle.pha.pa.us


From owner-pgsql-hackers@hub.org Wed Nov  5 15:11:07 1997
Received: from hub.org (hub.org [209.47.148.200])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA08751
	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 15:10:59 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id PAA01986; Wed, 5 Nov 1997 15:01:24 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 14:59:32 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id OAA01414 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 14:59:28 -0500 (EST)
Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id OAA01403 for <hackers@postgreSQL.org>; Wed, 5 Nov 1997 14:59:14 -0500 (EST)
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id OAA08283;
	Wed, 5 Nov 1997 14:53:55 -0500 (EST)
From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711051953.OAA08283@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Wed, 5 Nov 1997 14:53:54 -0500 (EST)
Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
In-Reply-To: <3460C65E.446B9B3D@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 6, 97 02:17:50 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

> > The only time that synced flag is used, is when the database starts up,
> > and it sees that the previous shutdown was not clean.
> >
> > What am I missing here?
>
> Ok, I see. But we can avoid 'synced' flag: we can make (just before
> sync-ing data pages) in-memory copies of "on-line" durty pg_log pages
> to being written/fsynced and perform write/fsync from these copies
> without stopping new commits in "on-line" page(s) (nothing must go
> to disk from "on-line" log pages).

[Working late tonight?]

OK, now I am lost.  We need the sync'ed flag so when we start the
postmaster, and we see the database we not shut down properly, we use
the flag to clear the commit flag from comitted transactions that were
not sync'ed by the postmaster.

In my opinion, we don't need any extra copies of pg_log, we can set
those sync'ed flags while others are making changes, because before we
did our sync, we gathered a list of committed transaction ids from the
shared transaction id queue that I mentioned a while ago.

We need this queue so we can find the newly-committed transactions that
do not have a sync flag.  Another way we could do this would be to scan
pg_log before we sync, getting all the committed transaction ids without
sync flags.  No lock is needed on the table.  If we miss some new ones,
we will get them next time we scan.  The problem I saw is that there is
no way to see when to stop scanning the pg_log table for such
transactions, so I thought each backend would have to put its newly
committed transactions in a separate place.  Maybe I am wrong.

This syncing method just seems so natural since we have pg_log.  That is
why I keep bringing it up until people tell me I am stupid.

This transaction commit/sync stuff is complicated, and takes a while to
hash out in a group.

---------------------------------------------------------------------------

I just re-read your description, and I see what you are saying.  My idea
has pg_log commit flag be real commit flags while the system is running,
but on reboot after failure, we remove the commit flags on non-synced
stuff before we start up.

Your idea is to make pg_log commit flags only appear in in-memory copies
of pg_log, and write the commit flags to disk only after the sync is
done.

Either way will work.  The question is, "Which is easier?"  The OS is
going to sync pg_log on its own.  We would almost need a second copy of
pg_log, one copy to be used on postmaster startup, and a second to be
used by running backends, and the postmaster would make a copy of the
running backend pg_log, sync the disks, and copy it to the boot copy.

I don't see how the backend is going to figure out which pg_log pages
were modified and need to be sent to the boot copy of pg_log.

Now that I am thinking, here is a good idea.  Instead of a fancy
transaction queue, what if we just have the backend record the lowest
numbered transaction they commit in a shared memory area.  If the
current transaction id they commit is greater than the minimum, then
change nothing.  That way, the backend could copy all pg_log pages
containing that minimum pg_log transaction id up to the most recent
pg_log page, do the sync, and copy just those to the boot copy of
pg_log.

This eliminates the transaction id queue.

The nice thing about the sync-flag in pg_log is that there is no copying
by the backend.  But we would have to spin through the file to set those
sync bits.  Your method just copies whole pages to the boot copy.

---------------------------------------------------------------------------

I don't want to force this idea on anyone, or annoy anyone.  I just
think it needs to be considered.  The concepts are unusual, so once
people get the full idea, if they don't like it, we can trash it.  I
still think it holds promise.

--
Bruce Momjian
maillist@candle.pha.pa.us


From hotz@jpl.nasa.gov Wed Nov  5 15:30:18 1997
Received: from hotzsun.jpl.nasa.gov (hotzsun.jpl.nasa.gov [137.79.51.138])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA09500
	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 15:30:16 -0500 (EST)
Received: from [137.79.51.141] (hotzmac [137.79.51.141]) by hotzsun.jpl.nasa.gov (8.7.6/8.7.3) with SMTP id MAA10100; Wed, 5 Nov 1997 12:29:58 -0800 (PST)
X-Sender: hotzmail@hotzsun.jpl.nasa.gov
Message-Id: <v02140b02b0868294bbc1@[137.79.51.141]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Wed, 5 Nov 1997 12:29:58 -0800
To: Bruce Momjian <maillist@candle.pha.pa.us>,
        matti@algonet.se (Mattias Kregert)
From: hotz@jpl.nasa.gov (Henry B. Hotz)
Subject: Re: [HACKERS] My $.02, was: PERFORMANCE and Good Bye, Time Travel!
Cc: pgsql-hackers@postgreSQL.org
Status: OR

At 11:15 AM 11/5/97, Bruce Momjian wrote:
>The strange thing I am hearing is that the people who use PostgreSQL are
>more worried about data recovery from a crash than million-dollar
>companies that use commercial databases.
>
>I don't get it.

I would run PG to make sure that committed transactions were really written
to disk because that seems "correct" and I don't have the kind of
performance requirements that would push me to do otherwise.

That said, I can see a need for varying performance/crash-immunity
tradeoffs, and at least *one* option in between "correct" and "unprotected"
operation would seem desirable.

Signature failed Preliminary Design Review.
Feasibility of a new signature is currently being evaluated.
h.b.hotz@jpl.nasa.gov, or hbhotz@oxy.edu


From owner-pgsql-hackers@hub.org Thu Nov  6 15:51:23 1997
Received: from hub.org (hub.org [209.47.148.200])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA04634
	for <maillist@candle.pha.pa.us>; Thu, 6 Nov 1997 15:51:08 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id PAA24783; Thu, 6 Nov 1997 15:36:47 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Thu, 06 Nov 1997 15:36:07 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id PAA24514 for pgsql-hackers-outgoing; Thu, 6 Nov 1997 15:36:02 -0500 (EST)
Received: from guevara.bildbasen.kiruna.se (guevara.bildbasen.kiruna.se [193.45.225.110]) by hub.org (8.8.5/8.7.5) with SMTP id PAA24319 for <pgsql-hackers@postgreSQL.org>; Thu, 6 Nov 1997 15:35:32 -0500 (EST)
Received: (qmail 9764 invoked by uid 129); 6 Nov 1997 20:34:35 -0000
Date: 6 Nov 1997 20:34:35 -0000
Message-ID: <19971106203435.9763.qmail@guevara.bildbasen.kiruna.se>
From: Goran Thyni <goran@bildbasen.se>
To: pgsql-hackers@postgreSQL.org
In-reply-to: <34619E9E.622F563@algonet.se> (message from Mattias Kregert on
	Thu, 06 Nov 1997 11:40:30 +0100)
Subject: [HACKERS] Re: Performance vs. Crash Recovery
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR


I am getting quiet bored by this discussion,
if someone has a strong opinion about how this
should be done go ahead and make a test implementation
then we have something to discuss.

In the mean time, if you want best possible data protection
mount you database disk sync:ed. This is safer than any scheme
we could come up with.
D*mned slow too, so everybody should be happy. :-)

And I see no point implement a periodic sync in postmaster.
All unices has cron, why not just use that.
Or even a stupid 1-liner (ba)sh-script like:

while true; do sleep 20; sync; done

	best regards,
--
---------------------------------------------
G<EFBFBD>ran Thyni, sysadm, JMS Bildbasen, Kiruna


From vadim@sable.krasnoyarsk.su Thu Nov  6 23:31:41 1997
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id XAA04723
	for <maillist@candle.pha.pa.us>; Thu, 6 Nov 1997 23:31:21 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id LAA25438; Fri, 7 Nov 1997 11:36:25 +0700 (KRS)
Sender: root@www.krasnet.ru
Message-ID: <34629AC9.15FB7483@sable.krasnoyarsk.su>
Date: Fri, 07 Nov 1997 11:36:25 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
CC: marc@fallon.classyad.com, hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <199711051953.OAA08283@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

Bruce Momjian wrote:
>
> > > The only time that synced flag is used, is when the database starts up,
> > > and it sees that the previous shutdown was not clean.
> > >
> > > What am I missing here?
> >
> > Ok, I see. But we can avoid 'synced' flag: we can make (just before
> > sync-ing data pages) in-memory copies of "on-line" durty pg_log pages
> > to being written/fsynced and perform write/fsync from these copies
> > without stopping new commits in "on-line" page(s) (nothing must go
> > to disk from "on-line" log pages).
>
> [Working late tonight?]

[Yes]

> I just re-read your description, and I see what you are saying.  My idea
> has pg_log commit flag be real commit flags while the system is running,
> but on reboot after failure, we remove the commit flags on non-synced
> stuff before we start up.
>
> Your idea is to make pg_log commit flags only appear in in-memory copies
> of pg_log, and write the commit flags to disk only after the sync is
> done.
>
> Either way will work.  The question is, "Which is easier?"  The OS is
> going to sync pg_log on its own.  We would almost need a second copy of
> pg_log, one copy to be used on postmaster startup, and a second to be
> used by running backends, and the postmaster would make a copy of the
> running backend pg_log, sync the disks, and copy it to the boot copy.
>
> I don't see how the backend is going to figure out which pg_log pages
> were modified and need to be sent to the boot copy of pg_log.
>
> Now that I am thinking, here is a good idea.  Instead of a fancy
> transaction queue, what if we just have the backend record the lowest
> numbered transaction they commit in a shared memory area.  If the
> current transaction id they commit is greater than the minimum, then
> change nothing.  That way, the backend could copy all pg_log pages
> containing that minimum pg_log transaction id up to the most recent
> pg_log page, do the sync, and copy just those to the boot copy of
> pg_log.
>
> This eliminates the transaction id queue.
>
> The nice thing about the sync-flag in pg_log is that there is no copying
> by the backend.  But we would have to spin through the file to set those
> sync bits.  Your method just copies whole pages to the boot copy.

   In my plans to re-design transaction system I supposed to keep in shmem
two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer
to access them is not good idea. Also, we could use spinlock instead of
lock manager to synchronize access to these pages (as I see in spin.c
spinlock-s could be shared, but only exclusive ones are used) - spinlocks
are faster.
   These two last pg_log pages are "online" ones. Race condition: when one or
both of online pages becomes non-online ones, i.e. pg_log has to be expanded
when writing commit/abort of "big" xid. This is how we could handle this
in "buffered" logging (delayed fsync) mode:

   When backend want to write commit/abort status he acquires exclusive
OnLineLogLock. If xid belongs to online pages then backend writes status
and releases spin. If xid is less than least xid on 1st online page then
backend releases spin and does exactly the same what he does in normal mode:
flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer,
update xid status, WriteBuffer, release write lock, flush pg_log.
If xid is greater than max xid on 2nd online page then the simplest way is
just do sync(); sync() (two times), flush 1st or both online pages,
read new page(s) into online pages space, update xid status,
release OnLineLogLock spin. We could try other ways but pg_log expanding
is rare case (32K xids in one pg_log page)...
   All what postmaster will have to do is:
1. Get shared OnLineLogLock.
2. Copy 2 x 8K data to private place.
3. Release spinlock.
4. sync(); sync(); (two times!)
5. Flush online pages.

We could use -F DELAY_TIME to turn fsync delayed mode ON.

And, btw, having two bits for xact status we have only one unused
status value (0x11) currently - I would like to use this for
nested xactions and savepoints...

> I don't want to force this idea on anyone, or annoy anyone.  I just
> think it needs to be considered.  The concepts are unusual, so once
> people get the full idea, if they don't like it, we can trash it.  I
> still think it holds promise.

Agreed.

Vadim

From owner-pgsql-hackers@hub.org Fri Nov  7 01:32:49 1997
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07651
	for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 01:32:47 -0500 (EST)
Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id XAA23328 for <maillist@candle.pha.pa.us>; Thu, 6 Nov 1997 23:46:08 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id XAA19565; Thu, 6 Nov 1997 23:38:55 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Thu, 06 Nov 1997 23:36:53 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id XAA18911 for pgsql-hackers-outgoing; Thu, 6 Nov 1997 23:36:44 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id XAA18779 for <pgsql-hackers@postgreSQL.org>; Thu, 6 Nov 1997 23:36:02 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id LAA25448; Fri, 7 Nov 1997 11:40:29 +0700 (KRS)
Message-ID: <34629BBD.59E2B600@sable.krasnoyarsk.su>
Date: Fri, 07 Nov 1997 11:40:29 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
CC: Mattias Kregert <matti@algonet.se>, pgsql-hackers@postgreSQL.org
Subject: Re: Sync:ing data and log (Was: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!)
References: <199711061810.NAA02118@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

Bruce Momjian wrote:
>
> >
> > Never use sync(). Use fsync(). Other processes should take care of their
> > own syncing. If you use sync(), and you have a lot of disks, the sync
> > can
> > take half a minute if you are unlucky.
>
> We could use fsync() but then the postmaster has to know what tables
> have dirty buffers, and I don't think there is an easy way to do this.

There is one way - shared system cache...

Vadim


From vadim@sable.krasnoyarsk.su Fri Nov  7 01:31:24 1997
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07639
	for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 01:31:22 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id XAA23094 for <maillist@candle.pha.pa.us>; Thu, 6 Nov 1997 23:39:00 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id LAA25457; Fri, 7 Nov 1997 11:43:52 +0700 (KRS)
Sender: root@www.krasnet.ru
Message-ID: <34629C87.3F54BC7E@sable.krasnoyarsk.su>
Date: Fri, 07 Nov 1997 11:43:51 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Mattias Kregert <matti@algonet.se>
CC: Bruce Momjian <maillist@candle.pha.pa.us>, pgsql-hackers@postgreSQL.org
Subject: Re: Performance vs. Crash Recovery (Was: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!)
References: <199711051615.LAA02260@candle.pha.pa.us> <34619E9E.622F563@algonet.se>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

Mattias Kregert wrote:
>
> > The strange thing I am hearing is that the people who use PostgreSQL are
> > more worried about data recovery from a crash than million-dollar
> > companies that use commercial databases.
> >
> > I don't get it.
>
> Perhaps the million-dollar companies have more sophisticated hardware,
> like big expensive disk arrays, big UPS:es and parallell backup
> servers?
> If so, the risk of harware failure is much smaller for them.

More of that - Informix is more stable than postgres: elog(FATAL)
occures sometime and in fsync delayed mode this will cause
of losing xaction too, not onle hard/OS failure.

Vadim

From owner-pgsql-hackers@hub.org Fri Nov  7 01:31:26 1997
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07642
	for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 01:31:24 -0500 (EST)
Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id AAA24358 for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 00:09:47 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id AAA00167; Fri, 7 Nov 1997 00:03:17 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 07 Nov 1997 00:01:26 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id AAA29427 for pgsql-hackers-outgoing; Fri, 7 Nov 1997 00:01:19 -0500 (EST)
Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id AAA29364 for <hackers@postgreSQL.org>; Fri, 7 Nov 1997 00:01:02 -0500 (EST)
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id XAA05565;
	Thu, 6 Nov 1997 23:54:33 -0500 (EST)
From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711070454.XAA05565@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Thu, 6 Nov 1997 23:54:33 -0500 (EST)
Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
In-Reply-To: <34629AC9.15FB7483@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 7, 97 11:36:25 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

I was worried when you didn't respond to my last list of ideas.  I
thought perhaps the idea was getting on your nerves.

I haven't dropped the idea because:

	1) it offers 2-9 times speedup in database modifications
	2) this is how the big commercial system handle it, and I think
	   we need to give users this option.
	3) in the way I had it designed, it wouldn't take much work to
	   do it.

Anything that promises that much speedup, if it can be done easy, I say
lets consider it, even if you loose 60 seconds of changes.


>    In my plans to re-design transaction system I supposed to keep in shmem
> two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer
> to access them is not good idea. Also, we could use spinlock instead of
> lock manager to synchronize access to these pages (as I see in spin.c
> spinlock-s could be shared, but only exclusive ones are used) - spinlocks
> are faster.

Ah, so you already had the idea of having on-line pages in shared memory
as part of a transaction system overhaul?  Right now, does each backend
lock/read/write/unlock to get at pg_log?  Wow, that is bad.

Perhaps mmap() would be a good idea.  My system has msync() to flush
mmap()'ed pages to the underlying file.  You would still run fsync()
after that.  This may give us the best of both worlds: a shared-memory
area of variable size, and control of when it get flushed to disk.  Do
other OS's have this?  I have a feeling OS's with unified buffer caches
don't have this ability to determine when the underlying mmap'ed file
gets sent to the underlying file and disk.


>    These two last pg_log pages are "online" ones. Race condition: when one or
> both of online pages becomes non-online ones, i.e. pg_log has to be expanded
> when writing commit/abort of "big" xid. This is how we could handle this
> in "buffered" logging (delayed fsync) mode:
>
>    When backend want to write commit/abort status he acquires exclusive
> OnLineLogLock. If xid belongs to online pages then backend writes status
> and releases spin. If xid is less than least xid on 1st online page then
> backend releases spin and does exactly the same what he does in normal mode:
> flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer,
> update xid status, WriteBuffer, release write lock, flush pg_log.
> If xid is greater than max xid on 2nd online page then the simplest way is
> just do sync(); sync() (two times), flush 1st or both online pages,
> read new page(s) into online pages space, update xid status,
> release OnLineLogLock spin. We could try other ways but pg_log expanding
> is rare case (32K xids in one pg_log page)...
>    All what postmaster will have to do is:
> 1. Get shared OnLineLogLock.
> 2. Copy 2 x 8K data to private place.
> 3. Release spinlock.
> 4. sync(); sync(); (two times!)
> 5. Flush online pages.
>
> We could use -F DELAY_TIME to turn fsync delayed mode ON.
>
> And, btw, having two bits for xact status we have only one unused
> status value (0x11) currently - I would like to use this for
> nested xactions and savepoints...

I saw that.  By keeping two copies of pg_log, one in memory to be used
by all backend, and another that hits the disk, it certainly will work.

>
> > I don't want to force this idea on anyone, or annoy anyone.  I just
> > think it needs to be considered.  The concepts are unusual, so once
> > people get the full idea, if they don't like it, we can trash it.  I
> > still think it holds promise.
>
> Agreed.
>
> Vadim
>


--
Bruce Momjian
maillist@candle.pha.pa.us


From owner-pgsql-hackers@hub.org Fri Nov  7 01:03:09 1997
Received: from hub.org (hub.org [209.47.148.200])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07314
	for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 01:03:05 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id AAA07879; Fri, 7 Nov 1997 00:57:42 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 07 Nov 1997 00:55:52 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id AAA03918 for pgsql-hackers-outgoing; Fri, 7 Nov 1997 00:55:46 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id AAA02961 for <hackers@postgreSQL.org>; Fri, 7 Nov 1997 00:55:18 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id MAA25567; Fri, 7 Nov 1997 12:59:29 +0700 (KRS)
Message-ID: <3462AE40.FF6D5DF@sable.krasnoyarsk.su>
Date: Fri, 07 Nov 1997 12:59:28 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
CC: marc@fallon.classyad.com, hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <199711070454.XAA05565@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

Bruce Momjian wrote:
>
> I was worried when you didn't respond to my last list of ideas.  I
> thought perhaps the idea was getting on your nerves.

No, I was (and, unfortunately, I still) busy...

>
> I haven't dropped the idea because:
>
>         1) it offers 2-9 times speedup in database modifications
>         2) this is how the big commercial system handle it, and I think
>            we need to give users this option.
>         3) in the way I had it designed, it wouldn't take much work to
>            do it.
>
> Anything that promises that much speedup, if it can be done easy, I say
> lets consider it, even if you loose 60 seconds of changes.

I agreed with your un-buffered logging idea. This would be excellent
feature for un-critical dbase usings (WWW, etc).

>
> >    In my plans to re-design transaction system I supposed to keep in shmem
> > two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer
> > to access them is not good idea. Also, we could use spinlock instead of
> > lock manager to synchronize access to these pages (as I see in spin.c
> > spinlock-s could be shared, but only exclusive ones are used) - spinlocks
> > are faster.
>
> Ah, so you already had the idea of having on-line pages in shared memory
> as part of a transaction system overhaul?  Right now, does each backend

Yes. I hope to implement this in the next 1-2 weeks.

> lock/read/write/unlock to get at pg_log?  Wow, that is bad.

Yes, he does.

>
> Perhaps mmap() would be a good idea.  My system has msync() to flush
> mmap()'ed pages to the underlying file.  You would still run fsync()
> after that.  This may give us the best of both worlds: a shared-memory
                                                           ^^^^^^^^^^^^^
> area of variable size, and control of when it get flushed to disk.  Do
  ^^^^^^^^^^^^^^^^^^^^^
I like it. FreeBSD supports

MAP_ANON    Map anonymous memory not associated with any specific file.

It would be nice to use mmap to get more "shared" memory, but I don't see
reasons to mmap any particular file to memory. Having two last pg_log pages
in memory + xact commit/abort writeback optimization (updation of commit/abort
xmin/xmax status in tuples by any scan - we already have this) reduce access
to "old" pg_log pages to zero.

> other OS's have this?  I have a feeling OS's with unified buffer caches
> don't have this ability to determine when the underlying mmap'ed file
> gets sent to the underlying file and disk.
>
> >    These two last pg_log pages are "online" ones. Race condition: when one or
> > both of online pages becomes non-online ones, i.e. pg_log has to be expanded
> > when writing commit/abort of "big" xid. This is how we could handle this
> > in "buffered" logging (delayed fsync) mode:
> >
> >    When backend want to write commit/abort status he acquires exclusive
> > OnLineLogLock. If xid belongs to online pages then backend writes status
> > and releases spin. If xid is less than least xid on 1st online page then
> > backend releases spin and does exactly the same what he does in normal mode:
> > flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer,
> > update xid status, WriteBuffer, release write lock, flush pg_log.
> > If xid is greater than max xid on 2nd online page then the simplest way is
> > just do sync(); sync() (two times), flush 1st or both online pages,
> > read new page(s) into online pages space, update xid status,
> > release OnLineLogLock spin. We could try other ways but pg_log expanding
> > is rare case (32K xids in one pg_log page)...
> >    All what postmaster will have to do is:
> > 1. Get shared OnLineLogLock.
> > 2. Copy 2 x 8K data to private place.
> > 3. Release spinlock.
> > 4. sync(); sync(); (two times!)
> > 5. Flush online pages.
> >
> > We could use -F DELAY_TIME to turn fsync delayed mode ON.
> >
> > And, btw, having two bits for xact status we have only one unused
> > status value (0x11) currently - I would like to use this for
> > nested xactions and savepoints...
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
More about this: 0x11 could mean "this _child_ transaction is committed -
you have to lookup in pg_xact_child to get parent xid and use pg_log again
to get parent xact status". If parent committed then child xact status
will be changed to 0x10 (committed) else - to 0x01 (aborted). Using this
we could get xact nesting and savepoints by starting new child xaction
inside running one...

>
> I saw that.  By keeping two copies of pg_log, one in memory to be used
                                        ^^^^^^
                             Just two pg_log pages...

> by all backend, and another that hits the disk, it certainly will work.

Vadim


From vadim@sable.krasnoyarsk.su Fri Nov  7 01:30:59 1997
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07599
	for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 01:30:58 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id BAA26793 for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 01:12:33 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id NAA25592; Fri, 7 Nov 1997 13:16:39 +0700 (KRS)
Sender: root@www.krasnet.ru
Message-ID: <3462B247.ABD322C@sable.krasnoyarsk.su>
Date: Fri, 07 Nov 1997 13:16:39 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Jan Wieck <wieck@sapserv.debis.de>
CC: Bruce Momjian <maillist@candle.pha.pa.us>, hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <m0xT9nq-000BFQC@orion.SAPserv.Hamburg.dsh.de>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

wieck@sapserv.debis.de wrote:
>
> Bruce wrote:
> >
> > > > It seems that this is what Oracle does, but Sybase writes queries
> > > > (with transaction ids, of 'course, and before execution) and
> > > > begin, commit/abort events <-- this is better for non-overwriting
> > > > system (shorter redo file), but, agreed, recovering is more complicated.
> > > >
> > > > Vadim
> > > >
> > >
> > >     Writing  only  the queries (and only those that really modify
> > >     data - no selects) would be much smarter and the  redo  files
> > >     will  be  shorter. But it wouldn't fit for PostgreSQL as long
> > >     as someone can submit a query like
> > >
> > >         DELETE FROM xxx WHERE oid = 59337;
> >
> > Interesting point.  Currently, an insert shows the OID as output in
> > psql.  Perhaps we could do a little oid-manipulating to set the oid of
> > the insert.
>
>     Only for simple inserts, not on
>
>         INSERT INTO xxx SELECT any_type_of_merge_join;

I don't know how but Sybase handle this and IDENTITY (case of OIDs) too.
But I don't object you, Jan, just because I havn't time to do
"log queries" redo implementation and so I would like to have "log changes"
redo at least. (Actually, "log changes" is good for my production dbase
with 1 - 2 thousand updations per day).
(BTW, "incrementing" backup could be implemented without redo - I have
some thoughts about this, - but having additional recovering is good
in any case).

Vadim

From owner-pgsql-hackers@hub.org Fri Nov  7 15:42:58 1997
Received: from hub.org (hub.org [209.47.148.200])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA22341
	for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 15:42:55 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id PAA02769; Fri, 7 Nov 1997 15:28:54 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 07 Nov 1997 15:24:00 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id PAA01318 for pgsql-hackers-outgoing; Fri, 7 Nov 1997 15:23:52 -0500 (EST)
Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id PAA00705 for <hackers@postgreSQL.org>; Fri, 7 Nov 1997 15:21:56 -0500 (EST)
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id PAA20010;
	Fri, 7 Nov 1997 15:20:10 -0500 (EST)
From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711072020.PAA20010@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Fri, 7 Nov 1997 15:20:10 -0500 (EST)
Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
In-Reply-To: <3462AE40.FF6D5DF@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 7, 97 12:59:28 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

> > Anything that promises that much speedup, if it can be done easy, I say
> > lets consider it, even if you loose 60 seconds of changes.
>
> I agreed with your un-buffered logging idea. This would be excellent
> feature for un-critical dbase usings (WWW, etc).

Actually, it is buffered logging.  We currently have unbuffered logging,
I think.

> > >    In my plans to re-design transaction system I supposed to keep in shmem
> > > two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer
> > > to access them is not good idea. Also, we could use spinlock instead of
> > > lock manager to synchronize access to these pages (as I see in spin.c
> > > spinlock-s could be shared, but only exclusive ones are used) - spinlocks
> > > are faster.
> >
> > Ah, so you already had the idea of having on-line pages in shared memory
> > as part of a transaction system overhaul?  Right now, does each backend
>
> Yes. I hope to implement this in the next 1-2 weeks.
>
> > lock/read/write/unlock to get at pg_log?  Wow, that is bad.
>
> Yes, he does.
>
> >
> > Perhaps mmap() would be a good idea.  My system has msync() to flush
> > mmap()'ed pages to the underlying file.  You would still run fsync()
> > after that.  This may give us the best of both worlds: a shared-memory
>                                                            ^^^^^^^^^^^^^
> > area of variable size, and control of when it get flushed to disk.  Do
>   ^^^^^^^^^^^^^^^^^^^^^
> I like it. FreeBSD supports
>
> MAP_ANON    Map anonymous memory not associated with any specific file.
>
> It would be nice to use mmap to get more "shared" memory, but I don't see
> reasons to mmap any particular file to memory. Having two last pg_log pages
> in memory + xact commit/abort writeback optimization (updation of commit/abort
> xmin/xmax status in tuples by any scan - we already have this) reduce access
> to "old" pg_log pages to zero.

I totally agree.  There is no advantage to mmap() vs. shared memory for
us.  I thought if we could control when the mmap() gets flushed to disk,
we could let the OS handle the syncing, but I doubt this is going to be
portable.

Though, we could mmap() pg_log, and that way backends would not have to
read/write the blocks, and they could all see the same data.  But with
the new scheme, they have most transaction ids in shared memory.

Interesting you mention the scan updating the transaction status.  We
would have a problem here.  It is possible a backend will update the
commit status of a data page, and that data page will make it to disk,
but if there is a crash before the update pg_log gets sync'ed, there
would be a partial transaction in the system.

I don't know any way that a backend would know the transaction has hit
disk, and the data commit flag could be set.  You don't want to update
the commit flag of the data page until entire transaction has been
sync'ed.  The only way to do that would be to have a 'commit and synced'
flag, but you want to save that for nested transactions.

Another case this could come in handy is to allow reuse of superceeded
data rows.  If the transaction is committed and synced, the row space
could be reused by another transaction.

> > other OS's have this?  I have a feeling OS's with unified buffer caches
> > don't have this ability to determine when the underlying mmap'ed file
> > gets sent to the underlying file and disk.
> >
> > >    These two last pg_log pages are "online" ones. Race condition: when one or
> > > both of online pages becomes non-online ones, i.e. pg_log has to be expanded
> > > when writing commit/abort of "big" xid. This is how we could handle this
> > > in "buffered" logging (delayed fsync) mode:
> > >
> > >    When backend want to write commit/abort status he acquires exclusive
> > > OnLineLogLock. If xid belongs to online pages then backend writes status

This confuses me.  Why does a backend need to lock pg_log to update a
transaction status?

> > > and releases spin. If xid is less than least xid on 1st online page then
> > > backend releases spin and does exactly the same what he does in normal mode:
> > > flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer,
> > > update xid status, WriteBuffer, release write lock, flush pg_log.
> > > If xid is greater than max xid on 2nd online page then the simplest way is
> > > just do sync(); sync() (two times), flush 1st or both online pages,
> > > read new page(s) into online pages space, update xid status,
> > > release OnLineLogLock spin. We could try other ways but pg_log expanding
> > > is rare case (32K xids in one pg_log page)...
> > >    All what postmaster will have to do is:
> > > 1. Get shared OnLineLogLock.
> > > 2. Copy 2 x 8K data to private place.
> > > 3. Release spinlock.
> > > 4. sync(); sync(); (two times!)
> > > 5. Flush online pages.

Great.

> > >
> > > We could use -F DELAY_TIME to turn fsync delayed mode ON.
> > >
> > > And, btw, having two bits for xact status we have only one unused
> > > status value (0x11) currently - I would like to use this for
> > > nested xactions and savepoints...
>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> More about this: 0x11 could mean "this _child_ transaction is committed -
> you have to lookup in pg_xact_child to get parent xid and use pg_log again
> to get parent xact status". If parent committed then child xact status
> will be changed to 0x10 (committed) else - to 0x01 (aborted). Using this
> we could get xact nesting and savepoints by starting new child xaction
> inside running one...

OK.

>
> >
> > I saw that.  By keeping two copies of pg_log, one in memory to be used
>                                         ^^^^^^
>                              Just two pg_log pages...

Got it.


--
Bruce Momjian
maillist@candle.pha.pa.us


From owner-pgsql-hackers@hub.org Sun Nov  9 22:07:36 1997
Received: from hub.org (hub.org [209.47.148.200])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id WAA04655
	for <maillist@candle.pha.pa.us>; Sun, 9 Nov 1997 22:07:30 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id VAA07023; Sun, 9 Nov 1997 21:55:54 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 09 Nov 1997 21:52:20 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id VAA06174 for pgsql-hackers-outgoing; Sun, 9 Nov 1997 21:52:13 -0500 (EST)
Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id VAA06092 for <hackers@postgreSQL.org>; Sun, 9 Nov 1997 21:51:58 -0500 (EST)
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id VAA04150;
	Sun, 9 Nov 1997 21:50:29 -0500 (EST)
From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711100250.VAA04150@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! (fwd)
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Sun, 9 Nov 1997 21:50:29 -0500 (EST)
Cc: hackers@postgreSQL.org (PostgreSQL-development)
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

Forwarded message:
> > > Perhaps mmap() would be a good idea.  My system has msync() to flush
> > > mmap()'ed pages to the underlying file.  You would still run fsync()
> > > after that.  This may give us the best of both worlds: a shared-memory
> >                                                            ^^^^^^^^^^^^^
> > > area of variable size, and control of when it get flushed to disk.  Do
> >   ^^^^^^^^^^^^^^^^^^^^^
> > I like it. FreeBSD supports
> >
> > MAP_ANON    Map anonymous memory not associated with any specific file.
> >
> > It would be nice to use mmap to get more "shared" memory, but I don't see
> > reasons to mmap any particular file to memory. Having two last pg_log pages
> > in memory + xact commit/abort writeback optimization (updation of commit/abort
> > xmin/xmax status in tuples by any scan - we already have this) reduce access
> > to "old" pg_log pages to zero.
>
> I totally agree.  There is no advantage to mmap() vs. shared memory for
> us.  I thought if we could control when the mmap() gets flushed to disk,
> we could let the OS handle the syncing, but I doubt this is going to be
> portable.
>
> Though, we could mmap() pg_log, and that way backends would not have to
> read/write the blocks, and they could all see the same data.  But with
> the new scheme, they have most transaction ids in shared memory.
>
> Interesting you mention the scan updating the transaction status.  We
> would have a problem here.  It is possible a backend will update the
> commit status of a data page, and that data page will make it to disk,
> but if there is a crash before the update pg_log gets sync'ed, there
> would be a partial transaction in the system.
>
> I don't know any way that a backend would know the transaction has hit
> disk, and the data commit flag could be set.  You don't want to update
> the commit flag of the data page until entire transaction has been
> sync'ed.  The only way to do that would be to have a 'commit and synced'
> flag, but you want to save that for nested transactions.
>
> Another case this could come in handy is to allow reuse of superceeded
> data rows.  If the transaction is committed and synced, the row space
> could be reused by another transaction.
>

I have been thinking about the mmap() issue, and it seems a natural for
pg_log.  You can have every backend mmap() pg_log.  It becomes a dynamic
shared memory area that is auto-initialized to the contents of pg_log,
and all changes can be made by all backends.  No locking needed.  We can
also flush the changes to the underlying file.  Under bsdi, you can also
have the mmap area follow you across exec() calls, so each backend
doesn't have to do anything.  I want to replace exec with fork also, so
the stuff would be auto-loaded in the address space of each backend.

This way, you don't have to have two on-line pages and move them around
as pg_log grows.

The only problem remains how to mark certain transactions as synced or
force only synced transactions to hit the pg_log file itself, and data
row commit status only should be updated for synced transactions.

--
Bruce Momjian
maillist@candle.pha.pa.us


From vadim@sable.krasnoyarsk.su Sun Nov  9 23:00:58 1997
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id XAA05394
	for <maillist@candle.pha.pa.us>; Sun, 9 Nov 1997 23:00:55 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id WAA25139 for <maillist@candle.pha.pa.us>; Sun, 9 Nov 1997 22:42:33 -0500 (EST)
Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id KAA01845; Mon, 10 Nov 1997 10:49:25 +0700 (KRS)
Sender: root@www.krasnet.ru
Message-ID: <34668444.237C228A@sable.krasnoyarsk.su>
Date: Mon, 10 Nov 1997 10:49:24 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
CC: marc@fallon.classyad.com, hackers@postgreSQL.org
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
References: <199711072020.PAA20010@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

Bruce Momjian wrote:
>
> > > Anything that promises that much speedup, if it can be done easy, I say
> > > lets consider it, even if you loose 60 seconds of changes.
> >
> > I agreed with your un-buffered logging idea. This would be excellent
> > feature for un-critical dbase usings (WWW, etc).
>
> Actually, it is buffered logging.  We currently have unbuffered logging,
> I think.

Sorry - mistyping.

>
> Interesting you mention the scan updating the transaction status.  We
> would have a problem here.  It is possible a backend will update the
> commit status of a data page, and that data page will make it to disk,
> but if there is a crash before the update pg_log gets sync'ed, there
> would be a partial transaction in the system.

You're right! Currently, only system relations can be affected by this:
backend releases locks on user tables after syncing data and pg_log.
I'll keep this in mind...

> > > >    These two last pg_log pages are "online" ones. Race condition: when one or
> > > > both of online pages becomes non-online ones, i.e. pg_log has to be expanded
> > > > when writing commit/abort of "big" xid. This is how we could handle this
> > > > in "buffered" logging (delayed fsync) mode:
> > > >
> > > >    When backend want to write commit/abort status he acquires exclusive
> > > > OnLineLogLock. If xid belongs to online pages then backend writes status
>
> This confuses me.  Why does a backend need to lock pg_log to update a
> transaction status?

What if two backends try to change xact statuses in the same byte ?

Vadim

From owner-pgsql-hackers@hub.org Sun Nov  9 23:59:50 1997
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id XAA06523
	for <maillist@candle.pha.pa.us>; Sun, 9 Nov 1997 23:59:48 -0500 (EST)
Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id XAA27105 for <maillist@candle.pha.pa.us>; Sun, 9 Nov 1997 23:41:39 -0500 (EST)
Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id XAA08860; Sun, 9 Nov 1997 23:35:42 -0500 (EST)
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 09 Nov 1997 23:31:50 -0500 (EST)
Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id XAA07962 for pgsql-hackers-outgoing; Sun, 9 Nov 1997 23:31:43 -0500 (EST)
Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id XAA07875 for <hackers@postgreSQL.org>; Sun, 9 Nov 1997 23:31:28 -0500 (EST)
Received: (from maillist@localhost)
	by candle.pha.pa.us (8.8.5/8.8.5) id XAA05566;
	Sun, 9 Nov 1997 23:17:41 -0500 (EST)
From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711100417.XAA05566@candle.pha.pa.us>
Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Sun, 9 Nov 1997 23:17:41 -0500 (EST)
Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
In-Reply-To: <34668444.237C228A@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 10, 97 10:49:24 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

> > > > >    These two last pg_log pages are "online" ones. Race condition: when one or
> > > > > both of online pages becomes non-online ones, i.e. pg_log has to be expanded
> > > > > when writing commit/abort of "big" xid. This is how we could handle this
> > > > > in "buffered" logging (delayed fsync) mode:
> > > > >
> > > > >    When backend want to write commit/abort status he acquires exclusive
> > > > > OnLineLogLock. If xid belongs to online pages then backend writes status
> >
> > This confuses me.  Why does a backend need to lock pg_log to update a
> > transaction status?
>
> What if two backends try to change xact statuses in the same byte ?

Ooo, you got me.  I so hoped to prevent locking.  It would be nice if:

	*x |= 3;

would be atomic, but I don't think it is.  Most RISC machines don't even
have an OR against a memory address, I think.

--
Bruce Momjian
maillist@candle.pha.pa.us