mirror of
				https://github.com/postgres/postgres.git
				synced 2025-11-04 00:02:52 -05:00 
			
		
		
		
	Add raw file discussion to performance TODO.detail.
This commit is contained in:
		
							parent
							
								
									7e3f2449d8
								
							
						
					
					
						commit
						e21e02ab12
					
				@ -345,7 +345,7 @@ From owner-pgsql-hackers@hub.org Tue Oct 19 10:31:10 1999
 | 
			
		||||
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | 
			
		||||
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id KAA29087
 | 
			
		||||
	for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:31:08 -0400 (EDT)
 | 
			
		||||
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.13 $) with ESMTP id KAA27535 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:19:47 -0400 (EDT)
 | 
			
		||||
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.14 $) with ESMTP id KAA27535 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:19:47 -0400 (EDT)
 | 
			
		||||
Received: from localhost (majordom@localhost)
 | 
			
		||||
	by hub.org (8.9.3/8.9.3) with SMTP id KAA30328;
 | 
			
		||||
	Tue, 19 Oct 1999 10:12:10 -0400 (EDT)
 | 
			
		||||
@ -454,7 +454,7 @@ From owner-pgsql-hackers@hub.org Tue Oct 19 21:25:30 1999
 | 
			
		||||
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | 
			
		||||
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA28130
 | 
			
		||||
	for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:25:26 -0400 (EDT)
 | 
			
		||||
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.13 $) with ESMTP id VAA10512 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:15:28 -0400 (EDT)
 | 
			
		||||
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.14 $) with ESMTP id VAA10512 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:15:28 -0400 (EDT)
 | 
			
		||||
Received: from localhost (majordom@localhost)
 | 
			
		||||
	by hub.org (8.9.3/8.9.3) with SMTP id VAA50745;
 | 
			
		||||
	Tue, 19 Oct 1999 21:07:23 -0400 (EDT)
 | 
			
		||||
@ -1006,7 +1006,7 @@ From pgsql-general-owner+M2497@hub.org Fri Jun 16 18:31:03 2000
 | 
			
		||||
Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
 | 
			
		||||
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA04165
 | 
			
		||||
	for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:31:01 -0400 (EDT)
 | 
			
		||||
Received: from hub.org (root@hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.13 $) with ESMTP id RAA13110 for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:20:12 -0400 (EDT)
 | 
			
		||||
Received: from hub.org (root@hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.14 $) with ESMTP id RAA13110 for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:20:12 -0400 (EDT)
 | 
			
		||||
Received: from hub.org (majordom@localhost [127.0.0.1])
 | 
			
		||||
	by hub.org (8.10.1/8.10.1) with SMTP id e5GLDaM14477;
 | 
			
		||||
	Fri, 16 Jun 2000 17:13:36 -0400 (EDT)
 | 
			
		||||
@ -2239,3 +2239,796 @@ from 1 to "maybe" for nodes that get too dense.
 | 
			
		||||
Hannu
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
From pgsql-hackers-owner+M21991@postgresql.org Wed Apr 24 23:37:37 2002
 | 
			
		||||
Return-path: <pgsql-hackers-owner+M21991@postgresql.org>
 | 
			
		||||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
			
		||||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P3ba416337
 | 
			
		||||
	for <pgman@candle.pha.pa.us>; Wed, 24 Apr 2002 23:37:36 -0400 (EDT)
 | 
			
		||||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
			
		||||
	by postgresql.org (Postfix) with SMTP
 | 
			
		||||
	id CF13447622B; Wed, 24 Apr 2002 23:37:31 -0400 (EDT)
 | 
			
		||||
Received: from sraigw.sra.co.jp (sraigw.sra.co.jp [202.32.10.2])
 | 
			
		||||
	by postgresql.org (Postfix) with ESMTP id 3EE92474E4B
 | 
			
		||||
	for <pgsql-hackers@postgresql.org>; Wed, 24 Apr 2002 23:37:19 -0400 (EDT)
 | 
			
		||||
Received: from srascb.sra.co.jp (srascb [133.137.8.65])
 | 
			
		||||
	by sraigw.sra.co.jp (8.9.3/3.7W-sraigw) with ESMTP id MAA76393;
 | 
			
		||||
	Thu, 25 Apr 2002 12:35:44 +0900 (JST)
 | 
			
		||||
Received: (from root@localhost)
 | 
			
		||||
	by srascb.sra.co.jp (8.11.6/8.11.6) id g3P3ZCK64299;
 | 
			
		||||
	Thu, 25 Apr 2002 12:35:12 +0900 (JST)
 | 
			
		||||
	(envelope-from t-ishii@sra.co.jp)
 | 
			
		||||
Received: from sranhm.sra.co.jp (sranhm [133.137.170.62])
 | 
			
		||||
	by srascb.sra.co.jp (8.11.6/8.11.6av) with ESMTP id g3P3ZBV64291;
 | 
			
		||||
	Thu, 25 Apr 2002 12:35:11 +0900 (JST)
 | 
			
		||||
	(envelope-from t-ishii@sra.co.jp)
 | 
			
		||||
Received: from localhost (IDENT:t-ishii@srapc1474.sra.co.jp [133.137.170.59])
 | 
			
		||||
	by sranhm.sra.co.jp (8.9.3+3.2W/3.7W-srambox) with ESMTP id MAA25562;
 | 
			
		||||
	Thu, 25 Apr 2002 12:35:43 +0900
 | 
			
		||||
To: tgl@sss.pgh.pa.us
 | 
			
		||||
cc: cjs@cynic.net, pgman@candle.pha.pa.us, pgsql-hackers@postgresql.org
 | 
			
		||||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
 | 
			
		||||
In-Reply-To: <12342.1019705420@sss.pgh.pa.us>
 | 
			
		||||
References: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net>
 | 
			
		||||
	<12342.1019705420@sss.pgh.pa.us>
 | 
			
		||||
X-Mailer: Mew version 1.94.2 on Emacs 20.7 / Mule 4.1
 | 
			
		||||
	=?iso-2022-jp?B?KBskQjAqGyhCKQ==?=
 | 
			
		||||
MIME-Version: 1.0
 | 
			
		||||
Content-Type: Text/Plain; charset=us-ascii
 | 
			
		||||
Content-Transfer-Encoding: 7bit
 | 
			
		||||
Message-ID: <20020425123429E.t-ishii@sra.co.jp>
 | 
			
		||||
Date: Thu, 25 Apr 2002 12:34:29 +0900
 | 
			
		||||
From: Tatsuo Ishii <t-ishii@sra.co.jp>
 | 
			
		||||
X-Dispatcher: imput version 20000228(IM140)
 | 
			
		||||
Lines: 12
 | 
			
		||||
Precedence: bulk
 | 
			
		||||
Sender: pgsql-hackers-owner@postgresql.org
 | 
			
		||||
Status: OR
 | 
			
		||||
 | 
			
		||||
> Curt Sampson <cjs@cynic.net> writes:
 | 
			
		||||
> > Grabbing bigger chunks is always optimal, AFICT, if they're not
 | 
			
		||||
> > *too* big and you use the data. A single 64K read takes very little
 | 
			
		||||
> > longer than a single 8K read.
 | 
			
		||||
> 
 | 
			
		||||
> Proof?
 | 
			
		||||
 | 
			
		||||
Long time ago I tested with the 32k block size and got 1.5-2x speed up
 | 
			
		||||
comparing ordinary 8k block size in the sequential scan case.
 | 
			
		||||
FYI, if this is the case.
 | 
			
		||||
--
 | 
			
		||||
Tatsuo Ishii
 | 
			
		||||
 | 
			
		||||
---------------------------(end of broadcast)---------------------------
 | 
			
		||||
TIP 5: Have you checked our extensive FAQ?
 | 
			
		||||
 | 
			
		||||
http://www.postgresql.org/users-lounge/docs/faq.html
 | 
			
		||||
 | 
			
		||||
From mloftis@wgops.com Thu Apr 25 01:43:14 2002
 | 
			
		||||
Return-path: <mloftis@wgops.com>
 | 
			
		||||
Received: from free.wgops.com (root@dsl092-002-178.sfo1.dsl.speakeasy.net [66.92.2.178])
 | 
			
		||||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P5hC426529
 | 
			
		||||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 01:43:13 -0400 (EDT)
 | 
			
		||||
Received: from wgops.com ([10.1.2.207])
 | 
			
		||||
	by free.wgops.com (8.11.3/8.11.3) with ESMTP id g3P5hBR43020;
 | 
			
		||||
	Wed, 24 Apr 2002 22:43:11 -0700 (PDT)
 | 
			
		||||
	(envelope-from mloftis@wgops.com)
 | 
			
		||||
Message-ID: <3CC7976F.7070407@wgops.com>
 | 
			
		||||
Date: Wed, 24 Apr 2002 22:43:11 -0700
 | 
			
		||||
From: Michael Loftis <mloftis@wgops.com>
 | 
			
		||||
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4.1) Gecko/20020314 Netscape6/6.2.2
 | 
			
		||||
X-Accept-Language: en-us
 | 
			
		||||
MIME-Version: 1.0
 | 
			
		||||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
			
		||||
cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
			
		||||
   PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
			
		||||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
 | 
			
		||||
References: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net> <12342.1019705420@sss.pgh.pa.us>
 | 
			
		||||
Content-Type: text/plain; charset=us-ascii; format=flowed
 | 
			
		||||
Content-Transfer-Encoding: 7bit
 | 
			
		||||
Status: OR
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
Tom Lane wrote:
 | 
			
		||||
 | 
			
		||||
>Curt Sampson <cjs@cynic.net> writes:
 | 
			
		||||
>
 | 
			
		||||
>>Grabbing bigger chunks is always optimal, AFICT, if they're not
 | 
			
		||||
>>*too* big and you use the data. A single 64K read takes very little
 | 
			
		||||
>>longer than a single 8K read.
 | 
			
		||||
>>
 | 
			
		||||
>
 | 
			
		||||
>Proof?
 | 
			
		||||
>
 | 
			
		||||
I contend this statement.
 | 
			
		||||
 | 
			
		||||
It's optimal to a point.  I know that my system settles into it's best 
 | 
			
		||||
read-speeds @ 32K or 64K chunks.  8K chunks are far below optimal for my 
 | 
			
		||||
system.  Most systems I work on do far better at 16K than at 8K, and 
 | 
			
		||||
most don't see any degradation when going to 32K chunks.  (this is 
 | 
			
		||||
across numerous OSes and configs -- results are interpretations from 
 | 
			
		||||
bonnie disk i/o marks).
 | 
			
		||||
 | 
			
		||||
Depending on what you're doing it is more efficiend to read bigger 
 | 
			
		||||
blocks up to a point.  If you're multi-thread or reading in non-blocking 
 | 
			
		||||
mode, take as big a chunk as you can handle or are ready to process in 
 | 
			
		||||
quick order.  If you're picking up a bunch of little chunks here and 
 | 
			
		||||
there and know oyu're not using them again then choose a size that will 
 | 
			
		||||
hopeuflly cause some of the reads to overlap, failing that, pick the 
 | 
			
		||||
smallest usable read size.
 | 
			
		||||
 | 
			
		||||
The OS can never do that stuff for you.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
From cjs@cynic.net Thu Apr 25 03:29:05 2002
 | 
			
		||||
Return-path: <cjs@cynic.net>
 | 
			
		||||
Received: from angelic.cynic.net ([202.232.117.21])
 | 
			
		||||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P7T3404027
 | 
			
		||||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 03:29:03 -0400 (EDT)
 | 
			
		||||
Received: from localhost (localhost [127.0.0.1])
 | 
			
		||||
	by angelic.cynic.net (Postfix) with ESMTP
 | 
			
		||||
	id 1C44E870E; Thu, 25 Apr 2002 16:28:51 +0900 (JST)
 | 
			
		||||
Date: Thu, 25 Apr 2002 16:28:51 +0900 (JST)
 | 
			
		||||
From: Curt Sampson <cjs@cynic.net>
 | 
			
		||||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
			
		||||
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
			
		||||
   PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
			
		||||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
 | 
			
		||||
In-Reply-To: <12342.1019705420@sss.pgh.pa.us>
 | 
			
		||||
Message-ID: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
 | 
			
		||||
MIME-Version: 1.0
 | 
			
		||||
Content-Type: TEXT/PLAIN; charset=US-ASCII
 | 
			
		||||
Status: OR
 | 
			
		||||
 | 
			
		||||
On Wed, 24 Apr 2002, Tom Lane wrote:
 | 
			
		||||
 | 
			
		||||
> Curt Sampson <cjs@cynic.net> writes:
 | 
			
		||||
> > Grabbing bigger chunks is always optimal, AFICT, if they're not
 | 
			
		||||
> > *too* big and you use the data. A single 64K read takes very little
 | 
			
		||||
> > longer than a single 8K read.
 | 
			
		||||
>
 | 
			
		||||
> Proof?
 | 
			
		||||
 | 
			
		||||
Well, there are various sorts of "proof" for this assertion. What
 | 
			
		||||
sort do you want?
 | 
			
		||||
 | 
			
		||||
Here's a few samples; if you're looking for something different to
 | 
			
		||||
satisfy you, let's discuss it.
 | 
			
		||||
 | 
			
		||||
1. Theoretical proof: two components of the delay in retrieving a
 | 
			
		||||
block from disk are the disk arm movement and the wait for the
 | 
			
		||||
right block to rotate under the head.
 | 
			
		||||
 | 
			
		||||
When retrieving, say, eight adjacent blocks, these will be spread
 | 
			
		||||
across no more than two cylinders (with luck, only one). The worst
 | 
			
		||||
case access time for a single block is the disk arm movement plus
 | 
			
		||||
the full rotational wait; this is the same as the worst case for
 | 
			
		||||
eight blocks if they're all on one cylinder. If they're not on one
 | 
			
		||||
cylinder, they're still on adjacent cylinders, requiring a very
 | 
			
		||||
short seek.
 | 
			
		||||
 | 
			
		||||
2. Proof by others using it: SQL server uses 64K reads when doing
 | 
			
		||||
table scans, as they say that their research indicates that the
 | 
			
		||||
major limitation is usually the number of I/O requests, not the
 | 
			
		||||
I/O capacity of the disk. BSD's explicitly separates the optimum
 | 
			
		||||
allocation size for storage (1K fragments) and optimum read size
 | 
			
		||||
(8K blocks) because they found performance to be much better when
 | 
			
		||||
a larger size block was read. Most file system vendors, too, do
 | 
			
		||||
read-ahead for this very reason.
 | 
			
		||||
 | 
			
		||||
3. Proof by testing. I wrote a little ruby program to seek to a
 | 
			
		||||
random point in the first 2 GB of my raw disk partition and read
 | 
			
		||||
1-8 8K blocks of data. (This was done as one I/O request.) (Using
 | 
			
		||||
the raw disk partition I avoid any filesystem buffering.) Here are
 | 
			
		||||
typical results:
 | 
			
		||||
 | 
			
		||||
 125 reads of 16x8K blocks: 1.9 sec, 66.04 req/sec. 15.1 ms/req, 0.946 ms/block
 | 
			
		||||
 250 reads of  8x8K blocks: 1.9 sec, 132.3 req/sec. 7.56 ms/req, 0.945 ms/block
 | 
			
		||||
 500 reads of  4x8K blocks: 2.5 sec, 199 req/sec.   5.03 ms/req, 1.26 ms/block
 | 
			
		||||
1000 reads of  2x8K blocks: 3.8 sec, 261.6 req/sec. 3.82 ms/req, 1.91 ms/block
 | 
			
		||||
2000 reads of  1x8K blocks: 6.4 sec, 310.4 req/sec. 3.22 ms/req, 3.22 ms/block
 | 
			
		||||
 | 
			
		||||
The ratios of data retrieval speed per read for groups of adjacent
 | 
			
		||||
8K blocks, assuming a single 8K block reads in 1 time unit, are:
 | 
			
		||||
 | 
			
		||||
    1 block	1.00
 | 
			
		||||
    2 blocks	1.18
 | 
			
		||||
    4 blocks	1.56
 | 
			
		||||
    8 blocks	2.34
 | 
			
		||||
    16 blocks	4.68
 | 
			
		||||
 | 
			
		||||
At less than 20% more expensive, certainly two-block read requests
 | 
			
		||||
could be considered to cost "very little more" than one-block read
 | 
			
		||||
requests. Even four-block read requests are only half-again as
 | 
			
		||||
expensive. And if you know you're really going to be using the
 | 
			
		||||
data, read in 8 block chunks and your cost per block (in terms of
 | 
			
		||||
time) drops to less than a third of the cost of single-block reads.
 | 
			
		||||
 | 
			
		||||
Let me put paid to comments about multiple simultaneous readers
 | 
			
		||||
making this invalid. Here's a typical result I get with four
 | 
			
		||||
instances of the program running simultaneously:
 | 
			
		||||
 | 
			
		||||
125 reads of 16x8K blocks: 4.4 sec, 28.21 req/sec. 35.4 ms/req, 2.22 ms/block
 | 
			
		||||
250 reads of 8x8K blocks: 3.9 sec, 64.88 req/sec. 15.4 ms/req, 1.93 ms/block
 | 
			
		||||
500 reads of 4x8K blocks: 5.8 sec, 86.52 req/sec. 11.6 ms/req, 2.89 ms/block
 | 
			
		||||
1000 reads of 2x8K blocks: 10 sec, 100.2 req/sec. 9.98 ms/req, 4.99 ms/block
 | 
			
		||||
2000 reads of 1x8K blocks: 18 sec, 110 req/sec. 9.09 ms/req, 9.09 ms/block
 | 
			
		||||
 | 
			
		||||
Here's the ratio table again, with another column comparing the
 | 
			
		||||
aggregate number of requests per second for one process and four
 | 
			
		||||
processes:
 | 
			
		||||
 | 
			
		||||
    1 block	1.00		310 : 440
 | 
			
		||||
    2 blocks	1.10		262 : 401
 | 
			
		||||
    4 blocks	1.28		199 : 346
 | 
			
		||||
    8 blocks	1.69		132 : 260
 | 
			
		||||
    16 blocks	3.89		 66 : 113
 | 
			
		||||
 | 
			
		||||
Note that, here the relative increase in performance for increasing
 | 
			
		||||
sizes of reads is even *better* until we get past 64K chunks. The
 | 
			
		||||
overall throughput is better, of course, because with more requests
 | 
			
		||||
per second coming in, the disk seek ordering code has more to work
 | 
			
		||||
with and the average seek time spent seeking vs. reading will be
 | 
			
		||||
reduced.
 | 
			
		||||
 | 
			
		||||
You know, this is not rocket science; I'm sure there must be papers
 | 
			
		||||
all over the place about this. If anybody still disagrees that it's
 | 
			
		||||
a good thing to read chunks up to 64K or so when the blocks are
 | 
			
		||||
adjacent and you know you'll need the data, I'd like to see some
 | 
			
		||||
tangible evidence to support that.
 | 
			
		||||
 | 
			
		||||
cjs
 | 
			
		||||
-- 
 | 
			
		||||
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | 
			
		||||
    Don't you know, in this new Dark Age, we're all light.  --XTC
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
From cjs@cynic.net Thu Apr 25 03:55:59 2002
 | 
			
		||||
Return-path: <cjs@cynic.net>
 | 
			
		||||
Received: from angelic.cynic.net ([202.232.117.21])
 | 
			
		||||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P7tv405489
 | 
			
		||||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 03:55:57 -0400 (EDT)
 | 
			
		||||
Received: from localhost (localhost [127.0.0.1])
 | 
			
		||||
	by angelic.cynic.net (Postfix) with ESMTP
 | 
			
		||||
	id 188EC870E; Thu, 25 Apr 2002 16:55:51 +0900 (JST)
 | 
			
		||||
Date: Thu, 25 Apr 2002 16:55:50 +0900 (JST)
 | 
			
		||||
From: Curt Sampson <cjs@cynic.net>
 | 
			
		||||
To: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
			
		||||
cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
			
		||||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
 | 
			
		||||
In-Reply-To: <200204250404.g3P44OI19061@candle.pha.pa.us>
 | 
			
		||||
Message-ID: <Pine.NEB.4.43.0204251636550.3111-100000@angelic.cynic.net>
 | 
			
		||||
MIME-Version: 1.0
 | 
			
		||||
Content-Type: TEXT/PLAIN; charset=US-ASCII
 | 
			
		||||
Status: OR
 | 
			
		||||
 | 
			
		||||
On Thu, 25 Apr 2002, Bruce Momjian wrote:
 | 
			
		||||
 | 
			
		||||
> Well, we are guilty of trying to push as much as possible on to other
 | 
			
		||||
> software.  We do this for portability reasons, and because we think our
 | 
			
		||||
> time is best spent dealing with db issues, not issues then can be deal
 | 
			
		||||
> with by other existing software, as long as the software is decent.
 | 
			
		||||
 | 
			
		||||
That's fine. I think that's a perfectly fair thing to do.
 | 
			
		||||
 | 
			
		||||
It was just the wording (i.e., "it's this other software's fault
 | 
			
		||||
that blah de blah") that got to me. To say, "We don't do readahead
 | 
			
		||||
becase most OSes supply it, and we feel that other things would
 | 
			
		||||
help more to improve performance," is fine by me. Or even, "Well,
 | 
			
		||||
nobody feels like doing it. You want it, do it yourself," I have
 | 
			
		||||
no problem with.
 | 
			
		||||
 | 
			
		||||
> Sure, that is certainly true.  However, it is hard to know what the
 | 
			
		||||
> future will hold even if we had perfect knowledge of what was happening
 | 
			
		||||
> in the kernel.  We don't know who else is going to start doing I/O once
 | 
			
		||||
> our I/O starts.  We may have a better idea with kernel knowledge, but we
 | 
			
		||||
> still don't know 100% what will be cached.
 | 
			
		||||
 | 
			
		||||
Well, we do if we use raw devices and do our own caching, using
 | 
			
		||||
pages that are pinned in RAM. That was sort of what I was aiming
 | 
			
		||||
at for the long run.
 | 
			
		||||
 | 
			
		||||
> We have free-behind on our list.
 | 
			
		||||
 | 
			
		||||
Uh...can't do it, if you're relying on the OS to do the buffering.
 | 
			
		||||
How do you tell the OS that you're no longer going to use a page?
 | 
			
		||||
 | 
			
		||||
> I think LRU-K will do this quite well
 | 
			
		||||
> and be a nice general solution for more than just sequential scans.
 | 
			
		||||
 | 
			
		||||
LRU-K sounds like a great idea to me, as does putting pages read
 | 
			
		||||
for a table scan at the LRU end of the cache, rather than the MRU
 | 
			
		||||
(assuming we do something to ensure that they stay in cache until
 | 
			
		||||
read once, at any rate).
 | 
			
		||||
 | 
			
		||||
But again, great for your own cache, but doesn't work with the OS
 | 
			
		||||
cache. And I'm a bit scared to crank up too high the amount of
 | 
			
		||||
memory I give Postgres, lest the OS try to too aggressively buffer
 | 
			
		||||
all that I/O in what memory remains to it, and start blowing programs
 | 
			
		||||
(like maybe the backend binary itself) out of RAM. But maybe this
 | 
			
		||||
isn't typically a problem; I don't know.
 | 
			
		||||
 | 
			
		||||
> There may be validity in this.  It is easy to do (I think) and could be
 | 
			
		||||
> a win.
 | 
			
		||||
 | 
			
		||||
It didn't look to difficult to me, when I looked at the code, and
 | 
			
		||||
you can see what kind of win it is from the response I just made
 | 
			
		||||
to Tom.
 | 
			
		||||
 | 
			
		||||
> >     1. It is *not* true that you have no idea where data is when
 | 
			
		||||
> >     using a storage array or other similar system. While you
 | 
			
		||||
> >     certainly ought not worry about things such as head positions
 | 
			
		||||
> >     and so on, it's been a given for a long, long time that two
 | 
			
		||||
> >     blocks that have close index numbers are going to be close
 | 
			
		||||
> >     together in physical storage.
 | 
			
		||||
>
 | 
			
		||||
> SCSI drivers, for example, are pretty smart.  Not sure we can take
 | 
			
		||||
> advantage of that from user-land I/O.
 | 
			
		||||
 | 
			
		||||
Looking at the NetBSD ones, I don't see what they're doing that's
 | 
			
		||||
so smart. (Aside from some awfully clever workarounds for stupid
 | 
			
		||||
hardware limitations that would otherwise kill performance.) What
 | 
			
		||||
sorts of "smart" are you referring to?
 | 
			
		||||
 | 
			
		||||
> Yes, but we are seeing some db's moving away from raw I/O.
 | 
			
		||||
 | 
			
		||||
Such as whom? And are you certain that they're moving to using the
 | 
			
		||||
OS buffer cache, too? MS SQL server, for example, uses the filesystem,
 | 
			
		||||
but turns off all buffering on those files.
 | 
			
		||||
 | 
			
		||||
> Our performance numbers beat most of the big db's already, so we must
 | 
			
		||||
> be doing something right.
 | 
			
		||||
 | 
			
		||||
Really? Do the performance numbers for simple, bulk operations
 | 
			
		||||
(imports, exports, table scans) beat the others handily? My intuition
 | 
			
		||||
says not, but I'll happily be convinced otherwise.
 | 
			
		||||
 | 
			
		||||
> Yes, but do we spend our time doing that.  Is the payoff worth it, vs.
 | 
			
		||||
> working on other features.  Sure it would be great to have all these
 | 
			
		||||
> fancy things, but is this where our time should be spent, considering
 | 
			
		||||
> other items on the TODO list?
 | 
			
		||||
 | 
			
		||||
I agree that these things need to be assesed.
 | 
			
		||||
 | 
			
		||||
> Jumping in and doing the I/O ourselves is a big undertaking, and looking
 | 
			
		||||
> at our TODO list, I am not sure if it is worth it right now.
 | 
			
		||||
 | 
			
		||||
Right. I'm not trying to say this is a critical priority, I'm just
 | 
			
		||||
trying to determine what we do right now, what we could do, and
 | 
			
		||||
the potential performance increase that would give us.
 | 
			
		||||
 | 
			
		||||
cjs
 | 
			
		||||
-- 
 | 
			
		||||
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | 
			
		||||
    Don't you know, in this new Dark Age, we're all light.  --XTC
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
From cjs@cynic.net Thu Apr 25 05:19:11 2002
 | 
			
		||||
Return-path: <cjs@cynic.net>
 | 
			
		||||
Received: from angelic.cynic.net ([202.232.117.21])
 | 
			
		||||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P9J9412878
 | 
			
		||||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 05:19:10 -0400 (EDT)
 | 
			
		||||
Received: from localhost (localhost [127.0.0.1])
 | 
			
		||||
	by angelic.cynic.net (Postfix) with ESMTP
 | 
			
		||||
	id 50386870E; Thu, 25 Apr 2002 18:19:03 +0900 (JST)
 | 
			
		||||
Date: Thu, 25 Apr 2002 18:19:02 +0900 (JST)
 | 
			
		||||
From: Curt Sampson <cjs@cynic.net>
 | 
			
		||||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
			
		||||
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
			
		||||
   PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
			
		||||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
 | 
			
		||||
In-Reply-To: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
 | 
			
		||||
Message-ID: <Pine.NEB.4.43.0204251805000.3111-100000@angelic.cynic.net>
 | 
			
		||||
MIME-Version: 1.0
 | 
			
		||||
Content-Type: TEXT/PLAIN; charset=US-ASCII
 | 
			
		||||
Status: OR
 | 
			
		||||
 | 
			
		||||
On Thu, 25 Apr 2002, Curt Sampson wrote:
 | 
			
		||||
 | 
			
		||||
> Here's the ratio table again, with another column comparing the
 | 
			
		||||
> aggregate number of requests per second for one process and four
 | 
			
		||||
> processes:
 | 
			
		||||
>
 | 
			
		||||
 | 
			
		||||
Just for interest, I ran this again with 20 processes working
 | 
			
		||||
simultaneously. I did six runs at each blockread size and summed
 | 
			
		||||
the tps for each process to find the aggregate number of reads per
 | 
			
		||||
second during the test. I dropped the higest and the lowest ones,
 | 
			
		||||
and averaged the rest. Here's the new table:
 | 
			
		||||
 | 
			
		||||
		1 proc	4 procs	20 procs
 | 
			
		||||
 | 
			
		||||
    1 block	310	440	260
 | 
			
		||||
    2 blocks	262	401	481
 | 
			
		||||
    4 blocks	199	346	354
 | 
			
		||||
    8 blocks	132	260	250
 | 
			
		||||
    16 blocks	 66	113	116
 | 
			
		||||
 | 
			
		||||
I'm not sure at all why performance gets so much *worse* with a lot of
 | 
			
		||||
contention on the 1K reads. This could have something to with NetBSD, or
 | 
			
		||||
its buffer cache, or my laptop's crappy little disk drive....
 | 
			
		||||
 | 
			
		||||
Or maybe I'm just running out of CPU.
 | 
			
		||||
 | 
			
		||||
cjs
 | 
			
		||||
-- 
 | 
			
		||||
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | 
			
		||||
    Don't you know, in this new Dark Age, we're all light.  --XTC
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
From tgl@sss.pgh.pa.us Thu Apr 25 09:54:35 2002
 | 
			
		||||
Return-path: <tgl@sss.pgh.pa.us>
 | 
			
		||||
Received: from sss.pgh.pa.us (root@[192.204.191.242])
 | 
			
		||||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3PDsY407038
 | 
			
		||||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 09:54:34 -0400 (EDT)
 | 
			
		||||
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
			
		||||
	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3PDsXF25059;
 | 
			
		||||
	Thu, 25 Apr 2002 09:54:33 -0400 (EDT)
 | 
			
		||||
To: Curt Sampson <cjs@cynic.net>
 | 
			
		||||
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
			
		||||
   PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
			
		||||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
 | 
			
		||||
In-Reply-To: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net> 
 | 
			
		||||
References: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
 | 
			
		||||
Comments: In-reply-to Curt Sampson <cjs@cynic.net>
 | 
			
		||||
	message dated "Thu, 25 Apr 2002 16:28:51 +0900"
 | 
			
		||||
Date: Thu, 25 Apr 2002 09:54:32 -0400
 | 
			
		||||
Message-ID: <25056.1019742872@sss.pgh.pa.us>
 | 
			
		||||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
			
		||||
Status: OR
 | 
			
		||||
 | 
			
		||||
Curt Sampson <cjs@cynic.net> writes:
 | 
			
		||||
> 1. Theoretical proof: two components of the delay in retrieving a
 | 
			
		||||
> block from disk are the disk arm movement and the wait for the
 | 
			
		||||
> right block to rotate under the head.
 | 
			
		||||
 | 
			
		||||
> When retrieving, say, eight adjacent blocks, these will be spread
 | 
			
		||||
> across no more than two cylinders (with luck, only one).
 | 
			
		||||
 | 
			
		||||
Weren't you contending earlier that with modern disk mechs you really
 | 
			
		||||
have no idea where the data is?  You're asserting as an article of 
 | 
			
		||||
faith that the OS has been able to place the file's data blocks
 | 
			
		||||
optimally --- or at least well enough to avoid unnecessary seeks.
 | 
			
		||||
But just a few days ago I was getting told that random_page_cost
 | 
			
		||||
was BS because there could be no such placement.
 | 
			
		||||
 | 
			
		||||
I'm getting a tad tired of sweeping generalizations offered without
 | 
			
		||||
proof, especially when they conflict.
 | 
			
		||||
 | 
			
		||||
> 3. Proof by testing. I wrote a little ruby program to seek to a
 | 
			
		||||
> random point in the first 2 GB of my raw disk partition and read
 | 
			
		||||
> 1-8 8K blocks of data. (This was done as one I/O request.) (Using
 | 
			
		||||
> the raw disk partition I avoid any filesystem buffering.)
 | 
			
		||||
 | 
			
		||||
And also ensure that you aren't testing the point at issue.
 | 
			
		||||
The point at issue is that *in the presence of kernel read-ahead*
 | 
			
		||||
it's quite unclear that there's any benefit to a larger request size.
 | 
			
		||||
Ideally the kernel will have the next block ready for you when you
 | 
			
		||||
ask, no matter what the request is.
 | 
			
		||||
 | 
			
		||||
There's been some talk of using the AIO interface (where available)
 | 
			
		||||
to "encourage" the kernel to do read-ahead.  I don't foresee us
 | 
			
		||||
writing our own substitute filesystem to make this happen, however.
 | 
			
		||||
Oracle may have the manpower for that sort of boondoggle, but we
 | 
			
		||||
don't...
 | 
			
		||||
 | 
			
		||||
			regards, tom lane
 | 
			
		||||
 | 
			
		||||
From pgsql-hackers-owner+M22053@postgresql.org Thu Apr 25 20:45:42 2002
 | 
			
		||||
Return-path: <pgsql-hackers-owner+M22053@postgresql.org>
 | 
			
		||||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
			
		||||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q0jg405210
 | 
			
		||||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 20:45:42 -0400 (EDT)
 | 
			
		||||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
			
		||||
	by postgresql.org (Postfix) with SMTP
 | 
			
		||||
	id 17CE6476270; Thu, 25 Apr 2002 20:45:38 -0400 (EDT)
 | 
			
		||||
Received: from doppelbock.patentinvestor.com (ip146.usw5.rb1.bel.nwlink.com [209.20.249.146])
 | 
			
		||||
	by postgresql.org (Postfix) with ESMTP id 257DC47591C
 | 
			
		||||
	for <pgsql-hackers@postgresql.org>; Thu, 25 Apr 2002 20:45:25 -0400 (EDT)
 | 
			
		||||
Received: (from kaf@localhost)
 | 
			
		||||
	by doppelbock.patentinvestor.com (8.11.6/8.11.2) id g3Q0erX14397;
 | 
			
		||||
	Thu, 25 Apr 2002 17:40:53 -0700
 | 
			
		||||
From: Kyle <kaf@nwlink.com>
 | 
			
		||||
MIME-Version: 1.0
 | 
			
		||||
Content-Type: text/plain; charset=us-ascii
 | 
			
		||||
Content-Transfer-Encoding: 7bit
 | 
			
		||||
Message-ID: <15560.41493.529847.635632@doppelbock.patentinvestor.com>
 | 
			
		||||
Date: Thu, 25 Apr 2002 17:40:53 -0700
 | 
			
		||||
To: PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
			
		||||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
 | 
			
		||||
In-Reply-To: <25056.1019742872@sss.pgh.pa.us>
 | 
			
		||||
References: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
 | 
			
		||||
	<25056.1019742872@sss.pgh.pa.us>
 | 
			
		||||
X-Mailer: VM 6.95 under 21.1 (patch 14) "Cuyahoga Valley" XEmacs Lucid
 | 
			
		||||
Precedence: bulk
 | 
			
		||||
Sender: pgsql-hackers-owner@postgresql.org
 | 
			
		||||
Status: ORr
 | 
			
		||||
 | 
			
		||||
Tom Lane wrote:
 | 
			
		||||
> ...
 | 
			
		||||
> Curt Sampson <cjs@cynic.net> writes:
 | 
			
		||||
> > 3. Proof by testing. I wrote a little ruby program to seek to a
 | 
			
		||||
> > random point in the first 2 GB of my raw disk partition and read
 | 
			
		||||
> > 1-8 8K blocks of data. (This was done as one I/O request.) (Using
 | 
			
		||||
> > the raw disk partition I avoid any filesystem buffering.)
 | 
			
		||||
> 
 | 
			
		||||
> And also ensure that you aren't testing the point at issue.
 | 
			
		||||
> The point at issue is that *in the presence of kernel read-ahead*
 | 
			
		||||
> it's quite unclear that there's any benefit to a larger request size.
 | 
			
		||||
> Ideally the kernel will have the next block ready for you when you
 | 
			
		||||
> ask, no matter what the request is.
 | 
			
		||||
> ...
 | 
			
		||||
 | 
			
		||||
I have to agree with Tom.  I think the numbers below show that with
 | 
			
		||||
kernel read-ahead, block size isn't an issue.
 | 
			
		||||
 | 
			
		||||
The big_file1 file used below is 2.0 gig of random data, and the
 | 
			
		||||
machine has 512 mb of main memory.  This ensures that we're not
 | 
			
		||||
just getting cached data.
 | 
			
		||||
 | 
			
		||||
foreach i (4k 8k 16k 32k 64k 128k)
 | 
			
		||||
  echo $i
 | 
			
		||||
  time dd bs=$i if=big_file1 of=/dev/null
 | 
			
		||||
end
 | 
			
		||||
 | 
			
		||||
and the results:
 | 
			
		||||
 | 
			
		||||
bs    user    kernel   elapsed
 | 
			
		||||
4k:   0.260   7.740    1:27.25
 | 
			
		||||
8k:   0.210   8.060    1:30.48
 | 
			
		||||
16k:  0.090   7.790    1:30.88
 | 
			
		||||
32k:  0.060   8.090    1:32.75
 | 
			
		||||
64k:  0.030   8.190    1:29.11
 | 
			
		||||
128k: 0.070   9.830    1:28.74
 | 
			
		||||
 | 
			
		||||
so with kernel read-ahead, we have basically the same elapsed (wall
 | 
			
		||||
time) regardless of block size.  Sure, user time drops to a low at 64k
 | 
			
		||||
blocksize, but kernel time is increasing.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
You could argue that this is a contrived example, no other I/O is
 | 
			
		||||
being done.  Well I created a second 2.0g file (big_file2) and did two
 | 
			
		||||
simultaneous reads from the same disk.  Sure performance went to hell
 | 
			
		||||
but it shows blocksize is still irrelevant in a multi I/O environment
 | 
			
		||||
with sequential read-ahead.
 | 
			
		||||
 | 
			
		||||
foreach i ( 4k 8k 16k 32k 64k 128k )
 | 
			
		||||
  echo $i
 | 
			
		||||
  time dd bs=$i if=big_file1 of=/dev/null &
 | 
			
		||||
  time dd bs=$i if=big_file2 of=/dev/null &
 | 
			
		||||
  wait
 | 
			
		||||
end
 | 
			
		||||
 | 
			
		||||
bs    user    kernel   elapsed
 | 
			
		||||
4k:   0.480   8.290    6:34.13  bigfile1
 | 
			
		||||
      0.320   8.730    6:34.33  bigfile2
 | 
			
		||||
8k:   0.250   7.580    6:31.75
 | 
			
		||||
      0.180   8.450    6:31.88
 | 
			
		||||
16k:  0.150   8.390    6:32.47
 | 
			
		||||
      0.100   7.900    6:32.55
 | 
			
		||||
32k:  0.190   8.460    6:24.72
 | 
			
		||||
      0.060   8.410    6:24.73
 | 
			
		||||
64k:  0.060   9.350    6:25.05
 | 
			
		||||
      0.150   9.240    6:25.13
 | 
			
		||||
128k: 0.090  10.610    6:33.14
 | 
			
		||||
      0.110  11.320    6:33.31
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
the differences in read times are basically in the mud.  Blocksize
 | 
			
		||||
just doesn't matter much with the kernel doing readahead.
 | 
			
		||||
 | 
			
		||||
-Kyle
 | 
			
		||||
 | 
			
		||||
---------------------------(end of broadcast)---------------------------
 | 
			
		||||
TIP 6: Have you searched our list archives?
 | 
			
		||||
 | 
			
		||||
http://archives.postgresql.org
 | 
			
		||||
 | 
			
		||||
From pgsql-hackers-owner+M22055@postgresql.org Thu Apr 25 22:19:07 2002
 | 
			
		||||
Return-path: <pgsql-hackers-owner+M22055@postgresql.org>
 | 
			
		||||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
			
		||||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q2J7411254
 | 
			
		||||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 22:19:07 -0400 (EDT)
 | 
			
		||||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
			
		||||
	by postgresql.org (Postfix) with SMTP
 | 
			
		||||
	id F3924476208; Thu, 25 Apr 2002 22:19:02 -0400 (EDT)
 | 
			
		||||
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
 | 
			
		||||
	by postgresql.org (Postfix) with ESMTP id 6741D474E71
 | 
			
		||||
	for <pgsql-hackers@postgresql.org>; Thu, 25 Apr 2002 22:18:50 -0400 (EDT)
 | 
			
		||||
Received: (from pgman@localhost)
 | 
			
		||||
	by candle.pha.pa.us (8.11.6/8.10.1) id g3Q2Ili11246;
 | 
			
		||||
	Thu, 25 Apr 2002 22:18:47 -0400 (EDT)
 | 
			
		||||
From: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
			
		||||
Message-ID: <200204260218.g3Q2Ili11246@candle.pha.pa.us>
 | 
			
		||||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
 | 
			
		||||
In-Reply-To: <15560.41493.529847.635632@doppelbock.patentinvestor.com>
 | 
			
		||||
To: Kyle <kaf@nwlink.com>
 | 
			
		||||
Date: Thu, 25 Apr 2002 22:18:47 -0400 (EDT)
 | 
			
		||||
cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
			
		||||
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
 | 
			
		||||
MIME-Version: 1.0
 | 
			
		||||
Content-Transfer-Encoding: 7bit
 | 
			
		||||
Content-Type: text/plain; charset=US-ASCII
 | 
			
		||||
Precedence: bulk
 | 
			
		||||
Sender: pgsql-hackers-owner@postgresql.org
 | 
			
		||||
Status: OR
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
Nice test.  Would you test simultaneous 'dd' on the same file, perhaps
 | 
			
		||||
with a slight delay between to the two so they don't read each other's
 | 
			
		||||
blocks?
 | 
			
		||||
 | 
			
		||||
seek() in the file will turn off read-ahead in most OS's.  I am not
 | 
			
		||||
saying this is a major issue for PostgreSQL but the numbers would be
 | 
			
		||||
interesting.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
---------------------------------------------------------------------------
 | 
			
		||||
 | 
			
		||||
Kyle wrote:
 | 
			
		||||
> Tom Lane wrote:
 | 
			
		||||
> > ...
 | 
			
		||||
> > Curt Sampson <cjs@cynic.net> writes:
 | 
			
		||||
> > > 3. Proof by testing. I wrote a little ruby program to seek to a
 | 
			
		||||
> > > random point in the first 2 GB of my raw disk partition and read
 | 
			
		||||
> > > 1-8 8K blocks of data. (This was done as one I/O request.) (Using
 | 
			
		||||
> > > the raw disk partition I avoid any filesystem buffering.)
 | 
			
		||||
> > 
 | 
			
		||||
> > And also ensure that you aren't testing the point at issue.
 | 
			
		||||
> > The point at issue is that *in the presence of kernel read-ahead*
 | 
			
		||||
> > it's quite unclear that there's any benefit to a larger request size.
 | 
			
		||||
> > Ideally the kernel will have the next block ready for you when you
 | 
			
		||||
> > ask, no matter what the request is.
 | 
			
		||||
> > ...
 | 
			
		||||
> 
 | 
			
		||||
> I have to agree with Tom.  I think the numbers below show that with
 | 
			
		||||
> kernel read-ahead, block size isn't an issue.
 | 
			
		||||
> 
 | 
			
		||||
> The big_file1 file used below is 2.0 gig of random data, and the
 | 
			
		||||
> machine has 512 mb of main memory.  This ensures that we're not
 | 
			
		||||
> just getting cached data.
 | 
			
		||||
> 
 | 
			
		||||
> foreach i (4k 8k 16k 32k 64k 128k)
 | 
			
		||||
>   echo $i
 | 
			
		||||
>   time dd bs=$i if=big_file1 of=/dev/null
 | 
			
		||||
> end
 | 
			
		||||
> 
 | 
			
		||||
> and the results:
 | 
			
		||||
> 
 | 
			
		||||
> bs    user    kernel   elapsed
 | 
			
		||||
> 4k:   0.260   7.740    1:27.25
 | 
			
		||||
> 8k:   0.210   8.060    1:30.48
 | 
			
		||||
> 16k:  0.090   7.790    1:30.88
 | 
			
		||||
> 32k:  0.060   8.090    1:32.75
 | 
			
		||||
> 64k:  0.030   8.190    1:29.11
 | 
			
		||||
> 128k: 0.070   9.830    1:28.74
 | 
			
		||||
> 
 | 
			
		||||
> so with kernel read-ahead, we have basically the same elapsed (wall
 | 
			
		||||
> time) regardless of block size.  Sure, user time drops to a low at 64k
 | 
			
		||||
> blocksize, but kernel time is increasing.
 | 
			
		||||
> 
 | 
			
		||||
> 
 | 
			
		||||
> You could argue that this is a contrived example, no other I/O is
 | 
			
		||||
> being done.  Well I created a second 2.0g file (big_file2) and did two
 | 
			
		||||
> simultaneous reads from the same disk.  Sure performance went to hell
 | 
			
		||||
> but it shows blocksize is still irrelevant in a multi I/O environment
 | 
			
		||||
> with sequential read-ahead.
 | 
			
		||||
> 
 | 
			
		||||
> foreach i ( 4k 8k 16k 32k 64k 128k )
 | 
			
		||||
>   echo $i
 | 
			
		||||
>   time dd bs=$i if=big_file1 of=/dev/null &
 | 
			
		||||
>   time dd bs=$i if=big_file2 of=/dev/null &
 | 
			
		||||
>   wait
 | 
			
		||||
> end
 | 
			
		||||
> 
 | 
			
		||||
> bs    user    kernel   elapsed
 | 
			
		||||
> 4k:   0.480   8.290    6:34.13  bigfile1
 | 
			
		||||
>       0.320   8.730    6:34.33  bigfile2
 | 
			
		||||
> 8k:   0.250   7.580    6:31.75
 | 
			
		||||
>       0.180   8.450    6:31.88
 | 
			
		||||
> 16k:  0.150   8.390    6:32.47
 | 
			
		||||
>       0.100   7.900    6:32.55
 | 
			
		||||
> 32k:  0.190   8.460    6:24.72
 | 
			
		||||
>       0.060   8.410    6:24.73
 | 
			
		||||
> 64k:  0.060   9.350    6:25.05
 | 
			
		||||
>       0.150   9.240    6:25.13
 | 
			
		||||
> 128k: 0.090  10.610    6:33.14
 | 
			
		||||
>       0.110  11.320    6:33.31
 | 
			
		||||
> 
 | 
			
		||||
> 
 | 
			
		||||
> the differences in read times are basically in the mud.  Blocksize
 | 
			
		||||
> just doesn't matter much with the kernel doing readahead.
 | 
			
		||||
> 
 | 
			
		||||
> -Kyle
 | 
			
		||||
> 
 | 
			
		||||
> ---------------------------(end of broadcast)---------------------------
 | 
			
		||||
> TIP 6: Have you searched our list archives?
 | 
			
		||||
> 
 | 
			
		||||
> http://archives.postgresql.org
 | 
			
		||||
> 
 | 
			
		||||
 | 
			
		||||
-- 
 | 
			
		||||
  Bruce Momjian                        |  http://candle.pha.pa.us
 | 
			
		||||
  pgman@candle.pha.pa.us               |  (610) 853-3000
 | 
			
		||||
  +  If your life is a hard drive,     |  830 Blythe Avenue
 | 
			
		||||
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
 | 
			
		||||
 | 
			
		||||
---------------------------(end of broadcast)---------------------------
 | 
			
		||||
TIP 6: Have you searched our list archives?
 | 
			
		||||
 | 
			
		||||
http://archives.postgresql.org
 | 
			
		||||
 | 
			
		||||
From cjs@cynic.net Thu Apr 25 22:27:23 2002
 | 
			
		||||
Return-path: <cjs@cynic.net>
 | 
			
		||||
Received: from angelic.cynic.net ([202.232.117.21])
 | 
			
		||||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q2RL411868
 | 
			
		||||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 22:27:22 -0400 (EDT)
 | 
			
		||||
Received: from localhost (localhost [127.0.0.1])
 | 
			
		||||
	by angelic.cynic.net (Postfix) with ESMTP
 | 
			
		||||
	id AF60C870E; Fri, 26 Apr 2002 11:27:17 +0900 (JST)
 | 
			
		||||
Date: Fri, 26 Apr 2002 11:27:17 +0900 (JST)
 | 
			
		||||
From: Curt Sampson <cjs@cynic.net>
 | 
			
		||||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
			
		||||
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
			
		||||
   PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
			
		||||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
 | 
			
		||||
In-Reply-To: <25056.1019742872@sss.pgh.pa.us>
 | 
			
		||||
Message-ID: <Pine.NEB.4.43.0204261028110.449-100000@angelic.cynic.net>
 | 
			
		||||
MIME-Version: 1.0
 | 
			
		||||
Content-Type: TEXT/PLAIN; charset=US-ASCII
 | 
			
		||||
Status: OR
 | 
			
		||||
 | 
			
		||||
On Thu, 25 Apr 2002, Tom Lane wrote:
 | 
			
		||||
 | 
			
		||||
> Curt Sampson <cjs@cynic.net> writes:
 | 
			
		||||
> > 1. Theoretical proof: two components of the delay in retrieving a
 | 
			
		||||
> > block from disk are the disk arm movement and the wait for the
 | 
			
		||||
> > right block to rotate under the head.
 | 
			
		||||
>
 | 
			
		||||
> > When retrieving, say, eight adjacent blocks, these will be spread
 | 
			
		||||
> > across no more than two cylinders (with luck, only one).
 | 
			
		||||
>
 | 
			
		||||
> Weren't you contending earlier that with modern disk mechs you really
 | 
			
		||||
> have no idea where the data is?
 | 
			
		||||
 | 
			
		||||
No, that was someone else. I contend that with pretty much any
 | 
			
		||||
large-scale storage mechanism (i.e., anything beyond ramdisks),
 | 
			
		||||
you will find that accessing two adjacent blocks is almost always
 | 
			
		||||
1) close to as fast as accessing just the one, and 2) much, much
 | 
			
		||||
faster than accessing two blocks that are relatively far apart.
 | 
			
		||||
 | 
			
		||||
There will be the odd case where the two adjacent blocks are
 | 
			
		||||
physically far apart, but this is rare.
 | 
			
		||||
 | 
			
		||||
If this idea doesn't hold true, the whole idea that sequential
 | 
			
		||||
reads are faster than random reads falls apart, and the optimizer
 | 
			
		||||
shouldn't even have the option to make random reads cost more, much
 | 
			
		||||
less have it set to four rather than one (or whatever it's set to).
 | 
			
		||||
 | 
			
		||||
> You're asserting as an article of
 | 
			
		||||
> faith that the OS has been able to place the file's data blocks
 | 
			
		||||
> optimally --- or at least well enough to avoid unnecessary seeks.
 | 
			
		||||
 | 
			
		||||
So are you, in the optimizer. But that's all right; the OS often
 | 
			
		||||
can and does do this placement; the FFS filesystem is explicitly
 | 
			
		||||
designed to do this sort of thing. If the filesystem isn't empty
 | 
			
		||||
and the files grow a lot they'll be split into large fragments,
 | 
			
		||||
but the fragments will be contiguous.
 | 
			
		||||
 | 
			
		||||
> But just a few days ago I was getting told that random_page_cost
 | 
			
		||||
> was BS because there could be no such placement.
 | 
			
		||||
 | 
			
		||||
I've been arguing against that point as well.
 | 
			
		||||
 | 
			
		||||
> And also ensure that you aren't testing the point at issue.
 | 
			
		||||
> The point at issue is that *in the presence of kernel read-ahead*
 | 
			
		||||
> it's quite unclear that there's any benefit to a larger request size.
 | 
			
		||||
 | 
			
		||||
I will test this.
 | 
			
		||||
 | 
			
		||||
cjs
 | 
			
		||||
-- 
 | 
			
		||||
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | 
			
		||||
    Don't you know, in this new Dark Age, we're all light.  --XTC
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user