Re: Sunfire X4500 recommendations

Поиск
Список
Период
Сортировка
От Matt Smiley
Тема Re: Sunfire X4500 recommendations
Дата
Msg-id 46037498020000280001F330@rtk_gwim1.rentrak.com
обсуждение исходный текст
Ответ на Sunfire X4500 recommendations  ("Matt Smiley" <mss@rentrak.com>)
Ответы Re: Sunfire X4500 recommendations  (Dimitri <dimitrik.fr@gmail.com>)
Список pgsql-performance
Thanks Dimitri!  That was very educational material!  I'm going to think out loud here, so please correct me if you see
anyerrors. 

The section on tuning for OLTP transactions was interesting, although my OLAP workload will be predominantly bulk I/O
overlarge datasets of mostly-sequential blocks. 

The NFS+ZFS section talked about the zil_disable control for making zfs ignore commits/fsyncs.  Given that Postgres'
executordoes single-threaded synchronous I/O like the tar example, it seems like it might benefit significantly from
settingzil_disable=1, at least in the case of frequently flushed/committed writes.  However, zil_disable=1 sounds
unsafefor the datafiles' filesystem, and would probably only be acceptible for the xlogs if they're stored on a
separatefilesystem and you're willing to loose recently committed transactions.  This sounds pretty similar to just
settingfsync=off in postgresql.conf, which is easier to change later, so I'll skip the zil_disable control. 

The RAID-Z section was a little surprising.  It made RAID-Z sound just like RAID 50, in that you can customize the
trade-offbetween iops versus usable diskspace and fault-tolerance by adjusting the number/size of parity-protected disk
groups. The only difference I noticed was that RAID-Z will apparently set the stripe size across vdevs (RAID-5s) to be
asclose as possible to the filesystem's block size, to maximize the number of disks involved in concurrently fetching
eachblock.  Does that sound about right? 

So now I'm wondering what RAID-Z offers that RAID-50 doesn't.  I came up with 2 things: an alleged affinity for
full-stripewrites and (under RAID-Z2) the added fault-tolerance of RAID-6's 2nd parity bit (allowing 2 disks to fail
perzpool).  It wasn't mentioned in this blog, but I've heard that under certain circumstances, RAID-Z will magically
decideto mirror a block instead of calculating parity on it.  I'm not sure how this would happen, and I don't know the
circumstancesthat would trigger this behavior, but I think the goal (if it really happens) is to avoid the performance
penaltyof having to read the rest of the stripe required to calculate parity.  As far as I know, this is only an issue
affectingsmall writes (e.g. single-row updates in an OLTP workload), but not large writes (compared to the RAID's
stripesize).  Anyway, when I saw the filesystem's intent log mentioned, I thought maybe the small writes are converted
tofull-stripe writes by deferring their commit until a full stripe's worth of data had been accumulated.  Does that
soundplausible? 

Are there any other noteworthy perks to RAID-Z, rather than RAID-50?  If not, I'm inclined to go with your suggestion,
Dimitri,and use zfs like RAID-10 to stripe a zpool over a bunch of RAID-1 vdevs.  Even though many of our queries do
mostlysequential I/O, getting higher seeks/second is more important to us than the sacrificed diskspace. 

For the record, those blogs also included a link to a very helpful ZFS Best Practices Guide:
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

To sum up, so far the short list of tuning suggestions for ZFS includes:
 - Use a separate zpool and filesystem for xlogs if your apps write often.
 - Consider setting zil_disable=1 on the xlogs' dedicated filesystem.  ZIL is the intent log, and it sounds like
disablingit may be like disabling journaling.  Previous message threads in the Postgres archives debate whether this is
safefor the xlogs, but it didn't seem like a conclusive answer was reached. 
 - Make filesystem block size (zfs record size) match the Postgres block size.
 - Manually adjust vdev_cache.  I think this sets the read-ahead size.  It defaults to 64 KB.  For OLTP workload,
reduceit; for DW/OLAP maybe increase it. 
 - Test various settings for vq_max_pending (until zfs can auto-tune it).  See
http://blogs.sun.com/erickustarz/entry/vq_max_pending
 - A zpool of mirrored disks should support more seeks/second than RAID-Z, just like RAID 10 vs. RAID 50.  However, no
singlePostgres backend will see better than a single disk's seek rate, because the executor currently dispatches only 1
logicalI/O request at a time. 


>>> Dimitri <dimitrik.fr@gmail.com> 03/23/07 2:28 AM >>>
On Friday 23 March 2007 03:20, Matt Smiley wrote:
> My company is purchasing a Sunfire x4500 to run our most I/O-bound
> databases, and I'd like to get some advice on configuration and tuning.
> We're currently looking at: - Solaris 10 + zfs + RAID Z
>  - CentOS 4 + xfs + RAID 10
>  - CentOS 4 + ext3 + RAID 10
> but we're open to other suggestions.
>

Matt,

for Solaris + ZFS you may find answers to all your questions here:

  http://blogs.sun.com/roch/category/ZFS
  http://blogs.sun.com/realneel/entry/zfs_and_databases

Think to measure log (WAL) activity and use separated pool for logs if needed.
Also, RAID-Z is more security-oriented rather performance, RAID-10 should be
a better choice...

Rgds,
-Dimitri



В списке pgsql-performance по дате отправления:

Предыдущее
От: Michael Stone
Дата:
Сообщение: Re: Performance of count(*)
Следующее
От: Tom Lane
Дата:
Сообщение: Re: linux - server configuration for small database