Обсуждение: Free Space Map rewrite

Поиск
Список
Период
Сортировка

Free Space Map rewrite

От
"Heikki Linnakangas"
Дата:
Hi,

Here's a new snapshot of the FSM rewrite I've been working on. The "map
fork" stuff hasn't been changed since last patch (I have some work to do
there based on Tom's recent comments), but the FSM implementation itself
is now starting to get in shape. So the thing to look at in this patch
is freespace.c. It's unreadable in diff format because the whole file
has basically been rewritten, you'll have to apply the patch. I've also
attached a README, which is also part of the patch.

Still a lot of work to be done, like ironing out race conditions between
updates and searches, the approach I'm planning to take there is
explained in the README, and WAL-logging, but I'm fairly happy with
what's there now.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
FSM page structure
------------------

Within an FSM page, we use a binary tree structure where leaf nodes store the
amount of free space on heap pages (or lower level FSM pages, see Higher-level
Structure below), with one leaf node per heap page. Intermediate nodes store
the Max amount of free space on any of its children.

For example:

    4
 4     2
3 4   0 2    <- This level represents heap pages

There's two basic operations: search and update.

To search for a page with X amount of free space, traverse down the tree along
a path where n >= X, until you hit the bottom. If both children of a node
satisfy the condition, you can pick either one arbitrarily.

To update the amount of free space on a page to X, first update the leaf node
corresponding the heap page, and "bubble up" the change to upper nodes, until
you hit a node where n is already >= X.

This data structure has a couple of nice properties:
- to determine that there is no page with X bytes of free space, you only
  need to look at the root node
- by varying which child to traverse to in the search algorithm, when you have
  a choice, we can implement various strategies, like preferring pages closer
  to a given page, or spreading the load across the table.


Higher-level structure
----------------------

To scale up the data structure described above beyond a single page, we
maintain a similar tree-structure across pages. Leaf nodes in higher level
pages correspond to lower level FSM pages. The root node within each page
has the same value as the corresponding leaf node on the parent page.

Root page is always stored at physical block 0.

For example, assuming each FSM page can hold information about 4 pages (in
reality, that's (BLCKSZ - headers) / 2, or ~4000 with default BLCKSZ),
we get a disk layout like this:

 0     <-- page 0 at level 2 (root page)
  0     <-- page 0 at level 1
   0     <-- page 0 at level 0
   1     <-- page 1 at level 0
   2     <-- ...
   3
  1     <-- page 1 at level 1
   4
   5
   6
   7
  2
   8
   9
   10
   11
  3
   12
   13
   14
   15

where the numbers are page numbers *at that level*, starting from 0.

To find the physical block # corresponding leaf page n, we need to calculate

(number of preceding leaf pages) + (number of preceding upper level pages).
This turns out to be

y = n + (n / F + 1) + (n / F^2 + 1) + ... + 1

where F is the fanout (4 in the above example).

From that, you can figure out the formulas for finding a given child page of
upper-level page, or the parent of a page (XXX: explain)


To keep things simple, the tree is always constant height. To cover the max.
relation size of 2^31 blocks, three levels is enough with the default BLCKSZ
(4000^3 >= 2^31).


Locking
-------
When traversing down, lock only one page at a time. Release lock on parent
page before locking child page. That means that you will need to start from
scratch if the node is concurrently updated and the free space that was
supposed to be on a page is no longer there.

When bubbling up, lock and update parent page before releaseing lock on child.
This ensures that the

TODO
----

- fastroot to avoid traversing upper nodes with just 1 child
- use a different system for tables that fit into one FSM page, with a
  mechanism to switch to the real thing as it grows.


Вложения

Re: Free Space Map rewrite

От
"Heikki Linnakangas"
Дата:
Here's an updated version of the FSM rewrite. Major changes since last
snapshot:
- WAL-logging
- Locking of operations involving multiple pages has been fixed
- contrib/pg_freespacemap has been rewritten to be useful with the new
FSM implementation

The interesting part of this patch is the FSM implementation,
freespace.c; I'm now working on the relation fork and xlogutils.c
rewrite stuff as separate patches.

I've been trying to do some basic performance testing of this, but since
FSM overhead is so small both in the old and new implementation, I
haven't been able to tease apart any meaningful difference yet. Perhaps
that just means that it's not something to worry about, though we have
to watch out for degenerate worst-case scenarios.

One behavioral difference between the old and the new FSM is that in the
old implementation, GetPageWithFreeSpace immediately subtracts the
amount that was asked for from the FSM, but not in the new one. Updating
the new FSM is significantly more expensive, because updates are
WAL-logged, so I don't think it's worthwhile to do that anymore. Mind
you that we only did that for the first insert to a page anyway;
subsequent inserts to the target page didn't update the FSM in the old
scheme either. Can anyone think of a use case where that would make a
difference?

Does anyone want to suggest performance test cases for FSM?

There's one noteworthy detail in the current implementation: the FSM
always takes at least 3 pages. That seems wasteful for small tables that
only need 1-2 pages for the data itself. I have a solution sketched up:
for small tables that fit on a single FSM (smaller than ~4000 pages, or
~32 MB), use just the one FSM page, and when the table grows, switch to
the full-blown multi-page implementation. That's pretty straightforward
in principle, but does add a fair amount of code to cope with both
formats, and to handle the switchover. Does anyone care? I'm thinking of
not implementing this in the first phase, but adding it later after the
code is committed, if there's demand.

Another related detail is that currently all searches have to traverse
three pages, starting from the root, even if the table is small so that
the root page contains just one child. This has already been solved in
our b-tree implementation, with the "fast root" system, and I'm planning
to do the same for the FSM. But again, unless we can actually come up
with a test case that shows the overhead, do we care? In any case, this
is again something I'd like to implement as an add-on patch after the
first version is committed.

Heikki Linnakangas wrote:
> Hi,
>
> Here's a new snapshot of the FSM rewrite I've been working on. The "map
> fork" stuff hasn't been changed since last patch (I have some work to do
> there based on Tom's recent comments), but the FSM implementation itself
> is now starting to get in shape. So the thing to look at in this patch
> is freespace.c. It's unreadable in diff format because the whole file
> has basically been rewritten, you'll have to apply the patch. I've also
> attached a README, which is also part of the patch.
>
> Still a lot of work to be done, like ironing out race conditions between
> updates and searches, the approach I'm planning to take there is
> explained in the README, and WAL-logging, but I'm fairly happy with
> what's there now.
>


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Вложения