RFC: Improve CPU cache locality of syscache searches

Поиск
Список
Период
Сортировка
От John Naylor
Тема RFC: Improve CPU cache locality of syscache searches
Дата
Msg-id CAFBsxsE35VLJ3hHkjJARB3QWqJ0zWeDw-jzqrfzkzMPuD_Ctvw@mail.gmail.com
обсуждение исходный текст
Ответы Re: RFC: Improve CPU cache locality of syscache searches  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
CPU caches have multiple levels, so I had an idea to use that concept in the syscaches.

Imagine if catcache buckets were cacheline-sized. In that case, we would store the most recently accessed hash values and pointers to catctup, in addition to the dlist_head:

typedef struct cc_bucket
{
  uint32 hashes[4];
  catctup *ct[4];
  dlist_head;
};

You can think of the bucket here as a 4-way set associative L1 and the list walk as an inclusive L2. There might be an existing term for this scheme, but I don't know what it is offhand.

Creating entries:

Instead of calling dlist_push_head(), we call dlist_push_tail() and then stash the entry in the L1 so it's still quickly available if it's accessed in the near future. This way, older entries evicted from L1 will tend to remain toward the front of the list. Walking the list will always go from oldest to newest, which might have better prefetch behavior (not sure).

Search:

Buckets with <= 4 entries would only need to access a single cacheline to get the catctup pointer with the matching hash value. If we have to walk the list to find a match, we stash it in the L1, which is faster than calling dlist_move_head().

L1 Eviction:

When putting an entry here, we memmove() everything down in each array and place the pointer and the hash value in the front, evicting the fourth (possibly NULL) entry.

The buckets would now be 4 times larger on 64-bit machines, but that might not be a problem if we just divide the initial bucket size by four as well. If the minimum nbuckets is 2, then the smaller caches would waste a bit of space, but it doesn't seem too bad. As far as when to rehash the bucket array, the fill factor would logically jump from 2 to 8. The worst-case list walk would be longer, but with a better overall memory access pattern, it might be fine.

I think it would be straightforward to code this up -- the difficulty is testing and accounting for random effects due to binary layout changes. Thoughts?

--

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Lowering the ever-growing heap->pd_lower
Следующее
От: Andres Freund
Дата:
Сообщение: Re: A varint implementation for PG?