Обсуждение: Inefficient barriers on solaris with sun cc

Поиск
Список
Период
Сортировка

Inefficient barriers on solaris with sun cc

От
Andres Freund
Дата:
Hi,

Binaries compiled on solaris using sun studio cc currently don't have
compiler and memory barriers implemented. That means we fall back to
relatively slow generic implementations for those. Especially compiler,
read, write barriers will be much slower than necessary (since they all
just need to prevent compiler reordering as both sparc and x86 are run
in TSO mode under solaris).

Since my estimate is that we'll use more and more barriers, that's going
to hurt more and more.

I do *not* plan to do anything about it atm, I just thought it might be
helpful to have this stated somewhere searchable.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Inefficient barriers on solaris with sun cc

От
Robert Haas
Дата:
On Thu, Sep 25, 2014 at 9:34 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Binaries compiled on solaris using sun studio cc currently don't have
> compiler and memory barriers implemented. That means we fall back to
> relatively slow generic implementations for those. Especially compiler,
> read, write barriers will be much slower than necessary (since they all
> just need to prevent compiler reordering as both sparc and x86 are run
> in TSO mode under solaris).
>
> Since my estimate is that we'll use more and more barriers, that's going
> to hurt more and more.
>
> I do *not* plan to do anything about it atm, I just thought it might be
> helpful to have this stated somewhere searchable.

To put that another way:

If there are any Sun Studio users out there who care about performance
on big iron, please send a patch to fix this...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inefficient barriers on solaris with sun cc

От
Oskari Saarenmaa
Дата:
25.09.2014, 16:34, Andres Freund kirjoitti:
> Binaries compiled on solaris using sun studio cc currently don't have
> compiler and memory barriers implemented. That means we fall back to
> relatively slow generic implementations for those. Especially compiler,
> read, write barriers will be much slower than necessary (since they all
> just need to prevent compiler reordering as both sparc and x86 are run
> in TSO mode under solaris).

Attached patch implements compiler and memory barriers for Solaris
Studio based on documentation at
http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html

I defined read and write barriers as acquire and release barriers
instead of pure read and write ones as that's what other platforms
appear to do.

/ Oskari

Вложения

Re: Inefficient barriers on solaris with sun cc

От
Robert Haas
Дата:
On Fri, Sep 26, 2014 at 8:36 AM, Oskari Saarenmaa <os@ohmu.fi> wrote:
> 25.09.2014, 16:34, Andres Freund kirjoitti:
>> Binaries compiled on solaris using sun studio cc currently don't have
>> compiler and memory barriers implemented. That means we fall back to
>> relatively slow generic implementations for those. Especially compiler,
>> read, write barriers will be much slower than necessary (since they all
>> just need to prevent compiler reordering as both sparc and x86 are run
>> in TSO mode under solaris).
>
> Attached patch implements compiler and memory barriers for Solaris Studio
> based on documentation at
> http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
>
> I defined read and write barriers as acquire and release barriers instead of
> pure read and write ones as that's what other platforms appear to do.

So you think a read barrier is the same thing as an acquire barrier
and a write barrier is the same as a release barrier?  That would be
surprising.  It's certainly not true in general.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inefficient barriers on solaris with sun cc

От
Oskari Saarenmaa
Дата:
26.09.2014, 15:39, Robert Haas kirjoitti:
> On Fri, Sep 26, 2014 at 8:36 AM, Oskari Saarenmaa <os@ohmu.fi> wrote:
>> 25.09.2014, 16:34, Andres Freund kirjoitti:
>>> Binaries compiled on solaris using sun studio cc currently don't have
>>> compiler and memory barriers implemented. That means we fall back to
>>> relatively slow generic implementations for those. Especially compiler,
>>> read, write barriers will be much slower than necessary (since they all
>>> just need to prevent compiler reordering as both sparc and x86 are run
>>> in TSO mode under solaris).
>>
>> Attached patch implements compiler and memory barriers for Solaris Studio
>> based on documentation at
>> http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
>>
>> I defined read and write barriers as acquire and release barriers instead of
>> pure read and write ones as that's what other platforms appear to do.
>
> So you think a read barrier is the same thing as an acquire barrier
> and a write barrier is the same as a release barrier?  That would be
> surprising.  It's certainly not true in general.

The above doc describes the difference: read barrier requires loads 
before the barrier to be completed before loads after the barrier - an 
acquire barrier is the same, but it also requires loads to be complete 
before stores after the barrier.

Similarly write barrier requires stores before the barrier to be 
completed before stores after the barrier - a release barrier is the 
same, but it also requires loads before the barrier to be completed 
before stores after the barrier.

So acquire is read + loads-before-stores and release is write + 
loads-before-stores.

The generic gcc atomics also define read barrier to __ATOMIC_ACQUIRE and 
write barrier to __ATOMIC_RELEASE.

/ Oskari



Re: Inefficient barriers on solaris with sun cc

От
Andres Freund
Дата:
On 2014-09-26 08:39:38 -0400, Robert Haas wrote:
> On Fri, Sep 26, 2014 at 8:36 AM, Oskari Saarenmaa <os@ohmu.fi> wrote:
> > 25.09.2014, 16:34, Andres Freund kirjoitti:
> >> Binaries compiled on solaris using sun studio cc currently don't have
> >> compiler and memory barriers implemented. That means we fall back to
> >> relatively slow generic implementations for those. Especially compiler,
> >> read, write barriers will be much slower than necessary (since they all
> >> just need to prevent compiler reordering as both sparc and x86 are run
> >> in TSO mode under solaris).
> >
> > Attached patch implements compiler and memory barriers for Solaris Studio
> > based on documentation at
> > http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
> >
> > I defined read and write barriers as acquire and release barriers instead of
> > pure read and write ones as that's what other platforms appear to do.
> 
> So you think a read barrier is the same thing as an acquire barrier
> and a write barrier is the same as a release barrier?  That would be
> surprising.  It's certainly not true in general.

It's generally true that a read barrier is implied by an acquire
barrier, no? Same for write barriers being implied by read
barriers. Neither is true the other way round, but that's fine.

Given how postgres uses memory barriers we actually could declare
read/write barriers to be compiler barriers when on solaris. Both
supported architectures (x86, sparc) are run in TSO mode. As the
existing barrier code for x86 says:* Both 32 and 64 bit x86 do not allow loads to be reordered with other loads,* or
storesto be reordered with other stores, but a load can be performed* before a subsequent store.** Technically, some
x86-ishchips support uncached memory access and/or* special instructions that are weakly ordered.  In those cases we'd
need*the read and write barriers to be lfence and sfence.  But since we don't* do those things, a compiler barrier
shouldbe enough.** "lock; addl" has worked for longer than "mfence". It's also rumored to be* faster in many scenarios
 

Unless I miss something the same is true for sparc *in solaris
userland*. But I'd be perfectly happy to go with something like Oksari's
version because it's still much better than the current code.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Inefficient barriers on solaris with sun cc

От
Robert Haas
Дата:
On Fri, Sep 26, 2014 at 8:55 AM, Oskari Saarenmaa <os@ohmu.fi> wrote:
>> So you think a read barrier is the same thing as an acquire barrier
>> and a write barrier is the same as a release barrier?  That would be
>> surprising.  It's certainly not true in general.
>
> The above doc describes the difference: read barrier requires loads before
> the barrier to be completed before loads after the barrier - an acquire
> barrier is the same, but it also requires loads to be complete before stores
> after the barrier.
>
> Similarly write barrier requires stores before the barrier to be completed
> before stores after the barrier - a release barrier is the same, but it also
> requires loads before the barrier to be completed before stores after the
> barrier.
>
> So acquire is read + loads-before-stores and release is write +
> loads-before-stores.

Hmm.  My impression was that an acquire barrier means that loads and
stores can migrate forward across the barrier but not backward; and
that a release barrier means that loads and stores can migrate
backward across the barrier but not forward.  I'm actually not really
sure what this means unless the barrier also does something in and of
itself.  For example, consider this:

some stuff
CAS(&lock, 0, 1) // i am an acquire barrier
more stuff
lock = 0 // i am a release barrier
even more stuff

If the CAS() and lock = 0 instructions were FULL barriers, then we'd
be saying that the stuff that happens in the critical section needs to
be exactly "more stuff".  But if they are acquire and release
barriers, respectively, then the CPU is allowed to move "some stuff"
or "even more stuff" into the critical section; but what it can't do
is move "more stuff" out.

Now if you just have a naked acquire barrier that is not doing
anything itself, I don't really know what the semantics of that should
be.  Say I want to appear to only change things while flag is 1, so I
write this code:

flag = 1
acquire barrier
things++
release barrier
flag = 0

With the definition you (and Oracle) propose, this won't work, because
there's nothing to keep the modification of things from being
reordered before flag = 1.  What good is that?  Apparently, I don't
have any idea!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inefficient barriers on solaris with sun cc

От
Oskari Saarenmaa
Дата:
26.09.2014, 17:28, Robert Haas kirjoitti:
> On Fri, Sep 26, 2014 at 8:55 AM, Oskari Saarenmaa <os@ohmu.fi> wrote:
>>> So you think a read barrier is the same thing as an acquire barrier
>>> and a write barrier is the same as a release barrier?  That would be
>>> surprising.  It's certainly not true in general.
>>
>> The above doc describes the difference: read barrier requires loads before
>> the barrier to be completed before loads after the barrier - an acquire
>> barrier is the same, but it also requires loads to be complete before stores
>> after the barrier.
>>
>> Similarly write barrier requires stores before the barrier to be completed
>> before stores after the barrier - a release barrier is the same, but it also
>> requires loads before the barrier to be completed before stores after the
>> barrier.
>>
>> So acquire is read + loads-before-stores and release is write +
>> loads-before-stores.
>
> Hmm.  My impression was that an acquire barrier means that loads and
> stores can migrate forward across the barrier but not backward; and
> that a release barrier means that loads and stores can migrate
> backward across the barrier but not forward.  I'm actually not really
> sure what this means unless the barrier also does something in and of
> itself.  For example, consider this:

[...]

> With the definition you (and Oracle) propose, this won't work, because
> there's nothing to keep the modification of things from being
> reordered before flag = 1.  What good is that?  Apparently, I don't
> have any idea!

I'm not proposing any definition for acquire or release barriers, I was 
just proposing to use the things Solaris Studio defines as acquire and 
release barriers to implement read and write barriers in PostgreSQL 
because similar barrier names are used with gcc and on Solaris Studio 
acquire is a stronger read barrier and release is a stronger write 
barrier.  atomics.h's definition of pg_(read|write)_barrier doesn't have 
any requirements for loads before stores, though, so we could use 
__machine_r_barrier and __machine_w_barrier instead.

But as Andres pointed out all this is probably unnecessary and we could 
define read and write barrier as __compiler_barrier with Solaris Studio 
cc.  It's only available for Solaris (x86 and Sparc) and Linux (x86).

/ Oskari



Re: Inefficient barriers on solaris with sun cc

От
Andres Freund
Дата:
On 2014-09-26 10:28:21 -0400, Robert Haas wrote:
> On Fri, Sep 26, 2014 at 8:55 AM, Oskari Saarenmaa <os@ohmu.fi> wrote:
> >> So you think a read barrier is the same thing as an acquire barrier
> >> and a write barrier is the same as a release barrier?  That would be
> >> surprising.  It's certainly not true in general.
> >
> > The above doc describes the difference: read barrier requires loads before
> > the barrier to be completed before loads after the barrier - an acquire
> > barrier is the same, but it also requires loads to be complete before stores
> > after the barrier.
> >
> > Similarly write barrier requires stores before the barrier to be completed
> > before stores after the barrier - a release barrier is the same, but it also
> > requires loads before the barrier to be completed before stores after the
> > barrier.
> >
> > So acquire is read + loads-before-stores and release is write +
> > loads-before-stores.
> 
> Hmm.  My impression was that an acquire barrier means that loads and
> stores can migrate forward across the barrier but not backward; and
> that a release barrier means that loads and stores can migrate
> backward across the barrier but not forward.

It's actually more complex than that :(

Simple things first:

Oracle's definition seems pretty iron clad:
http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
__machine_acq_barrier is a clear superset of __machine_r_barrier and
__machine_rel_barrier is a clear superset of __machine_w_barrier

And that's what we're essentially discussing, no? That said, there seems
to be no reason to avoid using __machine_r/w_barrier().


But for the reason why I defined pg_read_barrier/write_barrier to
__atomic_thread_fence(__ATOMIC_ACQUIRE/RELEASE):

The C11/C++11 definition it's made for is hellishly hard to
understand. There's very subtle differences between acquire/release
operation and acquire/release fences. 29.8.2/7.17.4 seems to be the relevant
parts of the standards. I think it essentially guarantees the mapping
we're talking about, but it's not entirely clear.

The way acquire/release fences are defined is that they form a
'synchronizes-with' relationship with each other. Which would, I think,
be sufficient given that without a release like operation on the other
thread a read/wrie barrier isn't worth much. But there's a rub in that
it requires a atomic operation involved somehere to give that guarantee.

I *did* check that the emitted code on relevant architectures is sane,
but that doesn't guarantee anything for the future.

Therefore I'm proposing to replace it with __ATOMIC_ACQ_REL which is
definitely guaranteeing what we need, even if superflously heavy on some
platforms. It still is significantly more efficient than
__sync_synchronize() which is what was used before. I.e. it generates no
code on x86 (MFENCE otherwise), and only a lwsync on PPC (hwsync
otherwise, although I don't know why) and similar on ia64.

As a reference, relevant standard sections are:
C11: 5.1.2.4 5); 7.17.4
C++11: 29.3; 1.10
Not that we can rely on those, but I think it's a good thing to orient on.

> I'm actually not really sure what this means unless the barrier also
> does something in and of itself.

> For example, consider this:
> 
> some stuff
> CAS(&lock, 0, 1) // i am an acquire barrier
> more stuff
> lock = 0 // i am a release barrier
> even more stuff
> 
> If the CAS() and lock = 0 instructions were FULL barriers, then we'd
> be saying that the stuff that happens in the critical section needs to
> be exactly "more stuff".  But if they are acquire and release
> barriers, respectively, then the CPU is allowed to move "some stuff"
> or "even more stuff" into the critical section; but what it can't do
> is move "more stuff" out.

> Now if you just have a naked acquire barrier that is not doing
> anything itself, I don't really know what the semantics of that should
> be.

Which is why these acquire/release fences, in contrast to
acquire/release operations, have more guarantees... You put your finger
right onto the spot.

> Say I want to appear to only change things while flag is 1, so I
> write this code:
> 
> flag = 1
> acquire barrier
> things++
> release barrier
> flag = 0
> 
> With the definition you (and Oracle) propose

As written above, I don't think that applies to oracle's definition?

> this won't work, because
> there's nothing to keep the modification of things from being
> reordered before flag = 1.  What good is that?  Apparently, I don't
> have any idea!

I hope it's a bit clearer now?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Inefficient barriers on solaris with sun cc

От
Robert Haas
Дата:
On Thu, Oct 2, 2014 at 10:34 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> It's actually more complex than that :(
>
> Simple things first:
>
> Oracle's definition seems pretty iron clad:
> http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
> __machine_acq_barrier is a clear superset of __machine_r_barrier and
> __machine_rel_barrier is a clear superset of __machine_w_barrier
>
> And that's what we're essentially discussing, no? That said, there seems
> to be no reason to avoid using __machine_r/w_barrier().

So let's use those, then.

> But for the reason why I defined pg_read_barrier/write_barrier to
> __atomic_thread_fence(__ATOMIC_ACQUIRE/RELEASE):
>
> The C11/C++11 definition it's made for is hellishly hard to
> understand. There's very subtle differences between acquire/release
> operation and acquire/release fences. 29.8.2/7.17.4 seems to be the relevant
> parts of the standards. I think it essentially guarantees the mapping
> we're talking about, but it's not entirely clear.
>
> The way acquire/release fences are defined is that they form a
> 'synchronizes-with' relationship with each other. Which would, I think,
> be sufficient given that without a release like operation on the other
> thread a read/wrie barrier isn't worth much. But there's a rub in that
> it requires a atomic operation involved somehere to give that guarantee.
>
> I *did* check that the emitted code on relevant architectures is sane,
> but that doesn't guarantee anything for the future.
>
> Therefore I'm proposing to replace it with __ATOMIC_ACQ_REL which is
> definitely guaranteeing what we need, even if superflously heavy on some
> platforms. It still is significantly more efficient than
> __sync_synchronize() which is what was used before. I.e. it generates no
> code on x86 (MFENCE otherwise), and only a lwsync on PPC (hwsync
> otherwise, although I don't know why) and similar on ia64.

A fully barrier on x86 should be an mfence, right?  With only a
compiler barrier, you have loads ordered with respect to loads and
stores ordered with respect to stores, but the load/store ordering
isn't fully defined.

> Which is why these acquire/release fences, in contrast to
> acquire/release operations, have more guarantees... You put your finger
> right onto the spot.

But, uh, we still don't seem to know what those guarantees actually ARE.

>> Say I want to appear to only change things while flag is 1, so I
>> write this code:
>>
>> flag = 1
>> acquire barrier
>> things++
>> release barrier
>> flag = 0
>>
>> With the definition you (and Oracle) propose
>> this won't work, because
>> there's nothing to keep the modification of things from being
>> reordered before flag = 1.  What good is that?  Apparently, I don't
>> have any idea!
>
> As written above, I don't think that applies to oracle's definition?

Oracle's definition doesn't look sufficient there.  The acquire
barrier guarantees that the load operations before the barrier will be
completed before the load and store operations after the barrier, but
the only operation before the barrier is a store, not a load, so it
guarantees nothing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inefficient barriers on solaris with sun cc

От
Andres Freund
Дата:
On 2014-10-02 10:55:06 -0400, Robert Haas wrote:
> On Thu, Oct 2, 2014 at 10:34 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > It's actually more complex than that :(
> >
> > Simple things first:
> >
> > Oracle's definition seems pretty iron clad:
> > http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
> > __machine_acq_barrier is a clear superset of __machine_r_barrier and
> > __machine_rel_barrier is a clear superset of __machine_w_barrier
> >
> > And that's what we're essentially discussing, no? That said, there seems
> > to be no reason to avoid using __machine_r/w_barrier().
> 
> So let's use those, then.

Right, I've never contended that.

> > But for the reason why I defined pg_read_barrier/write_barrier to
> > __atomic_thread_fence(__ATOMIC_ACQUIRE/RELEASE):
> >
> > The C11/C++11 definition it's made for is hellishly hard to
> > understand. There's very subtle differences between acquire/release
> > operation and acquire/release fences. 29.8.2/7.17.4 seems to be the relevant
> > parts of the standards. I think it essentially guarantees the mapping
> > we're talking about, but it's not entirely clear.
> >
> > The way acquire/release fences are defined is that they form a
> > 'synchronizes-with' relationship with each other. Which would, I think,
> > be sufficient given that without a release like operation on the other
> > thread a read/wrie barrier isn't worth much. But there's a rub in that
> > it requires a atomic operation involved somehere to give that guarantee.
> >
> > I *did* check that the emitted code on relevant architectures is sane,
> > but that doesn't guarantee anything for the future.
> >
> > Therefore I'm proposing to replace it with __ATOMIC_ACQ_REL which is
> > definitely guaranteeing what we need, even if superflously heavy on some
> > platforms. It still is significantly more efficient than
> > __sync_synchronize() which is what was used before. I.e. it generates no
> > code on x86 (MFENCE otherwise), and only a lwsync on PPC (hwsync
> > otherwise, although I don't know why) and similar on ia64.
> 
> A fully barrier on x86 should be an mfence, right?

Right. I've not talked about changing full barrier semantics. What I was
referring to is that until the atomics patch we always redefine
read/write barriers to be full barriers when using gcc intrinsics.

> With only a compiler barrier, you have loads ordered with respect to
> loads and stores ordered with respect to stores, but the load/store
> ordering isn't fully defined.

Yes.

> > Which is why these acquire/release fences, in contrast to
> > acquire/release operations, have more guarantees... You put your finger
> > right onto the spot.
> 
> But, uh, we still don't seem to know what those guarantees actually ARE.

Paired together they form a synchronized-with relationship. Problem #1
is that the standard's language isn't, to me at least, clear if there's
not some case where that's not the case. Problem #2 is that our current
README.barrier definition doesn't actually require barriers to be
paired. Which imo is bad, but still a fact.

The definition of ACQ_REL is pretty clearly sufficient imo: "Full
barrier in both directions and synchronizes with acquire loads and
release stores in another thread.".

> >> Say I want to appear to only change things while flag is 1, so I
> >> write this code:
> >>
> >> flag = 1
> >> acquire barrier
> >> things++
> >> release barrier
> >> flag = 0
> >>
> >> With the definition you (and Oracle) propose
> >> this won't work, because
> >> there's nothing to keep the modification of things from being
> >> reordered before flag = 1.  What good is that?  Apparently, I don't
> >> have any idea!
> >
> > As written above, I don't think that applies to oracle's definition?
> 
> Oracle's definition doesn't look sufficient there.

Perhaps I'm just not understanding what you want to show with this
example. This started as a discussion of comparing acquire/release with
read/write barriers, right? Or are you generally wondering about the
point acquire/release barriers?

> The acquire
> barrier guarantees that the load operations before the barrier will be
> completed before the load and store operations after the barrier, but
> the only operation before the barrier is a store, not a load, so it
> guarantees nothing.

Well, 'acquire' operations always have to related to a load. That's why
standalone 'acquire fences' or 'acquire barriers' are more heavyweight
than just a acquiring read.

And realistically, in the above example, you'd have to read flag to see
that it's not already 1, right?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Inefficient barriers on solaris with sun cc

От
Robert Haas
Дата:
On Thu, Oct 2, 2014 at 11:18 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> So let's use those, then.
>
> Right, I've never contended that.

OK, cool.

>> A fully barrier on x86 should be an mfence, right?
>
> Right. I've not talked about changing full barrier semantics. What I was
> referring to is that until the atomics patch we always redefine
> read/write barriers to be full barriers when using gcc intrinsics.

OK, got it.  If there's a cheaper way to tell gcc "loads before loads"
or "stores before stores", I'm fine with doing that for those cases.

>> > Which is why these acquire/release fences, in contrast to
>> > acquire/release operations, have more guarantees... You put your finger
>> > right onto the spot.
>>
>> But, uh, we still don't seem to know what those guarantees actually ARE.
>
> Paired together they form a synchronized-with relationship. Problem #1
> is that the standard's language isn't, to me at least, clear if there's
> not some case where that's not the case. Problem #2 is that our current
> README.barrier definition doesn't actually require barriers to be
> paired. Which imo is bad, but still a fact.

I don't know what a "synchronized-with relationship" means.

Also, I pretty much designed those definitions to match what Linux
does.  And it doesn't require that either, though it says that in most
cases it will work out that way.

> The definition of ACQ_REL is pretty clearly sufficient imo: "Full
> barrier in both directions and synchronizes with acquire loads and
> release stores in another thread.".

I dunno.  What's an acquire load?  What's a release store?  I know
what loads and stores are; I don't know what the adjectives mean.

>> The acquire
>> barrier guarantees that the load operations before the barrier will be
>> completed before the load and store operations after the barrier, but
>> the only operation before the barrier is a store, not a load, so it
>> guarantees nothing.
>
> Well, 'acquire' operations always have to related to a load.That's why
> standalone 'acquire fences' or 'acquire barriers' are more heavyweight
> than just a acquiring read.

Again, I can't judge any of this, because you haven't defined the
terms anywhere.

> And realistically, in the above example, you'd have to read flag to see
> that it's not already 1, right?

Not necessarily.  You could be the only writer.  Think about the way
the backend entries in the stats system work.  The point of setting
the flag may be for other people to know whether the data is in the
middle of being modified.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inefficient barriers on solaris with sun cc

От
Andres Freund
Дата:
On 2014-10-02 11:35:32 -0400, Robert Haas wrote:
> On Thu, Oct 2, 2014 at 11:18 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > Which is why these acquire/release fences, in contrast to
> >> > acquire/release operations, have more guarantees... You put your finger
> >> > right onto the spot.
> >>
> >> But, uh, we still don't seem to know what those guarantees actually ARE.
> >
> > Paired together they form a synchronized-with relationship. Problem #1
> > is that the standard's language isn't, to me at least, clear if there's
> > not some case where that's not the case. Problem #2 is that our current
> > README.barrier definition doesn't actually require barriers to be
> > paired. Which imo is bad, but still a fact.
> 
> I don't know what a "synchronized-with relationship" means.

I'm using the standard's language here, given that I'm trying to reason
about its behaviour...

What it means is that if you have a matching pair of acquire/release
operations or barriers/fences everything that happened *before* the last
release fence will be visible *after* executing the next acquire
operation in a different thread-of-execution. And 'after' is defined in
the way that is true if the 'acquiring' thread can see the result of the
'releasing' operation.
I.e. no loads after the acquire can see values from before the release.

My problem with the definition in the standard is that it's not
particularly clear how acquire fences *without* a underlying explicit
atomic operation are defined in the standard.

I checked gcc's current code and it's fine in that regard. Also other
popular concurrent open source stuff like
http://git.qemu.org/?p=qemu.git;a=blob;f=include/qemu/atomic.h;hb=HEAD
does precisely what I'm talking about:

100 #ifndef smp_wmb
101 #ifdef __ATOMIC_RELEASE
102 #define smp_wmb()   __atomic_thread_fence(__ATOMIC_RELEASE)
103 #else
104 #define smp_wmb()   __sync_synchronize()
105 #endif
106 #endif
107
108 #ifndef smp_rmb
109 #ifdef __ATOMIC_ACQUIRE
110 #define smp_rmb()   __atomic_thread_fence(__ATOMIC_ACQUIRE)
111 #else
112 #define smp_rmb()   __sync_synchronize()
113 #endif
114 #endif

The commit that added it
http://git.qemu.org/?p=qemu.git;a=commitdiff;h=5444e768ee1abe6e021bece19a9a932351f88c88
was written by one gcc guy and reviewed by another one...

So I think we can be pretty sure that gcc's __atomic_thread_fence()
behaves like we want. We probably have to be a bit more careful about
extending that definition (by including atomic.h and doing
atomic_thread_fence(memory_order_acquire)) to use general C11. Which is
probably a couple years away anyway.

> Also, I pretty much designed those definitions to match what Linux
> does.  And it doesn't require that either, though it says that in most
> cases it will work out that way.

My point is that that read barriers aren't particularly meaningful
without a defined store order from another thread/process. Without any
form of pairing you don't have that. The writing side could just have
reordered the writes in a way you didn't want them.  And the kernel docs
do say "A lack of appropriate pairing is almost certainly an error". But
since read barriers also pair with lock releases operations, that's
normally not a big problem.

> > The definition of ACQ_REL is pretty clearly sufficient imo: "Full
> > barrier in both directions and synchronizes with acquire loads and
> > release stores in another thread.".
> 
> I dunno.  What's an acquire load?  What's a release store?  I know
> what loads and stores are; I don't know what the adjectives mean.

An acquire load is either an explicit atomic load (tas, cmpxchg, etc
also count) or a normal load combined with a acquire barrier. The symmetric
definition is true for release store.

(so, on x86 every load/store that prevents compiler reordering
essentially a acquire/release store)

> > And realistically, in the above example, you'd have to read flag to see
> > that it's not already 1, right?
> 
> Not necessarily.  You could be the only writer.  Think about the way
> the backend entries in the stats system work.  The point of setting
> the flag may be for other people to know whether the data is in the
> middle of being modified.

So you're thinking about something seqlock alike... Isn't the problem
then that you actually don't want acquire semantics, but release or
write barrier semantics on that store? The acquire/read barrier part
would be on the reader side, no?
I'm still unsure what you want to show with that example?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Inefficient barriers on solaris with sun cc

От
Robert Haas
Дата:
On Thu, Oct 2, 2014 at 2:06 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Also, I pretty much designed those definitions to match what Linux
>> does.  And it doesn't require that either, though it says that in most
>> cases it will work out that way.
>
> My point is that that read barriers aren't particularly meaningful
> without a defined store order from another thread/process. Without any
> form of pairing you don't have that. The writing side could just have
> reordered the writes in a way you didn't want them.  And the kernel docs
> do say "A lack of appropriate pairing is almost certainly an error". But
> since read barriers also pair with lock releases operations, that's
> normally not a big problem.

Agreed, but it's possible to have a read-fence where an atomic
operation provides the ordering on the other side, or something like
that.

> I'm still unsure what you want to show with that example?

Me, too.  I think we've drifted off in the weeds.  Do we know what we
need to know to fix $SUBJECT?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inefficient barriers on solaris with sun cc

От
Andres Freund
Дата:
On 2014-10-06 11:38:47 -0400, Robert Haas wrote:
> On Thu, Oct 2, 2014 at 2:06 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> Also, I pretty much designed those definitions to match what Linux
> >> does.  And it doesn't require that either, though it says that in most
> >> cases it will work out that way.
> >
> > My point is that that read barriers aren't particularly meaningful
> > without a defined store order from another thread/process. Without any
> > form of pairing you don't have that. The writing side could just have
> > reordered the writes in a way you didn't want them.  And the kernel docs
> > do say "A lack of appropriate pairing is almost certainly an error". But
> > since read barriers also pair with lock releases operations, that's
> > normally not a big problem.
> 
> Agreed, but it's possible to have a read-fence where an atomic
> operation provides the ordering on the other side, or something like
> that.

Sure, that's one of the possible pairings. Most atomics have barrier
semantics...

> > I'm still unsure what you want to show with that example?
> 
> Me, too.  I think we've drifted off in the weeds.  Do we know what we
> need to know to fix $SUBJECT?

I think we can pretty much apply Oskari's patch after replacing
acquire/release with read/write intrinsics.

I'm opening a bug with the gcc folks about clarifying the docs on their
intrinsics.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Inefficient barriers on solaris with sun cc

От
Oskari Saarenmaa
Дата:
06.10.2014, 17:42, Andres Freund kirjoitti:
> I think we can pretty much apply Oskari's patch after replacing
> acquire/release with read/write intrinsics.

Attached a patch rebased to current master using read & write barriers.

/ Oskari

Вложения