diff mbox

[Libatomic,Darwin] Initial libatomic port for *darwin*.

Message ID 6228E2F0-3E10-4F77-92DE-9412E5325284@codesourcery.com
State New
Headers show

Commit Message

Iain Sandoe Nov. 13, 2014, 8:34 p.m. UTC
Hello Richard, Joseph,

Thanks for your reviews,

On 13 Nov 2014, at 07:40, Richard Henderson wrote:

> On 11/12/2014 10:18 PM, Iain Sandoe wrote:
> 
>> #  ifndef USE_ATOMIC
>> #    define USE_ATOMIC 1
>> #  endif
> 
> Why would USE_ATOMIC be defined previously?

This was left-over from a mode where I allowed the User to jam the mode to OSSpinLocks to test the performance.

I apologise, [weak excuse follows] with the turbulence of Darwin on trunk, my trunk version of the patch had got behind my 4.9 one. (most of the work has been hammered out there while we try to get bootstrap restored).

re-synced and retested with a patched trunk that bootstraps with some band-aid.

>> inline static void LockUnlock(uint32_t *l) {
>>   __atomic_store_4((_Atomic(uint32_t)*)l, 0, __ATOMIC_RELEASE);
>> }
> 
> Gnu coding style, please.  All through the file here.

Fixed.

> #  define LOCK_SIZE sizeof(uint32_t)
>> #  define NLOCKS (PAGE_SIZE / LOCK_SIZE)
>> static uint32_t locks[NLOCKS];
> 
> Um, surely not LOCK_SIZE, but CACHELINE_SIZE.  It's the granularity of the
> target region that's at issue, not the size of the lock itself.

The algorithm I've used is intentionally different from the pthreads-based posix one, here's the rationale, as I see it:

/* Algorithm motivations.

Layout Assumptions:
  o Darwin has a number of sub-targets with common atomic types that have no
    'native' in-line handling, but are smaller than a cache-line.
    E.G. PPC32 needs locking for >= 8byte quantities, X86/m32 for >=16.

  o The _Atomic alignment of a "natural type" is no greater than the type size.

  o There are no special guarantees about the alignment of _Atomic aggregates
    other than those determined by the psABI.

  o There are no guarantees that placement of an entity won't cause it to
    straddle a cache-line boundary.

  o Realistic User code will likely place several _Atomic qualified types in
    close proximity (such that they fall within the same cache-line).
    Similarly, arrays of _Atomic qualified items.

Performance Assumptions:
  o Collisions of address hashes for items (which make up the lock keys)
    constitute the largest performance issue.

  o We want to avoid unnecessary flushing of [lock table] cache-line(s) when
    items are accessed.

Implementation:
  We maintain a table of locks, each lock being 4 bytes (at present). 
  This choice of lock size gives some measure of equality in hash-collision
  statistics between the 'atomic' and 'OSSpinLock' implementations, since the
  lock size is fixed at 4 bytes for the latter.

  The table occupies one physical page, and we attempt to align it to a
  page boundary, appropriately.

  For entities that need a lock, with sizes < one cache line:
    Each entity that requires a lock, chooses the lock to use from the table on
  the basis of a hash determined by its size and address.  The lower log2(size)
  address bits are discarded on the assumption that the alignment of entities
  will not be smaller than their size.
  CHECKME: this is not verified for aggregates; it might be something that
  could/should be enforced from the front ends (since _Atomic types are
  allowed to have increased alignment c.f. 'normal').

  For entities that need a lock, with sizes >= one cacheline_size:
    We assume that the entity alignment >= log2(cacheline_size) and discard
  log2(cacheline_size) bits from the address.
  We then apply size/cacheline_size locks to cover the entity.

  The idea is that this will typically result in distinct hash keys for items
  placed close together.  The keys are mangled further such that the size is
  included in the hash.

  Finally, to attempt to make it such that the lock table entries are accessed
  in a scattered manner,to avoid repeated cacheline flushes, the hash is
  rearranged to attempt to maximise the most noise in the upper bits.
*/

NOTE that the CHECKME above doesn't put us in any worse position than the pthreads implementation (likely slightly better since we have a smaller granularity with the current scheme).

>> #if USE_ATOMIC
>>   LockLock (&locks[addr_hash (ptr, 1)]);
>> #else
>>   OSSpinLockLock(&locks[addr_hash (ptr, 1)]);
>> #endif
> 
> Better to #define LockLock  OSSpinLockLock within the area above, so as to
> avoid the ifdefs here.
done.

Thoughts on the rationale - or OK now?

thanks
Iain

I'm not aware of any other PRs that relate, but will do a final scan through and ask around the darwin folks.

libatomic:

	PR target/59305
	* config/darwin/host-config.h New.
	* config/darwin/lock.c New.
	* configure.tgt (DEFAULT_X86_CPU): New, (target): New entry for darwin.

Comments

Richard Henderson Nov. 14, 2014, 8:01 a.m. UTC | #1
On 11/13/2014 09:34 PM, Iain Sandoe wrote:
>> Um, surely not LOCK_SIZE, but CACHELINE_SIZE.  It's the granularity of the
>> target region that's at issue, not the size of the lock itself.
> 
> The algorithm I've used is intentionally different from the pthreads-based posix one...

All that would be fine if ...

> +/* The granularity at which locks are applied when n > CACHLINE_SIZE.
> +   We follow the posix pthreads implementation here.  */
> +#ifndef WATCH_SIZE
> +#  define WATCH_SIZE CACHLINE_SIZE
> +#endif

... you hadn't just said right here that the granularity at which you
want to lock is WATCH_SIZE,

> +#define LOCK_SIZE sizeof(LOCK_TYPE)
> +#define NLOCKS (PAGE_SIZE / LOCK_SIZE)
> +/* An array of locks, that should fill one physical page.  */
> +static LOCK_TYPE locks[NLOCKS] __attribute__((aligned(PAGE_SIZE)));

... but then go and use LOCK_SIZE instead.


r~
Iain Sandoe Nov. 14, 2014, 8:12 a.m. UTC | #2
Hello Richard,

On 14 Nov 2014, at 08:01, Richard Henderson wrote:

> On 11/13/2014 09:34 PM, Iain Sandoe wrote:
>>> Um, surely not LOCK_SIZE, but CACHELINE_SIZE.  It's the granularity of the
>>> target region that's at issue, not the size of the lock itself.
>> 
>> The algorithm I've used is intentionally different from the pthreads-based posix one...
> 
> All that would be fine if ...
> 
>> +/* The granularity at which locks are applied when n > CACHLINE_SIZE.
>> +   We follow the posix pthreads implementation here.  */
>> +#ifndef WATCH_SIZE
>> +#  define WATCH_SIZE CACHLINE_SIZE
>> +#endif
> 
> ... you hadn't just said right here that the granularity at which you
> want to lock is WATCH_SIZE,

That granularity *is* applied to items >= on cache line in size.

>> +#define LOCK_SIZE sizeof(LOCK_TYPE)
>> +#define NLOCKS (PAGE_SIZE / LOCK_SIZE)
>> +/* An array of locks, that should fill one physical page.  */
>> +static LOCK_TYPE locks[NLOCKS] __attribute__((aligned(PAGE_SIZE)));
> 
> ... but then go and use LOCK_SIZE instead.

my locks are only 4 bytes [whereas they are rounded-up-to-n-cachlines(sizeof(pthreads mutext)) for the posix implementation].
The items that they are locking are of arbitrary size (at least up to one page).

hmmm .. there's something I'm not following about what you are seeing as a problem here.

In the posix implementation the granularity calculation is also used to round up the space allocated in the locks table for each pthreads mutex (i.e. it has two uses, AFAICT).

thanks
Iain
Richard Henderson Nov. 14, 2014, 8:25 a.m. UTC | #3
On 11/14/2014 09:12 AM, Iain Sandoe wrote:
> my locks are only 4 bytes [whereas they are rounded-up-to-n-cachlines(sizeof(pthreads mutext)) for the posix implementation].
> The items that they are locking are of arbitrary size (at least up to one page).
> 
> hmmm .. there's something I'm not following about what you are seeing as a problem here.
> 
> In the posix implementation the granularity calculation is also used to
> round up the space allocated in the locks table for each pthreads mutex
> (i.e. it has two uses, AFAICT).

No, there's only one use: How large an area is *protected* by the lock.

Since we need to protect one page of these areas, we need NLOCKS = PAGE_SIZE /
WATCH_SIZE locks, which are then allocated in an array.  We do not care how
large that array is.

So if you'd like to differ from the posix implementation in protecting
4 bytes at a time, rather than one cacheline at a time, then just change
WATCH_SIZE to 4.  The fact that WATCH_SIZE happens to equal to the lock size is
simply a coincidence.


r~
Iain Sandoe Nov. 14, 2014, 8:44 a.m. UTC | #4
On 14 Nov 2014, at 08:25, Richard Henderson wrote:

> On 11/14/2014 09:12 AM, Iain Sandoe wrote:
>> my locks are only 4 bytes [whereas they are rounded-up-to-n-cachlines(sizeof(pthreads mutext)) for the posix implementation].
>> The items that they are locking are of arbitrary size (at least up to one page).
>> 
>> hmmm .. there's something I'm not following about what you are seeing as a problem here.
>> 
>> In the posix implementation the granularity calculation is also used to
>> round up the space allocated in the locks table for each pthreads mutex
>> (i.e. it has two uses, AFAICT).
> 
> No, there's only one use: How large an area is *protected* by the lock.
> 
> Since we need to protect one page of these areas, we need NLOCKS = PAGE_SIZE /
> WATCH_SIZE locks, which are then allocated in an array.  We do not care how
> large that array is.
> 
> So if you'd like to differ from the posix implementation in protecting
> 4 bytes at a time, rather than one cacheline at a time, then just change
> WATCH_SIZE to 4.  The fact that WATCH_SIZE happens to equal to the lock size is
> simply a coincidence.

Indeed (the use to round up the space allocated for the mutex also happens to be another co-incidence)

However, my intention is to have a variable-sized area protected by each locks.
The nummber of locks allocated exceeds (page-size/watch-size) [unless watch-sizes was reduced to 4bytes, of course]

Only when the size of the area to be protected exceeds one cache-line do I split it up into cache-line-sized chunks.

I happened to allocate one-page-worth of locks (as a somewhat arbitrary choice in the absence of metrics to guide otherwise) - which is another source of co-incidence.

Perhaps some re-naming of things would help, or do you think that a scheme to lock variable-sized chunks cannot work?

Iain
Richard Henderson Nov. 14, 2014, 10:22 a.m. UTC | #5
On 11/14/2014 09:44 AM, Iain Sandoe wrote:
> or do you think that a scheme to lock variable-sized chunks cannot work?

It certainly can't work while

> +void
> +libat_lock_1 (void *ptr)
> +{
> +  LockLock (&locks[addr_hash (ptr, 1)]);
> +}

doesn't have the true size.


r~
Jack Howarth Dec. 12, 2014, 3:46 p.m. UTC | #6
Iain,
    What is the status of this patch?
            Jack

On Thu, Nov 13, 2014 at 3:34 PM, Iain Sandoe <iain@codesourcery.com> wrote:
> Hello Richard, Joseph,
>
> Thanks for your reviews,
>
> On 13 Nov 2014, at 07:40, Richard Henderson wrote:
>
>> On 11/12/2014 10:18 PM, Iain Sandoe wrote:
>>
>>> #  ifndef USE_ATOMIC
>>> #    define USE_ATOMIC 1
>>> #  endif
>>
>> Why would USE_ATOMIC be defined previously?
>
> This was left-over from a mode where I allowed the User to jam the mode to OSSpinLocks to test the performance.
>
> I apologise, [weak excuse follows] with the turbulence of Darwin on trunk, my trunk version of the patch had got behind my 4.9 one. (most of the work has been hammered out there while we try to get bootstrap restored).
>
> re-synced and retested with a patched trunk that bootstraps with some band-aid.
>
>>> inline static void LockUnlock(uint32_t *l) {
>>>   __atomic_store_4((_Atomic(uint32_t)*)l, 0, __ATOMIC_RELEASE);
>>> }
>>
>> Gnu coding style, please.  All through the file here.
>
> Fixed.
>
>> #  define LOCK_SIZE sizeof(uint32_t)
>>> #  define NLOCKS (PAGE_SIZE / LOCK_SIZE)
>>> static uint32_t locks[NLOCKS];
>>
>> Um, surely not LOCK_SIZE, but CACHELINE_SIZE.  It's the granularity of the
>> target region that's at issue, not the size of the lock itself.
>
> The algorithm I've used is intentionally different from the pthreads-based posix one, here's the rationale, as I see it:
>
> /* Algorithm motivations.
>
> Layout Assumptions:
>   o Darwin has a number of sub-targets with common atomic types that have no
>     'native' in-line handling, but are smaller than a cache-line.
>     E.G. PPC32 needs locking for >= 8byte quantities, X86/m32 for >=16.
>
>   o The _Atomic alignment of a "natural type" is no greater than the type size.
>
>   o There are no special guarantees about the alignment of _Atomic aggregates
>     other than those determined by the psABI.
>
>   o There are no guarantees that placement of an entity won't cause it to
>     straddle a cache-line boundary.
>
>   o Realistic User code will likely place several _Atomic qualified types in
>     close proximity (such that they fall within the same cache-line).
>     Similarly, arrays of _Atomic qualified items.
>
> Performance Assumptions:
>   o Collisions of address hashes for items (which make up the lock keys)
>     constitute the largest performance issue.
>
>   o We want to avoid unnecessary flushing of [lock table] cache-line(s) when
>     items are accessed.
>
> Implementation:
>   We maintain a table of locks, each lock being 4 bytes (at present).
>   This choice of lock size gives some measure of equality in hash-collision
>   statistics between the 'atomic' and 'OSSpinLock' implementations, since the
>   lock size is fixed at 4 bytes for the latter.
>
>   The table occupies one physical page, and we attempt to align it to a
>   page boundary, appropriately.
>
>   For entities that need a lock, with sizes < one cache line:
>     Each entity that requires a lock, chooses the lock to use from the table on
>   the basis of a hash determined by its size and address.  The lower log2(size)
>   address bits are discarded on the assumption that the alignment of entities
>   will not be smaller than their size.
>   CHECKME: this is not verified for aggregates; it might be something that
>   could/should be enforced from the front ends (since _Atomic types are
>   allowed to have increased alignment c.f. 'normal').
>
>   For entities that need a lock, with sizes >= one cacheline_size:
>     We assume that the entity alignment >= log2(cacheline_size) and discard
>   log2(cacheline_size) bits from the address.
>   We then apply size/cacheline_size locks to cover the entity.
>
>   The idea is that this will typically result in distinct hash keys for items
>   placed close together.  The keys are mangled further such that the size is
>   included in the hash.
>
>   Finally, to attempt to make it such that the lock table entries are accessed
>   in a scattered manner,to avoid repeated cacheline flushes, the hash is
>   rearranged to attempt to maximise the most noise in the upper bits.
> */
>
> NOTE that the CHECKME above doesn't put us in any worse position than the pthreads implementation (likely slightly better since we have a smaller granularity with the current scheme).
>
>>> #if USE_ATOMIC
>>>   LockLock (&locks[addr_hash (ptr, 1)]);
>>> #else
>>>   OSSpinLockLock(&locks[addr_hash (ptr, 1)]);
>>> #endif
>>
>> Better to #define LockLock  OSSpinLockLock within the area above, so as to
>> avoid the ifdefs here.
> done.
>
> Thoughts on the rationale - or OK now?
>
> thanks
> Iain
>
> I'm not aware of any other PRs that relate, but will do a final scan through and ask around the darwin folks.
>
> libatomic:
>
>         PR target/59305
>         * config/darwin/host-config.h New.
>         * config/darwin/lock.c New.
>         * configure.tgt (DEFAULT_X86_CPU): New, (target): New entry for darwin.
>
>
>
>
>
Jack Howarth Jan. 8, 2015, 9:56 p.m. UTC | #7
Iain,
     Were you planning to try to get this committed before stage4 or
will it have to wait for the next major gcc release?
                  Jack

On Thu, Nov 13, 2014 at 3:34 PM, Iain Sandoe <iain@codesourcery.com> wrote:
> Hello Richard, Joseph,
>
> Thanks for your reviews,
>
> On 13 Nov 2014, at 07:40, Richard Henderson wrote:
>
>> On 11/12/2014 10:18 PM, Iain Sandoe wrote:
>>
>>> #  ifndef USE_ATOMIC
>>> #    define USE_ATOMIC 1
>>> #  endif
>>
>> Why would USE_ATOMIC be defined previously?
>
> This was left-over from a mode where I allowed the User to jam the mode to OSSpinLocks to test the performance.
>
> I apologise, [weak excuse follows] with the turbulence of Darwin on trunk, my trunk version of the patch had got behind my 4.9 one. (most of the work has been hammered out there while we try to get bootstrap restored).
>
> re-synced and retested with a patched trunk that bootstraps with some band-aid.
>
>>> inline static void LockUnlock(uint32_t *l) {
>>>   __atomic_store_4((_Atomic(uint32_t)*)l, 0, __ATOMIC_RELEASE);
>>> }
>>
>> Gnu coding style, please.  All through the file here.
>
> Fixed.
>
>> #  define LOCK_SIZE sizeof(uint32_t)
>>> #  define NLOCKS (PAGE_SIZE / LOCK_SIZE)
>>> static uint32_t locks[NLOCKS];
>>
>> Um, surely not LOCK_SIZE, but CACHELINE_SIZE.  It's the granularity of the
>> target region that's at issue, not the size of the lock itself.
>
> The algorithm I've used is intentionally different from the pthreads-based posix one, here's the rationale, as I see it:
>
> /* Algorithm motivations.
>
> Layout Assumptions:
>   o Darwin has a number of sub-targets with common atomic types that have no
>     'native' in-line handling, but are smaller than a cache-line.
>     E.G. PPC32 needs locking for >= 8byte quantities, X86/m32 for >=16.
>
>   o The _Atomic alignment of a "natural type" is no greater than the type size.
>
>   o There are no special guarantees about the alignment of _Atomic aggregates
>     other than those determined by the psABI.
>
>   o There are no guarantees that placement of an entity won't cause it to
>     straddle a cache-line boundary.
>
>   o Realistic User code will likely place several _Atomic qualified types in
>     close proximity (such that they fall within the same cache-line).
>     Similarly, arrays of _Atomic qualified items.
>
> Performance Assumptions:
>   o Collisions of address hashes for items (which make up the lock keys)
>     constitute the largest performance issue.
>
>   o We want to avoid unnecessary flushing of [lock table] cache-line(s) when
>     items are accessed.
>
> Implementation:
>   We maintain a table of locks, each lock being 4 bytes (at present).
>   This choice of lock size gives some measure of equality in hash-collision
>   statistics between the 'atomic' and 'OSSpinLock' implementations, since the
>   lock size is fixed at 4 bytes for the latter.
>
>   The table occupies one physical page, and we attempt to align it to a
>   page boundary, appropriately.
>
>   For entities that need a lock, with sizes < one cache line:
>     Each entity that requires a lock, chooses the lock to use from the table on
>   the basis of a hash determined by its size and address.  The lower log2(size)
>   address bits are discarded on the assumption that the alignment of entities
>   will not be smaller than their size.
>   CHECKME: this is not verified for aggregates; it might be something that
>   could/should be enforced from the front ends (since _Atomic types are
>   allowed to have increased alignment c.f. 'normal').
>
>   For entities that need a lock, with sizes >= one cacheline_size:
>     We assume that the entity alignment >= log2(cacheline_size) and discard
>   log2(cacheline_size) bits from the address.
>   We then apply size/cacheline_size locks to cover the entity.
>
>   The idea is that this will typically result in distinct hash keys for items
>   placed close together.  The keys are mangled further such that the size is
>   included in the hash.
>
>   Finally, to attempt to make it such that the lock table entries are accessed
>   in a scattered manner,to avoid repeated cacheline flushes, the hash is
>   rearranged to attempt to maximise the most noise in the upper bits.
> */
>
> NOTE that the CHECKME above doesn't put us in any worse position than the pthreads implementation (likely slightly better since we have a smaller granularity with the current scheme).
>
>>> #if USE_ATOMIC
>>>   LockLock (&locks[addr_hash (ptr, 1)]);
>>> #else
>>>   OSSpinLockLock(&locks[addr_hash (ptr, 1)]);
>>> #endif
>>
>> Better to #define LockLock  OSSpinLockLock within the area above, so as to
>> avoid the ifdefs here.
> done.
>
> Thoughts on the rationale - or OK now?
>
> thanks
> Iain
>
> I'm not aware of any other PRs that relate, but will do a final scan through and ask around the darwin folks.
>
> libatomic:
>
>         PR target/59305
>         * config/darwin/host-config.h New.
>         * config/darwin/lock.c New.
>         * configure.tgt (DEFAULT_X86_CPU): New, (target): New entry for darwin.
>
>
>
>
>
Iain Sandoe Jan. 8, 2015, 10:02 p.m. UTC | #8
Jack,

I haven't had time to look at the outstanding issues, and won't realistically before stage3 ends - so I don't suppose it will be eligible for GCC5 :(

Iain

On 8 Jan 2015, at 21:56, Jack Howarth wrote:

> Iain,
>     Were you planning to try to get this committed before stage4 or
> will it have to wait for the next major gcc release?
>                  Jack
> 
> On Thu, Nov 13, 2014 at 3:34 PM, Iain Sandoe <iain@codesourcery.com> wrote:
>> Hello Richard, Joseph,
>> 
>> Thanks for your reviews,
>> 
>> On 13 Nov 2014, at 07:40, Richard Henderson wrote:
>> 
>>> On 11/12/2014 10:18 PM, Iain Sandoe wrote:
>>> 
>>>> #  ifndef USE_ATOMIC
>>>> #    define USE_ATOMIC 1
>>>> #  endif
>>> 
>>> Why would USE_ATOMIC be defined previously?
>> 
>> This was left-over from a mode where I allowed the User to jam the mode to OSSpinLocks to test the performance.
>> 
>> I apologise, [weak excuse follows] with the turbulence of Darwin on trunk, my trunk version of the patch had got behind my 4.9 one. (most of the work has been hammered out there while we try to get bootstrap restored).
>> 
>> re-synced and retested with a patched trunk that bootstraps with some band-aid.
>> 
>>>> inline static void LockUnlock(uint32_t *l) {
>>>>  __atomic_store_4((_Atomic(uint32_t)*)l, 0, __ATOMIC_RELEASE);
>>>> }
>>> 
>>> Gnu coding style, please.  All through the file here.
>> 
>> Fixed.
>> 
>>> #  define LOCK_SIZE sizeof(uint32_t)
>>>> #  define NLOCKS (PAGE_SIZE / LOCK_SIZE)
>>>> static uint32_t locks[NLOCKS];
>>> 
>>> Um, surely not LOCK_SIZE, but CACHELINE_SIZE.  It's the granularity of the
>>> target region that's at issue, not the size of the lock itself.
>> 
>> The algorithm I've used is intentionally different from the pthreads-based posix one, here's the rationale, as I see it:
>> 
>> /* Algorithm motivations.
>> 
>> Layout Assumptions:
>>  o Darwin has a number of sub-targets with common atomic types that have no
>>    'native' in-line handling, but are smaller than a cache-line.
>>    E.G. PPC32 needs locking for >= 8byte quantities, X86/m32 for >=16.
>> 
>>  o The _Atomic alignment of a "natural type" is no greater than the type size.
>> 
>>  o There are no special guarantees about the alignment of _Atomic aggregates
>>    other than those determined by the psABI.
>> 
>>  o There are no guarantees that placement of an entity won't cause it to
>>    straddle a cache-line boundary.
>> 
>>  o Realistic User code will likely place several _Atomic qualified types in
>>    close proximity (such that they fall within the same cache-line).
>>    Similarly, arrays of _Atomic qualified items.
>> 
>> Performance Assumptions:
>>  o Collisions of address hashes for items (which make up the lock keys)
>>    constitute the largest performance issue.
>> 
>>  o We want to avoid unnecessary flushing of [lock table] cache-line(s) when
>>    items are accessed.
>> 
>> Implementation:
>>  We maintain a table of locks, each lock being 4 bytes (at present).
>>  This choice of lock size gives some measure of equality in hash-collision
>>  statistics between the 'atomic' and 'OSSpinLock' implementations, since the
>>  lock size is fixed at 4 bytes for the latter.
>> 
>>  The table occupies one physical page, and we attempt to align it to a
>>  page boundary, appropriately.
>> 
>>  For entities that need a lock, with sizes < one cache line:
>>    Each entity that requires a lock, chooses the lock to use from the table on
>>  the basis of a hash determined by its size and address.  The lower log2(size)
>>  address bits are discarded on the assumption that the alignment of entities
>>  will not be smaller than their size.
>>  CHECKME: this is not verified for aggregates; it might be something that
>>  could/should be enforced from the front ends (since _Atomic types are
>>  allowed to have increased alignment c.f. 'normal').
>> 
>>  For entities that need a lock, with sizes >= one cacheline_size:
>>    We assume that the entity alignment >= log2(cacheline_size) and discard
>>  log2(cacheline_size) bits from the address.
>>  We then apply size/cacheline_size locks to cover the entity.
>> 
>>  The idea is that this will typically result in distinct hash keys for items
>>  placed close together.  The keys are mangled further such that the size is
>>  included in the hash.
>> 
>>  Finally, to attempt to make it such that the lock table entries are accessed
>>  in a scattered manner,to avoid repeated cacheline flushes, the hash is
>>  rearranged to attempt to maximise the most noise in the upper bits.
>> */
>> 
>> NOTE that the CHECKME above doesn't put us in any worse position than the pthreads implementation (likely slightly better since we have a smaller granularity with the current scheme).
>> 
>>>> #if USE_ATOMIC
>>>>  LockLock (&locks[addr_hash (ptr, 1)]);
>>>> #else
>>>>  OSSpinLockLock(&locks[addr_hash (ptr, 1)]);
>>>> #endif
>>> 
>>> Better to #define LockLock  OSSpinLockLock within the area above, so as to
>>> avoid the ifdefs here.
>> done.
>> 
>> Thoughts on the rationale - or OK now?
>> 
>> thanks
>> Iain
>> 
>> I'm not aware of any other PRs that relate, but will do a final scan through and ask around the darwin folks.
>> 
>> libatomic:
>> 
>>        PR target/59305
>>        * config/darwin/host-config.h New.
>>        * config/darwin/lock.c New.
>>        * configure.tgt (DEFAULT_X86_CPU): New, (target): New entry for darwin.
>> 
>> 
>> 
>> 
>>
diff mbox

Patch

diff --git a/libatomic/config/darwin/host-config.h b/libatomic/config/darwin/host-config.h
new file mode 100644
index 0000000..db55d34
--- /dev/null
+++ b/libatomic/config/darwin/host-config.h
@@ -0,0 +1,55 @@ 
+/* Copyright (C) 2012-2014 Free Software Foundation, Inc.
+   Contributed by Richard Henderson <rth@redhat.com>.
+
+   This file is part of the GNU Atomic Library (libatomic).
+
+   Libatomic is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   Libatomic is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Included after all more target-specific host-config.h.  */
+
+
+#ifndef protect_start_end
+# ifdef HAVE_ATTRIBUTE_VISIBILITY
+#  pragma GCC visibility push(hidden)
+# endif
+
+void libat_lock_1 (void *ptr);
+void libat_unlock_1 (void *ptr);
+
+static inline UWORD
+protect_start (void *ptr)
+{
+  libat_lock_1 (ptr);
+  return 0;
+}
+
+static inline void
+protect_end (void *ptr, UWORD dummy UNUSED)
+{
+  libat_unlock_1 (ptr);
+}
+
+# define protect_start_end 1
+# ifdef HAVE_ATTRIBUTE_VISIBILITY
+#  pragma GCC visibility pop
+# endif
+#endif /* protect_start_end */
+
+#include_next <host-config.h>
diff --git a/libatomic/config/darwin/lock.c b/libatomic/config/darwin/lock.c
new file mode 100644
index 0000000..f0fdc7c
--- /dev/null
+++ b/libatomic/config/darwin/lock.c
@@ -0,0 +1,248 @@ 
+/* Copyright (C) 2014 Free Software Foundation, Inc.
+   Contributed by Iain Sandoe <iain@codesourcery.com>.
+
+   This file is part of the GNU Atomic Library (libatomic).
+
+   Libatomic is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   Libatomic is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <stdint.h>
+#include "libatomic_i.h"
+
+#include <mach/mach_traps.h>
+#include <mach/thread_switch.h>
+
+/* For items that must be guarded by a lock, we use the following strategy:
+   If atomic support is available for a unit32_t we use that.
+   If not, we use the Darwin OSSpinLock implementation. */
+
+/* Algorithm motivations.
+
+Layout Assumptions:
+  o Darwin has a number of sub-targets with common atomic types that have no
+    'native' in-line handling, but are smaller than a cache-line.
+    E.G. PPC32 needs locking for >= 8byte quantities, X86/m32 for >=16.
+
+  o The _Atomic alignment of a "natural type" is no greater than the type size.
+
+  o There are no special guarantees about the alignment of _Atomic aggregates
+    other than those determined by the psABI.
+
+  o There are no guarantees that placement of an entity won't cause it to
+    straddle a cache-line boundary.
+
+  o Realistic User code will likely place several _Atomic qualified types in
+    close proximity (such that they fall within the same cache-line).
+    Similarly, arrays of _Atomic qualified items.
+
+Performance Assumptions:
+  o Collisions of address hashes for items (which make up the lock keys)
+    constitute the largest performance issue.
+
+  o We want to avoid unnecessary flushing of [lock table] cache-line(s) when
+    items are accessed.
+
+Implementation:
+  We maintain a table of locks, each lock being 4 bytes (at present). 
+  This choice of lock size gives some measure of equality in hash-collision
+  statistics between the 'atomic' and 'OSSpinLock' implementations, since the
+  lock size is fixed at 4 bytes for the latter.
+
+  The table occupies one physical page, and we attempt to align it to a
+  page boundary, appropriately.
+
+  For entities that need a lock, with sizes < one cache line:
+    Each entity that requires a lock, chooses the lock to use from the table on
+  the basis of a hash determined by its size and address.  The lower log2(size)
+  address bits are discarded on the assumption that the alignment of entities
+  will not be smaller than their size.
+  CHECKME: this is not verified for aggregates; it might be something that
+  could/should be enforced from the front ends (since _Atomic types are
+  allowed to have increased alignment c.f. 'normal').
+
+  For entities that need a lock, with sizes >= one cacheline_size:
+    We assume that the entity alignment >= log2(cacheline_size) and discard
+  log2(cacheline_size) bits from the address.
+  We then apply size/cacheline_size locks to cover the entity.
+
+  The idea is that this will typically result in distinct hash keys for items
+  placed close together.  The keys are mangled further such that the size is
+  included in the hash.
+
+  Finally, to attempt to make it such that the lock table entries are accessed
+  in a scattered manner,to avoid repeated cacheline flushes, the hash is
+  rearranged to attempt to maximise the most noise in the upper bits.
+*/
+
+/* The target page size.  Must be no larger than the runtime page size,
+   lest locking fail with virtual address aliasing (i.e. a page mmaped
+   at two locations).  */
+#ifndef PAGE_SIZE
+#  define PAGE_SIZE 4096
+#endif
+
+/* The target cacheline size.  */
+#ifndef CACHLINE_SIZE
+#  define CACHLINE_SIZE 64
+#endif
+
+/* The granularity at which locks are applied when n > CACHLINE_SIZE.
+   We follow the posix pthreads implementation here.  */
+#ifndef WATCH_SIZE
+#  define WATCH_SIZE CACHLINE_SIZE
+#endif
+
+/* This is a number the number of tries we will make to acquire the lock
+   before giving up our time-slice (on the basis that we are guarding small
+   sections of code here and, therefore if we don't acquire the lock quickly,
+   that implies that the current holder is not active).  */
+#define NSPINS 4
+
+#if HAVE_ATOMIC_EXCHANGE_4 && HAVE_ATOMIC_LDST_4
+
+#  define LOCK_TYPE volatile uint32_t
+
+/* So that we work with gcc-4.8 we don't try to use _Atomic.
+   If _Atomic(uint32_t) ever gets greater alignment than 4 we'll need to
+   revise this.  */
+
+inline static void LockUnlock(LOCK_TYPE *l)
+{
+  __atomic_store_4((LOCK_TYPE *)l, 0, __ATOMIC_RELEASE);
+}
+
+inline static void LockLock(LOCK_TYPE *l)
+{
+  uint32_t old = 0;
+  unsigned n = NSPINS;
+  while (!__atomic_compare_exchange_4((LOCK_TYPE *)l, &old,
+         1, true, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
+    {
+      old = 0;
+      if (--n == 0)
+	{
+	  /* Give up this time-slice, no hint to the scheduler about what to
+	     pick.
+	     TODO: maybe see if it's worth preserving some info about presence
+	     of waiting processes - to allow a similar "give up" time-slice
+	     scheme on the unlock end.  */
+	  thread_switch((mach_port_name_t)0, SWITCH_OPTION_NONE,
+			MACH_MSG_TIMEOUT_NONE);
+	  n = NSPINS;
+	}
+    }
+}
+
+#else
+
+#  include <libkern/OSAtomic.h>
+#  define LOCK_TYPE OSSpinLock
+
+#define LockUnlock(L) OSSpinLockUnlock(L)
+
+inline static void LockLock(LOCK_TYPE *l)
+{
+  while (!OSSpinLockTry(l)) 
+    {
+      unsigned n = NSPINS;
+      if (--n == 0)
+	{
+	  /* Give up this time-slice, no hint to the scheduler about what to
+	     pick. (see comment above) */
+	  thread_switch((mach_port_name_t)0, SWITCH_OPTION_NONE,
+			MACH_MSG_TIMEOUT_NONE);
+	  n = NSPINS;
+	}
+    }
+}
+
+#endif
+
+#define LOCK_SIZE sizeof(LOCK_TYPE)
+#define NLOCKS (PAGE_SIZE / LOCK_SIZE)
+/* An array of locks, that should fill one physical page.  */
+static LOCK_TYPE locks[NLOCKS] __attribute__((aligned(PAGE_SIZE)));
+
+/* A hash function that assumes that entities of a given size are at least
+   aligned to that size, and tries to minimise the probability that adjacent
+   objects will end up using the same cache line in the locks.  */
+static inline uintptr_t
+addr_hash (void *ptr, size_t n)
+{
+  if (n <= CACHLINE_SIZE && n > 0)
+    n = sizeof(unsigned int)*8 - __builtin_clz((unsigned int) n) -1;
+  else
+    n = 7;
+
+  uint16_t x = (((uintptr_t)ptr) >> n);
+  x ^= n;
+  x = ((x >> 8) & 0xff) | ((x << 8) & 0xff00);
+  return x % NLOCKS;
+}
+
+void
+libat_lock_1 (void *ptr)
+{
+  LockLock (&locks[addr_hash (ptr, 1)]);
+}
+
+void
+libat_unlock_1 (void *ptr)
+{
+  LockUnlock (&locks[addr_hash (ptr, 1)]);
+}
+
+void
+libat_lock_n (void *ptr, size_t n)
+{
+  uintptr_t h = addr_hash (ptr, n);
+
+  /* Don't lock more than all the locks we have.  */
+  if (n > PAGE_SIZE)
+    n = PAGE_SIZE;
+
+  size_t i = 0;
+  do
+    {
+      LockLock (&locks[h]);
+      if (++h == NLOCKS)
+	h = 0;
+      i += WATCH_SIZE;
+    }
+  while (i < n);
+}
+
+void
+libat_unlock_n (void *ptr, size_t n)
+{
+  uintptr_t h = addr_hash (ptr, n);
+
+  if (n > PAGE_SIZE)
+    n = PAGE_SIZE;
+
+  size_t i = 0;
+  do
+    {
+      LockUnlock (&locks[h]);
+      if (++h == NLOCKS)
+	h = 0;
+      i += WATCH_SIZE;
+    }
+  while (i < n);
+}
diff --git a/libatomic/configure.tgt b/libatomic/configure.tgt
index b0344d5..8283012 100644
--- a/libatomic/configure.tgt
+++ b/libatomic/configure.tgt
@@ -26,6 +26,16 @@ 
 # Map the target cpu to an ARCH sub-directory.  At the same time,
 # work out any special compilation flags as necessary.
 
+case "${target}" in
+  *-*-darwin*)
+    # Use the same default as GCC.
+    DEFAULT_X86_CPU=core2
+    ;;
+  *)
+    DEFAULT_X86_CPU=i486
+    ;;
+esac
+
 case "${target_cpu}" in
   alpha*)
 	# fenv.c needs this option to generate inexact exceptions.
@@ -67,7 +77,7 @@  case "${target_cpu}" in
 	    ;;
 	  *)
 	    if test -z "$with_arch"; then
-	      XCFLAGS="${XCFLAGS} -march=i486 -mtune=${target_cpu}"
+	      XCFLAGS="${XCFLAGS} -march=$DEFAULT_X86_CPU -mtune=${target_cpu}"
 	      XCFLAGS="${XCFLAGS} -fomit-frame-pointer"
 	    fi
 	esac
@@ -78,7 +88,7 @@  case "${target_cpu}" in
   x86_64)
 	case " ${CC} ${CFLAGS} " in
 	  *" -m32 "*)
-	    XCFLAGS="${XCFLAGS} -march=i486 -mtune=generic"
+	    XCFLAGS="${XCFLAGS} -march=$DEFAULT_X86_CPU -mtune=generic"
 	    XCFLAGS="${XCFLAGS} -fomit-frame-pointer"
 	    ;;
 	  *)
@@ -107,11 +117,16 @@  case "${target}" in
   *-*-linux* | *-*-gnu* | *-*-k*bsd*-gnu \
   | *-*-netbsd* | *-*-freebsd* | *-*-openbsd* \
   | *-*-solaris2* | *-*-sysv4* | *-*-irix6* | *-*-osf* | *-*-hpux11* \
-  | *-*-darwin* | *-*-aix* | *-*-cygwin*)
+  | *-*-aix* | *-*-cygwin*)
 	# POSIX system.  The OS is supported.
 	config_path="${config_path} posix"
 	;;
 
+  *-*-darwin*)
+	# Darwin system.  The OS is supported.
+	config_path="${config_path} darwin"
+	;;
+
   *-*-mingw*)
 	# OS support for atomic primitives.
         case ${target_thread_file} in