diff mbox

[store,merging,RFA] Re-implement merging code

Message ID 57FB4BC6.6080403@foss.arm.com
State New
Headers show

Commit Message

Kyrill Tkachov Oct. 10, 2016, 8:05 a.m. UTC
Hi Richard,

As I mentioned, here is the patch applying to the main store merging patch to re-implement encode_tree_to_bitpos
to operate on the bytes directly.

This works fine on little-endian but breaks on big-endian, even for merging bitfields within a single byte.
Consider the code snippet from gcc.dg/store_merging_6.c:

struct bar {
   int a : 3;
   unsigned char b : 4;
   unsigned char c : 1;
   char d;
   char e;
   char f;
   char g;
};

void
foo1 (struct bar *p)
{
   p->b = 3;
   p->a = 2;
   p->c = 1;
   p->d = 4;
   p->e = 5;
}

The correct GIMPLE for these merged stores on big-endian is:
   MEM[(voidD.49 *)p_2(D)] = 18180;
   MEM[(charD.8 *)p_2(D) + 2B] = 5;

whereas with this patch we emit:
   MEM[(voidD.49 *)p_2(D)] = 39428;
   MEM[(charD.8 *)p_2(D) + 2B] = 5;

The dump for merging the individual stores without this patch (using the correct but costly wide_int approach in the base patch) is:
After writing 3 of size 4 at position 3 the merged region contains:
6 0 0 0 0 0
After writing 2 of size 3 at position 0 the merged region contains:
46 0 0 0 0 0
After writing 1 of size 1 at position 7 the merged region contains:
47 0 0 0 0 0
After writing 4 of size 8 at position 8 the merged region contains:
47 4 0 0 0 0
After writing 5 of size 8 at position 16 the merged region contains:
47 4 5 0 0 0


And with this patch it is:
After writing 3 of size 4 at position 3 the merged region contains:
18 0 0 0 0 0
After writing 2 of size 3 at position 0 the merged region contains:
1a 0 0 0 0 0
After writing 1 of size 1 at position 7 the merged region contains:
9a 0 0 0 0 0
After writing 4 of size 8 at position 8 the merged region contains:
9a 4 0 0 0 0
After writing 5 of size 8 at position 16 the merged region contains:
9a 4 5 0 0 0

(Note the dump just dumps the byte array from index 0 to <len> so the first thing printed is the lowest numbered byte.
Also, each byte is dumped in hex.)

The code as included here doesn't do any byte swapping for big-endian but as seen from the dump even writing a sub-byte
bitfield goes wrong so it would be nice to resolve that before going forward.
Any help with debugging this is hugely appreciated. I've included an ASCII diagram of the steps in the algorithm
in the patch itself.

Thanks,
Kyrill

Comments

Richard Biener Oct. 10, 2016, 10:22 a.m. UTC | #1
On Mon, 10 Oct 2016, Kyrill Tkachov wrote:

> Hi Richard,
> 
> As I mentioned, here is the patch applying to the main store merging patch to
> re-implement encode_tree_to_bitpos
> to operate on the bytes directly.
> 
> This works fine on little-endian but breaks on big-endian, even for merging
> bitfields within a single byte.
> Consider the code snippet from gcc.dg/store_merging_6.c:
> 
> struct bar {
>   int a : 3;
>   unsigned char b : 4;
>   unsigned char c : 1;
>   char d;
>   char e;
>   char f;
>   char g;
> };
> 
> void
> foo1 (struct bar *p)
> {
>   p->b = 3;
>   p->a = 2;
>   p->c = 1;
>   p->d = 4;
>   p->e = 5;
> }
> 
> The correct GIMPLE for these merged stores on big-endian is:
>   MEM[(voidD.49 *)p_2(D)] = 18180;
>   MEM[(charD.8 *)p_2(D) + 2B] = 5;
> 
> whereas with this patch we emit:
>   MEM[(voidD.49 *)p_2(D)] = 39428;
>   MEM[(charD.8 *)p_2(D) + 2B] = 5;
> 
> The dump for merging the individual stores without this patch (using the
> correct but costly wide_int approach in the base patch) is:
> After writing 3 of size 4 at position 3 the merged region contains:
> 6 0 0 0 0 0
> After writing 2 of size 3 at position 0 the merged region contains:
> 46 0 0 0 0 0
> After writing 1 of size 1 at position 7 the merged region contains:
> 47 0 0 0 0 0
> After writing 4 of size 8 at position 8 the merged region contains:
> 47 4 0 0 0 0
> After writing 5 of size 8 at position 16 the merged region contains:
> 47 4 5 0 0 0
> 
> 
> And with this patch it is:
> After writing 3 of size 4 at position 3 the merged region contains:
> 18 0 0 0 0 0
> After writing 2 of size 3 at position 0 the merged region contains:
> 1a 0 0 0 0 0
> After writing 1 of size 1 at position 7 the merged region contains:
> 9a 0 0 0 0 0
> After writing 4 of size 8 at position 8 the merged region contains:
> 9a 4 0 0 0 0
> After writing 5 of size 8 at position 16 the merged region contains:
> 9a 4 5 0 0 0
> 
> (Note the dump just dumps the byte array from index 0 to <len> so the first
> thing printed is the lowest numbered byte.
> Also, each byte is dumped in hex.)
> 
> The code as included here doesn't do any byte swapping for big-endian but as
> seen from the dump even writing a sub-byte
> bitfield goes wrong so it would be nice to resolve that before going forward.
> Any help with debugging this is hugely appreciated. I've included an ASCII
> diagram of the steps in the algorithm
> in the patch itself.

Ah, I think you need to account for BITS_BIG_ENDIAN in 
shift_bytes_in_array.  You have to shift towards MSB which means changing
left to right shifts for BITS_BIG_ENDIAN.

You also seem to miss to account for amnt / BITS_PER_UNIT != 0.
Independently of BYTES_BIG_ENDIAN it would be

  ptr[i + (amnt / BITS_PER_UNIT)] = ptr[i] << amnt;
...

(so best use a single load / store and operate on a temporary).

Richard.

> Thanks,
> Kyrill
>
Richard Biener Oct. 10, 2016, 10:24 a.m. UTC | #2
On Mon, 10 Oct 2016, Richard Biener wrote:

> On Mon, 10 Oct 2016, Kyrill Tkachov wrote:
> 
> > Hi Richard,
> > 
> > As I mentioned, here is the patch applying to the main store merging patch to
> > re-implement encode_tree_to_bitpos
> > to operate on the bytes directly.
> > 
> > This works fine on little-endian but breaks on big-endian, even for merging
> > bitfields within a single byte.
> > Consider the code snippet from gcc.dg/store_merging_6.c:
> > 
> > struct bar {
> >   int a : 3;
> >   unsigned char b : 4;
> >   unsigned char c : 1;
> >   char d;
> >   char e;
> >   char f;
> >   char g;
> > };
> > 
> > void
> > foo1 (struct bar *p)
> > {
> >   p->b = 3;
> >   p->a = 2;
> >   p->c = 1;
> >   p->d = 4;
> >   p->e = 5;
> > }
> > 
> > The correct GIMPLE for these merged stores on big-endian is:
> >   MEM[(voidD.49 *)p_2(D)] = 18180;
> >   MEM[(charD.8 *)p_2(D) + 2B] = 5;
> > 
> > whereas with this patch we emit:
> >   MEM[(voidD.49 *)p_2(D)] = 39428;
> >   MEM[(charD.8 *)p_2(D) + 2B] = 5;
> > 
> > The dump for merging the individual stores without this patch (using the
> > correct but costly wide_int approach in the base patch) is:
> > After writing 3 of size 4 at position 3 the merged region contains:
> > 6 0 0 0 0 0
> > After writing 2 of size 3 at position 0 the merged region contains:
> > 46 0 0 0 0 0
> > After writing 1 of size 1 at position 7 the merged region contains:
> > 47 0 0 0 0 0
> > After writing 4 of size 8 at position 8 the merged region contains:
> > 47 4 0 0 0 0
> > After writing 5 of size 8 at position 16 the merged region contains:
> > 47 4 5 0 0 0
> > 
> > 
> > And with this patch it is:
> > After writing 3 of size 4 at position 3 the merged region contains:
> > 18 0 0 0 0 0
> > After writing 2 of size 3 at position 0 the merged region contains:
> > 1a 0 0 0 0 0
> > After writing 1 of size 1 at position 7 the merged region contains:
> > 9a 0 0 0 0 0
> > After writing 4 of size 8 at position 8 the merged region contains:
> > 9a 4 0 0 0 0
> > After writing 5 of size 8 at position 16 the merged region contains:
> > 9a 4 5 0 0 0
> > 
> > (Note the dump just dumps the byte array from index 0 to <len> so the first
> > thing printed is the lowest numbered byte.
> > Also, each byte is dumped in hex.)
> > 
> > The code as included here doesn't do any byte swapping for big-endian but as
> > seen from the dump even writing a sub-byte
> > bitfield goes wrong so it would be nice to resolve that before going forward.
> > Any help with debugging this is hugely appreciated. I've included an ASCII
> > diagram of the steps in the algorithm
> > in the patch itself.
> 
> Ah, I think you need to account for BITS_BIG_ENDIAN in 
> shift_bytes_in_array.  You have to shift towards MSB which means changing
> left to right shifts for BITS_BIG_ENDIAN.
> 
> You also seem to miss to account for amnt / BITS_PER_UNIT != 0.
> Independently of BYTES_BIG_ENDIAN it would be

Ok, that would matter only if you'd merge shift_bytes_in_array,
clear_bit_region and the |-ring of that into the final buffer
(which should be possible).

Richard.
Kyrill Tkachov Oct. 10, 2016, 11:06 a.m. UTC | #3
On 10/10/16 11:22, Richard Biener wrote:
> On Mon, 10 Oct 2016, Kyrill Tkachov wrote:
>
>> Hi Richard,
>>
>> As I mentioned, here is the patch applying to the main store merging patch to
>> re-implement encode_tree_to_bitpos
>> to operate on the bytes directly.
>>
>> This works fine on little-endian but breaks on big-endian, even for merging
>> bitfields within a single byte.
>> Consider the code snippet from gcc.dg/store_merging_6.c:
>>
>> struct bar {
>>    int a : 3;
>>    unsigned char b : 4;
>>    unsigned char c : 1;
>>    char d;
>>    char e;
>>    char f;
>>    char g;
>> };
>>
>> void
>> foo1 (struct bar *p)
>> {
>>    p->b = 3;
>>    p->a = 2;
>>    p->c = 1;
>>    p->d = 4;
>>    p->e = 5;
>> }
>>
>> The correct GIMPLE for these merged stores on big-endian is:
>>    MEM[(voidD.49 *)p_2(D)] = 18180;
>>    MEM[(charD.8 *)p_2(D) + 2B] = 5;
>>
>> whereas with this patch we emit:
>>    MEM[(voidD.49 *)p_2(D)] = 39428;
>>    MEM[(charD.8 *)p_2(D) + 2B] = 5;
>>
>> The dump for merging the individual stores without this patch (using the
>> correct but costly wide_int approach in the base patch) is:
>> After writing 3 of size 4 at position 3 the merged region contains:
>> 6 0 0 0 0 0
>> After writing 2 of size 3 at position 0 the merged region contains:
>> 46 0 0 0 0 0
>> After writing 1 of size 1 at position 7 the merged region contains:
>> 47 0 0 0 0 0
>> After writing 4 of size 8 at position 8 the merged region contains:
>> 47 4 0 0 0 0
>> After writing 5 of size 8 at position 16 the merged region contains:
>> 47 4 5 0 0 0
>>
>>
>> And with this patch it is:
>> After writing 3 of size 4 at position 3 the merged region contains:
>> 18 0 0 0 0 0
>> After writing 2 of size 3 at position 0 the merged region contains:
>> 1a 0 0 0 0 0
>> After writing 1 of size 1 at position 7 the merged region contains:
>> 9a 0 0 0 0 0
>> After writing 4 of size 8 at position 8 the merged region contains:
>> 9a 4 0 0 0 0
>> After writing 5 of size 8 at position 16 the merged region contains:
>> 9a 4 5 0 0 0
>>
>> (Note the dump just dumps the byte array from index 0 to <len> so the first
>> thing printed is the lowest numbered byte.
>> Also, each byte is dumped in hex.)
>>
>> The code as included here doesn't do any byte swapping for big-endian but as
>> seen from the dump even writing a sub-byte
>> bitfield goes wrong so it would be nice to resolve that before going forward.
>> Any help with debugging this is hugely appreciated. I've included an ASCII
>> diagram of the steps in the algorithm
>> in the patch itself.
> Ah, I think you need to account for BITS_BIG_ENDIAN in
> shift_bytes_in_array.  You have to shift towards MSB which means changing
> left to right shifts for BITS_BIG_ENDIAN.

Thanks, I'll try it out. But this is on aarch64 where
BITS_BIG_ENDIAN is 0 even when BYTES_BIG_ENDIAN is 1
so there's something else bad here.

> You also seem to miss to account for amnt / BITS_PER_UNIT != 0.
> Independently of BYTES_BIG_ENDIAN it would be
>
>    ptr[i + (amnt / BITS_PER_UNIT)] = ptr[i] << amnt;
> ...

doh, yes. I'll fix that.

> (so best use a single load / store and operate on a temporary).

Thanks,
Kyrill

> Richard.
>
>> Thanks,
>> Kyrill
>>
Kyrill Tkachov Oct. 10, 2016, 11:08 a.m. UTC | #4
On 10/10/16 12:06, Kyrill Tkachov wrote:
>
> On 10/10/16 11:22, Richard Biener wrote:
>> On Mon, 10 Oct 2016, Kyrill Tkachov wrote:
>>
>>> Hi Richard,
>>>
>>> As I mentioned, here is the patch applying to the main store merging patch to
>>> re-implement encode_tree_to_bitpos
>>> to operate on the bytes directly.
>>>
>>> This works fine on little-endian but breaks on big-endian, even for merging
>>> bitfields within a single byte.
>>> Consider the code snippet from gcc.dg/store_merging_6.c:
>>>
>>> struct bar {
>>>    int a : 3;
>>>    unsigned char b : 4;
>>>    unsigned char c : 1;
>>>    char d;
>>>    char e;
>>>    char f;
>>>    char g;
>>> };
>>>
>>> void
>>> foo1 (struct bar *p)
>>> {
>>>    p->b = 3;
>>>    p->a = 2;
>>>    p->c = 1;
>>>    p->d = 4;
>>>    p->e = 5;
>>> }
>>>
>>> The correct GIMPLE for these merged stores on big-endian is:
>>>    MEM[(voidD.49 *)p_2(D)] = 18180;
>>>    MEM[(charD.8 *)p_2(D) + 2B] = 5;
>>>
>>> whereas with this patch we emit:
>>>    MEM[(voidD.49 *)p_2(D)] = 39428;
>>>    MEM[(charD.8 *)p_2(D) + 2B] = 5;
>>>
>>> The dump for merging the individual stores without this patch (using the
>>> correct but costly wide_int approach in the base patch) is:
>>> After writing 3 of size 4 at position 3 the merged region contains:
>>> 6 0 0 0 0 0
>>> After writing 2 of size 3 at position 0 the merged region contains:
>>> 46 0 0 0 0 0
>>> After writing 1 of size 1 at position 7 the merged region contains:
>>> 47 0 0 0 0 0
>>> After writing 4 of size 8 at position 8 the merged region contains:
>>> 47 4 0 0 0 0
>>> After writing 5 of size 8 at position 16 the merged region contains:
>>> 47 4 5 0 0 0
>>>
>>>
>>> And with this patch it is:
>>> After writing 3 of size 4 at position 3 the merged region contains:
>>> 18 0 0 0 0 0
>>> After writing 2 of size 3 at position 0 the merged region contains:
>>> 1a 0 0 0 0 0
>>> After writing 1 of size 1 at position 7 the merged region contains:
>>> 9a 0 0 0 0 0
>>> After writing 4 of size 8 at position 8 the merged region contains:
>>> 9a 4 0 0 0 0
>>> After writing 5 of size 8 at position 16 the merged region contains:
>>> 9a 4 5 0 0 0
>>>
>>> (Note the dump just dumps the byte array from index 0 to <len> so the first
>>> thing printed is the lowest numbered byte.
>>> Also, each byte is dumped in hex.)
>>>
>>> The code as included here doesn't do any byte swapping for big-endian but as
>>> seen from the dump even writing a sub-byte
>>> bitfield goes wrong so it would be nice to resolve that before going forward.
>>> Any help with debugging this is hugely appreciated. I've included an ASCII
>>> diagram of the steps in the algorithm
>>> in the patch itself.
>> Ah, I think you need to account for BITS_BIG_ENDIAN in
>> shift_bytes_in_array.  You have to shift towards MSB which means changing
>> left to right shifts for BITS_BIG_ENDIAN.
>
> Thanks, I'll try it out. But this is on aarch64 where
> BITS_BIG_ENDIAN is 0 even when BYTES_BIG_ENDIAN is 1
> so there's something else bad here.
>
>> You also seem to miss to account for amnt / BITS_PER_UNIT != 0.
>> Independently of BYTES_BIG_ENDIAN it would be
>>
>>    ptr[i + (amnt / BITS_PER_UNIT)] = ptr[i] << amnt;
>> ...
>
> doh, yes. I'll fix that.
>

Scratch that, just read your other reply.
The precondition for that function is that the shift amount is less than BITS_PER_UNIT.
I'll clarify that in the comment.

Kyril

>> (so best use a single load / store and operate on a temporary).
>
> Thanks,
> Kyrill
>
>> Richard.
>>
>>> Thanks,
>>> Kyrill
>>>
>
Richard Biener Oct. 10, 2016, 11:11 a.m. UTC | #5
On Mon, 10 Oct 2016, Kyrill Tkachov wrote:

> 
> On 10/10/16 11:22, Richard Biener wrote:
> > On Mon, 10 Oct 2016, Kyrill Tkachov wrote:
> > 
> > > Hi Richard,
> > > 
> > > As I mentioned, here is the patch applying to the main store merging patch
> > > to
> > > re-implement encode_tree_to_bitpos
> > > to operate on the bytes directly.
> > > 
> > > This works fine on little-endian but breaks on big-endian, even for
> > > merging
> > > bitfields within a single byte.
> > > Consider the code snippet from gcc.dg/store_merging_6.c:
> > > 
> > > struct bar {
> > >    int a : 3;
> > >    unsigned char b : 4;
> > >    unsigned char c : 1;
> > >    char d;
> > >    char e;
> > >    char f;
> > >    char g;
> > > };
> > > 
> > > void
> > > foo1 (struct bar *p)
> > > {
> > >    p->b = 3;
> > >    p->a = 2;
> > >    p->c = 1;
> > >    p->d = 4;
> > >    p->e = 5;
> > > }
> > > 
> > > The correct GIMPLE for these merged stores on big-endian is:
> > >    MEM[(voidD.49 *)p_2(D)] = 18180;
> > >    MEM[(charD.8 *)p_2(D) + 2B] = 5;
> > > 
> > > whereas with this patch we emit:
> > >    MEM[(voidD.49 *)p_2(D)] = 39428;
> > >    MEM[(charD.8 *)p_2(D) + 2B] = 5;
> > > 
> > > The dump for merging the individual stores without this patch (using the
> > > correct but costly wide_int approach in the base patch) is:
> > > After writing 3 of size 4 at position 3 the merged region contains:
> > > 6 0 0 0 0 0
> > > After writing 2 of size 3 at position 0 the merged region contains:
> > > 46 0 0 0 0 0
> > > After writing 1 of size 1 at position 7 the merged region contains:
> > > 47 0 0 0 0 0
> > > After writing 4 of size 8 at position 8 the merged region contains:
> > > 47 4 0 0 0 0
> > > After writing 5 of size 8 at position 16 the merged region contains:
> > > 47 4 5 0 0 0
> > > 
> > > 
> > > And with this patch it is:
> > > After writing 3 of size 4 at position 3 the merged region contains:
> > > 18 0 0 0 0 0
> > > After writing 2 of size 3 at position 0 the merged region contains:
> > > 1a 0 0 0 0 0
> > > After writing 1 of size 1 at position 7 the merged region contains:
> > > 9a 0 0 0 0 0
> > > After writing 4 of size 8 at position 8 the merged region contains:
> > > 9a 4 0 0 0 0
> > > After writing 5 of size 8 at position 16 the merged region contains:
> > > 9a 4 5 0 0 0
> > > 
> > > (Note the dump just dumps the byte array from index 0 to <len> so the
> > > first
> > > thing printed is the lowest numbered byte.
> > > Also, each byte is dumped in hex.)
> > > 
> > > The code as included here doesn't do any byte swapping for big-endian but
> > > as
> > > seen from the dump even writing a sub-byte
> > > bitfield goes wrong so it would be nice to resolve that before going
> > > forward.
> > > Any help with debugging this is hugely appreciated. I've included an ASCII
> > > diagram of the steps in the algorithm
> > > in the patch itself.
> > Ah, I think you need to account for BITS_BIG_ENDIAN in
> > shift_bytes_in_array.  You have to shift towards MSB which means changing
> > left to right shifts for BITS_BIG_ENDIAN.
> 
> Thanks, I'll try it out. But this is on aarch64 where
> BITS_BIG_ENDIAN is 0 even when BYTES_BIG_ENDIAN is 1
> so there's something else bad here.

Maybe I'm confusing all the macros, so maybe it's BYTES_BIG_ENDIAN
(vs. WORDS_BIG_ENDIAN -- in theory this approach should work for
pdp11 as well).

Richard.

> > You also seem to miss to account for amnt / BITS_PER_UNIT != 0.
> > Independently of BYTES_BIG_ENDIAN it would be
> > 
> >    ptr[i + (amnt / BITS_PER_UNIT)] = ptr[i] << amnt;
> > ...
> 
> doh, yes. I'll fix that.
> 
> > (so best use a single load / store and operate on a temporary).
> 
> Thanks,
> Kyrill
> 
> > Richard.
> > 
> > > Thanks,
> > > Kyrill
> > > 
> 
>
Richard Biener Oct. 10, 2016, 11:15 a.m. UTC | #6
On Mon, 10 Oct 2016, Richard Biener wrote:

> On Mon, 10 Oct 2016, Kyrill Tkachov wrote:
> 
> > 
> > On 10/10/16 11:22, Richard Biener wrote:
> > > On Mon, 10 Oct 2016, Kyrill Tkachov wrote:
> > > 
> > > > Hi Richard,
> > > > 
> > > > As I mentioned, here is the patch applying to the main store merging patch
> > > > to
> > > > re-implement encode_tree_to_bitpos
> > > > to operate on the bytes directly.
> > > > 
> > > > This works fine on little-endian but breaks on big-endian, even for
> > > > merging
> > > > bitfields within a single byte.
> > > > Consider the code snippet from gcc.dg/store_merging_6.c:
> > > > 
> > > > struct bar {
> > > >    int a : 3;
> > > >    unsigned char b : 4;
> > > >    unsigned char c : 1;
> > > >    char d;
> > > >    char e;
> > > >    char f;
> > > >    char g;
> > > > };
> > > > 
> > > > void
> > > > foo1 (struct bar *p)
> > > > {
> > > >    p->b = 3;
> > > >    p->a = 2;
> > > >    p->c = 1;
> > > >    p->d = 4;
> > > >    p->e = 5;
> > > > }
> > > > 
> > > > The correct GIMPLE for these merged stores on big-endian is:
> > > >    MEM[(voidD.49 *)p_2(D)] = 18180;
> > > >    MEM[(charD.8 *)p_2(D) + 2B] = 5;
> > > > 
> > > > whereas with this patch we emit:
> > > >    MEM[(voidD.49 *)p_2(D)] = 39428;
> > > >    MEM[(charD.8 *)p_2(D) + 2B] = 5;
> > > > 
> > > > The dump for merging the individual stores without this patch (using the
> > > > correct but costly wide_int approach in the base patch) is:
> > > > After writing 3 of size 4 at position 3 the merged region contains:
> > > > 6 0 0 0 0 0
> > > > After writing 2 of size 3 at position 0 the merged region contains:
> > > > 46 0 0 0 0 0
> > > > After writing 1 of size 1 at position 7 the merged region contains:
> > > > 47 0 0 0 0 0
> > > > After writing 4 of size 8 at position 8 the merged region contains:
> > > > 47 4 0 0 0 0
> > > > After writing 5 of size 8 at position 16 the merged region contains:
> > > > 47 4 5 0 0 0
> > > > 
> > > > 
> > > > And with this patch it is:
> > > > After writing 3 of size 4 at position 3 the merged region contains:
> > > > 18 0 0 0 0 0
> > > > After writing 2 of size 3 at position 0 the merged region contains:
> > > > 1a 0 0 0 0 0
> > > > After writing 1 of size 1 at position 7 the merged region contains:
> > > > 9a 0 0 0 0 0
> > > > After writing 4 of size 8 at position 8 the merged region contains:
> > > > 9a 4 0 0 0 0
> > > > After writing 5 of size 8 at position 16 the merged region contains:
> > > > 9a 4 5 0 0 0
> > > > 
> > > > (Note the dump just dumps the byte array from index 0 to <len> so the
> > > > first
> > > > thing printed is the lowest numbered byte.
> > > > Also, each byte is dumped in hex.)
> > > > 
> > > > The code as included here doesn't do any byte swapping for big-endian but
> > > > as
> > > > seen from the dump even writing a sub-byte
> > > > bitfield goes wrong so it would be nice to resolve that before going
> > > > forward.
> > > > Any help with debugging this is hugely appreciated. I've included an ASCII
> > > > diagram of the steps in the algorithm
> > > > in the patch itself.
> > > Ah, I think you need to account for BITS_BIG_ENDIAN in
> > > shift_bytes_in_array.  You have to shift towards MSB which means changing
> > > left to right shifts for BITS_BIG_ENDIAN.
> > 
> > Thanks, I'll try it out. But this is on aarch64 where
> > BITS_BIG_ENDIAN is 0 even when BYTES_BIG_ENDIAN is 1
> > so there's something else bad here.
> 
> Maybe I'm confusing all the macros, so maybe it's BYTES_BIG_ENDIAN
> (vs. WORDS_BIG_ENDIAN -- in theory this approach should work for
> pdp11 as well).

Or maybe I'm confusing how get_inner_reference numbers "bits" when
it returns bitpos... (and how a multi-byte value in target memory
representation has to be "shifted" by bitpos).

I really thought BITS_BIG_ENDIAN is the only thing that matters...

Btw, I reproduced on ppc64-linux (which has BITS_BIG_ENDIAN).

Richard.

> Richard.
> 
> > > You also seem to miss to account for amnt / BITS_PER_UNIT != 0.
> > > Independently of BYTES_BIG_ENDIAN it would be
> > > 
> > >    ptr[i + (amnt / BITS_PER_UNIT)] = ptr[i] << amnt;
> > > ...
> > 
> > doh, yes. I'll fix that.
> > 
> > > (so best use a single load / store and operate on a temporary).
> > 
> > Thanks,
> > Kyrill
> > 
> > > Richard.
> > > 
> > > > Thanks,
> > > > Kyrill
> > > > 
> > 
> > 
> 
>
Kyrill Tkachov Oct. 12, 2016, 3:47 p.m. UTC | #7
On 10/10/16 12:15, Richard Biener wrote:
> On Mon, 10 Oct 2016, Richard Biener wrote:
>
>> On Mon, 10 Oct 2016, Kyrill Tkachov wrote:
>>
>>> On 10/10/16 11:22, Richard Biener wrote:
>>>> On Mon, 10 Oct 2016, Kyrill Tkachov wrote:
>>>>
>>>>> Hi Richard,
>>>>>
>>>>> As I mentioned, here is the patch applying to the main store merging patch
>>>>> to
>>>>> re-implement encode_tree_to_bitpos
>>>>> to operate on the bytes directly.
>>>>>
>>>>> This works fine on little-endian but breaks on big-endian, even for
>>>>> merging
>>>>> bitfields within a single byte.
>>>>> Consider the code snippet from gcc.dg/store_merging_6.c:
>>>>>
>>>>> struct bar {
>>>>>     int a : 3;
>>>>>     unsigned char b : 4;
>>>>>     unsigned char c : 1;
>>>>>     char d;
>>>>>     char e;
>>>>>     char f;
>>>>>     char g;
>>>>> };
>>>>>
>>>>> void
>>>>> foo1 (struct bar *p)
>>>>> {
>>>>>     p->b = 3;
>>>>>     p->a = 2;
>>>>>     p->c = 1;
>>>>>     p->d = 4;
>>>>>     p->e = 5;
>>>>> }
>>>>>
>>>>> The correct GIMPLE for these merged stores on big-endian is:
>>>>>     MEM[(voidD.49 *)p_2(D)] = 18180;
>>>>>     MEM[(charD.8 *)p_2(D) + 2B] = 5;
>>>>>
>>>>> whereas with this patch we emit:
>>>>>     MEM[(voidD.49 *)p_2(D)] = 39428;
>>>>>     MEM[(charD.8 *)p_2(D) + 2B] = 5;
>>>>>
>>>>> The dump for merging the individual stores without this patch (using the
>>>>> correct but costly wide_int approach in the base patch) is:
>>>>> After writing 3 of size 4 at position 3 the merged region contains:
>>>>> 6 0 0 0 0 0
>>>>> After writing 2 of size 3 at position 0 the merged region contains:
>>>>> 46 0 0 0 0 0
>>>>> After writing 1 of size 1 at position 7 the merged region contains:
>>>>> 47 0 0 0 0 0
>>>>> After writing 4 of size 8 at position 8 the merged region contains:
>>>>> 47 4 0 0 0 0
>>>>> After writing 5 of size 8 at position 16 the merged region contains:
>>>>> 47 4 5 0 0 0
>>>>>
>>>>>
>>>>> And with this patch it is:
>>>>> After writing 3 of size 4 at position 3 the merged region contains:
>>>>> 18 0 0 0 0 0
>>>>> After writing 2 of size 3 at position 0 the merged region contains:
>>>>> 1a 0 0 0 0 0
>>>>> After writing 1 of size 1 at position 7 the merged region contains:
>>>>> 9a 0 0 0 0 0
>>>>> After writing 4 of size 8 at position 8 the merged region contains:
>>>>> 9a 4 0 0 0 0
>>>>> After writing 5 of size 8 at position 16 the merged region contains:
>>>>> 9a 4 5 0 0 0
>>>>>
>>>>> (Note the dump just dumps the byte array from index 0 to <len> so the
>>>>> first
>>>>> thing printed is the lowest numbered byte.
>>>>> Also, each byte is dumped in hex.)
>>>>>
>>>>> The code as included here doesn't do any byte swapping for big-endian but
>>>>> as
>>>>> seen from the dump even writing a sub-byte
>>>>> bitfield goes wrong so it would be nice to resolve that before going
>>>>> forward.
>>>>> Any help with debugging this is hugely appreciated. I've included an ASCII
>>>>> diagram of the steps in the algorithm
>>>>> in the patch itself.
>>>> Ah, I think you need to account for BITS_BIG_ENDIAN in
>>>> shift_bytes_in_array.  You have to shift towards MSB which means changing
>>>> left to right shifts for BITS_BIG_ENDIAN.
>>> Thanks, I'll try it out. But this is on aarch64 where
>>> BITS_BIG_ENDIAN is 0 even when BYTES_BIG_ENDIAN is 1
>>> so there's something else bad here.
>> Maybe I'm confusing all the macros, so maybe it's BYTES_BIG_ENDIAN
>> (vs. WORDS_BIG_ENDIAN -- in theory this approach should work for
>> pdp11 as well).
> Or maybe I'm confusing how get_inner_reference numbers "bits" when
> it returns bitpos... (and how a multi-byte value in target memory
> representation has to be "shifted" by bitpos).
>
> I really thought BITS_BIG_ENDIAN is the only thing that matters...
>
> Btw, I reproduced on ppc64-linux (which has BITS_BIG_ENDIAN).

having looked around the documentation and codebase it looks like BITS_BIG_ENDIAN
is just used to define how bitfield instructions on a target operate and so only
apply to the RTL level, so I think we don't have to worry about that at GIMPLE.
As a hack, adjusting bitpos for BYTES_BIG_ENDIAN to:
bitpos = byte_size * BITS_PER_UNIT - bitlen - (bitpos % BITS_PER_UNIT)
"fixes" the dg.exp=store_merging* testcases but is still wrong for multi-byte
testcases (gcc.c-torture/execute/20040629-1.c is a good one that exercises all that).

I'm trying to wrap my head around the byte layout and what should be shifted where...

Thanks,
Kyrill

> Richard.
>
>> Richard.
>>
>>>> You also seem to miss to account for amnt / BITS_PER_UNIT != 0.
>>>> Independently of BYTES_BIG_ENDIAN it would be
>>>>
>>>>     ptr[i + (amnt / BITS_PER_UNIT)] = ptr[i] << amnt;
>>>> ...
>>> doh, yes. I'll fix that.
>>>
>>>> (so best use a single load / store and operate on a temporary).
>>> Thanks,
>>> Kyrill
>>>
>>>> Richard.
>>>>
>>>>> Thanks,
>>>>> Kyrill
>>>>>
>>>
>>
diff mbox

Patch

diff --git a/gcc/gimple-ssa-store-merging.c b/gcc/gimple-ssa-store-merging.c
index 45dc615482780bcb4c9e0d261f25c334d83dd878..c6c81596528a065e6dec023522c1430997fb4593 100644
--- a/gcc/gimple-ssa-store-merging.c
+++ b/gcc/gimple-ssa-store-merging.c
@@ -199,6 +199,87 @@  dump_char_array (FILE *fd, unsigned char *ptr, unsigned int len)
   fprintf (fd, "\n");
 }
 
+/* Fill a byte array PTR of SZ elements with zeroes.  This is to be used by
+   encode_tree_to_bitpos to zero-initialize most likely small arrays but
+   with a non-compile-time-constant size.  */
+
+static inline void
+zero_char_buf (unsigned char *ptr, unsigned int sz)
+{
+  for (unsigned int i = 0; i < sz; i++)
+    ptr[i] = 0;
+}
+
+/* Shift the bytes in PTR of SZ elements by AMNT bits, carrying over the bits
+   between adjacent elements.  */
+
+static void
+shift_bytes_in_array (unsigned char *ptr, unsigned int sz, unsigned int amnt)
+{
+  unsigned char carry_over = 0U;
+  unsigned char carry_mask = (~0U) << ((unsigned char)(BITS_PER_UNIT - amnt));
+  unsigned char clear_mask = (~0U) << amnt;
+
+  for (unsigned int i = 0; i < sz; i++)
+    {
+      unsigned prev_carry_over = carry_over;
+      carry_over
+	= (ptr[i] & carry_mask) >> (BITS_PER_UNIT - amnt);
+
+      ptr[i] <<= amnt;
+      if (i != 0)
+	{
+	  ptr[i] &= clear_mask;
+	  ptr[i] |= prev_carry_over;
+	}
+    }
+}
+
+/* In the byte array PTR clear the bit region starting at bit
+   START and is LEN bits wide.  START should be within [0, BITS_PER_UNIT).
+   For regions spanning multiple bytes do this recursively until we reach
+   zero LEN or a region contained within a single byte.  */
+
+static void
+clear_bit_region (unsigned char *ptr, unsigned int start,
+		  unsigned int len)
+{
+  /* Degenerate base case.  */
+  if (len == 0)
+    return;
+
+  /* Second base case.  */
+  if ((start + len) <= BITS_PER_UNIT)
+    {
+      unsigned char mask = (~0U) << ((unsigned char)(BITS_PER_UNIT - len));
+      mask >>= BITS_PER_UNIT - (start + len);
+
+      ptr[0] &= ~mask;
+
+      return;
+    }
+  /* Clear most significant bits in a byte and proceed with the next byte.  */
+  else if (start != 0)
+    {
+      clear_bit_region (ptr, start, BITS_PER_UNIT - start);
+      clear_bit_region (ptr + 1, 0, len - (BITS_PER_UNIT - start) + 1);
+    }
+  /* Whole bytes need to be cleared.  */
+  else if (start == 0 && len > BITS_PER_UNIT)
+    {
+      unsigned int nbytes = len / BITS_PER_UNIT;
+      /* We could recurse on each byte but do the loop here to to avoid
+	 recuring too deep.  */
+      for (unsigned int i = 0; i < nbytes; i++)
+	ptr[i] = 0U;
+      /* Clear the remaining sub-byte region if there is one.  */
+      if (len % BITS_PER_UNIT != 0)
+	clear_bit_region (ptr + nbytes, 0, len % BITS_PER_UNIT);
+    }
+  else
+    gcc_unreachable ();
+}
+
 /* Write BITLEN bits of EXPR to the byte array PTR at
    bit position BITPOS.  PTR should contain TOTAL_BYTES elements.
    Return true if the operation succeeded.  */
@@ -207,77 +288,50 @@  static bool
 encode_tree_to_bitpos (tree expr, unsigned char *ptr, int bitlen, int bitpos,
 			unsigned int total_bytes)
 {
-  unsigned int last_byte = (bitpos + bitlen - 1) / BITS_PER_UNIT;
   unsigned int first_byte = bitpos / BITS_PER_UNIT;
   tree tmp_int = expr;
   bool sub_byte_op_p = (bitlen % BITS_PER_UNIT) || (bitpos % BITS_PER_UNIT);
-  /* If the expression is not an integer encode it into the temporary buffer
-     and read it back as an integer.  */
-  if (TREE_CODE (expr) != INTEGER_CST && sub_byte_op_p)
-    {
-      unsigned char *tmpbuf = XALLOCAVEC (unsigned char, total_bytes);
-      memset (tmpbuf, 0, total_bytes);
-      native_encode_expr (expr, tmpbuf, total_bytes, 0);
-      tree read_back_type
-	= build_nonstandard_integer_type (tree_to_shwi (
-		TYPE_SIZE (TREE_TYPE (expr))), UNSIGNED);
-      tmp_int
-	= native_interpret_expr (read_back_type, tmpbuf, total_bytes);
-    }
 
-  gcc_assert (tmp_int);
-
-  /* If we're inserting a non-bytesized width or not at a byte boundary
-     use an intermediate wide_int to perform the bit-insertion correctly.  */
-  if (sub_byte_op_p
-      || (TREE_CODE (expr) == INTEGER_CST
-	  && mode_for_size (bitlen, MODE_INT, 0) == BLKmode))
-    {
-      unsigned int byte_size = last_byte - first_byte + 1;
-
-      /* The functions native_encode_expr/native_interpret_expr use the
-	 TYPE_MODE of the type to determine the number of bytes to write/read
-	 so if we want to process a number of bytes that does not have a
-	 TYPE_MODE of equal size we need to use a type that has a valid mode
-	 for it.  */
-
-      machine_mode mode
-	= smallest_mode_for_size (byte_size * BITS_PER_UNIT, MODE_INT);
-      tree dest_int_type
-	= build_nonstandard_integer_type (GET_MODE_BITSIZE (mode), UNSIGNED);
-      byte_size = GET_MODE_SIZE (mode);
-      /* The region from the byte array that we're inserting into.  */
-      tree ptr_wide_int
-	= native_interpret_expr (dest_int_type, ptr + first_byte,
-				 total_bytes);
-
-      gcc_assert (ptr_wide_int);
-      wide_int dest_wide_int
-	= wi::to_wide (ptr_wide_int, TYPE_PRECISION (dest_int_type));
-
-      wide_int expr_wide_int
-	= wi::to_wide (tmp_int, byte_size * BITS_PER_UNIT);
-      if (BYTES_BIG_ENDIAN)
-	{
-	  unsigned int insert_pos
-	    = byte_size * BITS_PER_UNIT - bitlen - (bitpos % BITS_PER_UNIT);
-	  dest_wide_int
-	    = wi::insert (dest_wide_int, expr_wide_int, insert_pos, bitlen);
-	}
-      else
-	dest_wide_int = wi::insert (dest_wide_int, expr_wide_int,
-				    bitpos % BITS_PER_UNIT, bitlen);
-
-      tree res = wide_int_to_tree (dest_int_type, dest_wide_int);
-      if (!native_encode_expr (res, ptr + first_byte, total_bytes, 0))
-	return false;
-
-    }
-  /* If we're inserting a "well-behaved" value at a normal position just
-     call native_encode_expr directly.  */
-  else if (!native_encode_expr (tmp_int, ptr + first_byte,
-				 total_bytes, 0))
-    return false;
+  if (!sub_byte_op_p)
+    return native_encode_expr (tmp_int, ptr + first_byte, total_bytes, 0)
+	   != 0;
+
+  /* We are writing a non byte-sized quantity or at a position that is not
+     at a byte boundary.
+     |--------||--------||--------| ptr + first_byte
+           ^                ^
+           xxx  xxxxxxxx  xxx<-bp->
+           |______EXPR______|
+
+     First native_encode_expr EPXR into a temporary buffer and shift each
+     byte in the buffer by bitpos (carrying the bits over as necessary).
+     |00000000||00xxxxxx||xxxxxxxx| << bp = |000xxxxx||xxxxxxxx||xxx00000|
+                                                <------bitlen------><-bp->
+    Then we clear the destination bits:
+    |---00000||00000000||000-----| ptr + first_byte
+        <-------bitlen-----><-bp->
+
+    Finally we ORR the bytes of the shifted EXPR into the cleared region:
+    |---xxxxx||xxxxxxxx||xxx-----| ptr + first_byte.  */
+
+  unsigned int byte_size = GET_MODE_SIZE (TYPE_MODE (TREE_TYPE (expr))) + 1;
+  unsigned char *tmpbuf = XALLOCAVEC (unsigned char, byte_size);
+  zero_char_buf (tmpbuf, byte_size);
+  /* The store detection code should only have allowed constants that are
+     accepted by native_encode_expr.  */
+  if (native_encode_expr (expr, tmpbuf, byte_size, 0) == 0)
+    gcc_unreachable ();
+
+  /* Create the shifted version of EXPR.  */
+  shift_bytes_in_array (tmpbuf, byte_size, bitpos % BITS_PER_UNIT);
+
+  /* Clear the bit region in PTR where the bits from TMPBUF will be
+     inerted into.  */
+  clear_bit_region (ptr + first_byte, bitpos % BITS_PER_UNIT, bitlen);
+
+  /* Insert the bits from TMPBUF.  */
+  for (unsigned int i = 0; i < byte_size; i++)
+    ptr[first_byte + i] |= tmpbuf[i];
 
   return true;
 }