Patchwork sparc64: simple microoptimizations for atomic functions

login
register
mail settings
Submitter Mikulas Patocka
Date Aug. 18, 2010, 6:03 p.m.
Message ID <Pine.LNX.4.64.1008181402450.9287@hs20-bc2-1.build.redhat.com>
Download mbox | patch
Permalink /patch/62073/
State Accepted
Delegated to: David Miller
Headers show

Comments

Mikulas Patocka - Aug. 18, 2010, 6:03 p.m.
Simple microoptimizations for sparc64 atomic functions:
Save one instruction by using a delay slot.
Use %g1 instead of %g7, because %g1 is written earlier.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 arch/sparc/lib/atomic_64.S |   20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Aug. 18, 2010, 8:22 p.m.
From: Mikulas Patocka <mpatocka@redhat.com>
Date: Wed, 18 Aug 2010 14:03:37 -0400 (EDT)

> Simple microoptimizations for sparc64 atomic functions:
> Save one instruction by using a delay slot.
> Use %g1 instead of %g7, because %g1 is written earlier.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

Does it really make things faster?  What timings have you
done on these code sequences?
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka - Aug. 18, 2010, 10:08 p.m.
On Wed, 18 Aug 2010, David Miller wrote:

> From: Mikulas Patocka <mpatocka@redhat.com>
> Date: Wed, 18 Aug 2010 14:03:37 -0400 (EDT)
> 
> > Simple microoptimizations for sparc64 atomic functions:
> > Save one instruction by using a delay slot.
> > Use %g1 instead of %g7, because %g1 is written earlier.
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> Does it really make things faster?  What timings have you
> done on these code sequences?

I don't think such microoptimizations can be measured. It may save an 
I-cacheline --- but who knows if exactly this cacheline makes some effect 
or not? The point is that making microoptimizations is *much* easier than 
measuring them, so I usually do them and never ask if they help. If you do 
100 microoptimizations, it may have some effect in code size reduction, 
but you never pinpoint it to a single optimization.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Aug. 18, 2010, 11:03 p.m.
From: Mikulas Patocka <mpatocka@redhat.com>
Date: Wed, 18 Aug 2010 18:08:49 -0400 (EDT)

> I don't think such microoptimizations can be measured. It may save an 
> I-cacheline --- but who knows if exactly this cacheline makes some effect 
> or not?

These routines, when contention backoff is disabled, have
intentionally been coded to be perfectly 8 instructions, which is
exactly 32 bytes, which is exactly 1 I-cache line.  You'll find that
much of the by-hand sparc64 assembler routines have been written to be
a multiple of 8 instructions.

Because if you don't start a function on an I-cache line you get a
partial fetch when it's called, therefore making it impossible to fill
the pipeline even if the instructions could be executed in parallel.

So actually you're changes are likely to hurt performance from a cache
line and pipelining viewpoint.

Furthermore, talking about saving one cycle (which I don't even think
you'll get) when the CAS instruction itself is going to stall the chip
for ~50 cycles is not all that worthwhile either.

The UltraSPARC-I,II,III et al. programming manuals are pretty clear
about code generation guidelines, I've been reading them for 10+
years, and that is what I've used to guide the writing of the
assembler code.  I've also run the code through simulators (when
possible) and done cycle analysis (both hot and cold cache cases) on
real hardware for these routines.

So I basically expect the same kind of considerations from you if you
want to "optimize" this code :-)

I value your contribution but seriously I think the code is fine and
optimal as-is.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka - Aug. 18, 2010, 11:44 p.m.
On Wed, 18 Aug 2010, David Miller wrote:

> From: Mikulas Patocka <mpatocka@redhat.com>
> Date: Wed, 18 Aug 2010 18:08:49 -0400 (EDT)
> 
> > I don't think such microoptimizations can be measured. It may save an 
> > I-cacheline --- but who knows if exactly this cacheline makes some effect 
> > or not?
> 
> These routines, when contention backoff is disabled, have
> intentionally been coded to be perfectly 8 instructions, which is
> exactly 32 bytes, which is exactly 1 I-cache line.  You'll find that
> much of the by-hand sparc64 assembler routines have been written to be
> a multiple of 8 instructions.

They are not:
0000000000000000 <atomic_add>:
0000000000000028 <atomic_sub>:
0000000000000050 <atomic_add_ret>:
000000000000007c <atomic_sub_ret>:
00000000000000a8 <atomic64_add>:
00000000000000d0 <atomic64_sub>:
00000000000000f8 <atomic64_add_ret>:
0000000000000124 <atomic64_sub_ret>:

(on UP compile without backoff). That dummy backoff code produces jump 
forward and backward.

> Because if you don't start a function on an I-cache line you get a
> partial fetch when it's called, therefore making it impossible to fill
> the pipeline even if the instructions could be executed in parallel.

So add .align there.

> So actually you're changes are likely to hurt performance from a cache
> line and pipelining viewpoint.
> 
> Furthermore, talking about saving one cycle (which I don't even think
> you'll get) when the CAS instruction itself is going to stall the chip
> for ~50 cycles is not all that worthwhile either.

If it's like x86 --- i.e. flush the whole pipe and execute microcode, that 
it doesn't make much sense to optimize ticks in the pipeline. Optimizing 
for cache pollution could make some sense.

Mikulas

> The UltraSPARC-I,II,III et al. programming manuals are pretty clear
> about code generation guidelines, I've been reading them for 10+
> years, and that is what I've used to guide the writing of the
> assembler code.  I've also run the code through simulators (when
> possible) and done cycle analysis (both hot and cold cache cases) on
> real hardware for these routines.
> 
> So I basically expect the same kind of considerations from you if you
> want to "optimize" this code :-)
> 
> I value your contribution but seriously I think the code is fine and
> optimal as-is.
> 
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Aug. 19, 2010, 12:16 a.m.
From: Mikulas Patocka <mpatocka@redhat.com>
Date: Wed, 18 Aug 2010 19:44:05 -0400 (EDT)

> 
> 
> On Wed, 18 Aug 2010, David Miller wrote:
> 
>> These routines, when contention backoff is disabled, have
>> intentionally been coded to be perfectly 8 instructions, which is
>> exactly 32 bytes, which is exactly 1 I-cache line.  You'll find that
>> much of the by-hand sparc64 assembler routines have been written to be
>> a multiple of 8 instructions.
> 
> They are not:
> 0000000000000000 <atomic_add>:
> 0000000000000028 <atomic_sub>:
> 0000000000000050 <atomic_add_ret>:
> 000000000000007c <atomic_sub_ret>:
> 00000000000000a8 <atomic64_add>:
> 00000000000000d0 <atomic64_sub>:
> 00000000000000f8 <atomic64_add_ret>:
> 0000000000000124 <atomic64_sub_ret>:
> 
> (on UP compile without backoff). That dummy backoff code produces jump 
> forward and backward.

That's a bug, it should be just an empty macro expansion.

>> So actually you're changes are likely to hurt performance from a cache
>> line and pipelining viewpoint.
>> 
>> Furthermore, talking about saving one cycle (which I don't even think
>> you'll get) when the CAS instruction itself is going to stall the chip
>> for ~50 cycles is not all that worthwhile either.
> 
> If it's like x86 --- i.e. flush the whole pipe and execute microcode, that 
> it doesn't make much sense to optimize ticks in the pipeline.

Yes, CAS bascially puts a bunch of micro-ops into the pipeline to
do the load/compare/conditional-store atomically.

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Aug. 19, 2010, 5:52 a.m.
From: Mikulas Patocka <mpatocka@redhat.com>
Date: Wed, 18 Aug 2010 14:03:37 -0400 (EDT)

> Simple microoptimizations for sparc64 atomic functions:
> Save one instruction by using a delay slot.
> Use %g1 instead of %g7, because %g1 is written earlier.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

Actually after some more consideration and studying of this code I've
decided to apply your patch, thanks a lot Mikulas.

I'll then add the BACKOFF macro fixup for the UP case.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka - Aug. 19, 2010, 1:15 p.m.
BTW. two years ago, I reported a bug that userspace programs on sparc64 
would randomly crash in a few hours with preempt enabled 
( http://marc.info/?l=linux-sparc&m=121222607414906&w=4 )

Now I tried the same workload with 2.6.35 and it is stable with preempt.

Are you aware that the bug was fixed? Or if not, it may be still lurking 
there and some unrelated change hid the race condition. Should I bisect 
the kernel to find out which patch actually fixed (or hid) the bug?

Mikulas

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Aug. 19, 2010, 6:32 p.m.
From: Mikulas Patocka <mpatocka@redhat.com>
Date: Thu, 19 Aug 2010 09:15:16 -0400 (EDT)

> BTW. two years ago, I reported a bug that userspace programs on sparc64 
> would randomly crash in a few hours with preempt enabled 
> ( http://marc.info/?l=linux-sparc&m=121222607414906&w=4 )
> 
> Now I tried the same workload with 2.6.35 and it is stable with preempt.
> 
> Are you aware that the bug was fixed? Or if not, it may be still lurking 
> there and some unrelated change hid the race condition. Should I bisect 
> the kernel to find out which patch actually fixed (or hid) the bug?

I can't think of anything we did explicitly to fix that.

I'm sure you can find more useful things to do than to try and
bisect that thing down, really :-)
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka - Sept. 10, 2010, 6:31 p.m.
On Thu, 19 Aug 2010, David Miller wrote:

> From: Mikulas Patocka <mpatocka@redhat.com>
> Date: Thu, 19 Aug 2010 09:15:16 -0400 (EDT)
> 
> > BTW. two years ago, I reported a bug that userspace programs on sparc64 
> > would randomly crash in a few hours with preempt enabled 
> > ( http://marc.info/?l=linux-sparc&m=121222607414906&w=4 )
> > 
> > Now I tried the same workload with 2.6.35 and it is stable with preempt.
> > 
> > Are you aware that the bug was fixed? Or if not, it may be still lurking 
> > there and some unrelated change hid the race condition. Should I bisect 
> > the kernel to find out which patch actually fixed (or hid) the bug?
> 
> I can't think of anything we did explicitly to fix that.
> 
> I'm sure you can find more useful things to do than to try and
> bisect that thing down, really :-)

From my experience with development --- no race should be silently 
ignored. Because if you ignore a race, there is a possibility that the 
race is still lurking there, only hitting under different conditions.

I did the bisect anyway, the race was fixed between 2.6.28 and 2.6.29-rc1.
Here "bad" really means "doesn't have a bug" and "good" means "does have 
the bug", because bisect tool was written to search for bugs that were 
created, not fixed. "X" means "doesn't work, I guessed".

bad:    47676   590cf28580c999c8ba70dc39b40bab09d69e2630
bad:    61357   0191b625ca5a46206d2fb862bb08f36f2fcb3b31
bad:    61391   54a696bd07c14d3b1192d03ce7269bc59b45209a
bad:    61578   bb26c6c29b7cc9f39e491b074b09f3c284738d36
good:   65849   a65056205cdf7efb96fb2558e4f1ec6bae2582ed
good:   61638   cb10ea549fdc0ab2dd8988adab5bf40b4fa642f3
good:   113707  745ca2475a6ac596e3d8d37c2759c0fbe2586227 - X
good:   66223   d8a5e2e9f4e70ade136c67ce8242f0db4c2cddc7
bad:    86684   94d6a5f7341ebaff53d4e41cc81fab37f0d9fbed
good:   112956  e50a906e0200084f04f8f3b7c3a14b0442d1347f
bad:    102852  6ded6ab9be4f6164aef1c527407c1b94f0929799
bad:    109140  7596b27dbd8de7bcfa7a80b2756114b49bd5c018
bad:    111150  f3a5c547012a09f38f7c27b17a8e3150b69cd259

I very much don't understand the bisect tool, I thought that it will do a 
binary search between the entries in the ChangeLog file, but it skips very 
randomly (the second columnt is the line in the ChangeLog file).

How does it really work? Do you know how to interpret it and which patch 
really fixed the bug?

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Sept. 10, 2010, 6:36 p.m.
From: Mikulas Patocka <mpatocka@redhat.com>
Date: Fri, 10 Sep 2010 14:31:27 -0400 (EDT)

> I very much don't understand the bisect tool, I thought that it will do a 
> binary search between the entries in the ChangeLog file, but it skips very 
> randomly (the second columnt is the line in the ChangeLog file).
> 
> How does it really work? Do you know how to interpret it and which patch 
> really fixed the bug?

When there are multiple lines of development, it has to hop in and out
of the merged branches.

Sometimes that aspect can make it go "out of order".

The best you can do is do the full bisect, trust what it comes up
with, and start your analysis with the commit it comes up with after
the whole bisect is complete.

What commit did it print out when you completed the bisect run?

Also, there is another problematic case, sometimes bisect gives you
a merge commit.  And this means that in the tree in which the
development of the branch was done, the bisect condition doesn't
trigger, but once it gets integrated into what happened upstream
meanwhile, it does trigger.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka - Sept. 10, 2010, 6:46 p.m.
On Fri, 10 Sep 2010, David Miller wrote:

> From: Mikulas Patocka <mpatocka@redhat.com>
> Date: Fri, 10 Sep 2010 14:31:27 -0400 (EDT)
> 
> > I very much don't understand the bisect tool, I thought that it will do a 
> > binary search between the entries in the ChangeLog file, but it skips very 
> > randomly (the second columnt is the line in the ChangeLog file).
> > 
> > How does it really work? Do you know how to interpret it and which patch 
> > really fixed the bug?
> 
> When there are multiple lines of development, it has to hop in and out
> of the merged branches.
> 
> Sometimes that aspect can make it go "out of order".

Bisecting over a graph, that is totally confusing.

> The best you can do is do the full bisect, trust what it comes up
> with, and start your analysis with the commit it comes up with after
> the whole bisect is complete.
> 
> What commit did it print out when you completed the bisect run?
> 
> Also, there is another problematic case, sometimes bisect gives you
> a merge commit.  And this means that in the tree in which the
> development of the branch was done, the bisect condition doesn't
> trigger, but once it gets integrated into what happened upstream
> meanwhile, it does trigger.

good:   112956  e50a906e0200084f04f8f3b7c3a14b0442d1347f
bad:    102852  6ded6ab9be4f6164aef1c527407c1b94f0929799
bad:    109140  7596b27dbd8de7bcfa7a80b2756114b49bd5c018
bad:    111150  f3a5c547012a09f38f7c27b17a8e3150b69cd259

I don't know what came up. The last "good" one and the last "bad" one are 
more than 1000 lines apart. After these tries it refused to bisect 
further and pretended that the bug is found.

If you can't interpret it --- there happened sparc+sparc64 -> sparc merge 
between this and it looks like a probable cause. Could you give me your 
git tree with linear commits that I could try?

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Sept. 10, 2010, 7:07 p.m.
From: Mikulas Patocka <mpatocka@redhat.com>
Date: Fri, 10 Sep 2010 14:46:39 -0400 (EDT)

> If you can't interpret it --- there happened sparc+sparc64 -> sparc merge 
> between this and it looks like a probable cause. Could you give me your 
> git tree with linear commits that I could try?

I rebase my tree every time the merge window completes, so
I can't give that to you, sorry.

But you don't need my tree from back then, you can git bisect between
the commits on my line of development branch when Linus pulled my tree
in.

And even my original tree was linear, I would pull Linus's tree in now
and again to deal with merge conflicts and similar.

Anyways, all the information is there, you just need to give GIT the
properl commit ranges.

BTW, if you're so disappointed about how it walks the graph during git
bisect, feel free to suggest alternatives to the GIT developers.  I
think they considered long and hard about this, and therefore the
current behavior isn't arbitrary :-)
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka - Sept. 10, 2010, 7:33 p.m.
On Fri, 10 Sep 2010, David Miller wrote:

> From: Mikulas Patocka <mpatocka@redhat.com>
> Date: Fri, 10 Sep 2010 14:46:39 -0400 (EDT)
> 
> > If you can't interpret it --- there happened sparc+sparc64 -> sparc merge 
> > between this and it looks like a probable cause. Could you give me your 
> > git tree with linear commits that I could try?
> 
> I rebase my tree every time the merge window completes, so
> I can't give that to you, sorry.
> 
> But you don't need my tree from back then, you can git bisect between
> the commits on my line of development branch when Linus pulled my tree
> in.
> 
> And even my original tree was linear, I would pull Linus's tree in now
> and again to deal with merge conflicts and similar.

I assume that there was some commit ID in your tree when you started to do 
sparc+sparc64 unification and there was some commit ID when you finished 
it. So give me these two and I hope git will go linearly between them.

> Anyways, all the information is there, you just need to give GIT the
> properl commit ranges.
> 
> BTW, if you're so disappointed about how it walks the graph during git
> bisect, feel free to suggest alternatives to the GIT developers.  I
> think they considered long and hard about this, and therefore the
> current behavior isn't arbitrary :-)

If you understand git better than me, please explain to me how to 
interpret bisect result --- i.e. I sent a list of commit IDs that are 
result of bisecting, so tell me which fixed the bug and how did you find 
it.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Sept. 10, 2010, 7:45 p.m.
From: Mikulas Patocka <mpatocka@redhat.com>
Date: Fri, 10 Sep 2010 15:33:08 -0400 (EDT)

> I assume that there was some commit ID in your tree when you started to do 
> sparc+sparc64 unification and there was some commit ID when you finished 
> it. So give me these two and I hope git will go linearly between them.

Maybe not this specific portion, but some parts of the unification
work was not done linearly, it was partly done in a GIT tree
maintained by Sam Ravnborg and partly done in mine.

Non-linear development is just how things are much of the time.

And regardless, you could figure out the commits you're looking for by
looking in GITK at the merge commit when Linus brought the my tree in
during the merge window, walk up the line that represents my branch going
back in time to the point where the line connects back up to the mainline.

Start with something like "gitk v2.6.28..v2.6.29-rc1 -- arch/sparc arch/spar64"
it's pretty clear what the sequence of commits are in that branch when you
look at that window.

The boundaries (inclusive) you come up with will be something like:

4696b64d234b84b5b70ffd49a76833aa5c49cb61
86821d18f9d8fd42fa3180ad3703d491aecc52d9

Btw, "git bisect visualize" uses gitk to show you where the good and
bad points are in the graph during bisection.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

Index: linux-2.6.35-preempt/arch/sparc/lib/atomic_64.S
===================================================================
--- linux-2.6.35-preempt.orig/arch/sparc/lib/atomic_64.S	2010-08-18 15:10:47.000000000 +0200
+++ linux-2.6.35-preempt/arch/sparc/lib/atomic_64.S	2010-08-18 19:40:25.000000000 +0200
@@ -52,10 +52,9 @@  atomic_add_ret: /* %o0 = increment, %o1 
 	cas	[%o1], %g1, %g7
 	cmp	%g1, %g7
 	bne,pn	%icc, 2f
-	 add	%g7, %o0, %g7
-	sra	%g7, 0, %o0
+	 add	%g1, %o0, %g1
 	retl
-	 nop
+	 sra	%g1, 0, %o0
 2:	BACKOFF_SPIN(%o2, %o3, 1b)
 	.size	atomic_add_ret, .-atomic_add_ret
 
@@ -68,10 +67,9 @@  atomic_sub_ret: /* %o0 = decrement, %o1 
 	cas	[%o1], %g1, %g7
 	cmp	%g1, %g7
 	bne,pn	%icc, 2f
-	 sub	%g7, %o0, %g7
-	sra	%g7, 0, %o0
+	 sub	%g1, %o0, %g1
 	retl
-	 nop
+	 sra	%g1, 0, %o0
 2:	BACKOFF_SPIN(%o2, %o3, 1b)
 	.size	atomic_sub_ret, .-atomic_sub_ret
 
@@ -114,10 +112,9 @@  atomic64_add_ret: /* %o0 = increment, %o
 	casx	[%o1], %g1, %g7
 	cmp	%g1, %g7
 	bne,pn	%xcc, 2f
-	 add	%g7, %o0, %g7
-	mov	%g7, %o0
-	retl
 	 nop
+	retl
+	 add	%g1, %o0, %o0
 2:	BACKOFF_SPIN(%o2, %o3, 1b)
 	.size	atomic64_add_ret, .-atomic64_add_ret
 
@@ -130,9 +127,8 @@  atomic64_sub_ret: /* %o0 = decrement, %o
 	casx	[%o1], %g1, %g7
 	cmp	%g1, %g7
 	bne,pn	%xcc, 2f
-	 sub	%g7, %o0, %g7
-	mov	%g7, %o0
-	retl
 	 nop
+	retl
+	 sub	%g1, %o0, %o0
 2:	BACKOFF_SPIN(%o2, %o3, 1b)
 	.size	atomic64_sub_ret, .-atomic64_sub_ret