Message ID | 20130812173155.GI25995@mudshark.cambridge.arm.com |
---|---|
State | New |
Headers | show |
2013/8/12 Will Deacon <will.deacon@arm.com>: > Hello again, > > Please pull the following user-cacheflush updates for 3.12. This series both > improves performance of cacheflush-heavy workloads (i.e. browser benchmarks) > and also addresses a DoS issue on non-preemptible systems. > > Will > > --->8 > > The following changes since commit 3b2f64d00c46e1e4e9bd0bb9bb12619adac27a4b: > > Linux 3.11-rc2 (2013-07-21 12:05:29 -0700) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git for-rmk/cacheflush > > for you to fetch changes up to d3065541b5a95e5d0ef37e0dc8ada80fd1fba963: > > ARM: cacheflush: don't bother rounding to nearest vma (2013-07-22 10:49:00 +0100) > > ---------------------------------------------------------------- > Will Deacon (4): > ARM: entry: allow ARM-private syscalls to be restarted > ARM: cacheflush: split user cache-flushing into interruptible chunks > ARM: cacheflush: don't round address range up to nearest page > ARM: cacheflush: don't bother rounding to nearest vma > Hi all. I spend the last day running a bisect and I think I have found a problem :) I have a simple automated test case running, which looks like this: imx6d based device running X, chromium and x11vnc <----> windows pc connected via VNC to the device. With this patchset applyed the browser tab crashed after about 5 minutes hitting the F5/refresh button every 1-3 seconds. The last good kernel is 3.11.10 and the first bad one is 3.12. This is the result of my bisect: root@arm:~/kernel/linux-3.12.2# cat ~/bisect.log Bisecting: a merge base must be tested [6e4664525b1db28f8c4e1130957f70a94c19213e] Linux 3.11 Bisecting: 5700 revisions left to test after this (roughly 13 steps) [cc998ff8811530be521f6b316f37ab7676a07938] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Bisecting: 2850 revisions left to test after this (roughly 12 steps) [f9121153fdfbfaa930bf65077a5597e20d3ac608] mm/hwpoison: don't need to hold compound lock for hugetlbfs page Bisecting: 1428 revisions left to test after this (roughly 11 steps) [327fff3e1391a27dcc89de6e0481689a865361c9] Merge branch 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Bisecting: 775 revisions left to test after this (roughly 10 steps) [8e73e367f7dc50f1d1bc22a63e5764bb4eea9b48] Merge tag 'cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Bisecting: 247 revisions left to test after this (roughly 8 steps) [39eda2aba6be642b71f2e0ad623dcb09fd9d79cf] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Bisecting: 205 revisions left to test after this (roughly 8 steps) [22e04f6b4b04a8afe9af9239224591d06ba3b24d] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid Bisecting: 110 revisions left to test after this (roughly 7 steps) [2e032852245b3dcfe5461d7353e34eb6da095ccf] Merge branch 'for-linus' of git://git.linaro.org/people/rmk/linux-arm Bisecting: 55 revisions left to test after this (roughly 5 steps) [5cc91e0460889c8461620904968e193dddb1beb3] Merge branch 'for-rmk/cacheflush-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into devel-stable Bisecting: 11 revisions left to test after this (roughly 4 steps) [6af396a6b6c698eb3834184518fc9a59bc22c817] ARM: cacheflush: use -ishst dsb variant for ensuring flush completion Bisecting: 5 revisions left to test after this (roughly 3 steps) [b4f656eea63376da79b0b5a17660c4ce14b71b74] Pull branch 'for-rmk' of git://git.linaro.org/people/ardbiesheuvel/linux-arm into devel-stable Bisecting: 2 revisions left to test after this (roughly 2 steps) [d9524dc32cab52714dee0c8e59c7437ee33a239a] ARM: cacheflush: don't round address range up to nearest page Bisecting: 0 revisions left to test after this (roughly 1 step) [28256d612726a28a8b9d3c49f2b74198c4423d6a] ARM: cacheflush: split user cache-flushing into interruptible chunks Bisecting: 0 revisions left to test after this (roughly 0 steps) [377747c40657eb35ad98a56439606d96a928425a] ARM: entry: allow ARM-private syscalls to be restarted 28256d612726a28a8b9d3c49f2b74198c4423d6a is the first bad commit commit 28256d612726a28a8b9d3c49f2b74198c4423d6a Author: Will Deacon <will.deacon@arm.com> Date: Mon May 13 15:21:49 2013 +0100 ARM: cacheflush: split user cache-flushing into interruptible chunks Flushing a large, non-faulting VMA from userspace can potentially result in a long time spent flushing the cache line-by-line without preemption occurring (in the case of CONFIG_PREEMPT=n). Whilst this doesn't affect the stability of the system, it can certainly affect the responsiveness and CPU availability for other tasks. This patch splits up the user cacheflush code so that it flushes in chunks of a page. After each chunk has been flushed, we may reschedule if appropriate and, before processing the next chunk, we allow any pending signals to be handled before resuming from where we left off. Signed-off-by: Will Deacon <will.deacon@arm.com> :040000 040000 33ebf747dde46884ce4e7d4ce922fef3cd5b580e 22cdb8a0bc6dc72cb92d93c13ed1a45081269f77 M arch If I revert 28256d612726a28a8b9d3c49f2b74198c4423d6a and 97c72d89ce0ec8c73f19d5e35ec1f90f7a14bed7 my "test" runs hours. What debug options should I enable to get meaningful output from the kernel? greets -- Christian Gmeiner, MSc
On Wed, Dec 04, 2013 at 03:37:36PM +0000, Christian Gmeiner wrote: > 2013/8/12 Will Deacon <will.deacon@arm.com>: > > Please pull the following user-cacheflush updates for 3.12. This series both > > improves performance of cacheflush-heavy workloads (i.e. browser benchmarks) > > and also addresses a DoS issue on non-preemptible systems. [...] > Hi all. Hello, > I spend the last day running a bisect and I think I have found a problem :) > > I have a simple automated test case running, which looks like this: > > imx6d based device running X, chromium and x11vnc <----> windows pc connected > via VNC to the device. With this patchset applyed the browser tab > crashed after about > 5 minutes hitting the F5/refresh button every 1-3 seconds. Hmm... it would be great if we had a simpler way to reproduce this, but ok. How many cores do you have on your IMX6? Also, how does the browser tab crash? Does it receive a SIGILL? > 28256d612726a28a8b9d3c49f2b74198c4423d6a is the first bad commit > commit 28256d612726a28a8b9d3c49f2b74198c4423d6a > Author: Will Deacon <will.deacon@arm.com> > Date: Mon May 13 15:21:49 2013 +0100 > > ARM: cacheflush: split user cache-flushing into interruptible chunks > > Flushing a large, non-faulting VMA from userspace can potentially result > in a long time spent flushing the cache line-by-line without preemption > occurring (in the case of CONFIG_PREEMPT=n). > > Whilst this doesn't affect the stability of the system, it can certainly > affect the responsiveness and CPU availability for other tasks. > > This patch splits up the user cacheflush code so that it flushes in > chunks of a page. After each chunk has been flushed, we may reschedule > if appropriate and, before processing the next chunk, we allow any > pending signals to be handled before resuming from where we left off. > > Signed-off-by: Will Deacon <will.deacon@arm.com> I took another look at that patch and can't see anything obviously wrong with it. It may, however, be exposing bugs in userspace that you would struggle to hit before. > :040000 040000 33ebf747dde46884ce4e7d4ce922fef3cd5b580e > 22cdb8a0bc6dc72cb92d93c13ed1a45081269f77 M arch > > > If I revert 28256d612726a28a8b9d3c49f2b74198c4423d6a and > 97c72d89ce0ec8c73f19d5e35ec1f90f7a14bed7 my "test" runs hours. > > > What debug options should I enable to get meaningful output from the kernel? An strace log of the failing case would be good. Another thing you could try is commenting out the cond_resched in __do_cache_op and see if that helps. Will
2013/12/5 Jon Medhurst (Tixy) <tixy@linaro.org>: > On Thu, 2013-12-05 at 14:23 +0000, Jon Medhurst (Tixy) wrote: >> On Wed, 2013-12-04 at 16:13 +0000, Will Deacon wrote: >> > took another look at that patch and can't see anything obviously wrong >> > with it. >> >> If the memory region isn't guaranteed to be page aligned then doesn't it >> flush up to PAGE_SIZE-1 more bytes than requested and so exceed the >> bounds check in do_cache_op? Fixing this as below _appears_ to stop the >> Browser crashes I'm seeing (still doing some more testing)... > > Actually, a smaller patch and one which avoids the possibility of > arithmetic owerflow is... > > From: Jon Medhurst <tixy@linaro.org> > Date: Thu, 5 Dec 2013 14:29:24 +0000 > Subject: [PATCH] ARM: cacheflush: correctly limit range of memory region > being flushed > > The __do_cache_op function operates with a 'chunk' size of one page > but fails to limit the size of the final chunk so as to not exceed > the specified memory region. Fix this. > > Signed-off-by: Jon Medhurst <tixy@linaro.org> > --- > arch/arm/kernel/traps.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c > index dbf0923..7940241 100644 > --- a/arch/arm/kernel/traps.c > +++ b/arch/arm/kernel/traps.c > @@ -509,9 +509,10 @@ static inline int > __do_cache_op(unsigned long start, unsigned long end) > { > int ret; > - unsigned long chunk = PAGE_SIZE; > > do { > + unsigned long chunk = min(PAGE_SIZE, end - start); > + > if (signal_pending(current)) { > struct thread_info *ti = current_thread_info(); > > -- > 1.7.10.4 > > I am running my test overnight with this patch... stay tuned. greets -- Christian Gmeiner, MSc
2013/12/5 Christian Gmeiner <christian.gmeiner@gmail.com>: > 2013/12/5 Jon Medhurst (Tixy) <tixy@linaro.org>: >> On Thu, 2013-12-05 at 14:23 +0000, Jon Medhurst (Tixy) wrote: >>> On Wed, 2013-12-04 at 16:13 +0000, Will Deacon wrote: >>> > took another look at that patch and can't see anything obviously wrong >>> > with it. >>> >>> If the memory region isn't guaranteed to be page aligned then doesn't it >>> flush up to PAGE_SIZE-1 more bytes than requested and so exceed the >>> bounds check in do_cache_op? Fixing this as below _appears_ to stop the >>> Browser crashes I'm seeing (still doing some more testing)... >> >> Actually, a smaller patch and one which avoids the possibility of >> arithmetic owerflow is... >> >> From: Jon Medhurst <tixy@linaro.org> >> Date: Thu, 5 Dec 2013 14:29:24 +0000 >> Subject: [PATCH] ARM: cacheflush: correctly limit range of memory region >> being flushed >> >> The __do_cache_op function operates with a 'chunk' size of one page >> but fails to limit the size of the final chunk so as to not exceed >> the specified memory region. Fix this. >> >> Signed-off-by: Jon Medhurst <tixy@linaro.org> >> --- >> arch/arm/kernel/traps.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c >> index dbf0923..7940241 100644 >> --- a/arch/arm/kernel/traps.c >> +++ b/arch/arm/kernel/traps.c >> @@ -509,9 +509,10 @@ static inline int >> __do_cache_op(unsigned long start, unsigned long end) >> { >> int ret; >> - unsigned long chunk = PAGE_SIZE; >> >> do { >> + unsigned long chunk = min(PAGE_SIZE, end - start); >> + >> if (signal_pending(current)) { >> struct thread_info *ti = current_thread_info(); >> >> -- >> 1.7.10.4 >> >> > > I am running my test overnight with this patch... stay tuned. > My test is still running, which is good. Tested-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Hi guys, I was out-of-office yesterday, so sorry for the radio silence. On Thu, Dec 05, 2013 at 03:14:31PM +0000, Jon Medhurst (Tixy) wrote: > On Thu, 2013-12-05 at 14:23 +0000, Jon Medhurst (Tixy) wrote: > > On Wed, 2013-12-04 at 16:13 +0000, Will Deacon wrote: > > > took another look at that patch and can't see anything obviously wrong > > > with it. > > > > If the memory region isn't guaranteed to be page aligned then doesn't it > > flush up to PAGE_SIZE-1 more bytes than requested and so exceed the > > bounds check in do_cache_op? Fixing this as below _appears_ to stop the > > Browser crashes I'm seeing (still doing some more testing)... Well spotted! We used to round down to a page, but I fixed that in the same series and didn't notice I'd introduce a bug here. Thanks for digging. Interesting that we've not heard more reports of this... > Actually, a smaller patch and one which avoids the possibility of > arithmetic owerflow is... > > From: Jon Medhurst <tixy@linaro.org> > Date: Thu, 5 Dec 2013 14:29:24 +0000 > Subject: [PATCH] ARM: cacheflush: correctly limit range of memory region > being flushed > > The __do_cache_op function operates with a 'chunk' size of one page > but fails to limit the size of the final chunk so as to not exceed > the specified memory region. Fix this. > > Signed-off-by: Jon Medhurst <tixy@linaro.org> > --- > arch/arm/kernel/traps.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c > index dbf0923..7940241 100644 > --- a/arch/arm/kernel/traps.c > +++ b/arch/arm/kernel/traps.c > @@ -509,9 +509,10 @@ static inline int > __do_cache_op(unsigned long start, unsigned long end) > { > int ret; > - unsigned long chunk = PAGE_SIZE; > > do { > + unsigned long chunk = min(PAGE_SIZE, end - start); > + > if (signal_pending(current)) { > struct thread_info *ti = current_thread_info(); Acked-by: Will Deacon <will.deacon@arm.com> Please can you put this into the patch system, along with a CC stable? Cheers, Will