diff mbox series

fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t

Message ID 20190820170818.oldsdoumzashhcgh@linutronix.de
State New
Headers show
Series fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t | expand

Commit Message

Sebastian Andrzej Siewior Aug. 20, 2019, 5:08 p.m. UTC
From: Thomas Gleixner <tglx@linutronix.de>

Bit spinlocks are problematic if PREEMPT_RT is enabled, because they
disable preemption, which is undesired for latency reasons and breaks when
regular spinlocks are taken within the bit_spinlock locked region because
regular spinlocks are converted to 'sleeping spinlocks' on RT. So RT
replaces the bit spinlocks with regular spinlocks to avoid this problem.
Bit spinlocks are also not covered by lock debugging, e.g. lockdep.

Substitute the BH_Uptodate_Lock bit spinlock with a regular spinlock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bigeasy: remove the wrapper and use always spinlock_t]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 fs/buffer.c                 | 19 +++++++------------
 fs/ext4/page-io.c           |  8 +++-----
 fs/ntfs/aops.c              |  9 +++------
 include/linux/buffer_head.h |  6 +++---
 4 files changed, 16 insertions(+), 26 deletions(-)

Comments

Matthew Wilcox (Oracle) Aug. 20, 2019, 5:17 p.m. UTC | #1
On Tue, Aug 20, 2019 at 07:08:18PM +0200, Sebastian Siewior wrote:
> Bit spinlocks are problematic if PREEMPT_RT is enabled, because they
> disable preemption, which is undesired for latency reasons and breaks when
> regular spinlocks are taken within the bit_spinlock locked region because
> regular spinlocks are converted to 'sleeping spinlocks' on RT. So RT
> replaces the bit spinlocks with regular spinlocks to avoid this problem.
> Bit spinlocks are also not covered by lock debugging, e.g. lockdep.
> 
> Substitute the BH_Uptodate_Lock bit spinlock with a regular spinlock.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> [bigeasy: remove the wrapper and use always spinlock_t]

Uhh ... always grow the buffer_head, even for non-PREEMPT_RT?  Why?
Thomas Gleixner Aug. 20, 2019, 6:01 p.m. UTC | #2
On Tue, 20 Aug 2019, Matthew Wilcox wrote:
> On Tue, Aug 20, 2019 at 07:08:18PM +0200, Sebastian Siewior wrote:
> > Bit spinlocks are problematic if PREEMPT_RT is enabled, because they
> > disable preemption, which is undesired for latency reasons and breaks when
> > regular spinlocks are taken within the bit_spinlock locked region because
> > regular spinlocks are converted to 'sleeping spinlocks' on RT. So RT
> > replaces the bit spinlocks with regular spinlocks to avoid this problem.
> > Bit spinlocks are also not covered by lock debugging, e.g. lockdep.
> > 
> > Substitute the BH_Uptodate_Lock bit spinlock with a regular spinlock.
> > 
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > [bigeasy: remove the wrapper and use always spinlock_t]
> 
> Uhh ... always grow the buffer_head, even for non-PREEMPT_RT?  Why?

Christoph requested that:

  https://lkml.kernel.org/r/20190802075612.GA20962@infradead.org

Thanks,

	tglx
Sebastian Andrzej Siewior Oct. 11, 2019, 11:25 a.m. UTC | #3
On 2019-08-20 20:01:14 [+0200], Thomas Gleixner wrote:
> On Tue, 20 Aug 2019, Matthew Wilcox wrote:
> > On Tue, Aug 20, 2019 at 07:08:18PM +0200, Sebastian Siewior wrote:
> > > Bit spinlocks are problematic if PREEMPT_RT is enabled, because they
> > > disable preemption, which is undesired for latency reasons and breaks when
> > > regular spinlocks are taken within the bit_spinlock locked region because
> > > regular spinlocks are converted to 'sleeping spinlocks' on RT. So RT
> > > replaces the bit spinlocks with regular spinlocks to avoid this problem.
> > > Bit spinlocks are also not covered by lock debugging, e.g. lockdep.
> > > 
> > > Substitute the BH_Uptodate_Lock bit spinlock with a regular spinlock.
> > > 
> > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > > [bigeasy: remove the wrapper and use always spinlock_t]
> > 
> > Uhh ... always grow the buffer_head, even for non-PREEMPT_RT?  Why?
> 
> Christoph requested that:
> 
>   https://lkml.kernel.org/r/20190802075612.GA20962@infradead.org

What do we do about this one?

> Thanks,
> 
> 	tglx

Sebastian
Jan Kara Nov. 15, 2019, 2:56 p.m. UTC | #4
On Fri 11-10-19 13:25:25, Sebastian Siewior wrote:
> On 2019-08-20 20:01:14 [+0200], Thomas Gleixner wrote:
> > On Tue, 20 Aug 2019, Matthew Wilcox wrote:
> > > On Tue, Aug 20, 2019 at 07:08:18PM +0200, Sebastian Siewior wrote:
> > > > Bit spinlocks are problematic if PREEMPT_RT is enabled, because they
> > > > disable preemption, which is undesired for latency reasons and breaks when
> > > > regular spinlocks are taken within the bit_spinlock locked region because
> > > > regular spinlocks are converted to 'sleeping spinlocks' on RT. So RT
> > > > replaces the bit spinlocks with regular spinlocks to avoid this problem.
> > > > Bit spinlocks are also not covered by lock debugging, e.g. lockdep.
> > > > 
> > > > Substitute the BH_Uptodate_Lock bit spinlock with a regular spinlock.
> > > > 
> > > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > > > [bigeasy: remove the wrapper and use always spinlock_t]
> > > 
> > > Uhh ... always grow the buffer_head, even for non-PREEMPT_RT?  Why?
> > 
> > Christoph requested that:
> > 
> >   https://lkml.kernel.org/r/20190802075612.GA20962@infradead.org
> 
> What do we do about this one?

I was thinking about this for quite some time. In the end I think the patch
is almost fine but I'd name the lock b_update_lock and put it just after
b_size element in struct buffer_head to use the hole there. That way we
don't grow struct buffer_head.

With some effort, we could even shrink struct buffer_head from 104 bytes
(on x86_64) to 96 bytes but I don't think that effort is worth it (I'd find
it better use of time to actually work on getting rid of buffer heads
completely).

								Honza
Theodore Ts'o Nov. 15, 2019, 5:36 p.m. UTC | #5
On Fri, Nov 15, 2019 at 03:56:38PM +0100, Jan Kara wrote:
> With some effort, we could even shrink struct buffer_head from 104 bytes
> (on x86_64) to 96 bytes but I don't think that effort is worth it (I'd find
> it better use of time to actually work on getting rid of buffer heads
> completely).

Is that really realistic?  All aside from the very large number of
file systems which use buffer_heads that would have to be reworked,
the concept of buffer heads is pretty fundamental to how jbd2 is
architected.

					- Ted
Jan Kara Nov. 18, 2019, 9:35 a.m. UTC | #6
On Fri 15-11-19 12:36:34, Theodore Y. Ts'o wrote:
> On Fri, Nov 15, 2019 at 03:56:38PM +0100, Jan Kara wrote:
> > With some effort, we could even shrink struct buffer_head from 104 bytes
> > (on x86_64) to 96 bytes but I don't think that effort is worth it (I'd find
> > it better use of time to actually work on getting rid of buffer heads
> > completely).
> 
> Is that really realistic?  All aside from the very large number of
> file systems which use buffer_heads that would have to be reworked,
> the concept of buffer heads is pretty fundamental to how jbd2 is
> architected.

I think it is reasonably possible to remove buffer_heads from data path
(including direct IO path) of all filesystems. That way memory consumption
of buffer_heads becomes mostly irrelevant and we can have a look how much
from the current bh framework still makes sense...

								Honza
diff mbox series

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index 131d39ec7d316..eab37fbaa439f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -275,8 +275,7 @@  static void end_buffer_async_read(struct buffer_head *bh, int uptodate)
 	 * decide that the page is now completely done.
 	 */
 	first = page_buffers(page);
-	local_irq_save(flags);
-	bit_spin_lock(BH_Uptodate_Lock, &first->b_state);
+	spin_lock_irqsave(&first->uptodate_lock, flags);
 	clear_buffer_async_read(bh);
 	unlock_buffer(bh);
 	tmp = bh;
@@ -289,8 +288,7 @@  static void end_buffer_async_read(struct buffer_head *bh, int uptodate)
 		}
 		tmp = tmp->b_this_page;
 	} while (tmp != bh);
-	bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
-	local_irq_restore(flags);
+	spin_unlock_irqrestore(&first->uptodate_lock, flags);
 
 	/*
 	 * If none of the buffers had errors and they are all
@@ -302,8 +300,7 @@  static void end_buffer_async_read(struct buffer_head *bh, int uptodate)
 	return;
 
 still_busy:
-	bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
-	local_irq_restore(flags);
+	spin_unlock_irqrestore(&first->uptodate_lock, flags);
 	return;
 }
 
@@ -331,8 +328,7 @@  void end_buffer_async_write(struct buffer_head *bh, int uptodate)
 	}
 
 	first = page_buffers(page);
-	local_irq_save(flags);
-	bit_spin_lock(BH_Uptodate_Lock, &first->b_state);
+	spin_lock_irqsave(&first->uptodate_lock, flags);
 
 	clear_buffer_async_write(bh);
 	unlock_buffer(bh);
@@ -344,14 +340,12 @@  void end_buffer_async_write(struct buffer_head *bh, int uptodate)
 		}
 		tmp = tmp->b_this_page;
 	}
-	bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
-	local_irq_restore(flags);
+	spin_unlock_irqrestore(&first->uptodate_lock, flags);
 	end_page_writeback(page);
 	return;
 
 still_busy:
-	bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
-	local_irq_restore(flags);
+	spin_unlock_irqrestore(&first->uptodate_lock, flags);
 	return;
 }
 EXPORT_SYMBOL(end_buffer_async_write);
@@ -3420,6 +3414,7 @@  struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
 	struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags);
 	if (ret) {
 		INIT_LIST_HEAD(&ret->b_assoc_buffers);
+		spin_lock_init(&ret->uptodate_lock);
 		preempt_disable();
 		__this_cpu_inc(bh_accounting.nr);
 		recalc_bh_state();
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 12ceadef32c5a..7745ed23c6ad9 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -87,11 +87,10 @@  static void ext4_finish_bio(struct bio *bio)
 		}
 		bh = head = page_buffers(page);
 		/*
-		 * We check all buffers in the page under BH_Uptodate_Lock
+		 * We check all buffers in the page under uptodate_lock
 		 * to avoid races with other end io clearing async_write flags
 		 */
-		local_irq_save(flags);
-		bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
+		spin_lock_irqsave(&head->uptodate_lock, flags);
 		do {
 			if (bh_offset(bh) < bio_start ||
 			    bh_offset(bh) + bh->b_size > bio_end) {
@@ -103,8 +102,7 @@  static void ext4_finish_bio(struct bio *bio)
 			if (bio->bi_status)
 				buffer_io_error(bh);
 		} while ((bh = bh->b_this_page) != head);
-		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
-		local_irq_restore(flags);
+		spin_unlock_irqrestore(&head->uptodate_lock, flags);
 		if (!under_io) {
 			fscrypt_free_bounce_page(bounce_page);
 			end_page_writeback(page);
diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c
index 7202a1e39d70c..14ca433b3a9e4 100644
--- a/fs/ntfs/aops.c
+++ b/fs/ntfs/aops.c
@@ -92,8 +92,7 @@  static void ntfs_end_buffer_async_read(struct buffer_head *bh, int uptodate)
 				"0x%llx.", (unsigned long long)bh->b_blocknr);
 	}
 	first = page_buffers(page);
-	local_irq_save(flags);
-	bit_spin_lock(BH_Uptodate_Lock, &first->b_state);
+	spin_lock_irqsave(&first->uptodate_lock, flags);
 	clear_buffer_async_read(bh);
 	unlock_buffer(bh);
 	tmp = bh;
@@ -108,8 +107,7 @@  static void ntfs_end_buffer_async_read(struct buffer_head *bh, int uptodate)
 		}
 		tmp = tmp->b_this_page;
 	} while (tmp != bh);
-	bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
-	local_irq_restore(flags);
+	spin_unlock_irqrestore(&first->uptodate_lock, flags);
 	/*
 	 * If none of the buffers had errors then we can set the page uptodate,
 	 * but we first have to perform the post read mst fixups, if the
@@ -142,8 +140,7 @@  static void ntfs_end_buffer_async_read(struct buffer_head *bh, int uptodate)
 	unlock_page(page);
 	return;
 still_busy:
-	bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
-	local_irq_restore(flags);
+	spin_unlock_irqrestore(&first->uptodate_lock, flags);
 	return;
 }
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 7b73ef7f902d4..c8f2a3076ce00 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -22,9 +22,6 @@  enum bh_state_bits {
 	BH_Dirty,	/* Is dirty */
 	BH_Lock,	/* Is locked */
 	BH_Req,		/* Has been submitted for I/O */
-	BH_Uptodate_Lock,/* Used by the first bh in a page, to serialise
-			  * IO completion of other buffers in the page
-			  */
 
 	BH_Mapped,	/* Has a disk mapping */
 	BH_New,		/* Disk mapping was newly created by get_block */
@@ -62,6 +59,9 @@  typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate);
  */
 struct buffer_head {
 	unsigned long b_state;		/* buffer state bitmap (see above) */
+	spinlock_t	uptodate_lock;	/* Used by the first bh in a page, to
+					 * serialise IO completion of other
+					 * buffers in the page */
 	struct buffer_head *b_this_page;/* circular list of page's buffers */
 	struct page *b_page;		/* the page this bh is mapped to */