diff mbox

[07/11] md: rewrite handle_stripe_dirtying6 in asynchronous way

Message ID 200812090057.33890.yur@emcraft.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Yuri Tikhonov Dec. 8, 2008, 9:57 p.m. UTC
Rewrite handle_stripe_dirtying6 function to work asynchronously.

Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
---
 drivers/md/raid5.c |  113 ++++++++++++++--------------------------------------
 1 files changed, 30 insertions(+), 83 deletions(-)

Comments

Dan Williams Jan. 15, 2009, 9:51 p.m. UTC | #1
On Mon, Dec 8, 2008 at 2:57 PM, Yuri Tikhonov <yur@emcraft.com> wrote:
> Rewrite handle_stripe_dirtying6 function to work asynchronously.
>
> Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
> Signed-off-by: Ilya Yanok <yanok@emcraft.com>
> ---
>  drivers/md/raid5.c |  113 ++++++++++++++--------------------------------------
>  1 files changed, 30 insertions(+), 83 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index e08ed4f..f0b47bd 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2485,99 +2485,46 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf,
>                struct stripe_head *sh, struct stripe_head_state *s,
>                struct r6_state *r6s, int disks)
>  {
> -       int rcw = 0, must_compute = 0, pd_idx = sh->pd_idx, i;
> +       int rcw = 0, pd_idx = sh->pd_idx, i;
>        int qd_idx = r6s->qd_idx;
> +
> +       set_bit(STRIPE_HANDLE, &sh->state);
>        for (i = disks; i--; ) {
>                struct r5dev *dev = &sh->dev[i];
> -               /* Would I have to read this buffer for reconstruct_write */
> -               if (!test_bit(R5_OVERWRITE, &dev->flags)
> -                   && i != pd_idx && i != qd_idx
> -                   && (!test_bit(R5_LOCKED, &dev->flags)
> -                           ) &&
> -                   !test_bit(R5_UPTODATE, &dev->flags)) {
> -                       if (test_bit(R5_Insync, &dev->flags)) rcw++;
> -                       else {
> -                               pr_debug("raid6: must_compute: "
> -                                       "disk %d flags=%#lx\n", i, dev->flags);
> -                               must_compute++;
> +               /* check if we haven't enough data */
> +               if (!test_bit(R5_OVERWRITE, &dev->flags) &&
> +                   i != pd_idx && i != qd_idx &&
> +                   !test_bit(R5_LOCKED, &dev->flags) &&
> +                   !(test_bit(R5_UPTODATE, &dev->flags) ||
> +                     test_bit(R5_Wantcompute, &dev->flags))) {
> +                       rcw++;
> +                       if (!test_bit(R5_Insync, &dev->flags))
> +                               continue; /* it's a failed drive */
> +
> +                       if (
> +                         test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
> +                               pr_debug("Read_old stripe %llu "
> +                                       "block %d for Reconstruct\n",
> +                                    (unsigned long long)sh->sector, i);
> +                               set_bit(R5_LOCKED, &dev->flags);
> +                               set_bit(R5_Wantread, &dev->flags);
> +                               s->locked++;
> +                       } else {
> +                               pr_debug("Request delayed stripe %llu "
> +                                       "block %d for Reconstruct\n",
> +                                    (unsigned long long)sh->sector, i);
> +                               set_bit(STRIPE_DELAYED, &sh->state);
> +                               set_bit(STRIPE_HANDLE, &sh->state);

What's the reasoning behind changing the logic here, i.e. removing
must_compute and such?  I'd feel more comfortable seeing copy and
paste where possible with cleanups separated out into their own patch.

--
Dan
Dan Williams Jan. 15, 2009, 10:21 p.m. UTC | #2
On Thu, Jan 15, 2009 at 2:51 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Mon, Dec 8, 2008 at 2:57 PM, Yuri Tikhonov <yur@emcraft.com> wrote:
> What's the reasoning behind changing the logic here, i.e. removing
> must_compute and such?  I'd feel more comfortable seeing copy and
> paste where possible with cleanups separated out into their own patch.
>

Ok, I now see why this change was made.  Please make this changelog
more descriptive than "Rewrite handle_stripe_dirtying6 function to
work asynchronously."

Regards,
Dan
Denis ChengRq Jan. 16, 2009, 1:07 a.m. UTC | #3
On Tue, Dec 9, 2008 at 5:57 AM, Yuri Tikhonov <yur@emcraft.com> wrote:
> Rewrite handle_stripe_dirtying6 function to work asynchronously.


On Fri, Jan 16, 2009 at 6:21 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Jan 15, 2009 at 2:51 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Mon, Dec 8, 2008 at 2:57 PM, Yuri Tikhonov <yur@emcraft.com> wrote:
>> What's the reasoning behind changing the logic here, i.e. removing
>> must_compute and such?  I'd feel more comfortable seeing copy and
>> paste where possible with cleanups separated out into their own patch.

>
> Ok, I now see why this change was made.  Please make this changelog
> more descriptive than "Rewrite handle_stripe_dirtying6 function to
> work asynchronously."

Ack, could you please make the changelog more descriptive?
and or add some of your benchmark results?

Thanks,

--
Cheng Renquan (程任全), Shenzhen, China
Yuri Tikhonov Jan. 16, 2009, 2:24 p.m. UTC | #4
On Friday, January 16, 2009 you wrote:

> On Thu, Jan 15, 2009 at 2:51 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Mon, Dec 8, 2008 at 2:57 PM, Yuri Tikhonov <yur@emcraft.com> wrote:
>> What's the reasoning behind changing the logic here, i.e. removing
>> must_compute and such?  I'd feel more comfortable seeing copy and
>> paste where possible with cleanups separated out into their own patch.
>>

> Ok, I now see why this change was made.  Please make this changelog
> more descriptive than "Rewrite handle_stripe_dirtying6 function to
> work asynchronously."

 Sure, how about the following:

"

 md: rewrite handle_stripe_dirtying6 in asynchronous way

 Processing stripe dirtying in asynchronous way requires some changes 
to the handle_stripe_dirtying6() algorithm.

 In the synchronous implementation of the stripe dirtying we processed 
dirtying of a degraded stripe (with partially changed strip(s) located 
on the failed drive(s)) inside one handle_stripe_dirtying6() call:
- we computed the missed strips from the old parities, and thus got 
the fully up-to-date stripe, then
- we did reconstruction using the new data to write.

 In the asynchronous case of handle_stripe_dirtying6() we don't 
process anything right inside this function (since we under the lock), 
but only schedule the necessary operations with flags. Thus, if 
handle_stripe_dirtying6() is performed on the top of a degraded array 
we should schedule the reconstruction operation when the failed strips 
are marked (by previously called fetch_block6()) as to be computed 
(with the R5_Wantcompute flag), and all the other strips of the stripe 
are UPTODATE. The schedule_reconstruction() function will set the 
STRIPE_OP_POSTXOR flag [for new parity calculation], which is then 
handled in raid_run_ops() after the STRIPE_OP_COMPUTE_BLK one [which 
causes computing of the data missed].

"

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com
Yuri Tikhonov Jan. 16, 2009, 2:46 p.m. UTC | #5
Hello Cheng,

On Friday, January 16, 2009 you wrote:

> Ack, could you please make the changelog more descriptive?
> and or add some of your benchmark results?

 Of course. We did benchmarking using the Xdd tool like follows:

# xdd -op write -kbytes $kbytes -reqsize $reqsize -dio-passes 2 –verbose -target $target_device

 where

$kbytes = data disks * size of disk
$reqsize= data disks * chunk size
$target_device = /dev/md0

 This way we did write of full array size, and thus achieved the 
maximum performance.

 The test cases were RAID-6 built on the top of 14 S-ATA drives 
connected to 2 LSI cards (7+7) inserted into the 800 MHz Katmai board 
(based on ppc440spe) equipped with 4GB of 800 MHz DRAM .

 Here are the results (Psw - write throughput with s/w RAID-6; Phw - 
write throughput with the h/w accelerated RAID-6):

 PAGE_SIZE=4KB, chunk=64/128/256 KB
        Psw = 71/72/74 MBps
        Phw = 128/136/139 MBps

 PAGE_SIZE=16KB, chunk=256/512/1024 KB
        Psw = 81/81/82 MBps
        Phw = 205/244/239 MBps

 PAGE_SIZE=64KB, chunk=1024/2048/4096 KB
        Psw = 84/84/85 MBps
        Phw = 258/253/258 MBps

 PAGE_SIZE=256KB, chunk=4096/8192/16384 KB
        Psw = 81/83/83 MBps
        Phw = 288/275/274 MBps

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com
Dan Williams Jan. 16, 2009, 6:39 p.m. UTC | #6
On Fri, Jan 16, 2009 at 7:24 AM, Yuri Tikhonov <yur@emcraft.com> wrote:
>> Ok, I now see why this change was made.  Please make this changelog
>> more descriptive than "Rewrite handle_stripe_dirtying6 function to
>> work asynchronously."
>
>  Sure, how about the following:
>
> "
>
>  md: rewrite handle_stripe_dirtying6 in asynchronous way
>
>  Processing stripe dirtying in asynchronous way requires some changes
> to the handle_stripe_dirtying6() algorithm.
>
>  In the synchronous implementation of the stripe dirtying we processed
> dirtying of a degraded stripe (with partially changed strip(s) located
> on the failed drive(s)) inside one handle_stripe_dirtying6() call:
> - we computed the missed strips from the old parities, and thus got
> the fully up-to-date stripe, then
> - we did reconstruction using the new data to write.
>
>  In the asynchronous case of handle_stripe_dirtying6() we don't
> process anything right inside this function (since we under the lock),
> but only schedule the necessary operations with flags. Thus, if
> handle_stripe_dirtying6() is performed on the top of a degraded array
> we should schedule the reconstruction operation when the failed strips
> are marked (by previously called fetch_block6()) as to be computed
> (with the R5_Wantcompute flag), and all the other strips of the stripe
> are UPTODATE. The schedule_reconstruction() function will set the
> STRIPE_OP_POSTXOR flag [for new parity calculation], which is then
> handled in raid_run_ops() after the STRIPE_OP_COMPUTE_BLK one [which
> causes computing of the data missed].
>
> "

Excellent!

Thanks,
Dan
diff mbox

Patch

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e08ed4f..f0b47bd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2485,99 +2485,46 @@  static void handle_stripe_dirtying6(raid5_conf_t *conf,
 		struct stripe_head *sh,	struct stripe_head_state *s,
 		struct r6_state *r6s, int disks)
 {
-	int rcw = 0, must_compute = 0, pd_idx = sh->pd_idx, i;
+	int rcw = 0, pd_idx = sh->pd_idx, i;
 	int qd_idx = r6s->qd_idx;
+
+	set_bit(STRIPE_HANDLE, &sh->state);
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
-		/* Would I have to read this buffer for reconstruct_write */
-		if (!test_bit(R5_OVERWRITE, &dev->flags)
-		    && i != pd_idx && i != qd_idx
-		    && (!test_bit(R5_LOCKED, &dev->flags)
-			    ) &&
-		    !test_bit(R5_UPTODATE, &dev->flags)) {
-			if (test_bit(R5_Insync, &dev->flags)) rcw++;
-			else {
-				pr_debug("raid6: must_compute: "
-					"disk %d flags=%#lx\n", i, dev->flags);
-				must_compute++;
+		/* check if we haven't enough data */
+		if (!test_bit(R5_OVERWRITE, &dev->flags) &&
+		    i != pd_idx && i != qd_idx &&
+		    !test_bit(R5_LOCKED, &dev->flags) &&
+		    !(test_bit(R5_UPTODATE, &dev->flags) ||
+		      test_bit(R5_Wantcompute, &dev->flags))) {
+			rcw++;
+			if (!test_bit(R5_Insync, &dev->flags))
+				continue; /* it's a failed drive */
+
+			if (
+			  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
+				pr_debug("Read_old stripe %llu "
+					"block %d for Reconstruct\n",
+				     (unsigned long long)sh->sector, i);
+				set_bit(R5_LOCKED, &dev->flags);
+				set_bit(R5_Wantread, &dev->flags);
+				s->locked++;
+			} else {
+				pr_debug("Request delayed stripe %llu "
+					"block %d for Reconstruct\n",
+				     (unsigned long long)sh->sector, i);
+				set_bit(STRIPE_DELAYED, &sh->state);
+				set_bit(STRIPE_HANDLE, &sh->state);
 			}
 		}
 	}
-	pr_debug("for sector %llu, rcw=%d, must_compute=%d\n",
-	       (unsigned long long)sh->sector, rcw, must_compute);
-	set_bit(STRIPE_HANDLE, &sh->state);
-
-	if (rcw > 0)
-		/* want reconstruct write, but need to get some data */
-		for (i = disks; i--; ) {
-			struct r5dev *dev = &sh->dev[i];
-			if (!test_bit(R5_OVERWRITE, &dev->flags)
-			    && !(s->failed == 0 && (i == pd_idx || i == qd_idx))
-			    && !test_bit(R5_LOCKED, &dev->flags) &&
-			    !test_bit(R5_UPTODATE, &dev->flags) &&
-			    test_bit(R5_Insync, &dev->flags)) {
-				if (
-				  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-					pr_debug("Read_old stripe %llu "
-						"block %d for Reconstruct\n",
-					     (unsigned long long)sh->sector, i);
-					set_bit(R5_LOCKED, &dev->flags);
-					set_bit(R5_Wantread, &dev->flags);
-					s->locked++;
-				} else {
-					pr_debug("Request delayed stripe %llu "
-						"block %d for Reconstruct\n",
-					     (unsigned long long)sh->sector, i);
-					set_bit(STRIPE_DELAYED, &sh->state);
-					set_bit(STRIPE_HANDLE, &sh->state);
-				}
-			}
-		}
 	/* now if nothing is locked, and if we have enough data, we can start a
 	 * write request
 	 */
-	if (s->locked == 0 && rcw == 0 &&
+	if ((s->req_compute || !test_bit(STRIPE_COMPUTE_RUN, &sh->state)) &&
+	    s->locked == 0 && rcw == 0 &&
 	    !test_bit(STRIPE_BIT_DELAY, &sh->state)) {
-		if (must_compute > 0) {
-			/* We have failed blocks and need to compute them */
-			switch (s->failed) {
-			case 0:
-				BUG();
-			case 1:
-				compute_block_1(sh, r6s->failed_num[0], 0);
-				break;
-			case 2:
-				compute_block_2(sh, r6s->failed_num[0],
-						r6s->failed_num[1]);
-				break;
-			default: /* This request should have been failed? */
-				BUG();
-			}
-		}
-
-		pr_debug("Computing parity for stripe %llu\n",
-			(unsigned long long)sh->sector);
-		compute_parity6(sh, RECONSTRUCT_WRITE);
-		/* now every locked buffer is ready to be written */
-		for (i = disks; i--; )
-			if (test_bit(R5_LOCKED, &sh->dev[i].flags)) {
-				pr_debug("Writing stripe %llu block %d\n",
-				       (unsigned long long)sh->sector, i);
-				s->locked++;
-				set_bit(R5_Wantwrite, &sh->dev[i].flags);
-			}
-		if (s->locked == disks)
-			if (!test_and_set_bit(STRIPE_FULL_WRITE, &sh->state))
-				atomic_inc(&conf->pending_full_writes);
-		/* after a RECONSTRUCT_WRITE, the stripe MUST be in-sync */
-		set_bit(STRIPE_INSYNC, &sh->state);
-
-		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-			atomic_dec(&conf->preread_active_stripes);
-			if (atomic_read(&conf->preread_active_stripes) <
-			    IO_THRESHOLD)
-				md_wakeup_thread(conf->mddev->thread);
-		}
+		schedule_reconstruction(sh, s, 1, 0);
 	}
 }