diff mbox

[SMS] Support new loop pattern

Message ID CAJnFk2crgO980_ho6Ba8z3rDKqWhxb-nYap_JS=Xa=jgkc43ag@mail.gmail.com
State New
Headers show

Commit Message

Roman Zhuykov Dec. 7, 2011, 2:41 p.m. UTC
Apologies for the messed up previous e-mail.

2011/10/12 Ayal Zaks <ayal.zaks@gmail.com>:
>>> - the last jump instruction should look like:  pc=(regF!=0)?label:pc, regF is
>
> you'd probably want to bump to next instruction if falling through,
> e.g., pc=(regF!=0)?label:pc+4
>

It is considered that program counter is increased automatically on
hardware level.
Otherwise we should add something like "pc=pc+4" in parallel to each
instruction in RTL.

>>>  flag register;
>>> - the last instruction which sets regF should be: regF=COMPARE(regC,X), where X
>>>  is a constant, or maybe a register, which is not changed inside a loop;
>>> - only one instruction modifies regC inside a loop (other can use regC, but not
>>>  write), and it should simply adjust it by a constant: regC=regC+step, where
>>>  step is a constant.
>>
>>> When doloop is succesfully scheduled by SMS, its number of
>>> iterations of loop kernel should be decreased by the number of stages in a
>>> schedule minus one, while other iterations expand to prologue and epilogue.
>>> In new supported loops such approach can't be used, because some
>>> instructions can use count register (regC).  Instead of this,
>>> the final register value X in compare instruction regF=COMPARE(regC,X)
>>> is changed to another value Y respective to the stage this instruction
>>> is scheduled (Y = X - stage * step).
>
> making sure this does not underflow; i.e., that the number of
> iterations is no less than stage (you've addressed this towards the
> end below).
>

Yes, this situation is processed correctly.

>>
>> The main difference from doloop case is that regC can be used by some
>> instructions in loop body.
>> That's why we are unable to simply adjust regC initial value, but have
>> to keep it's value correct on each particular iteration.
>> So, we change comparison instruction accordingly.
>>
>> An example:
>> int a[100];
>> int main()
>> {
>>  int i;
>>  for (i = 85; i > 12; i -= 5)
>>      a[i] = i * i;
>>  return a[15]-225;
>> }
>> ARM assembler with "-O2 -fno-auto-inc-dec":
>>        ldr     r0, .L5
>>        mov     r3, #85
>>        mov     r2, r0
>> .L2:
>>        mul     r1, r3, r3
>>        sub     r3, r3, #5
>>        cmp     r3, #10
>>        str     r1, [r2, #340]
>>        sub     r2, r2, #20
>>        bne     .L2
>>        ldr     r0, [r0, #60]
>>        sub     r0, r0, #225
>>        bx      lr
>> .L5:
>>        .word   a
>>
>> Loop body is executed 15 times.
>> When compiling with SMS, it finds a schedule with ii=7, stage_count=3
>> and following times:
>> Stage  Time       Insn
>> 0          5      mul     r1, r3, r3
>> 1         10     sub     r3, r3, #5
>> 1         11     cmp     r3, #10
>> 1         11     str     r1, [r2, #340]
>> 1         13     bne     .L2
>> 2         16     sub     r2, r2, #20
>>
>
> branch is not scheduled last?
>

Yes, branch schedule time is smaller then sub's one.
This mean that "sub r2, r2, $20" instruction from original iteration
number K will be executed after
"bne .L2" from original iteration number K.
But certainly bne remains to be the last instuction in new loop body.
Below you can see how it looks after SMS.

>> To make new schedule correct the loop body
>> should be executed 14 times and we change compare instruction:
>
> the loop itself should execute 13 times.

with i =
85, 80, 75, 70, 65
60, 55, 50, 45, 40
35, 30, 25, 20, 15
this gives total 15 iterations (15 stores to memory).
And new loop body will be executed 13 times (one store goes to
epilogue and one - to prologue).

>> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step.
>> In our example regC is r3, X is 10, step = -5, compare instruction
>> is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15.
>>
>
> right. In general, if the compare is on stage s (starting from 0), it
> will be executed s times in the epilog, so it should exit the loop
> upon reaching Y = X - s * step.
>
>> So, after SMS it looks like:
>>        ldr     r0, .L5
>>        mov     r3, #85
>>        mov     r2, r0
>> ;;prologue
>>        mul     r1, r3, r3      ;;from stage 0 first iteration
>>        sub     r3, r3, #5      ;;3 insns from stage 1 first iteration
>>        cmp     r3, #10
>>        str     r1, [r2, #340]
>>        mul     r1, r3, r3      ;;from stage 0 second iteration
>> ;;body
>> .L2:
>>        sub     r3, r3, #5
>>        sub     r2, r2, #20
>>        cmp     r3, #15         ;; new value to compare with is Y=15
>>        str     r1, [r2, #340]
>>        mul     r1, r3, r3
>>        bne     .L2
>> ;;epilogue
>>        sub     r2, r2, #20     ;;from stage 2 pre-last iteration
>>        sub     r3, r3, #5      ;;3 insns from stage 1 last iteration
>>        cmp     r3, #10
>>        str     r1, [r2, #340]
>>        sub     r2, r2, #20     ;;from stage 2 last iteration
>>
>>        ldr     r0, [r0, #60]
>>        sub     r0, r0, #225
>>        bx      lr
>> .L5:
>>        .word   a
>>

Here in comments I mention why insn was copied to prolog and epilog.
Only branch is not copied at all.

>>> Testing of this appoach reveals two bugs, which do not appear while SMS was
>>> used only for doloop loops.  Both these bugs happen due to the nature of the
>>> flag register.  On x86_64 it is clobbered by most of arithmetic instructions.
> This should ideally be solved by a dedicated (separate) patch.
> ...
> This too should be solved by a dedicated (separate) patch, for easier digestion.

As Ayal asks, I'll continue discussion of these two bugs in two
separate e-mails, answering on this letter.

>>>
>>> One more thing to point out is number of loop iterations. When number of
>>> iterations of a loop is not known at compile time, SMS has to create two loop
>>> versions (original and scheduled), and execute scheduled one only when real
>>> number of iterations is bigger than number of stages.  In doloop case the
>>> number of iterations simply equals to the count register value before the loop.
>>> So SMS finds its constant initialization or makes two loop versions.  In new
>>> supported loops number of iterations value is more complex.  It even can't be
>>> calculated as (final_reg_value-start_reg_value)/step because of examples like
>>> this:
>>>
>>> for (unsigned int x = 0x0; x != 0x6F80919A; x += 0xEDCBA987) ...;
>>>
>>> This loop has 22 iterations.  So, i decided to use get_simple_loop_desc
>>> function which gives a structure with loop characteristics, some of them helps
>>> to find iteration number:
>>>
>>> rtx niter_expr - The number of iterations of the loop;
>>> bool const_iter - True if the loop iterates the constant number of times;
>>> unsigned HOST_WIDEST_INT niter - Number of iterations if constant;
>>>
>>> But we can use these expressions only after looking through some other fields
>>> of returned structure:
>>>
>>> bool simple_p - True if we are able to say anything about number of iterations
>>> of the loop;
>>> rtx assumptions - Assumptions under that the rest of the information is valid;
>>> rtx noloop_assumptions - Assumptions under which the loop ends before reaching
>>> the latch;
>>> rtx infinite - Condition under which the loop is infinite.
>>>
>>> I decide to allow SMS scheduling only when simple_p is true and other three
>>> fields are NULL_RTX, or when simple_p is true and
>>> flag_unsafe_loop_optimizations is set.  One more exception is infinite
>>> condition, and the next separate patch is an attempt to process it.
>>>
>
> ok, still need to go over this rather lengthy and orthogonal (although
> it exposes the bugs above) piece.
>
> Ayal.
>
>

New version is attached, it suits current trunk.
Without fixing both bugs mentioned above, this patch brokes bootstrap on x86-64.

Together with DDG fixes the patch was succesfully regtested on ARM,
and "regstrapped" on x86-64 and IA64.

--
Roman Zhuykov
zhroma@ispras.ru

Comments

Roman Zhuykov Dec. 29, 2011, 2:14 p.m. UTC | #1
Ping.
Ayal, could you review this patch and these two patches too.
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html

Happy holidays.

2011/12/7 Roman Zhuykov <zhroma@ispras.ru>:
> Apologies for the messed up previous e-mail.
>
> 2011/10/12 Ayal Zaks <ayal.zaks@gmail.com>:
>>>> - the last jump instruction should look like:  pc=(regF!=0)?label:pc, regF is
>>
>> you'd probably want to bump to next instruction if falling through,
>> e.g., pc=(regF!=0)?label:pc+4
>>
>
> It is considered that program counter is increased automatically on
> hardware level.
> Otherwise we should add something like "pc=pc+4" in parallel to each
> instruction in RTL.
>
>>>>  flag register;
>>>> - the last instruction which sets regF should be: regF=COMPARE(regC,X), where X
>>>>  is a constant, or maybe a register, which is not changed inside a loop;
>>>> - only one instruction modifies regC inside a loop (other can use regC, but not
>>>>  write), and it should simply adjust it by a constant: regC=regC+step, where
>>>>  step is a constant.
>>>
>>>> When doloop is succesfully scheduled by SMS, its number of
>>>> iterations of loop kernel should be decreased by the number of stages in a
>>>> schedule minus one, while other iterations expand to prologue and epilogue.
>>>> In new supported loops such approach can't be used, because some
>>>> instructions can use count register (regC).  Instead of this,
>>>> the final register value X in compare instruction regF=COMPARE(regC,X)
>>>> is changed to another value Y respective to the stage this instruction
>>>> is scheduled (Y = X - stage * step).
>>
>> making sure this does not underflow; i.e., that the number of
>> iterations is no less than stage (you've addressed this towards the
>> end below).
>>
>
> Yes, this situation is processed correctly.
>
>>>
>>> The main difference from doloop case is that regC can be used by some
>>> instructions in loop body.
>>> That's why we are unable to simply adjust regC initial value, but have
>>> to keep it's value correct on each particular iteration.
>>> So, we change comparison instruction accordingly.
>>>
>>> An example:
>>> int a[100];
>>> int main()
>>> {
>>>  int i;
>>>  for (i = 85; i > 12; i -= 5)
>>>      a[i] = i * i;
>>>  return a[15]-225;
>>> }
>>> ARM assembler with "-O2 -fno-auto-inc-dec":
>>>        ldr     r0, .L5
>>>        mov     r3, #85
>>>        mov     r2, r0
>>> .L2:
>>>        mul     r1, r3, r3
>>>        sub     r3, r3, #5
>>>        cmp     r3, #10
>>>        str     r1, [r2, #340]
>>>        sub     r2, r2, #20
>>>        bne     .L2
>>>        ldr     r0, [r0, #60]
>>>        sub     r0, r0, #225
>>>        bx      lr
>>> .L5:
>>>        .word   a
>>>
>>> Loop body is executed 15 times.
>>> When compiling with SMS, it finds a schedule with ii=7, stage_count=3
>>> and following times:
>>> Stage  Time       Insn
>>> 0          5      mul     r1, r3, r3
>>> 1         10     sub     r3, r3, #5
>>> 1         11     cmp     r3, #10
>>> 1         11     str     r1, [r2, #340]
>>> 1         13     bne     .L2
>>> 2         16     sub     r2, r2, #20
>>>
>>
>> branch is not scheduled last?
>>
>
> Yes, branch schedule time is smaller then sub's one.
> This mean that "sub r2, r2, $20" instruction from original iteration
> number K will be executed after
> "bne .L2" from original iteration number K.
> But certainly bne remains to be the last instuction in new loop body.
> Below you can see how it looks after SMS.
>
>>> To make new schedule correct the loop body
>>> should be executed 14 times and we change compare instruction:
>>
>> the loop itself should execute 13 times.
>
> with i =
> 85, 80, 75, 70, 65
> 60, 55, 50, 45, 40
> 35, 30, 25, 20, 15
> this gives total 15 iterations (15 stores to memory).
> And new loop body will be executed 13 times (one store goes to
> epilogue and one - to prologue).
>
>>> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step.
>>> In our example regC is r3, X is 10, step = -5, compare instruction
>>> is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15.
>>>
>>
>> right. In general, if the compare is on stage s (starting from 0), it
>> will be executed s times in the epilog, so it should exit the loop
>> upon reaching Y = X - s * step.
>>
>>> So, after SMS it looks like:
>>>        ldr     r0, .L5
>>>        mov     r3, #85
>>>        mov     r2, r0
>>> ;;prologue
>>>        mul     r1, r3, r3      ;;from stage 0 first iteration
>>>        sub     r3, r3, #5      ;;3 insns from stage 1 first iteration
>>>        cmp     r3, #10
>>>        str     r1, [r2, #340]
>>>        mul     r1, r3, r3      ;;from stage 0 second iteration
>>> ;;body
>>> .L2:
>>>        sub     r3, r3, #5
>>>        sub     r2, r2, #20
>>>        cmp     r3, #15         ;; new value to compare with is Y=15
>>>        str     r1, [r2, #340]
>>>        mul     r1, r3, r3
>>>        bne     .L2
>>> ;;epilogue
>>>        sub     r2, r2, #20     ;;from stage 2 pre-last iteration
>>>        sub     r3, r3, #5      ;;3 insns from stage 1 last iteration
>>>        cmp     r3, #10
>>>        str     r1, [r2, #340]
>>>        sub     r2, r2, #20     ;;from stage 2 last iteration
>>>
>>>        ldr     r0, [r0, #60]
>>>        sub     r0, r0, #225
>>>        bx      lr
>>> .L5:
>>>        .word   a
>>>
>
> Here in comments I mention why insn was copied to prolog and epilog.
> Only branch is not copied at all.
>
>>>> Testing of this appoach reveals two bugs, which do not appear while SMS was
>>>> used only for doloop loops.  Both these bugs happen due to the nature of the
>>>> flag register.  On x86_64 it is clobbered by most of arithmetic instructions.
>> This should ideally be solved by a dedicated (separate) patch.
>> ...
>> This too should be solved by a dedicated (separate) patch, for easier digestion.
>
> As Ayal asks, I'll continue discussion of these two bugs in two
> separate e-mails, answering on this letter.
>
>>>>
>>>> One more thing to point out is number of loop iterations. When number of
>>>> iterations of a loop is not known at compile time, SMS has to create two loop
>>>> versions (original and scheduled), and execute scheduled one only when real
>>>> number of iterations is bigger than number of stages.  In doloop case the
>>>> number of iterations simply equals to the count register value before the loop.
>>>> So SMS finds its constant initialization or makes two loop versions.  In new
>>>> supported loops number of iterations value is more complex.  It even can't be
>>>> calculated as (final_reg_value-start_reg_value)/step because of examples like
>>>> this:
>>>>
>>>> for (unsigned int x = 0x0; x != 0x6F80919A; x += 0xEDCBA987) ...;
>>>>
>>>> This loop has 22 iterations.  So, i decided to use get_simple_loop_desc
>>>> function which gives a structure with loop characteristics, some of them helps
>>>> to find iteration number:
>>>>
>>>> rtx niter_expr - The number of iterations of the loop;
>>>> bool const_iter - True if the loop iterates the constant number of times;
>>>> unsigned HOST_WIDEST_INT niter - Number of iterations if constant;
>>>>
>>>> But we can use these expressions only after looking through some other fields
>>>> of returned structure:
>>>>
>>>> bool simple_p - True if we are able to say anything about number of iterations
>>>> of the loop;
>>>> rtx assumptions - Assumptions under that the rest of the information is valid;
>>>> rtx noloop_assumptions - Assumptions under which the loop ends before reaching
>>>> the latch;
>>>> rtx infinite - Condition under which the loop is infinite.
>>>>
>>>> I decide to allow SMS scheduling only when simple_p is true and other three
>>>> fields are NULL_RTX, or when simple_p is true and
>>>> flag_unsafe_loop_optimizations is set.  One more exception is infinite
>>>> condition, and the next separate patch is an attempt to process it.
>>>>
>>
>> ok, still need to go over this rather lengthy and orthogonal (although
>> it exposes the bugs above) piece.
>>
>> Ayal.
>>
>>
>
> New version is attached, it suits current trunk.
> Without fixing both bugs mentioned above, this patch brokes bootstrap on x86-64.
>
> Together with DDG fixes the patch was succesfully regtested on ARM,
> and "regstrapped" on x86-64 and IA64.
>
> --
> Roman Zhuykov
> zhroma@ispras.ru
Roman Zhuykov Feb. 10, 2012, 12:15 p.m. UTC | #2
Ping.
Ayal, please review this patch and these three patches too:
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html

--
Roman Zhuykov
zhroma@ispras.ru
Andrey Belevantsev March 29, 2012, 12:37 p.m. UTC | #3
Hello,

I'd like to ping again those SMS patches once we're back to Stage 1.

Ayal, maybe it would remove some burden for you if you'd review the general 
SMS functionality of those patches, and we'd ask RTL folks to look at the 
pieces related to RTL pattern matching and generation?

Yours,
Andrey

On 10.02.2012 16:15, Roman Zhuykov wrote:
> Ping.
> Ayal, please review this patch and these three patches too:
> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html
> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html
>
> --
> Roman Zhuykov
> zhroma@ispras.ru
Ayal Zaks March 30, 2012, 11:20 p.m. UTC | #4
Roman, Andrey,

Sorry for the delayed response.

It would indeed be good to have SMS apply to more loop patterns, still
within the realm of *countable* loops. SMS was originally designed to
handle doloops, with a specific pattern controlling the loop, easily
identified and separable from the loop's body. The newly proposed
change to support new loop patterns is pretty invasive and sizable,
taking place entirely within modulo-sched.c. The main issue I've been
considering, is whether it would be possible instead to transform the
new loop patterns we want SMS to handle, into doloops (potentially
introducing additional induction variables to feed other uses), and
then feed the resulting loop into SMS as is? In other words, could you
fold it into doloop.c? And if so, will doing so introduce significant
overheads?

2012/3/29 Andrey Belevantsev <abel@ispras.ru>:
> Hello,
>
> I'd like to ping again those SMS patches once we're back to Stage 1.
>
> Ayal, maybe it would remove some burden for you if you'd review the general
> SMS functionality of those patches, and we'd ask RTL folks to look at the
> pieces related to RTL pattern matching and generation?
>

It definitely would ... especially in light of the above issue.
Thanks (for your patches, patience, pings..),
Ayal.



> Yours,
> Andrey
>
>
> On 10.02.2012 16:15, Roman Zhuykov wrote:
>>
>> Ping.
>> Ayal, please review this patch and these three patches too:
>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html
>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html
>>
>> --
>> Roman Zhuykov
>> zhroma@ispras.ru
Andrey Belevantsev April 10, 2012, 8:23 a.m. UTC | #5
Hello Ayal,

First of all, thanks for your feedback.  Now to your questions:

On 31.03.2012 3:20, Ayal Zaks wrote:
> Roman, Andrey,
>
> Sorry for the delayed response.
>
> It would indeed be good to have SMS apply to more loop patterns, still
> within the realm of *countable* loops. SMS was originally designed to
> handle doloops, with a specific pattern controlling the loop, easily
> identified and separable from the loop's body. The newly proposed
> change to support new loop patterns is pretty invasive and sizable,
> taking place entirely within modulo-sched.c. The main issue I've been
> considering, is whether it would be possible instead to transform the
> new loop patterns we want SMS to handle, into doloops (potentially
> introducing additional induction variables to feed other uses), and
> then feed the resulting loop into SMS as is? In other words, could you
> fold it into doloop.c? And if so, will doing so introduce significant
> overheads?

Let me perhaps explain better.  The patch itself is one core patch (this 
thread) adding the new functionality on detecting more complex loop 
patterns and the three fixes to SMS found while working on the main patch 
(the fixes are in the mails pinged at the very end of this message).  The 
three fixes are worthwhile to commit separately anyways, they are splitted 
up from the main patch for this purpose, so I would suggest to consider 
them in any case.

For the main patch, its core is as small as we could get.  It stays with 
the countable loops as for the cases where we could get overflow behavior 
or infinite loops we bail out early.  We handle only a case of simple 
same-step affine counters.  The main reason why we add support to SMS and 
not to the doloop pass are is when we do not pipeline a loop newly 
transformed to the doloop form, this loop actually slows down on the 
platforms not having a true doloop pattern.  One has to undo the doloop 
form and to get back to the original loop form to avoid this, which seems 
rather strange.  Also, the separate decrement insn that changes the 
induction variable is better be scheduled to get more precise schedule. 
And yes, I believe that making an extra induction variable just to have the 
control part without uses in the loop core will be unnecessary overhead.

Thus, I believe that if we do want SMS to handle more complex loop, then it 
is inevitable that SMS itself would be somewhat more complex.  I would 
welcome your suggestions to make the patch more clear.  One way I see is 
that the function for getting the condition of the new loop form can be 
moved to the generic RTL loop code given the agreement of other RTL 
maintainers.  Also, some new helpers can be introduced for handling this 
specific loop forms.  But it seems that the distinction between 
doloop/non-doloop loops has to stay in the code.

Yours,
Andrey
>
> 2012/3/29 Andrey Belevantsev<abel@ispras.ru>:
>> Hello,
>>
>> I'd like to ping again those SMS patches once we're back to Stage 1.
>>
>> Ayal, maybe it would remove some burden for you if you'd review the general
>> SMS functionality of those patches, and we'd ask RTL folks to look at the
>> pieces related to RTL pattern matching and generation?
>>
>
> It definitely would ... especially in light of the above issue.
> Thanks (for your patches, patience, pings..),
> Ayal.
>
>
>
>> Yours,
>> Andrey
>>
>>
>> On 10.02.2012 16:15, Roman Zhuykov wrote:
>>>
>>> Ping.
>>> Ayal, please review this patch and these three patches too:
>>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
>>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html
>>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html
>>>
>>> --
>>> Roman Zhuykov
>>> zhroma@ispras.ru
diff mbox

Patch

2011-12-07  Roman Zhuykov  <zhroma@ispras.ru>
	* modulo-sched.c (nondoloop_register_get): New function.
	(const_iteration_count): Rename to ...
	(search_const_init): ...this.  Add new parameter (is_const).  Always
	return register initialization rtx and set is_const to true
	only when it is constant.
	(duplicate_insns_of_cycles): Add new parameter (doloop_p).  Do not
	duplicate instructions with count_reg only when doloop_p is set.
	Update all callers.
	(generate_prolog_epilog): Add new parameters.  Correctly generate loop
	prologue for new loop pattern.
	(sms_schedule): Support new loop pattern.
---

diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
index 0ea9a4d..e62aca7 100644
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -220,7 +220,8 @@  static void set_node_sched_params (ddg_ptr);
 static partial_schedule_ptr sms_schedule_by_order (ddg_ptr, int, int, int *);
 static void permute_partial_schedule (partial_schedule_ptr, rtx);
 static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
-                                    rtx, rtx);
+                                    rtx, bool, bool, rtx, HOST_WIDEST_INT,
+                                    bool, HOST_WIDEST_INT, rtx *);
 static int calculate_stage_count (partial_schedule_ptr, int);
 static void calculate_must_precede_follow (ddg_node_ptr, int, int,
 					   int, int, sbitmap, sbitmap, sbitmap);
@@ -255,7 +256,7 @@  typedef struct node_sched_params node_sched_params;
 DEF_VEC_O (node_sched_params);
 DEF_VEC_ALLOC_O (node_sched_params, heap);
 
-/* The following three functions are copied from the current scheduler
+/* The following two functions are copied from the current scheduler
    code in order to use sched_analyze() for computing the dependencies.
    They are used when initializing the sched_info structure.  */
 static const char *
@@ -398,37 +399,164 @@  doloop_register_get (rtx head ATTRIBUTE_UNUSED, rtx tail ATTRIBUTE_UNUSED)
 #endif
 }
 
-/* Check if COUNT_REG is set to a constant in the PRE_HEADER block, so
-   that the number of iterations is a compile-time constant.  If so,
-   return the rtx that sets COUNT_REG to a constant, and set COUNT to
-   this constant.  Otherwise return 0.  */
+/* Same as previous for loop with always-the-same-step counter.  */
 static rtx
-const_iteration_count (rtx count_reg, basic_block pre_header,
-		       HOST_WIDEST_INT * count)
+nondoloop_register_get (rtx head, rtx tail, int cmp_side,
+			rtx *addsub_output, rtx *cmp_output)
+{
+  rtx insn, reg, flagreg, addsub, cmp, end;
+
+  /* Check jump instruction form */
+  insn = single_set (tail);
+  if (insn == NULL_RTX
+      || SET_DEST (insn) != pc_rtx
+      || GET_CODE (SET_SRC (insn)) != IF_THEN_ELSE
+      || GET_CODE (XEXP (SET_SRC (insn), 1)) != LABEL_REF
+      || XEXP (SET_SRC (insn), 2) != pc_rtx)
+    return NULL_RTX;
+
+  /* Check loop exit condition */
+  insn = XEXP (SET_SRC (insn), 0);
+  if (GET_CODE (insn) != NE || XEXP (insn, 1) != const0_rtx)
+    return NULL_RTX;
+
+  /* Flags register */
+  flagreg = XEXP (insn, 0);
+
+  /* Searching comparison instruction */
+  cmp = PREV_INSN (tail);
+  while (cmp != PREV_INSN (head))
+    {
+      if (INSN_P (cmp) && reg_set_p (flagreg, cmp))
+        break;
+      cmp = PREV_INSN (cmp);
+    }
+  if (cmp == PREV_INSN (head))
+    return NULL_RTX;
+
+  /* Check comparison */
+  insn = single_set (cmp);
+  if (insn == NULL_RTX
+      || ! rtx_equal_p (flagreg, SET_DEST (insn))
+      || GET_CODE (SET_SRC (insn)) != COMPARE)
+    return NULL_RTX;
+
+  /* Loop register */
+  gcc_assert (0 <= cmp_side && cmp_side <= 1);
+  reg = XEXP (SET_SRC (insn), cmp_side);
+  if (! REG_P (reg))
+    return NULL_RTX;
+
+  /* End value */
+  end = XEXP (SET_SRC (insn), 1 - cmp_side);
+  if (! REG_P (end) && ! CONST_INT_P (end))
+    return NULL_RTX;
+
+  /* Searching register add\sub instruction */
+  addsub = PREV_INSN (cmp);
+  while (addsub != PREV_INSN (head))
+    {
+      if (INSN_P (addsub) && reg_set_p (reg, addsub))
+        break;
+      addsub = PREV_INSN (addsub);
+    }
+  if (addsub == PREV_INSN (head))
+    return NULL_RTX;
+
+  /* Checking register change instruction */
+  insn = single_set (addsub);
+  if (insn == NULL_RTX || ! rtx_equal_p (reg, SET_DEST (insn)))
+    return NULL_RTX;
+  insn = SET_SRC (insn);
+  if ((GET_CODE (insn) != PLUS && GET_CODE (insn) != MINUS)
+      || ! rtx_equal_p (reg, XEXP (insn, 0))
+      || ! (CONST_INT_P (XEXP (insn, 1))))
+    return NULL_RTX;
+
+  /* No other REG and END (if reg) modifications allowed */
+  for (insn = head; insn != tail; insn = NEXT_INSN (insn))
+    {
+      if (REG_P(end) && reg_set_p (end, insn))
+        {
+          if (dump_file)
+          {
+            fprintf (dump_file, "SMS end register found ");
+            print_rtl_single (dump_file, reg);
+            fprintf (dump_file, " outside write in insn:\n");
+            print_rtl_single (dump_file, insn);
+          }
+	  return NULL_RTX;
+	}
+      if (insn != addsub && reg_set_p (reg, insn))
+        {
+          if (dump_file)
+          {
+            fprintf (dump_file, "SMS count_reg found ");
+            print_rtl_single (dump_file, reg);
+            fprintf (dump_file, " outside write in insn:\n");
+            print_rtl_single (dump_file, insn);
+          }
+          return NULL_RTX;
+        }
+    }
+
+  *addsub_output = addsub;
+  *cmp_output = cmp;
+  return reg;
+}
+
+/* Check if REG is set to a constant in the PRE_HEADER block.
+   If possible to find, return the rtx that sets REG.
+   If REG is set to a constant (probably not directly),
+   set IS_CONST to true and VALUE to that constant value.  */
+static rtx
+search_const_init (basic_block pre_header, rtx reg, bool *is_const,
+		   HOST_WIDEST_INT *value)
 {
   rtx insn;
   rtx head, tail;
 
-  if (! pre_header)
-    return NULL_RTX;
+  if (!pre_header)
+    {
+      *is_const = false;
+      return NULL_RTX;
+    }
 
   get_ebb_head_tail (pre_header, pre_header, &head, &tail);
 
   for (insn = tail; insn != PREV_INSN (head); insn = PREV_INSN (insn))
     if (NONDEBUG_INSN_P (insn) && single_set (insn) &&
-	rtx_equal_p (count_reg, SET_DEST (single_set (insn))))
+	rtx_equal_p (reg, SET_DEST (single_set (insn))))
       {
-	rtx pat = single_set (insn);
+	rtx src, pat = single_set (insn);
+	src = SET_SRC (pat);
 
-	if (CONST_INT_P (SET_SRC (pat)))
+	if (CONST_INT_P (src))
 	  {
-	    *count = INTVAL (SET_SRC (pat));
-	    return insn;
+	    *is_const = true;
+	    *value = INTVAL (src);
+	  }
+	else if (REG_P (src))
+	  { /* Check if previous insn sets SRC = constant.  */
+	    pat = single_set (PREV_INSN (insn));
+	    if (pat != NULL_RTX && rtx_equal_p (src, SET_DEST (pat))
+		&& CONST_INT_P (SET_SRC (pat)))
+	      {
+		*is_const = true;
+		*value = INTVAL (SET_SRC (pat));
+	      }
+	    else
+		*is_const = false;
 	  }
+	else
+	  *is_const = false;
 
-	return NULL_RTX;
+	return insn;
       }
+    else if (reg_set_p (reg, insn))
+      break;
 
+  *is_const = false;
   return NULL_RTX;
 }
 
@@ -1103,7 +1231,7 @@  clear:
 
 static void
 duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
-			   int to_stage, rtx count_reg)
+			   int to_stage, rtx count_reg, bool doloop_p)
 {
   int row;
   ps_insn_ptr ps_ij;
@@ -1115,14 +1243,14 @@  duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 	int first_u, last_u;
 	rtx u_insn;
 
-        /* Do not duplicate any insn which refers to count_reg as it
-           belongs to the control part.
+        /* In doloop case do not duplicate any insn which refers
+	   to count_reg as it belongs to the control part.
            The closing branch is scheduled as well and thus should
            be ignored.
            TODO: This should be done by analyzing the control part of
            the loop.  */
 	u_insn = ps_rtl_insn (ps, u);
-        if (reg_mentioned_p (count_reg, u_insn)
+        if ((doloop_p && reg_mentioned_p (count_reg, u_insn))
             || JUMP_P (u_insn))
           continue;
 
@@ -1142,7 +1270,10 @@  duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 /* Generate the instructions (including reg_moves) for prolog & epilog.  */
 static void
 generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
-                        rtx count_reg, rtx count_init)
+                        rtx count_reg, bool doloop_p, bool count_init_isconst,
+			rtx fin_reg, HOST_WIDEST_INT fin_nonconst_adjust,
+			bool create_reg, HOST_WIDEST_INT reg_val,
+			rtx *created_reg)
 {
   int i;
   int last_stage = PS_STAGE_COUNT (ps) - 1;
@@ -1151,12 +1282,12 @@  generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
   /* Generate the prolog, inserting its insns on the loop-entry edge.  */
   start_sequence ();
 
-  if (!count_init)
+  if (doloop_p && !count_init_isconst)
     {
-      /* Generate instructions at the beginning of the prolog to
-         adjust the loop count by STAGE_COUNT.  If loop count is constant
-         (count_init), this constant is adjusted by STAGE_COUNT in
-         generate_prolog_epilog function.  */
+      /* In doloop we generate instructions at the beginning of the prolog to
+         adjust the initial value of doloop counter by STAGE_COUNT.
+	 If loop count is constant, this adjustment is done outside this
+         function, simply correcting the source of initialization insn.  */
       rtx sub_reg = NULL_RTX;
 
       sub_reg = expand_simple_binop (GET_MODE (count_reg), MINUS,
@@ -1167,8 +1298,40 @@  generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
         emit_move_insn (count_reg, sub_reg);
     }
 
+  if (!doloop_p)
+    {
+      /* In non-doloop we generate instructions at the beginning of
+         the prolog to adjust the final value (with this value loop count
+	 register is compared to check whether the loop should stop).  */
+      if (fin_nonconst_adjust != 0)
+	{
+	  /* If the final value is in a register - create another register
+	     to store a shifted value.  */
+	  rtx new_reg, reg = NULL_RTX;
+	  reg = gen_reg_rtx (GET_MODE (fin_reg));
+	  new_reg = expand_simple_binop (GET_MODE (fin_reg), MINUS, fin_reg,
+					 GEN_INT (fin_nonconst_adjust),
+					 reg, 0, OPTAB_DIRECT);
+	  gcc_assert (REG_P (new_reg));
+	  if (REGNO (new_reg) != REGNO (reg))
+	    emit_move_insn (reg, new_reg);
+	  *created_reg = new_reg;
+	}
+      else if (create_reg)
+	{
+	  /* If old final value is an immediate, and the new one can't be
+	     an immediate, we create a register to store it.  If both values
+	     are immediate the adjustment is done outside this fuction,
+	     just correcting the constant value in compare intruction.  */
+	  rtx reg = NULL_RTX;
+	  reg = gen_reg_rtx (GET_MODE (count_reg));
+	  emit_move_insn (reg, GEN_INT (reg_val));
+	  *created_reg = reg;
+	}
+    }
+
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, 0, i, count_reg);
+    duplicate_insns_of_cycles (ps, 0, i, count_reg, doloop_p);
 
   /* Put the prolog on the entry edge.  */
   e = loop_preheader_edge (loop);
@@ -1182,7 +1345,7 @@  generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
   start_sequence ();
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
+    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg, doloop_p);
 
   /* Put the epilogue on the exit edge.  */
   gcc_assert (single_exit (loop));
@@ -1460,13 +1623,30 @@  sms_schedule (void)
           continue;
         }
 
-      /* Make sure this is a doloop.  */
-      if ( !(count_reg = doloop_register_get (head, tail)))
-      {
-        if (dump_file)
-          fprintf (dump_file, "SMS doloop_register_get failed\n");
-	continue;
-      }
+      /* Is this a doloop?  */
+      if ((count_reg = doloop_register_get (head, tail)))
+        {
+	  if (dump_file)
+	    fprintf (dump_file, "SMS doloop\n");
+        }
+      else if ((count_reg = nondoloop_register_get (head, tail, 0,
+						    &insn, &insn)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS non-doloop\n");
+	}
+      else if ((count_reg = nondoloop_register_get (head, tail, 1,
+						    &insn, &insn)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS non-doloop with transposed cmp\n");
+	}
+      else
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS imcompatible loop\n");
+	  continue;
+	}
 
       /* Don't handle BBs with calls or barriers
 	 or !single_set with the exception of instructions that include
@@ -1516,7 +1696,6 @@  sms_schedule (void)
 	    fprintf (dump_file, "SMS create_ddg failed\n");
 	  continue;
         }
-
       g_arr[loop->num] = g;
       if (dump_file)
         fprintf (dump_file, "...OK\n");
@@ -1528,14 +1707,28 @@  sms_schedule (void)
     fprintf (dump_file, "=========================\n\n");
   }
 
+  df_clear_flags (DF_LR_RUN_DCE);
+
   /* We don't want to perform SMS on new loops - created by versioning.  */
   FOR_EACH_LOOP (li, loop, 0)
     {
+      bool doloop_p, count_fin_isconst, count_init_isconst;
+      bool was_immediate = false;
+      bool prolog_create_reg = false;
+      int prolog_fin_nonconst_adjust = 0;
+      bool nonsimple_loop = false;
       rtx head, tail;
-      rtx count_reg, count_init;
-      int mii, rec_mii, stage_count, min_cycle;
-      HOST_WIDEST_INT loop_count = 0;
+      int min_cycle;
       bool opt_sc_p;
+      rtx count_reg, count_fin_reg, new_comp_reg = NULL_RTX;
+      rtx count_init_insn, count_fin_init_insn;
+      rtx add, cmp;
+      int mii, rec_mii, cmp_side = -1, cmp_stage = -1;
+      int stage_count = 0;
+      HOST_WIDEST_INT count_init_val = 0, count_fin_val = 0;
+      HOST_WIDEST_INT count_step = 0, loop_count = -1;
+      HOST_WIDEST_INT count_fin_newval = 0;
+      struct niter_desc *desc = NULL;
 
       if (! (g = g_arr[loop->num]))
         continue;
@@ -1573,32 +1766,159 @@  sms_schedule (void)
 	               (HOST_WIDEST_INT) profile_info->sum_max);
 	      fprintf (dump_file, "\n");
 	    }
-	  fprintf (dump_file, "SMS doloop\n");
 	  fprintf (dump_file, "SMS built-ddg %d\n", g->num_nodes);
           fprintf (dump_file, "SMS num-loads %d\n", g->num_loads);
           fprintf (dump_file, "SMS num-stores %d\n", g->num_stores);
 	}
 
 
-      /* In case of th loop have doloop register it gets special
-	 handling.  */
-      count_init = NULL_RTX;
-      if ((count_reg = doloop_register_get (head, tail)))
+      /* Extract count register and determine loop type.  */
+      add = NULL_RTX;
+      cmp = NULL_RTX;
+      if ((count_reg = doloop_register_get (head, tail))
+	  || (count_reg = nondoloop_register_get (head, tail, 0, &add, &cmp))
+	  || (count_reg = nondoloop_register_get (head, tail, 1, &add, &cmp)))
 	{
-	  basic_block pre_header;
+	  basic_block pre_header = loop_preheader_edge (loop)->src;
+
+	  doloop_p = (cmp == NULL_RTX);
+	  if (doloop_p)
+	    {
+	      /* Doloop finish parameters are always the same.  */
+	      count_step = -1;
+	      count_fin_isconst = true;
+	      count_fin_val = 0;
+	      count_fin_reg = NULL_RTX;
+	      count_fin_init_insn = NULL_RTX;
+	    }
+	  else
+	    {
+	      /* In other loop we need to determine counter step
+	         and finish parameters.  */
+	      rtx step, end;
+
+	      gcc_assert (single_set (add) && single_set (cmp));
+
+	      /* Extract the step.  */
+	      step = XEXP (SET_SRC (single_set (add)), 1);
+	      gcc_assert (CONST_INT_P (step));
+
+	      if (GET_CODE (SET_SRC (single_set (add))) == MINUS)
+	        count_step = - INTVAL (step);
+	      else if (GET_CODE (SET_SRC (single_set (add))) == PLUS)
+	        count_step = INTVAL (step);
+	      else
+		gcc_unreachable ();
+
+	      gcc_assert(count_step != 0);
+
+	      /* Check what operand of compare insn is a counter register.  */
+	      if (count_reg == XEXP (SET_SRC (single_set (cmp)), 0))
+		cmp_side = 0;
+	      else if (count_reg == XEXP (SET_SRC (single_set (cmp)), 1))
+		cmp_side = 1;
+	      else
+		gcc_unreachable ();
+
+	      /* Extract finish border for counter reg.  */
+	      end = XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side);
 
-	  pre_header = loop_preheader_edge (loop)->src;
-	  count_init = const_iteration_count (count_reg, pre_header,
-					      &loop_count);
+	      if (CONST_INT_P (end))
+		{
+		  /* Constant finish border.  loop until (reg != const).  */
+		  count_fin_isconst = true;
+		  count_fin_val = INTVAL (end);
+		  count_fin_reg = NULL_RTX;
+		  count_fin_init_insn = NULL_RTX;
+		}
+	      else if (REG_P (end))
+		{
+		  /* Register is a border.  Loop until (reg != fin_reg).  */
+		  count_fin_reg = end;
+		  count_fin_isconst = false;
+		  /* Try to find constant initinalization of fin_reg
+		   * in preheader.  */
+		  count_fin_init_insn = search_const_init (pre_header,
+							   count_fin_reg,
+							   &count_fin_isconst,
+							   &count_fin_val);
+		}
+	      else
+		gcc_unreachable ();
+	    }
+	  /* Try to find a constant initalization of count_reg in preheader.  */
+	  count_init_insn = search_const_init (pre_header,
+					       count_reg,
+					       &count_init_isconst,
+					       &count_init_val);
+	}
+      else /* Loop is incompatible now, but it was OK on while analyzing!  */
+	gcc_assert (count_reg);
+
+
+      desc = get_simple_loop_desc (loop);
+      gcc_assert (desc);
+      /* nonsimple_loop means it's impossible to analyze the loop
+         or there are some assumptions to make the analyzis results right
+         or there is a condition of non-infinite number of iterations.
+        We want doloops to be scheduled even if analyzis shows they are
+	 nonsimple (backward compatibility).  */
+      nonsimple_loop = !desc->simple_p;
+      /* We allow scheduling loop with some assumptions or infinite condition
+	 only when unsafe_loop_optimizations flag is enabled.  */
+      if (flag_unsafe_loop_optimizations)
+	 {
+	   desc->infinite = NULL_RTX;
+	   desc->assumptions = NULL_RTX;
+	   desc->noloop_assumptions = NULL_RTX;
+	 }
+      nonsimple_loop = nonsimple_loop || (desc->assumptions != NULL_RTX)
+			|| (desc->noloop_assumptions != NULL_RTX)
+			|| (desc->infinite != NULL_RTX);
+      /* Only doloops can be nonsimple_loops for SMS.  */
+      if (nonsimple_loop && !doloop_p)
+	{
+	  free_ddg (g);
+	  continue;
+	}
+      /* Manually set some description fields in non-simple doloop.  */
+      if (nonsimple_loop)
+	{
+	  gcc_assert(doloop_p);
+	  desc->const_iter = false;
+	  desc->infinite = NULL_RTX;
 	}
-      gcc_assert (count_reg);
 
-      if (dump_file && count_init)
+      if (desc->const_iter)
+	{
+	  gcc_assert (!desc->infinite);
+	  loop_count = desc->niter;
+	  if (dump_file)
+	    fprintf (dump_file, "SMS const loop iterations = "
+		     HOST_WIDEST_INT_PRINT_DEC "\n", loop_count);
+	}
+      if (count_init_isconst && count_fin_isconst)
         {
-          fprintf (dump_file, "SMS const-doloop ");
-          fprintf (dump_file, HOST_WIDEST_INT_PRINT_DEC,
-		     loop_count);
-          fprintf (dump_file, "\n");
+	  gcc_assert (doloop_p || desc->const_iter);
+	  if (doloop_p)
+	    {
+	      if (nonsimple_loop)
+		{
+	          loop_count = count_init_val;
+		  desc->const_iter = true;
+		}
+              gcc_assert (desc->const_iter && loop_count == count_init_val);
+	    }
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "SMS const-%s ",
+		       doloop_p ? "doloop" : "loop");
+	      fprintf (dump_file, HOST_WIDEST_INT_PRINT_DEC " to "
+		       HOST_WIDEST_INT_PRINT_DEC " step "
+		       HOST_WIDEST_INT_PRINT_DEC,
+		       count_init_val, count_fin_val, count_step);
+	      fprintf (dump_file, "\n");
+	    }
         }
 
       node_order = XNEWVEC (int, g->num_nodes);
@@ -1649,7 +1969,7 @@  sms_schedule (void)
 	     1 means that there is no interleaving between iterations thus
 	     we let the scheduling passes do the job in this case.  */
 	  if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
-	      || (count_init && (loop_count <= stage_count))
+	      || (desc->const_iter && (loop_count <= stage_count))
 	      || (flag_branch_probabilities && (trip_count <= stage_count)))
 	    {
 	      if (dump_file)
@@ -1709,23 +2029,24 @@  sms_schedule (void)
 	      print_partial_schedule (ps, dump_file);
 	    }
  
-          /* case the BCT count is not known , Do loop-versioning */
-	  if (count_reg && ! count_init)
-            {
-	      rtx comp_rtx = gen_rtx_fmt_ee (GT, VOIDmode, count_reg,
-	  				     GEN_INT(stage_count));
-	      unsigned prob = (PROB_SMS_ENOUGH_ITERATIONS
-			       * REG_BR_PROB_BASE) / 100;
-
-	      loop_version (loop, comp_rtx, &condition_bb,
-	  		    prob, prob, REG_BR_PROB_BASE - prob,
-			    true);
-	     }
+	  if (!desc->const_iter)
+	    {
+	      /* Loop versioning if the number of iterations is unknown.  */
+	      unsigned prob;
+	      rtx vers_cond;
+	      vers_cond = gen_rtx_fmt_ee (GT, VOIDmode, nonsimple_loop ?
+					  count_reg : desc->niter_expr,
+					  GEN_INT (stage_count));
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "\nLoop versioning condition:\n");
+		  print_rtl_single (dump_file, vers_cond);
+		}
 
-	  /* Set new iteration count of loop kernel.  */
-          if (count_reg && count_init)
-	    SET_SRC (single_set (count_init)) = GEN_INT (loop_count
-						     - stage_count + 1);
+	      prob = (PROB_SMS_ENOUGH_ITERATIONS * REG_BR_PROB_BASE) / 100;
+	      loop_version (loop, vers_cond, &condition_bb, prob,
+			    prob, REG_BR_PROB_BASE - prob, true);
+	    }
 
 	  /* Now apply the scheduled kernel to the RTL of the loop.  */
 	  permute_partial_schedule (ps, g->closing_branch->first_note);
@@ -1741,8 +2062,121 @@  sms_schedule (void)
 	  apply_reg_moves (ps);
 	  if (dump_file)
 	    print_node_sched_params (dump_file, g->num_nodes, ps);
-	  /* Generate prolog and epilog.  */
-          generate_prolog_epilog (ps, loop, count_reg, count_init);
+
+	  if (doloop_p && count_init_isconst)
+	    {
+	      /* Change counter reg initialization constant. In more complex
+	         cases this adjustment is done with adding some insns
+		 to loop prologue in generate_prolog_epilog function.  */
+	      gcc_assert (single_set (count_init_insn) != NULL_RTX);
+	      SET_SRC (single_set (count_init_insn))
+		    = GEN_INT (count_init_val - stage_count + 1);
+	      df_insn_rescan (count_init_insn);
+	    }
+
+	  if (!doloop_p)
+	    {
+	      /* Calculation of the compare insn stage in schedule.  */
+	      ps_insn_ptr crr_insn;
+	      int row, stage;
+	      cmp_stage = -1;
+	      for (row = 0; row < ps->ii; row++)
+		for (crr_insn = ps->rows[row];
+		     crr_insn;
+		     crr_insn = crr_insn->next_in_row)
+		  {
+		    stage = SCHED_STAGE (crr_insn->id);
+		    gcc_assert (0 <= stage && stage < stage_count);
+		    if (rtx_equal_p (ps_rtl_insn (ps, crr_insn->id), cmp))
+		      {
+			gcc_assert (cmp_stage == -1);
+		        cmp_stage = stage;
+		      }
+		  }
+              if (dump_file)
+		fprintf (dump_file, "cmp_stage=%d\n", cmp_stage);
+	      gcc_assert (cmp_stage >= 0);
+	    }
+
+	  /* When compare insn stage is non-zero we are to shift the final
+	     counter reg value (which counter is compared to exit loop).
+	     Final value can be an immediate or can be a register, which
+	     constant initialization we find in preheader.  */
+	  was_immediate = false;
+	  if (!doloop_p && count_fin_isconst && cmp_stage > 0)
+	    {
+              gcc_assert (0 <= cmp_side && cmp_side <= 1);
+	      /* New finish value.  */
+	      count_fin_newval = count_fin_val - count_step * cmp_stage;
+	      was_immediate = CONST_INT_P (XEXP (SET_SRC (single_set (cmp)),
+							  1 - cmp_side));
+	      if (was_immediate)
+		{
+		  /* Check whether new value also can be an immediate.
+		     For exapmle, on ARM not all values can be encoded as
+		     an immediate, so we have to load it to a register once
+		     before the loop starts.  */
+		  rtx to = GEN_INT (count_fin_newval);
+		  prolog_create_reg = rtx_cost (to, GET_CODE (to), 0, false)
+			    > rtx_cost (GEN_INT(1), CONST_INT, 0, false);
+	        }
+	      else
+		{
+		  /* A value is already in a register and we easily change
+		     initialization instruction in preheader.  */
+		  gcc_assert (count_fin_init_insn);
+		  SET_SRC (single_set (count_fin_init_insn))
+			= GEN_INT (count_fin_newval);
+		  df_insn_rescan (count_fin_init_insn);
+		}
+	    }
+
+	  /* The adjustment of finish register value.
+	     Zero means no adjustment needed or adjusment is done
+	     without additional insn in prologue.  */
+	  if (!doloop_p && !count_fin_isconst)
+	    prolog_fin_nonconst_adjust = count_step * cmp_stage;
+
+	  /* Ready to generate prolog and epilog.  */
+	  generate_prolog_epilog (ps, loop, count_reg, doloop_p,
+			          count_init_isconst, count_fin_reg,
+				  prolog_fin_nonconst_adjust,
+				  prolog_create_reg, count_fin_newval,
+				  &new_comp_reg);
+
+	  /* And only after generating prolog and epilog it is possible
+	     to modify the compare instruction (to prevent copying wrong insn
+	     form to first and last stages).  */
+	  if (!doloop_p && cmp_stage > 0)
+	    {
+              gcc_assert (0 <= cmp_side && cmp_side <= 1);
+	      if (was_immediate && !prolog_create_reg)
+		{
+		/* Easy case - just modify a constant.  */
+		  gcc_assert (new_comp_reg == NULL_RTX);
+		  XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side)
+			= GEN_INT (count_fin_newval);
+		}
+	      else
+		{
+		  if (count_fin_isconst && !was_immediate)
+		    /* Value is in a register and we already changed
+		       initialization instruction in preheader.  */
+		    gcc_assert (new_comp_reg == NULL_RTX);
+		  else
+		    {
+		      /* Another case - use created by generate_prolog_epilog
+		         register, which value is initialized in prologue.  */
+		      gcc_assert (new_comp_reg != NULL_RTX);
+		      XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side)
+			      = new_comp_reg;
+		    }
+		}
+	      df_insn_rescan (cmp);
+	    }
+	  else
+	    gcc_assert (new_comp_reg == NULL_RTX);
+
 	  break;
 	}
 
@@ -1752,7 +2186,9 @@  sms_schedule (void)
       free_ddg (g);
     }
 
+  df_set_flags (DF_LR_RUN_DCE);
   free (g_arr);
+  iv_analysis_done ();
 
   /* Release scheduler data, needed until now because of DFA.  */
   haifa_sched_finish ();