diff mbox series

[v2,5/7] sm501: Replace hand written implementation with pixman where possible

Message ID 58666389b6cae256e4e972a32c05cf8aa51bffc0.1590089984.git.balaton@eik.bme.hu
State New
Headers show
Series Misc display/sm501 clean ups and fixes | expand

Commit Message

BALATON Zoltan May 21, 2020, 7:39 p.m. UTC
Besides being faster this should also prevent malicious guests to
abuse 2D engine to overwrite data or cause a crash.

Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
---
 hw/display/sm501.c | 207 ++++++++++++++++++++++++++-------------------
 1 file changed, 119 insertions(+), 88 deletions(-)

Comments

Gerd Hoffmann May 26, 2020, 10:43 a.m. UTC | #1
On Thu, May 21, 2020 at 09:39:44PM +0200, BALATON Zoltan wrote:
> Besides being faster this should also prevent malicious guests to
> abuse 2D engine to overwrite data or cause a crash.

>          uint32_t src_base = s->twoD_source_base & 0x03FFFFFF;
> -        uint8_t *src = s->local_mem + src_base;

> -                    val = *(_pixel_type *)&src[index_s];                      \

Well, the advantage of *not* using pixman is that you can easily switch
the code to use offsets instead of pointers, then apply the mask to the
*final* offset to avoid oob data access:

    val = *(_pixel_type*)(&s->local_mem[(s->twoD_source_base + index_s) & 0x03FFFFFF]);

> +        if ((rop_mode && rop == 0x5) || (!rop_mode && rop == 0x55)) {
> +            /* Invert dest, is there a way to do this with pixman? */

PIXMAN_OP_XOR maybe?

> +            if (rtl && ((db >= sb && db <= se) || (de >= sb && de <= se))) {
> +                /* regions may overlap: copy via temporary */

The usual way for a hardware blitter is to have a direction bit, i.e.
the guest os can ask to blit in top->bottom or bottom->top scanline
ordering.  The guest can use that to make sure the blit does not
overwrite things.  But note the guest can also intentionally use
overlapping regions, i.e. memset(0) the first scanline, then use a blit
with overlap to clear the whole screen.  The later will surely break if
you blit via temporary image ...

> +                pixman_blt((uint32_t *)&s->local_mem[src_base],
> +                           (uint32_t *)&s->local_mem[dst_base],
> +                           src_pitch * (1 << format) / sizeof(uint32_t),
> +                           dst_pitch * (1 << format) / sizeof(uint32_t),
> +                           8 * (1 << format), 8 * (1 << format),
> +                           src_x, src_y, dst_x, dst_y, width, height);

See above, i'm not convinced pixman is the best way here.
When using pixman I'd suggest:

  (1) src = pixman_image_create_bits_no_clear(...);
  (2) dst = pixman_image_create_bits_no_clear(...);
  (3) pixman_image_composite(PIXMAN_OP_SRC, src, NULL, dst, ...);
  (4) pixman_image_unref(src);
  (5) pixman_image_unref(dst);

pixman_blt() is probably doing basically the same.  The advantage of not
using pixman_blt() is that

  (a) you can also use pixman ops other than PIXMAN_OP_SRC, and
  (b) you can have a helper function for (1)+(2) which very carefully
      applies sanity checks to make sure the pixman image created stays
      completely inside s->local_mem.
  (c) you have the option to completely rearrange the code flow, for
      example update the src pixman image whenever the guest touches
      src_base or src_pitch or format instead of having a
      create/op/unref cycle on every blitter op.

> +        pixman_fill((uint32_t *)&s->local_mem[dst_base],
> +                    dst_pitch * (1 << format) / sizeof(uint32_t),
> +                    8 * (1 << format), dst_x, dst_y, width, height, color);

  (1) src = pixman_image_create_solid(...), otherwise same as above ;)

take care,
  Gerd
BALATON Zoltan May 26, 2020, 1:35 p.m. UTC | #2
On Tue, 26 May 2020, Gerd Hoffmann wrote:
> On Thu, May 21, 2020 at 09:39:44PM +0200, BALATON Zoltan wrote:
>> Besides being faster this should also prevent malicious guests to
>> abuse 2D engine to overwrite data or cause a crash.
>
>>          uint32_t src_base = s->twoD_source_base & 0x03FFFFFF;
>> -        uint8_t *src = s->local_mem + src_base;
>
>> -                    val = *(_pixel_type *)&src[index_s];                      \
>
> Well, the advantage of *not* using pixman is that you can easily switch
> the code to use offsets instead of pointers, then apply the mask to the
> *final* offset to avoid oob data access:

The mask applied to src_base is not to prevent overflow but to implement 
register limits. Only these bits are valid if I remember correctly, so 
even if I use offsets I need to check for overflow. This patch basically 
does that by changing parameters to unsigned to prevent them being 
negative, checking values we multiply by to prevent them to be zero and 
then calculating first and last offset and check if they are within vram. 
(Unless of course I've made a mistake somewhere.) This should prevent 
overflow with one check and does not need to apply a mask at every step. 
The vram size can also be different so it's not a fixed mask anyway.

If not using pixman then I'd need to reimplement optimised 2D ops that 
will likely never be as good as pixman and no point in doing it several 
times for every device model so I'd rather try to use pixman where 
possible unless a better library is available.

>    val = *(_pixel_type*)(&s->local_mem[(s->twoD_source_base + index_s) & 0x03FFFFFF]);
>
>> +        if ((rop_mode && rop == 0x5) || (!rop_mode && rop == 0x55)) {
>> +            /* Invert dest, is there a way to do this with pixman? */
>
> PIXMAN_OP_XOR maybe?

Maybe, but looking at the pixman source I couldn't decide if

UN8x4_MUL_UN8_ADD_UN8x4_MUL_UN8 (s, dest_ia, d, src_ia);

seen here:
https://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine32.c#n396
is really the same as s ^ d.

>> +            if (rtl && ((db >= sb && db <= se) || (de >= sb && de <= se))) {
>> +                /* regions may overlap: copy via temporary */
>
> The usual way for a hardware blitter is to have a direction bit, i.e.
> the guest os can ask to blit in top->bottom or bottom->top scanline
> ordering.  The guest can use that to make sure the blit does not

Yes, this is the rtl above (right to left) and AmigaOS sets this most of 
the time so only relying on that to detect overlaps is not efficient.

> overwrite things.  But note the guest can also intentionally use
> overlapping regions, i.e. memset(0) the first scanline, then use a blit
> with overlap to clear the whole screen.  The later will surely break if
> you blit via temporary image ...

Fortunately no guest code seems to do that so unless we find one needing 
it I don't worry much about such rare cases. It would be best if pixman 
supported this but while I've found patches were submitted they did not 
get merged so far so using a temporary seems to be the simplest way that 
works well enough for now.

>> +                pixman_blt((uint32_t *)&s->local_mem[src_base],
>> +                           (uint32_t *)&s->local_mem[dst_base],
>> +                           src_pitch * (1 << format) / sizeof(uint32_t),
>> +                           dst_pitch * (1 << format) / sizeof(uint32_t),
>> +                           8 * (1 << format), 8 * (1 << format),
>> +                           src_x, src_y, dst_x, dst_y, width, height);
>
> See above, i'm not convinced pixman is the best way here.
> When using pixman I'd suggest:
>
>  (1) src = pixman_image_create_bits_no_clear(...);
>  (2) dst = pixman_image_create_bits_no_clear(...);
>  (3) pixman_image_composite(PIXMAN_OP_SRC, src, NULL, dst, ...);
>  (4) pixman_image_unref(src);
>  (5) pixman_image_unref(dst);
>
> pixman_blt() is probably doing basically the same.

Actually not the same, pixman_blt is faster operating directly on pointers 
while we need all the pixman_image overhead to use pixman_image_composite. 
Blitter is used for a lot of small ops (I've seen AmigaOS even call it 
with 1 pixel regions) so going through pixman_image every time does not 
seem to be efficient. To implement more complex ops this may be needed so 
I may try to figure that out later but I'd need some test cases to test if 
the results are correct. The current patches do the same as before (except 
for some rare overlapping cases as you noted above but we haven't observed 
any yet) and fix the overflows so this was the best I could do in the time 
I had. Maybe I try to improve this later but don't plan to rewrite it now.

>  The advantage of not
> using pixman_blt() is that
>
>  (a) you can also use pixman ops other than PIXMAN_OP_SRC, and
>  (b) you can have a helper function for (1)+(2) which very carefully
>      applies sanity checks to make sure the pixman image created stays
>      completely inside s->local_mem.
>  (c) you have the option to completely rearrange the code flow, for
>      example update the src pixman image whenever the guest touches
>      src_base or src_pitch or format instead of having a
>      create/op/unref cycle on every blitter op.

From traces I think most guest would write bltter related regs on every op 
so probably not worth the hassle to try to update regions on register 
access and we could do it on every op, possibly optimising 1 pixel blits 
and small regions via some special cases but even then simple copy image 
is probably the most common op that might worth doing via pixman_blt as 
it's expected to be frequently used so the less overhead is the better. 
Therefore I'd only use image_composite for more complex ops but that's too 
much effort for a relatively unused device model. Maybe for ati-vga I'll 
try to make it better but first should fix microengine for that so drivers 
can talk to it. I'd rather spend my limited free time on that than further 
improving sm501 unless some bugs show up.

>> +        pixman_fill((uint32_t *)&s->local_mem[dst_base],
>> +                    dst_pitch * (1 << format) / sizeof(uint32_t),
>> +                    8 * (1 << format), dst_x, dst_y, width, height, color);
>
>  (1) src = pixman_image_create_solid(...), otherwise same as above ;)

Same argument as composite_image and for fill we don't even have any 
advantage so while for composite implementing other ops is a reason to not 
use pixman_blt I see no reason to not go the fastest way for fill.

Regards,
BALATON Zoltan
Gerd Hoffmann May 27, 2020, 9:15 a.m. UTC | #3
Hi,

> > Well, the advantage of *not* using pixman is that you can easily switch
> > the code to use offsets instead of pointers, then apply the mask to the
> > *final* offset to avoid oob data access:
> 
> The mask applied to src_base is not to prevent overflow but to implement
> register limits.

Yea, that was just a quick sketch to outline the idea without checking
all details.

> This patch basically does
> that by changing parameters to unsigned to prevent them being negative,
> checking values we multiply by to prevent them to be zero and then
> calculating first and last offset and check if they are within vram.

Well.  With cirrus this proved to be fragile.  The checks missed corner
cases and we've got a series of CVEs in the blitter code.  Switching to
offsets + masking every vram access (see commit ffaf85777828) stopped
that.

> (Unless
> of course I've made a mistake somewhere.)

Exactly ...

> This should prevent overflow with
> one check and does not need to apply a mask at every step. The vram size can
> also be different so it's not a fixed mask anyway.
> 
> If not using pixman then I'd need to reimplement optimised 2D ops that will
> likely never be as good as pixman and no point in doing it several times for
> every device model so I'd rather try to use pixman where possible unless a
> better library is available.

Yes, performance-wise pixman is clearly the better choice.  At the end
of the day it is a security vs performance trade off.

> > > +            if (rtl && ((db >= sb && db <= se) || (de >= sb && de <= se))) {
> > > +                /* regions may overlap: copy via temporary */
> > 
> > The usual way for a hardware blitter is to have a direction bit, i.e.
> > the guest os can ask to blit in top->bottom or bottom->top scanline
> > ordering.  The guest can use that to make sure the blit does not
> 
> Yes, this is the rtl above (right to left) and AmigaOS sets this most of the
> time so only relying on that to detect overlaps is not efficient.

Hmm, checking rtl like that doesn't look correct to me then.  When using
the blitter to move a window you have to set/clear rtl depending on
whenever you move the window up or down on the screen, and src+dst
regions can overlap in both cases ...

> > overwrite things.  But note the guest can also intentionally use
> > overlapping regions, i.e. memset(0) the first scanline, then use a blit
> > with overlap to clear the whole screen.  The later will surely break if
> > you blit via temporary image ...
> 
> Fortunately no guest code seems to do that so unless we find one needing it
> I don't worry much about such rare cases.

Ok.

> > > +                pixman_blt((uint32_t *)&s->local_mem[src_base],
> > > +                           (uint32_t *)&s->local_mem[dst_base],
> > > +                           src_pitch * (1 << format) / sizeof(uint32_t),
> > > +                           dst_pitch * (1 << format) / sizeof(uint32_t),
> > > +                           8 * (1 << format), 8 * (1 << format),
> > > +                           src_x, src_y, dst_x, dst_y, width, height);
> > 
> > See above, i'm not convinced pixman is the best way here.
> > When using pixman I'd suggest:
> > 
> >  (1) src = pixman_image_create_bits_no_clear(...);
> >  (2) dst = pixman_image_create_bits_no_clear(...);
> >  (3) pixman_image_composite(PIXMAN_OP_SRC, src, NULL, dst, ...);
> >  (4) pixman_image_unref(src);
> >  (5) pixman_image_unref(dst);
> > 
> > pixman_blt() is probably doing basically the same.
> 
> Actually not the same, pixman_blt is faster operating directly on pointers
> while we need all the pixman_image overhead to use pixman_image_composite.

Ok (I didn't check the pixman code).

Given the use case (run a computer museum ;) I think we can live with
the flaws of the pixman approach.  Security shouldn't be that much of an
issue here.  The behavior and blitter use pattern of the guests is known
too and unlikely to change.

take care,
  Gerd
BALATON Zoltan May 27, 2020, 11:05 a.m. UTC | #4
Hello,

On Wed, 27 May 2020, Gerd Hoffmann wrote:
>>> Well, the advantage of *not* using pixman is that you can easily switch
>>> the code to use offsets instead of pointers, then apply the mask to the
>>> *final* offset to avoid oob data access:
>>
>> The mask applied to src_base is not to prevent overflow but to implement
>> register limits.
>
> Yea, that was just a quick sketch to outline the idea without checking
> all details.
>
>> This patch basically does
>> that by changing parameters to unsigned to prevent them being negative,
>> checking values we multiply by to prevent them to be zero and then
>> calculating first and last offset and check if they are within vram.
>
> Well.  With cirrus this proved to be fragile.  The checks missed corner
> cases and we've got a series of CVEs in the blitter code.  Switching to
> offsets + masking every vram access (see commit ffaf85777828) stopped
> that.
>
>> (Unless
>> of course I've made a mistake somewhere.)
>
> Exactly ...

Hopefully we can make the checks correct eventually. I think for sm501 it 
should already be OK, I'll need to check ati-vga again because I think 
there may be still a mistake in that. (It does not help that every device 
encode these values differently in registers.)

>> This should prevent overflow with
>> one check and does not need to apply a mask at every step. The vram size can
>> also be different so it's not a fixed mask anyway.
>>
>> If not using pixman then I'd need to reimplement optimised 2D ops that will
>> likely never be as good as pixman and no point in doing it several times for
>> every device model so I'd rather try to use pixman where possible unless a
>> better library is available.
>
> Yes, performance-wise pixman is clearly the better choice.  At the end
> of the day it is a security vs performance trade off.

I prefer performance here if security can be achieved without loss of 
performance with correct checks so rather fix the checks until they are 
correct than do additional things in a loop.

>>>> +            if (rtl && ((db >= sb && db <= se) || (de >= sb && de <= se))) {
>>>> +                /* regions may overlap: copy via temporary */
>>>
>>> The usual way for a hardware blitter is to have a direction bit, i.e.
>>> the guest os can ask to blit in top->bottom or bottom->top scanline
>>> ordering.  The guest can use that to make sure the blit does not
>>
>> Yes, this is the rtl above (right to left) and AmigaOS sets this most of the
>> time so only relying on that to detect overlaps is not efficient.
>
> Hmm, checking rtl like that doesn't look correct to me then.  When using
> the blitter to move a window you have to set/clear rtl depending on
> whenever you move the window up or down on the screen, and src+dst
> regions can overlap in both cases ...

Pixman does left to right, top to bottom so we don't need special handling 
for such blits, they will work even for overlapping areas. Doing non 
overlapping blits should also work with whatever direction (but AmigaOS 
seems to use rtl as default even for non overlapping, maybe hardware 
prefers that or was easier to code somehow). The only case where pixman 
does not work is reverse direction overlapping areas which is checked 
here, although becuase of different strides and offsets it's hard to check 
exactly so we only do a crude check to see if the memory areas are 
overlapping at all. This should catch all bad cases and maybe some good 
ones but checking for those is probably as expensive as doing the blit 
instead. As you said this may not work in some cases but until we come 
across such cases I'd go with this simpler solution because otherwise we 
likely need to implement our own optimised blit routine.

Unlike ati-vga, sm501 does not have independent direction bits so rtl 
seems to mean both right to left and bottom to top. Ati-vga has different 
bit for bottom to top so those with left to right could still use 
pixman_blt calling it in a reverse counting loop for every line but I did 
not go for that optimisation yet. For sm501 there's no such option. 
Possible furher optimisation could be handling 1 pixel and small regions 
directly where the overhead of calling pixman may be bigger than the gain 
from its optimised routines but I would need to measure that for which I 
have no time.

>>> overwrite things.  But note the guest can also intentionally use
>>> overlapping regions, i.e. memset(0) the first scanline, then use a blit
>>> with overlap to clear the whole screen.  The later will surely break if
>>> you blit via temporary image ...
>>
>> Fortunately no guest code seems to do that so unless we find one needing it
>> I don't worry much about such rare cases.
>
> Ok.
>
>>>> +                pixman_blt((uint32_t *)&s->local_mem[src_base],
>>>> +                           (uint32_t *)&s->local_mem[dst_base],
>>>> +                           src_pitch * (1 << format) / sizeof(uint32_t),
>>>> +                           dst_pitch * (1 << format) / sizeof(uint32_t),
>>>> +                           8 * (1 << format), 8 * (1 << format),
>>>> +                           src_x, src_y, dst_x, dst_y, width, height);
>>>
>>> See above, i'm not convinced pixman is the best way here.
>>> When using pixman I'd suggest:
>>>
>>>  (1) src = pixman_image_create_bits_no_clear(...);
>>>  (2) dst = pixman_image_create_bits_no_clear(...);
>>>  (3) pixman_image_composite(PIXMAN_OP_SRC, src, NULL, dst, ...);
>>>  (4) pixman_image_unref(src);
>>>  (5) pixman_image_unref(dst);
>>>
>>> pixman_blt() is probably doing basically the same.
>>
>> Actually not the same, pixman_blt is faster operating directly on pointers
>> while we need all the pixman_image overhead to use pixman_image_composite.
>
> Ok (I didn't check the pixman code).

You should. It's seriously undocumented and using it seems to need digging 
the code or maybe I've missed all the wonderful documentation?

> Given the use case (run a computer museum ;) I think we can live with
> the flaws of the pixman approach.  Security shouldn't be that much of an
> issue here.  The behavior and blitter use pattern of the guests is known
> too and unlikely to change.

To my knowledge the sm501 is only used on an SH4 machine, the sam460ex and 
to run MorphOS on mac99 but ati-vga is already better for the latter so 
these are not security critical in my opinion.

Regards,
BALATON Zoltan
diff mbox series

Patch

diff --git a/hw/display/sm501.c b/hw/display/sm501.c
index 5ed57703d8..8bf4d111f4 100644
--- a/hw/display/sm501.c
+++ b/hw/display/sm501.c
@@ -706,13 +706,12 @@  static void sm501_2d_operation(SM501State *s)
     /* 1 if rop2 source is the pattern, otherwise the source is the bitmap */
     int rop2_source_is_pattern = (s->twoD_control >> 14) & 0x1;
     int rop = s->twoD_control & 0xFF;
-    int dst_x = (s->twoD_destination >> 16) & 0x01FFF;
-    int dst_y = s->twoD_destination & 0xFFFF;
-    int width = (s->twoD_dimension >> 16) & 0x1FFF;
-    int height = s->twoD_dimension & 0xFFFF;
+    unsigned int dst_x = (s->twoD_destination >> 16) & 0x01FFF;
+    unsigned int dst_y = s->twoD_destination & 0xFFFF;
+    unsigned int width = (s->twoD_dimension >> 16) & 0x1FFF;
+    unsigned int height = s->twoD_dimension & 0xFFFF;
     uint32_t dst_base = s->twoD_destination_base & 0x03FFFFFF;
-    uint8_t *dst = s->local_mem + dst_base;
-    int dst_pitch = (s->twoD_pitch >> 16) & 0x1FFF;
+    unsigned int dst_pitch = (s->twoD_pitch >> 16) & 0x1FFF;
     int crt = (s->dc_crt_control & SM501_DC_CRT_CONTROL_SEL) ? 1 : 0;
     int fb_len = get_width(s, crt) * get_height(s, crt) * get_bpp(s, crt);
 
@@ -721,104 +720,136 @@  static void sm501_2d_operation(SM501State *s)
         return;
     }
 
-    if (rop_mode == 0) {
-        if (rop != 0xcc) {
-            /* Anything other than plain copies are not supported */
-            qemu_log_mask(LOG_UNIMP, "sm501: rop3 mode with rop %x is not "
-                          "supported.\n", rop);
-        }
-    } else {
-        if (rop2_source_is_pattern && rop != 0x5) {
-            /* For pattern source, we support only inverse dest */
-            qemu_log_mask(LOG_UNIMP, "sm501: rop2 source being the pattern and "
-                          "rop %x is not supported.\n", rop);
-        } else {
-            if (rop != 0x5 && rop != 0xc) {
-                /* Anything other than plain copies or inverse dest is not
-                 * supported */
-                qemu_log_mask(LOG_UNIMP, "sm501: rop mode %x is not "
-                              "supported.\n", rop);
-            }
-        }
-    }
-
     if (s->twoD_source_base & BIT(27) || s->twoD_destination_base & BIT(27)) {
         qemu_log_mask(LOG_UNIMP, "sm501: only local memory is supported.\n");
         return;
     }
 
+    if (!dst_pitch) {
+        qemu_log_mask(LOG_GUEST_ERROR, "sm501: Zero dest pitch.\n");
+        return;
+    }
+
+    if (!width || !height) {
+        qemu_log_mask(LOG_GUEST_ERROR, "sm501: Zero size 2D op.\n");
+        return;
+    }
+
+    if (rtl) {
+        dst_x -= width - 1;
+        dst_y -= height - 1;
+    }
+
+    if (dst_base >= get_local_mem_size(s) || dst_base +
+        (dst_x + width + (dst_y + height) * (dst_pitch + width)) *
+        (1 << format) >= get_local_mem_size(s)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "sm501: 2D op dest is outside vram.\n");
+        return;
+    }
+
     switch (cmd) {
-    case 0x00: /* copy area */
+    case 0: /* BitBlt */
     {
-        int src_x = (s->twoD_source >> 16) & 0x01FFF;
-        int src_y = s->twoD_source & 0xFFFF;
+        unsigned int src_x = (s->twoD_source >> 16) & 0x01FFF;
+        unsigned int src_y = s->twoD_source & 0xFFFF;
         uint32_t src_base = s->twoD_source_base & 0x03FFFFFF;
-        uint8_t *src = s->local_mem + src_base;
-        int src_pitch = s->twoD_pitch & 0x1FFF;
-
-#define COPY_AREA(_bpp, _pixel_type, rtl) {                                   \
-        int y, x, index_d, index_s;                                           \
-        for (y = 0; y < height; y++) {                              \
-            for (x = 0; x < width; x++) {                           \
-                _pixel_type val;                                              \
-                                                                              \
-                if (rtl) {                                                    \
-                    index_s = ((src_y - y) * src_pitch + src_x - x) * _bpp;   \
-                    index_d = ((dst_y - y) * dst_pitch + dst_x - x) * _bpp;   \
-                } else {                                                      \
-                    index_s = ((src_y + y) * src_pitch + src_x + x) * _bpp;   \
-                    index_d = ((dst_y + y) * dst_pitch + dst_x + x) * _bpp;   \
-                }                                                             \
-                if (rop_mode == 1 && rop == 5) {                              \
-                    /* Invert dest */                                         \
-                    val = ~*(_pixel_type *)&dst[index_d];                     \
-                } else {                                                      \
-                    val = *(_pixel_type *)&src[index_s];                      \
-                }                                                             \
-                *(_pixel_type *)&dst[index_d] = val;                          \
-            }                                                                 \
-        }                                                                     \
-    }
-        switch (format) {
-        case 0:
-            COPY_AREA(1, uint8_t, rtl);
-            break;
-        case 1:
-            COPY_AREA(2, uint16_t, rtl);
-            break;
-        case 2:
-            COPY_AREA(4, uint32_t, rtl);
-            break;
+        unsigned int src_pitch = s->twoD_pitch & 0x1FFF;
+
+        if (!src_pitch) {
+            qemu_log_mask(LOG_GUEST_ERROR, "sm501: Zero src pitch.\n");
+            return;
+        }
+
+        if (rtl) {
+            src_x -= width - 1;
+            src_y -= height - 1;
+        }
+
+        if (src_base >= get_local_mem_size(s) || src_base +
+            (src_x + width + (src_y + height) * (src_pitch + width)) *
+            (1 << format) >= get_local_mem_size(s)) {
+            qemu_log_mask(LOG_GUEST_ERROR,
+                          "sm501: 2D op src is outside vram.\n");
+            return;
+        }
+
+        if ((rop_mode && rop == 0x5) || (!rop_mode && rop == 0x55)) {
+            /* Invert dest, is there a way to do this with pixman? */
+            unsigned int x, y, i;
+            uint8_t *d = s->local_mem + dst_base;
+
+            for (y = 0; y < height; y++) {
+                i = (dst_x + (dst_y + y) * dst_pitch) * (1 << format);
+                for (x = 0; x < width; x++, i += (1 << format)) {
+                    switch (format) {
+                    case 0:
+                        d[i] = ~d[i];
+                        break;
+                    case 1:
+                        *(uint16_t *)&d[i] = ~*(uint16_t *)&d[i];
+                        break;
+                    case 2:
+                        *(uint32_t *)&d[i] = ~*(uint32_t *)&d[i];
+                        break;
+                    }
+                }
+            }
+        } else {
+            /* Do copy src for unimplemented ops, better than unpainted area */
+            if ((rop_mode && (rop != 0xc || rop2_source_is_pattern)) ||
+                (!rop_mode && rop != 0xcc)) {
+                qemu_log_mask(LOG_UNIMP,
+                              "sm501: rop%d op %x%s not implemented\n",
+                              (rop_mode ? 2 : 3), rop,
+                              (rop2_source_is_pattern ?
+                                  " with pattern source" : ""));
+            }
+            /* Check for overlaps, this could be made more exact */
+            uint32_t sb, se, db, de;
+            sb = src_base + src_x + src_y * (width + src_pitch);
+            se = sb + width + height * (width + src_pitch);
+            db = dst_base + dst_x + dst_y * (width + dst_pitch);
+            de = db + width + height * (width + dst_pitch);
+            if (rtl && ((db >= sb && db <= se) || (de >= sb && de <= se))) {
+                /* regions may overlap: copy via temporary */
+                int llb = width * (1 << format);
+                int tmp_stride = DIV_ROUND_UP(llb, sizeof(uint32_t));
+                uint32_t *tmp = g_malloc(tmp_stride * sizeof(uint32_t) *
+                                         height);
+                pixman_blt((uint32_t *)&s->local_mem[src_base], tmp,
+                           src_pitch * (1 << format) / sizeof(uint32_t),
+                           tmp_stride, 8 * (1 << format), 8 * (1 << format),
+                           src_x, src_y, 0, 0, width, height);
+                pixman_blt(tmp, (uint32_t *)&s->local_mem[dst_base],
+                           tmp_stride,
+                           dst_pitch * (1 << format) / sizeof(uint32_t),
+                           8 * (1 << format), 8 * (1 << format),
+                           0, 0, dst_x, dst_y, width, height);
+                g_free(tmp);
+            } else {
+                pixman_blt((uint32_t *)&s->local_mem[src_base],
+                           (uint32_t *)&s->local_mem[dst_base],
+                           src_pitch * (1 << format) / sizeof(uint32_t),
+                           dst_pitch * (1 << format) / sizeof(uint32_t),
+                           8 * (1 << format), 8 * (1 << format),
+                           src_x, src_y, dst_x, dst_y, width, height);
+            }
         }
         break;
     }
-    case 0x01: /* fill rectangle */
+    case 1: /* Rectangle Fill */
     {
         uint32_t color = s->twoD_foreground;
 
-#define FILL_RECT(_bpp, _pixel_type) {                                      \
-        int y, x;                                                           \
-        for (y = 0; y < height; y++) {                            \
-            for (x = 0; x < width; x++) {                         \
-                int index = ((dst_y + y) * dst_pitch + dst_x + x) * _bpp;   \
-                *(_pixel_type *)&dst[index] = (_pixel_type)color;           \
-            }                                                               \
-        }                                                                   \
-    }
-
-        switch (format) {
-        case 0:
-            FILL_RECT(1, uint8_t);
-            break;
-        case 1:
-            color = cpu_to_le16(color);
-            FILL_RECT(2, uint16_t);
-            break;
-        case 2:
+        if (format == 2) {
             color = cpu_to_le32(color);
-            FILL_RECT(4, uint32_t);
-            break;
+        } else if (format == 1) {
+            color = cpu_to_le16(color);
         }
+
+        pixman_fill((uint32_t *)&s->local_mem[dst_base],
+                    dst_pitch * (1 << format) / sizeof(uint32_t),
+                    8 * (1 << format), dst_x, dst_y, width, height, color);
         break;
     }
     default: