Patchwork [2/6] qemu-kvm: Modify and introduce wrapper functions to access phys_ram_dirty.

login
register
mail settings
Submitter Yoshiaki Tamura
Date March 16, 2010, 10:53 a.m.
Message ID <1268736839-27371-3-git-send-email-tamura.yoshiaki@lab.ntt.co.jp>
Download mbox | patch
Permalink /patch/47843/
State New
Headers show

Comments

Yoshiaki Tamura - March 16, 2010, 10:53 a.m.
Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
bit-based phys_ram_dirty bitmap, and adds more wrapper functions to prevent
direct access to the phys_ram_dirty bitmap.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
 cpu-all.h |   94 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 90 insertions(+), 4 deletions(-)
Avi Kivity - March 16, 2010, 12:45 p.m.
On 03/16/2010 12:53 PM, Yoshiaki Tamura wrote:
> Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
> bit-based phys_ram_dirty bitmap, and adds more wrapper functions to prevent
> direct access to the phys_ram_dirty bitmap.
>    

> +
> +static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
> +{
> +    unsigned long mask;
> +    int index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
> +    int offset = (addr>>  TARGET_PAGE_BITS)&  (HOST_LONG_BITS - 1);
> +    int ret = 0;
> +
> +    mask = 1UL<<  offset;
> +    if (phys_ram_vga_dirty[index]&  mask)
> +        ret |= VGA_DIRTY_FLAG;
> +    if (phys_ram_code_dirty[index]&  mask)
> +        ret |=  CODE_DIRTY_FLAG;
> +    if (phys_ram_migration_dirty[index]&  mask)
> +        ret |= MIGRATION_DIRTY_FLAG;
> +
> +    return ret;
>   }
>
>   static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
>                                                   int dirty_flags)
>   {
> -    return phys_ram_dirty[addr>>  TARGET_PAGE_BITS]&  dirty_flags;
> +    return cpu_physical_memory_get_dirty_flags(addr)&  dirty_flags;
>   }
>    

This turns one cacheline access into three.  If the dirty bitmaps were 
in an array, you could do

   return dirty_bitmaps[dirty_index][addr >> (TARGET_PAGE_BITS + 
BITS_IN_LONG)] & mask;

with one cacheline access.

>
>   static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
>   {
> -    phys_ram_dirty[addr>>  TARGET_PAGE_BITS] = 0xff;
> +    unsigned long mask;
> +    int index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
> +    int offset = (addr>>  TARGET_PAGE_BITS)&  (HOST_LONG_BITS - 1);
> +
> +    mask = 1UL<<  offset;
> +    phys_ram_vga_dirty[index] |= mask;
> +    phys_ram_code_dirty[index] |= mask;
> +    phys_ram_migration_dirty[index] |= mask;
> +}
>    

This is also three cacheline accesses.  I think we should have a master 
bitmap which is updated by set_dirty(), and which is or'ed into the 
other bitmaps when they are accessed.  At least the vga and migration 
bitmaps are only read periodically, not randomly, so this would be very 
fast.  In a way, this is similar to how the qemu bitmap is updated from 
the kvm bitmap today.

I am not sure about the code bitmap though.
Yoshiaki Tamura - March 16, 2010, 1:17 p.m.
Avi Kivity wrote:
> On 03/16/2010 12:53 PM, Yoshiaki Tamura wrote:
>> Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
>> bit-based phys_ram_dirty bitmap, and adds more wrapper functions to
>> prevent
>> direct access to the phys_ram_dirty bitmap.
>
>> +
>> +static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
>> +{
>> + unsigned long mask;
>> + int index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>> + int ret = 0;
>> +
>> + mask = 1UL<< offset;
>> + if (phys_ram_vga_dirty[index]& mask)
>> + ret |= VGA_DIRTY_FLAG;
>> + if (phys_ram_code_dirty[index]& mask)
>> + ret |= CODE_DIRTY_FLAG;
>> + if (phys_ram_migration_dirty[index]& mask)
>> + ret |= MIGRATION_DIRTY_FLAG;
>> +
>> + return ret;
>> }
>>
>> static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
>> int dirty_flags)
>> {
>> - return phys_ram_dirty[addr>> TARGET_PAGE_BITS]& dirty_flags;
>> + return cpu_physical_memory_get_dirty_flags(addr)& dirty_flags;
>> }
>
> This turns one cacheline access into three. If the dirty bitmaps were in
> an array, you could do
>
> return dirty_bitmaps[dirty_index][addr >> (TARGET_PAGE_BITS +
> BITS_IN_LONG)] & mask;
>
> with one cacheline access.

If I'm understanding the existing code correctly,
int dirty_flags can be combined, like VGA + MIGRATION.
If we only have to worry about a single dirty flag, I agree with your idea.

On the other hand, qemu seems to require getting combined dirty flags.
If we introduce dirty bitmaps for each type, we need to access each bitmap to 
get combined flags.  I wasn't sure how to make this more efficient...

>> static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
>> {
>> - phys_ram_dirty[addr>> TARGET_PAGE_BITS] = 0xff;
>> + unsigned long mask;
>> + int index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>> +
>> + mask = 1UL<< offset;
>> + phys_ram_vga_dirty[index] |= mask;
>> + phys_ram_code_dirty[index] |= mask;
>> + phys_ram_migration_dirty[index] |= mask;
>> +}
>
> This is also three cacheline accesses. I think we should have a master
> bitmap which is updated by set_dirty(), and which is or'ed into the
> other bitmaps when they are accessed. At least the vga and migration
> bitmaps are only read periodically, not randomly, so this would be very
> fast. In a way, this is similar to how the qemu bitmap is updated from
> the kvm bitmap today.

Sounds good to me.
So we're going to introduce 4 (VGA, CODE, MIGRATION, master) bit-based bitmaps 
in total.
Avi Kivity - March 16, 2010, 1:29 p.m.
On 03/16/2010 03:17 PM, Yoshiaki Tamura wrote:
> Avi Kivity wrote:
>> On 03/16/2010 12:53 PM, Yoshiaki Tamura wrote:
>>> Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
>>> bit-based phys_ram_dirty bitmap, and adds more wrapper functions to
>>> prevent
>>> direct access to the phys_ram_dirty bitmap.
>>
>>> +
>>> +static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
>>> +{
>>> + unsigned long mask;
>>> + int index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>>> + int ret = 0;
>>> +
>>> + mask = 1UL<< offset;
>>> + if (phys_ram_vga_dirty[index]& mask)
>>> + ret |= VGA_DIRTY_FLAG;
>>> + if (phys_ram_code_dirty[index]& mask)
>>> + ret |= CODE_DIRTY_FLAG;
>>> + if (phys_ram_migration_dirty[index]& mask)
>>> + ret |= MIGRATION_DIRTY_FLAG;
>>> +
>>> + return ret;
>>> }
>>>
>>> static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
>>> int dirty_flags)
>>> {
>>> - return phys_ram_dirty[addr>> TARGET_PAGE_BITS]& dirty_flags;
>>> + return cpu_physical_memory_get_dirty_flags(addr)& dirty_flags;
>>> }
>>
>> This turns one cacheline access into three. If the dirty bitmaps were in
>> an array, you could do
>>
>> return dirty_bitmaps[dirty_index][addr >> (TARGET_PAGE_BITS +
>> BITS_IN_LONG)] & mask;
>>
>> with one cacheline access.
>
> If I'm understanding the existing code correctly,
> int dirty_flags can be combined, like VGA + MIGRATION.
> If we only have to worry about a single dirty flag, I agree with your 
> idea.

 From a quick grep it seems flags are not combined, except for something 
strange with CODE_DIRTY_FLAG:

> static void notdirty_mem_writel(void *opaque, target_phys_addr_t ram_addr,
>                                 uint32_t val)
> {
>     int dirty_flags;
>     dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
>     if (!(dirty_flags & CODE_DIRTY_FLAG)) {
> #if !defined(CONFIG_USER_ONLY)
>         tb_invalidate_phys_page_fast(ram_addr, 4);
>         dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
> #endif
>     }
>     stl_p(qemu_get_ram_ptr(ram_addr), val);
>     dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
>     phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS] = dirty_flags;
>     /* we remove the notdirty callback only if the code has been
>        flushed */
>     if (dirty_flags == 0xff)
>         tlb_set_dirty(cpu_single_env, cpu_single_env->mem_io_vaddr);
> }

I can't say I understand what it does.

>
> On the other hand, qemu seems to require getting combined dirty flags.
> If we introduce dirty bitmaps for each type, we need to access each 
> bitmap to get combined flags.  I wasn't sure how to make this more 
> efficient...
>
>>> static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
>>> {
>>> - phys_ram_dirty[addr>> TARGET_PAGE_BITS] = 0xff;
>>> + unsigned long mask;
>>> + int index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>>> +
>>> + mask = 1UL<< offset;
>>> + phys_ram_vga_dirty[index] |= mask;
>>> + phys_ram_code_dirty[index] |= mask;
>>> + phys_ram_migration_dirty[index] |= mask;
>>> +}
>>
>> This is also three cacheline accesses. I think we should have a master
>> bitmap which is updated by set_dirty(), and which is or'ed into the
>> other bitmaps when they are accessed. At least the vga and migration
>> bitmaps are only read periodically, not randomly, so this would be very
>> fast. In a way, this is similar to how the qemu bitmap is updated from
>> the kvm bitmap today.
>
> Sounds good to me.
> So we're going to introduce 4 (VGA, CODE, MIGRATION, master) bit-based 
> bitmaps in total.
>

Yeah, except CODE doesn't behave like the others.  Would be best to 
understand what it's requirements are before making the change.  Maybe 
CODE will need separate handling (so master will only feed VGA and 
MIGRATION).
Anthony Liguori - March 16, 2010, 1:35 p.m.
On 03/16/2010 07:45 AM, Avi Kivity wrote:
> On 03/16/2010 12:53 PM, Yoshiaki Tamura wrote:
>> Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
>> bit-based phys_ram_dirty bitmap, and adds more wrapper functions to 
>> prevent
>> direct access to the phys_ram_dirty bitmap.
>
>> +
>> +static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
>> +{
>> +    unsigned long mask;
>> +    int index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
>> +    int offset = (addr>>  TARGET_PAGE_BITS)&  (HOST_LONG_BITS - 1);
>> +    int ret = 0;
>> +
>> +    mask = 1UL<<  offset;
>> +    if (phys_ram_vga_dirty[index]&  mask)
>> +        ret |= VGA_DIRTY_FLAG;
>> +    if (phys_ram_code_dirty[index]&  mask)
>> +        ret |=  CODE_DIRTY_FLAG;
>> +    if (phys_ram_migration_dirty[index]&  mask)
>> +        ret |= MIGRATION_DIRTY_FLAG;
>> +
>> +    return ret;
>>   }
>>
>>   static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
>>                                                   int dirty_flags)
>>   {
>> -    return phys_ram_dirty[addr>>  TARGET_PAGE_BITS]&  dirty_flags;
>> +    return cpu_physical_memory_get_dirty_flags(addr)&  dirty_flags;
>>   }
>
> This turns one cacheline access into three.  If the dirty bitmaps were 
> in an array, you could do
>
>   return dirty_bitmaps[dirty_index][addr >> (TARGET_PAGE_BITS + 
> BITS_IN_LONG)] & mask;
>
> with one cacheline access.

As far as I can tell, we only ever call with a single flag so your 
suggestion makes sense.

I'd suggest introducing these functions before splitting the bitmap up.  
It makes review a bit easier.

>>
>>   static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
>>   {
>> -    phys_ram_dirty[addr>>  TARGET_PAGE_BITS] = 0xff;
>> +    unsigned long mask;
>> +    int index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
>> +    int offset = (addr>>  TARGET_PAGE_BITS)&  (HOST_LONG_BITS - 1);
>> +
>> +    mask = 1UL<<  offset;
>> +    phys_ram_vga_dirty[index] |= mask;
>> +    phys_ram_code_dirty[index] |= mask;
>> +    phys_ram_migration_dirty[index] |= mask;
>> +}
>
> This is also three cacheline accesses.  I think we should have a 
> master bitmap which is updated by set_dirty(), and which is or'ed into 
> the other bitmaps when they are accessed.  At least the vga and 
> migration bitmaps are only read periodically, not randomly, so this 
> would be very fast.  In a way, this is similar to how the qemu bitmap 
> is updated from the kvm bitmap today.
>
> I am not sure about the code bitmap though.

I think your suggestion makes sense and would also work for the code bitmap.

Regards,

Anthony Liguori
Yoshiaki Tamura - March 16, 2010, 1:49 p.m.
Avi Kivity wrote:
> On 03/16/2010 03:17 PM, Yoshiaki Tamura wrote:
>> Avi Kivity wrote:
>>> On 03/16/2010 12:53 PM, Yoshiaki Tamura wrote:
>>>> Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
>>>> bit-based phys_ram_dirty bitmap, and adds more wrapper functions to
>>>> prevent
>>>> direct access to the phys_ram_dirty bitmap.
>>>
>>>> +
>>>> +static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
>>>> +{
>>>> + unsigned long mask;
>>>> + int index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>>>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>>>> + int ret = 0;
>>>> +
>>>> + mask = 1UL<< offset;
>>>> + if (phys_ram_vga_dirty[index]& mask)
>>>> + ret |= VGA_DIRTY_FLAG;
>>>> + if (phys_ram_code_dirty[index]& mask)
>>>> + ret |= CODE_DIRTY_FLAG;
>>>> + if (phys_ram_migration_dirty[index]& mask)
>>>> + ret |= MIGRATION_DIRTY_FLAG;
>>>> +
>>>> + return ret;
>>>> }
>>>>
>>>> static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
>>>> int dirty_flags)
>>>> {
>>>> - return phys_ram_dirty[addr>> TARGET_PAGE_BITS]& dirty_flags;
>>>> + return cpu_physical_memory_get_dirty_flags(addr)& dirty_flags;
>>>> }
>>>
>>> This turns one cacheline access into three. If the dirty bitmaps were in
>>> an array, you could do
>>>
>>> return dirty_bitmaps[dirty_index][addr >> (TARGET_PAGE_BITS +
>>> BITS_IN_LONG)] & mask;
>>>
>>> with one cacheline access.
>>
>> If I'm understanding the existing code correctly,
>> int dirty_flags can be combined, like VGA + MIGRATION.
>> If we only have to worry about a single dirty flag, I agree with your
>> idea.
>
>  From a quick grep it seems flags are not combined, except for something
> strange with CODE_DIRTY_FLAG:

Thanks for checking out.
But the CODE_DIRTY_FLAG makes me really nervous...

>> static void notdirty_mem_writel(void *opaque, target_phys_addr_t
>> ram_addr,
>> uint32_t val)
>> {
>> int dirty_flags;
>> dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
>> if (!(dirty_flags & CODE_DIRTY_FLAG)) {
>> #if !defined(CONFIG_USER_ONLY)
>> tb_invalidate_phys_page_fast(ram_addr, 4);
>> dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
>> #endif
>> }
>> stl_p(qemu_get_ram_ptr(ram_addr), val);
>> dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
>> phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS] = dirty_flags;
>> /* we remove the notdirty callback only if the code has been
>> flushed */
>> if (dirty_flags == 0xff)
>> tlb_set_dirty(cpu_single_env, cpu_single_env->mem_io_vaddr);
>> }
>
> I can't say I understand what it does.

Me neither.
This the reason I had to take naive approach...

>> On the other hand, qemu seems to require getting combined dirty flags.
>> If we introduce dirty bitmaps for each type, we need to access each
>> bitmap to get combined flags. I wasn't sure how to make this more
>> efficient...
>>
>>>> static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
>>>> {
>>>> - phys_ram_dirty[addr>> TARGET_PAGE_BITS] = 0xff;
>>>> + unsigned long mask;
>>>> + int index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>>>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>>>> +
>>>> + mask = 1UL<< offset;
>>>> + phys_ram_vga_dirty[index] |= mask;
>>>> + phys_ram_code_dirty[index] |= mask;
>>>> + phys_ram_migration_dirty[index] |= mask;
>>>> +}
>>>
>>> This is also three cacheline accesses. I think we should have a master
>>> bitmap which is updated by set_dirty(), and which is or'ed into the
>>> other bitmaps when they are accessed. At least the vga and migration
>>> bitmaps are only read periodically, not randomly, so this would be very
>>> fast. In a way, this is similar to how the qemu bitmap is updated from
>>> the kvm bitmap today.
>>
>> Sounds good to me.
>> So we're going to introduce 4 (VGA, CODE, MIGRATION, master) bit-based
>> bitmaps in total.
>>
>
> Yeah, except CODE doesn't behave like the others. Would be best to
> understand what it's requirements are before making the change. Maybe
> CODE will need separate handling (so master will only feed VGA and
> MIGRATION).

After implementing this patch set, I thought separating the wrapper functions 
for each dirty flag type might be an option.  Unifying everything makes 
inefficient here.  But anyway, do you know somebody who has a strong insight on 
this CODE_DIRTY_FLAG?
Anthony Liguori - March 16, 2010, 1:51 p.m.
On 03/16/2010 08:29 AM, Avi Kivity wrote:
> On 03/16/2010 03:17 PM, Yoshiaki Tamura wrote:
>> Avi Kivity wrote:
>>> On 03/16/2010 12:53 PM, Yoshiaki Tamura wrote:
>>>> Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
>>>> bit-based phys_ram_dirty bitmap, and adds more wrapper functions to
>>>> prevent
>>>> direct access to the phys_ram_dirty bitmap.
>>>
>>>> +
>>>> +static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t 
>>>> addr)
>>>> +{
>>>> + unsigned long mask;
>>>> + int index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>>>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>>>> + int ret = 0;
>>>> +
>>>> + mask = 1UL<< offset;
>>>> + if (phys_ram_vga_dirty[index]& mask)
>>>> + ret |= VGA_DIRTY_FLAG;
>>>> + if (phys_ram_code_dirty[index]& mask)
>>>> + ret |= CODE_DIRTY_FLAG;
>>>> + if (phys_ram_migration_dirty[index]& mask)
>>>> + ret |= MIGRATION_DIRTY_FLAG;
>>>> +
>>>> + return ret;
>>>> }
>>>>
>>>> static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
>>>> int dirty_flags)
>>>> {
>>>> - return phys_ram_dirty[addr>> TARGET_PAGE_BITS]& dirty_flags;
>>>> + return cpu_physical_memory_get_dirty_flags(addr)& dirty_flags;
>>>> }
>>>
>>> This turns one cacheline access into three. If the dirty bitmaps 
>>> were in
>>> an array, you could do
>>>
>>> return dirty_bitmaps[dirty_index][addr >> (TARGET_PAGE_BITS +
>>> BITS_IN_LONG)] & mask;
>>>
>>> with one cacheline access.
>>
>> If I'm understanding the existing code correctly,
>> int dirty_flags can be combined, like VGA + MIGRATION.
>> If we only have to worry about a single dirty flag, I agree with your 
>> idea.
>
> From a quick grep it seems flags are not combined, except for 
> something strange with CODE_DIRTY_FLAG:
>
>> static void notdirty_mem_writel(void *opaque, target_phys_addr_t 
>> ram_addr,
>>                                 uint32_t val)
>> {
>>     int dirty_flags;
>>     dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
>>     if (!(dirty_flags & CODE_DIRTY_FLAG)) {
>> #if !defined(CONFIG_USER_ONLY)
>>         tb_invalidate_phys_page_fast(ram_addr, 4);
>>         dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
>> #endif
>>     }
>>     stl_p(qemu_get_ram_ptr(ram_addr), val);
>>     dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
>>     phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS] = dirty_flags;
>>     /* we remove the notdirty callback only if the code has been
>>        flushed */
>>     if (dirty_flags == 0xff)
>>         tlb_set_dirty(cpu_single_env, cpu_single_env->mem_io_vaddr);
>> }
>
> I can't say I understand what it does.

The semantics of CODE_DIRTY_FLAG are a little counter intuitive.  
CODE_DIRTY_FLAG means that we know that something isn't code so writes 
do not need checking for self modifying code.

notdirty_mem_write() is called for any ram that is in the virtual TLB 
that has not been updated yet and once a write has occurred, we can 
switch to faster access functions (provided we've invalidated any 
translation blocks).

That's why the check is if (!(dirty_flags & CODE_DIRTY_FLAG)), if it 
hasn't been set yet, we have to assume that it could be a TB so we need 
to invalidate it.  tb_invalidate_phys_page_fast() will set the 
CODE_DIRTY_FLAG if no code is present in that memory area which is why 
we fetch dirty_flags again.

We do the store, and then set the dirty bits to mark that the page is 
now dirty taking care to not change the CODE_DIRTY_FLAG bit.

At the very end, we check to see if CODE_DIRTY_FLAG which indicates that 
we no longer need to trap writes.  If so, we call tlb_set_dirty() which 
will ultimately remove the notdirty callback in favor of a faster access 
mechanism.

With respect patch series, there should be no problem having a separate 
code bitmap that gets updated along with a main bitmap provided that the 
semantics of CODE_DIRTY_FLAG are preserved.

>> Sounds good to me.
>> So we're going to introduce 4 (VGA, CODE, MIGRATION, master) 
>> bit-based bitmaps in total.
>>
>
> Yeah, except CODE doesn't behave like the others.  Would be best to 
> understand what it's requirements are before making the change.  Maybe 
> CODE will need separate handling (so master will only feed VGA and 
> MIGRATION).

Generally speaking, cpu_physical_memory_set_dirty() is called by the 
device model.  Any writes by the device model that results in 
self-modifying code are not going to have predictable semantics which is 
why it can set CODE_DIRTY_FLAG.

CODE_DIRTY_FLAG doesn't need to get updated from a master bitmap.  It 
should be treated as a separate bitmap that is strictly dealt with by 
the virtual TLB.

Regards,

Anthony Liguori
Avi Kivity - March 16, 2010, 1:57 p.m.
On 03/16/2010 03:51 PM, Anthony Liguori wrote:
> On 03/16/2010 08:29 AM, Avi Kivity wrote:
>> On 03/16/2010 03:17 PM, Yoshiaki Tamura wrote:
>>> Avi Kivity wrote:
>>>> On 03/16/2010 12:53 PM, Yoshiaki Tamura wrote:
>>>>> Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
>>>>> bit-based phys_ram_dirty bitmap, and adds more wrapper functions to
>>>>> prevent
>>>>> direct access to the phys_ram_dirty bitmap.
>>>>
>>>>> +
>>>>> +static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t 
>>>>> addr)
>>>>> +{
>>>>> + unsigned long mask;
>>>>> + int index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>>>>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>>>>> + int ret = 0;
>>>>> +
>>>>> + mask = 1UL<< offset;
>>>>> + if (phys_ram_vga_dirty[index]& mask)
>>>>> + ret |= VGA_DIRTY_FLAG;
>>>>> + if (phys_ram_code_dirty[index]& mask)
>>>>> + ret |= CODE_DIRTY_FLAG;
>>>>> + if (phys_ram_migration_dirty[index]& mask)
>>>>> + ret |= MIGRATION_DIRTY_FLAG;
>>>>> +
>>>>> + return ret;
>>>>> }
>>>>>
>>>>> static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
>>>>> int dirty_flags)
>>>>> {
>>>>> - return phys_ram_dirty[addr>> TARGET_PAGE_BITS]& dirty_flags;
>>>>> + return cpu_physical_memory_get_dirty_flags(addr)& dirty_flags;
>>>>> }
>>>>
>>>> This turns one cacheline access into three. If the dirty bitmaps 
>>>> were in
>>>> an array, you could do
>>>>
>>>> return dirty_bitmaps[dirty_index][addr >> (TARGET_PAGE_BITS +
>>>> BITS_IN_LONG)] & mask;
>>>>
>>>> with one cacheline access.
>>>
>>> If I'm understanding the existing code correctly,
>>> int dirty_flags can be combined, like VGA + MIGRATION.
>>> If we only have to worry about a single dirty flag, I agree with 
>>> your idea.
>>
>> From a quick grep it seems flags are not combined, except for 
>> something strange with CODE_DIRTY_FLAG:
>>
>>> static void notdirty_mem_writel(void *opaque, target_phys_addr_t 
>>> ram_addr,
>>>                                 uint32_t val)
>>> {
>>>     int dirty_flags;
>>>     dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
>>>     if (!(dirty_flags & CODE_DIRTY_FLAG)) {
>>> #if !defined(CONFIG_USER_ONLY)
>>>         tb_invalidate_phys_page_fast(ram_addr, 4);
>>>         dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
>>> #endif
>>>     }
>>>     stl_p(qemu_get_ram_ptr(ram_addr), val);
>>>     dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
>>>     phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS] = dirty_flags;
>>>     /* we remove the notdirty callback only if the code has been
>>>        flushed */
>>>     if (dirty_flags == 0xff)
>>>         tlb_set_dirty(cpu_single_env, cpu_single_env->mem_io_vaddr);
>>> }
>>
>> I can't say I understand what it does.
>
> The semantics of CODE_DIRTY_FLAG are a little counter intuitive.  
> CODE_DIRTY_FLAG means that we know that something isn't code so writes 
> do not need checking for self modifying code.

So the hardware equivalent is, when the Instruction TLB loads a page 
address, clear CODE_DIRTY_FLAG?

>
> notdirty_mem_write() is called for any ram that is in the virtual TLB 
> that has not been updated yet and once a write has occurred, we can 
> switch to faster access functions (provided we've invalidated any 
> translation blocks).
>
> That's why the check is if (!(dirty_flags & CODE_DIRTY_FLAG)), if it 
> hasn't been set yet, we have to assume that it could be a TB so we 
> need to invalidate it.  tb_invalidate_phys_page_fast() will set the 
> CODE_DIRTY_FLAG if no code is present in that memory area which is why 
> we fetch dirty_flags again.

Ok.

>
> We do the store, and then set the dirty bits to mark that the page is 
> now dirty taking care to not change the CODE_DIRTY_FLAG bit.
>
> At the very end, we check to see if CODE_DIRTY_FLAG which indicates 
> that we no longer need to trap writes.  If so, we call tlb_set_dirty() 
> which will ultimately remove the notdirty callback in favor of a 
> faster access mechanism.
>
> With respect patch series, there should be no problem having a 
> separate code bitmap that gets updated along with a main bitmap 
> provided that the semantics of CODE_DIRTY_FLAG are preserved.
>
>>> Sounds good to me.
>>> So we're going to introduce 4 (VGA, CODE, MIGRATION, master) 
>>> bit-based bitmaps in total.
>>>
>>
>> Yeah, except CODE doesn't behave like the others.  Would be best to 
>> understand what it's requirements are before making the change.  
>> Maybe CODE will need separate handling (so master will only feed VGA 
>> and MIGRATION).
>
> Generally speaking, cpu_physical_memory_set_dirty() is called by the 
> device model.  Any writes by the device model that results in 
> self-modifying code are not going to have predictable semantics which 
> is why it can set CODE_DIRTY_FLAG.
>
> CODE_DIRTY_FLAG doesn't need to get updated from a master bitmap.  It 
> should be treated as a separate bitmap that is strictly dealt with by 
> the virtual TLB.

Thanks.
Anthony Liguori - March 16, 2010, 2:50 p.m.
On 03/16/2010 08:57 AM, Avi Kivity wrote:
> On 03/16/2010 03:51 PM, Anthony Liguori wrote:
>> On 03/16/2010 08:29 AM, Avi Kivity wrote:
>>> On 03/16/2010 03:17 PM, Yoshiaki Tamura wrote:
>>>> Avi Kivity wrote:
>>>>> On 03/16/2010 12:53 PM, Yoshiaki Tamura wrote:
>>>>>> Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
>>>>>> bit-based phys_ram_dirty bitmap, and adds more wrapper functions to
>>>>>> prevent
>>>>>> direct access to the phys_ram_dirty bitmap.
>>>>>
>>>>>> +
>>>>>> +static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t 
>>>>>> addr)
>>>>>> +{
>>>>>> + unsigned long mask;
>>>>>> + int index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>>>>>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>>>>>> + int ret = 0;
>>>>>> +
>>>>>> + mask = 1UL<< offset;
>>>>>> + if (phys_ram_vga_dirty[index]& mask)
>>>>>> + ret |= VGA_DIRTY_FLAG;
>>>>>> + if (phys_ram_code_dirty[index]& mask)
>>>>>> + ret |= CODE_DIRTY_FLAG;
>>>>>> + if (phys_ram_migration_dirty[index]& mask)
>>>>>> + ret |= MIGRATION_DIRTY_FLAG;
>>>>>> +
>>>>>> + return ret;
>>>>>> }
>>>>>>
>>>>>> static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
>>>>>> int dirty_flags)
>>>>>> {
>>>>>> - return phys_ram_dirty[addr>> TARGET_PAGE_BITS]& dirty_flags;
>>>>>> + return cpu_physical_memory_get_dirty_flags(addr)& dirty_flags;
>>>>>> }
>>>>>
>>>>> This turns one cacheline access into three. If the dirty bitmaps 
>>>>> were in
>>>>> an array, you could do
>>>>>
>>>>> return dirty_bitmaps[dirty_index][addr >> (TARGET_PAGE_BITS +
>>>>> BITS_IN_LONG)] & mask;
>>>>>
>>>>> with one cacheline access.
>>>>
>>>> If I'm understanding the existing code correctly,
>>>> int dirty_flags can be combined, like VGA + MIGRATION.
>>>> If we only have to worry about a single dirty flag, I agree with 
>>>> your idea.
>>>
>>> From a quick grep it seems flags are not combined, except for 
>>> something strange with CODE_DIRTY_FLAG:
>>>
>>>> static void notdirty_mem_writel(void *opaque, target_phys_addr_t 
>>>> ram_addr,
>>>>                                 uint32_t val)
>>>> {
>>>>     int dirty_flags;
>>>>     dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
>>>>     if (!(dirty_flags & CODE_DIRTY_FLAG)) {
>>>> #if !defined(CONFIG_USER_ONLY)
>>>>         tb_invalidate_phys_page_fast(ram_addr, 4);
>>>>         dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
>>>> #endif
>>>>     }
>>>>     stl_p(qemu_get_ram_ptr(ram_addr), val);
>>>>     dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
>>>>     phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS] = dirty_flags;
>>>>     /* we remove the notdirty callback only if the code has been
>>>>        flushed */
>>>>     if (dirty_flags == 0xff)
>>>>         tlb_set_dirty(cpu_single_env, cpu_single_env->mem_io_vaddr);
>>>> }
>>>
>>> I can't say I understand what it does.
>>
>> The semantics of CODE_DIRTY_FLAG are a little counter intuitive.  
>> CODE_DIRTY_FLAG means that we know that something isn't code so 
>> writes do not need checking for self modifying code.
>
> So the hardware equivalent is, when the Instruction TLB loads a page 
> address, clear CODE_DIRTY_FLAG?

Yes, and is what tlb_protect_code() does and it's called from 
tb_alloc_page() which is what's code when a TB is created.

Regards,

Anthony Liguori
Blue Swirl - March 16, 2010, 8:10 p.m.
On 3/16/10, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 03/16/2010 08:57 AM, Avi Kivity wrote:
>
> > On 03/16/2010 03:51 PM, Anthony Liguori wrote:
> >
> > > On 03/16/2010 08:29 AM, Avi Kivity wrote:
> > >
> > > > On 03/16/2010 03:17 PM, Yoshiaki Tamura wrote:
> > > >
> > > > > Avi Kivity wrote:
> > > > >
> > > > > > On 03/16/2010 12:53 PM, Yoshiaki Tamura wrote:
> > > > > >
> > > > > > > Modifies wrapper functions for byte-based phys_ram_dirty bitmap
> to
> > > > > > > bit-based phys_ram_dirty bitmap, and adds more wrapper functions
> to
> > > > > > > prevent
> > > > > > > direct access to the phys_ram_dirty bitmap.
> > > > > > >
> > > > > >
> > > > > >
> > > > > > > +
> > > > > > > +static inline int
> cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
> > > > > > > +{
> > > > > > > + unsigned long mask;
> > > > > > > + int index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
> > > > > > > + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
> > > > > > > + int ret = 0;
> > > > > > > +
> > > > > > > + mask = 1UL<< offset;
> > > > > > > + if (phys_ram_vga_dirty[index]& mask)
> > > > > > > + ret |= VGA_DIRTY_FLAG;
> > > > > > > + if (phys_ram_code_dirty[index]& mask)
> > > > > > > + ret |= CODE_DIRTY_FLAG;
> > > > > > > + if (phys_ram_migration_dirty[index]& mask)
> > > > > > > + ret |= MIGRATION_DIRTY_FLAG;
> > > > > > > +
> > > > > > > + return ret;
> > > > > > > }
> > > > > > >
> > > > > > > static inline int
> cpu_physical_memory_get_dirty(ram_addr_t addr,
> > > > > > > int dirty_flags)
> > > > > > > {
> > > > > > > - return phys_ram_dirty[addr>> TARGET_PAGE_BITS]& dirty_flags;
> > > > > > > + return
> cpu_physical_memory_get_dirty_flags(addr)& dirty_flags;
> > > > > > > }
> > > > > > >
> > > > > >
> > > > > > This turns one cacheline access into three. If the dirty bitmaps
> were in
> > > > > > an array, you could do
> > > > > >
> > > > > > return dirty_bitmaps[dirty_index][addr >>
> (TARGET_PAGE_BITS +
> > > > > > BITS_IN_LONG)] & mask;
> > > > > >
> > > > > > with one cacheline access.
> > > > > >
> > > > >
> > > > > If I'm understanding the existing code correctly,
> > > > > int dirty_flags can be combined, like VGA + MIGRATION.
> > > > > If we only have to worry about a single dirty flag, I agree with
> your idea.
> > > > >
> > > >
> > > > From a quick grep it seems flags are not combined, except for
> something strange with CODE_DIRTY_FLAG:
> > > >
> > > >
> > > > >  static void notdirty_mem_writel(void *opaque,
> target_phys_addr_t ram_addr,
> > > > >                                uint32_t val)
> > > > > {
> > > > >    int dirty_flags;
> > > > >    dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
> > > > >    if (!(dirty_flags & CODE_DIRTY_FLAG)) {
> > > > > #if !defined(CONFIG_USER_ONLY)
> > > > >        tb_invalidate_phys_page_fast(ram_addr, 4);
> > > > >        dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
> > > > > #endif
> > > > >    }
> > > > >    stl_p(qemu_get_ram_ptr(ram_addr), val);
> > > > >    dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
> > > > >    phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS] = dirty_flags;
> > > > >    /* we remove the notdirty callback only if the code has been
> > > > >       flushed */
> > > > >    if (dirty_flags == 0xff)
> > > > >        tlb_set_dirty(cpu_single_env, cpu_single_env->mem_io_vaddr);
> > > > > }
> > > > >
> > > >
> > > > I can't say I understand what it does.
> > > >
> > >
> > > The semantics of CODE_DIRTY_FLAG are a little counter intuitive.
> CODE_DIRTY_FLAG means that we know that something isn't code so writes do
> not need checking for self modifying code.
> > >
> >
> > So the hardware equivalent is, when the Instruction TLB loads a page
> address, clear CODE_DIRTY_FLAG?
> >
>
>  Yes, and is what tlb_protect_code() does and it's called from
> tb_alloc_page() which is what's code when a TB is created.

Just a tangential note: a long time ago, I tried to disable self
modifying code detection for Sparc. On most RISC architectures, SMC
needs explicit flushing so in theory we need not track code memory
writes. However, during exceptions the translator needs to access the
original unmodified code that was used to generate the TB. But maybe
there are other ways to avoid SMC tracking, on x86 it's still needed
but I suppose SMC is pretty rare.
Richard Henderson - March 16, 2010, 10:31 p.m.
On 03/16/2010 01:10 PM, Blue Swirl wrote:
> Just a tangential note: a long time ago, I tried to disable self
> modifying code detection for Sparc. On most RISC architectures, SMC
> needs explicit flushing so in theory we need not track code memory
> writes. However, during exceptions the translator needs to access the
> original unmodified code that was used to generate the TB. But maybe
> there are other ways to avoid SMC tracking, on x86 it's still needed
> but I suppose SMC is pretty rare.

True SMC is fairly rare, but the SMC checker triggers fairly often
on the PLT update during dynamic linking.  Nearly all cpus (x86 being
the only exception I recall) needed to re-design their PLT format to
avoid this code update in order to support SELinux.

Where does the translator need access to this original code?  I was
just thinking about this problem today, wondering how much overhead
there is with this SMC page protection thing.


r~
Yoshiaki Tamura - March 16, 2010, 10:50 p.m.
Anthony Liguori wrote:
> On 03/16/2010 07:45 AM, Avi Kivity wrote:
>> On 03/16/2010 12:53 PM, Yoshiaki Tamura wrote:
>>> Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
>>> bit-based phys_ram_dirty bitmap, and adds more wrapper functions to
>>> prevent
>>> direct access to the phys_ram_dirty bitmap.
>>
>>> +
>>> +static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
>>> +{
>>> + unsigned long mask;
>>> + int index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>>> + int ret = 0;
>>> +
>>> + mask = 1UL<< offset;
>>> + if (phys_ram_vga_dirty[index]& mask)
>>> + ret |= VGA_DIRTY_FLAG;
>>> + if (phys_ram_code_dirty[index]& mask)
>>> + ret |= CODE_DIRTY_FLAG;
>>> + if (phys_ram_migration_dirty[index]& mask)
>>> + ret |= MIGRATION_DIRTY_FLAG;
>>> +
>>> + return ret;
>>> }
>>>
>>> static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
>>> int dirty_flags)
>>> {
>>> - return phys_ram_dirty[addr>> TARGET_PAGE_BITS]& dirty_flags;
>>> + return cpu_physical_memory_get_dirty_flags(addr)& dirty_flags;
>>> }
>>
>> This turns one cacheline access into three. If the dirty bitmaps were
>> in an array, you could do
>>
>> return dirty_bitmaps[dirty_index][addr >> (TARGET_PAGE_BITS +
>> BITS_IN_LONG)] & mask;
>>
>> with one cacheline access.
>
> As far as I can tell, we only ever call with a single flag so your
> suggestion makes sense.
>
> I'd suggest introducing these functions before splitting the bitmap up.
> It makes review a bit easier.

Thanks for your advise.
I'll post the wrapper functions for existing byte-based phys_ram_dirty first.

Yoshi
Paul Brook - March 17, 2010, 12:05 a.m.
> Where does the translator need access to this original code?  I was
> just thinking about this problem today, wondering how much overhead
> there is with this SMC page protection thing.

When an MMU fault occurs qemu re-translates the TB with additional annotations 
to determine which guest instruction caused the fault.
See translate-all.c:cpu_restore_state().

Paul
Avi Kivity - March 17, 2010, 4:07 a.m.
On 03/16/2010 10:10 PM, Blue Swirl wrote:
>
>>   Yes, and is what tlb_protect_code() does and it's called from
>> tb_alloc_page() which is what's code when a TB is created.
>>      
> Just a tangential note: a long time ago, I tried to disable self
> modifying code detection for Sparc. On most RISC architectures, SMC
> needs explicit flushing so in theory we need not track code memory
> writes. However, during exceptions the translator needs to access the
> original unmodified code that was used to generate the TB. But maybe
> there are other ways to avoid SMC tracking, on x86 it's still needed
>    

On x86 you're supposed to execute a serializing instruction (one of 
INVD, INVEPT, INVLPG, INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to control 
register, with the exception of MOV CR8), MOV (to debug register), 
WBINVD, WRMSR, CPUID, IRET, and RSM) before running modified code.

> but I suppose SMC is pretty rare.
>    

Every time you demand load a code page from disk, you're running self 
modifying code (though it usually doesn't exist in the tlb, so there's 
no previous version that can cause trouble).
Paul Brook - March 17, 2010, 4:06 p.m.
> On 03/16/2010 10:10 PM, Blue Swirl wrote:
> >>   Yes, and is what tlb_protect_code() does and it's called from
> >> tb_alloc_page() which is what's code when a TB is created.
> >
> > Just a tangential note: a long time ago, I tried to disable self
> > modifying code detection for Sparc. On most RISC architectures, SMC
> > needs explicit flushing so in theory we need not track code memory
> > writes. However, during exceptions the translator needs to access the
> > original unmodified code that was used to generate the TB. But maybe
> > there are other ways to avoid SMC tracking, on x86 it's still needed
> 
> On x86 you're supposed to execute a serializing instruction (one of
> INVD, INVEPT, INVLPG, INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to control
> register, with the exception of MOV CR8), MOV (to debug register),
> WBINVD, WRMSR, CPUID, IRET, and RSM) before running modified code.

Last time I checked, a jump instruction was sufficient to ensure coherency 
withing a core.  Serializing instructions are only required for coherency 
between cores on SMP systems.

QEMU effectively has a very large physically tagged icache[1] with very 
expensive cache loads.  AFAIK The only practical way to maintain that cache on 
x86 targets is to do write snooping via dirty bits. On targets that mandate 
explicit icache invalidation we might be able to get away with this, however I 
doubt it actually gains you anything - a correctly written guest is going to 
invalidate at least as much as we get from dirty tracking, and we still need 
to provide correct behaviour when executing with cache disabled.

> > but I suppose SMC is pretty rare.
> 
> Every time you demand load a code page from disk, you're running self
> modifying code (though it usually doesn't exist in the tlb, so there's
> no previous version that can cause trouble).

I think you're confusing TLB flushes with TB flushes.

Paul

[1] Even modern x86 only have relatively small icache. The large L2/L3 caches 
aren't relevant as they are unified I/D caches.
Avi Kivity - March 17, 2010, 4:28 p.m.
On 03/17/2010 06:06 PM, Paul Brook wrote:
>> On 03/16/2010 10:10 PM, Blue Swirl wrote:
>>      
>>>>    Yes, and is what tlb_protect_code() does and it's called from
>>>> tb_alloc_page() which is what's code when a TB is created.
>>>>          
>>> Just a tangential note: a long time ago, I tried to disable self
>>> modifying code detection for Sparc. On most RISC architectures, SMC
>>> needs explicit flushing so in theory we need not track code memory
>>> writes. However, during exceptions the translator needs to access the
>>> original unmodified code that was used to generate the TB. But maybe
>>> there are other ways to avoid SMC tracking, on x86 it's still needed
>>>        
>> On x86 you're supposed to execute a serializing instruction (one of
>> INVD, INVEPT, INVLPG, INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to control
>> register, with the exception of MOV CR8), MOV (to debug register),
>> WBINVD, WRMSR, CPUID, IRET, and RSM) before running modified code.
>>      
> Last time I checked, a jump instruction was sufficient to ensure coherency
> withing a core.  Serializing instructions are only required for coherency
> between cores on SMP systems.
>    

Yeah, the docs say either a jump or a serializing instruction is needed.

> QEMU effectively has a very large physically tagged icache[1] with very
> expensive cache loads.  AFAIK The only practical way to maintain that cache on
> x86 targets is to do write snooping via dirty bits. On targets that mandate
> explicit icache invalidation we might be able to get away with this, however I
> doubt it actually gains you anything - a correctly written guest is going to
> invalidate at least as much as we get from dirty tracking, and we still need
> to provide correct behaviour when executing with cache disabled.
>    

Agreed.

>    
>>> but I suppose SMC is pretty rare.
>>>        
>> Every time you demand load a code page from disk, you're running self
>> modifying code (though it usually doesn't exist in the tlb, so there's
>> no previous version that can cause trouble).
>>      
> I think you're confusing TLB flushes with TB flushes.
>    

No - my thinking was page fault, load page, invlpg, continue.  But the 
invlpg is unneeded, and "continue" has to include a jump anyway.

Patch

diff --git a/cpu-all.h b/cpu-all.h
index 9bc01b9..91ec3e5 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -843,7 +843,9 @@  int cpu_str_to_log_mask(const char *str);
 /* memory API */
 
 extern int phys_ram_fd;
-extern uint8_t *phys_ram_dirty;
+extern unsigned long *phys_ram_vga_dirty;
+extern unsigned long *phys_ram_code_dirty;
+extern unsigned long *phys_ram_migration_dirty;
 extern ram_addr_t ram_size;
 extern ram_addr_t last_ram_offset;
 extern uint8_t *bios_mem;
@@ -879,20 +881,104 @@  int cpu_memory_rw_debug(CPUState *env, target_ulong addr,
 /* read dirty bit (return 0 or 1) */
 static inline int cpu_physical_memory_is_dirty(ram_addr_t addr)
 {
-    return phys_ram_dirty[addr >> TARGET_PAGE_BITS] == 0xff;
+    unsigned long mask;
+    int index = (addr >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+    int offset = (addr >> TARGET_PAGE_BITS) & (HOST_LONG_BITS - 1);
+
+    mask = 1UL << offset;
+    return (phys_ram_vga_dirty[index]  &
+            phys_ram_code_dirty[index] &
+            phys_ram_migration_dirty[index] & mask) == mask;
+}
+
+static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
+{
+    unsigned long mask;
+    int index = (addr >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+    int offset = (addr >> TARGET_PAGE_BITS) & (HOST_LONG_BITS - 1);
+    int ret = 0;
+
+    mask = 1UL << offset;
+    if (phys_ram_vga_dirty[index] & mask)
+        ret |= VGA_DIRTY_FLAG;
+    if (phys_ram_code_dirty[index] & mask)
+        ret |=  CODE_DIRTY_FLAG;
+    if (phys_ram_migration_dirty[index] & mask)
+        ret |= MIGRATION_DIRTY_FLAG;
+
+    return ret;
 }
 
 static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
                                                 int dirty_flags)
 {
-    return phys_ram_dirty[addr >> TARGET_PAGE_BITS] & dirty_flags;
+    return cpu_physical_memory_get_dirty_flags(addr) & dirty_flags;
 }
 
 static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
 {
-    phys_ram_dirty[addr >> TARGET_PAGE_BITS] = 0xff;
+    unsigned long mask;
+    int index = (addr >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+    int offset = (addr >> TARGET_PAGE_BITS) & (HOST_LONG_BITS - 1);
+
+    mask = 1UL << offset;
+    phys_ram_vga_dirty[index] |= mask;
+    phys_ram_code_dirty[index] |= mask;
+    phys_ram_migration_dirty[index] |= mask;
+}
+
+static inline void cpu_physical_memory_set_dirty_range(ram_addr_t addr,
+                                                       unsigned long mask)
+{
+    int index = (addr >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+
+    phys_ram_vga_dirty[index] |= mask;
+    phys_ram_code_dirty[index] |= mask;
+    phys_ram_migration_dirty[index] |= mask;
 }
 
+static inline void cpu_physical_memory_set_dirty_flags(ram_addr_t addr,
+                                                       int dirty_flags)
+{
+    unsigned long mask;
+    int index = (addr >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+    int offset = (addr >> TARGET_PAGE_BITS) & (HOST_LONG_BITS - 1);
+
+    mask = 1UL << offset;
+    if (dirty_flags & VGA_DIRTY_FLAG)
+        phys_ram_vga_dirty[index] |= mask;
+    if (dirty_flags & CODE_DIRTY_FLAG)
+        phys_ram_code_dirty[index] |= mask;
+    if (dirty_flags & MIGRATION_DIRTY_FLAG)
+        phys_ram_migration_dirty[index] |= mask;
+}
+
+static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start,
+                                                        int length,
+                                                        int dirty_flags)
+{
+    ram_addr_t addr = start;
+    unsigned long mask;
+    int index, offset, i;
+
+    for (i = 0;  i < length; i += TARGET_PAGE_SIZE) {
+        index = ((addr + i) >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+        offset = ((addr + i) >> TARGET_PAGE_BITS) & (HOST_LONG_BITS - 1);
+        mask = ~(1UL << offset);
+
+        if (dirty_flags & VGA_DIRTY_FLAG)
+            phys_ram_vga_dirty[index] &= mask;
+        if (dirty_flags & CODE_DIRTY_FLAG)
+            phys_ram_code_dirty[index] &= mask;
+        if (dirty_flags & MIGRATION_DIRTY_FLAG)
+            phys_ram_migration_dirty[index] &= mask;
+     }
+}
+
+int cpu_physical_memory_get_dirty_range(ram_addr_t start, ram_addr_t end, 
+                                        ram_addr_t *dirty_rams, int length,
+                                        int dirty_flags);
+
 void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
                                      int dirty_flags);
 void cpu_tlb_update_dirty(CPUState *env);