Message ID | 5edff645-12e8-d3e0-1849-302b6986c232@ozlabs.ru |
---|---|
State | New |
Headers | show |
On 23/02/2017 05:20, Alexey Kardashevskiy wrote: > First, Paolo is right and ram_device_mem_ops::endianness should be > host-endian which happens to be little in our test case (ppc64le) So you tested a ppc64 BE guest and it works? > Keep things where they are in the VFIO department and just fix > ram_device_mem_ops::endianness? I would fix the ram_device_mem_ops. Either by introducing DEVICE_HOST_ENDIAN(*) or with Yongji's patch. (*) DEVICE_NATIVE_ENDIAN is special cased all over the place because the same device (in a file that's compiled just once) can be either little- or big-endian. DEVICE_HOST_ENDIAN can be a simple #define to either DEVICE_LITTLE_ENDIAN or DEVICE_BIG_ENDIAN, because host endianness is the same for all QEMU binaries. It's literally half a dozen lines of code. Paolo
On 23 February 2017 at 08:35, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 23/02/2017 05:20, Alexey Kardashevskiy wrote: >> First, Paolo is right and ram_device_mem_ops::endianness should be >> host-endian which happens to be little in our test case (ppc64le) > > So you tested a ppc64 BE guest and it works? > >> Keep things where they are in the VFIO department and just fix >> ram_device_mem_ops::endianness? > > I would fix the ram_device_mem_ops. Either by introducing > DEVICE_HOST_ENDIAN(*) or with Yongji's patch. > > (*) DEVICE_NATIVE_ENDIAN is special cased all over the place > because the same device (in a file that's compiled just once) > can be either little- or big-endian. DEVICE_HOST_ENDIAN can > be a simple #define to either DEVICE_LITTLE_ENDIAN or > DEVICE_BIG_ENDIAN, because host endianness is the same for > all QEMU binaries. It's literally half a dozen lines of code. I'm really not convinced we need DEVICE_HOST_ENDIAN. RAM areas should be target-endian (you can probably define "target endianness" as "the endianness that RAM areas have".) -- PMM
On 23/02/2017 11:02, Peter Maydell wrote: > On 23 February 2017 at 08:35, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> >> On 23/02/2017 05:20, Alexey Kardashevskiy wrote: >>> First, Paolo is right and ram_device_mem_ops::endianness should be >>> host-endian which happens to be little in our test case (ppc64le) >> >> So you tested a ppc64 BE guest and it works? >> >>> Keep things where they are in the VFIO department and just fix >>> ram_device_mem_ops::endianness? >> >> I would fix the ram_device_mem_ops. Either by introducing >> DEVICE_HOST_ENDIAN(*) or with Yongji's patch. >> >> (*) DEVICE_NATIVE_ENDIAN is special cased all over the place >> because the same device (in a file that's compiled just once) >> can be either little- or big-endian. DEVICE_HOST_ENDIAN can >> be a simple #define to either DEVICE_LITTLE_ENDIAN or >> DEVICE_BIG_ENDIAN, because host endianness is the same for >> all QEMU binaries. It's literally half a dozen lines of code. > > I'm really not convinced we need DEVICE_HOST_ENDIAN. RAM > areas should be target-endian (you can probably define > "target endianness" as "the endianness that RAM areas have".) This is not RAM. This is MMIO, backed by a MMIO area in the host. The MemoryRegionOps read from the MMIO area (so the data has host endianness) and do not do any further swap: data = *(uint16_t *)(mr->ram_block->host + addr); Here, the dereference is basically the same as ldl_he_p. If you wanted to make the MemoryRegion use DEVICE_NATIVE_ENDIAN, you'd need to tswap around the access. Or you can use ldl_le_p and DEVICE_LITTLE_ENDIAN (this is what Yongji's patch open codes), or ldl_be_p and DEVICE_BIG_ENDIAN. They are all the same in the end. Paolo
On 23 February 2017 at 10:10, Paolo Bonzini <pbonzini@redhat.com> wrote: > On 23/02/2017 11:02, Peter Maydell wrote: >> I'm really not convinced we need DEVICE_HOST_ENDIAN. RAM >> areas should be target-endian (you can probably define >> "target endianness" as "the endianness that RAM areas have".) > > This is not RAM. This is MMIO, backed by a MMIO area in the host. Hmm, I see...the naming is a bit unfortunate if it's not RAM. > The > MemoryRegionOps read from the MMIO area (so the data has host > endianness) and do not do any further swap: > > data = *(uint16_t *)(mr->ram_block->host + addr); > > Here, the dereference is basically the same as ldl_he_p. > > If you wanted to make the MemoryRegion use DEVICE_NATIVE_ENDIAN, you'd > need to tswap around the access. Or you can use ldl_le_p and > DEVICE_LITTLE_ENDIAN (this is what Yongji's patch open codes), or > ldl_be_p and DEVICE_BIG_ENDIAN. They are all the same in the end. Using stl_p &c in a DEVICE_NATIVE_ENDIAN MR would work too, right? (This is how all the NATIVE_ENDIAN MRs in exec.c work.) thanks -- PMM
On 23/02/2017 11:23, Peter Maydell wrote: > On 23 February 2017 at 10:10, Paolo Bonzini <pbonzini@redhat.com> wrote: >> On 23/02/2017 11:02, Peter Maydell wrote: >>> I'm really not convinced we need DEVICE_HOST_ENDIAN. RAM >>> areas should be target-endian (you can probably define >>> "target endianness" as "the endianness that RAM areas have".) >> >> This is not RAM. This is MMIO, backed by a MMIO area in the host. > > Hmm, I see...the naming is a bit unfortunate if it's not RAM. Yeah, it's called like that because it is backed by a RAMBlock but it returns false for memory_access_is_direct. >> The >> MemoryRegionOps read from the MMIO area (so the data has host >> endianness) and do not do any further swap: >> >> data = *(uint16_t *)(mr->ram_block->host + addr); >> >> Here, the dereference is basically the same as ldl_he_p. >> >> If you wanted to make the MemoryRegion use DEVICE_NATIVE_ENDIAN, you'd >> need to tswap around the access. Or you can use ldl_le_p and >> DEVICE_LITTLE_ENDIAN (this is what Yongji's patch open codes), or >> ldl_be_p and DEVICE_BIG_ENDIAN. They are all the same in the end. > > Using stl_p &c in a DEVICE_NATIVE_ENDIAN MR would work too, right? > (This is how all the NATIVE_ENDIAN MRs in exec.c work.) Yes, it should, as long as the memcpy(...) of {ld,st}*_he_p is compiled to a single access, which should be the case. Paolo
On 23/02/17 19:35, Paolo Bonzini wrote: > > > On 23/02/2017 05:20, Alexey Kardashevskiy wrote: >> First, Paolo is right and ram_device_mem_ops::endianness should be >> host-endian which happens to be little in our test case (ppc64le) > > So you tested a ppc64 BE guest and it works? Yes. Setting ram_device_mem_ops.endianness = DEVICE_LITTLE_ENDIAN does the job for both LE and BE guests on the LE host.
On 23 February 2017 at 10:33, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 23/02/2017 11:23, Peter Maydell wrote: >> On 23 February 2017 at 10:10, Paolo Bonzini <pbonzini@redhat.com> wrote: >>> On 23/02/2017 11:02, Peter Maydell wrote: >>>> I'm really not convinced we need DEVICE_HOST_ENDIAN. RAM >>>> areas should be target-endian (you can probably define >>>> "target endianness" as "the endianness that RAM areas have".) >>> >>> This is not RAM. This is MMIO, backed by a MMIO area in the host. >> >> Hmm, I see...the naming is a bit unfortunate if it's not RAM. > > Yeah, it's called like that because it is backed by a RAMBlock but it > returns false for memory_access_is_direct. We should probably update the doc comment to note that the pointer is to host-endianness memory (and that this is not like normal RAM which is target-endian)... >>> The >>> MemoryRegionOps read from the MMIO area (so the data has host >>> endianness) and do not do any further swap: >>> >>> data = *(uint16_t *)(mr->ram_block->host + addr); >>> >>> Here, the dereference is basically the same as ldl_he_p. >>> >>> If you wanted to make the MemoryRegion use DEVICE_NATIVE_ENDIAN, you'd >>> need to tswap around the access. Or you can use ldl_le_p and >>> DEVICE_LITTLE_ENDIAN (this is what Yongji's patch open codes), or >>> ldl_be_p and DEVICE_BIG_ENDIAN. They are all the same in the end. >> >> Using stl_p &c in a DEVICE_NATIVE_ENDIAN MR would work too, right? >> (This is how all the NATIVE_ENDIAN MRs in exec.c work.) > > Yes, it should, as long as the memcpy(...) of {ld,st}*_he_p is compiled > to a single access, which should be the case. ...and whichever of these approaches we take, we should have a comment which notes that we are converting from the host endianness memory to the endianness specified by the MemoryRegion endianness attribute. thanks -- PMM
On 23/02/2017 12:34, Peter Maydell wrote: > On 23 February 2017 at 10:33, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> >> On 23/02/2017 11:23, Peter Maydell wrote: >>> On 23 February 2017 at 10:10, Paolo Bonzini <pbonzini@redhat.com> wrote: >>>> On 23/02/2017 11:02, Peter Maydell wrote: >>>>> I'm really not convinced we need DEVICE_HOST_ENDIAN. RAM >>>>> areas should be target-endian (you can probably define >>>>> "target endianness" as "the endianness that RAM areas have".) >>>> >>>> This is not RAM. This is MMIO, backed by a MMIO area in the host. >>> >>> Hmm, I see...the naming is a bit unfortunate if it's not RAM. >> >> Yeah, it's called like that because it is backed by a RAMBlock but it >> returns false for memory_access_is_direct. > > We should probably update the doc comment to note that the > pointer is to host-endianness memory (and that this is not > like normal RAM which is target-endian)... I wouldn't call it host-endianness memory, and I disagree that normal RAM is target-endian---in both cases it's just a bunch of bytes. However, the access done by the MemoryRegionOps callbacks needs to match the endianness declared by the MemoryRegionOps themselves. Paolo >>>> The >>>> MemoryRegionOps read from the MMIO area (so the data has host >>>> endianness) and do not do any further swap: >>>> >>>> data = *(uint16_t *)(mr->ram_block->host + addr); >>>> >>>> Here, the dereference is basically the same as ldl_he_p. >>>> >>>> If you wanted to make the MemoryRegion use DEVICE_NATIVE_ENDIAN, you'd >>>> need to tswap around the access. Or you can use ldl_le_p and >>>> DEVICE_LITTLE_ENDIAN (this is what Yongji's patch open codes), or >>>> ldl_be_p and DEVICE_BIG_ENDIAN. They are all the same in the end. >>> >>> Using stl_p &c in a DEVICE_NATIVE_ENDIAN MR would work too, right? >>> (This is how all the NATIVE_ENDIAN MRs in exec.c work.) >> >> Yes, it should, as long as the memcpy(...) of {ld,st}*_he_p is compiled >> to a single access, which should be the case. > > ...and whichever of these approaches we take, we should have > a comment which notes that we are converting from the host > endianness memory to the endianness specified by the MemoryRegion > endianness attribute. > > thanks > -- PMM >
On 23 February 2017 at 11:43, Paolo Bonzini <pbonzini@redhat.com> wrote: > On 23/02/2017 12:34, Peter Maydell wrote: >> We should probably update the doc comment to note that the >> pointer is to host-endianness memory (and that this is not >> like normal RAM which is target-endian)... > > I wouldn't call it host-endianness memory, and I disagree that normal > RAM is target-endian---in both cases it's just a bunch of bytes. > > However, the access done by the MemoryRegionOps callbacks needs to match > the endianness declared by the MemoryRegionOps themselves. Well, if the guest stores a bunch of integers to the memory, which way round do you see them when you look at the bunch of bytes? The answer is different for the two things, so something is different about their endianness... thanks -- PMM
On 23/02/2017 13:26, Peter Maydell wrote: > On 23 February 2017 at 11:43, Paolo Bonzini <pbonzini@redhat.com> wrote: >> On 23/02/2017 12:34, Peter Maydell wrote: >>> We should probably update the doc comment to note that the >>> pointer is to host-endianness memory (and that this is not >>> like normal RAM which is target-endian)... >> >> I wouldn't call it host-endianness memory, and I disagree that normal >> RAM is target-endian---in both cases it's just a bunch of bytes. >> >> However, the access done by the MemoryRegionOps callbacks needs to match >> the endianness declared by the MemoryRegionOps themselves. > > Well, if the guest stores a bunch of integers to the memory, which > way round do you see them when you look at the bunch of bytes? You see them in whatever endianness the guest used. So let's say a little-endian guest is writing 0x12345678 to a register. With DEVICE_HOST_ENDIAN, some examples are: - little-endian host, little-endian device. The guest will write 0x12345678, QEMU will write 0x12345678 which becomes 0x78 0x56 0x34 0x12. The little-endian device is happy because it sees it as 0x12345678. - little-endian host, big-endian device. The guest will write 0x78563412 to account for the different endianness of the host. QEMU will also write 0x78563412 which becomes 0x12 0x34 0x56 0x78. The big-endian device is happy because it sees it as 0x12345678. - big-endian host, little-endian device. The guest will write 0x12345678, memory.c will flip it and QEMU will write 0x78563412. In bytes this becomes 0x78 0x56 0x34 0x12. The little-endian device is happy because it sees it as 0x12345678. - big-endian host, big-endian device. The guest will write 0x78563412 to account for the different endianness of the host. Memory.c will flip it and QEMU will write 0x12345678, which becomes 0x12 0x34 0x56 0x78. The big-endian device is happy because it sees it as 0x12345678. With DEVICE_LITTLE_ENDIAN you add two flips on big-endian hosts; with DEVICE_BIG_ENDIAN you add two flips on little-endian hosts. Paolo > The answer is different for the two things, so something is different > about their endianness...
On 23 February 2017 at 12:53, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 23/02/2017 13:26, Peter Maydell wrote: >> On 23 February 2017 at 11:43, Paolo Bonzini <pbonzini@redhat.com> wrote: >>> On 23/02/2017 12:34, Peter Maydell wrote: >>>> We should probably update the doc comment to note that the >>>> pointer is to host-endianness memory (and that this is not >>>> like normal RAM which is target-endian)... >>> >>> I wouldn't call it host-endianness memory, and I disagree that normal >>> RAM is target-endian---in both cases it's just a bunch of bytes. >>> >>> However, the access done by the MemoryRegionOps callbacks needs to match >>> the endianness declared by the MemoryRegionOps themselves. >> >> Well, if the guest stores a bunch of integers to the memory, which >> way round do you see them when you look at the bunch of bytes? > > You see them in whatever endianness the guest used. I'm confused. I said "normal RAM and this ramdevice memory are different", and you seem to be saying they're the same. I don't think they are (in particular I think with a BE guest on an LE host they'll look different). thanks -- PMM
On 23/02/2017 15:35, Peter Maydell wrote: > On 23 February 2017 at 12:53, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> >> On 23/02/2017 13:26, Peter Maydell wrote: >>> On 23 February 2017 at 11:43, Paolo Bonzini <pbonzini@redhat.com> wrote: >>>> On 23/02/2017 12:34, Peter Maydell wrote: >>>>> We should probably update the doc comment to note that the >>>>> pointer is to host-endianness memory (and that this is not >>>>> like normal RAM which is target-endian)... >>>> >>>> I wouldn't call it host-endianness memory, and I disagree that normal >>>> RAM is target-endian---in both cases it's just a bunch of bytes. >>>> >>>> However, the access done by the MemoryRegionOps callbacks needs to match >>>> the endianness declared by the MemoryRegionOps themselves. >>> >>> Well, if the guest stores a bunch of integers to the memory, which >>> way round do you see them when you look at the bunch of bytes? >> >> You see them in whatever endianness the guest used. > > I'm confused. I said "normal RAM and this ramdevice memory are > different", and you seem to be saying they're the same. I don't > think they are (in particular I think with a BE guest on an > LE host they'll look different). No, they look entirely the same. The only difference is that they go through MemoryRegionOps instead of memcpy. Paolo
On 23 February 2017 at 15:21, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 23/02/2017 15:35, Peter Maydell wrote: >> On 23 February 2017 at 12:53, Paolo Bonzini <pbonzini@redhat.com> wrote: >>> >>> >>> On 23/02/2017 13:26, Peter Maydell wrote: >>>> On 23 February 2017 at 11:43, Paolo Bonzini <pbonzini@redhat.com> wrote: >>>>> On 23/02/2017 12:34, Peter Maydell wrote: >>>>>> We should probably update the doc comment to note that the >>>>>> pointer is to host-endianness memory (and that this is not >>>>>> like normal RAM which is target-endian)... >>>>> >>>>> I wouldn't call it host-endianness memory, and I disagree that normal >>>>> RAM is target-endian---in both cases it's just a bunch of bytes. >>>>> >>>>> However, the access done by the MemoryRegionOps callbacks needs to match >>>>> the endianness declared by the MemoryRegionOps themselves. >>>> >>>> Well, if the guest stores a bunch of integers to the memory, which >>>> way round do you see them when you look at the bunch of bytes? >>> >>> You see them in whatever endianness the guest used. >> >> I'm confused. I said "normal RAM and this ramdevice memory are >> different", and you seem to be saying they're the same. I don't >> think they are (in particular I think with a BE guest on an >> LE host they'll look different). > > No, they look entirely the same. The only difference is that they go > through MemoryRegionOps instead of memcpy. Then we have a different problem, because the thing this patch is claiming to fix is that the memory the device is backed by (from vfio) is little-endian and we're not accessing it right. RAM of the usual sort is target-endian (by which I mean "when the guest does a write of 32-bits 0x12345678, and you look at the memory byte by byte then the order of bytes is either 0x12 0x34 0x56 0x78 if TARGET_LITTLE_ENDIAN or 0x78 0x56 0x34 0x12 if TARGET_BIG_ENDIAN"). AIUI what we want for this VFIO case is "when the guest does a 32-bit write of 0x12345678 then the bytes are 0x12 0x34 0x56 0x78 regardless of whether TARGET_BIG_ENDIAN or not". thanks -- PMM
On Thu, 23 Feb 2017 16:21:47 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 23/02/2017 15:35, Peter Maydell wrote: > > On 23 February 2017 at 12:53, Paolo Bonzini <pbonzini@redhat.com> wrote: > >> > >> > >> On 23/02/2017 13:26, Peter Maydell wrote: > >>> On 23 February 2017 at 11:43, Paolo Bonzini <pbonzini@redhat.com> wrote: > >>>> On 23/02/2017 12:34, Peter Maydell wrote: > >>>>> We should probably update the doc comment to note that the > >>>>> pointer is to host-endianness memory (and that this is not > >>>>> like normal RAM which is target-endian)... > >>>> > >>>> I wouldn't call it host-endianness memory, and I disagree that normal > >>>> RAM is target-endian---in both cases it's just a bunch of bytes. > >>>> > >>>> However, the access done by the MemoryRegionOps callbacks needs to match > >>>> the endianness declared by the MemoryRegionOps themselves. > >>> > >>> Well, if the guest stores a bunch of integers to the memory, which > >>> way round do you see them when you look at the bunch of bytes? > >> > >> You see them in whatever endianness the guest used. > > > > I'm confused. I said "normal RAM and this ramdevice memory are > > different", and you seem to be saying they're the same. I don't > > think they are (in particular I think with a BE guest on an > > LE host they'll look different). > > No, they look entirely the same. The only difference is that they go > through MemoryRegionOps instead of memcpy. Is this true for vfio use case? If we use memcpy we're talking directly to the device with no endian conversions. If we use read/write then there is an endian conversion in the host kernel.
On 23/02/2017 16:39, Alex Williamson wrote: > On Thu, 23 Feb 2017 16:21:47 +0100 > Paolo Bonzini <pbonzini@redhat.com> wrote: > >> On 23/02/2017 15:35, Peter Maydell wrote: >>> On 23 February 2017 at 12:53, Paolo Bonzini <pbonzini@redhat.com> wrote: >>>> >>>> >>>> On 23/02/2017 13:26, Peter Maydell wrote: >>>>> On 23 February 2017 at 11:43, Paolo Bonzini <pbonzini@redhat.com> wrote: >>>>>> On 23/02/2017 12:34, Peter Maydell wrote: >>>>>>> We should probably update the doc comment to note that the >>>>>>> pointer is to host-endianness memory (and that this is not >>>>>>> like normal RAM which is target-endian)... >>>>>> >>>>>> I wouldn't call it host-endianness memory, and I disagree that normal >>>>>> RAM is target-endian---in both cases it's just a bunch of bytes. >>>>>> >>>>>> However, the access done by the MemoryRegionOps callbacks needs to match >>>>>> the endianness declared by the MemoryRegionOps themselves. >>>>> >>>>> Well, if the guest stores a bunch of integers to the memory, which >>>>> way round do you see them when you look at the bunch of bytes? >>>> >>>> You see them in whatever endianness the guest used. >>> >>> I'm confused. I said "normal RAM and this ramdevice memory are >>> different", and you seem to be saying they're the same. I don't >>> think they are (in particular I think with a BE guest on an >>> LE host they'll look different). >> >> No, they look entirely the same. The only difference is that they go >> through MemoryRegionOps instead of memcpy. > > Is this true for vfio use case? If we use memcpy we're talking directly > to the device with no endian conversions. If we use read/write then > there is an endian conversion in the host kernel. But ramd MemoryRegionOps do not use file read/write, they use memory read/write, so they talk directly to the device. Paolo
On 23/02/2017 16:29, Peter Maydell wrote: >> No, they look entirely the same. The only difference is that they go >> through MemoryRegionOps instead of memcpy. > Then we have a different problem, because the thing this patch > is claiming to fix is that the memory the device is backed by > (from vfio) is little-endian and we're not accessing it right. > > RAM of the usual sort is target-endian (by which I mean "when the guest > does a write of 32-bits 0x12345678, and you look at the memory byte > by byte then the order of bytes is either 0x12 0x34 0x56 0x78 if > TARGET_LITTLE_ENDIAN or 0x78 0x56 0x34 0x12 if TARGET_BIG_ENDIAN"). Okay, I think I see what you mean. MMIO regions do not expect their datum in any endianness; they just get a uint64_t representing an 8/16/32/64-bit value. The datum undergoes a conversion if necessary, according to target and MemoryRegionOps endiannesses, before the callback is invoked. In the ramd case, the ops convert between bunch-of-bytes-stored-into-RAM-by-host and uint64_t (to uint64_t for read, from uint64_t for write). The conversion from/to target endianness has been done already by the memory.c core code. The ops' task is to ensure that this bunch of bytes has the exact order that the guest desired. There's more than one way to do it---all that's needed is that the order used by the ops matches the one specified by the ops' endianness, resulting in an even number of swaps. However, DEVICE_NATIVE_ENDIAN would have to be paired with tswap, which the current code does not do, hence the bug. To have no swap at all, you'd need DEVICE_HOST_ENDIAN. > AIUI what we want for this VFIO case is "when the guest does > a 32-bit write of 0x12345678 then the bytes are 0x12 0x34 0x56 0x78 > regardless of whether TARGET_BIG_ENDIAN or not". No, I don't think so. This is not specific to VFIO. You can do it with any device, albeit VFIO is currently the only one using ramd regions. Paolo
On 23 February 2017 at 15:58, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 23/02/2017 16:29, Peter Maydell wrote: >>> No, they look entirely the same. The only difference is that they go >>> through MemoryRegionOps instead of memcpy. >> Then we have a different problem, because the thing this patch >> is claiming to fix is that the memory the device is backed by >> (from vfio) is little-endian and we're not accessing it right. >> >> RAM of the usual sort is target-endian (by which I mean "when the guest >> does a write of 32-bits 0x12345678, and you look at the memory byte >> by byte then the order of bytes is either 0x12 0x34 0x56 0x78 if >> TARGET_LITTLE_ENDIAN or 0x78 0x56 0x34 0x12 if TARGET_BIG_ENDIAN"). > > Okay, I think I see what you mean. > > MMIO regions do not expect their datum in any endianness; they just get > a uint64_t representing an 8/16/32/64-bit value. The datum undergoes a > conversion if necessary, according to target and MemoryRegionOps > endiannesses, before the callback is invoked. Yep. However, how they implement 8/16/32/64 bit transfers implies an endianness model (assuming it is consistent and the device provides enough of the operations to determine one). > In the ramd case, the ops convert between > bunch-of-bytes-stored-into-RAM-by-host and uint64_t (to uint64_t for > read, from uint64_t for write). The conversion from/to target > endianness has been done already by the memory.c core code. The ops' > task is to ensure that this bunch of bytes has the exact order that the > guest desired. There's more than one way to do it---all that's needed > is that the order used by the ops matches the one specified by the ops' > endianness, resulting in an even number of swaps. > > However, DEVICE_NATIVE_ENDIAN would have to be paired with tswap, which > the current code does not do, hence the bug. To have no swap at all, > you'd need DEVICE_HOST_ENDIAN. Yes, I agree that the current ramdevice code has this bug (and that we can fix it by any of the various options). >> AIUI what we want for this VFIO case is "when the guest does >> a 32-bit write of 0x12345678 then the bytes are 0x12 0x34 0x56 0x78 >> regardless of whether TARGET_BIG_ENDIAN or not". > > No, I don't think so. This is not specific to VFIO. You can do it with > any device, albeit VFIO is currently the only one using ramd regions. The commit message in the patch that started this thread off says specifically that "VFIO PCI device is little endian". Is that wrong? thanks -- PMM
On Thu, 23 Feb 2017 16:47:23 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 23/02/2017 16:39, Alex Williamson wrote: > > On Thu, 23 Feb 2017 16:21:47 +0100 > > Paolo Bonzini <pbonzini@redhat.com> wrote: > > > >> On 23/02/2017 15:35, Peter Maydell wrote: > >>> On 23 February 2017 at 12:53, Paolo Bonzini <pbonzini@redhat.com> wrote: > >>>> > >>>> > >>>> On 23/02/2017 13:26, Peter Maydell wrote: > >>>>> On 23 February 2017 at 11:43, Paolo Bonzini <pbonzini@redhat.com> wrote: > >>>>>> On 23/02/2017 12:34, Peter Maydell wrote: > >>>>>>> We should probably update the doc comment to note that the > >>>>>>> pointer is to host-endianness memory (and that this is not > >>>>>>> like normal RAM which is target-endian)... > >>>>>> > >>>>>> I wouldn't call it host-endianness memory, and I disagree that normal > >>>>>> RAM is target-endian---in both cases it's just a bunch of bytes. > >>>>>> > >>>>>> However, the access done by the MemoryRegionOps callbacks needs to match > >>>>>> the endianness declared by the MemoryRegionOps themselves. > >>>>> > >>>>> Well, if the guest stores a bunch of integers to the memory, which > >>>>> way round do you see them when you look at the bunch of bytes? > >>>> > >>>> You see them in whatever endianness the guest used. > >>> > >>> I'm confused. I said "normal RAM and this ramdevice memory are > >>> different", and you seem to be saying they're the same. I don't > >>> think they are (in particular I think with a BE guest on an > >>> LE host they'll look different). > >> > >> No, they look entirely the same. The only difference is that they go > >> through MemoryRegionOps instead of memcpy. > > > > Is this true for vfio use case? If we use memcpy we're talking directly > > to the device with no endian conversions. If we use read/write then > > there is an endian conversion in the host kernel. > > But ramd MemoryRegionOps do not use file read/write, they use memory > read/write, so they talk directly to the device. Ah ha! Ding! ;) Sorry, I forgot that.
On 23/02/2017 17:08, Peter Maydell wrote: > On 23 February 2017 at 15:58, Paolo Bonzini <pbonzini@redhat.com> wrote: >> However, DEVICE_NATIVE_ENDIAN would have to be paired with tswap, which >> the current code does not do, hence the bug. To have no swap at all, >> you'd need DEVICE_HOST_ENDIAN. > > Yes, I agree that the current ramdevice code has this bug (and > that we can fix it by any of the various options). Good. :) >>> AIUI what we want for this VFIO case is "when the guest does >>> a 32-bit write of 0x12345678 then the bytes are 0x12 0x34 0x56 0x78 >>> regardless of whether TARGET_BIG_ENDIAN or not". >> >> No, I don't think so. This is not specific to VFIO. You can do it with >> any device, albeit VFIO is currently the only one using ramd regions. > > The commit message in the patch that started this thread off > says specifically that "VFIO PCI device is little endian". > Is that wrong? Yes, I think it's a red herring. Hence my initial confusion, when I asked "would Yongji's patch just work if it used DEVICE_BIG_ENDIAN and beNN_to_cpu/cpu_to_beNN". Paolo
On Thu, Feb 23, 2017 at 03:29:53PM +0000, Peter Maydell wrote: > On 23 February 2017 at 15:21, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > > > On 23/02/2017 15:35, Peter Maydell wrote: > >> On 23 February 2017 at 12:53, Paolo Bonzini <pbonzini@redhat.com> wrote: > >>> > >>> > >>> On 23/02/2017 13:26, Peter Maydell wrote: > >>>> On 23 February 2017 at 11:43, Paolo Bonzini <pbonzini@redhat.com> wrote: > >>>>> On 23/02/2017 12:34, Peter Maydell wrote: > >>>>>> We should probably update the doc comment to note that the > >>>>>> pointer is to host-endianness memory (and that this is not > >>>>>> like normal RAM which is target-endian)... > >>>>> > >>>>> I wouldn't call it host-endianness memory, and I disagree that normal > >>>>> RAM is target-endian---in both cases it's just a bunch of bytes. > >>>>> > >>>>> However, the access done by the MemoryRegionOps callbacks needs to match > >>>>> the endianness declared by the MemoryRegionOps themselves. > >>>> > >>>> Well, if the guest stores a bunch of integers to the memory, which > >>>> way round do you see them when you look at the bunch of bytes? > >>> > >>> You see them in whatever endianness the guest used. > >> > >> I'm confused. I said "normal RAM and this ramdevice memory are > >> different", and you seem to be saying they're the same. I don't > >> think they are (in particular I think with a BE guest on an > >> LE host they'll look different). > > > > No, they look entirely the same. The only difference is that they go > > through MemoryRegionOps instead of memcpy. > > Then we have a different problem, because the thing this patch > is claiming to fix is that the memory the device is backed by > (from vfio) is little-endian and we're not accessing it right. > > RAM of the usual sort is target-endian (by which I mean "when the guest > does a write of 32-bits 0x12345678, and you look at the memory byte > by byte then the order of bytes is either 0x12 0x34 0x56 0x78 if > TARGET_LITTLE_ENDIAN or 0x78 0x56 0x34 0x12 if TARGET_BIG_ENDIAN"). > > AIUI what we want for this VFIO case is "when the guest does > a 32-bit write of 0x12345678 then the bytes are 0x12 0x34 0x56 0x78 > regardless of whether TARGET_BIG_ENDIAN or not". At least in the case of KVM and MMIO emulation (which is the case here), run->mmio.data should be considered as a byte stream without endianness, and what is needed is for QEMU to transfer data between run->mmio.data and the device (or whatever is backing the MMIO region) without any net byte swap. So if QEMU is doing a 32-bit host-endian load from run->mmio.data (for an MMIO store), then it needs to do a 32-bit host-endian store into the device memory. In other words, the RAM memory region needs to be considered host endian to match run->mmio.data being considered host endian. Paul.
On Thu, Feb 23, 2017 at 12:43:37PM +0100, Paolo Bonzini wrote: > > > On 23/02/2017 12:34, Peter Maydell wrote: > > On 23 February 2017 at 10:33, Paolo Bonzini <pbonzini@redhat.com> wrote: > >> > >> > >> On 23/02/2017 11:23, Peter Maydell wrote: > >>> On 23 February 2017 at 10:10, Paolo Bonzini <pbonzini@redhat.com> wrote: > >>>> On 23/02/2017 11:02, Peter Maydell wrote: > >>>>> I'm really not convinced we need DEVICE_HOST_ENDIAN. RAM > >>>>> areas should be target-endian (you can probably define > >>>>> "target endianness" as "the endianness that RAM areas have".) > >>>> > >>>> This is not RAM. This is MMIO, backed by a MMIO area in the host. > >>> > >>> Hmm, I see...the naming is a bit unfortunate if it's not RAM. > >> > >> Yeah, it's called like that because it is backed by a RAMBlock but it > >> returns false for memory_access_is_direct. > > > > We should probably update the doc comment to note that the > > pointer is to host-endianness memory (and that this is not > > like normal RAM which is target-endian)... > > I wouldn't call it host-endianness memory, and I disagree that normal > RAM is target-endian---in both cases it's just a bunch of bytes. Right. Really, endianness is not a property of the device - even less so of RAM. It's a property of an individual multibyte value and how it is interpreted relative to individual byte addresses. Really the declarations in the MRs are saying: assuming the guest (software + hardware) does (big|little|target native) accesses on this device, do the right swaps so we get host native multi-byte values which match the original multibyte values the guest had in mind. What we want for memory (both RAM and VFIO MMIO) is: don't even try to preserve multi-byte values between host and guest views, just preserve byte address order. Tastes may vary as to whether you call that "host endian" or "no endian" or "bag-o'-bytesian" or whatever. > However, the access done by the MemoryRegionOps callbacks needs to match > the endianness declared by the MemoryRegionOps themselves. > > Paolo > > >>>> The > >>>> MemoryRegionOps read from the MMIO area (so the data has host > >>>> endianness) and do not do any further swap: > >>>> > >>>> data = *(uint16_t *)(mr->ram_block->host + addr); > >>>> > >>>> Here, the dereference is basically the same as ldl_he_p. > >>>> > >>>> If you wanted to make the MemoryRegion use DEVICE_NATIVE_ENDIAN, you'd > >>>> need to tswap around the access. Or you can use ldl_le_p and > >>>> DEVICE_LITTLE_ENDIAN (this is what Yongji's patch open codes), or > >>>> ldl_be_p and DEVICE_BIG_ENDIAN. They are all the same in the end. > >>> > >>> Using stl_p &c in a DEVICE_NATIVE_ENDIAN MR would work too, right? > >>> (This is how all the NATIVE_ENDIAN MRs in exec.c work.) > >> > >> Yes, it should, as long as the memcpy(...) of {ld,st}*_he_p is compiled > >> to a single access, which should be the case. > > > > ...and whichever of these approaches we take, we should have > > a comment which notes that we are converting from the host > > endianness memory to the endianness specified by the MemoryRegion > > endianness attribute. > > > > thanks > > -- PMM > > >
Quoting Alexey Kardashevskiy (2017-02-22 22:20:25) > On 21/02/17 17:46, Yongji Xie wrote: > > At the moment ram device's memory regions are NATIVE_ENDIAN. This does > > not work on PPC64 because VFIO PCI device is little endian but PPC64 > > always defines static macro TARGET_WORDS_BIGENDIAN. > > > > This fixes endianness for ram device the same way as it is done > > for VFIO region in commit 6758008e2c4e79fb6bd04fe8e7a41665fa583965. > > > > Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com> > > --- > > memory.c | 14 +++++++------- > > 1 files changed, 7 insertions(+), 7 deletions(-) > > > > diff --git a/memory.c b/memory.c > > index 6c58373..1ccb99f 100644 > > --- a/memory.c > > +++ b/memory.c > > @@ -1139,13 +1139,13 @@ static uint64_t memory_region_ram_device_read(void *opaque, > > data = *(uint8_t *)(mr->ram_block->host + addr); > > break; > > case 2: > > - data = *(uint16_t *)(mr->ram_block->host + addr); > > + data = le16_to_cpu(*(uint16_t *)(mr->ram_block->host + addr)); > > break; > > case 4: > > - data = *(uint32_t *)(mr->ram_block->host + addr); > > + data = le32_to_cpu(*(uint32_t *)(mr->ram_block->host + addr)); > > break; > > case 8: > > - data = *(uint64_t *)(mr->ram_block->host + addr); > > + data = le64_to_cpu(*(uint64_t *)(mr->ram_block->host + addr)); > > break; > > } > > > > @@ -1166,13 +1166,13 @@ static void memory_region_ram_device_write(void *opaque, hwaddr addr, > > *(uint8_t *)(mr->ram_block->host + addr) = (uint8_t)data; > > break; > > case 2: > > - *(uint16_t *)(mr->ram_block->host + addr) = (uint16_t)data; > > + *(uint16_t *)(mr->ram_block->host + addr) = cpu_to_le16((uint16_t)data); > > break; > > case 4: > > - *(uint32_t *)(mr->ram_block->host + addr) = (uint32_t)data; > > + *(uint32_t *)(mr->ram_block->host + addr) = cpu_to_le32((uint32_t)data); > > break; > > case 8: > > - *(uint64_t *)(mr->ram_block->host + addr) = data; > > + *(uint64_t *)(mr->ram_block->host + addr) = cpu_to_le64(data); > > break; > > } > > } > > @@ -1180,7 +1180,7 @@ static void memory_region_ram_device_write(void *opaque, hwaddr addr, > > static const MemoryRegionOps ram_device_mem_ops = { > > .read = memory_region_ram_device_read, > > .write = memory_region_ram_device_write, > > - .endianness = DEVICE_NATIVE_ENDIAN, > > + .endianness = DEVICE_LITTLE_ENDIAN, > > .valid = { > > .min_access_size = 1, > > .max_access_size = 8, > > > > > I did some debugging today. > > First, Paolo is right and ram_device_mem_ops::endianness should be > host-endian which happens to be little in our test case (ppc64le) so > changes to .read/.write are actually no-op (I believe so, have not checked). > > > But I was wondering why this gets executed at all. > > The test case is: > > qemu-system-ppc64 ... > -device "vfio-pci,id=vfio0001_03_00_0,host=0001:03:00.0 > -drive id=DRIVE0,if=none,file=./test.qcow2,format=qcow2 > -device virtio-blk-pci,id=vblk0,drive=DRIVE0 > > The host kernel is v4.10, ppc64le (little endian), 64K system page size, > QEMU is v2.8.0. > > When this boots, lspci shows: > > 00:00.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single > Port Adapter > Subsystem: IBM Device 038c > Flags: bus master, fast devsel, latency 0, IRQ 18 > Memory at 210000004000 (64-bit, non-prefetchable) [size=4K] > Memory at 210000800000 (64-bit, non-prefetchable) [size=8M] > Memory at 210001000000 (64-bit, non-prefetchable) [size=4K] > Expansion ROM at 2000c0080000 [disabled] [size=512K] > Capabilities: [40] Power Management version 3 > Capabilities: [48] MSI: Enable- Count=1/32 Maskable- 64bit+ > Capabilities: [58] Express Endpoint, MSI 00 > Capabilities: [94] Vital Product Data > Capabilities: [9c] MSI-X: Enable+ Count=32 Masked- > Kernel driver in use: cxgb3 > > 00:01.0 SCSI storage controller: Red Hat, Inc Virtio block device > Subsystem: Red Hat, Inc Device 0002 > Physical Slot: C16 > Flags: bus master, fast devsel, latency 0, IRQ 17 > I/O ports at 0040 [size=64] > Memory at 2000c0000000 (32-bit, non-prefetchable) [size=4K] > Memory at 210000000000 (64-bit, prefetchable) [size=16K] > Capabilities: [98] MSI-X: Enable+ Count=2 Masked- > Capabilities: [84] Vendor Specific Information: Len=14 <?> > Capabilities: [70] Vendor Specific Information: Len=14 <?> > Capabilities: [60] Vendor Specific Information: Len=10 <?> > Capabilities: [50] Vendor Specific Information: Len=10 <?> > Capabilities: [40] Vendor Specific Information: Len=10 <?> > Kernel driver in use: virtio-pci > > > As we can see, BAR0 of Chelsio is 210000004000 - not aligned (it should > have been aligned but it is not - this is another bug, in QEMU). Normally I think SLOF is the culprit in this case. The patch below enforces 64k alignment for mmio/mem regions at boot-time. I'm not sure if this should be done for all devices, or limited specifically to VFIO though (controlled perhaps via a per-device property? or at the machine-level at least?): https://github.com/mdroth/SLOF/commit/6ff6827740aba39d8db9bebcc3ae069bdf934d06 I've only sniff-tested it with virtio so far. Does this fix things for Chelsio? aligned (enabled): Bus 0, device 2, function 0: Ethernet controller: PCI device 1af4:1000 IRQ 0. BAR0: I/O at 0x0040 [0x005f]. BAR1: 32 bit memory at 0xc0080000 [0xc0080fff]. BAR4: 64 bit prefetchable memory at 0x210000010000 [0x210000013fff]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. id "" Bus 0, device 3, function 0: Ethernet controller: PCI device 1af4:1000 IRQ 0. BAR0: I/O at 0x0020 [0x003f]. BAR1: 32 bit memory at 0xc0000000 [0xc0000fff]. BAR4: 64 bit prefetchable memory at 0x210000000000 [0x210000003fff]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. id "" aligned (disabled): Bus 0, device 2, function 0: Ethernet controller: PCI device 1af4:1000 IRQ 0. BAR0: I/O at 0x0040 [0x005f]. BAR1: 32 bit memory at 0xc0080000 [0xc0080fff]. BAR4: 64 bit prefetchable memory at 0x210000004000 [0x210000007fff]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. id "" Bus 0, device 3, function 0: Ethernet controller: PCI device 1af4:1000 IRQ 0. BAR0: I/O at 0x0020 [0x003f]. BAR1: 32 bit memory at 0xc0000000 [0xc0000fff]. BAR4: 64 bit prefetchable memory at 0x210000000000 [0x210000003fff]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. id "" upstream: Bus 0, device 2, function 0: Ethernet controller: PCI device 1af4:1000 IRQ 0. BAR0: I/O at 0x0040 [0x005f]. BAR1: 32 bit memory at 0xc0080000 [0xc0080fff]. BAR4: 64 bit prefetchable memory at 0x210000004000 [0x210000007fff]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. id "" Bus 0, device 3, function 0: Ethernet controller: PCI device 1af4:1000 IRQ 0. BAR0: I/O at 0x0020 [0x003f]. BAR1: 32 bit memory at 0xc0000000 [0xc0000fff]. BAR4: 64 bit prefetchable memory at 0x210000000000 [0x210000003fff]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. id "" > such a BAR would be emulated by VFIO. However, since 95251725e335 "vfio: > Add support for mmapping sub-page MMIO BARs" we mmap such BARs to QEMU in a > hope they will be registered later via KVM_SET_USER_MEMORY_REGION - and > this fails as the guest address is not host page size aligned. > > So we end up having the following memory tree: > > memory-region: pci@800000020000000.mmio > 0000000000000000-ffffffffffffffff (prio 0, RW): pci@800000020000000.mmio > 00000000c0000000-00000000c0000fff (prio 1, RW): virtio-blk-pci-msix > 00000000c0000000-00000000c000001f (prio 0, RW): msix-table > 00000000c0000800-00000000c0000807 (prio 0, RW): msix-pba > 0000210000000000-0000210000003fff (prio 1, RW): virtio-pci > 0000210000000000-0000210000000fff (prio 0, RW): virtio-pci-common > 0000210000001000-0000210000001fff (prio 0, RW): virtio-pci-isr > 0000210000002000-0000210000002fff (prio 0, RW): virtio-pci-device > 0000210000003000-0000210000003fff (prio 0, RW): virtio-pci-notify > 0000210000004000-0000210000004fff (prio 1, RW): 0001:03:00.0 BAR 0 > 0000210000004000-0000210000004fff (prio 0, RW): 0001:03:00.0 BAR 0 > mmaps[0] > 0000210000800000-0000210000ffffff (prio 1, RW): 0001:03:00.0 BAR 2 > 0000210000800000-0000210000ffffff (prio 0, RW): 0001:03:00.0 BAR 2 > mmaps[0] > 0000210001000000-0000210001000fff (prio 1, RW): 0001:03:00.0 BAR 4 > 0000210001000000-00002100010001ff (prio 0, RW): msix-table > 0000210001000800-0000210001000807 (prio 0, RW): msix-pba [disabled] > > The problem region which this patch is fixing is "0001:03:00.0 BAR 0 > mmaps[0]". It is mmaped to QEMU and not registered in KVM. So when the > guest accesses this BAR, we trap into KVM, KVM passes it to QEMU and QEMU > read/writes this memory region. > > A simple hack like below fixes it - it basically removes mmap'd memory > region from the tree and MMIO starts being handled by the parent MR - > "0001:03:00.0 BAR 0" (which is "little endian" - vfio_region_ops). > > > I am wondering now what would be a correct approach here? > > Add/remove mmap'd MRs once we detect aligned/unaligned BARs? > > Keep things where they are in the VFIO department and just fix > ram_device_mem_ops::endianness? VFIO tries to report similar scenarios in realize(): vfio_bar_setup: if (vfio_region_mmap(&bar->region)) { error_report("Failed to mmap %s BAR %d. Performance may be slow", vdev->vbasedev.name, nr); } maybe we should at least be reporting it in this case as well? In our case we'll see it every reset/hotplug though, so maybe a trace makes more sense. Even with the patch for SLOF this would continue to be an issue for hotplug until BAR assignment is moved to QEMU for pseries, so would be good to have a simple way to check for it. Can send a patch if that makes sense. > > Thanks. > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > index d7dbe0e3e0..0657a27623 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -1109,7 +1109,10 @@ static void > vfio_sub_page_bar_update_mapping(PCIDevice *pdev, int bar) > memory_region_transaction_begin(); > > memory_region_set_size(mr, size); > - memory_region_set_size(mmap_mr, size); > + if (bar_addr & qemu_real_host_page_mask) > + memory_region_del_subregion(mr, mmap_mr); > + else > + memory_region_set_size(mmap_mr, size); > if (size != region->size && memory_region_is_mapped(mr)) { > memory_region_del_subregion(r->address_space, mr); > memory_region_add_subregion_overlap(r->address_space, > > > > -- > Alexey >
On 27/02/17 13:25, Michael Roth wrote: > Quoting Alexey Kardashevskiy (2017-02-22 22:20:25) >> On 21/02/17 17:46, Yongji Xie wrote: >>> At the moment ram device's memory regions are NATIVE_ENDIAN. This does >>> not work on PPC64 because VFIO PCI device is little endian but PPC64 >>> always defines static macro TARGET_WORDS_BIGENDIAN. >>> >>> This fixes endianness for ram device the same way as it is done >>> for VFIO region in commit 6758008e2c4e79fb6bd04fe8e7a41665fa583965. >>> >>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com> >>> --- >>> memory.c | 14 +++++++------- >>> 1 files changed, 7 insertions(+), 7 deletions(-) >>> >>> diff --git a/memory.c b/memory.c >>> index 6c58373..1ccb99f 100644 >>> --- a/memory.c >>> +++ b/memory.c >>> @@ -1139,13 +1139,13 @@ static uint64_t memory_region_ram_device_read(void *opaque, >>> data = *(uint8_t *)(mr->ram_block->host + addr); >>> break; >>> case 2: >>> - data = *(uint16_t *)(mr->ram_block->host + addr); >>> + data = le16_to_cpu(*(uint16_t *)(mr->ram_block->host + addr)); >>> break; >>> case 4: >>> - data = *(uint32_t *)(mr->ram_block->host + addr); >>> + data = le32_to_cpu(*(uint32_t *)(mr->ram_block->host + addr)); >>> break; >>> case 8: >>> - data = *(uint64_t *)(mr->ram_block->host + addr); >>> + data = le64_to_cpu(*(uint64_t *)(mr->ram_block->host + addr)); >>> break; >>> } >>> >>> @@ -1166,13 +1166,13 @@ static void memory_region_ram_device_write(void *opaque, hwaddr addr, >>> *(uint8_t *)(mr->ram_block->host + addr) = (uint8_t)data; >>> break; >>> case 2: >>> - *(uint16_t *)(mr->ram_block->host + addr) = (uint16_t)data; >>> + *(uint16_t *)(mr->ram_block->host + addr) = cpu_to_le16((uint16_t)data); >>> break; >>> case 4: >>> - *(uint32_t *)(mr->ram_block->host + addr) = (uint32_t)data; >>> + *(uint32_t *)(mr->ram_block->host + addr) = cpu_to_le32((uint32_t)data); >>> break; >>> case 8: >>> - *(uint64_t *)(mr->ram_block->host + addr) = data; >>> + *(uint64_t *)(mr->ram_block->host + addr) = cpu_to_le64(data); >>> break; >>> } >>> } >>> @@ -1180,7 +1180,7 @@ static void memory_region_ram_device_write(void *opaque, hwaddr addr, >>> static const MemoryRegionOps ram_device_mem_ops = { >>> .read = memory_region_ram_device_read, >>> .write = memory_region_ram_device_write, >>> - .endianness = DEVICE_NATIVE_ENDIAN, >>> + .endianness = DEVICE_LITTLE_ENDIAN, >>> .valid = { >>> .min_access_size = 1, >>> .max_access_size = 8, >>> >> >> >> I did some debugging today. >> >> First, Paolo is right and ram_device_mem_ops::endianness should be >> host-endian which happens to be little in our test case (ppc64le) so >> changes to .read/.write are actually no-op (I believe so, have not checked). >> >> >> But I was wondering why this gets executed at all. >> >> The test case is: >> >> qemu-system-ppc64 ... >> -device "vfio-pci,id=vfio0001_03_00_0,host=0001:03:00.0 >> -drive id=DRIVE0,if=none,file=./test.qcow2,format=qcow2 >> -device virtio-blk-pci,id=vblk0,drive=DRIVE0 >> >> The host kernel is v4.10, ppc64le (little endian), 64K system page size, >> QEMU is v2.8.0. >> >> When this boots, lspci shows: >> >> 00:00.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single >> Port Adapter >> Subsystem: IBM Device 038c >> Flags: bus master, fast devsel, latency 0, IRQ 18 >> Memory at 210000004000 (64-bit, non-prefetchable) [size=4K] >> Memory at 210000800000 (64-bit, non-prefetchable) [size=8M] >> Memory at 210001000000 (64-bit, non-prefetchable) [size=4K] >> Expansion ROM at 2000c0080000 [disabled] [size=512K] >> Capabilities: [40] Power Management version 3 >> Capabilities: [48] MSI: Enable- Count=1/32 Maskable- 64bit+ >> Capabilities: [58] Express Endpoint, MSI 00 >> Capabilities: [94] Vital Product Data >> Capabilities: [9c] MSI-X: Enable+ Count=32 Masked- >> Kernel driver in use: cxgb3 >> >> 00:01.0 SCSI storage controller: Red Hat, Inc Virtio block device >> Subsystem: Red Hat, Inc Device 0002 >> Physical Slot: C16 >> Flags: bus master, fast devsel, latency 0, IRQ 17 >> I/O ports at 0040 [size=64] >> Memory at 2000c0000000 (32-bit, non-prefetchable) [size=4K] >> Memory at 210000000000 (64-bit, prefetchable) [size=16K] >> Capabilities: [98] MSI-X: Enable+ Count=2 Masked- >> Capabilities: [84] Vendor Specific Information: Len=14 <?> >> Capabilities: [70] Vendor Specific Information: Len=14 <?> >> Capabilities: [60] Vendor Specific Information: Len=10 <?> >> Capabilities: [50] Vendor Specific Information: Len=10 <?> >> Capabilities: [40] Vendor Specific Information: Len=10 <?> >> Kernel driver in use: virtio-pci >> >> >> As we can see, BAR0 of Chelsio is 210000004000 - not aligned (it should >> have been aligned but it is not - this is another bug, in QEMU). Normally > > I think SLOF is the culprit in this case. The patch below enforces > 64k alignment for mmio/mem regions at boot-time. I'm not sure if this > should be done for all devices, or limited specifically to VFIO though > (controlled perhaps via a per-device property? or at the machine-level > at least?): I was sure we have this in SLOF and now I see that we don't. Hm :) > https://github.com/mdroth/SLOF/commit/6ff6827740aba39d8db9bebcc3ae069bdf934d06 > > I've only sniff-tested it with virtio so far. Does this fix things > for Chelsio? > > aligned (enabled): > Bus 0, device 2, function 0: > Ethernet controller: PCI device 1af4:1000 > IRQ 0. > BAR0: I/O at 0x0040 [0x005f]. > BAR1: 32 bit memory at 0xc0080000 [0xc0080fff]. > BAR4: 64 bit prefetchable memory at 0x210000010000 [0x210000013fff]. > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > id "" > Bus 0, device 3, function 0: > Ethernet controller: PCI device 1af4:1000 > IRQ 0. > BAR0: I/O at 0x0020 [0x003f]. > BAR1: 32 bit memory at 0xc0000000 [0xc0000fff]. > BAR4: 64 bit prefetchable memory at 0x210000000000 [0x210000003fff]. > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > id "" > > aligned (disabled): > Bus 0, device 2, function 0: > Ethernet controller: PCI device 1af4:1000 > IRQ 0. > BAR0: I/O at 0x0040 [0x005f]. > BAR1: 32 bit memory at 0xc0080000 [0xc0080fff]. > BAR4: 64 bit prefetchable memory at 0x210000004000 [0x210000007fff]. > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > id "" > Bus 0, device 3, function 0: > Ethernet controller: PCI device 1af4:1000 > IRQ 0. > BAR0: I/O at 0x0020 [0x003f]. > BAR1: 32 bit memory at 0xc0000000 [0xc0000fff]. > BAR4: 64 bit prefetchable memory at 0x210000000000 [0x210000003fff]. > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > id "" > > upstream: > Bus 0, device 2, function 0: > Ethernet controller: PCI device 1af4:1000 > IRQ 0. > BAR0: I/O at 0x0040 [0x005f]. > BAR1: 32 bit memory at 0xc0080000 [0xc0080fff]. > BAR4: 64 bit prefetchable memory at 0x210000004000 [0x210000007fff]. > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > id "" > Bus 0, device 3, function 0: > Ethernet controller: PCI device 1af4:1000 > IRQ 0. > BAR0: I/O at 0x0020 [0x003f]. > BAR1: 32 bit memory at 0xc0000000 [0xc0000fff]. > BAR4: 64 bit prefetchable memory at 0x210000000000 [0x210000003fff]. > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > id "" > > >> such a BAR would be emulated by VFIO. However, since 95251725e335 "vfio: >> Add support for mmapping sub-page MMIO BARs" we mmap such BARs to QEMU in a >> hope they will be registered later via KVM_SET_USER_MEMORY_REGION - and >> this fails as the guest address is not host page size aligned. >> >> So we end up having the following memory tree: >> >> memory-region: pci@800000020000000.mmio >> 0000000000000000-ffffffffffffffff (prio 0, RW): pci@800000020000000.mmio >> 00000000c0000000-00000000c0000fff (prio 1, RW): virtio-blk-pci-msix >> 00000000c0000000-00000000c000001f (prio 0, RW): msix-table >> 00000000c0000800-00000000c0000807 (prio 0, RW): msix-pba >> 0000210000000000-0000210000003fff (prio 1, RW): virtio-pci >> 0000210000000000-0000210000000fff (prio 0, RW): virtio-pci-common >> 0000210000001000-0000210000001fff (prio 0, RW): virtio-pci-isr >> 0000210000002000-0000210000002fff (prio 0, RW): virtio-pci-device >> 0000210000003000-0000210000003fff (prio 0, RW): virtio-pci-notify >> 0000210000004000-0000210000004fff (prio 1, RW): 0001:03:00.0 BAR 0 >> 0000210000004000-0000210000004fff (prio 0, RW): 0001:03:00.0 BAR 0 >> mmaps[0] >> 0000210000800000-0000210000ffffff (prio 1, RW): 0001:03:00.0 BAR 2 >> 0000210000800000-0000210000ffffff (prio 0, RW): 0001:03:00.0 BAR 2 >> mmaps[0] >> 0000210001000000-0000210001000fff (prio 1, RW): 0001:03:00.0 BAR 4 >> 0000210001000000-00002100010001ff (prio 0, RW): msix-table >> 0000210001000800-0000210001000807 (prio 0, RW): msix-pba [disabled] >> >> The problem region which this patch is fixing is "0001:03:00.0 BAR 0 >> mmaps[0]". It is mmaped to QEMU and not registered in KVM. So when the >> guest accesses this BAR, we trap into KVM, KVM passes it to QEMU and QEMU >> read/writes this memory region. >> >> A simple hack like below fixes it - it basically removes mmap'd memory >> region from the tree and MMIO starts being handled by the parent MR - >> "0001:03:00.0 BAR 0" (which is "little endian" - vfio_region_ops). >> >> >> I am wondering now what would be a correct approach here? >> >> Add/remove mmap'd MRs once we detect aligned/unaligned BARs? >> >> Keep things where they are in the VFIO department and just fix >> ram_device_mem_ops::endianness? > > VFIO tries to report similar scenarios in realize(): > > vfio_bar_setup: > if (vfio_region_mmap(&bar->region)) { > error_report("Failed to mmap %s BAR %d. Performance may be slow", > vdev->vbasedev.name, nr); > } > > maybe we should at least be reporting it in this case as well? In our > case we'll see it every reset/hotplug though, so maybe a trace makes > more sense. Tracepoint should be enough imho. > Even with the patch for SLOF this would continue to be an > issue for hotplug until BAR assignment is moved to QEMU for pseries, > so would be good to have a simple way to check for it. Can send a patch > if that makes sense. Can BAR allocation be done by QEMU when hotplugging? > >> >> Thanks. >> >> >> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >> index d7dbe0e3e0..0657a27623 100644 >> --- a/hw/vfio/pci.c >> +++ b/hw/vfio/pci.c >> @@ -1109,7 +1109,10 @@ static void >> vfio_sub_page_bar_update_mapping(PCIDevice *pdev, int bar) >> memory_region_transaction_begin(); >> >> memory_region_set_size(mr, size); >> - memory_region_set_size(mmap_mr, size); >> + if (bar_addr & qemu_real_host_page_mask) >> + memory_region_del_subregion(mr, mmap_mr); >> + else >> + memory_region_set_size(mmap_mr, size); >> if (size != region->size && memory_region_is_mapped(mr)) { >> memory_region_del_subregion(r->address_space, mr); >> memory_region_add_subregion_overlap(r->address_space, >> >> >> >> -- >> Alexey >> >
on 2017/2/27 11:25, Alexey Kardashevskiy wrote: > On 27/02/17 13:25, Michael Roth wrote: >> Quoting Alexey Kardashevskiy (2017-02-22 22:20:25) >>> On 21/02/17 17:46, Yongji Xie wrote: >>>> At the moment ram device's memory regions are NATIVE_ENDIAN. This does >>>> not work on PPC64 because VFIO PCI device is little endian but PPC64 >>>> always defines static macro TARGET_WORDS_BIGENDIAN. >>>> >>>> This fixes endianness for ram device the same way as it is done >>>> for VFIO region in commit 6758008e2c4e79fb6bd04fe8e7a41665fa583965. >>>> >>>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com> >>>> --- >>>> memory.c | 14 +++++++------- >>>> 1 files changed, 7 insertions(+), 7 deletions(-) >>>> >>>> diff --git a/memory.c b/memory.c >>>> index 6c58373..1ccb99f 100644 >>>> --- a/memory.c >>>> +++ b/memory.c >>>> @@ -1139,13 +1139,13 @@ static uint64_t memory_region_ram_device_read(void *opaque, >>>> data = *(uint8_t *)(mr->ram_block->host + addr); >>>> break; >>>> case 2: >>>> - data = *(uint16_t *)(mr->ram_block->host + addr); >>>> + data = le16_to_cpu(*(uint16_t *)(mr->ram_block->host + addr)); >>>> break; >>>> case 4: >>>> - data = *(uint32_t *)(mr->ram_block->host + addr); >>>> + data = le32_to_cpu(*(uint32_t *)(mr->ram_block->host + addr)); >>>> break; >>>> case 8: >>>> - data = *(uint64_t *)(mr->ram_block->host + addr); >>>> + data = le64_to_cpu(*(uint64_t *)(mr->ram_block->host + addr)); >>>> break; >>>> } >>>> >>>> @@ -1166,13 +1166,13 @@ static void memory_region_ram_device_write(void *opaque, hwaddr addr, >>>> *(uint8_t *)(mr->ram_block->host + addr) = (uint8_t)data; >>>> break; >>>> case 2: >>>> - *(uint16_t *)(mr->ram_block->host + addr) = (uint16_t)data; >>>> + *(uint16_t *)(mr->ram_block->host + addr) = cpu_to_le16((uint16_t)data); >>>> break; >>>> case 4: >>>> - *(uint32_t *)(mr->ram_block->host + addr) = (uint32_t)data; >>>> + *(uint32_t *)(mr->ram_block->host + addr) = cpu_to_le32((uint32_t)data); >>>> break; >>>> case 8: >>>> - *(uint64_t *)(mr->ram_block->host + addr) = data; >>>> + *(uint64_t *)(mr->ram_block->host + addr) = cpu_to_le64(data); >>>> break; >>>> } >>>> } >>>> @@ -1180,7 +1180,7 @@ static void memory_region_ram_device_write(void *opaque, hwaddr addr, >>>> static const MemoryRegionOps ram_device_mem_ops = { >>>> .read = memory_region_ram_device_read, >>>> .write = memory_region_ram_device_write, >>>> - .endianness = DEVICE_NATIVE_ENDIAN, >>>> + .endianness = DEVICE_LITTLE_ENDIAN, >>>> .valid = { >>>> .min_access_size = 1, >>>> .max_access_size = 8, >>>> >>> >>> >>> I did some debugging today. >>> >>> First, Paolo is right and ram_device_mem_ops::endianness should be >>> host-endian which happens to be little in our test case (ppc64le) so >>> changes to .read/.write are actually no-op (I believe so, have not checked). >>> >>> >>> But I was wondering why this gets executed at all. >>> >>> The test case is: >>> >>> qemu-system-ppc64 ... >>> -device "vfio-pci,id=vfio0001_03_00_0,host=0001:03:00.0 >>> -drive id=DRIVE0,if=none,file=./test.qcow2,format=qcow2 >>> -device virtio-blk-pci,id=vblk0,drive=DRIVE0 >>> >>> The host kernel is v4.10, ppc64le (little endian), 64K system page size, >>> QEMU is v2.8.0. >>> >>> When this boots, lspci shows: >>> >>> 00:00.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single >>> Port Adapter >>> Subsystem: IBM Device 038c >>> Flags: bus master, fast devsel, latency 0, IRQ 18 >>> Memory at 210000004000 (64-bit, non-prefetchable) [size=4K] >>> Memory at 210000800000 (64-bit, non-prefetchable) [size=8M] >>> Memory at 210001000000 (64-bit, non-prefetchable) [size=4K] >>> Expansion ROM at 2000c0080000 [disabled] [size=512K] >>> Capabilities: [40] Power Management version 3 >>> Capabilities: [48] MSI: Enable- Count=1/32 Maskable- 64bit+ >>> Capabilities: [58] Express Endpoint, MSI 00 >>> Capabilities: [94] Vital Product Data >>> Capabilities: [9c] MSI-X: Enable+ Count=32 Masked- >>> Kernel driver in use: cxgb3 >>> >>> 00:01.0 SCSI storage controller: Red Hat, Inc Virtio block device >>> Subsystem: Red Hat, Inc Device 0002 >>> Physical Slot: C16 >>> Flags: bus master, fast devsel, latency 0, IRQ 17 >>> I/O ports at 0040 [size=64] >>> Memory at 2000c0000000 (32-bit, non-prefetchable) [size=4K] >>> Memory at 210000000000 (64-bit, prefetchable) [size=16K] >>> Capabilities: [98] MSI-X: Enable+ Count=2 Masked- >>> Capabilities: [84] Vendor Specific Information: Len=14 <?> >>> Capabilities: [70] Vendor Specific Information: Len=14 <?> >>> Capabilities: [60] Vendor Specific Information: Len=10 <?> >>> Capabilities: [50] Vendor Specific Information: Len=10 <?> >>> Capabilities: [40] Vendor Specific Information: Len=10 <?> >>> Kernel driver in use: virtio-pci >>> >>> >>> As we can see, BAR0 of Chelsio is 210000004000 - not aligned (it should >>> have been aligned but it is not - this is another bug, in QEMU). Normally >> I think SLOF is the culprit in this case. The patch below enforces >> 64k alignment for mmio/mem regions at boot-time. I'm not sure if this >> should be done for all devices, or limited specifically to VFIO though >> (controlled perhaps via a per-device property? or at the machine-level >> at least?): > > I was sure we have this in SLOF and now I see that we don't. Hm :) > I have a quick check. Seems like the SLOF patch was only in pkvm3.1.1 SLOF tree... Thanks, Yongji
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index d7dbe0e3e0..0657a27623 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -1109,7 +1109,10 @@ static void vfio_sub_page_bar_update_mapping(PCIDevice *pdev, int bar) memory_region_transaction_begin(); memory_region_set_size(mr, size); - memory_region_set_size(mmap_mr, size); + if (bar_addr & qemu_real_host_page_mask) + memory_region_del_subregion(mr, mmap_mr); + else + memory_region_set_size(mmap_mr, size); if (size != region->size && memory_region_is_mapped(mr)) { memory_region_del_subregion(r->address_space, mr);