Message ID | 20131130185639.GA13039@pegasus.dumpdata.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Sat, 2013-11-30 at 13:56 -0500, Konrad Rzeszutek Wilk wrote: > My theory is that the SWIOTLB is not full - it is just that the request > is for a compound page that is more than 512kB. Please note that > SWIOTLB highest "chunk" of buffer it can deal with is 512kb. > > And that is of course the question comes out - why would it try to > bounce buffer it. In Xen the answer is simple - the sg chunks cross page > boundaries which means that they are not physically contingous - so we > have to use the bounce buffer. It would be better if the the sg list > provided a large list of 4KB pages instead of compound pages as that > could help in avoiding the bounce buffer. > > But I digress - this is a theory - I don't know whether the SCSI layer > does any colescing of the sg list - and if so, whether there is any > easy knob to tell it to not do it. Well, SCSI doesn't, but block does. It's actually an efficiency thing since most firmware descriptor formats cope with multiple pages and the more descriptors you have for a transaction, the more work the on-board processor on the HBA has to do. If you have an emulated HBA, like virtio, you could turn off physical coalesing by setting the use_clustering flag to DISABLE_CLUSTERING. But you can't do that for a real card. I assume the problem here is that the host is passing the card directly to the guest and the guest clusters based on its idea of guest pages which don't map to contiguous physical pages? The way you tell how many physically contiguous pages block is willing to merge is by looking at /sys/block/<dev>/queue/max_segment_size if that's 4k then it won't merge, if it's greater than 4k, then it will. I'm not quite sure what to do ... you can't turn of clustering globally in the guest because the virtio drivers use it to reduce ring descriptor pressure, what you probably want is some way to flag a pass through device. James -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 30 Nov 2013, James Bottomley wrote: > On Sat, 2013-11-30 at 13:56 -0500, Konrad Rzeszutek Wilk wrote: > > My theory is that the SWIOTLB is not full - it is just that the request > > is for a compound page that is more than 512kB. Please note that > > SWIOTLB highest "chunk" of buffer it can deal with is 512kb. > > > > And that is of course the question comes out - why would it try to > > bounce buffer it. In Xen the answer is simple - the sg chunks cross page > > boundaries which means that they are not physically contingous - so we > > have to use the bounce buffer. It would be better if the the sg list > > provided a large list of 4KB pages instead of compound pages as that > > could help in avoiding the bounce buffer. > > > > But I digress - this is a theory - I don't know whether the SCSI layer > > does any colescing of the sg list - and if so, whether there is any > > easy knob to tell it to not do it. > > Well, SCSI doesn't, but block does. It's actually an efficiency thing > since most firmware descriptor formats cope with multiple pages and the > more descriptors you have for a transaction, the more work the on-board > processor on the HBA has to do. If you have an emulated HBA, like > virtio, you could turn off physical coalesing by setting the > use_clustering flag to DISABLE_CLUSTERING. But you can't do that for a > real card. I assume the problem here is that the host is passing the > card directly to the guest and the guest clusters based on its idea of > guest pages which don't map to contiguous physical pages? > > The way you tell how many physically contiguous pages block is willing > to merge is by looking at /sys/block/<dev>/queue/max_segment_size if > that's 4k then it won't merge, if it's greater than 4k, then it will. > > I'm not quite sure what to do ... you can't turn of clustering globally > in the guest because the virtio drivers use it to reduce ring descriptor > pressure, what you probably want is some way to flag a pass through > device. Given that we don't use virtio on Xen, we could actually turn off clustering globally (if we are running on Xen). In fact for example BIOVEC_PHYS_MERGEABLE is defined: +#define BIOVEC_PHYS_MERGEABLE(vec1, vec2) \ + (__BIOVEC_PHYS_MERGEABLE(vec1, vec2) && \ + (!xen_domain() || xen_biovec_phys_mergeable(vec1, vec2))) so that we can disable it if the two bv_page are not actually physical contiguous. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Nov 30, 2013 at 03:48:44PM -0500, James Bottomley wrote: > On Sat, 2013-11-30 at 13:56 -0500, Konrad Rzeszutek Wilk wrote: > > My theory is that the SWIOTLB is not full - it is just that the request > > is for a compound page that is more than 512kB. Please note that > > SWIOTLB highest "chunk" of buffer it can deal with is 512kb. > > > > And that is of course the question comes out - why would it try to > > bounce buffer it. In Xen the answer is simple - the sg chunks cross page > > boundaries which means that they are not physically contingous - so we > > have to use the bounce buffer. It would be better if the the sg list > > provided a large list of 4KB pages instead of compound pages as that > > could help in avoiding the bounce buffer. > > > > But I digress - this is a theory - I don't know whether the SCSI layer > > does any colescing of the sg list - and if so, whether there is any > > easy knob to tell it to not do it. > > Well, SCSI doesn't, but block does. It's actually an efficiency thing > since most firmware descriptor formats cope with multiple pages and the > more descriptors you have for a transaction, the more work the on-board > processor on the HBA has to do. If you have an emulated HBA, like > virtio, you could turn off physical coalesing by setting the > use_clustering flag to DISABLE_CLUSTERING. But you can't do that for a > real card. I assume the problem here is that the host is passing the > card directly to the guest and the guest clusters based on its idea of > guest pages which don't map to contiguous physical pages? Kind of. Except that in this case the guest does know that it can't map them contingously - and resorts to using the bounce buffer so that it can provide a nice chunk of contingous area. This is detected by the SWIOTLB layer and also the block layer to discourage coalescing there. But since SCSI is all about sg list I think it gets tangled up here: 537 for_each_sg(sgl, sg, nelems, i) { 538 phys_addr_t paddr = sg_phys(sg); 539 dma_addr_t dev_addr = xen_phys_to_bus(paddr); 540 541 if (swiotlb_force || 542 !dma_capable(hwdev, dev_addr, sg->length) || 543 range_straddles_page_boundary(paddr, sg->length)) { 544 phys_addr_t map = swiotlb_tbl_map_single(hwdev, 545 start_dma_addr, 546 sg_phys(sg), 547 sg->length, 548 dir); So it is either not capable of reaching that physical address (so DMA mask, but I doubt it - this is LSI which can do 64bit). Or the pages straddle. They can straddle it by well, being offset at odd locations, or compound pages. But why would they in the first place - and so many of them - considering the flow of those printks Ian's is seeing. James, The SCSI layer wouldn't do any funny business here right - no reording of bios? That is all left to the block layer right? > > The way you tell how many physically contiguous pages block is willing > to merge is by looking at /sys/block/<dev>/queue/max_segment_size if > that's 4k then it won't merge, if it's greater than 4k, then it will. Ah, good idea. Ian, anything there? > > I'm not quite sure what to do ... you can't turn of clustering globally > in the guest because the virtio drivers use it to reduce ring descriptor > pressure, what you probably want is some way to flag a pass through > device. > > James > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2013-12-03 at 12:33 -0500, Konrad Rzeszutek Wilk wrote: > On Sat, Nov 30, 2013 at 03:48:44PM -0500, James Bottomley wrote: > > On Sat, 2013-11-30 at 13:56 -0500, Konrad Rzeszutek Wilk wrote: > > > My theory is that the SWIOTLB is not full - it is just that the request > > > is for a compound page that is more than 512kB. Please note that > > > SWIOTLB highest "chunk" of buffer it can deal with is 512kb. > > > > > > And that is of course the question comes out - why would it try to > > > bounce buffer it. In Xen the answer is simple - the sg chunks cross page > > > boundaries which means that they are not physically contingous - so we > > > have to use the bounce buffer. It would be better if the the sg list > > > provided a large list of 4KB pages instead of compound pages as that > > > could help in avoiding the bounce buffer. > > > > > > But I digress - this is a theory - I don't know whether the SCSI layer > > > does any colescing of the sg list - and if so, whether there is any > > > easy knob to tell it to not do it. > > > > Well, SCSI doesn't, but block does. It's actually an efficiency thing > > since most firmware descriptor formats cope with multiple pages and the > > more descriptors you have for a transaction, the more work the on-board > > processor on the HBA has to do. If you have an emulated HBA, like > > virtio, you could turn off physical coalesing by setting the > > use_clustering flag to DISABLE_CLUSTERING. But you can't do that for a > > real card. I assume the problem here is that the host is passing the > > card directly to the guest and the guest clusters based on its idea of > > guest pages which don't map to contiguous physical pages? > > Kind of. Except that in this case the guest does know that it can't map > them contingously - and resorts to using the bounce buffer so that it > can provide a nice chunk of contingous area. This is detected by > the SWIOTLB layer and also the block layer to discourage coalescing > there. > > But since SCSI is all about sg list I think it gets tangled up here: > > 537 for_each_sg(sgl, sg, nelems, i) { > 538 phys_addr_t paddr = sg_phys(sg); > 539 dma_addr_t dev_addr = xen_phys_to_bus(paddr); > 540 > 541 if (swiotlb_force || > 542 !dma_capable(hwdev, dev_addr, sg->length) || > 543 range_straddles_page_boundary(paddr, sg->length)) { > 544 phys_addr_t map = swiotlb_tbl_map_single(hwdev, > 545 start_dma_addr, > 546 sg_phys(sg), > 547 sg->length, > 548 dir); > > So it is either not capable of reaching that physical address (so DMA > mask, but I doubt it - this is LSI which can do 64bit). Right, so no bouncing. > Or the pages > straddle. They can straddle it by well, being offset at odd locations, or > compound pages. All modern filesystems have 4k+ block sizes, so no offsets at all. For DIO you can get offsets at the beginning and end of the transfer, but they will be offsets within the page, so the problem can only be clustering (physical merging). > But why would they in the first place - and so many of them - considering > the flow of those printks Ian's is seeing. Probably because compaction and our allocators are designed to give out physically contiguous pages, which work their way back into the block layer in order. On a lot of I/O workloads, we see 30%+ physical merging. > James, > The SCSI layer wouldn't do any funny business here right - no reording > of bios? That is all left to the block layer right? We don't see bios ... they're top of block. SCSI sees requests but the block layer does all our request and sg list manipulation for us. James -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/lib/swiotlb.c b/lib/swiotlb.c index e4399fa..d4c95d0 100644 --- a/lib/swiotlb.c +++ b/lib/swiotlb.c @@ -505,7 +505,12 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, not_found: spin_unlock_irqrestore(&io_tlb_lock, flags); - dev_warn(hwdev, "swiotlb buffer is full\n"); + if (printk_ratelimit()) { + dev_warn(hwdev, "swiotlb buffer is full for %lx (%d bytes) %s\n", orig_addr, size, + dir == DMA_BIDIRECTIONAL ? "BIDIRECTIONAL" : + (dir == DMA_TO_DEVICE ? "TO_DEVICE" : "FROM_DEVICE" )); + dump_stack(); + } return SWIOTLB_MAP_ERROR; found: spin_unlock_irqrestore(&io_tlb_lock, flags);