From patchwork Thu May  5 12:39:49 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Tim Gardner <tim.gardner@canonical.com>
X-Patchwork-Id: 1626945
Return-Path: <kernel-team-bounces@lists.ubuntu.com>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: bilbo.ozlabs.org;
	dkim=fail reason="signature verification failed" (2048-bit key;
 unprotected) header.d=canonical.com header.i=@canonical.com
 header.a=rsa-sha256 header.s=20210705 header.b=b/soA83E;
	dkim-atps=neutral
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com
 (client-ip=91.189.94.19; helo=huckleberry.canonical.com;
 envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=<UNKNOWN>)
Received: from huckleberry.canonical.com (huckleberry.canonical.com
 [91.189.94.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by bilbo.ozlabs.org (Postfix) with ESMTPS id 4KvCvy2MGvz9sBF
	for <incoming@patchwork.ozlabs.org>; Thu,  5 May 2022 22:40:14 +1000 (AEST)
Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com)
	by huckleberry.canonical.com with esmtp (Exim 4.86_2)
	(envelope-from <kernel-team-bounces@lists.ubuntu.com>)
	id 1nmam8-0007wc-0a; Thu, 05 May 2022 12:40:08 +0000
Received: from smtp-relay-internal-0.internal ([10.131.114.225]
 helo=smtp-relay-internal-0.canonical.com)
 by huckleberry.canonical.com with esmtps
 (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2)
 (envelope-from <tim.gardner@canonical.com>) id 1nmam4-0007tH-Ae
 for kernel-team@lists.ubuntu.com; Thu, 05 May 2022 12:40:04 +0000
Received: from mail-pg1-f200.google.com (mail-pg1-f200.google.com
 [209.85.215.200])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-relay-internal-0.canonical.com (Postfix) with ESMTPS id EF6A23F62F
 for <kernel-team@lists.ubuntu.com>; Thu,  5 May 2022 12:40:03 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com;
 s=20210705; t=1651754404;
 bh=AS6uoIzWkV/wa5lY4j9Kx1fkmD3QUX8HuuwMl0/iUFk=;
 h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
 MIME-Version;
 b=b/soA83E6VSUGhjehPi927IE+Aq/UEkX46XjyD/t/fiFrOWq+3bjnzPgwqCQsgrsW
 4IGmsnZUMaCvDjNdevJFV/2FG4KlSm/pPvlhEa3d2OwqOvrtoeuWKIU1FKOiKeM/z2
 mfPkXaE6JSBRHvlrEqhtvmJy40j9PbZEkWWqTLzU/2NVPZvcpzfljaCoOm4ov9YQ0b
 VgbLQJ1ew2kAe140Yo3xiRkDrAGStVOj5mqX+f65cljrYD6DyGPexNcRY74DhyGFu8
 5eriNgdQ6eYrKvfIAEzQT00RtIiOkizjcJ6iKFw3KklNb9raJS1ff90QOZ/wCgkVeo
 51jBgrzgoPuLQ==
Received: by mail-pg1-f200.google.com with SMTP id
 d127-20020a633685000000b003ab20e589a8so2114190pga.22
 for <kernel-team@lists.ubuntu.com>; Thu, 05 May 2022 05:40:03 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=AS6uoIzWkV/wa5lY4j9Kx1fkmD3QUX8HuuwMl0/iUFk=;
 b=M9ovXTf3uXWEWA7Fhfr4rmPsxmpmT/+RIhRift7e1A8i4czTo4i9rZSrgsrI3UfDst
 AEKB6xZ9d8H+nXpgSnFISOa/VS6H6rh28vNR4DUPfEKxp+vEWAUpdwbtlky3AFPDe3Ga
 Von/gdlma7yntTn6ZfP+ISB0QqDb9lk9TYFMWf6LtwTgA76NatUVTrnsRLRGVYpFXX1y
 Of0ErD7mpkWN+BPlNuBcsRcOqXqjrPbQbg3iEDHfa2yidQWxZJxq+qDn4sqt2nz6PKgX
 MEKDidAmoNiG7JVJxoCyd8w2k7KbrstiLnAtoQzP00AbFI/zgbBq+GYQ/4USeBOze7Nh
 lO8g==
X-Gm-Message-State: AOAM533mOHJvRMe9GfNZQU2QS0+YyMZv/HABJGU7ahmaNaWvc9LVy/9R
 5oM0q3Mz+35xdyk3Keu7Lj2coeN7o2R/hF3qB/VYq0fGqSDZLBvTmWpxgYfCGQVZbSHmoNEyQm2
 jeHL2uySIIv9S0Uo4KJ1EvO98MbnGSdQeQc+W6NSpvA==
X-Received: by 2002:a63:f44f:0:b0:3c1:ed4f:b066 with SMTP id
 p15-20020a63f44f000000b003c1ed4fb066mr19201322pgk.334.1651754401968;
 Thu, 05 May 2022 05:40:01 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJzzaYbSb6WS3vEXtyQYCcFkWIbXzgiKHQkAY5Umx06JlMRaeRYYyy5KzeNb8h1Kw0ZL1YVXUA==
X-Received: by 2002:a63:f44f:0:b0:3c1:ed4f:b066 with SMTP id
 p15-20020a63f44f000000b003c1ed4fb066mr19201281pgk.334.1651754401299;
 Thu, 05 May 2022 05:40:01 -0700 (PDT)
Received: from localhost.localdomain ([69.163.84.166])
 by smtp.gmail.com with ESMTPSA id
 o12-20020a170902bccc00b0015ec71f72d6sm1403744pls.253.2022.05.05.05.40.00
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Thu, 05 May 2022 05:40:00 -0700 (PDT)
From: Tim Gardner <tim.gardner@canonical.com>
To: kernel-team@lists.ubuntu.com
Subject: [PATCH 4/4] UBUNTU: SAUCE: swiotlb: Split up single swiotlb lock
Date: Thu,  5 May 2022 06:39:49 -0600
Message-Id: <20220505123949.12030-5-tim.gardner@canonical.com>
X-Mailer: git-send-email 2.36.0
In-Reply-To: <20220505123949.12030-1-tim.gardner@canonical.com>
References: <20220505123949.12030-1-tim.gardner@canonical.com>
MIME-Version: 1.0
X-BeenThere: kernel-team@lists.ubuntu.com
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Kernel team discussions <kernel-team.lists.ubuntu.com>
List-Unsubscribe: <https://lists.ubuntu.com/mailman/options/kernel-team>,
 <mailto:kernel-team-request@lists.ubuntu.com?subject=unsubscribe>
List-Archive: <https://lists.ubuntu.com/archives/kernel-team>
List-Post: <mailto:kernel-team@lists.ubuntu.com>
List-Help: <mailto:kernel-team-request@lists.ubuntu.com?subject=help>
List-Subscribe: <https://lists.ubuntu.com/mailman/listinfo/kernel-team>,
 <mailto:kernel-team-request@lists.ubuntu.com?subject=subscribe>
Errors-To: kernel-team-bounces@lists.ubuntu.com
Sender: "kernel-team" <kernel-team-bounces@lists.ubuntu.com>

From: Andi Kleen <ak@linux.intel.com>

BugLink: https://bugs.launchpad.net/bugs/1971701

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
signifiant lock contention on the swiotlb lock. We've seen 20+% CPU
time in locks in some extreme cases.

This patch splits the swiotlb into individual areas which have their
own lock. Each CPU tries to allocate in its own area first. Only if
that fails does it search other areas. On freeing the allocation is
freed into the area where the memory was originally allocated from.

To avoid doing a full modulo in the main path the number of swiotlb
areas is always rounded to the next power of two. I believe that's
not really needed anymore on modern CPUs (which have fast enough
dividers), but still a good idea on older parts.

The number of areas can be set using the swiotlb option. But to avoid
every user having to set this option set the default to the number of
available CPUs. Unfortunately on x86 swiotlb is initialized before
num_possible_cpus() is available, that is why it uses a custom hook
called from the early ACPI code.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
(cherry picked from commit 4529b5784c141782c72ec9bd9a92df2b68cb7d45 https://github.com/intel/tdx.git)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
---
 .../admin-guide/kernel-parameters.txt         |   4 +-
 arch/x86/kernel/acpi/boot.c                   |   4 +
 include/linux/swiotlb.h                       |  27 ++-
 kernel/dma/swiotlb.c                          | 177 +++++++++++++++---
 4 files changed, 183 insertions(+), 29 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 369d7f3f63cb..9ef92b9049c2 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5578,8 +5578,10 @@
 			it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
 
 	swiotlb=	[ARM,IA-64,PPC,MIPS,X86]
-			Format: { <int> | force | noforce }
+			Format: { <int> [,<int>] | force | noforce }
 			<int> -- Number of I/O TLB slabs
+			<int> -- Second integer after comma. Number of swiotlb
+				 areas with their own lock. Must be power of 2.
 			force -- force using of bounce buffers even if they
 			         wouldn't be automatically used by the kernel
 			noforce -- Never use bounce buffers (for debugging)
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 14bcd59bcdee..346d4f8ab4c1 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -22,6 +22,7 @@
 #include <linux/efi-bgrt.h>
 #include <linux/serial_core.h>
 #include <linux/pgtable.h>
+#include <linux/swiotlb.h>
 
 #include <asm/e820/api.h>
 #include <asm/irqdomain.h>
@@ -1063,6 +1064,9 @@ static int __init acpi_parse_madt_lapic_entries(void)
 		return count;
 	}
 
+	/* This does not take overrides into consideration */
+	swiotlb_hint_cpus(max(count, x2count));
+
 	x2count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_X2APIC_NMI,
 					acpi_parse_x2apic_nmi, 0);
 	count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC_NMI,
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index aabca568bd15..9be6aba5a2a4 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -38,6 +38,7 @@ enum swiotlb_force {
 
 extern void swiotlb_init(int verbose);
 int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose);
+void swiotlb_hint_cpus(int cpus);
 unsigned long swiotlb_size_or_default(void);
 void swiotlb_set_alloc_from_low_pages(bool low);
 extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs);
@@ -71,7 +72,25 @@ struct io_tlb_slot {
 };
 
 /**
- * struct io_tlb_mem - IO TLB Memory Pool Descriptor
+ * struct io_tlb_area - IO TLB memory area descriptor
+ *
+ * This is a single area with a single lock.
+ *
+ * @used:	The number of used IO TLB block.
+ * @list:	The free list describing the number of free entries available
+ *		from each index.
+ * @lock:	The lock to protect the above data structures in the map and
+ *		unmap calls.
+ */
+
+struct io_tlb_area {
+	unsigned long used;
+	struct list_head free_slots;
+	spinlock_t lock;
+};
+
+/**
+ * struct io_tlb_mem - io tlb memory pool descriptor
  *
  * @start:	The start address of the swiotlb memory pool. Used to do a quick
  *		range check to see if the memory was in fact allocated by this
@@ -89,8 +108,6 @@ struct io_tlb_slot {
  * @index:	The index to start searching in the next round.
  * @orig_addr:	The original address corresponding to a mapped entry.
  * @alloc_size:	Size of the allocated buffer.
- * @lock:	The lock to protect the above data structures in the map and
- *		unmap calls.
  * @debugfs:	The dentry to debugfs.
  * @late_alloc:	%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
@@ -103,13 +120,11 @@ struct io_tlb_mem {
 	phys_addr_t end;
 	void *vaddr;
 	unsigned long nslabs;
-	unsigned long used;
-	struct list_head free_slots;
-	spinlock_t lock;
 	struct dentry *debugfs;
 	bool late_alloc;
 	bool force_bounce;
 	bool for_alloc;
+	struct io_tlb_area *areas;
 	struct io_tlb_slot *slots;
 	unsigned long *bitmap;
 };
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 180e02f5fd26..4b7e1ef59899 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -69,6 +69,8 @@
 
 #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
 
+#define NUM_AREAS_DEFAULT 1
+
 enum swiotlb_force swiotlb_force;
 
 struct io_tlb_mem io_tlb_default_mem;
@@ -85,14 +87,60 @@ static unsigned int max_segment;
 
 static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
 
+static __read_mostly unsigned int area_index_shift;
+static __read_mostly unsigned int area_bits;
+static __read_mostly int num_areas = NUM_AREAS_DEFAULT;
+
+static __init int setup_areas(int n)
+{
+	unsigned long nslabs;
+
+	if (n < 1 || !is_power_of_2(n)) {
+		pr_err("swiotlb: Invalid areas parameter %d\n", n);
+		return -EINVAL;
+	}
+
+	/* Round up number of slabs to the next power of 2.
+	 * The last area is going be smaller than the rest if default_nslabs is
+	 * not power of two.
+	 */
+	nslabs = roundup_pow_of_two(default_nslabs);
+
+	pr_info("swiotlb: Using %d areas\n", n);
+	num_areas = n;
+	area_index_shift = __fls(nslabs) - __fls(num_areas);
+	area_bits = __fls(n);
+	return 0;
+}
+
+/*
+ * Can be called from architecture specific code when swiotlb is set up before
+ * possible cpus are setup.
+ */
+
+void __init swiotlb_hint_cpus(int cpus)
+{
+	if (num_areas == NUM_AREAS_DEFAULT && cpus > 1) {
+		if (!is_power_of_2(cpus))
+			cpus = 1U << (__fls(cpus) + 1);
+		setup_areas(cpus);
+	}
+}
+
 static int __init
 setup_io_tlb_npages(char *str)
 {
+	int ret = 0;
+
 	if (isdigit(*str)) {
 		/* avoid tail segment of size < IO_TLB_SEGSIZE */
 		default_nslabs =
 			ALIGN(simple_strtoul(str, &str, 0), IO_TLB_SEGSIZE);
 	}
+	if (*str == ',')
+		++str;
+	if (isdigit(*str))
+		ret = setup_areas(simple_strtoul(str, &str, 0));
 	if (*str == ',')
 		++str;
 	if (!strcmp(str, "force"))
@@ -100,7 +148,7 @@ setup_io_tlb_npages(char *str)
 	else if (!strcmp(str, "noforce"))
 		swiotlb_force = SWIOTLB_NO_FORCE;
 
-	return 0;
+	return ret;
 }
 early_param("swiotlb", setup_io_tlb_npages);
 
@@ -139,6 +187,7 @@ void __init swiotlb_adjust_size(unsigned long size)
 		return;
 	size = ALIGN(size, IO_TLB_SIZE);
 	default_nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
+	setup_areas(num_areas);
 	pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
 }
 
@@ -227,18 +276,22 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 	mem->nslabs = nslabs;
 	mem->start = start;
 	mem->end = mem->start + bytes;
-	INIT_LIST_HEAD(&mem->free_slots);
 	mem->late_alloc = late_alloc;
 
 	if (swiotlb_force == SWIOTLB_FORCE)
 		mem->force_bounce = true;
 
-	spin_lock_init(&mem->lock);
+	for (i = 0; i < num_areas; i++) {
+		INIT_LIST_HEAD(&mem->areas[i].free_slots);
+		spin_lock_init(&mem->areas[i].lock);
+	}
 	for (i = 0; i < mem->nslabs; i++) {
+		int aindex = area_index_shift ? i >> area_index_shift : 0;
 		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
-		list_add_tail(&mem->slots[i].node, &mem->free_slots);
+		list_add_tail(&mem->slots[i].node,
+			      &mem->areas[aindex].free_slots);
 	}
 
 	/*
@@ -270,6 +323,10 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
 	if (!mem->slots)
 		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
 		      __func__, alloc_size, PAGE_SIZE);
+	mem->areas = memblock_alloc(sizeof(struct io_tlb_area) * num_areas,
+				    SMP_CACHE_BYTES);
+	if (!mem->areas)
+		panic("Cannot allocate io_tlb_areas");
 
 	mem->bitmap = memblock_alloc(DIV_ROUND_UP(nslabs, BITS_PER_BYTE), SMP_CACHE_BYTES);
 	if (!mem->bitmap)
@@ -371,6 +428,7 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 {
 	struct io_tlb_mem *mem = &io_tlb_default_mem;
 	unsigned long bytes = nslabs << IO_TLB_SHIFT;
+	int order;
 
 	if (swiotlb_force == SWIOTLB_NO_FORCE)
 		return 0;
@@ -380,14 +438,22 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 		return -ENOMEM;
 
 	mem->bitmap = kzalloc(DIV_ROUND_UP(nslabs, BITS_PER_BYTE), GFP_KERNEL);
-	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
-		get_order(array_size(sizeof(*mem->slots), nslabs)));
+	order = get_order(array_size(sizeof(*mem->slots), nslabs));
+	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
 	if (!mem->slots || !mem->bitmap) {
 		kfree(mem->bitmap);
 		kfree(mem->slots);
 		return -ENOMEM;
 	}
 
+	mem->areas = (struct io_tlb_area *)kcalloc(num_areas,
+						   sizeof(struct io_tlb_area),
+						   GFP_KERNEL);
+	if (!mem->areas) {
+		free_pages((unsigned long)mem->slots, order);
+		return -ENOMEM;
+	}
+
 	set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT);
 	swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
 
@@ -412,9 +478,12 @@ void __init swiotlb_exit(void)
 
 	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
 	if (mem->late_alloc) {
+		kfree(mem->areas);
 		free_pages(tbl_vaddr, get_order(tbl_size));
 		free_pages((unsigned long)mem->slots, get_order(slots_size));
 	} else {
+		memblock_free_late(__pa(mem->areas),
+				   num_areas * sizeof(struct io_tlb_area));
 		memblock_free_late(mem->start, tbl_size);
 		memblock_free_late(__pa(mem->slots), slots_size);
 	}
@@ -517,14 +586,32 @@ static inline unsigned long get_max_slots(unsigned long boundary_mask)
 	return nr_slots(boundary_mask + 1);
 }
 
+static inline unsigned long area_nslabs(struct io_tlb_mem *mem)
+{
+	return area_index_shift ? (1UL << area_index_shift) + 1 :
+			mem->nslabs;
+}
+
+static inline unsigned int area_start(int aindex)
+{
+	return aindex << area_index_shift;
+}
+
+static inline unsigned int area_end(struct io_tlb_mem *mem, int aindex)
+{
+	return area_start(aindex) + area_nslabs(mem);
+}
+
 /*
  * Find a suitable number of IO TLB entries size that will fit this request and
  * allocate a buffer from that IO TLB pool.
  */
-static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
-			      size_t alloc_size)
+static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
+				 struct io_tlb_area *area,
+				 int area_index,
+				 struct device *dev, phys_addr_t orig_addr,
+				 size_t alloc_size)
 {
-	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
 	struct io_tlb_slot *slot, *tmp;
 	unsigned long boundary_mask = dma_get_seg_boundary(dev);
 	dma_addr_t tbl_dma_addr =
@@ -538,12 +625,13 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	unsigned long flags;
 
 	BUG_ON(!nslots);
+	BUG_ON(area_index >= num_areas);
 
-	spin_lock_irqsave(&mem->lock, flags);
-	if (unlikely(nslots > mem->nslabs - mem->used))
+	spin_lock_irqsave(&area->lock, flags);
+	if (unlikely(nslots > area_nslabs(mem) - area->used))
 		goto not_found;
 
-	list_for_each_entry_safe(slot, tmp, &mem->free_slots, node) {
+	list_for_each_entry_safe(slot, tmp, &area->free_slots, node) {
 		index = slot - mem->slots;
 		if (orig_addr &&
 		    (slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
@@ -576,7 +664,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	}
 
 not_found:
-	spin_unlock_irqrestore(&mem->lock, flags);
+	spin_unlock_irqrestore(&area->lock, flags);
 	return -1;
 
 found:
@@ -587,12 +675,42 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 		list_del(&mem->slots[i].node);
 	}
 
-	mem->used += nslots;
+	area->used += nslots;
 
-	spin_unlock_irqrestore(&mem->lock, flags);
+	spin_unlock_irqrestore(&area->lock, flags);
 	return index;
 }
 
+static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
+			      size_t alloc_size)
+{
+	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	int start = raw_smp_processor_id() & ((1U << area_bits) - 1);
+	int i, index;
+
+	i = start;
+	do {
+		index = swiotlb_do_find_slots(mem, mem->areas + i, i,
+					      dev, orig_addr, alloc_size);
+		if (index >= 0)
+			return index;
+		if (++i >= num_areas)
+			i = 0;
+	} while (i != start);
+	return -1;
+}
+
+/* Somewhat racy estimate */
+static unsigned long mem_used(struct io_tlb_mem *mem)
+{
+	int i;
+	unsigned long used = 0;
+
+	for (i = 0; i < num_areas; i++)
+		used += mem->areas[i].used;
+	return used;
+}
+
 phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		size_t mapping_size, size_t alloc_size,
 		enum dma_data_direction dir, unsigned long attrs)
@@ -620,7 +738,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		if (!(attrs & DMA_ATTR_NO_WARN))
 			dev_warn_ratelimited(dev,
 	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
-				 alloc_size, mem->nslabs, mem->used);
+				 alloc_size, mem->nslabs, mem_used(mem));
 		return (phys_addr_t)DMA_MAPPING_ERROR;
 	}
 
@@ -650,9 +768,14 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
 	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
 	int nslots = nr_slots(mem->slots[index].alloc_size + offset);
+	int aindex = area_index_shift ? index >> area_index_shift : 0;
+	struct io_tlb_area *area = mem->areas + aindex;
 	int i;
 
-	spin_lock_irqsave(&mem->lock, flags);
+	BUG_ON(aindex >= num_areas);
+
+	spin_lock_irqsave(&area->lock, flags);
+
 	/*
 	 * Return the slots to swiotlb, updating bitmap to indicate
 	 * corresponding entries are free.
@@ -661,11 +784,11 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
-		list_add(&mem->slots[i].node, &mem->free_slots);
+		list_add(&mem->slots[i].node, &area->free_slots);
 	}
 
-	mem->used -= nslots;
-	spin_unlock_irqrestore(&mem->lock, flags);
+	area->used -= nslots;
+	spin_unlock_irqrestore(&area->lock, flags);
 }
 
 /*
@@ -749,17 +872,27 @@ bool is_swiotlb_active(struct device *dev)
 {
 	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
 
-	return mem && mem->nslabs;
+	return mem && mem->start != mem->end;
 }
 EXPORT_SYMBOL_GPL(is_swiotlb_active);
 
 #ifdef CONFIG_DEBUG_FS
+
+static int used_get(void *data, u64 *val)
+{
+	struct io_tlb_mem *mem = (struct io_tlb_mem *)data;
+	*val = mem_used(mem);
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(used_fops, used_get, NULL, "%llu\n");
+
 static struct dentry *debugfs_dir;
 
 static void swiotlb_create_debugfs_files(struct io_tlb_mem *mem)
 {
 	debugfs_create_ulong("io_tlb_nslabs", 0400, mem->debugfs, &mem->nslabs);
-	debugfs_create_ulong("io_tlb_used", 0400, mem->debugfs, &mem->used);
+	debugfs_create_file("io_tlb_used", 0400, mem->debugfs, mem, &used_fops);
 }
 
 static int __init swiotlb_create_default_debugfs(void)