Patchwork [03/10] x86: add initialization code for DMA-API debugging

login
register
mail settings
Submitter Joerg Roedel
Date Nov. 21, 2008, 4:26 p.m.
Message ID <1227284770-19215-4-git-send-email-joerg.roedel@amd.com>
Download mbox | patch
Permalink /patch/10050/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Joerg Roedel - Nov. 21, 2008, 4:26 p.m.
Impact: creates necessary data structures for DMA-API debugging

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 arch/x86/include/asm/dma-mapping.h |    1 +
 arch/x86/include/asm/dma_debug.h   |   14 +++++
 arch/x86/kernel/Makefile           |    2 +
 arch/x86/kernel/pci-dma-debug.c    |  111 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/pci-dma.c          |    2 +
 5 files changed, 130 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/kernel/pci-dma-debug.c
Ingo Molnar - Nov. 21, 2008, 4:56 p.m.
* Joerg Roedel <joerg.roedel@amd.com> wrote:

> +extern
> +void dma_debug_init(void);

this can be on a single line.

> +
> +#else /* CONFIG_DMA_API_DEBUG */
> +
> +static inline
> +void dma_debug_init(void)

this too. (when something fits on a single line, we prefer it so)

> +#include <linux/types.h>
> +#include <linux/scatterlist.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/module.h>
> +#include <linux/hardirq.h>
> +#include <linux/dma-mapping.h>
> +#include <asm/bug.h>
> +#include <asm/dma-mapping.h>
> +#include <asm/dma_debug.h>

to reduce the chances of commit conflicts in the future, we 
generally sort include lines in x86 files the following way:

> +#include <linux/scatterlist.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/spinlock.h>
> +#include <linux/hardirq.h>
> +#include <linux/module.h>
> +#include <linux/types.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
>
> +#include <asm/bug.h>
> +#include <asm/dma-mapping.h>
> +#include <asm/dma_debug.h>

[ note the extra newline too between the linux/ and asm/ portions. ]

> +#define HASH_SIZE 256
> +#define HASH_FN_SHIFT 20
> +#define HASH_FN_MASK 0xffULL

please align the values vertically.

> +/* Hash list to save the allocated dma addresses */
> +static struct list_head dma_entry_hash[HASH_SIZE];

Should be cacheline-aligned i guess - if this feature is enabled this 
is a hot area.

> +/* A slab cache to allocate dma_map_entries fast */
> +static struct kmem_cache *dma_entry_cache;

__read_mostly - to isolate it from the above hot area.

> +/* lock to protect the data structures */
> +static DEFINE_SPINLOCK(dma_lock);

should have a separate cacheline too i guess.

> +static int hash_fn(struct dma_debug_entry *entry)
> +{
> +	/*
> +	 * Hash function is based on the dma address.
> +	 * We use bits 20-27 here as the index into the hash
> +	 */
> +	BUG_ON(entry->dev_addr == bad_dma_address);

please use WARN_ON_ONCE() instead of crashing the box in possibly hard 
to debug spots.

> +	return (entry->dev_addr >> HASH_FN_SHIFT) & HASH_FN_MASK;
> +}
> +
> +static struct dma_debug_entry *dma_entry_alloc(void)
> +{
> +	gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
> +
> +	if (in_atomic())
> +		gfp |= GFP_ATOMIC;
> +
> +	return kmem_cache_alloc(dma_entry_cache, gfp);

hm. that slab allocation in the middle of DMA mapping ops makes me a 
bit nervous. the DMA mapping ops are generally rather atomic.

and in_atomic() check there is a bug on !PREEMPT kernels, so it wont 
fly.

We dont have a gfp flag passed in as all the DMA mapping APIs really 
expect all allocations having been done in advance already.

Any chance to attach the debug entry to the iotlb entries themselves - 
either during build time (for swiotlb) or during iommu init time (for 
the hw iommu's) ?

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joerg Roedel - Nov. 21, 2008, 5:10 p.m.
On Fri, Nov 21, 2008 at 05:56:28PM +0100, Ingo Molnar wrote:
> > +	return (entry->dev_addr >> HASH_FN_SHIFT) & HASH_FN_MASK;
> > +}
> > +
> > +static struct dma_debug_entry *dma_entry_alloc(void)
> > +{
> > +	gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
> > +
> > +	if (in_atomic())
> > +		gfp |= GFP_ATOMIC;
> > +
> > +	return kmem_cache_alloc(dma_entry_cache, gfp);
> 
> hm. that slab allocation in the middle of DMA mapping ops makes me a 
> bit nervous. the DMA mapping ops are generally rather atomic.
> 
> and in_atomic() check there is a bug on !PREEMPT kernels, so it wont 
> fly.

I am not sure I understand this correctly. You say the check for
in_atomic() will break on !PREEMPT kernels?

> We dont have a gfp flag passed in as all the DMA mapping APIs really 
> expect all allocations having been done in advance already.

Hmm, I can change the code to pre-allocate a certain (configurable?)
number of these entries and disble the checking if we run out of it.

> Any chance to attach the debug entry to the iotlb entries themselves - 
> either during build time (for swiotlb) or during iommu init time (for 
> the hw iommu's) ?

Hm, I want to avoid that. This approach has the advantage that is
works independent of any dma_ops implementation. So it can be used to
find out if a DMA bug originates from the device driver or (in my case)
from the IOMMU driver.

Joerg
Ingo Molnar - Nov. 21, 2008, 5:19 p.m.
* Joerg Roedel <joerg.roedel@amd.com> wrote:

> On Fri, Nov 21, 2008 at 05:56:28PM +0100, Ingo Molnar wrote:
> > > +	return (entry->dev_addr >> HASH_FN_SHIFT) & HASH_FN_MASK;
> > > +}
> > > +
> > > +static struct dma_debug_entry *dma_entry_alloc(void)
> > > +{
> > > +	gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
> > > +
> > > +	if (in_atomic())
> > > +		gfp |= GFP_ATOMIC;
> > > +
> > > +	return kmem_cache_alloc(dma_entry_cache, gfp);
> > 
> > hm. that slab allocation in the middle of DMA mapping ops makes me a 
> > bit nervous. the DMA mapping ops are generally rather atomic.
> > 
> > and in_atomic() check there is a bug on !PREEMPT kernels, so it wont 
> > fly.
> 
> I am not sure I understand this correctly. You say the check for 
> in_atomic() will break on !PREEMPT kernels?

Correct. There is no check to be able to tell whether we can schedule 
or not. I.e. on !PREEMPT your patches will crash and burn.

> > We dont have a gfp flag passed in as all the DMA mapping APIs 
> > really expect all allocations having been done in advance already.
> 
> Hmm, I can change the code to pre-allocate a certain (configurable?) 
> number of these entries and disble the checking if we run out of it.

yeah, that's a good approach too - that's what lockdep does. Perhaps 
make the max entries nr a Kconfig entity - so it can be tuned up/down 
depending on what type of iommu scheme is enabled.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar - Nov. 21, 2008, 5:27 p.m.
* Ingo Molnar <mingo@elte.hu> wrote:

> > > We dont have a gfp flag passed in as all the DMA mapping APIs 
> > > really expect all allocations having been done in advance 
> > > already.
> > 
> > Hmm, I can change the code to pre-allocate a certain 
> > (configurable?) number of these entries and disble the checking if 
> > we run out of it.
> 
> yeah, that's a good approach too - that's what lockdep does. Perhaps 
> make the max entries nr a Kconfig entity - so it can be tuned 
> up/down depending on what type of iommu scheme is enabled.

there's another reason why we want to do that: this scheme covers all 
of DMA - not just the ones which need to go via an iommu and for which 
there's an IOTLB entry present. So the pool should probably be sized 
after RAM size, to be on the safe side.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar - Nov. 21, 2008, 5:43 p.m.
* Joerg Roedel <joerg.roedel@amd.com> wrote:

> +static struct list_head dma_entry_hash[HASH_SIZE];
> +
> +/* A slab cache to allocate dma_map_entries fast */
> +static struct kmem_cache *dma_entry_cache;
> +
> +/* lock to protect the data structures */
> +static DEFINE_SPINLOCK(dma_lock);

some more generic comments about the data structure: it's main purpose 
is to provide a mapping based on (dev,addr). There's little if any 
cross-entry interaction - same-address+same-dev DMA is checked.

1)

the hash:

+ 	return (entry->dev_addr >> HASH_FN_SHIFT) & HASH_FN_MASK;

should mix in entry->dev as well - that way we get not just per 
address but per device hash space separation as well.

2)

HASH_FN_SHIFT is 1MB chunks right now - that's probably fine in 
practice albeit perhaps a bit too small. There's seldom any coherency 
between the physical addresses of DMA - we rarely have any real 
(performance-relevant) physical co-location of DMA addresses beyond 4K 
granularity. So using 1MB chunking here will discard a good deal of 
random low bits we should be hashing on.

3)

And the most scalable locking would be per hash bucket locking - no 
global lock is needed. The bucket hash heads should probably be 
cacheline sized - so we'd get one lock per bucket.

This way if there's irq+DMA traffic on one CPU from one device into 
one range of memory, and irq+DMA traffic on another CPU to another 
device, they will map to two different hash buckets.

4)

Plus it might be an option to make hash lookup lockless as well: 
depending on the DMA flux we can get a lot of lookups, and taking the 
bucket lock can be avoided, if you use RCU-safe list ops and drive the 
refilling of the free entries pool from RCU.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
FUJITA Tomonori - Nov. 22, 2008, 3:27 a.m.
On Fri, 21 Nov 2008 17:26:03 +0100
Joerg Roedel <joerg.roedel@amd.com> wrote:

> Impact: creates necessary data structures for DMA-API debugging
> 
> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> ---
>  arch/x86/include/asm/dma-mapping.h |    1 +
>  arch/x86/include/asm/dma_debug.h   |   14 +++++
>  arch/x86/kernel/Makefile           |    2 +
>  arch/x86/kernel/pci-dma-debug.c    |  111 ++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/pci-dma.c          |    2 +
>  5 files changed, 130 insertions(+), 0 deletions(-)
>  create mode 100644 arch/x86/kernel/pci-dma-debug.c
> 
> diff --git a/arch/x86/include/asm/dma-mapping.h b/arch/x86/include/asm/dma-mapping.h
> index 7f225a4..83d7b7d 100644
> --- a/arch/x86/include/asm/dma-mapping.h
> +++ b/arch/x86/include/asm/dma-mapping.h
> @@ -9,6 +9,7 @@
>  #include <linux/scatterlist.h>
>  #include <asm/io.h>
>  #include <asm/swiotlb.h>
> +#include <asm/dma_debug.h>
>  #include <asm-generic/dma-coherent.h>
>  
>  extern dma_addr_t bad_dma_address;
> diff --git a/arch/x86/include/asm/dma_debug.h b/arch/x86/include/asm/dma_debug.h
> index d79f024..f2c3d53 100644
> --- a/arch/x86/include/asm/dma_debug.h
> +++ b/arch/x86/include/asm/dma_debug.h
> @@ -38,4 +38,18 @@ struct dma_debug_entry {
>  	int direction;
>  };
>  
> +#ifdef CONFIG_DMA_API_DEBUG
> +
> +extern
> +void dma_debug_init(void);
> +
> +#else /* CONFIG_DMA_API_DEBUG */
> +
> +static inline
> +void dma_debug_init(void)
> +{
> +}
> +
> +#endif /* CONFIG_DMA_API_DEBUG */
> +
>  #endif /* __ASM_X86_DMA_DEBUG */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index e489ff9..6271cd2 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -105,6 +105,8 @@ microcode-$(CONFIG_MICROCODE_INTEL)	+= microcode_intel.o
>  microcode-$(CONFIG_MICROCODE_AMD)	+= microcode_amd.o
>  obj-$(CONFIG_MICROCODE)			+= microcode.o
>  
> +obj-$(CONFIG_DMA_API_DEBUG)             += pci-dma-debug.o
> +
>  ###
>  # 64 bit specific files
>  ifeq ($(CONFIG_X86_64),y)
> diff --git a/arch/x86/kernel/pci-dma-debug.c b/arch/x86/kernel/pci-dma-debug.c
> new file mode 100644
> index 0000000..c2d3408
> --- /dev/null
> +++ b/arch/x86/kernel/pci-dma-debug.c
> @@ -0,0 +1,111 @@
> +/*
> + * Copyright (C) 2008 Advanced Micro Devices, Inc.
> + *
> + * Author: Joerg Roedel <joerg.roedel@amd.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
> + */
> +
> +#include <linux/types.h>
> +#include <linux/scatterlist.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/module.h>
> +#include <linux/hardirq.h>
> +#include <linux/dma-mapping.h>
> +#include <asm/bug.h>
> +#include <asm/dma-mapping.h>
> +#include <asm/dma_debug.h>
> +
> +#define HASH_SIZE 256
> +#define HASH_FN_SHIFT 20
> +#define HASH_FN_MASK 0xffULL
> +
> +/* Hash list to save the allocated dma addresses */
> +static struct list_head dma_entry_hash[HASH_SIZE];
> +
> +/* A slab cache to allocate dma_map_entries fast */
> +static struct kmem_cache *dma_entry_cache;
> +
> +/* lock to protect the data structures */
> +static DEFINE_SPINLOCK(dma_lock);
> +
> +static int hash_fn(struct dma_debug_entry *entry)
> +{
> +	/*
> +	 * Hash function is based on the dma address.
> +	 * We use bits 20-27 here as the index into the hash
> +	 */
> +	BUG_ON(entry->dev_addr == bad_dma_address);

'bad_dma_address' is x86 specific. You already found it though.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joerg Roedel - Nov. 22, 2008, 9:40 a.m.
On Sat, Nov 22, 2008 at 12:27:41PM +0900, FUJITA Tomonori wrote:
> On Fri, 21 Nov 2008 17:26:03 +0100
> Joerg Roedel <joerg.roedel@amd.com> wrote:
> 
> > Impact: creates necessary data structures for DMA-API debugging
> > 
> > Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> > ---
> >  arch/x86/include/asm/dma-mapping.h |    1 +
> >  arch/x86/include/asm/dma_debug.h   |   14 +++++
> >  arch/x86/kernel/Makefile           |    2 +
> >  arch/x86/kernel/pci-dma-debug.c    |  111 ++++++++++++++++++++++++++++++++++++
> >  arch/x86/kernel/pci-dma.c          |    2 +
> >  5 files changed, 130 insertions(+), 0 deletions(-)
> >  create mode 100644 arch/x86/kernel/pci-dma-debug.c
> > 
> > diff --git a/arch/x86/include/asm/dma-mapping.h b/arch/x86/include/asm/dma-mapping.h
> > index 7f225a4..83d7b7d 100644
> > --- a/arch/x86/include/asm/dma-mapping.h
> > +++ b/arch/x86/include/asm/dma-mapping.h
> > @@ -9,6 +9,7 @@
> >  #include <linux/scatterlist.h>
> >  #include <asm/io.h>
> >  #include <asm/swiotlb.h>
> > +#include <asm/dma_debug.h>
> >  #include <asm-generic/dma-coherent.h>
> >  
> >  extern dma_addr_t bad_dma_address;
> > diff --git a/arch/x86/include/asm/dma_debug.h b/arch/x86/include/asm/dma_debug.h
> > index d79f024..f2c3d53 100644
> > --- a/arch/x86/include/asm/dma_debug.h
> > +++ b/arch/x86/include/asm/dma_debug.h
> > @@ -38,4 +38,18 @@ struct dma_debug_entry {
> >  	int direction;
> >  };
> >  
> > +#ifdef CONFIG_DMA_API_DEBUG
> > +
> > +extern
> > +void dma_debug_init(void);
> > +
> > +#else /* CONFIG_DMA_API_DEBUG */
> > +
> > +static inline
> > +void dma_debug_init(void)
> > +{
> > +}
> > +
> > +#endif /* CONFIG_DMA_API_DEBUG */
> > +
> >  #endif /* __ASM_X86_DMA_DEBUG */
> > diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> > index e489ff9..6271cd2 100644
> > --- a/arch/x86/kernel/Makefile
> > +++ b/arch/x86/kernel/Makefile
> > @@ -105,6 +105,8 @@ microcode-$(CONFIG_MICROCODE_INTEL)	+= microcode_intel.o
> >  microcode-$(CONFIG_MICROCODE_AMD)	+= microcode_amd.o
> >  obj-$(CONFIG_MICROCODE)			+= microcode.o
> >  
> > +obj-$(CONFIG_DMA_API_DEBUG)             += pci-dma-debug.o
> > +
> >  ###
> >  # 64 bit specific files
> >  ifeq ($(CONFIG_X86_64),y)
> > diff --git a/arch/x86/kernel/pci-dma-debug.c b/arch/x86/kernel/pci-dma-debug.c
> > new file mode 100644
> > index 0000000..c2d3408
> > --- /dev/null
> > +++ b/arch/x86/kernel/pci-dma-debug.c
> > @@ -0,0 +1,111 @@
> > +/*
> > + * Copyright (C) 2008 Advanced Micro Devices, Inc.
> > + *
> > + * Author: Joerg Roedel <joerg.roedel@amd.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of the GNU General Public License version 2 as published
> > + * by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/scatterlist.h>
> > +#include <linux/list.h>
> > +#include <linux/slab.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/module.h>
> > +#include <linux/hardirq.h>
> > +#include <linux/dma-mapping.h>
> > +#include <asm/bug.h>
> > +#include <asm/dma-mapping.h>
> > +#include <asm/dma_debug.h>
> > +
> > +#define HASH_SIZE 256
> > +#define HASH_FN_SHIFT 20
> > +#define HASH_FN_MASK 0xffULL
> > +
> > +/* Hash list to save the allocated dma addresses */
> > +static struct list_head dma_entry_hash[HASH_SIZE];
> > +
> > +/* A slab cache to allocate dma_map_entries fast */
> > +static struct kmem_cache *dma_entry_cache;
> > +
> > +/* lock to protect the data structures */
> > +static DEFINE_SPINLOCK(dma_lock);
> > +
> > +static int hash_fn(struct dma_debug_entry *entry)
> > +{
> > +	/*
> > +	 * Hash function is based on the dma address.
> > +	 * We use bits 20-27 here as the index into the hash
> > +	 */
> > +	BUG_ON(entry->dev_addr == bad_dma_address);
> 
> 'bad_dma_address' is x86 specific. You already found it though.

Interesting. Is there another value for dma_addr_t which drivers can
check for to find out if a dma-api operation failed?

Joerg
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joerg Roedel - Nov. 22, 2008, 9:48 a.m.
On Fri, Nov 21, 2008 at 06:43:48PM +0100, Ingo Molnar wrote:
> 
> * Joerg Roedel <joerg.roedel@amd.com> wrote:
> 
> > +static struct list_head dma_entry_hash[HASH_SIZE];
> > +
> > +/* A slab cache to allocate dma_map_entries fast */
> > +static struct kmem_cache *dma_entry_cache;
> > +
> > +/* lock to protect the data structures */
> > +static DEFINE_SPINLOCK(dma_lock);
> 
> some more generic comments about the data structure: it's main purpose 
> is to provide a mapping based on (dev,addr). There's little if any 
> cross-entry interaction - same-address+same-dev DMA is checked.
> 
> 1)
> 
> the hash:
> 
> + 	return (entry->dev_addr >> HASH_FN_SHIFT) & HASH_FN_MASK;
> 
> should mix in entry->dev as well - that way we get not just per 
> address but per device hash space separation as well.
> 
> 2)
> 
> HASH_FN_SHIFT is 1MB chunks right now - that's probably fine in 
> practice albeit perhaps a bit too small. There's seldom any coherency 
> between the physical addresses of DMA - we rarely have any real 
> (performance-relevant) physical co-location of DMA addresses beyond 4K 
> granularity. So using 1MB chunking here will discard a good deal of 
> random low bits we should be hashing on.
> 
> 3)
> 
> And the most scalable locking would be per hash bucket locking - no 
> global lock is needed. The bucket hash heads should probably be 
> cacheline sized - so we'd get one lock per bucket.

Hmm, I just had the idea of saving this data in struct device. How about
that? The locking should scale too and we can extend it easier. For
example it simplifys a per-device disable function for the checking. Or
another future feature might be leak tracing.

> This way if there's irq+DMA traffic on one CPU from one device into 
> one range of memory, and irq+DMA traffic on another CPU to another 
> device, they will map to two different hash buckets.
> 
> 4)
> 
> Plus it might be an option to make hash lookup lockless as well: 
> depending on the DMA flux we can get a lot of lookups, and taking the 
> bucket lock can be avoided, if you use RCU-safe list ops and drive the 
> refilling of the free entries pool from RCU.

Joerg
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
FUJITA Tomonori - Nov. 22, 2008, 10:16 a.m.
On Sat, 22 Nov 2008 10:40:32 +0100
Joerg Roedel <joro@8bytes.org> wrote:

> > > +static int hash_fn(struct dma_debug_entry *entry)
> > > +{
> > > +	/*
> > > +	 * Hash function is based on the dma address.
> > > +	 * We use bits 20-27 here as the index into the hash
> > > +	 */
> > > +	BUG_ON(entry->dev_addr == bad_dma_address);
> > 
> > 'bad_dma_address' is x86 specific. You already found it though.
> 
> Interesting. Is there another value for dma_addr_t which drivers can
> check for to find out if a dma-api operation failed?

They are architecture dependent. But only X86 uses a variable because
of GART and swiotlb, I think.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
FUJITA Tomonori - Nov. 22, 2008, 12:28 p.m.
On Sat, 22 Nov 2008 19:16:05 +0900
FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> wrote:

> On Sat, 22 Nov 2008 10:40:32 +0100
> Joerg Roedel <joro@8bytes.org> wrote:
> 
> > > > +static int hash_fn(struct dma_debug_entry *entry)
> > > > +{
> > > > +	/*
> > > > +	 * Hash function is based on the dma address.
> > > > +	 * We use bits 20-27 here as the index into the hash
> > > > +	 */
> > > > +	BUG_ON(entry->dev_addr == bad_dma_address);
> > > 
> > > 'bad_dma_address' is x86 specific. You already found it though.
> > 
> > Interesting. Is there another value for dma_addr_t which drivers can
> > check for to find out if a dma-api operation failed?
> 
> They are architecture dependent. But only X86 uses a variable because
> of GART and swiotlb, I think.

BTW, this code doesn't work even on x86 (swiotlb). dma_map_error
should be used, which is an architecture-independent way to test a dma
address.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar - Nov. 23, 2008, 11:28 a.m.
* Joerg Roedel <joro@8bytes.org> wrote:

> On Fri, Nov 21, 2008 at 06:43:48PM +0100, Ingo Molnar wrote:
> > 
> > * Joerg Roedel <joerg.roedel@amd.com> wrote:
> > 
> > > +static struct list_head dma_entry_hash[HASH_SIZE];
> > > +
> > > +/* A slab cache to allocate dma_map_entries fast */
> > > +static struct kmem_cache *dma_entry_cache;
> > > +
> > > +/* lock to protect the data structures */
> > > +static DEFINE_SPINLOCK(dma_lock);
> > 
> > some more generic comments about the data structure: it's main purpose 
> > is to provide a mapping based on (dev,addr). There's little if any 
> > cross-entry interaction - same-address+same-dev DMA is checked.
> > 
> > 1)
> > 
> > the hash:
> > 
> > + 	return (entry->dev_addr >> HASH_FN_SHIFT) & HASH_FN_MASK;
> > 
> > should mix in entry->dev as well - that way we get not just per 
> > address but per device hash space separation as well.
> > 
> > 2)
> > 
> > HASH_FN_SHIFT is 1MB chunks right now - that's probably fine in 
> > practice albeit perhaps a bit too small. There's seldom any coherency 
> > between the physical addresses of DMA - we rarely have any real 
> > (performance-relevant) physical co-location of DMA addresses beyond 4K 
> > granularity. So using 1MB chunking here will discard a good deal of 
> > random low bits we should be hashing on.
> > 
> > 3)
> > 
> > And the most scalable locking would be per hash bucket locking - no 
> > global lock is needed. The bucket hash heads should probably be 
> > cacheline sized - so we'd get one lock per bucket.
> 
> Hmm, I just had the idea of saving this data in struct device. How 
> about that? The locking should scale too and we can extend it 
> easier. For example it simplifys a per-device disable function for 
> the checking. Or another future feature might be leak tracing.

that will help with spreading the hash across devices, but brings in 
lifetime issues: you must be absolutely sure all DMA has drained at 
the point a device is deinitialized.

Dunno ... i think it's still better to have a central hash for all DMA 
ops that is aware of per device details.

The moment we spread this out to struct device we've lost the ability 
to _potentially_ do inter-device checking. (for example to make sure 
no other device is DMA-ing on the same address - which is a possible 
bug pattern as well although right now Linux does not really avoid it 
actively)

Hm?

Btw., also have a look at lib/debugobjects.c: i think we should also 
consider extending that facility and then just hook it up to the DMA 
ops. The DMA checking is kind of a similar "op lifetime" scenario - 
with a few extras to extend lib/debugobjects.c with. Note how it 
already has pooling, a good hash, etc.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar - Nov. 23, 2008, 11:35 a.m.
* Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Joerg Roedel <joro@8bytes.org> wrote:
> 
> > On Fri, Nov 21, 2008 at 06:43:48PM +0100, Ingo Molnar wrote:
> > > 
> > > * Joerg Roedel <joerg.roedel@amd.com> wrote:
> > > 
> > > > +static struct list_head dma_entry_hash[HASH_SIZE];
> > > > +
> > > > +/* A slab cache to allocate dma_map_entries fast */
> > > > +static struct kmem_cache *dma_entry_cache;
> > > > +
> > > > +/* lock to protect the data structures */
> > > > +static DEFINE_SPINLOCK(dma_lock);
> > > 
> > > some more generic comments about the data structure: it's main purpose 
> > > is to provide a mapping based on (dev,addr). There's little if any 
> > > cross-entry interaction - same-address+same-dev DMA is checked.
> > > 
> > > 1)
> > > 
> > > the hash:
> > > 
> > > + 	return (entry->dev_addr >> HASH_FN_SHIFT) & HASH_FN_MASK;
> > > 
> > > should mix in entry->dev as well - that way we get not just per 
> > > address but per device hash space separation as well.
> > > 
> > > 2)
> > > 
> > > HASH_FN_SHIFT is 1MB chunks right now - that's probably fine in 
> > > practice albeit perhaps a bit too small. There's seldom any coherency 
> > > between the physical addresses of DMA - we rarely have any real 
> > > (performance-relevant) physical co-location of DMA addresses beyond 4K 
> > > granularity. So using 1MB chunking here will discard a good deal of 
> > > random low bits we should be hashing on.
> > > 
> > > 3)
> > > 
> > > And the most scalable locking would be per hash bucket locking - no 
> > > global lock is needed. The bucket hash heads should probably be 
> > > cacheline sized - so we'd get one lock per bucket.
> > 
> > Hmm, I just had the idea of saving this data in struct device. How 
> > about that? The locking should scale too and we can extend it 
> > easier. For example it simplifys a per-device disable function for 
> > the checking. Or another future feature might be leak tracing.
> 
> that will help with spreading the hash across devices, but brings in 
> lifetime issues: you must be absolutely sure all DMA has drained at 
> the point a device is deinitialized.

Note that obviously proper DMA quiescence is a must-have during device 
dinit anyway, but still, it's an extra complication to init/deinit the 
hashes, etc.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar - Nov. 23, 2008, 1:04 p.m.
Another generic suggestion:

Why not just replace dma_ops with a debug version? That way it could 
be a runtime feature based off the already existing DMA op callbacks, 
without any extra overhead.

I'd prefer such a solution much more with an x86 architecture 
maintainer hat on, because this way distributions could enable this 
debug feature (with the facility being off by default) without 
worrying about the wrapping overhead.

This would basically be a special variant of stacked DMA ops support.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen - Nov. 23, 2008, 7:36 p.m.
Joerg Roedel <joerg.roedel@amd.com> writes:
> +/* Hash list to save the allocated dma addresses */
> +static struct list_head dma_entry_hash[HASH_SIZE];

Hash tables should use hlists.

> +static int hash_fn(struct dma_debug_entry *entry)
> +{
> +	/*
> +	 * Hash function is based on the dma address.
> +	 * We use bits 20-27 here as the index into the hash
> +	 */
> +	BUG_ON(entry->dev_addr == bad_dma_address);
> +
> +	return (entry->dev_addr >> HASH_FN_SHIFT) & HASH_FN_MASK;

It would be probably safer to use a stronger hash like FNV
There are a couple to reuse in include/

> +}
> +
> +static struct dma_debug_entry *dma_entry_alloc(void)
> +{
> +	gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
> +
> +	if (in_atomic())
> +		gfp |= GFP_ATOMIC;
> +
> +	return kmem_cache_alloc(dma_entry_cache, gfp);
> +}

While the basic idea is reasonable this function is unfortunately
broken. It's not always safe to allocate memory (e.g. in the block
write out path which uses map_sg). You would need to use
a mempool or something.

Besides the other problem of using GFP_ATOMIC is that it can 
fail under high load and you don't handle this case very well
(would report a bug incorrectly). And stress tests tend to 
trigger that, reporting false positives in such a case is a very very
bad thing, it leads to QA people putting these messages
on their blacklists.

-Andi

Patch

diff --git a/arch/x86/include/asm/dma-mapping.h b/arch/x86/include/asm/dma-mapping.h
index 7f225a4..83d7b7d 100644
--- a/arch/x86/include/asm/dma-mapping.h
+++ b/arch/x86/include/asm/dma-mapping.h
@@ -9,6 +9,7 @@ 
 #include <linux/scatterlist.h>
 #include <asm/io.h>
 #include <asm/swiotlb.h>
+#include <asm/dma_debug.h>
 #include <asm-generic/dma-coherent.h>
 
 extern dma_addr_t bad_dma_address;
diff --git a/arch/x86/include/asm/dma_debug.h b/arch/x86/include/asm/dma_debug.h
index d79f024..f2c3d53 100644
--- a/arch/x86/include/asm/dma_debug.h
+++ b/arch/x86/include/asm/dma_debug.h
@@ -38,4 +38,18 @@  struct dma_debug_entry {
 	int direction;
 };
 
+#ifdef CONFIG_DMA_API_DEBUG
+
+extern
+void dma_debug_init(void);
+
+#else /* CONFIG_DMA_API_DEBUG */
+
+static inline
+void dma_debug_init(void)
+{
+}
+
+#endif /* CONFIG_DMA_API_DEBUG */
+
 #endif /* __ASM_X86_DMA_DEBUG */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index e489ff9..6271cd2 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -105,6 +105,8 @@  microcode-$(CONFIG_MICROCODE_INTEL)	+= microcode_intel.o
 microcode-$(CONFIG_MICROCODE_AMD)	+= microcode_amd.o
 obj-$(CONFIG_MICROCODE)			+= microcode.o
 
+obj-$(CONFIG_DMA_API_DEBUG)             += pci-dma-debug.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/pci-dma-debug.c b/arch/x86/kernel/pci-dma-debug.c
new file mode 100644
index 0000000..c2d3408
--- /dev/null
+++ b/arch/x86/kernel/pci-dma-debug.c
@@ -0,0 +1,111 @@ 
+/*
+ * Copyright (C) 2008 Advanced Micro Devices, Inc.
+ *
+ * Author: Joerg Roedel <joerg.roedel@amd.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
+ */
+
+#include <linux/types.h>
+#include <linux/scatterlist.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/module.h>
+#include <linux/hardirq.h>
+#include <linux/dma-mapping.h>
+#include <asm/bug.h>
+#include <asm/dma-mapping.h>
+#include <asm/dma_debug.h>
+
+#define HASH_SIZE 256
+#define HASH_FN_SHIFT 20
+#define HASH_FN_MASK 0xffULL
+
+/* Hash list to save the allocated dma addresses */
+static struct list_head dma_entry_hash[HASH_SIZE];
+
+/* A slab cache to allocate dma_map_entries fast */
+static struct kmem_cache *dma_entry_cache;
+
+/* lock to protect the data structures */
+static DEFINE_SPINLOCK(dma_lock);
+
+static int hash_fn(struct dma_debug_entry *entry)
+{
+	/*
+	 * Hash function is based on the dma address.
+	 * We use bits 20-27 here as the index into the hash
+	 */
+	BUG_ON(entry->dev_addr == bad_dma_address);
+
+	return (entry->dev_addr >> HASH_FN_SHIFT) & HASH_FN_MASK;
+}
+
+static struct dma_debug_entry *dma_entry_alloc(void)
+{
+	gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
+
+	if (in_atomic())
+		gfp |= GFP_ATOMIC;
+
+	return kmem_cache_alloc(dma_entry_cache, gfp);
+}
+
+static void dma_entry_free(struct dma_debug_entry *entry)
+{
+	kmem_cache_free(dma_entry_cache, entry);
+}
+
+static struct dma_debug_entry *
+find_dma_entry(struct dma_debug_entry *ref)
+{
+	int idx = hash_fn(ref);
+	struct dma_debug_entry *entry;
+
+	list_for_each_entry(entry, &dma_entry_hash[idx], list) {
+		if ((entry->dev_addr == ref->dev_addr) &&
+				(entry->dev == ref->dev))
+			return entry;
+	}
+
+	return NULL;
+}
+
+static void add_dma_entry(struct dma_debug_entry *entry)
+{
+	int idx = hash_fn(entry);
+
+	list_add_tail(&entry->list, &dma_entry_hash[idx]);
+}
+
+static void remove_dma_entry(struct dma_debug_entry *entry)
+{
+	list_del(&entry->list);
+}
+
+void dma_debug_init(void)
+{
+	int i;
+
+	for (i = 0; i < HASH_SIZE; ++i)
+		INIT_LIST_HEAD(&dma_entry_hash[i]);
+
+	dma_entry_cache = kmem_cache_create("dma_debug_entry_cache",
+			sizeof(struct dma_debug_entry),
+			0, SLAB_PANIC, NULL);
+
+	printk(KERN_INFO "PCI-DMA: DMA API debugging enabled by kernel config\n");
+}
+
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 1926248..94096b8 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -275,6 +275,8 @@  EXPORT_SYMBOL(dma_supported);
 
 static int __init pci_iommu_init(void)
 {
+	dma_debug_init();
+
 	calgary_iommu_init();
 
 	intel_iommu_init();