mbox series

[v3,0/7] Marvell NAND controller rework with ->exec_op()

Message ID 20180109103637.23798-1-miquel.raynal@free-electrons.com
Headers show
Series Marvell NAND controller rework with ->exec_op() | expand

Message

Miquel Raynal Jan. 9, 2018, 10:36 a.m. UTC
Hi,

After the addition of the NAND framework ->exec_op() interface (see [1]
for the series preparing it and [2] for the last version of the
core-side implementation of ->exec_op() itself), this series replaces
the current Marvell NAND controller driver pxa3xx_nand.c with a rework
called marvell_nand.c.

Aside the fact that it drops the big state machine, improves the overall
speed and implements raw accesses, it is the first driver-side
implementation of the ->exec_op() interface and may be used as reference
for latter reworks of the same type.

One may find more detail about why a completely new driver is needed in
the commit log of:

    "mtd: nand: add reworked Marvell NAND controller driver"

Device tree NAND node definition for all platforms referring to the
Marvell driver using the new bindings have already been accepted by the
MVEBU DT maintainers and will be merged after the driver. They are more
hierarchical and fit the real organization of the hardware, by having
NAND partitions that are part of NAND chip nodes, themselves part of the
NAND controller node.

These changes have been tested on:
   - PXA3xx platform with a CM-X300 board (2kiB page NAND, 1b/512B
     strength, Hamming ECC engine) [32 bits]
   - Armada 385 DB AP (4kiB page NAND, 4b/512B, BCH ECC engine) [32 bits]
   - Armada 398 DB (4kiB page NAND, 8b/512B, BCH ECC engine using a layout
     with a last chunk different than the others) [32 bits]
   - Armada 7040 DB and Armada 8040 DB (4kiB page NAND, 4b/512B, BCH ECC
     engine) [64 bits]
   - Triax dvb-tc board (2kiB page NAND, 4b/512B, BCH ECC engine) [32 bits]

This version is known not to be stable yet with a Zylonite based setup but
otherwise looks good for Marvell EBU platforms.

For people who would like to test it easily, a branch ready to be tested
is available at [3]. It is based on nand/next and has all the changes
brought by the previously mentionned series as well as this one.

Thank you,
Miquèl


[1] https://www.spinics.net/lists/arm-kernel/msg619633.html
[2] http://lists.infradead.org/pipermail/linux-mtd/2017-December/077965.html
[3] https://github.com/miquelraynal/linux/tree/marvell/nand-next/nfc


Changes since v2:
  - Added a patch to create the nand-rb property in the Documentation
  - Rewording in the Documentation according to Rob's comments
  - Moved from marvell,rb to nand-rb property in the code
  - Disociated using this driver with Marvell EBU platforms than using
    it with PXA ones
  - Fixed the handling of 16-bit buses
  - Fixed SPDX comment style
  - Reorganized registers offsets/bit fields definitions as requested
  - Moved to Kernel doc
  - Changed the logic in ->select_chip() to use a prepared value of NDCR
    only instead of recalculating it
  - Fixed the presence of the SPARE_EN bit, reworked a bit the
    hmg_do_read/write() helpers
  - Fixed the OOB layouts that were unusable (all spare data first, then
    all ECC bytes)
  - Additional check on mtd->writesize when using NFCv1 (all sizes not
    supported)
  - Various typos/rewording

Changes since v1:
  - Rewording
  - Fixed BCH ->read/write_page() hooks for 2kiB pages NAND chips
  - Removed license text, used SPDX tag instead
  - Removed read_page_data()
  - Enhanced the DT bindings document with the label property and the
    deprecated bindings.
  - Simplified the read_chunk() helper (OOB always read).
  - Simplified the ->bch_read_page() hook by removing the addition raw
    read to get ECC bytes.
  - Fixed the ->correct() function that did not check for bitflips in
    ECC bytes in erased pages.


Miquel Raynal (7):
  dt-bindings: mtd: document new nand-rb property
  dt-bindings: mtd: add Marvell NAND controller documentation
  mtd: nand: add reworked Marvell NAND controller driver
  mtd: nand: use reworked NAND controller driver with Marvell EBU SoCs
  mtd: nand: use Marvell reworked NAND controller driver with all
    platforms
  dt-bindings: mtd: remove pxa3xx NAND controller documentation
  mtd: nand: remove useless fields from pxa3xx NAND platform data

 .../devicetree/bindings/mtd/marvell-nand.txt       |  123 +
 Documentation/devicetree/bindings/mtd/nand.txt     |    1 +
 .../devicetree/bindings/mtd/pxa3xx-nand.txt        |   50 -
 arch/arm/configs/cm_x300_defconfig                 |    2 +-
 arch/arm/configs/mvebu_v7_defconfig                |    2 +-
 arch/arm/configs/pxa3xx_defconfig                  |    3 +-
 arch/arm/configs/pxa_defconfig                     |    2 +-
 arch/arm/configs/raumfeld_defconfig                |    2 +-
 arch/arm/mach-mmp/ttc_dkb.c                        |    4 +-
 arch/arm/mach-pxa/cm-x300.c                        |    8 +-
 arch/arm/mach-pxa/colibri-pxa3xx.c                 |    8 +-
 arch/arm/mach-pxa/colibri.h                        |    2 +-
 arch/arm/mach-pxa/littleton.c                      |   10 +-
 arch/arm/mach-pxa/mxm8x10.c                        |   10 +-
 arch/arm/mach-pxa/raumfeld.c                       |    6 +-
 arch/arm/mach-pxa/zylonite.c                       |   10 +-
 arch/arm64/configs/defconfig                       |    2 +-
 drivers/mtd/nand/Kconfig                           |   18 +-
 drivers/mtd/nand/Makefile                          |    2 +-
 drivers/mtd/nand/marvell_nand.c                    | 2896 ++++++++++++++++++++
 drivers/mtd/nand/pxa3xx_nand.c                     | 2104 --------------
 include/linux/platform_data/mtd-nand-pxa3xx.h      |   43 +-
 22 files changed, 3072 insertions(+), 2236 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/mtd/marvell-nand.txt
 delete mode 100644 Documentation/devicetree/bindings/mtd/pxa3xx-nand.txt
 create mode 100644 drivers/mtd/nand/marvell_nand.c
 delete mode 100644 drivers/mtd/nand/pxa3xx_nand.c

Comments

Boris Brezillon Jan. 11, 2018, 11:27 a.m. UTC | #1
On Tue,  9 Jan 2018 11:36:30 +0100
Miquel Raynal <miquel.raynal@free-electrons.com> wrote:

> Hi,
> 
> After the addition of the NAND framework ->exec_op() interface (see [1]
> for the series preparing it and [2] for the last version of the
> core-side implementation of ->exec_op() itself), this series replaces
> the current Marvell NAND controller driver pxa3xx_nand.c with a rework
> called marvell_nand.c.
> 
> Aside the fact that it drops the big state machine, improves the overall
> speed and implements raw accesses, it is the first driver-side
> implementation of the ->exec_op() interface and may be used as reference
> for latter reworks of the same type.
> 
> One may find more detail about why a completely new driver is needed in
> the commit log of:
> 
>     "mtd: nand: add reworked Marvell NAND controller driver"
> 
> Device tree NAND node definition for all platforms referring to the
> Marvell driver using the new bindings have already been accepted by the
> MVEBU DT maintainers and will be merged after the driver. They are more
> hierarchical and fit the real organization of the hardware, by having
> NAND partitions that are part of NAND chip nodes, themselves part of the
> NAND controller node.
> 
> These changes have been tested on:
>    - PXA3xx platform with a CM-X300 board (2kiB page NAND, 1b/512B
>      strength, Hamming ECC engine) [32 bits]
>    - Armada 385 DB AP (4kiB page NAND, 4b/512B, BCH ECC engine) [32 bits]
>    - Armada 398 DB (4kiB page NAND, 8b/512B, BCH ECC engine using a layout
>      with a last chunk different than the others) [32 bits]
>    - Armada 7040 DB and Armada 8040 DB (4kiB page NAND, 4b/512B, BCH ECC
>      engine) [64 bits]
>    - Triax dvb-tc board (2kiB page NAND, 4b/512B, BCH ECC engine) [32 bits]
> 
> This version is known not to be stable yet with a Zylonite based setup but
> otherwise looks good for Marvell EBU platforms.

So, here is the plan: since the driver has been tested on various mvebu
platforms and is known to work fine on these platforms, I'd like to
queue the driver and the patch modifying mvebu defconfigs (patches 1 to
4) for 4.16.
I'll leave other patches for 4.17, which means I'd like remaining bugs
to be fixed during the 4.16 release cycle so that we can eventually get
rid of the old driver. That's really important to me that we don't keep
both drivers around for too long, because my previous experience showed
that, when you have 2 drivers for the same HW, people don't switch to
the new one until they're forced to do it.

Robert, are you fine with this approach? What about the tests you were
doing? Did you make any progress? Did you find other issues?

> 
> For people who would like to test it easily, a branch ready to be tested
> is available at [3]. It is based on nand/next and has all the changes
> brought by the previously mentionned series as well as this one.
> 
> Thank you,
> Miquèl
> 
> 
> [1] https://www.spinics.net/lists/arm-kernel/msg619633.html
> [2] http://lists.infradead.org/pipermail/linux-mtd/2017-December/077965.html
> [3] https://github.com/miquelraynal/linux/tree/marvell/nand-next/nfc
> 
> 
> Changes since v2:
>   - Added a patch to create the nand-rb property in the Documentation
>   - Rewording in the Documentation according to Rob's comments
>   - Moved from marvell,rb to nand-rb property in the code
>   - Disociated using this driver with Marvell EBU platforms than using
>     it with PXA ones
>   - Fixed the handling of 16-bit buses
>   - Fixed SPDX comment style
>   - Reorganized registers offsets/bit fields definitions as requested
>   - Moved to Kernel doc
>   - Changed the logic in ->select_chip() to use a prepared value of NDCR
>     only instead of recalculating it
>   - Fixed the presence of the SPARE_EN bit, reworked a bit the
>     hmg_do_read/write() helpers
>   - Fixed the OOB layouts that were unusable (all spare data first, then
>     all ECC bytes)
>   - Additional check on mtd->writesize when using NFCv1 (all sizes not
>     supported)
>   - Various typos/rewording
> 
> Changes since v1:
>   - Rewording
>   - Fixed BCH ->read/write_page() hooks for 2kiB pages NAND chips
>   - Removed license text, used SPDX tag instead
>   - Removed read_page_data()
>   - Enhanced the DT bindings document with the label property and the
>     deprecated bindings.
>   - Simplified the read_chunk() helper (OOB always read).
>   - Simplified the ->bch_read_page() hook by removing the addition raw
>     read to get ECC bytes.
>   - Fixed the ->correct() function that did not check for bitflips in
>     ECC bytes in erased pages.
> 
> 
> Miquel Raynal (7):
>   dt-bindings: mtd: document new nand-rb property
>   dt-bindings: mtd: add Marvell NAND controller documentation
>   mtd: nand: add reworked Marvell NAND controller driver
>   mtd: nand: use reworked NAND controller driver with Marvell EBU SoCs
>   mtd: nand: use Marvell reworked NAND controller driver with all
>     platforms
>   dt-bindings: mtd: remove pxa3xx NAND controller documentation
>   mtd: nand: remove useless fields from pxa3xx NAND platform data
> 
>  .../devicetree/bindings/mtd/marvell-nand.txt       |  123 +
>  Documentation/devicetree/bindings/mtd/nand.txt     |    1 +
>  .../devicetree/bindings/mtd/pxa3xx-nand.txt        |   50 -
>  arch/arm/configs/cm_x300_defconfig                 |    2 +-
>  arch/arm/configs/mvebu_v7_defconfig                |    2 +-
>  arch/arm/configs/pxa3xx_defconfig                  |    3 +-
>  arch/arm/configs/pxa_defconfig                     |    2 +-
>  arch/arm/configs/raumfeld_defconfig                |    2 +-
>  arch/arm/mach-mmp/ttc_dkb.c                        |    4 +-
>  arch/arm/mach-pxa/cm-x300.c                        |    8 +-
>  arch/arm/mach-pxa/colibri-pxa3xx.c                 |    8 +-
>  arch/arm/mach-pxa/colibri.h                        |    2 +-
>  arch/arm/mach-pxa/littleton.c                      |   10 +-
>  arch/arm/mach-pxa/mxm8x10.c                        |   10 +-
>  arch/arm/mach-pxa/raumfeld.c                       |    6 +-
>  arch/arm/mach-pxa/zylonite.c                       |   10 +-
>  arch/arm64/configs/defconfig                       |    2 +-
>  drivers/mtd/nand/Kconfig                           |   18 +-
>  drivers/mtd/nand/Makefile                          |    2 +-
>  drivers/mtd/nand/marvell_nand.c                    | 2896 ++++++++++++++++++++
>  drivers/mtd/nand/pxa3xx_nand.c                     | 2104 --------------
>  include/linux/platform_data/mtd-nand-pxa3xx.h      |   43 +-
>  22 files changed, 3072 insertions(+), 2236 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/mtd/marvell-nand.txt
>  delete mode 100644 Documentation/devicetree/bindings/mtd/pxa3xx-nand.txt
>  create mode 100644 drivers/mtd/nand/marvell_nand.c
>  delete mode 100644 drivers/mtd/nand/pxa3xx_nand.c
>
Robert Jarzmik Jan. 11, 2018, 5:42 p.m. UTC | #2
Boris Brezillon <boris.brezillon@free-electrons.com> writes:

Hi Boris and Miquel,

> So, here is the plan: since the driver has been tested on various mvebu
> platforms and is known to work fine on these platforms, I'd like to
> queue the driver and the patch modifying mvebu defconfigs (patches 1 to
> 4) for 4.16.
That's all right.

> I'll leave other patches for 4.17, which means I'd like remaining bugs
> to be fixed during the 4.16 release cycle so that we can eventually get
> rid of the old driver. That's really important to me that we don't keep
> both drivers around for too long, because my previous experience showed
> that, when you have 2 drivers for the same HW, people don't switch to
> the new one until they're forced to do it.
>
> Robert, are you fine with this approach? What about the tests you were
> doing? Did you make any progress? Did you find other issues?
So far, with the latest branch from Miquel of tip commit 12b9e62c851c ("ARM64:
dts: marvell: use reworked NAND controller driver on Armada 8K"), the bad blocks
issue is still there, ie :
 - the old pxa3xx driver doesn't see any bad block and mounts the ext2/ubifs
   correctly
 - barebox doesn't see any bad block
 - marvell_nand sees all (or most all) blocks as bad with "flash_bbt=0" in
   platform data, which is very surprising

I'm really surprised that in your tests on the cm_x300, in a platform_data setup
(ie. not device-tree setup), you're not seeing these errors ...

As if I'm fine with this approach, I agree with step 1 (patches 1-4). As for
step 2, I'll agree if the current situation is solved and my boards recognize
correctly my ext2 over ubifs on the NAND.

Cheers.

--
Robert

[1] The dmesg extract (here with flash_bbt = 0)
Loading ARM Linux zImage '/mnt/tftp/zImage_jenkins'
commandline: ram=64M console=ttyS0,115200 ip=dhcp root=/dev/nfs nfsroot=/home/none/nfsroot/zylonite,v3,tcp earlycon mtdparts=pxa3xx_nand-0:128k@0(TIMH)ro,128k@128k(OBMI)ro,768k@256k(barebox),256k@1024k(barebox-env),12M@1280k(kernel),38016k@13568k(root)
arch_number: 1233
Uncompressing Linux... done, booting the kernel.
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.15.0-rc1-00044-g11cc68b (jenkins@belgarath) (gcc version 4.8.3 20140320 (prerelease) (Sourcery CodeBench Lite 2014.05-29)) #879 PREEMPT Thu Jan 11 09:16:09 CET 2018
[    0.000000] CPU: XScale-V3 based processor [69056891] revision 1 (ARMv5TE), cr=0000397f
[    0.000000] CPU: VIVT data cache, VIVT instruction cache
[    0.000000] Machine: PXA3xx Platform Development Kit (aka Zylonite)
[    0.000000] Ignoring tag cmdline (using the default kernel command line)
[    0.000000] Memory policy: Data cache writeback
[    0.000000] RO Mode clock: 0.00MHz
[    0.000000] Run Mode clock: 0.00MHz
[    0.000000] Turbo Mode clock: 0.00MHz
[    0.000000] System bus clock: 0.00MHz
[    0.000000] On node 0 totalpages: 16384
[    0.000000]   Normal zone: 128 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 16384 pages, LIFO batch:3
[    0.000000] random: fast init done
[    0.000000] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[    0.000000] pcpu-alloc: [0] 0 
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 16256
[    0.000000] Kernel command line: root=/dev/ram0 ip=192.168.1.232:192.168.1.5::255.255.255.0::eth0:on console=ttyS0,115200 mem=64M mtdparts=pxa3xx_nand-0:128k@0(TIMH)ro,128k@128k(OBMI)ro,768k@256k(barebox),256k@1024k(barebox-env),12M@1280k(kernel),38016k@13568k(root) ubi.mtd=5 earlycon=pxa,io,0xf6200000,115200n8 debug no_console_suspend
[    0.000000] Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
[    0.000000] Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
[    0.000000] Memory: 56856K/65536K available (4225K kernel code, 202K rwdata, 972K rodata, 2396K init, 102K bss, 8680K reserved, 0K cma-reserved)
[    0.000000] Virtual kernel memory layout:
[    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
[    0.000000]     fixmap  : 0xffc00000 - 0xfff00000   (3072 kB)
[    0.000000]     vmalloc : 0xc4800000 - 0xff800000   ( 944 MB)
[    0.000000]     lowmem  : 0xc0000000 - 0xc4000000   (  64 MB)
[    0.000000]     modules : 0xbf000000 - 0xc0000000   (  16 MB)
[    0.000000]       .text : 0xc0008000 - 0xc04289e8   (4227 kB)
[    0.000000]       .init : 0xc053f000 - 0xc0796000   (2396 kB)
[    0.000000]       .data : 0xc0796000 - 0xc07c8bec   ( 203 kB)
[    0.000000]        .bss : 0xc07c8bec - 0xc07e25fc   ( 103 kB)
[    0.000000] Preemptible hierarchical RCU implementation.
[    0.000000] 	Tasks RCU enabled.
[    0.000000] NR_IRQS: 16, nr_irqs: 336, preallocated irqs: 336
[    0.000000] RJK: parent_rate=13000000, xl=8, xn=1
[    0.000068] sched_clock: 32 bits at 3250kHz, resolution 307ns, wraps every 660764198758ns
[    0.000267] clocksource: oscr0: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 588080137591 ns
[    0.002139] Console: colour dummy device 80x30
[    0.002298] Calibrating delay loop... 103.83 BogoMIPS (lpj=519168)
[    0.081019] pid_max: default: 32768 minimum: 301
[    0.081858] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.081958] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.085170] CPU: Testing write buffer coherency: ok
[    0.088980] Setting up static identity map for 0x80008200 - 0x80008260
[    0.089936] Hierarchical SRCU implementation.
[    0.102958] devtmpfs: initialized
[    0.113835] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.113974] futex hash table entries: 256 (order: -1, 3072 bytes)
[    0.116311] NET: Registered protocol family 16
[    0.119126] DMA: preallocated 256 KiB pool for atomic coherent allocations
[    0.385492] Advanced Linux Sound Architecture Driver Initialized.
[    0.398200] clocksource: Switched to clocksource oscr0
[    0.551166] NET: Registered protocol family 2
[    0.556911] TCP established hash table entries: 1024 (order: 0, 4096 bytes)
[    0.557135] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
[    0.557316] TCP: Hash tables configured (established 1024 bind 1024)
[    0.557858] UDP hash table entries: 256 (order: 0, 4096 bytes)
[    0.558038] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
[    0.559740] NET: Registered protocol family 1
[    0.561795] RPC: Registered named UNIX socket transport module.
[    0.561891] RPC: Registered udp transport module.
[    0.561946] RPC: Registered tcp transport module.
[    0.562003] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    2.497308] Initialise system trusted keyrings
[    2.499900] workingset: timestamp_bits=30 max_order=14 bucket_order=0
[    2.503798] NFS: Registering the id_resolver key type
[    2.504004] Key type id_resolver registered
[    2.504066] Key type id_legacy registered
[    2.510210] Key type asymmetric registered
[    2.510311] Asymmetric key parser 'x509' registered
[    2.510469] io scheduler noop registered
[    2.510534] io scheduler deadline registered
[    2.510928] io scheduler cfq registered (default)
[    2.510999] io scheduler mq-deadline registered
[    2.511060] io scheduler kyber registered
[    2.571677] pxa-dma pxa-dma.0: initialized 32 channels on 100 requestors
[    2.575099] pxa2xx-uart.0: ttyS0 at MMIO 0x40100000 (irq = 38, base_baud = 928571) is a UART1
[    3.050644] console [ttyS0] enabled
[    3.056987] pxa2xx-uart.1: ttyS1 at MMIO 0x40200000 (irq = 37, base_baud = 928571) is a UART2
[    3.069918] pxa2xx-uart.2: ttyS2 at MMIO 0x40700000 (irq = 36, base_baud = 928571) is a UART3
[    3.085482] nand: executing subop:
[    3.091647] nand:     ->CMD      [0xff]
[    3.095546] nand:     ->WAITRDY  [max 250 ms]
[    3.100705] marvell-nfc pxa3xx-nand: 
[    3.100705] NDCR:  0x90079fff
[    3.100705] NDCB0: 0x00a000ff
[    3.100705] NDCB1: 0x00000000
[    3.100705] NDCB2: 0x00000000
[    3.100705] NDCB3: 0x00000000
[    3.119968] nand: executing subop:
[    3.123439] nand:     ->CMD      [0x90]
[    3.127315] nand:     ->ADDR     [1 cyc: 00]
[    3.131983] nand:     ->DATA_IN  [2 B, force 8-bit]
[    3.136985] marvell-nfc pxa3xx-nand: 
[    3.136985] NDCR:  0x90079fff
[    3.136985] NDCB0: 0x00610090
[    3.136985] NDCB1: 0x00000000
[    3.136985] NDCB2: 0x00000000
[    3.136985] NDCB3: 0x00000000
[    3.155819] nand: executing subop:
[    3.159462] nand:     ->CMD      [0x90]
[    3.163343] nand:     ->ADDR     [1 cyc: 00]
[    3.167635] nand:     ->DATA_IN  [8 B, force 8-bit]
[    3.172747] marvell-nfc pxa3xx-nand: 
[    3.172747] NDCR:  0x90079fff
[    3.172747] NDCB0: 0x00610090
[    3.172747] NDCB1: 0x00000000
[    3.172747] NDCB2: 0x00000000
[    3.172747] NDCB3: 0x00000000
[    3.191422] nand: executing subop:
[    3.194876] nand:     ->CMD      [0x90]
[    3.198897] nand:     ->ADDR     [1 cyc: 20]
[    3.203210] nand:     ->DATA_IN  [4 B, force 8-bit]
[    3.208314] marvell-nfc pxa3xx-nand: 
[    3.208314] NDCR:  0x90079fff
[    3.208314] NDCB0: 0x00610090
[    3.208314] NDCB1: 0x00000020
[    3.208314] NDCB2: 0x00000000
[    3.208314] NDCB3: 0x00000000
[    3.226920] nand: executing subop:
[    3.230511] nand:     ->CMD      [0x90]
[    3.234394] nand:     ->ADDR     [1 cyc: 40]
[    3.238830] nand:     ->DATA_IN  [5 B, force 8-bit]
[    3.243808] marvell-nfc pxa3xx-nand: 
[    3.243808] NDCR:  0x90079fff
[    3.243808] NDCB0: 0x00610090
[    3.243808] NDCB1: 0x00000040
[    3.243808] NDCB2: 0x00000000
[    3.243808] NDCB3: 0x00000000
[    3.262444] nand: device found, Manufacturer ID: 0x20, Chip ID: 0xba
[    3.268947] nand: ST Micro NAND 256MiB 1,8V 16-bit
[    3.273791] nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
[    3.281519] marvell-nfc pxa3xx-nand: No minimum ECC strength, using 1b/512B
[    3.288669] Scanning device for bad blocks
[    3.292845] nand: nand_do_read_oob: from = 0x00000000, len = 64
[    3.299007] marvell-nfc pxa3xx-nand: 
[    3.299007] NDCR:  0x9d079fff
[    3.299007] NDCB0: 0x000d3000
[    3.299007] NDCB1: 0x00000000
[    3.299007] NDCB2: 0x00000000
[    3.299007] NDCB3: 0x00000000
[    3.318075] Bad eraseblock 0 at 0x000000000000
[    3.322773] nand: nand_do_read_oob: from = 0x00020000, len = 64
[    3.328941] marvell-nfc pxa3xx-nand: 
[    3.328941] NDCR:  0x9d079fff
[    3.328941] NDCB0: 0x000d3000
[    3.328941] NDCB1: 0x00400000
[    3.328941] NDCB2: 0x00000000
[    3.328941] NDCB3: 0x00000000
[    3.347848] nand: nand_do_read_oob: from = 0x00040000, len = 64
[    3.354049] marvell-nfc pxa3xx-nand: 
[    3.354049] NDCR:  0x9d079fff
[    3.354049] NDCB0: 0x000d3000
[    3.354049] NDCB1: 0x00800000
[    3.354049] NDCB2: 0x00000000
[    3.354049] NDCB3: 0x00000000
[    3.372925] Bad eraseblock 2 at 0x000000040000
[    3.377451] nand: nand_do_read_oob: from = 0x00060000, len = 64
[    3.383633] marvell-nfc pxa3xx-nand: 
[    3.383633] NDCR:  0x9d079fff
[    3.383633] NDCB0: 0x000d3000
[    3.383633] NDCB1: 0x00c00000
[    3.383633] NDCB2: 0x00000000
[    3.383633] NDCB3: 0x00000000
[    3.402509] Bad eraseblock 3 at 0x000000060000
... and so on ...
[   60.154031] marvell-nfc pxa3xx-nand: 
[   60.154031] NDCR:  0x9d079fff
[   60.154031] NDCB0: 0x000d3000
[   60.154031] NDCB1: 0x58410000
[   60.154031] NDCB2: 0x00000000
[   60.154031] NDCB3: 0x00000000
[   60.173335] ubi0: scanning is finished
[   60.177297] ubi0 error: ubi_read_volume_table: the layout volume was not found
[   60.184878] ubi0 error: ubi_attach_mtd_dev: failed to attach mtd5, error -22
[   60.192397] UBI error: cannot attach mtd5
[   60.197652] pxa-rtc pxa-rtc: setting system clock to 2000-01-01 00:01:25 UTC (946684885)
[   60.282788] smc91x smc91x.0 eth0: link down
[   62.152546] smc91x smc91x.0 eth0: link up, 100Mbps, full-duplex, lpa 0xCDE1
[   62.198419] IP-Config: Complete:
[   62.201766]      device=eth0, hwaddr=00:0e:0c:a7:26:f7, ipaddr=192.168.1.232, mask=255.255.255.0, gw=255.255.255.255
[   62.212438]      host=192.168.1.232, domain=, nis-domain=(none)
[   62.218511]      bootserver=192.168.1.5, rootserver=192.168.1.5, rootpath=
[   62.226759] ALSA device list:
[   62.230209]   #0: Zylonite
[   62.255735] Freeing unused kernel memory: 2396K
[   62.260675] This architecture does not have kernel memory protection.
Starting logging: OK
Miquel Raynal Jan. 11, 2018, 10:24 p.m. UTC | #3
Hi Robert,

On Thu, 11 Jan 2018 18:42:56 +0100
Robert Jarzmik <robert.jarzmik@free.fr> wrote:

> Boris Brezillon <boris.brezillon@free-electrons.com> writes:
> 
> Hi Boris and Miquel,
> 
> > So, here is the plan: since the driver has been tested on various
> > mvebu platforms and is known to work fine on these platforms, I'd
> > like to queue the driver and the patch modifying mvebu defconfigs
> > (patches 1 to 4) for 4.16.
> That's all right.
> 
> > I'll leave other patches for 4.17, which means I'd like remaining
> > bugs to be fixed during the 4.16 release cycle so that we can
> > eventually get rid of the old driver. That's really important to me
> > that we don't keep both drivers around for too long, because my
> > previous experience showed that, when you have 2 drivers for the
> > same HW, people don't switch to the new one until they're forced to
> > do it.
> >
> > Robert, are you fine with this approach? What about the tests you
> > were doing? Did you make any progress? Did you find other issues?
> So far, with the latest branch from Miquel of tip commit 12b9e62c851c
> ("ARM64: dts: marvell: use reworked NAND controller driver on Armada
> 8K"), the bad blocks issue is still there, ie :
>  - the old pxa3xx driver doesn't see any bad block and mounts the
> ext2/ubifs correctly
>  - barebox doesn't see any bad block
>  - marvell_nand sees all (or most all) blocks as bad with
> "flash_bbt=0" in platform data, which is very surprising
> 
> I'm really surprised that in your tests on the cm_x300, in a
> platform_data setup (ie. not device-tree setup), you're not seeing
> these errors ...

I have no problems with the cm_x300 board (using platform data) but
there is one big difference: the bootloader. You are using Barebox
while I am using U-Boot.

Please pull this branch which is for testing purpose [1].

There are two "HACK"s:
1/ Dump the timing registers: this is to see how Barebox does
initialize these registers. I will put these values back into my setup
and see how the board reacts.
2/ Dump the OOB area while reading. This is to see why the driver
declares all blocks as bad.

Can you please run this branch first?

Then, can you please:
- boot the old driver
- dump both NDTR[0|1] registers that should be well initialized
- boot the new driver with the values previously retrieved (you can
  assign these values where exactly HACK 1/ adds the printk's).


Thank you,
Miquèl

[1]
https://github.com/miquelraynal/linux/commits/marvell/nand-next/nfc-pxa-bug
Willy Tarreau Jan. 11, 2018, 10:28 p.m. UTC | #4
Hi Miquel,

On Tue, Jan 09, 2018 at 11:36:30AM +0100, Miquel Raynal wrote:
> These changes have been tested on:
>    - PXA3xx platform with a CM-X300 board (2kiB page NAND, 1b/512B
>      strength, Hamming ECC engine) [32 bits]
>    - Armada 385 DB AP (4kiB page NAND, 4b/512B, BCH ECC engine) [32 bits]
>    - Armada 398 DB (4kiB page NAND, 8b/512B, BCH ECC engine using a layout
>      with a last chunk different than the others) [32 bits]
>    - Armada 7040 DB and Armada 8040 DB (4kiB page NAND, 4b/512B, BCH ECC
>      engine) [64 bits]
>    - Triax dvb-tc board (2kiB page NAND, 4b/512B, BCH ECC engine) [32 bits]

If you're interested, I have a mirabox with an armada 370 which uses the
same driver. I've not powered it up for a while but if that can be useful
I can try to find some time for this.

Just let me know
Willy
Robert Jarzmik Jan. 12, 2018, 8:09 a.m. UTC | #5
Miquel RAYNAL <miquel.raynal@free-electrons.com> writes:

I begun all your test procedure (on my zylonite board).
The timing registers are the same in both pxa3xx_nand and marvell_nand, ie :
[    3.085539] Timing registers from Bootloader:
[    3.089971] -  NDTR0: 0x00161c1c
[    3.095979] -  NDTR1: 0x0f3c00a2

I can attach the dmesg of the first run (dump of OOB). Yet I think you're
missing the point as to where the bug lies.

In the zylonite setup, the BBT is _not_ in the OOB of each block. Instead, it
lies at the end of the NAND, in the last blocks (see struct
nand_bbt_descr). Reading each block and declaring it as bad as is done in
marvell_nand (at least that is my understanding of your traces), but it is not
what should be done if a match is found for the bbt_pattern. Instead, the BBT
should be loaded from the last 8 blocks of the NAND, ie. page 130944 and page
131008 in my setup.

Therefore, I would rather think that marvell-nand is not using the BBT at the
end of the nand rather than misconfiguring the timing registers.

Cheers.

--
Robert

PS: You really should expunge the mailing recipients a bit ...

[1] DMesg extract
netconsole: port not set
netconsole: registered as netconsole-1
smc91c111 smc91c1110: chip is revision= 9, version= 2
mdio_bus: miibus0: probed
eth0: got preset MAC address: 00:0e:0c:a7:26:f7
nand: NAND device: Manufacturer ID: 0x20, Chip ID: 0xba (ST Micro NAND 256MiB 1,8V 16-bit), 256MiB, page size: 2048, OOB size: 64
mrvl_nand mrvl_nand0: ECC strength 1, ECC step size 512
Bad block table found at page 131008, version 0x04
Bad block table found at page 130944, version 0x04
malloc space: 0x83700000 -> 0x83efffff (size 8 MiB)
running /env/bin/init...
magicvar: No such file or directory
magicvar: No such file or directory
magicvar: No such file or directory

Hit any key to stop autoboot:  3 2 1 0
booting net
netconsole: netconsole initialized with 255.255.255.255:6662
eth0: 100Mbps full duplex link detected
DHCP client bound to address 192.168.1.232
netconsole: netconsole initialized with 255.255.255.255:6662
could not open /mnt/tftp/none-linux-zylonite: No such file or directory
Booting net failed: No such file or directory
booting net failed: No such file or directory
boot: No such file or directory
.[1;32mbarebox@.[1;36mZylonite:/.[0m global linux.bootargs.debug=earlycon
.[1;32mbarebox@.[1;36mZylonite:/.[0m bootm /mnt/tftp/zImage_jenkins

Loading ARM Linux zImage '/mnt/tftp/zImage_jenkins'
commandline: ram=64M console=ttyS0,115200 ip=dhcp root=/dev/nfs nfsroot=/home/none/nfsroot/zylonite,v3,tcp earlycon mtdparts=pxa3xx_nand-0:128k@0(TIMH)ro,128k@128k(OBMI)ro,768k@256k(barebox),256k@1024k(barebox-env),12M@1280k(kernel),38016k@13568k(root)
arch_number: 1233
Uncompressing Linux... done, booting the kernel.
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.15.0-rc1-00047-g3085f79 (jenkins@belgarath) (gcc version 4.8.3 20140320 (prerelease) (Sourcery CodeBench Lite 2014.05-29)) #889 PREEMPT Fri Jan 12 08:26:17 CET 2018
[    0.000000] CPU: XScale-V3 based processor [69056891] revision 1 (ARMv5TE), cr=0000397f
[    0.000000] CPU: VIVT data cache, VIVT instruction cache
[    0.000000] Machine: PXA3xx Platform Development Kit (aka Zylonite)
[    0.000000] Ignoring tag cmdline (using the default kernel command line)
[    0.000000] Memory policy: Data cache writeback
[    0.000000] RO Mode clock: 0.00MHz
[    0.000000] Run Mode clock: 0.00MHz
[    0.000000] Turbo Mode clock: 0.00MHz
[    0.000000] System bus clock: 0.00MHz
[    0.000000] On node 0 totalpages: 16384
[    0.000000]   Normal zone: 128 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 16384 pages, LIFO batch:3
[    0.000000] random: fast init done
[    0.000000] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[    0.000000] pcpu-alloc: [0] 0 
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 16256
[    0.000000] Kernel command line: root=/dev/ram0 ip=192.168.1.232:192.168.1.5::255.255.255.0::eth0:on console=ttyS0,115200 mem=64M mtdparts=pxa3xx_nand-0:128k@0(TIMH)ro,128k@128k(OBMI)ro,768k@256k(barebox),256k@1024k(barebox-env),12M@1280k(kernel),38016k@13568k(root) ubi.mtd=5 earlycon=pxa,io,0xf6200000,115200n8 debug no_console_suspend
[    0.000000] Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
[    0.000000] Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
[    0.000000] Memory: 56856K/65536K available (4226K kernel code, 202K rwdata, 972K rodata, 2396K init, 102K bss, 8680K reserved, 0K cma-reserved)
[    0.000000] Virtual kernel memory layout:
[    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
[    0.000000]     fixmap  : 0xffc00000 - 0xfff00000   (3072 kB)
[    0.000000]     vmalloc : 0xc4800000 - 0xff800000   ( 944 MB)
[    0.000000]     lowmem  : 0xc0000000 - 0xc4000000   (  64 MB)
[    0.000000]     modules : 0xbf000000 - 0xc0000000   (  16 MB)
[    0.000000]       .text : 0xc0008000 - 0xc0428a48   (4227 kB)
[    0.000000]       .init : 0xc053f000 - 0xc0796000   (2396 kB)
[    0.000000]       .data : 0xc0796000 - 0xc07c8bec   ( 203 kB)
[    0.000000]        .bss : 0xc07c8bec - 0xc07e25fc   ( 103 kB)
[    0.000000] Preemptible hierarchical RCU implementation.
[    0.000000] 	Tasks RCU enabled.
[    0.000000] NR_IRQS: 16, nr_irqs: 336, preallocated irqs: 336
[    0.000000] RJK: parent_rate=13000000, xl=8, xn=1
[    0.000070] sched_clock: 32 bits at 3250kHz, resolution 307ns, wraps every 660764198758ns
[    0.000266] clocksource: oscr0: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 588080137591 ns
[    0.002134] Console: colour dummy device 80x30
[    0.002294] Calibrating delay loop... 103.83 BogoMIPS (lpj=519168)
[    0.081019] pid_max: default: 32768 minimum: 301
[    0.081862] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.081960] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.085156] CPU: Testing write buffer coherency: ok
[    0.088957] Setting up static identity map for 0x80008200 - 0x80008260
[    0.089916] Hierarchical SRCU implementation.
[    0.102924] devtmpfs: initialized
[    0.113807] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.113948] futex hash table entries: 256 (order: -1, 3072 bytes)
[    0.116278] NET: Registered protocol family 16
[    0.119090] DMA: preallocated 256 KiB pool for atomic coherent allocations
[    0.389530] Advanced Linux Sound Architecture Driver Initialized.
[    0.400350] clocksource: Switched to clocksource oscr0
[    0.553012] NET: Registered protocol family 2
[    0.558410] TCP established hash table entries: 1024 (order: 0, 4096 bytes)
[    0.558634] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
[    0.558815] TCP: Hash tables configured (established 1024 bind 1024)
[    0.559367] UDP hash table entries: 256 (order: 0, 4096 bytes)
[    0.559547] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
[    0.561234] NET: Registered protocol family 1
[    0.563263] RPC: Registered named UNIX socket transport module.
[    0.563363] RPC: Registered udp transport module.
[    0.563418] RPC: Registered tcp transport module.
[    0.563475] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    2.498791] Initialise system trusted keyrings
[    2.501003] workingset: timestamp_bits=30 max_order=14 bucket_order=0
[    2.504911] NFS: Registering the id_resolver key type
[    2.505141] Key type id_resolver registered
[    2.505204] Key type id_legacy registered
[    2.511301] Key type asymmetric registered
[    2.511404] Asymmetric key parser 'x509' registered
[    2.511567] io scheduler noop registered
[    2.511632] io scheduler deadline registered
[    2.512023] io scheduler cfq registered (default)
[    2.512092] io scheduler mq-deadline registered
[    2.512155] io scheduler kyber registered
[    2.569944] pxa-dma pxa-dma.0: initialized 32 channels on 100 requestors
[    2.575695] pxa2xx-uart.0: ttyS0 at MMIO 0x40100000 (irq = 38, base_baud = 928571) is a UART1
[    3.050762] console [ttyS0] enabled
[    3.057118] pxa2xx-uart.1: ttyS1 at MMIO 0x40200000 (irq = 37, base_baud = 928571) is a UART2
[    3.069632] pxa2xx-uart.2: ttyS2 at MMIO 0x40700000 (irq = 36, base_baud = 928571) is a UART3
[    3.085539] Timing registers from Bootloader:
[    3.089971] -  NDTR0: 0x00161c1c
[    3.095979] -  NDTR1: 0x0f3c00a2
[    3.099319] nand: executing subop:
[    3.103209] nand:     ->CMD      [0xff]
[    3.107105] nand:     ->WAITRDY  [max 250 ms]
[    3.111915] marvell-nfc pxa3xx-nand: 
[    3.111915] NDCR:  0x90071fff
[    3.111915] NDCB0: 0x00a000ff
[    3.111915] NDCB1: 0x00000000
[    3.111915] NDCB2: 0x00000000
[    3.111915] NDCB3: 0x00000000
[    3.131170] nand: executing subop:
[    3.134638] nand:     ->CMD      [0x90]
[    3.138513] nand:     ->ADDR     [1 cyc: 00]
[    3.143174] nand:     ->DATA_IN  [2 B, force 8-bit]
[    3.148182] marvell-nfc pxa3xx-nand: 
[    3.148182] NDCR:  0x90071fff
[    3.148182] NDCB0: 0x00610090
[    3.148182] NDCB1: 0x00000000
[    3.148182] NDCB2: 0x00000000
[    3.148182] NDCB3: 0x00000000
[    3.167016] nand: executing subop:
[    3.170656] nand:     ->CMD      [0x90]
[    3.174543] nand:     ->ADDR     [1 cyc: 00]
[    3.178842] nand:     ->DATA_IN  [8 B, force 8-bit]
[    3.183955] marvell-nfc pxa3xx-nand: 
[    3.183955] NDCR:  0x90071fff
[    3.183955] NDCB0: 0x00610090
[    3.183955] NDCB1: 0x00000000
[    3.183955] NDCB2: 0x00000000
[    3.183955] NDCB3: 0x00000000
[    3.202627] nand: executing subop:
[    3.206083] nand:     ->CMD      [0x90]
[    3.209958] nand:     ->ADDR     [1 cyc: 20]
[    3.214403] nand:     ->DATA_IN  [4 B, force 8-bit]
[    3.219383] marvell-nfc pxa3xx-nand: 
[    3.219383] NDCR:  0x90071fff
[    3.219383] NDCB0: 0x00610090
[    3.219383] NDCB1: 0x00000020
[    3.219383] NDCB2: 0x00000000
[    3.219383] NDCB3: 0x00000000
[    3.238001] nand: executing subop:
[    3.241598] nand:     ->CMD      [0x90]
[    3.245491] nand:     ->ADDR     [1 cyc: 40]
[    3.249790] nand:     ->DATA_IN  [5 B, force 8-bit]
[    3.254897] marvell-nfc pxa3xx-nand: 
[    3.254897] NDCR:  0x90071fff
[    3.254897] NDCB0: 0x00610090
[    3.254897] NDCB1: 0x00000040
[    3.254897] NDCB2: 0x00000000
[    3.254897] NDCB3: 0x00000000
[    3.273547] nand: device found, Manufacturer ID: 0x20, Chip ID: 0xba
[    3.279923] nand: ST Micro NAND 256MiB 1,8V 16-bit
[    3.284893] nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
[    3.292633] marvell-nfc pxa3xx-nand: No minimum ECC strength, using 1b/512B
[    3.299666] Scanning device for bad blocks
[    3.303978] nand: nand_do_read_oob: from = 0x00000000, len = 64
[    3.310000] marvell-nfc pxa3xx-nand: 
[    3.310000] NDCR:  0x9d079fff
[    3.310000] NDCB0: 0x000d3000
[    3.310000] NDCB1: 0x00000000
[    3.310000] NDCB2: 0x00000000
[    3.310000] NDCB3: 0x00000000
[    3.329036] OOB from page 0:
[    3.332122] 00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
[    3.342950] 01: 00 00 00 00 00 00 00 00 d4 eb 0b f5 fa 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
[    3.353817] Bad eraseblock 0 at 0x000000000000
[    3.358338] nand: nand_do_read_oob: from = 0x00020000, len = 64
[    3.364494] marvell-nfc pxa3xx-nand: 
[    3.364494] NDCR:  0x9d079fff
[    3.364494] NDCB0: 0x000d3000
[    3.364494] NDCB1: 0x00400000
[    3.364494] NDCB2: 0x00000000
[    3.364494] NDCB3: 0x00000000
[    3.383344] OOB from page 64:
[    3.386353] 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
[    3.397213] 01: ff ff ff ff ff ff ff ff dc 1e ef 48 38 04 e6 95 40 86 da 0f b6 fd 95 6d 7f 05 2d cf fd 61 d9 05 
[    3.408131] nand: nand_do_read_oob: from = 0x00040000, len = 64
[    3.414298] marvell-nfc pxa3xx-nand: 
[    3.414298] NDCR:  0x9d079fff
[    3.414298] NDCB0: 0x000d3000
[    3.414298] NDCB1: 0x00800000
[    3.414298] NDCB2: 0x00000000
[    3.414298] NDCB3: 0x00000000
[    3.433140] OOB from page 128:
[    3.436237] 00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
[    3.447080] 01: 00 00 00 00 00 00 00 00 48 5b 01 d2 56 00 a2 ec 23 82 51 02 ef af 9d ae 3e 02 34 82 6c d8 75 0e 
[    3.457961] Bad eraseblock 2 at 0x000000040000
[    3.462604] nand: nand_do_read_oob: from = 0x00060000, len = 64
[    3.468627] marvell-nfc pxa3xx-nand: 
[    3.468627] NDCR:  0x9d079fff
[    3.468627] NDCB0: 0x000d3000
[    3.468627] NDCB1: 0x00c00000
[    3.468627] NDCB2: 0x00000000
[    3.468627] NDCB3: 0x00000000
[    3.487466] OOB from page 192:
[    3.490706] 00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
[    3.501537] 01: 00 00 00 00 00 00 00 00 a7 23 9c bc 5d 02 5e 55 3b fd 7f 04 ed 35 c0 d1 a7 0a c3 94 09 cf 9a 0d 
[    3.512409] Bad eraseblock 3 at 0x000000060000
Boris Brezillon Jan. 12, 2018, 8:45 a.m. UTC | #6
On Fri, 12 Jan 2018 09:09:17 +0100
Robert Jarzmik <robert.jarzmik@free.fr> wrote:

> Miquel RAYNAL <miquel.raynal@free-electrons.com> writes:
> 
> I begun all your test procedure (on my zylonite board).
> The timing registers are the same in both pxa3xx_nand and marvell_nand, ie :
> [    3.085539] Timing registers from Bootloader:
> [    3.089971] -  NDTR0: 0x00161c1c
> [    3.095979] -  NDTR1: 0x0f3c00a2
> 
> I can attach the dmesg of the first run (dump of OOB). Yet I think you're
> missing the point as to where the bug lies.

We definitely don't know where the bug lies, otherwise we wouldn't do
the remote debug session we're doing here.

> 
> In the zylonite setup, the BBT is _not_ in the OOB of each block. Instead, it
> lies at the end of the NAND, in the last blocks (see struct
> nand_bbt_descr). Reading each block and declaring it as bad as is done in
> marvell_nand (at least that is my understanding of your traces), but it is not
> what should be done if a match is found for the bbt_pattern. Instead, the BBT
> should be loaded from the last 8 blocks of the NAND, ie. page 130944 and page
> 131008 in my setup.

The driver is not searching for a BBT because it's explicitly disabled
in your pdata (if it was enabled we would see something like "Bad block
table not found ..." or "Bad block table found ..." in the logs). And
that's anyway not the bug we're trying to fix here. In your setup (2k
pages with Hamming ECC), the bad block markers, i.e. the markers
present in each block and used to mark a block good or bad (0xffff =>
good, != 0xffff => bad), should be preserved. So, the symptoms we're
seeing here, where almost all blocks are reported as bad when scanning
BBMs, is not expected, and that's what we're trying to debug/fix.

> 
> Therefore, I would rather think that marvell-nand is not using the BBT at the
> end of the nand rather than misconfiguring the timing registers.

Timing mis-configuration was just a lead we had to follow. It seems
that it's not the problem here, but we had to test it. Now, the missing
BBT scan is clearly caused by an explicit config telling the driver to
ignore the BBT. You can try to enable it if you want to test BBT
handling (pdata->flash_bbt = 1), but even if that works, we'd like to
understand why the regular BBM scanning does not work.

Honestly, it's hard to be sure what you're testing, because we don't
know whether you're testing the branch Miquel provided or manually
apply some changes locally. Can you push your local changes somewhere
(if any)?

> 
> Cheers.
> 
> --
> Robert
> 
> PS: You really should expunge the mailing recipients a bit ...

Indeed!

> 
> [1] DMesg extract
> netconsole: port not set
> netconsole: registered as netconsole-1
> smc91c111 smc91c1110: chip is revision= 9, version= 2
> mdio_bus: miibus0: probed
> eth0: got preset MAC address: 00:0e:0c:a7:26:f7
> nand: NAND device: Manufacturer ID: 0x20, Chip ID: 0xba (ST Micro NAND 256MiB 1,8V 16-bit), 256MiB, page size: 2048, OOB size: 64
> mrvl_nand mrvl_nand0: ECC strength 1, ECC step size 512
> Bad block table found at page 131008, version 0x04
> Bad block table found at page 130944, version 0x04
> malloc space: 0x83700000 -> 0x83efffff (size 8 MiB)
> running /env/bin/init...
> magicvar: No such file or directory
> magicvar: No such file or directory
> magicvar: No such file or directory
> 
> Hit any key to stop autoboot:  3 2 1 0
> booting net
> netconsole: netconsole initialized with 255.255.255.255:6662
> eth0: 100Mbps full duplex link detected
> DHCP client bound to address 192.168.1.232
> netconsole: netconsole initialized with 255.255.255.255:6662
> could not open /mnt/tftp/none-linux-zylonite: No such file or directory
> Booting net failed: No such file or directory
> booting net failed: No such file or directory
> boot: No such file or directory
> .[1;32mbarebox@.[1;36mZylonite:/.[0m global linux.bootargs.debug=earlycon
> .[1;32mbarebox@.[1;36mZylonite:/.[0m bootm /mnt/tftp/zImage_jenkins
> 
> Loading ARM Linux zImage '/mnt/tftp/zImage_jenkins'
> commandline: ram=64M console=ttyS0,115200 ip=dhcp root=/dev/nfs nfsroot=/home/none/nfsroot/zylonite,v3,tcp earlycon mtdparts=pxa3xx_nand-0:128k@0(TIMH)ro,128k@128k(OBMI)ro,768k@256k(barebox),256k@1024k(barebox-env),12M@1280k(kernel),38016k@13568k(root)
> arch_number: 1233
> Uncompressing Linux... done, booting the kernel.
> [    0.000000] Booting Linux on physical CPU 0x0
> [    0.000000] Linux version 4.15.0-rc1-00047-g3085f79 (jenkins@belgarath) (gcc version 4.8.3 20140320 (prerelease) (Sourcery CodeBench Lite 2014.05-29)) #889 PREEMPT Fri Jan 12 08:26:17 CET 2018
> [    0.000000] CPU: XScale-V3 based processor [69056891] revision 1 (ARMv5TE), cr=0000397f
> [    0.000000] CPU: VIVT data cache, VIVT instruction cache
> [    0.000000] Machine: PXA3xx Platform Development Kit (aka Zylonite)
> [    0.000000] Ignoring tag cmdline (using the default kernel command line)
> [    0.000000] Memory policy: Data cache writeback
> [    0.000000] RO Mode clock: 0.00MHz
> [    0.000000] Run Mode clock: 0.00MHz
> [    0.000000] Turbo Mode clock: 0.00MHz
> [    0.000000] System bus clock: 0.00MHz
> [    0.000000] On node 0 totalpages: 16384
> [    0.000000]   Normal zone: 128 pages used for memmap
> [    0.000000]   Normal zone: 0 pages reserved
> [    0.000000]   Normal zone: 16384 pages, LIFO batch:3
> [    0.000000] random: fast init done
> [    0.000000] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
> [    0.000000] pcpu-alloc: [0] 0 
> [    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 16256
> [    0.000000] Kernel command line: root=/dev/ram0 ip=192.168.1.232:192.168.1.5::255.255.255.0::eth0:on console=ttyS0,115200 mem=64M mtdparts=pxa3xx_nand-0:128k@0(TIMH)ro,128k@128k(OBMI)ro,768k@256k(barebox),256k@1024k(barebox-env),12M@1280k(kernel),38016k@13568k(root) ubi.mtd=5 earlycon=pxa,io,0xf6200000,115200n8 debug no_console_suspend
> [    0.000000] Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
> [    0.000000] Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
> [    0.000000] Memory: 56856K/65536K available (4226K kernel code, 202K rwdata, 972K rodata, 2396K init, 102K bss, 8680K reserved, 0K cma-reserved)
> [    0.000000] Virtual kernel memory layout:
> [    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
> [    0.000000]     fixmap  : 0xffc00000 - 0xfff00000   (3072 kB)
> [    0.000000]     vmalloc : 0xc4800000 - 0xff800000   ( 944 MB)
> [    0.000000]     lowmem  : 0xc0000000 - 0xc4000000   (  64 MB)
> [    0.000000]     modules : 0xbf000000 - 0xc0000000   (  16 MB)
> [    0.000000]       .text : 0xc0008000 - 0xc0428a48   (4227 kB)
> [    0.000000]       .init : 0xc053f000 - 0xc0796000   (2396 kB)
> [    0.000000]       .data : 0xc0796000 - 0xc07c8bec   ( 203 kB)
> [    0.000000]        .bss : 0xc07c8bec - 0xc07e25fc   ( 103 kB)
> [    0.000000] Preemptible hierarchical RCU implementation.
> [    0.000000] 	Tasks RCU enabled.
> [    0.000000] NR_IRQS: 16, nr_irqs: 336, preallocated irqs: 336
> [    0.000000] RJK: parent_rate=13000000, xl=8, xn=1
> [    0.000070] sched_clock: 32 bits at 3250kHz, resolution 307ns, wraps every 660764198758ns
> [    0.000266] clocksource: oscr0: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 588080137591 ns
> [    0.002134] Console: colour dummy device 80x30
> [    0.002294] Calibrating delay loop... 103.83 BogoMIPS (lpj=519168)
> [    0.081019] pid_max: default: 32768 minimum: 301
> [    0.081862] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
> [    0.081960] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
> [    0.085156] CPU: Testing write buffer coherency: ok
> [    0.088957] Setting up static identity map for 0x80008200 - 0x80008260
> [    0.089916] Hierarchical SRCU implementation.
> [    0.102924] devtmpfs: initialized
> [    0.113807] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
> [    0.113948] futex hash table entries: 256 (order: -1, 3072 bytes)
> [    0.116278] NET: Registered protocol family 16
> [    0.119090] DMA: preallocated 256 KiB pool for atomic coherent allocations
> [    0.389530] Advanced Linux Sound Architecture Driver Initialized.
> [    0.400350] clocksource: Switched to clocksource oscr0
> [    0.553012] NET: Registered protocol family 2
> [    0.558410] TCP established hash table entries: 1024 (order: 0, 4096 bytes)
> [    0.558634] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
> [    0.558815] TCP: Hash tables configured (established 1024 bind 1024)
> [    0.559367] UDP hash table entries: 256 (order: 0, 4096 bytes)
> [    0.559547] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
> [    0.561234] NET: Registered protocol family 1
> [    0.563263] RPC: Registered named UNIX socket transport module.
> [    0.563363] RPC: Registered udp transport module.
> [    0.563418] RPC: Registered tcp transport module.
> [    0.563475] RPC: Registered tcp NFSv4.1 backchannel transport module.
> [    2.498791] Initialise system trusted keyrings
> [    2.501003] workingset: timestamp_bits=30 max_order=14 bucket_order=0
> [    2.504911] NFS: Registering the id_resolver key type
> [    2.505141] Key type id_resolver registered
> [    2.505204] Key type id_legacy registered
> [    2.511301] Key type asymmetric registered
> [    2.511404] Asymmetric key parser 'x509' registered
> [    2.511567] io scheduler noop registered
> [    2.511632] io scheduler deadline registered
> [    2.512023] io scheduler cfq registered (default)
> [    2.512092] io scheduler mq-deadline registered
> [    2.512155] io scheduler kyber registered
> [    2.569944] pxa-dma pxa-dma.0: initialized 32 channels on 100 requestors
> [    2.575695] pxa2xx-uart.0: ttyS0 at MMIO 0x40100000 (irq = 38, base_baud = 928571) is a UART1
> [    3.050762] console [ttyS0] enabled
> [    3.057118] pxa2xx-uart.1: ttyS1 at MMIO 0x40200000 (irq = 37, base_baud = 928571) is a UART2
> [    3.069632] pxa2xx-uart.2: ttyS2 at MMIO 0x40700000 (irq = 36, base_baud = 928571) is a UART3
> [    3.085539] Timing registers from Bootloader:
> [    3.089971] -  NDTR0: 0x00161c1c
> [    3.095979] -  NDTR1: 0x0f3c00a2
> [    3.099319] nand: executing subop:
> [    3.103209] nand:     ->CMD      [0xff]
> [    3.107105] nand:     ->WAITRDY  [max 250 ms]
> [    3.111915] marvell-nfc pxa3xx-nand: 
> [    3.111915] NDCR:  0x90071fff
> [    3.111915] NDCB0: 0x00a000ff
> [    3.111915] NDCB1: 0x00000000
> [    3.111915] NDCB2: 0x00000000
> [    3.111915] NDCB3: 0x00000000
> [    3.131170] nand: executing subop:
> [    3.134638] nand:     ->CMD      [0x90]
> [    3.138513] nand:     ->ADDR     [1 cyc: 00]
> [    3.143174] nand:     ->DATA_IN  [2 B, force 8-bit]
> [    3.148182] marvell-nfc pxa3xx-nand: 
> [    3.148182] NDCR:  0x90071fff
> [    3.148182] NDCB0: 0x00610090
> [    3.148182] NDCB1: 0x00000000
> [    3.148182] NDCB2: 0x00000000
> [    3.148182] NDCB3: 0x00000000
> [    3.167016] nand: executing subop:
> [    3.170656] nand:     ->CMD      [0x90]
> [    3.174543] nand:     ->ADDR     [1 cyc: 00]
> [    3.178842] nand:     ->DATA_IN  [8 B, force 8-bit]
> [    3.183955] marvell-nfc pxa3xx-nand: 
> [    3.183955] NDCR:  0x90071fff
> [    3.183955] NDCB0: 0x00610090
> [    3.183955] NDCB1: 0x00000000
> [    3.183955] NDCB2: 0x00000000
> [    3.183955] NDCB3: 0x00000000
> [    3.202627] nand: executing subop:
> [    3.206083] nand:     ->CMD      [0x90]
> [    3.209958] nand:     ->ADDR     [1 cyc: 20]
> [    3.214403] nand:     ->DATA_IN  [4 B, force 8-bit]
> [    3.219383] marvell-nfc pxa3xx-nand: 
> [    3.219383] NDCR:  0x90071fff
> [    3.219383] NDCB0: 0x00610090
> [    3.219383] NDCB1: 0x00000020
> [    3.219383] NDCB2: 0x00000000
> [    3.219383] NDCB3: 0x00000000
> [    3.238001] nand: executing subop:
> [    3.241598] nand:     ->CMD      [0x90]
> [    3.245491] nand:     ->ADDR     [1 cyc: 40]
> [    3.249790] nand:     ->DATA_IN  [5 B, force 8-bit]
> [    3.254897] marvell-nfc pxa3xx-nand: 
> [    3.254897] NDCR:  0x90071fff
> [    3.254897] NDCB0: 0x00610090
> [    3.254897] NDCB1: 0x00000040
> [    3.254897] NDCB2: 0x00000000
> [    3.254897] NDCB3: 0x00000000
> [    3.273547] nand: device found, Manufacturer ID: 0x20, Chip ID: 0xba
> [    3.279923] nand: ST Micro NAND 256MiB 1,8V 16-bit
> [    3.284893] nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
> [    3.292633] marvell-nfc pxa3xx-nand: No minimum ECC strength, using 1b/512B
> [    3.299666] Scanning device for bad blocks
> [    3.303978] nand: nand_do_read_oob: from = 0x00000000, len = 64
> [    3.310000] marvell-nfc pxa3xx-nand: 
> [    3.310000] NDCR:  0x9d079fff
> [    3.310000] NDCB0: 0x000d3000
> [    3.310000] NDCB1: 0x00000000
> [    3.310000] NDCB2: 0x00000000
> [    3.310000] NDCB3: 0x00000000
> [    3.329036] OOB from page 0:
> [    3.332122] 00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> [    3.342950] 01: 00 00 00 00 00 00 00 00 d4 eb 0b f5 fa 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [    3.353817] Bad eraseblock 0 at 0x000000000000
> [    3.358338] nand: nand_do_read_oob: from = 0x00020000, len = 64
> [    3.364494] marvell-nfc pxa3xx-nand: 
> [    3.364494] NDCR:  0x9d079fff
> [    3.364494] NDCB0: 0x000d3000
> [    3.364494] NDCB1: 0x00400000
> [    3.364494] NDCB2: 0x00000000
> [    3.364494] NDCB3: 0x00000000
> [    3.383344] OOB from page 64:
> [    3.386353] 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> [    3.397213] 01: ff ff ff ff ff ff ff ff dc 1e ef 48 38 04 e6 95 40 86 da 0f b6 fd 95 6d 7f 05 2d cf fd 61 d9 05 
> [    3.408131] nand: nand_do_read_oob: from = 0x00040000, len = 64
> [    3.414298] marvell-nfc pxa3xx-nand: 
> [    3.414298] NDCR:  0x9d079fff
> [    3.414298] NDCB0: 0x000d3000
> [    3.414298] NDCB1: 0x00800000
> [    3.414298] NDCB2: 0x00000000
> [    3.414298] NDCB3: 0x00000000
> [    3.433140] OOB from page 128:
> [    3.436237] 00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> [    3.447080] 01: 00 00 00 00 00 00 00 00 48 5b 01 d2 56 00 a2 ec 23 82 51 02 ef af 9d ae 3e 02 34 82 6c d8 75 0e 

All bytes set to 0. Looks like someone explicitly wrote 0 in the OOB
area :-/. Do you know which component wrote this block (barebox or
Linux)?

> [    3.457961] Bad eraseblock 2 at 0x000000040000
> [    3.462604] nand: nand_do_read_oob: from = 0x00060000, len = 64
> [    3.468627] marvell-nfc pxa3xx-nand: 
> [    3.468627] NDCR:  0x9d079fff
> [    3.468627] NDCB0: 0x000d3000
> [    3.468627] NDCB1: 0x00c00000
> [    3.468627] NDCB2: 0x00000000
> [    3.468627] NDCB3: 0x00000000
> [    3.487466] OOB from page 192:
> [    3.490706] 00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> [    3.501537] 01: 00 00 00 00 00 00 00 00 a7 23 9c bc 5d 02 5e 55 3b fd 7f 04 ed 35 c0 d1 a7 0a c3 94 09 cf 9a 0d 
> [    3.512409] Bad eraseblock 3 at 0x000000060000
Miquel Raynal Jan. 12, 2018, 9:01 a.m. UTC | #7
Hi Robert,

On Fri, 12 Jan 2018 09:45:01 +0100
Boris Brezillon <boris.brezillon@free-electrons.com> wrote:

> On Fri, 12 Jan 2018 09:09:17 +0100
> Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> 
> > Miquel RAYNAL <miquel.raynal@free-electrons.com> writes:
> > 
> > I begun all your test procedure (on my zylonite board).
> > The timing registers are the same in both pxa3xx_nand and
> > marvell_nand, ie : [    3.085539] Timing registers from Bootloader:
> > [    3.089971] -  NDTR0: 0x00161c1c
> > [    3.095979] -  NDTR1: 0x0f3c00a2
> > 
> > I can attach the dmesg of the first run (dump of OOB). Yet I think
> > you're missing the point as to where the bug lies.  
> 
> We definitely don't know where the bug lies, otherwise we wouldn't do
> the remote debug session we're doing here.
> 
> > 
> > In the zylonite setup, the BBT is _not_ in the OOB of each block.
> > Instead, it lies at the end of the NAND, in the last blocks (see
> > struct nand_bbt_descr). Reading each block and declaring it as bad
> > as is done in marvell_nand (at least that is my understanding of
> > your traces), but it is not what should be done if a match is found
> > for the bbt_pattern. Instead, the BBT should be loaded from the
> > last 8 blocks of the NAND, ie. page 130944 and page 131008 in my
> > setup.  
> 
> The driver is not searching for a BBT because it's explicitly disabled
> in your pdata (if it was enabled we would see something like "Bad
> block table not found ..." or "Bad block table found ..." in the
> logs). And that's anyway not the bug we're trying to fix here. In
> your setup (2k pages with Hamming ECC), the bad block markers, i.e.
> the markers present in each block and used to mark a block good or
> bad (0xffff => good, != 0xffff => bad), should be preserved. So, the
> symptoms we're seeing here, where almost all blocks are reported as
> bad when scanning BBMs, is not expected, and that's what we're trying
> to debug/fix.

Also, the first blocks are (presumably) taken by the Bootloader, having
data there in the OOB is not impossible. Could you please show the same
dumps within the partition where you UBI/EXT setup relies?

> 
> > 
> > Therefore, I would rather think that marvell-nand is not using the
> > BBT at the end of the nand rather than misconfiguring the timing
> > registers.  
> 
> Timing mis-configuration was just a lead we had to follow. It seems
> that it's not the problem here, but we had to test it. Now, the
> missing BBT scan is clearly caused by an explicit config telling the
> driver to ignore the BBT. You can try to enable it if you want to
> test BBT handling (pdata->flash_bbt = 1), but even if that works,
> we'd like to understand why the regular BBM scanning does not work.
> 
> Honestly, it's hard to be sure what you're testing, because we don't
> know whether you're testing the branch Miquel provided or manually
> apply some changes locally. Can you push your local changes somewhere
> (if any)?

Yes, please, can you push a branch with your patches?


As far as I remember, you removed the flash_bbt from the pdata to avoid
smashing the BBT again. Can you test this works with the old driver ?

If the old driver fails without flash_bbt, then you should go back to
the original setup and test again the last modifications.

Thanks,
Miquèl
Robert Jarzmik Jan. 12, 2018, 9:34 a.m. UTC | #8
Boris Brezillon <boris.brezillon@free-electrons.com> writes:

> On Fri, 12 Jan 2018 09:09:17 +0100
> Robert Jarzmik <robert.jarzmik@free.fr> wrote:
>
>> Miquel RAYNAL <miquel.raynal@free-electrons.com> writes:
>> 
>> I begun all your test procedure (on my zylonite board).
>> The timing registers are the same in both pxa3xx_nand and marvell_nand, ie :
>> [    3.085539] Timing registers from Bootloader:
>> [    3.089971] -  NDTR0: 0x00161c1c
>> [    3.095979] -  NDTR1: 0x0f3c00a2
>> 
>> I can attach the dmesg of the first run (dump of OOB). Yet I think you're
>> missing the point as to where the bug lies.
>
> We definitely don't know where the bug lies, otherwise we wouldn't do
> the remote debug session we're doing here.
Fair enough.
> The driver is not searching for a BBT because it's explicitly disabled
> in your pdata (if it was enabled we would see something like "Bad block
> table not found ..." or "Bad block table found ..." in the logs).
You're right, and that's because I was told to remove the "flash_bbt=1" from my
platform data by Miquel in order to not destroy it again.

> And that's anyway not the bug we're trying to fix here. In your setup (2k
> pages with Hamming ECC), the bad block markers, i.e. the markers present in
> each block and used to mark a block good or bad (0xffff => good, != 0xffff =>
> bad), should be preserved.
I think we're still not aligned here. There are _no_ bad block markers in the
OOB on my flash, because there is a BBT at the end.

> So, the symptoms we're seeing here, where almost all blocks are reported as
> bad when scanning BBMs, is not expected, and that's what we're trying to
> debug/fix.
Well, I still think this is not something to fix ... I still think that OOB data
is not relevant as to the state of bad blocks in my flash ...

> Timing mis-configuration was just a lead we had to follow. It seems
> that it's not the problem here, but we had to test it. Now, the missing
> BBT scan is clearly caused by an explicit config telling the driver to
> ignore the BBT.
We agree on that.

> You can try to enable it if you want to test BBT
> handling (pdata->flash_bbt = 1), but even if that works, we'd like to
> understand why the regular BBM scanning does not work.
As you wish. I can make other tests, as long as my BBT is not broken again. If I
re-enable "flash_bbt=1", I'd like another "hack" to prevent BBT breakage, as
disabling it was adviced by Miquel to protect my NAND.

> Honestly, it's hard to be sure what you're testing, because we don't
> know whether you're testing the branch Miquel provided or manually
> apply some changes locally. Can you push your local changes somewhere
> (if any)?
git fetch https://github.com/rjarzmik/linux marvell-nand-bug
make zylonite_defconfig

>> mtdparts=pxa3xx_nand-0:128k@0(TIMH)ro,128k@128k(OBMI)ro,768k@256k(barebox),256k@1024k(barebox-env),12M@1280k(kernel),38016k@13568k(root)
>> [    3.414298] marvell-nfc pxa3xx-nand: 
>> [    3.414298] NDCR:  0x9d079fff
>> [    3.414298] NDCB0: 0x000d3000
>> [    3.414298] NDCB1: 0x00800000
>> [    3.414298] NDCB2: 0x00000000
>> [    3.414298] NDCB3: 0x00000000
>> [    3.433140] OOB from page 128:
>> [    3.436237] 00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
>> [    3.447080] 01: 00 00 00 00 00 00 00 00 48 5b 01 d2 56 00 a2 ec 23 82 51 02 ef af 9d ae 3e 02 34 82 6c d8 75 0e 
>
> All bytes set to 0. Looks like someone explicitly wrote 0 in the OOB
> area :-/. Do you know which component wrote this block (barebox or
> Linux)?
In this specific case, you're in "TIMH" partition, which a specific partition
for the IPL (ROM part of the PXA3xx reads and loads it), and follows other rules
AFAIK.

The really anoying and relevant part are the bad blocks at 13568 KBytes offset
(ie. root partition), which contains the ext2/ubifs.

Cheers.

--
Robert
Boris Brezillon Jan. 12, 2018, 9:52 a.m. UTC | #9
On Fri, 12 Jan 2018 10:34:13 +0100
Robert Jarzmik <robert.jarzmik@free.fr> wrote:

> Boris Brezillon <boris.brezillon@free-electrons.com> writes:
> 
> > On Fri, 12 Jan 2018 09:09:17 +0100
> > Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> >  
> >> Miquel RAYNAL <miquel.raynal@free-electrons.com> writes:
> >> 
> >> I begun all your test procedure (on my zylonite board).
> >> The timing registers are the same in both pxa3xx_nand and marvell_nand, ie :
> >> [    3.085539] Timing registers from Bootloader:
> >> [    3.089971] -  NDTR0: 0x00161c1c
> >> [    3.095979] -  NDTR1: 0x0f3c00a2
> >> 
> >> I can attach the dmesg of the first run (dump of OOB). Yet I think you're
> >> missing the point as to where the bug lies.  
> >
> > We definitely don't know where the bug lies, otherwise we wouldn't do
> > the remote debug session we're doing here.  
> Fair enough.
> > The driver is not searching for a BBT because it's explicitly disabled
> > in your pdata (if it was enabled we would see something like "Bad block
> > table not found ..." or "Bad block table found ..." in the logs).  
> You're right, and that's because I was told to remove the "flash_bbt=1" from my
> platform data by Miquel in order to not destroy it again.

Because we though scanning of BBMs was working with the old pxa driver
(which should be the case for your setup, BTW), and we thought the new
driver was introducing a regression here.

BTW, did you ever try with the old driver and ->flash_bbt = false? If
you did not, can you test?

> 
> > And that's anyway not the bug we're trying to fix here. In your setup (2k
> > pages with Hamming ECC), the bad block markers, i.e. the markers present in
> > each block and used to mark a block good or bad (0xffff => good, != 0xffff =>
> > bad), should be preserved.  
> I think we're still not aligned here. There are _no_ bad block markers in the
> OOB on my flash, because there is a BBT at the end.

That's not how it works. The BBT is a way to get information about bad
blocks within a single read access, but, if you can preserve BBMs and
keep them updated (which is the case here), you should do it, just in
case you lose the BBT.

> 
> > So, the symptoms we're seeing here, where almost all blocks are reported as
> > bad when scanning BBMs, is not expected, and that's what we're trying to
> > debug/fix.  
> Well, I still think this is not something to fix ... I still think that OOB data
> is not relevant as to the state of bad blocks in my flash ...

Hm, I disagree. What if, for any reason, the BBT is lost? Don't you
want the full scan to work?

> 
> > Timing mis-configuration was just a lead we had to follow. It seems
> > that it's not the problem here, but we had to test it. Now, the missing
> > BBT scan is clearly caused by an explicit config telling the driver to
> > ignore the BBT.  
> We agree on that.
> 
> > You can try to enable it if you want to test BBT
> > handling (pdata->flash_bbt = 1), but even if that works, we'd like to
> > understand why the regular BBM scanning does not work.  
> As you wish. I can make other tests, as long as my BBT is not broken again. If I
> re-enable "flash_bbt=1", I'd like another "hack" to prevent BBT breakage, as
> disabling it was adviced by Miquel to protect my NAND.

Okay, so I have another solution for that: drop the NAND_BBT_CREATE and
NAND_BBT_WRITE here [1] and here [2]. That should let you read the
existing BBT without updating it or creating a new one if it's not
detected.

> 
> > Honestly, it's hard to be sure what you're testing, because we don't
> > know whether you're testing the branch Miquel provided or manually
> > apply some changes locally. Can you push your local changes somewhere
> > (if any)?  
> git fetch https://github.com/rjarzmik/linux marvell-nand-bug
> make zylonite_defconfig

Thanks for sharing this branch.

> 
> >> mtdparts=pxa3xx_nand-0:128k@0(TIMH)ro,128k@128k(OBMI)ro,768k@256k(barebox),256k@1024k(barebox-env),12M@1280k(kernel),38016k@13568k(root)
> >> [    3.414298] marvell-nfc pxa3xx-nand: 
> >> [    3.414298] NDCR:  0x9d079fff
> >> [    3.414298] NDCB0: 0x000d3000
> >> [    3.414298] NDCB1: 0x00800000
> >> [    3.414298] NDCB2: 0x00000000
> >> [    3.414298] NDCB3: 0x00000000
> >> [    3.433140] OOB from page 128:
> >> [    3.436237] 00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> >> [    3.447080] 01: 00 00 00 00 00 00 00 00 48 5b 01 d2 56 00 a2 ec 23 82 51 02 ef af 9d ae 3e 02 34 82 6c d8 75 0e   
> >
> > All bytes set to 0. Looks like someone explicitly wrote 0 in the OOB
> > area :-/. Do you know which component wrote this block (barebox or
> > Linux)?  
> In this specific case, you're in "TIMH" partition, which a specific partition
> for the IPL (ROM part of the PXA3xx reads and loads it), and follows other rules
> AFAIK.
> 
> The really anoying and relevant part are the bad blocks at 13568 KBytes offset
> (ie. root partition), which contains the ext2/ubifs.

Can you provide the OOB dump of this block?

Thanks,

Boris

[1]https://github.com/rjarzmik/linux/blob/marvell-nand-bug/drivers/mtd/nand/marvell_nand.c#L2181
[2]https://github.com/rjarzmik/linux/blob/marvell-nand-bug/drivers/mtd/nand/marvell_nand.c#L2191
Boris Brezillon Jan. 12, 2018, 10:21 a.m. UTC | #10
On Fri, 12 Jan 2018 10:34:13 +0100
Robert Jarzmik <robert.jarzmik@free.fr> wrote:

> Boris Brezillon <boris.brezillon@free-electrons.com> writes:
> 
> > On Fri, 12 Jan 2018 09:09:17 +0100
> > Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> >  
> >> Miquel RAYNAL <miquel.raynal@free-electrons.com> writes:
> >> 
> >> I begun all your test procedure (on my zylonite board).
> >> The timing registers are the same in both pxa3xx_nand and marvell_nand, ie :
> >> [    3.085539] Timing registers from Bootloader:
> >> [    3.089971] -  NDTR0: 0x00161c1c
> >> [    3.095979] -  NDTR1: 0x0f3c00a2
> >> 
> >> I can attach the dmesg of the first run (dump of OOB). Yet I think you're
> >> missing the point as to where the bug lies.  
> >
> > We definitely don't know where the bug lies, otherwise we wouldn't do
> > the remote debug session we're doing here.  
> Fair enough.
> > The driver is not searching for a BBT because it's explicitly disabled
> > in your pdata (if it was enabled we would see something like "Bad block
> > table not found ..." or "Bad block table found ..." in the logs).  
> You're right, and that's because I was told to remove the "flash_bbt=1" from my
> platform data by Miquel in order to not destroy it again.
> 
> > And that's anyway not the bug we're trying to fix here. In your setup (2k
> > pages with Hamming ECC), the bad block markers, i.e. the markers present in
> > each block and used to mark a block good or bad (0xffff => good, != 0xffff =>
> > bad), should be preserved.  
> I think we're still not aligned here. There are _no_ bad block markers in the
> OOB on my flash, because there is a BBT at the end.

I think I'm still missing something. If I look at the branch you just
pushed, I see that ->flash_bbt was not set to 1 before this commit [1],
which means the pxa3xx driver was no setting the NAND_BBT_USE_FLASH
flag, which in turn means you were not using the on-flash-bbt.
When you test the old (pxa3xx) driver, are you sure you're testing
things with a mainline kernel? If you have extra commits on top of
mainline, can you push them somewhere?

[1]https://github.com/rjarzmik/linux/commit/5d391176c89a777f1c7d082cd50c1dc87e54b355#diff-6e551812174323b126d3f20eff6eec83
Boris Brezillon Jan. 12, 2018, 3:55 p.m. UTC | #11
On Tue,  9 Jan 2018 11:36:30 +0100
Miquel Raynal <miquel.raynal@free-electrons.com> wrote:

> Hi,
> 
> After the addition of the NAND framework ->exec_op() interface (see [1]
> for the series preparing it and [2] for the last version of the
> core-side implementation of ->exec_op() itself), this series replaces
> the current Marvell NAND controller driver pxa3xx_nand.c with a rework
> called marvell_nand.c.
> 
> Aside the fact that it drops the big state machine, improves the overall
> speed and implements raw accesses, it is the first driver-side
> implementation of the ->exec_op() interface and may be used as reference
> for latter reworks of the same type.
> 
> One may find more detail about why a completely new driver is needed in
> the commit log of:
> 
>     "mtd: nand: add reworked Marvell NAND controller driver"
> 
> Device tree NAND node definition for all platforms referring to the
> Marvell driver using the new bindings have already been accepted by the
> MVEBU DT maintainers and will be merged after the driver. They are more
> hierarchical and fit the real organization of the hardware, by having
> NAND partitions that are part of NAND chip nodes, themselves part of the
> NAND controller node.
> 
> These changes have been tested on:
>    - PXA3xx platform with a CM-X300 board (2kiB page NAND, 1b/512B
>      strength, Hamming ECC engine) [32 bits]
>    - Armada 385 DB AP (4kiB page NAND, 4b/512B, BCH ECC engine) [32 bits]
>    - Armada 398 DB (4kiB page NAND, 8b/512B, BCH ECC engine using a layout
>      with a last chunk different than the others) [32 bits]
>    - Armada 7040 DB and Armada 8040 DB (4kiB page NAND, 4b/512B, BCH ECC
>      engine) [64 bits]
>    - Triax dvb-tc board (2kiB page NAND, 4b/512B, BCH ECC engine) [32 bits]
> 
> This version is known not to be stable yet with a Zylonite based setup but
> otherwise looks good for Marvell EBU platforms.
> 
> For people who would like to test it easily, a branch ready to be tested
> is available at [3]. It is based on nand/next and has all the changes
> brought by the previously mentionned series as well as this one.
> 
> Thank you,
> Miquèl
> 
> 
> [1] https://www.spinics.net/lists/arm-kernel/msg619633.html
> [2] http://lists.infradead.org/pipermail/linux-mtd/2017-December/077965.html
> [3] https://github.com/miquelraynal/linux/tree/marvell/nand-next/nfc
> 
> 
> Changes since v2:
>   - Added a patch to create the nand-rb property in the Documentation
>   - Rewording in the Documentation according to Rob's comments
>   - Moved from marvell,rb to nand-rb property in the code
>   - Disociated using this driver with Marvell EBU platforms than using
>     it with PXA ones
>   - Fixed the handling of 16-bit buses
>   - Fixed SPDX comment style
>   - Reorganized registers offsets/bit fields definitions as requested
>   - Moved to Kernel doc
>   - Changed the logic in ->select_chip() to use a prepared value of NDCR
>     only instead of recalculating it
>   - Fixed the presence of the SPARE_EN bit, reworked a bit the
>     hmg_do_read/write() helpers
>   - Fixed the OOB layouts that were unusable (all spare data first, then
>     all ECC bytes)
>   - Additional check on mtd->writesize when using NFCv1 (all sizes not
>     supported)
>   - Various typos/rewording
> 
> Changes since v1:
>   - Rewording
>   - Fixed BCH ->read/write_page() hooks for 2kiB pages NAND chips
>   - Removed license text, used SPDX tag instead
>   - Removed read_page_data()
>   - Enhanced the DT bindings document with the label property and the
>     deprecated bindings.
>   - Simplified the read_chunk() helper (OOB always read).
>   - Simplified the ->bch_read_page() hook by removing the addition raw
>     read to get ECC bytes.
>   - Fixed the ->correct() function that did not check for bitflips in
>     ECC bytes in erased pages.
> 
> 
> Miquel Raynal (7):
>   dt-bindings: mtd: document new nand-rb property
>   dt-bindings: mtd: add Marvell NAND controller documentation
>   mtd: nand: add reworked Marvell NAND controller driver
>   mtd: nand: use reworked NAND controller driver with Marvell EBU SoCs

Applied patches 1 to 4.

Can you please send a patch to add a new entry in MAINTAINERS for this
driver?

Thanks,

Boris

>   mtd: nand: use Marvell reworked NAND controller driver with all
>     platforms
>   dt-bindings: mtd: remove pxa3xx NAND controller documentation
>   mtd: nand: remove useless fields from pxa3xx NAND platform data
> 
>  .../devicetree/bindings/mtd/marvell-nand.txt       |  123 +
>  Documentation/devicetree/bindings/mtd/nand.txt     |    1 +
>  .../devicetree/bindings/mtd/pxa3xx-nand.txt        |   50 -
>  arch/arm/configs/cm_x300_defconfig                 |    2 +-
>  arch/arm/configs/mvebu_v7_defconfig                |    2 +-
>  arch/arm/configs/pxa3xx_defconfig                  |    3 +-
>  arch/arm/configs/pxa_defconfig                     |    2 +-
>  arch/arm/configs/raumfeld_defconfig                |    2 +-
>  arch/arm/mach-mmp/ttc_dkb.c                        |    4 +-
>  arch/arm/mach-pxa/cm-x300.c                        |    8 +-
>  arch/arm/mach-pxa/colibri-pxa3xx.c                 |    8 +-
>  arch/arm/mach-pxa/colibri.h                        |    2 +-
>  arch/arm/mach-pxa/littleton.c                      |   10 +-
>  arch/arm/mach-pxa/mxm8x10.c                        |   10 +-
>  arch/arm/mach-pxa/raumfeld.c                       |    6 +-
>  arch/arm/mach-pxa/zylonite.c                       |   10 +-
>  arch/arm64/configs/defconfig                       |    2 +-
>  drivers/mtd/nand/Kconfig                           |   18 +-
>  drivers/mtd/nand/Makefile                          |    2 +-
>  drivers/mtd/nand/marvell_nand.c                    | 2896 ++++++++++++++++++++
>  drivers/mtd/nand/pxa3xx_nand.c                     | 2104 --------------
>  include/linux/platform_data/mtd-nand-pxa3xx.h      |   43 +-
>  22 files changed, 3072 insertions(+), 2236 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/mtd/marvell-nand.txt
>  delete mode 100644 Documentation/devicetree/bindings/mtd/pxa3xx-nand.txt
>  create mode 100644 drivers/mtd/nand/marvell_nand.c
>  delete mode 100644 drivers/mtd/nand/pxa3xx_nand.c
>
Robert Jarzmik Jan. 12, 2018, 4:43 p.m. UTC | #12
Boris Brezillon <boris.brezillon@free-electrons.com> writes:

> I think I'm still missing something. If I look at the branch you just
> pushed, I see that ->flash_bbt was not set to 1 before this commit [1],
> which means the pxa3xx driver was no setting the NAND_BBT_USE_FLASH
> flag, which in turn means you were not using the on-flash-bbt.
> When you test the old (pxa3xx) driver, are you sure you're testing
> things with a mainline kernel? If you have extra commits on top of
> mainline, can you push them somewhere?

Here is the branch I'm using with my pxa3xx driver :
git fetch https://github.com/rjarzmik/linux pxa3xx-test

Cheers.
Miquel Raynal Jan. 12, 2018, 6:21 p.m. UTC | #13
Hello Willy,

On Thu, 11 Jan 2018 23:28:33 +0100
Willy Tarreau <w@1wt.eu> wrote:

> Hi Miquel,
> 
> On Tue, Jan 09, 2018 at 11:36:30AM +0100, Miquel Raynal wrote:
> > These changes have been tested on:
> >    - PXA3xx platform with a CM-X300 board (2kiB page NAND, 1b/512B
> >      strength, Hamming ECC engine) [32 bits]
> >    - Armada 385 DB AP (4kiB page NAND, 4b/512B, BCH ECC engine) [32
> > bits]
> >    - Armada 398 DB (4kiB page NAND, 8b/512B, BCH ECC engine using a
> > layout with a last chunk different than the others) [32 bits]
> >    - Armada 7040 DB and Armada 8040 DB (4kiB page NAND, 4b/512B,
> > BCH ECC engine) [64 bits]
> >    - Triax dvb-tc board (2kiB page NAND, 4b/512B, BCH ECC engine)
> > [32 bits]  
> 
> If you're interested, I have a mirabox with an armada 370 which uses
> the same driver. I've not powered it up for a while but if that can
> be useful I can try to find some time for this.

Thank you very much for the proposal!

For the moment I don't think there is a need as this type of platform
has already been tested, but I keep it in mind.

Thanks again!
Miquèl
Robert Jarzmik Jan. 12, 2018, 8:44 p.m. UTC | #14
Boris Brezillon <boris.brezillon@free-electrons.com> writes:

> On Fri, 12 Jan 2018 10:34:13 +0100
> Robert Jarzmik <robert.jarzmik@free.fr> wrote:
>
>> Boris Brezillon <boris.brezillon@free-electrons.com> writes:
> Because we though scanning of BBMs was working with the old pxa driver
> (which should be the case for your setup, BTW), and we thought the new
> driver was introducing a regression here.
That's what happens :
 - flash_bbt=1 with old driver => everything works fine
 - flash_bbt=1 with marvell_nand => BBT is damaged (or so I believe from
   Miquel's analysis)

> BTW, did you ever try with the old driver and ->flash_bbt = false? If
> you did not, can you test?
Sure, just did, same behavior as with marvell_nand :
 - bad erase blocks (almost) everywhere
 - ubifs error

>> I think we're still not aligned here. There are _no_ bad block markers in the
>> OOB on my flash, because there is a BBT at the end.
>
> That's not how it works. The BBT is a way to get information about bad
> blocks within a single read access, but, if you can preserve BBMs and
> keep them updated (which is the case here), you should do it, just in
> case you lose the BBT.
You're probably right today. But this assertion is probably wrong for system
created in early 2000s ... :)

>> > So, the symptoms we're seeing here, where almost all blocks are reported as
>> > bad when scanning BBMs, is not expected, and that's what we're trying to
>> > debug/fix.  
>> Well, I still think this is not something to fix ... I still think that OOB data
>> is not relevant as to the state of bad blocks in my flash ...
>
> Hm, I disagree. What if, for any reason, the BBT is lost? Don't you
> want the full scan to work?
If the BBT is lost, you have the mirror BBT, it's its purpose.

> Okay, so I have another solution for that: drop the NAND_BBT_CREATE and
> NAND_BBT_WRITE here [1] and here [2]. That should let you read the
> existing BBT without updating it or creating a new one if it's not
> detected.
Okay, let's try the marvell-nand-bug branch with this included.
It works :
[   18.302123] ubi0: attached mtd5 (name "root", size 37 MiB)
[   18.307691] ubi0: PEB size: 131072 bytes (128 KiB), LEB size: 126976 bytes
[   18.315003] ubi0: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
[   18.322155] ubi0: VID header offset: 2048 (aligned 2048), data offset: 4096
[   18.329167] ubi0: good PEBs: 297, bad PEBs: 0, corrupted PEBs: 0
[   18.335789] ubi0: user volume: 1, internal volumes: 1, max. volumes count: 128
[   18.343409] ubi0: max/mean erase counter: 6/4, WL threshold: 4096, image sequence number: 30621
[   18.352460] ubi0: available PEBs: 0, total reserved PEBs: 297, PEBs reserved for bad PEB handling: 40
[   18.361937] ubi0: background thread "ubi_bgt0d" started, PID 411

That means the BBT reading is the issue don't you think ?

Now if I keep NAND_BBT_CREATE but remove NAND_BBT_WRITE same thing, it works as
well. That leaves only the re-enabling of the BBT write, which I'll do as soon
as you tell me my NAND won't be damaged.

Cheers.
Boris Brezillon Jan. 13, 2018, 8:38 a.m. UTC | #15
Hello Robert,

On Fri, 12 Jan 2018 21:44:27 +0100
Robert Jarzmik <robert.jarzmik@free.fr> wrote:

> Boris Brezillon <boris.brezillon@free-electrons.com> writes:
> 
> > On Fri, 12 Jan 2018 10:34:13 +0100
> > Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> >  
> >> Boris Brezillon <boris.brezillon@free-electrons.com> writes:  
> > Because we though scanning of BBMs was working with the old pxa driver
> > (which should be the case for your setup, BTW), and we thought the new
> > driver was introducing a regression here.  
> That's what happens :
>  - flash_bbt=1 with old driver => everything works fine
>  - flash_bbt=1 with marvell_nand => BBT is damaged (or so I believe from
>    Miquel's analysis)

It shouldn't be damaged anymore. The bug has been fixed just before we
asked you to scrub the BBT area.

> 
> > BTW, did you ever try with the old driver and ->flash_bbt = false? If
> > you did not, can you test?  
> Sure, just did, same behavior as with marvell_nand :
>  - bad erase blocks (almost) everywhere
>  - ubifs error

That's a relief!

> 
> >> I think we're still not aligned here. There are _no_ bad block markers in the
> >> OOB on my flash, because there is a BBT at the end.  
> >
> > That's not how it works. The BBT is a way to get information about bad
> > blocks within a single read access, but, if you can preserve BBMs and
> > keep them updated (which is the case here), you should do it, just in
> > case you lose the BBT.  
> You're probably right today. But this assertion is probably wrong for system
> created in early 2000s ... :)

I can't say, but I recommend patching the component that screw up BBMs
in your setup anyway. It's probably not the kernel since Miquel tested
the transition from the old to the new driver without activating the
on-flash-bbt on his pxa boards, and all BBMs were preserved.

So, it's either barebox or another component you use to program things.

> 
> >> > So, the symptoms we're seeing here, where almost all blocks are reported as
> >> > bad when scanning BBMs, is not expected, and that's what we're trying to
> >> > debug/fix.    
> >> Well, I still think this is not something to fix ... I still think that OOB data
> >> is not relevant as to the state of bad blocks in my flash ...  
> >
> > Hm, I disagree. What if, for any reason, the BBT is lost? Don't you
> > want the full scan to work?  
> If the BBT is lost, you have the mirror BBT, it's its purpose.

If both are lost, you're screwed.

> 
> > Okay, so I have another solution for that: drop the NAND_BBT_CREATE and
> > NAND_BBT_WRITE here [1] and here [2]. That should let you read the
> > existing BBT without updating it or creating a new one if it's not
> > detected.  
> Okay, let's try the marvell-nand-bug branch with this included.
> It works :
> [   18.302123] ubi0: attached mtd5 (name "root", size 37 MiB)
> [   18.307691] ubi0: PEB size: 131072 bytes (128 KiB), LEB size: 126976 bytes
> [   18.315003] ubi0: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
> [   18.322155] ubi0: VID header offset: 2048 (aligned 2048), data offset: 4096
> [   18.329167] ubi0: good PEBs: 297, bad PEBs: 0, corrupted PEBs: 0
> [   18.335789] ubi0: user volume: 1, internal volumes: 1, max. volumes count: 128
> [   18.343409] ubi0: max/mean erase counter: 6/4, WL threshold: 4096, image sequence number: 30621
> [   18.352460] ubi0: available PEBs: 0, total reserved PEBs: 297, PEBs reserved for bad PEB handling: 40
> [   18.361937] ubi0: background thread "ubi_bgt0d" started, PID 411
> 
> That means the BBT reading is the issue don't you think ?

The BBT detection issue has already been fixed with Miquel's previous
version. So there shouldn't be any issue with that anymore, and your
results tend to confirm that.

> 
> Now if I keep NAND_BBT_CREATE but remove NAND_BBT_WRITE same thing, it works as
> well. That leaves only the re-enabling of the BBT write, which I'll do as soon
> as you tell me my NAND won't be damaged.

It won't, you can safely re-enable NAND_BBT_WRITE. The one that was
causing trouble previously was NAND_BBT_CREATE, because the BBT was not
found, and the NAND framework was creating a new one after scanning
BBMs, which led to the situation you reported: BBT reporting all blocks
as bad.

Thanks for helping us with this bug, I think we're close to a fully
working situation now.

Regards,

Boris
Boris Brezillon Jan. 13, 2018, 8:38 a.m. UTC | #16
On Fri, 12 Jan 2018 17:43:52 +0100
Robert Jarzmik <robert.jarzmik@free.fr> wrote:

> Boris Brezillon <boris.brezillon@free-electrons.com> writes:
> 
> > I think I'm still missing something. If I look at the branch you just
> > pushed, I see that ->flash_bbt was not set to 1 before this commit [1],
> > which means the pxa3xx driver was no setting the NAND_BBT_USE_FLASH
> > flag, which in turn means you were not using the on-flash-bbt.
> > When you test the old (pxa3xx) driver, are you sure you're testing
> > things with a mainline kernel? If you have extra commits on top of
> > mainline, can you push them somewhere?  
> 
> Here is the branch I'm using with my pxa3xx driver :
> git fetch https://github.com/rjarzmik/linux pxa3xx-test

May I ask why this is not in mainline?
Miquel Raynal Jan. 13, 2018, 11:05 a.m. UTC | #17
Hello,

On Sat, 13 Jan 2018 09:38:07 +0100
Boris Brezillon <boris.brezillon@free-electrons.com> wrote:

> Hello Robert,
> 
> On Fri, 12 Jan 2018 21:44:27 +0100
> Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> 
> > Boris Brezillon <boris.brezillon@free-electrons.com> writes:
> >   
> > > On Fri, 12 Jan 2018 10:34:13 +0100
> > > Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> > >    
> > >> Boris Brezillon <boris.brezillon@free-electrons.com> writes:    
> > > Because we though scanning of BBMs was working with the old pxa
> > > driver (which should be the case for your setup, BTW), and we
> > > thought the new driver was introducing a regression here.    
> > That's what happens :
> >  - flash_bbt=1 with old driver => everything works fine
> >  - flash_bbt=1 with marvell_nand => BBT is damaged (or so I believe
> > from Miquel's analysis)  
> 
> It shouldn't be damaged anymore. The bug has been fixed just before we
> asked you to scrub the BBT area.
> 
> >   
> > > BTW, did you ever try with the old driver and ->flash_bbt =
> > > false? If you did not, can you test?    
> > Sure, just did, same behavior as with marvell_nand :
> >  - bad erase blocks (almost) everywhere
> >  - ubifs error  
> 
> That's a relief!

Indeed, it is!

> 
> >   
> > >> I think we're still not aligned here. There are _no_ bad block
> > >> markers in the OOB on my flash, because there is a BBT at the
> > >> end.    
> > >
> > > That's not how it works. The BBT is a way to get information
> > > about bad blocks within a single read access, but, if you can
> > > preserve BBMs and keep them updated (which is the case here), you
> > > should do it, just in case you lose the BBT.    
> > You're probably right today. But this assertion is probably wrong
> > for system created in early 2000s ... :)  
> 
> I can't say, but I recommend patching the component that screw up BBMs
> in your setup anyway. It's probably not the kernel since Miquel tested
> the transition from the old to the new driver without activating the
> on-flash-bbt on his pxa boards, and all BBMs were preserved.
> 
> So, it's either barebox or another component you use to program
> things.
> 
> >   
> > >> > So, the symptoms we're seeing here, where almost all blocks
> > >> > are reported as bad when scanning BBMs, is not expected, and
> > >> > that's what we're trying to debug/fix.      
> > >> Well, I still think this is not something to fix ... I still
> > >> think that OOB data is not relevant as to the state of bad
> > >> blocks in my flash ...    
> > >
> > > Hm, I disagree. What if, for any reason, the BBT is lost? Don't
> > > you want the full scan to work?    
> > If the BBT is lost, you have the mirror BBT, it's its purpose.  
> 
> If both are lost, you're screwed.

And when you encounter a driver problem, it is very likely that both
will be smashed, as it happened this week. Now I better understand why
you feared loosing the BBT again: it forces you to recreate it by hand.

> 
> >   
> > > Okay, so I have another solution for that: drop the
> > > NAND_BBT_CREATE and NAND_BBT_WRITE here [1] and here [2]. That
> > > should let you read the existing BBT without updating it or
> > > creating a new one if it's not detected.    
> > Okay, let's try the marvell-nand-bug branch with this included.
> > It works :
> > [   18.302123] ubi0: attached mtd5 (name "root", size 37 MiB)
> > [   18.307691] ubi0: PEB size: 131072 bytes (128 KiB), LEB size:
> > 126976 bytes [   18.315003] ubi0: min./max. I/O unit sizes:
> > 2048/2048, sub-page size 2048 [   18.322155] ubi0: VID header
> > offset: 2048 (aligned 2048), data offset: 4096 [   18.329167] ubi0:
> > good PEBs: 297, bad PEBs: 0, corrupted PEBs: 0 [   18.335789] ubi0:
> > user volume: 1, internal volumes: 1, max. volumes count: 128
> > [   18.343409] ubi0: max/mean erase counter: 6/4, WL threshold:
> > 4096, image sequence number: 30621 [   18.352460] ubi0: available
> > PEBs: 0, total reserved PEBs: 297, PEBs reserved for bad PEB
> > handling: 40 [   18.361937] ubi0: background thread "ubi_bgt0d"
> > started, PID 411
> > 
> > That means the BBT reading is the issue don't you think ?  
> 
> The BBT detection issue has already been fixed with Miquel's previous
> version. So there shouldn't be any issue with that anymore, and your
> results tend to confirm that.
> 
> > 
> > Now if I keep NAND_BBT_CREATE but remove NAND_BBT_WRITE same thing,
> > it works as well. That leaves only the re-enabling of the BBT
> > write, which I'll do as soon as you tell me my NAND won't be
> > damaged.  
> 
> It won't, you can safely re-enable NAND_BBT_WRITE. The one that was
> causing trouble previously was NAND_BBT_CREATE, because the BBT was
> not found, and the NAND framework was creating a new one after
> scanning BBMs, which led to the situation you reported: BBT reporting
> all blocks as bad.
> 
> Thanks for helping us with this bug, I think we're close to a fully
> working situation now.

That is great, thank you both.
Miquèl
Robert Jarzmik Jan. 14, 2018, 10:20 a.m. UTC | #18
Boris Brezillon <boris.brezillon@free-electrons.com> writes:

> On Fri, 12 Jan 2018 17:43:52 +0100
> Robert Jarzmik <robert.jarzmik@free.fr> wrote:
>
>> Boris Brezillon <boris.brezillon@free-electrons.com> writes:
>> 
>> > I think I'm still missing something. If I look at the branch you just
>> > pushed, I see that ->flash_bbt was not set to 1 before this commit [1],
>> > which means the pxa3xx driver was no setting the NAND_BBT_USE_FLASH
>> > flag, which in turn means you were not using the on-flash-bbt.
>> > When you test the old (pxa3xx) driver, are you sure you're testing
>> > things with a mainline kernel? If you have extra commits on top of
>> > mainline, can you push them somewhere?  
>> 
>> Here is the branch I'm using with my pxa3xx driver :
>> git fetch https://github.com/rjarzmik/linux pxa3xx-test
>
> May I ask why this is not in mainline?
Sure.

The pxa3xx comes into 3 flavors of SoCs :
 - pxa300
 - pxa310
 - pxa320

In these 3, only pxa310 has an internal PoP NAND, and requires the keep_config =
1 setting (where timings are set by the internal ROM code).

Yet zylonite_init_nand() is shared across all 3 platforms. In order to not break
other existing pxa3xx devices, I don't want to change this setting. Instead, my
plan is slowly convert pxaXXX to devicetree, and have this parameter set in
devicetree in a per board basis.

As to the .flash_bbt = 1 parameter, it's even worse. That's the decision made in
the initial flash formatting that counts. With the zylonite310 I have, the BBT
is at the end of the NAND. There might be other parts where it's only in the OOB
area. So it's difficult to change it without taking a risk of breaking others.

Cheers.
Robert Jarzmik Jan. 14, 2018, 10:35 a.m. UTC | #19
Miquel Raynal <miquel.raynal@free-electrons.com> writes:

>> It shouldn't be damaged anymore. The bug has been fixed just before we
>> asked you to scrub the BBT area.
I confirm. The last marvell branch I have (2be03e113a5b ("HACK: Dump OOB data
while reading")) does work correctly on my zylonite.

I still have to make a test of a file write, and a devicetree based setup, but I
don't think anything wrong will happen now.

Cheers.
Boris Brezillon Jan. 22, 2018, 8:51 a.m. UTC | #20
Hi Robert,

On Sun, 14 Jan 2018 11:35:30 +0100
Robert Jarzmik <robert.jarzmik@free.fr> wrote:

> Miquel Raynal <miquel.raynal@free-electrons.com> writes:
> 
> >> It shouldn't be damaged anymore. The bug has been fixed just before we
> >> asked you to scrub the BBT area.  
> I confirm. The last marvell branch I have (2be03e113a5b ("HACK: Dump OOB data
> while reading")) does work correctly on my zylonite.
> 
> I still have to make a test of a file write, and a devicetree based setup, but I
> don't think anything wrong will happen now.

Did you make any progress with that? Note that, no matter the outcome
of your tests, I don't plan to queue the remaining patches for this
release, so don't be afraid of putting your Tested-by/Acked-by on
patches 5-6 if it works on your platforms ;-).

Thanks,

Boris
Boris Brezillon Jan. 22, 2018, 8:54 a.m. UTC | #21
On Sun, 14 Jan 2018 11:20:57 +0100
Robert Jarzmik <robert.jarzmik@free.fr> wrote:

> Boris Brezillon <boris.brezillon@free-electrons.com> writes:
> 
> > On Fri, 12 Jan 2018 17:43:52 +0100
> > Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> >  
> >> Boris Brezillon <boris.brezillon@free-electrons.com> writes:
> >>   
> >> > I think I'm still missing something. If I look at the branch you just
> >> > pushed, I see that ->flash_bbt was not set to 1 before this commit [1],
> >> > which means the pxa3xx driver was no setting the NAND_BBT_USE_FLASH
> >> > flag, which in turn means you were not using the on-flash-bbt.
> >> > When you test the old (pxa3xx) driver, are you sure you're testing
> >> > things with a mainline kernel? If you have extra commits on top of
> >> > mainline, can you push them somewhere?    
> >> 
> >> Here is the branch I'm using with my pxa3xx driver :
> >> git fetch https://github.com/rjarzmik/linux pxa3xx-test  
> >
> > May I ask why this is not in mainline?  
> Sure.
> 
> The pxa3xx comes into 3 flavors of SoCs :
>  - pxa300
>  - pxa310
>  - pxa320
> 
> In these 3, only pxa310 has an internal PoP NAND, and requires the keep_config =
> 1 setting (where timings are set by the internal ROM code).
> 
> Yet zylonite_init_nand() is shared across all 3 platforms. In order to not break
> other existing pxa3xx devices, I don't want to change this setting. Instead, my
> plan is slowly convert pxaXXX to devicetree, and have this parameter set in
> devicetree in a per board basis.
> 
> As to the .flash_bbt = 1 parameter, it's even worse. That's the decision made in
> the initial flash formatting that counts. With the zylonite310 I have, the BBT
> is at the end of the NAND. There might be other parts where it's only in the OOB
> area. So it's difficult to change it without taking a risk of breaking others.

Ok. It's clearer now. I wish we had this discussion earlier :-/.

Anyway, thanks for the clarification.

Boris
Robert Jarzmik Jan. 27, 2018, 10:33 a.m. UTC | #22
Boris Brezillon <boris.brezillon@free-electrons.com> writes:

> Hi Robert,
>
> On Sun, 14 Jan 2018 11:35:30 +0100
> Robert Jarzmik <robert.jarzmik@free.fr> wrote:
>
>> Miquel Raynal <miquel.raynal@free-electrons.com> writes:
>> 
>> >> It shouldn't be damaged anymore. The bug has been fixed just before we
>> >> asked you to scrub the BBT area.  
>> I confirm. The last marvell branch I have (2be03e113a5b ("HACK: Dump OOB data
>> while reading")) does work correctly on my zylonite.
>> 
>> I still have to make a test of a file write, and a devicetree based setup, but I
>> don't think anything wrong will happen now.
>
> Did you make any progress with that? Note that, no matter the outcome
> of your tests, I don't plan to queue the remaining patches for this
> release, so don't be afraid of putting your Tested-by/Acked-by on
> patches 5-6 if it works on your platforms ;-).

Hi Boris and Miquel,

I have done the write tests with my ubifs existing filesystem, it works.
Feel free to add to patches 5 and 6 my :

Tested-by: Robert Jarzmik <robert.jarzmik@free.fr>

Cheers.
Boris Brezillon Jan. 29, 2018, 10:36 a.m. UTC | #23
On Sat, 27 Jan 2018 11:33:47 +0100
Robert Jarzmik <robert.jarzmik@free.fr> wrote:

> Boris Brezillon <boris.brezillon@free-electrons.com> writes:
> 
> > Hi Robert,
> >
> > On Sun, 14 Jan 2018 11:35:30 +0100
> > Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> >  
> >> Miquel Raynal <miquel.raynal@free-electrons.com> writes:
> >>   
> >> >> It shouldn't be damaged anymore. The bug has been fixed just before we
> >> >> asked you to scrub the BBT area.    
> >> I confirm. The last marvell branch I have (2be03e113a5b ("HACK: Dump OOB data
> >> while reading")) does work correctly on my zylonite.
> >> 
> >> I still have to make a test of a file write, and a devicetree based setup, but I
> >> don't think anything wrong will happen now.  
> >
> > Did you make any progress with that? Note that, no matter the outcome
> > of your tests, I don't plan to queue the remaining patches for this
> > release, so don't be afraid of putting your Tested-by/Acked-by on
> > patches 5-6 if it works on your platforms ;-).  
> 
> Hi Boris and Miquel,
> 
> I have done the write tests with my ubifs existing filesystem, it works.
> Feel free to add to patches 5 and 6 my :
> 
> Tested-by: Robert Jarzmik <robert.jarzmik@free.fr>

Thanks a lot.

Would you mind reviewing/testing patch 7?

Thanks,

Boris