diff mbox series

[net-next,1/2] mlxsw: core: Add ethtool support for QSFP-DD transceivers

Message ID 20200626144724.224372-2-idosch@idosch.org
State Changes Requested
Delegated to: David Miller
Headers show
Series mlxsw: Add support for QSFP-DD transceiver type | expand

Commit Message

Ido Schimmel June 26, 2020, 2:47 p.m. UTC
From: Vadim Pasternak <vadimp@mellanox.com>

The Quad Small Form Factor Pluggable Double Density (QSFP-DD) hardware
specification defines a form factor that supports up to 400 Gbps in
aggregate over an 8x50-Gbps electrical interface. The QSFP-DD supports
both optical and copper interfaces.

Implementation is based on Common Management Interface Specification;
Rev 4.0 May 8, 2019. Table 8-2 "Identifier and Status Summary (Lower
Page)" from this spec defines "Id and Status" fields located at offsets
00h - 02h. Bit 2 at offset 02h ("Flat_mem") specifies QSFP EEPROM memory
mode, which could be "upper memory flat" or "paged". Flat memory mode is
coded "1", and indicates that only page 00h is implemented in EEPROM.
Paged memory is coded "0" and indicates that pages 00h, 01h, 02h, 10h
and 11h are implemented. Pages 10h and 11h are currently not supported
by the driver.

"Flat" memory mode is used for the passive copper transceivers. For this
type only page 00h (256 bytes) is available. "Paged" memory is used for
the optical transceivers. For this type pages 00h (256 bytes), 01h (128
bytes) and 02h (128 bytes) are available. Upper page 01h contains static
advertising field, while upper page 02h contains the module-defined
thresholds and lane-specific monitors.

Extend enumerator 'mlxsw_reg_mcia_eeprom_module_info_id' with additional
field 'MLXSW_REG_MCIA_EEPROM_MODULE_INFO_TYPE_ID'. This field is used to
indicate for QSFP-DD transceiver type which memory mode is to be used.

Expose 256 bytes buffer for QSFP-DD passive copper transceiver and
512 bytes buffer for optical.

Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 .../net/ethernet/mellanox/mlxsw/core_env.c    | 19 ++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlxsw/reg.h     |  2 ++
 2 files changed, 20 insertions(+), 1 deletion(-)

Comments

Andrew Lunn June 26, 2020, 3:19 p.m. UTC | #1
> +	case MLXSW_REG_MCIA_EEPROM_MODULE_INFO_ID_QSFP_DD:
> +		/* Use SFF_8636 as base type. ethtool should recognize specific
> +		 * type through the identifier value.
> +		 */
> +		modinfo->type       = ETH_MODULE_SFF_8636;
> +		/* Verify if module EEPROM is a flat memory. In case of flat
> +		 * memory only page 00h (0-255 bytes) can be read. Otherwise
> +		 * upper pages 01h and 02h can also be read. Upper pages 10h
> +		 * and 11h are currently not supported by the driver.
> +		 */
> +		if (module_info[MLXSW_REG_MCIA_EEPROM_MODULE_INFO_TYPE_ID] &
> +		    MLXSW_REG_MCIA_EEPROM_CMIS_FLAT_MEMORY)
> +			modinfo->eeprom_len = ETH_MODULE_SFF_8636_LEN;
> +		else
> +			modinfo->eeprom_len = ETH_MODULE_SFF_8472_LEN;
> +		break;

Although the upper pages 10h and 11h are not supported now, we
probably think about how they would be supported, to make sure we are
not going into a dead end.

From ethtool qsfp.c

/*
 *      Description:
 *      a) The register/memory layout is up to 5 128 byte pages defined by
 *              a "pages valid" register and switched via a "page select"
 *              register. Memory of 256 bytes can be memory mapped at a time
 *              according to SFF 8636.
 *      b) SFF 8636 based 640 bytes memory layout is presented for parser
 *
 *           SFF 8636 based QSFP Memory Map
 *
 *           2-Wire Serial Address: 1010000x
 *
 *           Lower Page 00h (128 bytes)
 *           ======================
 *           |                     |
 *           |Page Select Byte(127)|
 *           ======================
 *                    |
 *                    V
 *           ----------------------------------------
 *          |             |            |             |
 *          V             V            V             V
 *       ----------   ----------   ---------    ------------
 *      | Upper    | | Upper    | | Upper    | | Upper      |
 *      | Page 00h | | Page 01h | | Page 02h | | Page 03h   |
 *      |          | |(Optional)| |(Optional)| | (Optional) |
 *      |          | |          | |          | |            |
 *      |          | |          | |          | |            |
 *      |    ID    | |   AST    | |  User    | |  For       |
 *      |  Fields  | |  Table   | | EEPROM   | |  Cable     |
 *      |          | |          | | Data     | | Assemblies |
 *      |          | |          | |          | |            |
 *      |          | |          | |          | |            |
 *      -----------  -----------   ----------  --------------
 *
 *
 **/

Is page 03h valid for a QSFP DD? Do we add pages 10h and 11h after
page 03h, or instead of? How do we indicate to user space what pages
of data have been passed to it?

   Andrew
Adrian Pop June 26, 2020, 5:33 p.m. UTC | #2
>
> Is page 03h valid for a QSFP DD? Do we add pages 10h and 11h after
> page 03h, or instead of? How do we indicate to user space what pages
> of data have been passed to it?
>
>    Andrew

From QSFP-DD CMIS Rev 4.0: "In particular, support of the Lower Memory
and of Page 00h is required for all modules, including passive copper
cables. These pages are therefore always implemented. Additional
support for Pages 01h, 02h and bank 0 of Pages 10h and 11h is required
for all paged memory modules."

According to the same document, page 0x03 contains "User EEPROM
(NVRs)". Byte 142, bit 2, page 0x01 indicates if the user page 0x03
was implemented. I did not find anything about page 0x02 (where the
user EEPROM is stored) in the documentation for QSFP. I suppose it is
always implemented? If we really want to have it so it is similar to
QSFP, one could send 896 bytes (instead of 768) and just fill that
portion with 0 in case it's not implemented. Note that this is just an
idea, I'm not aware of best practices in cases like this.

Adrian
Andrew Lunn June 26, 2020, 7:07 p.m. UTC | #3
On Fri, Jun 26, 2020 at 06:33:55PM +0100, Adrian Pop wrote:
> >
> > Is page 03h valid for a QSFP DD? Do we add pages 10h and 11h after
> > page 03h, or instead of? How do we indicate to user space what pages
> > of data have been passed to it?
> >
> >    Andrew
> 
> >From QSFP-DD CMIS Rev 4.0: "In particular, support of the Lower Memory
> and of Page 00h is required for all modules, including passive copper
> cables. These pages are therefore always implemented. Additional
> support for Pages 01h, 02h and bank 0 of Pages 10h and 11h is required
> for all paged memory modules."
> 
> According to the same document, page 0x03 contains "User EEPROM
> (NVRs)". Byte 142, bit 2, page 0x01 indicates if the user page 0x03
> was implemented. I did not find anything about page 0x02 (where the
> user EEPROM is stored) in the documentation for QSFP. I suppose it is
> always implemented? If we really want to have it so it is similar to
> QSFP, one could send 896 bytes (instead of 768) and just fill that
> portion with 0 in case it's not implemented. Note that this is just an
> idea, I'm not aware of best practices in cases like this.

It does seem common to only return a subset of the pages. This patch
for example. We need some clear rules to known what the kernel has
passed to user space, so we can both correctly interpret when a subset
has been passed, and also ensure all drivers are doing the same thing.

Currently we have:

 *       ----------   ----------   ---------    ------------
 *      | Upper    | | Upper    | | Upper    | | Upper      |
 *      | Page 00h | | Page 01h | | Page 02h | | Page 03h   |
 *      |          | |(Optional)| |(Optional)| | (Optional) |
 *      |          | |          | |          | |            |
 *      |          | |          | |          | |            |
 *      |    ID    | |   AST    | |  User    | |  For       |
 *      |  Fields  | |  Table   | | EEPROM   | |  Cable     |
 *      |          | |          | | Data     | | Assemblies |
 *      |          | |          | |          | |            |
 *      |          | |          | |          | |            |
 *      -----------  -----------   ----------  --------------

You are saying pages 00h, 01h and 02h are mandatory for QSPF-DD.  Page
03h is optional, but when present, it seems to contain what is page
02h above. Since the QSPF KAPI has it, QSPF-DD KAPI should also have
it. So i would suggest that pages 10h and 11h come after that.

If a driver wants to pass a subset, it can, but it must always trim
from the right, it cannot remove pages from the middle.

     Andrew
Adrian Pop June 26, 2020, 10:13 p.m. UTC | #4
> You are saying pages 00h, 01h and 02h are mandatory for QSPF-DD.  Page
> 03h is optional, but when present, it seems to contain what is page
> 02h above. Since the QSPF KAPI has it, QSPF-DD KAPI should also have
> it. So i would suggest that pages 10h and 11h come after that.
>
> If a driver wants to pass a subset, it can, but it must always trim
> from the right, it cannot remove pages from the middle.
>
>      Andrew

I agree with this. Basically there are two big cases:
- passive copper transceivers with flat memory => just 00h will be
present (both lower and higher => 256 bytes)
- optical transceivers with paged memory => 00h, 01h, 02h, 10h, 11h => 768 bytes
Having page 03h after 02h (so 896 bytes in total) seems like a good
idea and the updates I'll have to do to my old patch are minor
(updating the offset value of page 10h and 11h). When I tested my
patch, I did it with both passive copper transceivers and optical
transceivers and there weren't any issues.

In this patch, Ido added a comment in the code stating "Upper pages
10h and 11h are currently not supported by the driver.". This won't
actually be a problem! In CMIS Rev. 4, Table 8-12 Byte 85 (55h), we
learn that if the value of that byte is 01h or 02h, we have a SMF or
MMF interface (both optical). In the qsfp_dd_show_sig_optical_pwr
function (in my patch) there is this bit:

+ __u8 module_type = id[QSFP_DD_MODULE_TYPE_OFFSET];
[...]
+ /**
+ * The thresholds and the high/low alarms/warnings are available
+ * only if an optical interface (MMF/SMF) is present (if this is
+ * the case, it means that 5 pages are available).
+ */
+ if (module_type != QSFP_DD_MT_MMF &&
+    module_type != QSFP_DD_MT_SMF &&
+    eeprom_len != QSFP_DD_EEPROM_5PAG)
+ return;

But Ido sets the eeprom_len to be ETH_MODULE_SFF_8472_LEN which is
512, while QSFP_DD_EEPROM_5PAG is defined as 80h * 6 = 768. So there
won't be any issues of accessing non-existent values, since I return
from the function that deals with the pages 10h and 11h early. When
the driver will support them too everything will just work so your
idea of a driver being able to pass only a subset of pages (being
allowed to trim only from the right) holds.

Adrian
Ido Schimmel June 27, 2020, 7:16 p.m. UTC | #5
On Fri, Jun 26, 2020 at 11:13:42PM +0100, Adrian Pop wrote:
> > You are saying pages 00h, 01h and 02h are mandatory for QSPF-DD.  Page
> > 03h is optional, but when present, it seems to contain what is page
> > 02h above. Since the QSPF KAPI has it, QSPF-DD KAPI should also have
> > it. So i would suggest that pages 10h and 11h come after that.
> >
> > If a driver wants to pass a subset, it can, but it must always trim
> > from the right, it cannot remove pages from the middle.
> >
> >      Andrew
> 
> I agree with this. Basically there are two big cases:
> - passive copper transceivers with flat memory => just 00h will be
> present (both lower and higher => 256 bytes)
> - optical transceivers with paged memory => 00h, 01h, 02h, 10h, 11h => 768 bytes
> Having page 03h after 02h (so 896 bytes in total) seems like a good
> idea and the updates I'll have to do to my old patch are minor
> (updating the offset value of page 10h and 11h). When I tested my
> patch, I did it with both passive copper transceivers and optical
> transceivers and there weren't any issues.

Hi Adrian, Andrew,

Not sure I understand... You want the kernel to always pass page 03h to
user space (potentially zeroed)? Page 03h is not mandatory according to
the standard and page 01h contains information if page 03h is present or
not. So user space has the information it needs to determine if after
page 02h we have page 03h or page 10h. Why always pass page 03h then?

> 
> In this patch, Ido added a comment in the code stating "Upper pages
> 10h and 11h are currently not supported by the driver.". This won't
> actually be a problem! In CMIS Rev. 4, Table 8-12 Byte 85 (55h), we
> learn that if the value of that byte is 01h or 02h, we have a SMF or
> MMF interface (both optical). In the qsfp_dd_show_sig_optical_pwr
> function (in my patch) there is this bit:
> 
> + __u8 module_type = id[QSFP_DD_MODULE_TYPE_OFFSET];
> [...]
> + /**
> + * The thresholds and the high/low alarms/warnings are available
> + * only if an optical interface (MMF/SMF) is present (if this is
> + * the case, it means that 5 pages are available).
> + */
> + if (module_type != QSFP_DD_MT_MMF &&
> +    module_type != QSFP_DD_MT_SMF &&
> +    eeprom_len != QSFP_DD_EEPROM_5PAG)
> + return;
> 
> But Ido sets the eeprom_len to be ETH_MODULE_SFF_8472_LEN which is
> 512, while QSFP_DD_EEPROM_5PAG is defined as 80h * 6 = 768. So there
> won't be any issues of accessing non-existent values, since I return
> from the function that deals with the pages 10h and 11h early. When
> the driver will support them too everything will just work so your
> idea of a driver being able to pass only a subset of pages (being
> allowed to trim only from the right) holds.

BTW, Adrian, this is the output I get with your patch on top of current
ethtool:

$ ethtool -m swp1
        Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
        Power class                               : 1
        Max power                                 : 0.25W
        Connector                                 : 0x23 (No separable connector)
        Cable assembly length                     : 0.50km
        Tx CDR bypass control                     : Yes
        Rx CDR bypass control                     : No
        Tx CDR                                    : No
        Rx CDR                                    : No
        Transmitter technology                    : 0x0a (Copper cable, unequalized)
        Attenuation at 5GHz                       : 4db
        Attenuation at 7GHz                       : 5db
        Attenuation at 12.9GHz                    : 7db
        Attenuation at 25.8GHz                    : 21db
        Module temperature                        : 0.00 degrees C / 32.00 degrees F
        Module voltage                            : 0.0000 V
        Length (SMF)                              : 0.00km
        Length (OM5)                              : 0m
        Length (OM4)                              : 0m
        Length (OM3 50/125um)                     : 0m
        Length (OM2 50/125um)                     : 17m
        Vendor name                               : Mellanox
        Vendor OUI                                : 00:02:c9
        Vendor PN                                 : MCP1660-W00AE30
        Vendor rev                                : A2
        Vendor SN                                 : MT2019VS04757
        Date code                                 : 200507  _
        Revision compliance                       : Rev. 3.0
Adrian Pop June 27, 2020, 8:42 p.m. UTC | #6
>
> Hi Adrian, Andrew,
>
> Not sure I understand... You want the kernel to always pass page 03h to
> user space (potentially zeroed)? Page 03h is not mandatory according to
> the standard and page 01h contains information if page 03h is present or

Hi Ido!

Andrew was thinking of having 03h after 02h (potentially zeroed) just
for the purpose of having a similar layout for QSFP-DD the same way we
do for QSFP. But as you said, it is not mandatory according to the
standard and I also don't know the entire codebase for ethtool and
where it might be actually needed. I think Andrew can argue for its
presence better than me.

> not. So user space has the information it needs to determine if after
> page 02h we have page 03h or page 10h. Why always pass page 03h then?
>

If we decide to add 03h but only sometimes, I think we will add an
extra layer of complexity. Sometimes after 02h we would have 03h and
sometimes 10h. In qsfp-dd.h (following the convention from qsfp.h) in
my patch there are a lot of different constants defined with respect
to the offset of the parent page in the memory layout and "dynamic
offsets" don't sound very good, at least for me. So even if there's a
way of checking in the user space which page is after 02h, a more
stable memory layout works better on the long run.

Adrian
Ido Schimmel June 28, 2020, 11:55 a.m. UTC | #7
On Sat, Jun 27, 2020 at 09:42:10PM +0100, Adrian Pop wrote:
> >
> > Hi Adrian, Andrew,
> >
> > Not sure I understand... You want the kernel to always pass page 03h to
> > user space (potentially zeroed)? Page 03h is not mandatory according to
> > the standard and page 01h contains information if page 03h is present or
> 
> Hi Ido!
> 
> Andrew was thinking of having 03h after 02h (potentially zeroed) just
> for the purpose of having a similar layout for QSFP-DD the same way we
> do for QSFP. But as you said, it is not mandatory according to the
> standard and I also don't know the entire codebase for ethtool and
> where it might be actually needed. I think Andrew can argue for its
> presence better than me.
> 
> > not. So user space has the information it needs to determine if after
> > page 02h we have page 03h or page 10h. Why always pass page 03h then?
> >
> 
> If we decide to add 03h but only sometimes, I think we will add an
> extra layer of complexity. Sometimes after 02h we would have 03h and
> sometimes 10h. In qsfp-dd.h (following the convention from qsfp.h) in
> my patch there are a lot of different constants defined with respect
> to the offset of the parent page in the memory layout and "dynamic
> offsets" don't sound very good, at least for me. So even if there's a
> way of checking in the user space which page is after 02h, a more
> stable memory layout works better on the long run.

Adrian,

Thanks for the detailed response. I don't think the kernel should pass
fake pages only to make it easier for user space to parse the
information. What you are describing is basic dissection and it's done
all the time by wireshark / tcpdump.

Anyway, even we pass a fake page 03h, page 11h can still be at a
variable offset. See table 8-28 [1], bits 1-0 at offset 142 in page 01h
determine the size of pages 10h and 11h:

0 - each page is 128 bytes in size
1 - each page is 256 bytes in size
2 - each page is 512 bytes in size

So a completely stable layout (unless I missed something) will entail
the kernel sending 1664 bytes to user space each time. This looks
unnecessarily rigid to me. The people who wrote the standard obviously
took into account the fact that the page layout needs to be discoverable
from the data and I think we should embrace it and only pass valid
information to user space.

Regardless, can Andrew and you let us know if you have a problems with
current patch set which only exposes pages 00h-02h? I see it's marked as
"Changes Requested", so I will need to re-submit.

Thanks

[1] http://www.qsfp-dd.com/wp-content/uploads/2019/05/QSFP-DD-CMIS-rev4p0.pdf
Adrian Pop June 28, 2020, 12:27 p.m. UTC | #8
Hi!

Regarding multiple banks, I think that we can have a much more
creative way of dealing with them (in the future). At page 76, we have
"In particular, support of the Lower Memory and of Page 00h is
required for all modules, including passive copper cables. These pages
are therefore always implemented. Additional support for Pages 01h,
02h and bank 0 of Pages 10h and 11h is required for all paged memory
modules."

My old patch clearly supports only the 1st (and mandatory) bank. So
the memory layout is 00h, 01h, 02h, 10h, 11h. If we will extend
ethtool to work for multiple banks, we might have something like 00h,
01h, 02h, (10h, 11h | bank 0), (10h, 11h | bank 1) etc. So we can
still check bits 1-0 of byte 142 from page 01h to determine how many
banks we have and we can still follow the "we can trim, but only to
the right" rule. Of course, this is only an idea, at the moment I
don't think we can even test something like that.

I'm sure that I can work something out to deal with sometimes having
page 03h and sometimes not, but first I think we need to decide if we
always add it or not. As I mentioned in my previous email, I think
Andrew can argue for its presence better than me. Based on that, I can
re-submit my old patch for ethtool.

Adrian

On Sun, 28 Jun 2020 at 12:56, Ido Schimmel <idosch@idosch.org> wrote:
>
> On Sat, Jun 27, 2020 at 09:42:10PM +0100, Adrian Pop wrote:
> > >
> > > Hi Adrian, Andrew,
> > >
> > > Not sure I understand... You want the kernel to always pass page 03h to
> > > user space (potentially zeroed)? Page 03h is not mandatory according to
> > > the standard and page 01h contains information if page 03h is present or
> >
> > Hi Ido!
> >
> > Andrew was thinking of having 03h after 02h (potentially zeroed) just
> > for the purpose of having a similar layout for QSFP-DD the same way we
> > do for QSFP. But as you said, it is not mandatory according to the
> > standard and I also don't know the entire codebase for ethtool and
> > where it might be actually needed. I think Andrew can argue for its
> > presence better than me.
> >
> > > not. So user space has the information it needs to determine if after
> > > page 02h we have page 03h or page 10h. Why always pass page 03h then?
> > >
> >
> > If we decide to add 03h but only sometimes, I think we will add an
> > extra layer of complexity. Sometimes after 02h we would have 03h and
> > sometimes 10h. In qsfp-dd.h (following the convention from qsfp.h) in
> > my patch there are a lot of different constants defined with respect
> > to the offset of the parent page in the memory layout and "dynamic
> > offsets" don't sound very good, at least for me. So even if there's a
> > way of checking in the user space which page is after 02h, a more
> > stable memory layout works better on the long run.
>
> Adrian,
>
> Thanks for the detailed response. I don't think the kernel should pass
> fake pages only to make it easier for user space to parse the
> information. What you are describing is basic dissection and it's done
> all the time by wireshark / tcpdump.
>
> Anyway, even we pass a fake page 03h, page 11h can still be at a
> variable offset. See table 8-28 [1], bits 1-0 at offset 142 in page 01h
> determine the size of pages 10h and 11h:
>
> 0 - each page is 128 bytes in size
> 1 - each page is 256 bytes in size
> 2 - each page is 512 bytes in size
>
> So a completely stable layout (unless I missed something) will entail
> the kernel sending 1664 bytes to user space each time. This looks
> unnecessarily rigid to me. The people who wrote the standard obviously
> took into account the fact that the page layout needs to be discoverable
> from the data and I think we should embrace it and only pass valid
> information to user space.
>
> Regardless, can Andrew and you let us know if you have a problems with
> current patch set which only exposes pages 00h-02h? I see it's marked as
> "Changes Requested", so I will need to re-submit.
>
> Thanks
>
> [1] http://www.qsfp-dd.com/wp-content/uploads/2019/05/QSFP-DD-CMIS-rev4p0.pdf
Andrew Lunn June 30, 2020, 12:21 a.m. UTC | #9
> Adrian,
> 
> Thanks for the detailed response. I don't think the kernel should pass
> fake pages only to make it easier for user space to parse the
> information. What you are describing is basic dissection and it's done
> all the time by wireshark / tcpdump.
> 
> Anyway, even we pass a fake page 03h, page 11h can still be at a
> variable offset. See table 8-28 [1], bits 1-0 at offset 142 in page 01h
> determine the size of pages 10h and 11h:
> 
> 0 - each page is 128 bytes in size
> 1 - each page is 256 bytes in size
> 2 - each page is 512 bytes in size
> 
> So a completely stable layout (unless I missed something) will entail
> the kernel sending 1664 bytes to user space each time. This looks
> unnecessarily rigid to me. The people who wrote the standard obviously
> took into account the fact that the page layout needs to be discoverable
> from the data and I think we should embrace it and only pass valid
> information to user space.

O.K. That makes things more complex. And i would say, Ido is correct,
we need to make use of the discoverable features.

I've no practice experience with modules other than plain old SFPs,
1G. And those have all sorts of errors, even basic things like the CRC
are systematically incorrect because they are not recalculated after
adding the serial number. We have had people trying to submit patches
to ethtool to make it ignore bits so that it dumps more information,
because the manufacturer failed to set the correct bits, etc.

Ido, Adrian, what is your experience with these QSFP-DD devices. Are
they generally of better quality, the EEPROM can be trusted? Is there
any form of compliance test.

If we go down the path of using the discovery information, it means we
have no way for user space to try to correct for when the information
is incorrect. It cannot request specific pages. So maybe we should
consider an alternative?

The netlink ethtool gives us more flexibility. How about we make a new
API where user space can request any pages it want, and specify the
size of the page. ethtool can start out by reading page 0. That should
allow it to identify the basic class of device. It can then request
additional pages as needed.

The nice thing about that is we don't need two parsers of the
discovery information, one in user and second in kernel space. We
don't need to guarantee these two parsers agree with each other, in
order to correctly decode what the kernel sent to user space. And user
space has the flexibility to work around known issues when
manufactures get their EEPROM wrong.

	     Andrew
Ido Schimmel June 30, 2020, 5:59 a.m. UTC | #10
On Tue, Jun 30, 2020 at 02:21:59AM +0200, Andrew Lunn wrote:
> I've no practice experience with modules other than plain old SFPs,
> 1G. And those have all sorts of errors, even basic things like the CRC
> are systematically incorrect because they are not recalculated after
> adding the serial number. We have had people trying to submit patches
> to ethtool to make it ignore bits so that it dumps more information,
> because the manufacturer failed to set the correct bits, etc.
> 
> Ido, Adrian, what is your experience with these QSFP-DD devices. Are
> they generally of better quality, the EEPROM can be trusted? Is there
> any form of compliance test.

Vadim, I know you tested with at least two different QSFP-DD modules,
can you please share your experience?

> 
> If we go down the path of using the discovery information, it means we
> have no way for user space to try to correct for when the information
> is incorrect. It cannot request specific pages. So maybe we should
> consider an alternative?
> 
> The netlink ethtool gives us more flexibility. How about we make a new
> API where user space can request any pages it want, and specify the
> size of the page. ethtool can start out by reading page 0. That should
> allow it to identify the basic class of device. It can then request
> additional pages as needed.

Just to make sure I understand, this also means adding a new API towards
drivers, right? So that they only read from HW the requested info.

> The nice thing about that is we don't need two parsers of the
> discovery information, one in user and second in kernel space. We
> don't need to guarantee these two parsers agree with each other, in
> order to correctly decode what the kernel sent to user space. And user
> space has the flexibility to work around known issues when
> manufactures get their EEPROM wrong.

Sounds sane to me... I know that in the past Vadim had to deal with
various faulty modules. Vadim, is this something we can support? What
happens if user space requests a page that does not exist? For example,
in the case of QSFP-DD, lets say we do not provide page 03h but user
space still wants it because it believes manufacturer did not set
correct bits.
Vadim Pasternak June 30, 2020, 9:37 a.m. UTC | #11
> -----Original Message-----
> From: Ido Schimmel <idosch@idosch.org>
> Sent: Tuesday, June 30, 2020 9:00 AM
> To: Andrew Lunn <andrew@lunn.ch>; Vadim Pasternak
> <vadimp@mellanox.com>
> Cc: Adrian Pop <popadrian1996@gmail.com>; netdev@vger.kernel.org;
> davem@davemloft.net; kuba@kernel.org; Jiri Pirko <jiri@mellanox.com>;
> mlxsw <mlxsw@mellanox.com>; Ido Schimmel <idosch@mellanox.com>
> Subject: Re: [PATCH net-next 1/2] mlxsw: core: Add ethtool support for QSFP-
> DD transceivers
> 
> On Tue, Jun 30, 2020 at 02:21:59AM +0200, Andrew Lunn wrote:
> > I've no practice experience with modules other than plain old SFPs,
> > 1G. And those have all sorts of errors, even basic things like the CRC
> > are systematically incorrect because they are not recalculated after
> > adding the serial number. We have had people trying to submit patches
> > to ethtool to make it ignore bits so that it dumps more information,
> > because the manufacturer failed to set the correct bits, etc.
> >
> > Ido, Adrian, what is your experience with these QSFP-DD devices. Are
> > they generally of better quality, the EEPROM can be trusted? Is there
> > any form of compliance test.
> 
> Vadim, I know you tested with at least two different QSFP-DD modules, can
> you please share your experience?
> 

I tested two types of QSFP-DD, cooper and optical from few vendors:
Innolight, SP (Source Photonics) and Mellanox customized transceivers.
We don't have enough statistics. I guess in all our systems in LAB we
validated about 150 - 200 cables. No one of them had wrong EEPROM.

But in all Mellanox systems QSFP reading works through the firmware
and firmware performs QSFP validation for stamping (some cable type
are considered as untrusted and firmware put them to the black list),
page checksum, power consuming criteria.


> >
> > If we go down the path of using the discovery information, it means we
> > have no way for user space to try to correct for when the information
> > is incorrect. It cannot request specific pages. So maybe we should
> > consider an alternative?
> >
> > The netlink ethtool gives us more flexibility. How about we make a new
> > API where user space can request any pages it want, and specify the
> > size of the page. ethtool can start out by reading page 0. That should
> > allow it to identify the basic class of device. It can then request
> > additional pages as needed.
> 
> Just to make sure I understand, this also means adding a new API towards
> drivers, right? So that they only read from HW the requested info.
> 
> > The nice thing about that is we don't need two parsers of the
> > discovery information, one in user and second in kernel space. We
> > don't need to guarantee these two parsers agree with each other, in
> > order to correctly decode what the kernel sent to user space. And user
> > space has the flexibility to work around known issues when
> > manufactures get their EEPROM wrong.
> 
> Sounds sane to me... I know that in the past Vadim had to deal with various
> faulty modules. Vadim, is this something we can support? What happens if
> user space requests a page that does not exist? For example, in the case of
> QSFP-DD, lets say we do not provide page 03h but user space still wants it
> because it believes manufacturer did not set correct bits.

Regarding faulty modules, as I wrote - validation is performed by firmware
and our software trust firmware.

Currently user space just asks for the buffer length.
I suppose in case we'll have additional API:
ethtool -m <if> <page> <offset> <size>
it would be possible to provide buffer only for the defined page and upto
valid size.

Pay attention that CMIS specification covers also others transceivers types
and some of them we are going to support through ethtool, like:
19h (OSFP 8x Pluggable Transceiver)
1Ah (SFP-DD Double Density 2x Pluggable Transceiver)
1Eh (QSFP with QSFP-DD memory map)

If I am not wrong 19h and 1Eh should have same layout as QSFP-DD and
SFP-DD is supposed to be similar, but shorter (page 02h is reserved, page
01h contains info, which for QSFP-DD sits at page 02h).
Andrew Lunn June 30, 2020, 1 p.m. UTC | #12
> Sounds sane to me... I know that in the past Vadim had to deal with
> various faulty modules. Vadim, is this something we can support? What
> happens if user space requests a page that does not exist? For example,
> in the case of QSFP-DD, lets say we do not provide page 03h but user
> space still wants it because it believes manufacturer did not set
> correct bits.

Hi Ido

I can see two scenarios.

This API is retrofitted to a firmware which only supports pre-defined
linear dumps. A page is requested which is not part of the
dump. EOPNOTSUPP seems like a good response.

The second is for a page which does not exist in the module. I would
just let the i2c bus master perform the transfer. Some might return
EIO, ENODEV if the SFP does not respond. Otherwise it might return
0xff from pullups, or random junk. Hopefully each page as a checksum,
and hopefully the vendor actually get the checksum correct? So let
userspace either deal with the error code, or whatever data is
provided.

The current code already to some extent handles missing data, it uses
the length to determine if the page is present or not. So it is not
too big a change to look error codes for individual pages.

    Andrew
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_env.c b/drivers/net/ethernet/mellanox/mlxsw/core_env.c
index 08215fed193d..68f198afc9b0 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_env.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_env.c
@@ -67,7 +67,8 @@  mlxsw_env_query_module_eeprom(struct mlxsw_core *mlxsw_core, int module,
 		offset -= MLXSW_REG_MCIA_EEPROM_UP_PAGE_LENGTH * page;
 		/* When reading upper pages 1, 2 and 3 the offset starts at
 		 * 128. Please refer to "QSFP+ Memory Map" figure in SFF-8436
-		 * specification for graphical depiction.
+		 * specification and to "CMIS Module Memory Map" Figure in
+		 * CMIS specification for graphical depiction.
 		 * MCIA register accepts buffer size <= 48. Page of size 128
 		 * should be read by chunks of size 48, 48, 32. Align the size
 		 * of the last chunk to avoid reading after the end of the
@@ -210,6 +211,22 @@  int mlxsw_env_get_module_info(struct mlxsw_core *mlxsw_core, int module,
 		else
 			modinfo->eeprom_len = ETH_MODULE_SFF_8472_LEN / 2;
 		break;
+	case MLXSW_REG_MCIA_EEPROM_MODULE_INFO_ID_QSFP_DD:
+		/* Use SFF_8636 as base type. ethtool should recognize specific
+		 * type through the identifier value.
+		 */
+		modinfo->type       = ETH_MODULE_SFF_8636;
+		/* Verify if module EEPROM is a flat memory. In case of flat
+		 * memory only page 00h (0-255 bytes) can be read. Otherwise
+		 * upper pages 01h and 02h can also be read. Upper pages 10h
+		 * and 11h are currently not supported by the driver.
+		 */
+		if (module_info[MLXSW_REG_MCIA_EEPROM_MODULE_INFO_TYPE_ID] &
+		    MLXSW_REG_MCIA_EEPROM_CMIS_FLAT_MEMORY)
+			modinfo->eeprom_len = ETH_MODULE_SFF_8636_LEN;
+		else
+			modinfo->eeprom_len = ETH_MODULE_SFF_8472_LEN;
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index fcb88d4271bf..73cc7fd5020c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -8548,6 +8548,7 @@  MLXSW_ITEM32(reg, mcia, size, 0x08, 0, 16);
 #define MLXSW_REG_MCIA_TH_PAGE_NUM		3
 #define MLXSW_REG_MCIA_PAGE0_LO			0
 #define MLXSW_REG_MCIA_TH_PAGE_OFF		0x80
+#define MLXSW_REG_MCIA_EEPROM_CMIS_FLAT_MEMORY	BIT(7)
 
 enum mlxsw_reg_mcia_eeprom_module_info_rev_id {
 	MLXSW_REG_MCIA_EEPROM_MODULE_INFO_REV_ID_UNSPC	= 0x00,
@@ -8566,6 +8567,7 @@  enum mlxsw_reg_mcia_eeprom_module_info_id {
 enum mlxsw_reg_mcia_eeprom_module_info {
 	MLXSW_REG_MCIA_EEPROM_MODULE_INFO_ID,
 	MLXSW_REG_MCIA_EEPROM_MODULE_INFO_REV_ID,
+	MLXSW_REG_MCIA_EEPROM_MODULE_INFO_TYPE_ID,
 	MLXSW_REG_MCIA_EEPROM_MODULE_INFO_SIZE,
 };