diff mbox series

[v10,03/12] peci: Add support for PECI bus driver core

Message ID 20190107214136.5256-4-jae.hyun.yoo@linux.intel.com
State Not Applicable, archived
Headers show
Series PECI device driver introduction | expand

Commit Message

Jae Hyun Yoo Jan. 7, 2019, 9:41 p.m. UTC
This commit adds driver implementation for PECI bus core into linux
driver framework.

PECI (Platform Environment Control Interface) is a one-wire bus interface
that provides a communication channel from Intel processors and chipset
components to external monitoring or control devices. PECI is designed to
support the following sideband functions:

* Processor and DRAM thermal management
  - Processor fan speed control is managed by comparing Digital Thermal
    Sensor (DTS) thermal readings acquired via PECI against the
    processor-specific fan speed control reference point, or TCONTROL. Both
    TCONTROL and DTS thermal readings are accessible via the processor PECI
    client. These variables are referenced to a common temperature, the TCC
    activation point, and are both defined as negative offsets from that
    reference.
  - PECI based access to the processor package configuration space provides
    a means for Baseboard Management Controllers (BMC) or other platform
    management devices to actively manage the processor and memory power
    and thermal features.

* Platform Manageability
  - Platform manageability functions including thermal, power, and error
    monitoring. Note that platform 'power' management includes monitoring
    and control for both the processor and DRAM subsystem to assist with
    data center power limiting.
  - PECI allows read access to certain error registers in the processor MSR
    space and status monitoring registers in the PCI configuration space
    within the processor and downstream devices.
  - PECI permits writes to certain registers in the processor PCI
    configuration space.

* Processor Interface Tuning and Diagnostics
  - Processor interface tuning and diagnostics capabilities
    (Intel Interconnect BIST). The processors Intel Interconnect Built In
    Self Test (Intel IBIST) allows for infield diagnostic capabilities in
    the Intel UPI and memory controller interfaces. PECI provides a port to
    execute these diagnostics via its PCI Configuration read and write
    capabilities.

* Failure Analysis
  - Output the state of the processor after a failure for analysis via
    Crashdump.

PECI uses a single wire for self-clocking and data transfer. The bus
requires no additional control lines. The physical layer is a self-clocked
one-wire bus that begins each bit with a driven, rising edge from an idle
level near zero volts. The duration of the signal driven high depends on
whether the bit value is a logic '0' or logic '1'. PECI also includes
variable data transfer rate established with every message. In this way, it
is highly flexible even though underlying logic is simple.

The interface design was optimized for interfacing between an Intel
processor and chipset components in both single processor and multiple
processor environments. The single wire interface provides low board
routing overhead for the multiple load connections in the congested routing
area near the processor and chipset components. Bus speed, error checking,
and low protocol overhead provides adequate link bandwidth and reliability
to transfer critical device operating conditions and configuration
information.

This implementation provides the basic framework to add PECI extensions to
the Linux bus and device models. A hardware specific 'Adapter' driver can
be attached to the PECI bus to provide sideband functions described above.
It is also possible to access all devices on an adapter from userspace
through the /dev interface. A device specific 'Client' driver also can be
attached to the PECI bus so each processor client's features can be
supported by the 'Client' driver through an adapter connection in the bus.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Gavin Schenk <g.schenk@eckelmann.de>
Cc: Vinod Koul <vkoul@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sagar Dharia <sdharia@codeaurora.org>
Cc: David Kershner <david.kershner@unisys.com>
Cc: Johan Hovold <johan@kernel.org>
Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Cyrille Pitchen <cyrille.pitchen@free-electrons.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jason M Biils <jason.m.bills@linux.intel.com>
Cc: Julia Cartwright <juliac@eso.teric.us>
Signed-off-by: Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Haiyue Wang <haiyue.wang@linux.intel.com>
Reviewed-by: James Feist <james.feist@linux.intel.com>
Reviewed-by: Vernon Mauery <vernon.mauery@linux.intel.com>
---
 drivers/Kconfig                 |    2 +
 drivers/Makefile                |    1 +
 drivers/peci/Kconfig            |   12 +
 drivers/peci/Makefile           |    6 +
 drivers/peci/peci-core.c        | 1527 +++++++++++++++++++++++++++++++
 include/linux/peci.h            |  142 +++
 include/uapi/linux/peci-ioctl.h |  403 ++++++++
 7 files changed, 2093 insertions(+)
 create mode 100644 drivers/peci/Kconfig
 create mode 100644 drivers/peci/Makefile
 create mode 100644 drivers/peci/peci-core.c
 create mode 100644 include/linux/peci.h
 create mode 100644 include/uapi/linux/peci-ioctl.h

Comments

Joel Stanley Jan. 14, 2019, 11:13 a.m. UTC | #1
Hello Jae,

On Tue, 8 Jan 2019 at 08:11, Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com> wrote:
>
> This commit adds driver implementation for PECI bus core into linux
> driver framework.

I would like to help you get this merged next release cycle, as we are
now carrying it in OpenBMC. I suggest we ask Greg to queue it up if
there are no objections after you've addressed my questions.

> +static u8 peci_aw_fcs(u8 *data, int len)

I was wondering what aw_fcs meant. I notice that later on you describe
it as an Assure Write Frame Check Sequence byte. You could add a
comment next to this function :)

Instead of casing to u8 every time you call this, you could have this
take a struct peci_xfer_msg * and cast when calling crc8.

> +{
> +       return crc8(peci_crc8_table, data, (size_t)len, 0);
> +}
> +
> +static int __peci_xfer(struct peci_adapter *adapter, struct peci_xfer_msg *msg,
> +                      bool do_retry, bool has_aw_fcs)
> +{
> +       ktime_t start, end;
> +       s64 elapsed_ms;
> +       int rc = 0;
> +
> +       /**

These are for kerneldoc, and the comments aren't kerneldoc. Replace
them with /* instead.

> +        * For some commands, the PECI originator may need to retry a command if
> +        * the processor PECI client responds with a 0x8x completion code. In
> +        * each instance, the processor PECI client may have started the
> +        * operation but not completed it yet. When the 'retry' bit is set, the
> +        * PECI client will ignore a new request if it exactly matches a
> +        * previous valid request.
> +        */
> +
> +       if (do_retry)
> +               start = ktime_get();
> +
> +       do {
> +               rc = adapter->xfer(adapter, msg);
> +
> +               if (!do_retry || rc)
> +                       break;
> +
> +               if (msg->rx_buf[0] == DEV_PECI_CC_SUCCESS)
> +                       break;
> +
> +               /* Retry is needed when completion code is 0x8x */
> +               if ((msg->rx_buf[0] & DEV_PECI_CC_RETRY_CHECK_MASK) !=
> +                   DEV_PECI_CC_NEED_RETRY) {
> +                       rc = -EIO;
> +                       break;
> +               }
> +
> +               /* Set the retry bit to indicate a retry attempt */
> +               msg->tx_buf[1] |= DEV_PECI_RETRY_BIT;
> +
> +               /* Recalculate the AW FCS if it has one */
> +               if (has_aw_fcs)
> +                       msg->tx_buf[msg->tx_len - 1] = 0x80 ^

Can we guarantee that msg->tx_len will be set to non-zero whenever has_aw_fcs?

I suggest checking before doing the assignment in case a new caller is
added and they make a mistake.

> +static int peci_ioctl_get_dib(struct peci_adapter *adapter, void *vmsg)
> +{
> +       struct peci_get_dib_msg *umsg = vmsg;
> +       struct peci_xfer_msg msg;
> +       int rc;
> +
> +       msg.addr      = umsg->addr;
> +       msg.tx_len    = GET_DIB_WR_LEN;
> +       msg.rx_len    = GET_DIB_RD_LEN;
> +       msg.tx_buf[0] = GET_DIB_PECI_CMD;
> +
> +       rc = peci_xfer(adapter, &msg);

Most of tx_buf is going to be uninitialised. I assume a well behaving
adapter->xfer will check this and only send the correct number of
bytes, but it might pay to zero out struct peci_xfer_msg in all of
these functions?

> +       if (rc)
> +               return rc;
> +
> +       umsg->dib = le64_to_cpup((__le64 *)msg.rx_buf);
> +
> +       return 0;
> +}

> +
> +#if IS_ENABLED(CONFIG_OF)
> +static struct peci_client *peci_of_register_device(struct peci_adapter *adapter,
> +                                                  struct device_node *node)
> +{
> +       struct peci_board_info info = {};
> +       struct peci_client *result;
> +       const __be32 *addr_be;
> +       int len;
> +
> +       dev_dbg(&adapter->dev, "register %pOF\n", node);
> +
> +       if (of_modalias_node(node, info.type, sizeof(info.type)) < 0) {

I don't understand why you're doing this. Won't this always be peci,
as your binding requires?

> +               dev_err(&adapter->dev, "modalias failure on %pOF\n", node);
> +               return ERR_PTR(-EINVAL);
> +       }
> +
> +       addr_be = of_get_property(node, "reg", &len);
> +       if (!addr_be || len < sizeof(*addr_be)) {

The second check looks suspicious.

You could fix it to check the expected length (4), or use of_property_read_u32.

> +               dev_err(&adapter->dev, "invalid reg on %pOF\n", node);
> +               return ERR_PTR(-EINVAL);
> +       }
> +
> +       info.addr = be32_to_cpup(addr_be);
> +       info.of_node = of_node_get(node);
> +
> +       result = peci_new_device(adapter, &info);
> +       if (!result)

Should you do an of_node_put here?

> +               result = ERR_PTR(-EINVAL);
> +
> +       of_node_put(node);

Why do you release the reference here?

> +       return result;
> +}
> +
Jae Hyun Yoo Jan. 14, 2019, 10:38 p.m. UTC | #2
Hello Joel,

On 1/14/2019 3:13 AM, Joel Stanley wrote:
> Hello Jae,
> 
> On Tue, 8 Jan 2019 at 08:11, Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com> wrote:
>>
>> This commit adds driver implementation for PECI bus core into linux
>> driver framework.
> 
> I would like to help you get this merged next release cycle, as we are
> now carrying it in OpenBMC. I suggest we ask Greg to queue it up if
> there are no objections after you've addressed my questions.
> 

Thanks a lot for your help on reviewing this patch series. I'll submit
v11 to address your comments. We could ask Greg to queue it then.

>> +static u8 peci_aw_fcs(u8 *data, int len)
> 
> I was wondering what aw_fcs meant. I notice that later on you describe
> it as an Assure Write Frame Check Sequence byte. You could add a
> comment next to this function :)
> 

Agreed. I'll add a comment like you suggested.

> Instead of casing to u8 every time you call this, you could have this
> take a struct peci_xfer_msg * and cast when calling crc8.
> 

Yes, that would be neater. Will fix it.

>> +{
>> +       return crc8(peci_crc8_table, data, (size_t)len, 0);
>> +}
>> +
>> +static int __peci_xfer(struct peci_adapter *adapter, struct peci_xfer_msg *msg,
>> +                      bool do_retry, bool has_aw_fcs)
>> +{
>> +       ktime_t start, end;
>> +       s64 elapsed_ms;
>> +       int rc = 0;
>> +
>> +       /**
> 
> These are for kerneldoc, and the comments aren't kerneldoc. Replace
> them with /* instead.
> 

Okay, I'll check all comments again in this series.

>> +        * For some commands, the PECI originator may need to retry a command if
>> +        * the processor PECI client responds with a 0x8x completion code. In
>> +        * each instance, the processor PECI client may have started the
>> +        * operation but not completed it yet. When the 'retry' bit is set, the
>> +        * PECI client will ignore a new request if it exactly matches a
>> +        * previous valid request.
>> +        */
>> +
>> +       if (do_retry)
>> +               start = ktime_get();
>> +
>> +       do {
>> +               rc = adapter->xfer(adapter, msg);
>> +
>> +               if (!do_retry || rc)
>> +                       break;
>> +
>> +               if (msg->rx_buf[0] == DEV_PECI_CC_SUCCESS)
>> +                       break;
>> +
>> +               /* Retry is needed when completion code is 0x8x */
>> +               if ((msg->rx_buf[0] & DEV_PECI_CC_RETRY_CHECK_MASK) !=
>> +                   DEV_PECI_CC_NEED_RETRY) {
>> +                       rc = -EIO;
>> +                       break;
>> +               }
>> +
>> +               /* Set the retry bit to indicate a retry attempt */
>> +               msg->tx_buf[1] |= DEV_PECI_RETRY_BIT;
>> +
>> +               /* Recalculate the AW FCS if it has one */
>> +               if (has_aw_fcs)
>> +                       msg->tx_buf[msg->tx_len - 1] = 0x80 ^
> 
> Can we guarantee that msg->tx_len will be set to non-zero whenever has_aw_fcs?
> 
> I suggest checking before doing the assignment in case a new caller is
> added and they make a mistake.
> 

The msg->tx_len is already checked by callers - peci_ioctl_wr_pkg_cfg()
and peci_ioctl_wr_pci_cfg_local() - so it's not needed to be checked
again at here.

>> +static int peci_ioctl_get_dib(struct peci_adapter *adapter, void *vmsg)
>> +{
>> +       struct peci_get_dib_msg *umsg = vmsg;
>> +       struct peci_xfer_msg msg;
>> +       int rc;
>> +
>> +       msg.addr      = umsg->addr;
>> +       msg.tx_len    = GET_DIB_WR_LEN;
>> +       msg.rx_len    = GET_DIB_RD_LEN;
>> +       msg.tx_buf[0] = GET_DIB_PECI_CMD;
>> +
>> +       rc = peci_xfer(adapter, &msg);
> 
> Most of tx_buf is going to be uninitialised. I assume a well behaving
> adapter->xfer will check this and only send the correct number of
> bytes, but it might pay to zero out struct peci_xfer_msg in all of
> these functions?
> 

The tx_buf will be initialized only amounts it needs to be in each
command. The adapter->xfer is handling exactly up to msg->tx_len so
it would be better keep the current code without using zeroing out the
struct.

>> +       if (rc)
>> +               return rc;
>> +
>> +       umsg->dib = le64_to_cpup((__le64 *)msg.rx_buf);
>> +
>> +       return 0;
>> +}
> 
>> +
>> +#if IS_ENABLED(CONFIG_OF)
>> +static struct peci_client *peci_of_register_device(struct peci_adapter *adapter,
>> +                                                  struct device_node *node)
>> +{
>> +       struct peci_board_info info = {};
>> +       struct peci_client *result;
>> +       const __be32 *addr_be;
>> +       int len;
>> +
>> +       dev_dbg(&adapter->dev, "register %pOF\n", node);
>> +
>> +       if (of_modalias_node(node, info.type, sizeof(info.type)) < 0) {
> 
> I don't understand why you're doing this. Won't this always be peci,
> as your binding requires?
> 

Since it supports only 'intel,peci-client' for now, it's not needed
actually. I'll drop it at this time. It would be added later if needed.

>> +               dev_err(&adapter->dev, "modalias failure on %pOF\n", node);
>> +               return ERR_PTR(-EINVAL);
>> +       }
>> +
>> +       addr_be = of_get_property(node, "reg", &len);
>> +       if (!addr_be || len < sizeof(*addr_be)) {
> 
> The second check looks suspicious.
> 
> You could fix it to check the expected length (4), or use of_property_read_u32.
> 

Right, I'll fix it using of_property_read_u32.

>> +               dev_err(&adapter->dev, "invalid reg on %pOF\n", node);
>> +               return ERR_PTR(-EINVAL);
>> +       }
>> +
>> +       info.addr = be32_to_cpup(addr_be);
>> +       info.of_node = of_node_get(node);
>> +
>> +       result = peci_new_device(adapter, &info);
>> +       if (!result)
> 
> Should you do an of_node_put here?
> 

Oh, this code is definitely incorrect. I should to put the of_node
reference at here if peci_new_device() fails and should keep the
reference until peci_unregister_device() is called. Will fix it.
Thank you for your pointing it out!

>> +               result = ERR_PTR(-EINVAL);
>> +
>> +       of_node_put(node);
> 
> Why do you release the reference here?
> 

This is incorrect too. Will move the of_node_get/put code into
peci_new_device() and peci_unregister_device() to avoid confusion.
Thanks again for your pointing it out. :)

>> +       return result;
>> +}
>> +
>
gregkh@linuxfoundation.org Jan. 22, 2019, 1:20 p.m. UTC | #3
On Mon, Jan 07, 2019 at 01:41:27PM -0800, Jae Hyun Yoo wrote:
> +config PECI
> +	bool "PECI support"
> +	select RT_MUTEXES
> +	select CRC8
> +	help
> +	  The Platform Environment Control Interface (PECI) is a one-wire bus
> +	  interface that provides a communication channel from Intel processors
> +	  and chipset components to external monitoring or control devices.

Why can't this be built as a module?

And why do you rely on rt mutexes?

Anyway, the driver core interaction looks good, I have a few other minor
comments on the ioctl handling:

> +typedef int (*peci_ioctl_fn_type)(struct peci_adapter *, void *);
> +
> +static const peci_ioctl_fn_type peci_ioctl_fn[PECI_CMD_MAX] = {
> +	peci_ioctl_xfer,
> +	peci_ioctl_ping,
> +	peci_ioctl_get_dib,
> +	peci_ioctl_get_temp,
> +	peci_ioctl_rd_pkg_cfg,
> +	peci_ioctl_wr_pkg_cfg,
> +	peci_ioctl_rd_ia_msr,
> +	NULL, /* Reserved */
> +	peci_ioctl_rd_pci_cfg,
> +	NULL, /* Reserved */
> +	peci_ioctl_rd_pci_cfg_local,
> +	peci_ioctl_wr_pci_cfg_local,
> +};

That's "interesting", why not have a list that has the ioctl number, and
the function pointer?  That way no need for "reserved" entries, and you
don't have to add a new item to your "is this a valid ioctl" check
below:

> +static long peci_ioctl(struct file *file, unsigned int iocmd, unsigned long arg)
> +{
> +	struct peci_adapter *adapter = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	unsigned int msg_len;
> +	enum peci_cmd cmd;
> +	int rc = 0;

No need to set rc here.

> +	u8 *msg;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;

Really?  Nice, you have userspace tools running as root, what could go
wrong... :)

> +
> +	dev_dbg(&adapter->dev, "ioctl, cmd=0x%x, arg=0x%lx\n", iocmd, arg);

I understand the need for this type of thing when debugging the code
initially, but you can remove them all now, ftrace should provide what
you need now, right?

> +
> +	switch (iocmd) {
> +	case PECI_IOC_XFER:
> +	case PECI_IOC_PING:
> +	case PECI_IOC_GET_DIB:
> +	case PECI_IOC_GET_TEMP:
> +	case PECI_IOC_RD_PKG_CFG:
> +	case PECI_IOC_WR_PKG_CFG:
> +	case PECI_IOC_RD_IA_MSR:
> +	case PECI_IOC_RD_PCI_CFG:
> +	case PECI_IOC_RD_PCI_CFG_LOCAL:
> +	case PECI_IOC_WR_PCI_CFG_LOCAL:
> +		cmd = _IOC_NR(iocmd);
> +		msg_len = _IOC_SIZE(iocmd);
> +		break;

That check up there can be replaced with an iteration over the list of
ioctl functions, right?


> +
> +	default:
> +		dev_dbg(&adapter->dev, "Invalid ioctl cmd : 0x%x\n", iocmd);
> +		return -ENOTTY;
> +	}
> +
> +	if (!access_ok(argp, msg_len))
> +		return -EFAULT;
> +
> +	msg = memdup_user(argp, msg_len);
> +	if (IS_ERR(msg))
> +		return PTR_ERR(msg);

Why call access_ok() if you are going to copy the memory anyways?
memdup_user should fail with -EFAULT if you can't copy the memory
properly.

> +
> +	rc = peci_command(adapter, cmd, msg);

Are you _SURE_ you audited your ioctls to not do foolish things with
userspace values?  I sure didn't, so can you please have someone else do
this and sign off on this patch that they did so?

> +static int peci_detect(struct peci_adapter *adapter, u8 addr)
> +{
> +	struct peci_ping_msg msg;
> +
> +	msg.addr = addr;

What about the un-initialized fields in this structure?  Can you
properly handle that, and also, is this ok to be on the stack?

> +static ssize_t peci_sysfs_new_device(struct device *dev,
> +				     struct device_attribute *attr,
> +				     const char *buf, size_t count)
> +{
> +	struct peci_adapter *adapter = to_peci_adapter(dev);
> +	struct peci_board_info info = {};
> +	struct peci_client *client;
> +	char *blank, end;
> +	int rc;
> +
> +	/* Parse device type */
> +	blank = strchr(buf, ' ');
> +	if (!blank) {
> +		dev_err(dev, "%s: Missing parameters\n", "new_device");
> +		return -EINVAL;
> +	}
> +	if (blank - buf > PECI_NAME_SIZE - 1) {
> +		dev_err(dev, "%s: Invalid device type\n", "new_device");
> +		return -EINVAL;
> +	}
> +	memcpy(info.type, buf, blank - buf);
> +
> +	/* Parse remaining parameters, reject extra parameters */
> +	rc = sscanf(++blank, "%hi%c", &info.addr, &end);

Please do not tell me you are parsing a sysfs write to do some type of
new configuration.  That is not what sysfs is for, that is what configfs
is for, please use that instead, this isn't ok.



> +	if (rc < 1) {
> +		dev_err(dev, "%s: Can't parse client address\n", "new_device");
> +		return -EINVAL;
> +	}
> +	if (rc > 1  && end != '\n') {
> +		dev_err(dev, "%s: Extra parameters\n", "new_device");
> +		return -EINVAL;
> +	}
> +
> +	client = peci_new_device(adapter, &info);
> +	if (!client)
> +		return -EINVAL;
> +
> +	/* Keep track of the added device */
> +	mutex_lock(&adapter->userspace_clients_lock);
> +	list_add_tail(&client->detected, &adapter->userspace_clients);
> +	mutex_unlock(&adapter->userspace_clients_lock);
> +	dev_info(dev, "%s: Instantiated device %s at 0x%02hx\n", "new_device",
> +		 info.type, info.addr);

Don't be noisy for things that are expected to happen.


> +
> +	return count;
> +}
> +static DEVICE_ATTR(new_device, 0200, NULL, peci_sysfs_new_device);

Not that you should be doing this, but DEVICE_ATTR_RO() is what you want
here, right?


> +
> +static ssize_t peci_sysfs_delete_device(struct device *dev,
> +					struct device_attribute *attr,
> +					const char *buf, size_t count)
> +{
> +	struct peci_adapter *adapter = to_peci_adapter(dev);
> +	struct peci_client *client, *next;
> +	struct peci_board_info info = {};
> +	struct peci_driver *driver;
> +	char *blank, end;
> +	int rc;
> +
> +	/* Parse device type */
> +	blank = strchr(buf, ' ');
> +	if (!blank) {
> +		dev_err(dev, "%s: Missing parameters\n", "delete_device");
> +		return -EINVAL;
> +	}
> +	if (blank - buf > PECI_NAME_SIZE - 1) {
> +		dev_err(dev, "%s: Invalid device type\n", "delete_device");
> +		return -EINVAL;
> +	}
> +	memcpy(info.type, buf, blank - buf);
> +
> +	/* Parse remaining parameters, reject extra parameters */
> +	rc = sscanf(++blank, "%hi%c", &info.addr, &end);
> +	if (rc < 1) {
> +		dev_err(dev, "%s: Can't parse client address\n",
> +			"delete_device");
> +		return -EINVAL;
> +	}
> +	if (rc > 1  && end != '\n') {
> +		dev_err(dev, "%s: Extra parameters\n", "delete_device");
> +		return -EINVAL;
> +	}

Same here, no parsing of configurations through sysfs, that is not ok.
Again, use configfs.

> +
> +	/* Make sure the device was added through sysfs */
> +	rc = -ENOENT;
> +	mutex_lock(&adapter->userspace_clients_lock);
> +	list_for_each_entry_safe(client, next, &adapter->userspace_clients,
> +				 detected) {
> +		driver = to_peci_driver(client->dev.driver);
> +
> +		if (client->addr == info.addr &&
> +		    !strncmp(client->name, info.type, PECI_NAME_SIZE)) {
> +			dev_info(dev, "%s: Deleting device %s at 0x%02hx\n",
> +				 "delete_device", client->name, client->addr);
> +			list_del(&client->detected);
> +			peci_unregister_device(client);
> +			rc = count;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&adapter->userspace_clients_lock);
> +
> +	if (rc < 0)
> +		dev_err(dev, "%s: Can't find device in list\n",
> +			"delete_device");

So you can just spam the syslog by writing odd values to the file?  Not
good.

> +
> +	return rc;
> +}
> +static DEVICE_ATTR_IGNORE_LOCKDEP(delete_device, 0200, NULL,
> +				  peci_sysfs_delete_device);

sysfs files that remove themselves are reserved for a specific circle in
hell, you really don't want to mess with them :(

> +/**
> + * struct peci_adapter - represent a PECI adapter
> + * @owner: owner module of the PECI adpater
> + * @bus_lock: mutex for exclusion of multiple callers
> + * @dev: device interface to this driver
> + * @cdev: character device object to create character device
> + * @nr: the bus number to map
> + * @name: name of the adapter
> + * @userspace_clients_lock: mutex for exclusion of clients handling
> + * @userspace_clients: list of registered clients
> + * @xfer: low-level transfer function pointer of the adapter
> + * @cmd_mask: mask for supportable PECI commands
> + *
> + * Each PECI adapter can communicate with one or more PECI client children.
> + * These make a small bus, sharing a single wired PECI connection.
> + */
> +struct peci_adapter {
> +	struct module		*owner;
> +	struct rt_mutex		bus_lock;
> +	struct device		dev;
> +	struct cdev		cdev;

Yeah, two differently reference counted variables in the same structure,
what could go wrong!  :(

Don't do this, make your cdev a pointer to be sure you get things right,
otherwise this will never work properly.

And shouldn't the character device be a class device?  Don't confuse
devices in the driver model with the interaction of them to userspace in
a specific manner, those should be separate things, right?


> +	int			nr;
> +	char			name[PECI_NAME_SIZE];

What's wrong with the name in struct device?


> +	struct mutex		userspace_clients_lock; /* clients list mutex */
> +	struct list_head	userspace_clients;
> +	int			(*xfer)(struct peci_adapter *adapter,
> +					struct peci_xfer_msg *msg);
> +	uint			cmd_mask;

u32?

> --- /dev/null
> +++ b/include/uapi/linux/peci-ioctl.h
> @@ -0,0 +1,403 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2018-2019 Intel Corporation */
> +
> +#ifndef __PECI_IOCTL_H
> +#define __PECI_IOCTL_H
> +
> +#include <linux/ioctl.h>
> +#include <linux/types.h>
> +
> +/* Base Address of 48d */
> +#define PECI_BASE_ADDR  0x30  /* The PECI client's default address of 0x30 */
> +#define PECI_OFFSET_MAX 8     /* Max numver of CPU clients */
> +
> +/* PCI Access */
> +#define MAX_PCI_READ_LEN 24   /* Number of bytes of the PCI Space read */

PCI?  Or PECI?  I'm confused now, what does PCI have to do with PECI?

> +
> +#define PCI_BUS0_CPU0      0x00
> +#define PCI_BUS0_CPU1      0x80
> +#define PCI_CPUBUSNO_BUS   0x00
> +#define PCI_CPUBUSNO_DEV   0x08
> +#define PCI_CPUBUSNO_FUNC  0x02
> +#define PCI_CPUBUSNO       0xcc
> +#define PCI_CPUBUSNO_1     0xd0
> +#define PCI_CPUBUSNO_VALID 0xd4

You just abused the PCI core namespace here, are you sure about that?


> +
> +/* Package Identifier Read Parameter Value */
> +#define PKG_ID_CPU_ID               0x0000  /* CPUID Info */
> +#define PKG_ID_PLATFORM_ID          0x0001  /* Platform ID */
> +#define PKG_ID_UNCORE_ID            0x0002  /* Uncore Device ID */
> +#define PKG_ID_MAX_THREAD_ID        0x0003  /* Max Thread ID */
> +#define PKG_ID_MICROCODE_REV        0x0004  /* CPU Microcode Update Revision */
> +#define PKG_ID_MACHINE_CHECK_STATUS 0x0005  /* Machine Check Status */
> +
> +/* RdPkgConfig Index */
> +#define MBX_INDEX_CPU_ID            0   /* Package Identifier Read */
> +#define MBX_INDEX_VR_DEBUG          1   /* VR Debug */
> +#define MBX_INDEX_PKG_TEMP_READ     2   /* Package Temperature Read */
> +#define MBX_INDEX_ENERGY_COUNTER    3   /* Energy counter */
> +#define MBX_INDEX_ENERGY_STATUS     4   /* DDR Energy Status */
> +#define MBX_INDEX_WAKE_MODE_BIT     5   /* "Wake on PECI" Mode bit */
> +#define MBX_INDEX_EPI               6   /* Efficient Performance Indication */
> +#define MBX_INDEX_PKG_RAPL_PERF     8   /* Pkg RAPL Performance Status Read */
> +#define MBX_INDEX_PER_CORE_DTS_TEMP 9   /* Per Core DTS Temperature Read */
> +#define MBX_INDEX_DTS_MARGIN        10  /* DTS thermal margin */
> +#define MBX_INDEX_SKT_PWR_THRTL_DUR 11  /* Socket Power Throttled Duration */
> +#define MBX_INDEX_CFG_TDP_CONTROL   12  /* TDP Config Control */
> +#define MBX_INDEX_CFG_TDP_LEVELS    13  /* TDP Config Levels */
> +#define MBX_INDEX_DDR_DIMM_TEMP     14  /* DDR DIMM Temperature */
> +#define MBX_INDEX_CFG_ICCMAX        15  /* Configurable ICCMAX */
> +#define MBX_INDEX_TEMP_TARGET       16  /* Temperature Target Read */
> +#define MBX_INDEX_CURR_CFG_LIMIT    17  /* Current Config Limit */
> +#define MBX_INDEX_DIMM_TEMP_READ    20  /* Package Thermal Status Read */
> +#define MBX_INDEX_DRAM_IMC_TMP_READ 22  /* DRAM IMC Temperature Read */
> +#define MBX_INDEX_DDR_CH_THERM_STAT 23  /* DDR Channel Thermal Status */
> +#define MBX_INDEX_PKG_POWER_LIMIT1  26  /* Package Power Limit1 */
> +#define MBX_INDEX_PKG_POWER_LIMIT2  27  /* Package Power Limit2 */
> +#define MBX_INDEX_TDP               28  /* Thermal design power minimum */
> +#define MBX_INDEX_TDP_HIGH          29  /* Thermal design power maximum */
> +#define MBX_INDEX_TDP_UNITS         30  /* Units for power/energy registers */
> +#define MBX_INDEX_RUN_TIME          31  /* Accumulated Run Time */
> +#define MBX_INDEX_CONSTRAINED_TIME  32  /* Thermally Constrained Time Read */
> +#define MBX_INDEX_TURBO_RATIO       33  /* Turbo Activation Ratio */
> +#define MBX_INDEX_DDR_RAPL_PL1      34  /* DDR RAPL PL1 */
> +#define MBX_INDEX_DDR_PWR_INFO_HIGH 35  /* DRAM Power Info Read (high) */
> +#define MBX_INDEX_DDR_PWR_INFO_LOW  36  /* DRAM Power Info Read (low) */
> +#define MBX_INDEX_DDR_RAPL_PL2      37  /* DDR RAPL PL2 */
> +#define MBX_INDEX_DDR_RAPL_STATUS   38  /* DDR RAPL Performance Status */
> +#define MBX_INDEX_DDR_HOT_ABSOLUTE  43  /* DDR Hottest Dimm Absolute Temp */
> +#define MBX_INDEX_DDR_HOT_RELATIVE  44  /* DDR Hottest Dimm Relative Temp */
> +#define MBX_INDEX_DDR_THROTTLE_TIME 45  /* DDR Throttle Time */
> +#define MBX_INDEX_DDR_THERM_STATUS  46  /* DDR Thermal Status */
> +#define MBX_INDEX_TIME_AVG_TEMP     47  /* Package time-averaged temperature */
> +#define MBX_INDEX_TURBO_RATIO_LIMIT 49  /* Turbo Ratio Limit Read */
> +#define MBX_INDEX_HWP_AUTO_OOB      53  /* HWP Autonomous Out-of-band */
> +#define MBX_INDEX_DDR_WARM_BUDGET   55  /* DDR Warm Power Budget */
> +#define MBX_INDEX_DDR_HOT_BUDGET    56  /* DDR Hot Power Budget */
> +#define MBX_INDEX_PKG_PSYS_PWR_LIM3 57  /* Package/Psys Power Limit3 */
> +#define MBX_INDEX_PKG_PSYS_PWR_LIM1 58  /* Package/Psys Power Limit1 */
> +#define MBX_INDEX_PKG_PSYS_PWR_LIM2 59  /* Package/Psys Power Limit2 */
> +#define MBX_INDEX_PKG_PSYS_PWR_LIM4 60  /* Package/Psys Power Limit4 */
> +#define MBX_INDEX_PERF_LIMIT_REASON 65  /* Performance Limit Reasons */
> +
> +/* WrPkgConfig Index */
> +#define MBX_INDEX_DIMM_AMBIENT      19
> +#define MBX_INDEX_DIMM_TEMP         24
> +
> +/* Device Specific Completion Code (CC) Definition */
> +#define DEV_PECI_CC_SUCCESS          0x40
> +#define DEV_PECI_CC_TIMEOUT          0x80
> +#define DEV_PECI_CC_OUT_OF_RESOURCE  0x81
> +#define DEV_PECI_CC_UNAVAIL_RESOURCE 0x82
> +#define DEV_PECI_CC_INVALID_REQ      0x90
> +
> +/* Completion Code mask to check retry needs */
> +#define DEV_PECI_CC_RETRY_CHECK_MASK 0xf0
> +#define DEV_PECI_CC_NEED_RETRY       0x80
> +
> +/* Skylake EDS says to retry for 250ms */
> +#define DEV_PECI_RETRY_TIME_MS       250
> +#define DEV_PECI_RETRY_INTERVAL_USEC 10000
> +#define DEV_PECI_RETRY_BIT           0x01
> +
> +#define GET_TEMP_WR_LEN   1
> +#define GET_TEMP_RD_LEN   2
> +#define GET_TEMP_PECI_CMD 0x01
> +
> +#define GET_DIB_WR_LEN   1
> +#define GET_DIB_RD_LEN   8
> +#define GET_DIB_PECI_CMD 0xf7
> +
> +#define RDPKGCFG_WRITE_LEN     5
> +#define RDPKGCFG_READ_LEN_BASE 1
> +#define RDPKGCFG_PECI_CMD      0xa1
> +
> +#define WRPKGCFG_WRITE_LEN_BASE 6
> +#define WRPKGCFG_READ_LEN       1
> +#define WRPKGCFG_PECI_CMD       0xa5
> +
> +#define RDIAMSR_WRITE_LEN 5
> +#define RDIAMSR_READ_LEN  9
> +#define RDIAMSR_PECI_CMD  0xb1
> +
> +#define WRIAMSR_PECI_CMD  0xb5
> +
> +#define RDPCICFG_WRITE_LEN 6
> +#define RDPCICFG_READ_LEN  5
> +#define RDPCICFG_PECI_CMD  0x61
> +
> +#define WRPCICFG_PECI_CMD  0x65
> +
> +#define RDPCICFGLOCAL_WRITE_LEN     5
> +#define RDPCICFGLOCAL_READ_LEN_BASE 1
> +#define RDPCICFGLOCAL_PECI_CMD      0xe1
> +
> +#define WRPCICFGLOCAL_WRITE_LEN_BASE 6
> +#define WRPCICFGLOCAL_READ_LEN       1
> +#define WRPCICFGLOCAL_PECI_CMD       0xe5

Alignment?

> +
> +#define PECI_BUFFER_SIZE 32

Why don't all of these have PECI_ at the front of them?

> +/**
> + * enum peci_cmd - PECI client commands
> + * @PECI_CMD_XFER: raw PECI transfer
> + * @PECI_CMD_PING: ping, a required message for all PECI devices
> + * @PECI_CMD_GET_DIB: get DIB (Device Info Byte)
> + * @PECI_CMD_GET_TEMP: get maximum die temperature
> + * @PECI_CMD_RD_PKG_CFG: read access to the PCS (Package Configuration Space)
> + * @PECI_CMD_WR_PKG_CFG: write access to the PCS (Package Configuration Space)
> + * @PECI_CMD_RD_IA_MSR: read access to MSRs (Model Specific Registers)
> + * @PECI_CMD_WR_IA_MSR: write access to MSRs (Model Specific Registers)
> + * @PECI_CMD_RD_PCI_CFG: sideband read access to the PCI configuration space
> + *	maintained in downstream devices external to the processor
> + * @PECI_CMD_WR_PCI_CFG: sideband write access to the PCI configuration space
> + *	maintained in downstream devices external to the processor
> + * @PECI_CMD_RD_PCI_CFG_LOCAL: sideband read access to the PCI configuration
> + *	space that resides within the processor
> + * @PECI_CMD_WR_PCI_CFG_LOCAL: sideband write access to the PCI configuration
> + *	space that resides within the processor
> + *
> + * Available commands depend on client's PECI revision.
> + */
> +enum peci_cmd {
> +	PECI_CMD_XFER = 0,
> +	PECI_CMD_PING,
> +	PECI_CMD_GET_DIB,
> +	PECI_CMD_GET_TEMP,
> +	PECI_CMD_RD_PKG_CFG,
> +	PECI_CMD_WR_PKG_CFG,
> +	PECI_CMD_RD_IA_MSR,
> +	PECI_CMD_WR_IA_MSR,
> +	PECI_CMD_RD_PCI_CFG,
> +	PECI_CMD_WR_PCI_CFG,
> +	PECI_CMD_RD_PCI_CFG_LOCAL,
> +	PECI_CMD_WR_PCI_CFG_LOCAL,
> +	PECI_CMD_MAX
> +};
> +
> +/**
> + * struct peci_xfer_msg - raw PECI transfer command
> + * @addr; address of the client
> + * @tx_len: number of data to be written in bytes
> + * @rx_len: number of data to be read in bytes
> + * @tx_buf: data to be written, or NULL
> + * @rx_buf: data to be read, or NULL
> + *
> + * raw PECI transfer
> + */
> +struct peci_xfer_msg {
> +	__u8 addr;
> +	__u8 tx_len;
> +	__u8 rx_len;
> +	__u8 tx_buf[PECI_BUFFER_SIZE];
> +	__u8 rx_buf[PECI_BUFFER_SIZE];
> +} __attribute__((__packed__));

Go kick the hardware engineer who did not align things on a 32bit
boundry.  They should have known better :(

> +
> +/**
> + * struct peci_ping_msg - ping command
> + * @addr: address of the client
> + *
> + * Ping() is a required message for all PECI devices. This message is used to
> + * enumerate devices or determine if a device has been removed, been
> + * powered-off, etc.
> + */
> +struct peci_ping_msg {
> +	__u8 addr;
> +} __attribute__((__packed__));
> +
> +/**
> + * struct peci_get_dib_msg - GetDIB command
> + * @addr: address of the client
> + * @dib: DIB data to be read
> + *
> + * The processor PECI client implementation of GetDIB() includes an 8-byte
> + * response and provides information regarding client revision number and the
> + * number of supported domains. All processor PECI clients support the GetDIB()
> + * command.
> + */
> +struct peci_get_dib_msg {
> +	__u8  addr;
> +	__u64 dib;
> +} __attribute__((__packed__));

Ick, really?  That's horrible, again, go kick them hard!

> +/**
> + * struct peci_get_temp_msg - GetTemp command
> + * @addr: address of the client
> + * @temp_raw: raw temperature data to be read
> + *
> + * The GetTemp() command is used to retrieve the maximum die temperature from a
> + * target PECI address. The temperature is used by the external thermal
> + * management system to regulate the temperature on the die. The data is
> + * returned as a negative value representing the number of degrees centigrade
> + * below the maximum processor junction temperature.
> + */
> +struct peci_get_temp_msg {
> +	__u8  addr;
> +	__s16 temp_raw;
> +} __attribute__((__packed__));

{kick kick kick}

> +
> +/**
> + * struct peci_rd_pkg_cfg_msg - RdPkgConfig command
> + * @addr: address of the client
> + * @index: encoding index for the requested service
> + * @param: specific data being requested
> + * @rx_len: number of data to be read in bytes
> + * @pkg_config: package config data to be read
> + *
> + * The RdPkgConfig() command provides read access to the Package Configuration
> + * Space (PCS) within the processor, including various power and thermal
> + * management functions. Typical PCS read services supported by the processor
> + * may include access to temperature data, energy status, run time information,
> + * DIMM temperatures and so on.
> + */
> +struct peci_rd_pkg_cfg_msg {
> +	__u8  addr;
> +	__u8  index;
> +	__u16 param;
> +	__u8  rx_len;
> +	__u8  pkg_config[4];
> +} __attribute__((__packed__));

Ok, that's better, someone finally wisened up.

> +
> +/**
> + * struct peci_wr_pkg_cfg_msg - WrPkgConfig command
> + * @addr: address of the client
> + * @index: encoding index for the requested service
> + * @param: specific data being requested
> + * @tx_len: number of data to be written in bytes
> + * @value: package config data to be written
> + *
> + * The WrPkgConfig() command provides write access to the Package Configuration
> + * Space (PCS) within the processor, including various power and thermal
> + * management functions. Typical PCS write services supported by the processor
> + * may include power limiting, thermal averaging constant programming and so on.
> + */
> +struct peci_wr_pkg_cfg_msg {
> +	__u8  addr;
> +	__u8  index;
> +	__u16 param;
> +	__u8  tx_len;
> +	__u32 value;
> +} __attribute__((__packed__));

No they didn't, ick, no one cares about speed here I guess.  Oh well :(

> +
> +/**
> + * struct peci_rd_ia_msr_msg - RdIAMSR command
> + * @addr: address of the client
> + * @thread_id: ID of the specific logical processor
> + * @address: address of MSR to read from
> + * @value: data to be read
> + *
> + * The RdIAMSR() PECI command provides read access to Model Specific Registers
> + * (MSRs) defined in the processor's Intel Architecture (IA).
> + */
> +struct peci_rd_ia_msr_msg {
> +	__u8  addr;
> +	__u8  thread_id;
> +	__u16 address;
> +	__u64 value;
> +} __attribute__((__packed__));

They got lucky!

oh well,

greg k-h
Jae Hyun Yoo Jan. 23, 2019, 9:38 p.m. UTC | #4
Hi Greg,

Thanks for sharing your time on reviewing this patch. Please find my
answers below.

On 1/22/2019 5:20 AM, Greg Kroah-Hartman wrote:
> On Mon, Jan 07, 2019 at 01:41:27PM -0800, Jae Hyun Yoo wrote:
>> +config PECI
>> +	bool "PECI support"
>> +	select RT_MUTEXES
>> +	select CRC8
>> +	help
>> +	  The Platform Environment Control Interface (PECI) is a one-wire bus
>> +	  interface that provides a communication channel from Intel processors
>> +	  and chipset components to external monitoring or control devices.
> 
> Why can't this be built as a module?

This PECI core driver should be prepared ahead of any PECI adapter
driver registration which can be called from 'kernel_init' if the
adapter driver is built-in. So this driver uses 'postcore_initcall' and
it's the reason why it can't be built as a module.

> And why do you rely on rt mutexes?

I intended to use that for giving priorities on each PECI sideband
function. For an example, Crach Dump sideband function should have a
higher priority than Temperature Monitoring sideband function. But, in
this initial implementation, it doesn't have any priority adjustment
code on the rt mutext yet so I'll remove this config dependency and will
replace rt_mutex with mutex for now. It could be replaced back with
rt_mutex later when it's actually needed.

> Anyway, the driver core interaction looks good, I have a few other minor
> comments on the ioctl handling:
> 
>> +typedef int (*peci_ioctl_fn_type)(struct peci_adapter *, void *);
>> +
>> +static const peci_ioctl_fn_type peci_ioctl_fn[PECI_CMD_MAX] = {
>> +	peci_ioctl_xfer,
>> +	peci_ioctl_ping,
>> +	peci_ioctl_get_dib,
>> +	peci_ioctl_get_temp,
>> +	peci_ioctl_rd_pkg_cfg,
>> +	peci_ioctl_wr_pkg_cfg,
>> +	peci_ioctl_rd_ia_msr,
>> +	NULL, /* Reserved */
>> +	peci_ioctl_rd_pci_cfg,
>> +	NULL, /* Reserved */
>> +	peci_ioctl_rd_pci_cfg_local,
>> +	peci_ioctl_wr_pci_cfg_local,
>> +};
> 
> That's "interesting", why not have a list that has the ioctl number, and
> the function pointer?  That way no need for "reserved" entries, and you
> don't have to add a new item to your "is this a valid ioctl" check
> below:

Yes, that would be nicer. Thanks for your pointing it out. I'll add the
ioctl number and the function pointer for the reserved entries so that
the handler function can report an error value to indicate that the
command is not implemented yet.

>> +static long peci_ioctl(struct file *file, unsigned int iocmd, unsigned long arg)
>> +{
>> +	struct peci_adapter *adapter = file->private_data;
>> +	void __user *argp = (void __user *)arg;
>> +	unsigned int msg_len;
>> +	enum peci_cmd cmd;
>> +	int rc = 0;
> 
> No need to set rc here.

You are right. Will fix it.

>> +	u8 *msg;
>> +
>> +	if (!capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
> 
> Really?  Nice, you have userspace tools running as root, what could go
> wrong... :)

I didn't catch your point. Should it be removed?

>> +
>> +	dev_dbg(&adapter->dev, "ioctl, cmd=0x%x, arg=0x%lx\n", iocmd, arg);
> 
> I understand the need for this type of thing when debugging the code
> initially, but you can remove them all now, ftrace should provide what
> you need now, right?

Right, this was just for early stage debugging. Will remove it.

>> +
>> +	switch (iocmd) {
>> +	case PECI_IOC_XFER:
>> +	case PECI_IOC_PING:
>> +	case PECI_IOC_GET_DIB:
>> +	case PECI_IOC_GET_TEMP:
>> +	case PECI_IOC_RD_PKG_CFG:
>> +	case PECI_IOC_WR_PKG_CFG:
>> +	case PECI_IOC_RD_IA_MSR:
>> +	case PECI_IOC_RD_PCI_CFG:
>> +	case PECI_IOC_RD_PCI_CFG_LOCAL:
>> +	case PECI_IOC_WR_PCI_CFG_LOCAL:
>> +		cmd = _IOC_NR(iocmd);
>> +		msg_len = _IOC_SIZE(iocmd);
>> +		break;
> 
> That check up there can be replaced with an iteration over the list of
> ioctl functions, right?

Right. I'll simplify it.

>> +
>> +	default:
>> +		dev_dbg(&adapter->dev, "Invalid ioctl cmd : 0x%x\n", iocmd);
>> +		return -ENOTTY;
>> +	}
>> +
>> +	if (!access_ok(argp, msg_len))
>> +		return -EFAULT;
>> +
>> +	msg = memdup_user(argp, msg_len);
>> +	if (IS_ERR(msg))
>> +		return PTR_ERR(msg);
> 
> Why call access_ok() if you are going to copy the memory anyways?
> memdup_user should fail with -EFAULT if you can't copy the memory
> properly.

I see. I'll remove the access_ok().

>> +
>> +	rc = peci_command(adapter, cmd, msg);
> 
> Are you _SURE_ you audited your ioctls to not do foolish things with
> userspace values?  I sure didn't, so can you please have someone else do
> this and sign off on this patch that they did so?

No I didn't. I'll ask someone to do ioctl io audit on this patch and
adding a signed off tag too.

>> +static int peci_detect(struct peci_adapter *adapter, u8 addr)
>> +{
>> +	struct peci_ping_msg msg;
>> +
>> +	msg.addr = addr;
> 
> What about the un-initialized fields in this structure?  Can you
> properly handle that, and also, is this ok to be on the stack?

It's fully initialized at here because the peci_ping_msg struct has only
one member:

struct peci_ping_msg {
	__u8 addr;
};

>> +static ssize_t peci_sysfs_new_device(struct device *dev,
>> +				     struct device_attribute *attr,
>> +				     const char *buf, size_t count)
>> +{
>> +	struct peci_adapter *adapter = to_peci_adapter(dev);
>> +	struct peci_board_info info = {};
>> +	struct peci_client *client;
>> +	char *blank, end;
>> +	int rc;
>> +
>> +	/* Parse device type */
>> +	blank = strchr(buf, ' ');
>> +	if (!blank) {
>> +		dev_err(dev, "%s: Missing parameters\n", "new_device");
>> +		return -EINVAL;
>> +	}
>> +	if (blank - buf > PECI_NAME_SIZE - 1) {
>> +		dev_err(dev, "%s: Invalid device type\n", "new_device");
>> +		return -EINVAL;
>> +	}
>> +	memcpy(info.type, buf, blank - buf);
>> +
>> +	/* Parse remaining parameters, reject extra parameters */
>> +	rc = sscanf(++blank, "%hi%c", &info.addr, &end);
> 
> Please do not tell me you are parsing a sysfs write to do some type of
> new configuration.  That is not what sysfs is for, that is what configfs
> is for, please use that instead, this isn't ok.

This is for run-time registration of a PECI client which is connected to
a PECI bus adapter. The life cycle of this sysfs interface will be
synced with the adapter driver. Actually, it follows what I2C core
driver currently does.

>> +	if (rc < 1) {
>> +		dev_err(dev, "%s: Can't parse client address\n", "new_device");
>> +		return -EINVAL;
>> +	}
>> +	if (rc > 1  && end != '\n') {
>> +		dev_err(dev, "%s: Extra parameters\n", "new_device");
>> +		return -EINVAL;
>> +	}
>> +
>> +	client = peci_new_device(adapter, &info);
>> +	if (!client)
>> +		return -EINVAL;
>> +
>> +	/* Keep track of the added device */
>> +	mutex_lock(&adapter->userspace_clients_lock);
>> +	list_add_tail(&client->detected, &adapter->userspace_clients);
>> +	mutex_unlock(&adapter->userspace_clients_lock);
>> +	dev_info(dev, "%s: Instantiated device %s at 0x%02hx\n", "new_device",
>> +		 info.type, info.addr);
> 
> Don't be noisy for things that are expected to happen.

I think it should print out a result on a userspace request for client
module registration.

>> +
>> +	return count;
>> +}
>> +static DEVICE_ATTR(new_device, 0200, NULL, peci_sysfs_new_device);
> 
> Not that you should be doing this, but DEVICE_ATTR_RO() is what you want
> here, right?

Actually, DEVICE_ATTR_WO() is what I can use here. The reason why I used
DEVICE_ATTR() is for keeping the function naming pattern as
'peci_sysfs_new_device()' instead of 'new_device_store()'.

>> +
>> +static ssize_t peci_sysfs_delete_device(struct device *dev,
>> +					struct device_attribute *attr,
>> +					const char *buf, size_t count)
>> +{
>> +	struct peci_adapter *adapter = to_peci_adapter(dev);
>> +	struct peci_client *client, *next;
>> +	struct peci_board_info info = {};
>> +	struct peci_driver *driver;
>> +	char *blank, end;
>> +	int rc;
>> +
>> +	/* Parse device type */
>> +	blank = strchr(buf, ' ');
>> +	if (!blank) {
>> +		dev_err(dev, "%s: Missing parameters\n", "delete_device");
>> +		return -EINVAL;
>> +	}
>> +	if (blank - buf > PECI_NAME_SIZE - 1) {
>> +		dev_err(dev, "%s: Invalid device type\n", "delete_device");
>> +		return -EINVAL;
>> +	}
>> +	memcpy(info.type, buf, blank - buf);
>> +
>> +	/* Parse remaining parameters, reject extra parameters */
>> +	rc = sscanf(++blank, "%hi%c", &info.addr, &end);
>> +	if (rc < 1) {
>> +		dev_err(dev, "%s: Can't parse client address\n",
>> +			"delete_device");
>> +		return -EINVAL;
>> +	}
>> +	if (rc > 1  && end != '\n') {
>> +		dev_err(dev, "%s: Extra parameters\n", "delete_device");
>> +		return -EINVAL;
>> +	}
> 
> Same here, no parsing of configurations through sysfs, that is not ok.
> Again, use configfs.

Same as above. This is for run-time deregistration of a PECI client
which is registered on a PECI bus adapter. The life cycle of this sysfs
interface will be synced with the adapter driver. Actually, it follows
what I2C core driver currently does.

>> +
>> +	/* Make sure the device was added through sysfs */
>> +	rc = -ENOENT;
>> +	mutex_lock(&adapter->userspace_clients_lock);
>> +	list_for_each_entry_safe(client, next, &adapter->userspace_clients,
>> +				 detected) {
>> +		driver = to_peci_driver(client->dev.driver);
>> +
>> +		if (client->addr == info.addr &&
>> +		    !strncmp(client->name, info.type, PECI_NAME_SIZE)) {
>> +			dev_info(dev, "%s: Deleting device %s at 0x%02hx\n",
>> +				 "delete_device", client->name, client->addr);
>> +			list_del(&client->detected);
>> +			peci_unregister_device(client);
>> +			rc = count;
>> +			break;
>> +		}
>> +	}
>> +	mutex_unlock(&adapter->userspace_clients_lock);
>> +
>> +	if (rc < 0)
>> +		dev_err(dev, "%s: Can't find device in list\n",
>> +			"delete_device");
> 
> So you can just spam the syslog by writing odd values to the file?  Not
> good.

Okay. I'll remove this printing out.

>> +
>> +	return rc;
>> +}
>> +static DEVICE_ATTR_IGNORE_LOCKDEP(delete_device, 0200, NULL,
>> +				  peci_sysfs_delete_device);
> 
> sysfs files that remove themselves are reserved for a specific circle in
> hell, you really don't want to mess with them :(

It cannot remove itself. The owner of the sysfs files is an adapter
driver and only client devices can be added/removed by these sysfs
files. An adapter can't be removed by accessing of this sysfs.

>> +/**
>> + * struct peci_adapter - represent a PECI adapter
>> + * @owner: owner module of the PECI adpater
>> + * @bus_lock: mutex for exclusion of multiple callers
>> + * @dev: device interface to this driver
>> + * @cdev: character device object to create character device
>> + * @nr: the bus number to map
>> + * @name: name of the adapter
>> + * @userspace_clients_lock: mutex for exclusion of clients handling
>> + * @userspace_clients: list of registered clients
>> + * @xfer: low-level transfer function pointer of the adapter
>> + * @cmd_mask: mask for supportable PECI commands
>> + *
>> + * Each PECI adapter can communicate with one or more PECI client children.
>> + * These make a small bus, sharing a single wired PECI connection.
>> + */
>> +struct peci_adapter {
>> +	struct module		*owner;
>> +	struct rt_mutex		bus_lock;
>> +	struct device		dev;
>> +	struct cdev		cdev;
> 
> Yeah, two differently reference counted variables in the same structure,
> what could go wrong!  :(
> 
> Don't do this, make your cdev a pointer to be sure you get things right,
> otherwise this will never work properly.
> 
> And shouldn't the character device be a class device?  Don't confuse
> devices in the driver model with the interaction of them to userspace in
> a specific manner, those should be separate things, right?

Okay, I got your point. I'll split out the character device part as
a separated module and will make the module can be attached to a PECI
bus.

>> +	int			nr;
>> +	char			name[PECI_NAME_SIZE];
> 
> What's wrong with the name in struct device?

The name in struct device is used for indicating generic PECI bus
topology like below example:

Adapter devices: 'peci-0' where 0 means bus number.
Client devices: '0-30' where 0 means bus number, 30 means hexadecimal
                        address of the client device.

These devices can be found in /sys/bus/peci/devices folder.

The name at here will contain a name which is assigned by adapter
driver. In case of Aspeed PECI adapter driver in this series, it shows
'1e78b000.peci-bus'.

>> +	struct mutex		userspace_clients_lock; /* clients list mutex */
>> +	struct list_head	userspace_clients;
>> +	int			(*xfer)(struct peci_adapter *adapter,
>> +					struct peci_xfer_msg *msg);
>> +	uint			cmd_mask;
> 
> u32?

Yes. Will fix it.

>> --- /dev/null
>> +++ b/include/uapi/linux/peci-ioctl.h
>> @@ -0,0 +1,403 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (c) 2018-2019 Intel Corporation */
>> +
>> +#ifndef __PECI_IOCTL_H
>> +#define __PECI_IOCTL_H
>> +
>> +#include <linux/ioctl.h>
>> +#include <linux/types.h>
>> +
>> +/* Base Address of 48d */
>> +#define PECI_BASE_ADDR  0x30  /* The PECI client's default address of 0x30 */
>> +#define PECI_OFFSET_MAX 8     /* Max numver of CPU clients */
>> +
>> +/* PCI Access */
>> +#define MAX_PCI_READ_LEN 24   /* Number of bytes of the PCI Space read */
> 
> PCI?  Or PECI?  I'm confused now, what does PCI have to do with PECI?

It defines a constant for the PCI configuration space access through
a PECI bus. I'll add 'PECI_' at the front.

>> +
>> +#define PCI_BUS0_CPU0      0x00
>> +#define PCI_BUS0_CPU1      0x80
>> +#define PCI_CPUBUSNO_BUS   0x00
>> +#define PCI_CPUBUSNO_DEV   0x08
>> +#define PCI_CPUBUSNO_FUNC  0x02
>> +#define PCI_CPUBUSNO       0xcc
>> +#define PCI_CPUBUSNO_1     0xd0
>> +#define PCI_CPUBUSNO_VALID 0xd4
> 
> You just abused the PCI core namespace here, are you sure about that?

I'll add 'PECI_' at the front of them.

>> +
>> +/* Package Identifier Read Parameter Value */
>> +#define PKG_ID_CPU_ID               0x0000  /* CPUID Info */
>> +#define PKG_ID_PLATFORM_ID          0x0001  /* Platform ID */
>> +#define PKG_ID_UNCORE_ID            0x0002  /* Uncore Device ID */
>> +#define PKG_ID_MAX_THREAD_ID        0x0003  /* Max Thread ID */
>> +#define PKG_ID_MICROCODE_REV        0x0004  /* CPU Microcode Update Revision */
>> +#define PKG_ID_MACHINE_CHECK_STATUS 0x0005  /* Machine Check Status */
>> +
>> +/* RdPkgConfig Index */
>> +#define MBX_INDEX_CPU_ID            0   /* Package Identifier Read */
>> +#define MBX_INDEX_VR_DEBUG          1   /* VR Debug */
>> +#define MBX_INDEX_PKG_TEMP_READ     2   /* Package Temperature Read */
>> +#define MBX_INDEX_ENERGY_COUNTER    3   /* Energy counter */
>> +#define MBX_INDEX_ENERGY_STATUS     4   /* DDR Energy Status */
>> +#define MBX_INDEX_WAKE_MODE_BIT     5   /* "Wake on PECI" Mode bit */
>> +#define MBX_INDEX_EPI               6   /* Efficient Performance Indication */
>> +#define MBX_INDEX_PKG_RAPL_PERF     8   /* Pkg RAPL Performance Status Read */
>> +#define MBX_INDEX_PER_CORE_DTS_TEMP 9   /* Per Core DTS Temperature Read */
>> +#define MBX_INDEX_DTS_MARGIN        10  /* DTS thermal margin */
>> +#define MBX_INDEX_SKT_PWR_THRTL_DUR 11  /* Socket Power Throttled Duration */
>> +#define MBX_INDEX_CFG_TDP_CONTROL   12  /* TDP Config Control */
>> +#define MBX_INDEX_CFG_TDP_LEVELS    13  /* TDP Config Levels */
>> +#define MBX_INDEX_DDR_DIMM_TEMP     14  /* DDR DIMM Temperature */
>> +#define MBX_INDEX_CFG_ICCMAX        15  /* Configurable ICCMAX */
>> +#define MBX_INDEX_TEMP_TARGET       16  /* Temperature Target Read */
>> +#define MBX_INDEX_CURR_CFG_LIMIT    17  /* Current Config Limit */
>> +#define MBX_INDEX_DIMM_TEMP_READ    20  /* Package Thermal Status Read */
>> +#define MBX_INDEX_DRAM_IMC_TMP_READ 22  /* DRAM IMC Temperature Read */
>> +#define MBX_INDEX_DDR_CH_THERM_STAT 23  /* DDR Channel Thermal Status */
>> +#define MBX_INDEX_PKG_POWER_LIMIT1  26  /* Package Power Limit1 */
>> +#define MBX_INDEX_PKG_POWER_LIMIT2  27  /* Package Power Limit2 */
>> +#define MBX_INDEX_TDP               28  /* Thermal design power minimum */
>> +#define MBX_INDEX_TDP_HIGH          29  /* Thermal design power maximum */
>> +#define MBX_INDEX_TDP_UNITS         30  /* Units for power/energy registers */
>> +#define MBX_INDEX_RUN_TIME          31  /* Accumulated Run Time */
>> +#define MBX_INDEX_CONSTRAINED_TIME  32  /* Thermally Constrained Time Read */
>> +#define MBX_INDEX_TURBO_RATIO       33  /* Turbo Activation Ratio */
>> +#define MBX_INDEX_DDR_RAPL_PL1      34  /* DDR RAPL PL1 */
>> +#define MBX_INDEX_DDR_PWR_INFO_HIGH 35  /* DRAM Power Info Read (high) */
>> +#define MBX_INDEX_DDR_PWR_INFO_LOW  36  /* DRAM Power Info Read (low) */
>> +#define MBX_INDEX_DDR_RAPL_PL2      37  /* DDR RAPL PL2 */
>> +#define MBX_INDEX_DDR_RAPL_STATUS   38  /* DDR RAPL Performance Status */
>> +#define MBX_INDEX_DDR_HOT_ABSOLUTE  43  /* DDR Hottest Dimm Absolute Temp */
>> +#define MBX_INDEX_DDR_HOT_RELATIVE  44  /* DDR Hottest Dimm Relative Temp */
>> +#define MBX_INDEX_DDR_THROTTLE_TIME 45  /* DDR Throttle Time */
>> +#define MBX_INDEX_DDR_THERM_STATUS  46  /* DDR Thermal Status */
>> +#define MBX_INDEX_TIME_AVG_TEMP     47  /* Package time-averaged temperature */
>> +#define MBX_INDEX_TURBO_RATIO_LIMIT 49  /* Turbo Ratio Limit Read */
>> +#define MBX_INDEX_HWP_AUTO_OOB      53  /* HWP Autonomous Out-of-band */
>> +#define MBX_INDEX_DDR_WARM_BUDGET   55  /* DDR Warm Power Budget */
>> +#define MBX_INDEX_DDR_HOT_BUDGET    56  /* DDR Hot Power Budget */
>> +#define MBX_INDEX_PKG_PSYS_PWR_LIM3 57  /* Package/Psys Power Limit3 */
>> +#define MBX_INDEX_PKG_PSYS_PWR_LIM1 58  /* Package/Psys Power Limit1 */
>> +#define MBX_INDEX_PKG_PSYS_PWR_LIM2 59  /* Package/Psys Power Limit2 */
>> +#define MBX_INDEX_PKG_PSYS_PWR_LIM4 60  /* Package/Psys Power Limit4 */
>> +#define MBX_INDEX_PERF_LIMIT_REASON 65  /* Performance Limit Reasons */
>> +
>> +/* WrPkgConfig Index */
>> +#define MBX_INDEX_DIMM_AMBIENT      19
>> +#define MBX_INDEX_DIMM_TEMP         24
>> +
>> +/* Device Specific Completion Code (CC) Definition */
>> +#define DEV_PECI_CC_SUCCESS          0x40
>> +#define DEV_PECI_CC_TIMEOUT          0x80
>> +#define DEV_PECI_CC_OUT_OF_RESOURCE  0x81
>> +#define DEV_PECI_CC_UNAVAIL_RESOURCE 0x82
>> +#define DEV_PECI_CC_INVALID_REQ      0x90
>> +
>> +/* Completion Code mask to check retry needs */
>> +#define DEV_PECI_CC_RETRY_CHECK_MASK 0xf0
>> +#define DEV_PECI_CC_NEED_RETRY       0x80
>> +
>> +/* Skylake EDS says to retry for 250ms */
>> +#define DEV_PECI_RETRY_TIME_MS       250
>> +#define DEV_PECI_RETRY_INTERVAL_USEC 10000
>> +#define DEV_PECI_RETRY_BIT           0x01
>> +
>> +#define GET_TEMP_WR_LEN   1
>> +#define GET_TEMP_RD_LEN   2
>> +#define GET_TEMP_PECI_CMD 0x01
>> +
>> +#define GET_DIB_WR_LEN   1
>> +#define GET_DIB_RD_LEN   8
>> +#define GET_DIB_PECI_CMD 0xf7
>> +
>> +#define RDPKGCFG_WRITE_LEN     5
>> +#define RDPKGCFG_READ_LEN_BASE 1
>> +#define RDPKGCFG_PECI_CMD      0xa1
>> +
>> +#define WRPKGCFG_WRITE_LEN_BASE 6
>> +#define WRPKGCFG_READ_LEN       1
>> +#define WRPKGCFG_PECI_CMD       0xa5
>> +
>> +#define RDIAMSR_WRITE_LEN 5
>> +#define RDIAMSR_READ_LEN  9
>> +#define RDIAMSR_PECI_CMD  0xb1
>> +
>> +#define WRIAMSR_PECI_CMD  0xb5
>> +
>> +#define RDPCICFG_WRITE_LEN 6
>> +#define RDPCICFG_READ_LEN  5
>> +#define RDPCICFG_PECI_CMD  0x61
>> +
>> +#define WRPCICFG_PECI_CMD  0x65
>> +
>> +#define RDPCICFGLOCAL_WRITE_LEN     5
>> +#define RDPCICFGLOCAL_READ_LEN_BASE 1
>> +#define RDPCICFGLOCAL_PECI_CMD      0xe1
>> +
>> +#define WRPCICFGLOCAL_WRITE_LEN_BASE 6
>> +#define WRPCICFGLOCAL_READ_LEN       1
>> +#define WRPCICFGLOCAL_PECI_CMD       0xe5
> 
> Alignment?

Will fix it.

>> +
>> +#define PECI_BUFFER_SIZE 32
> 
> Why don't all of these have PECI_ at the front of them?

Okay. I'll add 'PECI_' for all of these.

>> +/**
>> + * enum peci_cmd - PECI client commands
>> + * @PECI_CMD_XFER: raw PECI transfer
>> + * @PECI_CMD_PING: ping, a required message for all PECI devices
>> + * @PECI_CMD_GET_DIB: get DIB (Device Info Byte)
>> + * @PECI_CMD_GET_TEMP: get maximum die temperature
>> + * @PECI_CMD_RD_PKG_CFG: read access to the PCS (Package Configuration Space)
>> + * @PECI_CMD_WR_PKG_CFG: write access to the PCS (Package Configuration Space)
>> + * @PECI_CMD_RD_IA_MSR: read access to MSRs (Model Specific Registers)
>> + * @PECI_CMD_WR_IA_MSR: write access to MSRs (Model Specific Registers)
>> + * @PECI_CMD_RD_PCI_CFG: sideband read access to the PCI configuration space
>> + *	maintained in downstream devices external to the processor
>> + * @PECI_CMD_WR_PCI_CFG: sideband write access to the PCI configuration space
>> + *	maintained in downstream devices external to the processor
>> + * @PECI_CMD_RD_PCI_CFG_LOCAL: sideband read access to the PCI configuration
>> + *	space that resides within the processor
>> + * @PECI_CMD_WR_PCI_CFG_LOCAL: sideband write access to the PCI configuration
>> + *	space that resides within the processor
>> + *
>> + * Available commands depend on client's PECI revision.
>> + */
>> +enum peci_cmd {
>> +	PECI_CMD_XFER = 0,
>> +	PECI_CMD_PING,
>> +	PECI_CMD_GET_DIB,
>> +	PECI_CMD_GET_TEMP,
>> +	PECI_CMD_RD_PKG_CFG,
>> +	PECI_CMD_WR_PKG_CFG,
>> +	PECI_CMD_RD_IA_MSR,
>> +	PECI_CMD_WR_IA_MSR,
>> +	PECI_CMD_RD_PCI_CFG,
>> +	PECI_CMD_WR_PCI_CFG,
>> +	PECI_CMD_RD_PCI_CFG_LOCAL,
>> +	PECI_CMD_WR_PCI_CFG_LOCAL,
>> +	PECI_CMD_MAX
>> +};
>> +
>> +/**
>> + * struct peci_xfer_msg - raw PECI transfer command
>> + * @addr; address of the client
>> + * @tx_len: number of data to be written in bytes
>> + * @rx_len: number of data to be read in bytes
>> + * @tx_buf: data to be written, or NULL
>> + * @rx_buf: data to be read, or NULL
>> + *
>> + * raw PECI transfer
>> + */
>> +struct peci_xfer_msg {
>> +	__u8 addr;
>> +	__u8 tx_len;
>> +	__u8 rx_len;
>> +	__u8 tx_buf[PECI_BUFFER_SIZE];
>> +	__u8 rx_buf[PECI_BUFFER_SIZE];
>> +} __attribute__((__packed__));
> 
> Go kick the hardware engineer who did not align things on a 32bit
> boundry.  They should have known better :(

I intended to make it as contiguous byte sequence in this order because
some PECI commands need to caclulate CRC8 checksum using this as a byte
array. I'll align these on a 32bit boundary and will use a temporary
buffer instead while calcuating a CRC8 checksum.

>> +
>> +/**
>> + * struct peci_ping_msg - ping command
>> + * @addr: address of the client
>> + *
>> + * Ping() is a required message for all PECI devices. This message is used to
>> + * enumerate devices or determine if a device has been removed, been
>> + * powered-off, etc.
>> + */
>> +struct peci_ping_msg {
>> +	__u8 addr;
>> +} __attribute__((__packed__));
>> +
>> +/**
>> + * struct peci_get_dib_msg - GetDIB command
>> + * @addr: address of the client
>> + * @dib: DIB data to be read
>> + *
>> + * The processor PECI client implementation of GetDIB() includes an 8-byte
>> + * response and provides information regarding client revision number and the
>> + * number of supported domains. All processor PECI clients support the GetDIB()
>> + * command.
>> + */
>> +struct peci_get_dib_msg {
>> +	__u8  addr;
>> +	__u64 dib;
>> +} __attribute__((__packed__));
> 
> Ick, really?  That's horrible, again, go kick them hard!

Okay, I kicked hard myself. Will align it.

>> +/**
>> + * struct peci_get_temp_msg - GetTemp command
>> + * @addr: address of the client
>> + * @temp_raw: raw temperature data to be read
>> + *
>> + * The GetTemp() command is used to retrieve the maximum die temperature from a
>> + * target PECI address. The temperature is used by the external thermal
>> + * management system to regulate the temperature on the die. The data is
>> + * returned as a negative value representing the number of degrees centigrade
>> + * below the maximum processor junction temperature.
>> + */
>> +struct peci_get_temp_msg {
>> +	__u8  addr;
>> +	__s16 temp_raw;
>> +} __attribute__((__packed__));
> 
> {kick kick kick}

Will fix it too.

>> +
>> +/**
>> + * struct peci_rd_pkg_cfg_msg - RdPkgConfig command
>> + * @addr: address of the client
>> + * @index: encoding index for the requested service
>> + * @param: specific data being requested
>> + * @rx_len: number of data to be read in bytes
>> + * @pkg_config: package config data to be read
>> + *
>> + * The RdPkgConfig() command provides read access to the Package Configuration
>> + * Space (PCS) within the processor, including various power and thermal
>> + * management functions. Typical PCS read services supported by the processor
>> + * may include access to temperature data, energy status, run time information,
>> + * DIMM temperatures and so on.
>> + */
>> +struct peci_rd_pkg_cfg_msg {
>> +	__u8  addr;
>> +	__u8  index;
>> +	__u16 param;
>> +	__u8  rx_len;
>> +	__u8  pkg_config[4];
>> +} __attribute__((__packed__));
> 
> Ok, that's better, someone finally wisened up.
> 
>> +
>> +/**
>> + * struct peci_wr_pkg_cfg_msg - WrPkgConfig command
>> + * @addr: address of the client
>> + * @index: encoding index for the requested service
>> + * @param: specific data being requested
>> + * @tx_len: number of data to be written in bytes
>> + * @value: package config data to be written
>> + *
>> + * The WrPkgConfig() command provides write access to the Package Configuration
>> + * Space (PCS) within the processor, including various power and thermal
>> + * management functions. Typical PCS write services supported by the processor
>> + * may include power limiting, thermal averaging constant programming and so on.
>> + */
>> +struct peci_wr_pkg_cfg_msg {
>> +	__u8  addr;
>> +	__u8  index;
>> +	__u16 param;
>> +	__u8  tx_len;
>> +	__u32 value;
>> +} __attribute__((__packed__));
> 
> No they didn't, ick, no one cares about speed here I guess.  Oh well :(

Will fix it too.

>> +
>> +/**
>> + * struct peci_rd_ia_msr_msg - RdIAMSR command
>> + * @addr: address of the client
>> + * @thread_id: ID of the specific logical processor
>> + * @address: address of MSR to read from
>> + * @value: data to be read
>> + *
>> + * The RdIAMSR() PECI command provides read access to Model Specific Registers
>> + * (MSRs) defined in the processor's Intel Architecture (IA).
>> + */
>> +struct peci_rd_ia_msr_msg {
>> +	__u8  addr;
>> +	__u8  thread_id;
>> +	__u16 address;
>> +	__u64 value;
>> +} __attribute__((__packed__));
> 
> They got lucky!
> 
> oh well,
> 
> greg k-h
> 

Thanks again for your comments. I'll submit a new version after
addressing your comments.

Regards,
Jae
gregkh@linuxfoundation.org Jan. 24, 2019, 6:57 a.m. UTC | #5
On Wed, Jan 23, 2019 at 01:38:24PM -0800, Jae Hyun Yoo wrote:
> Hi Greg,
> 
> Thanks for sharing your time on reviewing this patch. Please find my
> answers below.
> 
> On 1/22/2019 5:20 AM, Greg Kroah-Hartman wrote:
> > On Mon, Jan 07, 2019 at 01:41:27PM -0800, Jae Hyun Yoo wrote:
> > > +config PECI
> > > +	bool "PECI support"
> > > +	select RT_MUTEXES
> > > +	select CRC8
> > > +	help
> > > +	  The Platform Environment Control Interface (PECI) is a one-wire bus
> > > +	  interface that provides a communication channel from Intel processors
> > > +	  and chipset components to external monitoring or control devices.
> > 
> > Why can't this be built as a module?
> 
> This PECI core driver should be prepared ahead of any PECI adapter
> driver registration which can be called from 'kernel_init' if the
> adapter driver is built-in. So this driver uses 'postcore_initcall' and
> it's the reason why it can't be built as a module.

Then set up your dependancies correctly.  As an example, you can have
the USB core as a module, and you are not allowed to have USB drivers
built into the kernel.  It's not difficult to do this.  Please make your
core also be a module so that you do not burden the zillions of machines
out there with code that they do not need, just because you forced the
distro to choose "Y" or "N".

> > And why do you rely on rt mutexes?
> 
> I intended to use that for giving priorities on each PECI sideband
> function. For an example, Crach Dump sideband function should have a
> higher priority than Temperature Monitoring sideband function. But, in
> this initial implementation, it doesn't have any priority adjustment
> code on the rt mutext yet so I'll remove this config dependency and will
> replace rt_mutex with mutex for now. It could be replaced back with
> rt_mutex later when it's actually needed.

thank you.

> > > +	u8 *msg;
> > > +
> > > +	if (!capable(CAP_SYS_ADMIN))
> > > +		return -EPERM;
> > 
> > Really?  Nice, you have userspace tools running as root, what could go
> > wrong... :)
> 
> I didn't catch your point. Should it be removed?

You are forcing your userspace tools to have CAP_SYS_ADMIN in order to
talk to your device.  Why?  What is wrong with the file permissions on
the device node instead?  Why can't userspace decide what user should be
able to talk to your device?

Don't require special permissions for no good reason.  This forces
whatever userspace tool that handles this, to have enough permission to
do anything it wants in the system.  And I do not think you want that.

> > > +static int peci_detect(struct peci_adapter *adapter, u8 addr)
> > > +{
> > > +	struct peci_ping_msg msg;
> > > +
> > > +	msg.addr = addr;
> > 
> > What about the un-initialized fields in this structure?  Can you
> > properly handle that, and also, is this ok to be on the stack?
> 
> It's fully initialized at here because the peci_ping_msg struct has only
> one member:
> 
> struct peci_ping_msg {
> 	__u8 addr;
> };

Ok.  But my question about "can you do this off the stack" remains.

> > > +static ssize_t peci_sysfs_new_device(struct device *dev,
> > > +				     struct device_attribute *attr,
> > > +				     const char *buf, size_t count)
> > > +{
> > > +	struct peci_adapter *adapter = to_peci_adapter(dev);
> > > +	struct peci_board_info info = {};
> > > +	struct peci_client *client;
> > > +	char *blank, end;
> > > +	int rc;
> > > +
> > > +	/* Parse device type */
> > > +	blank = strchr(buf, ' ');
> > > +	if (!blank) {
> > > +		dev_err(dev, "%s: Missing parameters\n", "new_device");
> > > +		return -EINVAL;
> > > +	}
> > > +	if (blank - buf > PECI_NAME_SIZE - 1) {
> > > +		dev_err(dev, "%s: Invalid device type\n", "new_device");
> > > +		return -EINVAL;
> > > +	}
> > > +	memcpy(info.type, buf, blank - buf);
> > > +
> > > +	/* Parse remaining parameters, reject extra parameters */
> > > +	rc = sscanf(++blank, "%hi%c", &info.addr, &end);
> > 
> > Please do not tell me you are parsing a sysfs write to do some type of
> > new configuration.  That is not what sysfs is for, that is what configfs
> > is for, please use that instead, this isn't ok.
> 
> This is for run-time registration of a PECI client which is connected to
> a PECI bus adapter. The life cycle of this sysfs interface will be
> synced with the adapter driver. Actually, it follows what I2C core
> driver currently does.

What i2c core driver parses configuration options in sysfs?

Ugh, I see that now.  That's horrible.  Please do not emulate that at
all.

Again, use configfs, that is what it is there for, do not spread bad
interfaces to new places in the kernel.

> > > +	if (rc < 1) {
> > > +		dev_err(dev, "%s: Can't parse client address\n", "new_device");
> > > +		return -EINVAL;
> > > +	}
> > > +	if (rc > 1  && end != '\n') {
> > > +		dev_err(dev, "%s: Extra parameters\n", "new_device");
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	client = peci_new_device(adapter, &info);
> > > +	if (!client)
> > > +		return -EINVAL;
> > > +
> > > +	/* Keep track of the added device */
> > > +	mutex_lock(&adapter->userspace_clients_lock);
> > > +	list_add_tail(&client->detected, &adapter->userspace_clients);
> > > +	mutex_unlock(&adapter->userspace_clients_lock);
> > > +	dev_info(dev, "%s: Instantiated device %s at 0x%02hx\n", "new_device",
> > > +		 info.type, info.addr);
> > 
> > Don't be noisy for things that are expected to happen.
> 
> I think it should print out a result on a userspace request for client
> module registration.

Why?  Who is going to do anything with this?  Userspace "knows" it
worked because their function call returned success.  It is not going to
then parse a kernel log.

> > > +
> > > +	return count;
> > > +}
> > > +static DEVICE_ATTR(new_device, 0200, NULL, peci_sysfs_new_device);
> > 
> > Not that you should be doing this, but DEVICE_ATTR_RO() is what you want
> > here, right?
> 
> Actually, DEVICE_ATTR_WO() is what I can use here. The reason why I used
> DEVICE_ATTR() is for keeping the function naming pattern as
> 'peci_sysfs_new_device()' instead of 'new_device_store()'.

Sorry, yes, WO is what you want.  But anyway, this should be in
configfs, not sysfs.

> > > +static ssize_t peci_sysfs_delete_device(struct device *dev,
> > > +					struct device_attribute *attr,
> > > +					const char *buf, size_t count)
> > > +{
> > > +	struct peci_adapter *adapter = to_peci_adapter(dev);
> > > +	struct peci_client *client, *next;
> > > +	struct peci_board_info info = {};
> > > +	struct peci_driver *driver;
> > > +	char *blank, end;
> > > +	int rc;
> > > +
> > > +	/* Parse device type */
> > > +	blank = strchr(buf, ' ');
> > > +	if (!blank) {
> > > +		dev_err(dev, "%s: Missing parameters\n", "delete_device");
> > > +		return -EINVAL;
> > > +	}
> > > +	if (blank - buf > PECI_NAME_SIZE - 1) {
> > > +		dev_err(dev, "%s: Invalid device type\n", "delete_device");
> > > +		return -EINVAL;
> > > +	}
> > > +	memcpy(info.type, buf, blank - buf);
> > > +
> > > +	/* Parse remaining parameters, reject extra parameters */
> > > +	rc = sscanf(++blank, "%hi%c", &info.addr, &end);
> > > +	if (rc < 1) {
> > > +		dev_err(dev, "%s: Can't parse client address\n",
> > > +			"delete_device");
> > > +		return -EINVAL;
> > > +	}
> > > +	if (rc > 1  && end != '\n') {
> > > +		dev_err(dev, "%s: Extra parameters\n", "delete_device");
> > > +		return -EINVAL;
> > > +	}
> > 
> > Same here, no parsing of configurations through sysfs, that is not ok.
> > Again, use configfs.
> 
> Same as above. This is for run-time deregistration of a PECI client
> which is registered on a PECI bus adapter. The life cycle of this sysfs
> interface will be synced with the adapter driver. Actually, it follows
> what I2C core driver currently does.

Again, do not copy previous mistakes please.

> > > +
> > > +	return rc;
> > > +}
> > > +static DEVICE_ATTR_IGNORE_LOCKDEP(delete_device, 0200, NULL,
> > > +				  peci_sysfs_delete_device);
> > 
> > sysfs files that remove themselves are reserved for a specific circle in
> > hell, you really don't want to mess with them :(
> 
> It cannot remove itself. The owner of the sysfs files is an adapter
> driver and only client devices can be added/removed by these sysfs
> files. An adapter can't be removed by accessing of this sysfs.

They why the LOCKDEP notation?

> > > +/**
> > > + * struct peci_adapter - represent a PECI adapter
> > > + * @owner: owner module of the PECI adpater
> > > + * @bus_lock: mutex for exclusion of multiple callers
> > > + * @dev: device interface to this driver
> > > + * @cdev: character device object to create character device
> > > + * @nr: the bus number to map
> > > + * @name: name of the adapter
> > > + * @userspace_clients_lock: mutex for exclusion of clients handling
> > > + * @userspace_clients: list of registered clients
> > > + * @xfer: low-level transfer function pointer of the adapter
> > > + * @cmd_mask: mask for supportable PECI commands
> > > + *
> > > + * Each PECI adapter can communicate with one or more PECI client children.
> > > + * These make a small bus, sharing a single wired PECI connection.
> > > + */
> > > +struct peci_adapter {
> > > +	struct module		*owner;
> > > +	struct rt_mutex		bus_lock;
> > > +	struct device		dev;
> > > +	struct cdev		cdev;
> > 
> > Yeah, two differently reference counted variables in the same structure,
> > what could go wrong!  :(
> > 
> > Don't do this, make your cdev a pointer to be sure you get things right,
> > otherwise this will never work properly.
> > 
> > And shouldn't the character device be a class device?  Don't confuse
> > devices in the driver model with the interaction of them to userspace in
> > a specific manner, those should be separate things, right?
> 
> Okay, I got your point. I'll split out the character device part as
> a separated module and will make the module can be attached to a PECI
> bus.

I'm not saying it has to be a whole separate module, at the least it
needs to be a new structure.  As it is, your code is not correct due to
the dual-structures-trying-to-fight-it-out-for-lifecycle-rules

> > > +/**
> > > + * struct peci_xfer_msg - raw PECI transfer command
> > > + * @addr; address of the client
> > > + * @tx_len: number of data to be written in bytes
> > > + * @rx_len: number of data to be read in bytes
> > > + * @tx_buf: data to be written, or NULL
> > > + * @rx_buf: data to be read, or NULL
> > > + *
> > > + * raw PECI transfer
> > > + */
> > > +struct peci_xfer_msg {
> > > +	__u8 addr;
> > > +	__u8 tx_len;
> > > +	__u8 rx_len;
> > > +	__u8 tx_buf[PECI_BUFFER_SIZE];
> > > +	__u8 rx_buf[PECI_BUFFER_SIZE];
> > > +} __attribute__((__packed__));
> > 
> > Go kick the hardware engineer who did not align things on a 32bit
> > boundry.  They should have known better :(
> 
> I intended to make it as contiguous byte sequence in this order because
> some PECI commands need to caclulate CRC8 checksum using this as a byte
> array. I'll align these on a 32bit boundary and will use a temporary
> buffer instead while calcuating a CRC8 checksum.

If this really is the way the structure needs to be on the wire, that's
fine, it's just a complaint about the protocol and how bad accessing
those fields are going to suck for some processors.

But if you can get away with not doing it like this, that would be nice.

thanks,

greg k-h
Jae Hyun Yoo Jan. 24, 2019, 10:01 p.m. UTC | #6
Hi Greg,

On 1/23/2019 10:57 PM, Greg Kroah-Hartman wrote:
> On Wed, Jan 23, 2019 at 01:38:24PM -0800, Jae Hyun Yoo wrote:
>> Hi Greg,
>>
>> Thanks for sharing your time on reviewing this patch. Please find my
>> answers below.
>>
>> On 1/22/2019 5:20 AM, Greg Kroah-Hartman wrote:
>>> On Mon, Jan 07, 2019 at 01:41:27PM -0800, Jae Hyun Yoo wrote:
>>>> +config PECI
>>>> +	bool "PECI support"
>>>> +	select RT_MUTEXES
>>>> +	select CRC8
>>>> +	help
>>>> +	  The Platform Environment Control Interface (PECI) is a one-wire bus
>>>> +	  interface that provides a communication channel from Intel processors
>>>> +	  and chipset components to external monitoring or control devices.
>>>
>>> Why can't this be built as a module?
>>
>> This PECI core driver should be prepared ahead of any PECI adapter
>> driver registration which can be called from 'kernel_init' if the
>> adapter driver is built-in. So this driver uses 'postcore_initcall' and
>> it's the reason why it can't be built as a module.
> 
> Then set up your dependancies correctly.  As an example, you can have
> the USB core as a module, and you are not allowed to have USB drivers
> built into the kernel.  It's not difficult to do this.  Please make your
> core also be a module so that you do not burden the zillions of machines
> out there with code that they do not need, just because you forced the
> distro to choose "Y" or "N".

Okay. I'll make this core config as a tristate.

>>>> +	u8 *msg;
>>>> +
>>>> +	if (!capable(CAP_SYS_ADMIN))
>>>> +		return -EPERM;
>>>
>>> Really?  Nice, you have userspace tools running as root, what could go
>>> wrong... :)
>>
>> I didn't catch your point. Should it be removed?
> 
> You are forcing your userspace tools to have CAP_SYS_ADMIN in order to
> talk to your device.  Why?  What is wrong with the file permissions on
> the device node instead?  Why can't userspace decide what user should be
> able to talk to your device?
> 
> Don't require special permissions for no good reason.  This forces
> whatever userspace tool that handles this, to have enough permission to
> do anything it wants in the system.  And I do not think you want that.

Okay, I got your point now. Thank you. I'll remove this permission check
from here.

>>>> +static int peci_detect(struct peci_adapter *adapter, u8 addr)
>>>> +{
>>>> +	struct peci_ping_msg msg;
>>>> +
>>>> +	msg.addr = addr;
>>>
>>> What about the un-initialized fields in this structure?  Can you
>>> properly handle that, and also, is this ok to be on the stack?
>>
>> It's fully initialized at here because the peci_ping_msg struct has only
>> one member:
>>
>> struct peci_ping_msg {
>> 	__u8 addr;
>> };
> 
> Ok.  But my question about "can you do this off the stack" remains.

I'll add 3 bytes of dummy padding into this structure. Also, I'll check
again u32 boundary alignment for all struct defines in peci_ioctl.h.
Would it be okay to be on stack then?

>>>> +static ssize_t peci_sysfs_new_device(struct device *dev,
>>>> +				     struct device_attribute *attr,
>>>> +				     const char *buf, size_t count)
>>>> +{
>>>> +	struct peci_adapter *adapter = to_peci_adapter(dev);
>>>> +	struct peci_board_info info = {};
>>>> +	struct peci_client *client;
>>>> +	char *blank, end;
>>>> +	int rc;
>>>> +
>>>> +	/* Parse device type */
>>>> +	blank = strchr(buf, ' ');
>>>> +	if (!blank) {
>>>> +		dev_err(dev, "%s: Missing parameters\n", "new_device");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +	if (blank - buf > PECI_NAME_SIZE - 1) {
>>>> +		dev_err(dev, "%s: Invalid device type\n", "new_device");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +	memcpy(info.type, buf, blank - buf);
>>>> +
>>>> +	/* Parse remaining parameters, reject extra parameters */
>>>> +	rc = sscanf(++blank, "%hi%c", &info.addr, &end);
>>>
>>> Please do not tell me you are parsing a sysfs write to do some type of
>>> new configuration.  That is not what sysfs is for, that is what configfs
>>> is for, please use that instead, this isn't ok.
>>
>> This is for run-time registration of a PECI client which is connected to
>> a PECI bus adapter. The life cycle of this sysfs interface will be
>> synced with the adapter driver. Actually, it follows what I2C core
>> driver currently does.
> 
> What i2c core driver parses configuration options in sysfs?
> 
> Ugh, I see that now.  That's horrible.  Please do not emulate that at
> all.
> 
> Again, use configfs, that is what it is there for, do not spread bad
> interfaces to new places in the kernel.

Okay, I see. I'll rewrite this interface using configfs. Thanks!

>>>> +	if (rc < 1) {
>>>> +		dev_err(dev, "%s: Can't parse client address\n", "new_device");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +	if (rc > 1  && end != '\n') {
>>>> +		dev_err(dev, "%s: Extra parameters\n", "new_device");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	client = peci_new_device(adapter, &info);
>>>> +	if (!client)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* Keep track of the added device */
>>>> +	mutex_lock(&adapter->userspace_clients_lock);
>>>> +	list_add_tail(&client->detected, &adapter->userspace_clients);
>>>> +	mutex_unlock(&adapter->userspace_clients_lock);
>>>> +	dev_info(dev, "%s: Instantiated device %s at 0x%02hx\n", "new_device",
>>>> +		 info.type, info.addr);
>>>
>>> Don't be noisy for things that are expected to happen.
>>
>> I think it should print out a result on a userspace request for client
>> module registration.
> 
> Why?  Who is going to do anything with this?  Userspace "knows" it
> worked because their function call returned success.  It is not going to
> then parse a kernel log.

Agree. Will change it to dev_dbg().

>>>> +
>>>> +	return count;
>>>> +}
>>>> +static DEVICE_ATTR(new_device, 0200, NULL, peci_sysfs_new_device);
>>>
>>> Not that you should be doing this, but DEVICE_ATTR_RO() is what you want
>>> here, right?
>>
>> Actually, DEVICE_ATTR_WO() is what I can use here. The reason why I used
>> DEVICE_ATTR() is for keeping the function naming pattern as
>> 'peci_sysfs_new_device()' instead of 'new_device_store()'.
> 
> Sorry, yes, WO is what you want.  But anyway, this should be in
> configfs, not sysfs.

I'll rewrite this interface using configfs.

>>>> +static ssize_t peci_sysfs_delete_device(struct device *dev,
>>>> +					struct device_attribute *attr,
>>>> +					const char *buf, size_t count)
>>>> +{
>>>> +	struct peci_adapter *adapter = to_peci_adapter(dev);
>>>> +	struct peci_client *client, *next;
>>>> +	struct peci_board_info info = {};
>>>> +	struct peci_driver *driver;
>>>> +	char *blank, end;
>>>> +	int rc;
>>>> +
>>>> +	/* Parse device type */
>>>> +	blank = strchr(buf, ' ');
>>>> +	if (!blank) {
>>>> +		dev_err(dev, "%s: Missing parameters\n", "delete_device");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +	if (blank - buf > PECI_NAME_SIZE - 1) {
>>>> +		dev_err(dev, "%s: Invalid device type\n", "delete_device");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +	memcpy(info.type, buf, blank - buf);
>>>> +
>>>> +	/* Parse remaining parameters, reject extra parameters */
>>>> +	rc = sscanf(++blank, "%hi%c", &info.addr, &end);
>>>> +	if (rc < 1) {
>>>> +		dev_err(dev, "%s: Can't parse client address\n",
>>>> +			"delete_device");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +	if (rc > 1  && end != '\n') {
>>>> +		dev_err(dev, "%s: Extra parameters\n", "delete_device");
>>>> +		return -EINVAL;
>>>> +	}
>>>
>>> Same here, no parsing of configurations through sysfs, that is not ok.
>>> Again, use configfs.
>>
>> Same as above. This is for run-time deregistration of a PECI client
>> which is registered on a PECI bus adapter. The life cycle of this sysfs
>> interface will be synced with the adapter driver. Actually, it follows
>> what I2C core driver currently does.
> 
> Again, do not copy previous mistakes please.

Agreed. Thanks!

>>>> +
>>>> +	return rc;
>>>> +}
>>>> +static DEVICE_ATTR_IGNORE_LOCKDEP(delete_device, 0200, NULL,
>>>> +				  peci_sysfs_delete_device);
>>>
>>> sysfs files that remove themselves are reserved for a specific circle in
>>> hell, you really don't want to mess with them :(
>>
>> It cannot remove itself. The owner of the sysfs files is an adapter
>> driver and only client devices can be added/removed by these sysfs
>> files. An adapter can't be removed by accessing of this sysfs.
> 
> They why the LOCKDEP notation?

For a case of debugging. Anyway, I'll rewrite this part using configfs.

>>>> +/**
>>>> + * struct peci_adapter - represent a PECI adapter
>>>> + * @owner: owner module of the PECI adpater
>>>> + * @bus_lock: mutex for exclusion of multiple callers
>>>> + * @dev: device interface to this driver
>>>> + * @cdev: character device object to create character device
>>>> + * @nr: the bus number to map
>>>> + * @name: name of the adapter
>>>> + * @userspace_clients_lock: mutex for exclusion of clients handling
>>>> + * @userspace_clients: list of registered clients
>>>> + * @xfer: low-level transfer function pointer of the adapter
>>>> + * @cmd_mask: mask for supportable PECI commands
>>>> + *
>>>> + * Each PECI adapter can communicate with one or more PECI client children.
>>>> + * These make a small bus, sharing a single wired PECI connection.
>>>> + */
>>>> +struct peci_adapter {
>>>> +	struct module		*owner;
>>>> +	struct rt_mutex		bus_lock;
>>>> +	struct device		dev;
>>>> +	struct cdev		cdev;
>>>
>>> Yeah, two differently reference counted variables in the same structure,
>>> what could go wrong!  :(
>>>
>>> Don't do this, make your cdev a pointer to be sure you get things right,
>>> otherwise this will never work properly.
>>>
>>> And shouldn't the character device be a class device?  Don't confuse
>>> devices in the driver model with the interaction of them to userspace in
>>> a specific manner, those should be separate things, right?
>>
>> Okay, I got your point. I'll split out the character device part as
>> a separated module and will make the module can be attached to a PECI
>> bus.
> 
> I'm not saying it has to be a whole separate module, at the least it
> needs to be a new structure.  As it is, your code is not correct due to
> the dual-structures-trying-to-fight-it-out-for-lifecycle-rules

I got you point clearly now. I'll use a separate structure for the
character device.

>>>> +/**
>>>> + * struct peci_xfer_msg - raw PECI transfer command
>>>> + * @addr; address of the client
>>>> + * @tx_len: number of data to be written in bytes
>>>> + * @rx_len: number of data to be read in bytes
>>>> + * @tx_buf: data to be written, or NULL
>>>> + * @rx_buf: data to be read, or NULL
>>>> + *
>>>> + * raw PECI transfer
>>>> + */
>>>> +struct peci_xfer_msg {
>>>> +	__u8 addr;
>>>> +	__u8 tx_len;
>>>> +	__u8 rx_len;
>>>> +	__u8 tx_buf[PECI_BUFFER_SIZE];
>>>> +	__u8 rx_buf[PECI_BUFFER_SIZE];
>>>> +} __attribute__((__packed__));
>>>
>>> Go kick the hardware engineer who did not align things on a 32bit
>>> boundry.  They should have known better :(
>>
>> I intended to make it as contiguous byte sequence in this order because
>> some PECI commands need to caclulate CRC8 checksum using this as a byte
>> array. I'll align these on a 32bit boundary and will use a temporary
>> buffer instead while calcuating a CRC8 checksum.
> 
> If this really is the way the structure needs to be on the wire, that's
> fine, it's just a complaint about the protocol and how bad accessing
> those fields are going to suck for some processors.
> 
> But if you can get away with not doing it like this, that would be nice.

OK. I'll align this on a 32bit boundary and align all remaining
structures in this header file as well.

Thanks a lot for your comments.

Regards,
Jae
gregkh@linuxfoundation.org Jan. 25, 2019, 7:18 a.m. UTC | #7
On Thu, Jan 24, 2019 at 02:01:10PM -0800, Jae Hyun Yoo wrote:
> On 1/23/2019 10:57 PM, Greg Kroah-Hartman wrote:
> > On Wed, Jan 23, 2019 at 01:38:24PM -0800, Jae Hyun Yoo wrote:
> > > > What about the un-initialized fields in this structure?  Can you
> > > > properly handle that, and also, is this ok to be on the stack?
> > > 
> > > It's fully initialized at here because the peci_ping_msg struct has only
> > > one member:
> > > 
> > > struct peci_ping_msg {
> > > 	__u8 addr;
> > > };
> > 
> > Ok.  But my question about "can you do this off the stack" remains.
> 
> I'll add 3 bytes of dummy padding into this structure. Also, I'll check
> again u32 boundary alignment for all struct defines in peci_ioctl.h.
> Would it be okay to be on stack then?

The issue of being on the stack has nothing to do with alignment, and
everything to do with, "can your controller handle data from the stack".
Lots of busses and controllers can not (i.e. all USB devices), so you
have to properly allocate all memory that is used for data transfers
from areas that are able to do DMA properly (i.e. by using kmalloc).

That is why I asked here about that, if this is a USB driver, having the
data you wish to send from a stack variable is not allowed.  I don't
know how your hardware works, which is why I was asking this.

Note, some architectures (like x86), hide this fact as their stack
memory is able to be DMA, so you do not run into any errors.  Other
arches that Linux supports are not like that, which is why we have those
types of restrictions.

Hope this helps,

greg k-h
Jae Hyun Yoo Jan. 25, 2019, 6:51 p.m. UTC | #8
Hi Greg,

On 1/24/2019 11:18 PM, Greg Kroah-Hartman wrote:
> On Thu, Jan 24, 2019 at 02:01:10PM -0800, Jae Hyun Yoo wrote:
>> On 1/23/2019 10:57 PM, Greg Kroah-Hartman wrote:
>>> On Wed, Jan 23, 2019 at 01:38:24PM -0800, Jae Hyun Yoo wrote:
>>>>> What about the un-initialized fields in this structure?  Can you
>>>>> properly handle that, and also, is this ok to be on the stack?
>>>>
>>>> It's fully initialized at here because the peci_ping_msg struct has only
>>>> one member:
>>>>
>>>> struct peci_ping_msg {
>>>> 	__u8 addr;
>>>> };
>>>
>>> Ok.  But my question about "can you do this off the stack" remains.
>>
>> I'll add 3 bytes of dummy padding into this structure. Also, I'll check
>> again u32 boundary alignment for all struct defines in peci_ioctl.h.
>> Would it be okay to be on stack then?
> 
> The issue of being on the stack has nothing to do with alignment, and
> everything to do with, "can your controller handle data from the stack".
> Lots of busses and controllers can not (i.e. all USB devices), so you
> have to properly allocate all memory that is used for data transfers
> from areas that are able to do DMA properly (i.e. by using kmalloc).
> 
> That is why I asked here about that, if this is a USB driver, having the
> data you wish to send from a stack variable is not allowed.  I don't
> know how your hardware works, which is why I was asking this.
> 
> Note, some architectures (like x86), hide this fact as their stack
> memory is able to be DMA, so you do not run into any errors.  Other
> arches that Linux supports are not like that, which is why we have those
> types of restrictions.

Thanks for your detailed explanation.

In this core driver, all PECI command messages will be translated into
the raw PECI message structure (peci_xfer_msg) and it is the actual
data which will be delivered to an adapter driver through
adapter->xfer().

I'll fix the translation logic to use heap instead of stack for handling
the peci_xfer_msg structure.

Thanks,
Jae
gregkh@linuxfoundation.org Jan. 26, 2019, 8:29 a.m. UTC | #9
On Fri, Jan 25, 2019 at 10:51:54AM -0800, Jae Hyun Yoo wrote:
> Hi Greg,
> 
> On 1/24/2019 11:18 PM, Greg Kroah-Hartman wrote:
> > On Thu, Jan 24, 2019 at 02:01:10PM -0800, Jae Hyun Yoo wrote:
> > > On 1/23/2019 10:57 PM, Greg Kroah-Hartman wrote:
> > > > On Wed, Jan 23, 2019 at 01:38:24PM -0800, Jae Hyun Yoo wrote:
> > > > > > What about the un-initialized fields in this structure?  Can you
> > > > > > properly handle that, and also, is this ok to be on the stack?
> > > > > 
> > > > > It's fully initialized at here because the peci_ping_msg struct has only
> > > > > one member:
> > > > > 
> > > > > struct peci_ping_msg {
> > > > > 	__u8 addr;
> > > > > };
> > > > 
> > > > Ok.  But my question about "can you do this off the stack" remains.
> > > 
> > > I'll add 3 bytes of dummy padding into this structure. Also, I'll check
> > > again u32 boundary alignment for all struct defines in peci_ioctl.h.
> > > Would it be okay to be on stack then?
> > 
> > The issue of being on the stack has nothing to do with alignment, and
> > everything to do with, "can your controller handle data from the stack".
> > Lots of busses and controllers can not (i.e. all USB devices), so you
> > have to properly allocate all memory that is used for data transfers
> > from areas that are able to do DMA properly (i.e. by using kmalloc).
> > 
> > That is why I asked here about that, if this is a USB driver, having the
> > data you wish to send from a stack variable is not allowed.  I don't
> > know how your hardware works, which is why I was asking this.
> > 
> > Note, some architectures (like x86), hide this fact as their stack
> > memory is able to be DMA, so you do not run into any errors.  Other
> > arches that Linux supports are not like that, which is why we have those
> > types of restrictions.
> 
> Thanks for your detailed explanation.
> 
> In this core driver, all PECI command messages will be translated into
> the raw PECI message structure (peci_xfer_msg) and it is the actual
> data which will be delivered to an adapter driver through
> adapter->xfer().
> 
> I'll fix the translation logic to use heap instead of stack for handling
> the peci_xfer_msg structure.

Great!  You might want to add the checks that we have in the USB core
for this type of thing to catch users who don't remember not to send
stack-data.  Look at usb_hcd_map_urb_for_dma() for specifics.

thanks,

greg k-h
Jae Hyun Yoo Jan. 28, 2019, 5:25 p.m. UTC | #10
Hi Greg,

On 1/26/2019 12:29 AM, Greg Kroah-Hartman wrote:
> On Fri, Jan 25, 2019 at 10:51:54AM -0800, Jae Hyun Yoo wrote:
>> Hi Greg,
>>
>> On 1/24/2019 11:18 PM, Greg Kroah-Hartman wrote:
>>> On Thu, Jan 24, 2019 at 02:01:10PM -0800, Jae Hyun Yoo wrote:
>>>> On 1/23/2019 10:57 PM, Greg Kroah-Hartman wrote:
>>>>> On Wed, Jan 23, 2019 at 01:38:24PM -0800, Jae Hyun Yoo wrote:
>>>>>>> What about the un-initialized fields in this structure?  Can you
>>>>>>> properly handle that, and also, is this ok to be on the stack?
>>>>>>
>>>>>> It's fully initialized at here because the peci_ping_msg struct has only
>>>>>> one member:
>>>>>>
>>>>>> struct peci_ping_msg {
>>>>>> 	__u8 addr;
>>>>>> };
>>>>>
>>>>> Ok.  But my question about "can you do this off the stack" remains.
>>>>
>>>> I'll add 3 bytes of dummy padding into this structure. Also, I'll check
>>>> again u32 boundary alignment for all struct defines in peci_ioctl.h.
>>>> Would it be okay to be on stack then?
>>>
>>> The issue of being on the stack has nothing to do with alignment, and
>>> everything to do with, "can your controller handle data from the stack".
>>> Lots of busses and controllers can not (i.e. all USB devices), so you
>>> have to properly allocate all memory that is used for data transfers
>>> from areas that are able to do DMA properly (i.e. by using kmalloc).
>>>
>>> That is why I asked here about that, if this is a USB driver, having the
>>> data you wish to send from a stack variable is not allowed.  I don't
>>> know how your hardware works, which is why I was asking this.
>>>
>>> Note, some architectures (like x86), hide this fact as their stack
>>> memory is able to be DMA, so you do not run into any errors.  Other
>>> arches that Linux supports are not like that, which is why we have those
>>> types of restrictions.
>>
>> Thanks for your detailed explanation.
>>
>> In this core driver, all PECI command messages will be translated into
>> the raw PECI message structure (peci_xfer_msg) and it is the actual
>> data which will be delivered to an adapter driver through
>> adapter->xfer().
>>
>> I'll fix the translation logic to use heap instead of stack for handling
>> the peci_xfer_msg structure.
> 
> Great!  You might want to add the checks that we have in the USB core
> for this type of thing to catch users who don't remember not to send
> stack-data.  Look at usb_hcd_map_urb_for_dma() for specifics.

Thanks for the additional information. I'll add the checks as well.

Regards,
Jae
diff mbox series

Patch

diff --git a/drivers/Kconfig b/drivers/Kconfig
index 4f9f99057ff8..bbb66439a307 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -228,4 +228,6 @@  source "drivers/siox/Kconfig"
 
 source "drivers/slimbus/Kconfig"
 
+source "drivers/peci/Kconfig"
+
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index e1ce029d28fd..9ec44c032a42 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -186,3 +186,4 @@  obj-$(CONFIG_MULTIPLEXER)	+= mux/
 obj-$(CONFIG_UNISYS_VISORBUS)	+= visorbus/
 obj-$(CONFIG_SIOX)		+= siox/
 obj-$(CONFIG_GNSS)		+= gnss/
+obj-$(CONFIG_PECI)		+= peci/
diff --git a/drivers/peci/Kconfig b/drivers/peci/Kconfig
new file mode 100644
index 000000000000..4ccacb22a356
--- /dev/null
+++ b/drivers/peci/Kconfig
@@ -0,0 +1,12 @@ 
+#
+# Platform Environment Control Interface (PECI) subsystem configuration
+#
+
+config PECI
+	bool "PECI support"
+	select RT_MUTEXES
+	select CRC8
+	help
+	  The Platform Environment Control Interface (PECI) is a one-wire bus
+	  interface that provides a communication channel from Intel processors
+	  and chipset components to external monitoring or control devices.
diff --git a/drivers/peci/Makefile b/drivers/peci/Makefile
new file mode 100644
index 000000000000..9e8615e0d3ff
--- /dev/null
+++ b/drivers/peci/Makefile
@@ -0,0 +1,6 @@ 
+#
+# Makefile for the PECI core and bus drivers.
+#
+
+# Core functionality
+obj-$(CONFIG_PECI)		+= peci-core.o
diff --git a/drivers/peci/peci-core.c b/drivers/peci/peci-core.c
new file mode 100644
index 000000000000..1542be81e499
--- /dev/null
+++ b/drivers/peci/peci-core.c
@@ -0,0 +1,1527 @@ 
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018-2019 Intel Corporation
+
+#include <linux/bitfield.h>
+#include <linux/crc8.h>
+#include <linux/delay.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/peci.h>
+#include <linux/pm_domain.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+/* Mask for getting minor revision number from DIB */
+#define REVISION_NUM_MASK  GENMASK(15, 8)
+
+/* CRC8 table for Assure Write Frame Check */
+#define PECI_CRC8_POLYNOMIAL  0x07
+DECLARE_CRC8_TABLE(peci_crc8_table);
+
+static struct device_type peci_adapter_type;
+static struct device_type peci_client_type;
+
+/* Max number of peci cdev */
+#define PECI_CDEV_MAX  16
+
+static dev_t peci_devt;
+static bool is_registered;
+
+static DEFINE_MUTEX(core_lock);
+static DEFINE_IDR(peci_adapter_idr);
+
+static struct peci_adapter *peci_get_adapter(int nr)
+{
+	struct peci_adapter *adapter;
+
+	mutex_lock(&core_lock);
+	adapter = idr_find(&peci_adapter_idr, nr);
+	if (!adapter)
+		goto out_unlock;
+
+	if (try_module_get(adapter->owner))
+		get_device(&adapter->dev);
+	else
+		adapter = NULL;
+
+out_unlock:
+	mutex_unlock(&core_lock);
+	return adapter;
+}
+
+static void peci_put_adapter(struct peci_adapter *adapter)
+{
+	if (!adapter)
+		return;
+
+	put_device(&adapter->dev);
+	module_put(adapter->owner);
+}
+
+static ssize_t name_show(struct device *dev,
+			 struct device_attribute *attr,
+			 char *buf)
+{
+	return sprintf(buf, "%s\n", dev->type == &peci_client_type ?
+		       to_peci_client(dev)->name : to_peci_adapter(dev)->name);
+}
+static DEVICE_ATTR_RO(name);
+
+static void peci_client_dev_release(struct device *dev)
+{
+	struct peci_client *client = to_peci_client(dev);
+
+	dev_dbg(dev, "%s: %s\n", __func__, client->name);
+	peci_put_adapter(client->adapter);
+	kfree(client);
+}
+
+static struct attribute *peci_device_attrs[] = {
+	&dev_attr_name.attr,
+	NULL
+};
+ATTRIBUTE_GROUPS(peci_device);
+
+static struct device_type peci_client_type = {
+	.groups		= peci_device_groups,
+	.release	= peci_client_dev_release,
+};
+
+/**
+ * peci_verify_client - return parameter as peci_client, or NULL
+ * @dev: device, probably from some driver model iterator
+ *
+ * Return: pointer to peci_client on success, else NULL.
+ */
+struct peci_client *peci_verify_client(struct device *dev)
+{
+	return (dev->type == &peci_client_type)
+			? to_peci_client(dev)
+			: NULL;
+}
+EXPORT_SYMBOL_GPL(peci_verify_client);
+
+static u8 peci_aw_fcs(u8 *data, int len)
+{
+	return crc8(peci_crc8_table, data, (size_t)len, 0);
+}
+
+static int __peci_xfer(struct peci_adapter *adapter, struct peci_xfer_msg *msg,
+		       bool do_retry, bool has_aw_fcs)
+{
+	ktime_t start, end;
+	s64 elapsed_ms;
+	int rc = 0;
+
+	/**
+	 * For some commands, the PECI originator may need to retry a command if
+	 * the processor PECI client responds with a 0x8x completion code. In
+	 * each instance, the processor PECI client may have started the
+	 * operation but not completed it yet. When the 'retry' bit is set, the
+	 * PECI client will ignore a new request if it exactly matches a
+	 * previous valid request.
+	 */
+
+	if (do_retry)
+		start = ktime_get();
+
+	do {
+		rc = adapter->xfer(adapter, msg);
+
+		if (!do_retry || rc)
+			break;
+
+		if (msg->rx_buf[0] == DEV_PECI_CC_SUCCESS)
+			break;
+
+		/* Retry is needed when completion code is 0x8x */
+		if ((msg->rx_buf[0] & DEV_PECI_CC_RETRY_CHECK_MASK) !=
+		    DEV_PECI_CC_NEED_RETRY) {
+			rc = -EIO;
+			break;
+		}
+
+		/* Set the retry bit to indicate a retry attempt */
+		msg->tx_buf[1] |= DEV_PECI_RETRY_BIT;
+
+		/* Recalculate the AW FCS if it has one */
+		if (has_aw_fcs)
+			msg->tx_buf[msg->tx_len - 1] = 0x80 ^
+						peci_aw_fcs((u8 *)msg,
+							    2 + msg->tx_len);
+
+		/**
+		 * Retry for at least 250ms before returning an error.
+		 * Retry interval guideline:
+		 *   No minimum < Retry Interval < No maximum
+		 *                (recommend 10ms)
+		 */
+		end = ktime_get();
+		elapsed_ms = ktime_to_ms(ktime_sub(end, start));
+		if (elapsed_ms >= DEV_PECI_RETRY_TIME_MS) {
+			dev_dbg(&adapter->dev, "Timeout retrying xfer!\n");
+			rc = -ETIMEDOUT;
+			break;
+		}
+
+		usleep_range((DEV_PECI_RETRY_INTERVAL_USEC >> 2) + 1,
+			     DEV_PECI_RETRY_INTERVAL_USEC);
+	} while (true);
+
+	if (rc)
+		dev_dbg(&adapter->dev, "xfer error, rc: %d\n", rc);
+
+	return rc;
+}
+
+static int peci_xfer(struct peci_adapter *adapter, struct peci_xfer_msg *msg)
+{
+	return __peci_xfer(adapter, msg, false, false);
+}
+
+static int peci_xfer_with_retries(struct peci_adapter *adapter,
+				  struct peci_xfer_msg *msg,
+				  bool has_aw_fcs)
+{
+	return __peci_xfer(adapter, msg, true, has_aw_fcs);
+}
+
+static int peci_scan_cmd_mask(struct peci_adapter *adapter)
+{
+	struct peci_xfer_msg msg;
+	u8 revision;
+	int rc = 0;
+	u64 dib;
+
+	/* Update command mask just once */
+	if (adapter->cmd_mask & BIT(PECI_CMD_XFER))
+		return 0;
+
+	msg.addr      = PECI_BASE_ADDR;
+	msg.tx_len    = GET_DIB_WR_LEN;
+	msg.rx_len    = GET_DIB_RD_LEN;
+	msg.tx_buf[0] = GET_DIB_PECI_CMD;
+
+	rc = peci_xfer(adapter, &msg);
+	if (rc)
+		return rc;
+
+	dib = le64_to_cpup((__le64 *)msg.rx_buf);
+
+	/* Check special case for Get DIB command */
+	if (dib == 0) {
+		dev_dbg(&adapter->dev, "DIB read as 0\n");
+		return -EIO;
+	}
+
+	/**
+	 * Setting up the supporting commands based on minor revision number.
+	 * See PECI Spec Table 3-1.
+	 */
+	revision = FIELD_GET(REVISION_NUM_MASK, dib);
+	if (revision >= 0x36) /* Rev. 3.6 */
+		adapter->cmd_mask |= BIT(PECI_CMD_WR_IA_MSR);
+	if (revision >= 0x35) /* Rev. 3.5 */
+		adapter->cmd_mask |= BIT(PECI_CMD_WR_PCI_CFG);
+	if (revision >= 0x34) /* Rev. 3.4 */
+		adapter->cmd_mask |= BIT(PECI_CMD_RD_PCI_CFG);
+	if (revision >= 0x33) { /* Rev. 3.3 */
+		adapter->cmd_mask |= BIT(PECI_CMD_RD_PCI_CFG_LOCAL);
+		adapter->cmd_mask |= BIT(PECI_CMD_WR_PCI_CFG_LOCAL);
+	}
+	if (revision >= 0x32) /* Rev. 3.2 */
+		adapter->cmd_mask |= BIT(PECI_CMD_RD_IA_MSR);
+	if (revision >= 0x31) { /* Rev. 3.1 */
+		adapter->cmd_mask |= BIT(PECI_CMD_RD_PKG_CFG);
+		adapter->cmd_mask |= BIT(PECI_CMD_WR_PKG_CFG);
+	}
+
+	adapter->cmd_mask |= BIT(PECI_CMD_XFER);
+	adapter->cmd_mask |= BIT(PECI_CMD_GET_TEMP);
+	adapter->cmd_mask |= BIT(PECI_CMD_GET_DIB);
+	adapter->cmd_mask |= BIT(PECI_CMD_PING);
+
+	return rc;
+}
+
+static int peci_cmd_support(struct peci_adapter *adapter, enum peci_cmd cmd)
+{
+	if (!(adapter->cmd_mask & BIT(PECI_CMD_PING)) &&
+	    peci_scan_cmd_mask(adapter) < 0) {
+		dev_dbg(&adapter->dev, "Failed to scan command mask\n");
+		return -EIO;
+	}
+
+	if (!(adapter->cmd_mask & BIT(cmd))) {
+		dev_dbg(&adapter->dev, "Command %d is not supported\n", cmd);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int peci_ioctl_xfer(struct peci_adapter *adapter, void *vmsg)
+{
+	struct peci_xfer_msg *msg = vmsg;
+
+	return peci_xfer(adapter, msg);
+}
+
+static int peci_ioctl_ping(struct peci_adapter *adapter, void *vmsg)
+{
+	struct peci_ping_msg *umsg = vmsg;
+	struct peci_xfer_msg msg;
+
+	msg.addr   = umsg->addr;
+	msg.tx_len = 0;
+	msg.rx_len = 0;
+
+	return peci_xfer(adapter, &msg);
+}
+
+static int peci_ioctl_get_dib(struct peci_adapter *adapter, void *vmsg)
+{
+	struct peci_get_dib_msg *umsg = vmsg;
+	struct peci_xfer_msg msg;
+	int rc;
+
+	msg.addr      = umsg->addr;
+	msg.tx_len    = GET_DIB_WR_LEN;
+	msg.rx_len    = GET_DIB_RD_LEN;
+	msg.tx_buf[0] = GET_DIB_PECI_CMD;
+
+	rc = peci_xfer(adapter, &msg);
+	if (rc)
+		return rc;
+
+	umsg->dib = le64_to_cpup((__le64 *)msg.rx_buf);
+
+	return 0;
+}
+
+static int peci_ioctl_get_temp(struct peci_adapter *adapter, void *vmsg)
+{
+	struct peci_get_temp_msg *umsg = vmsg;
+	struct peci_xfer_msg msg;
+	int rc;
+
+	msg.addr      = umsg->addr;
+	msg.tx_len    = GET_TEMP_WR_LEN;
+	msg.rx_len    = GET_TEMP_RD_LEN;
+	msg.tx_buf[0] = GET_TEMP_PECI_CMD;
+
+	rc = peci_xfer(adapter, &msg);
+	if (rc)
+		return rc;
+
+	umsg->temp_raw = le16_to_cpup((__le16 *)msg.rx_buf);
+
+	return 0;
+}
+
+static int peci_ioctl_rd_pkg_cfg(struct peci_adapter *adapter, void *vmsg)
+{
+	struct peci_rd_pkg_cfg_msg *umsg = vmsg;
+	struct peci_xfer_msg msg;
+	int rc = 0;
+
+	/* Per the PECI spec, the read length must be a byte, word, or dword */
+	if (umsg->rx_len != 1 && umsg->rx_len != 2 && umsg->rx_len != 4) {
+		dev_dbg(&adapter->dev, "Invalid read length, rx_len: %d\n",
+			umsg->rx_len);
+		return -EINVAL;
+	}
+
+	msg.addr = umsg->addr;
+	msg.tx_len = RDPKGCFG_WRITE_LEN;
+	/* read lengths of 1 and 2 result in an error, so only use 4 for now */
+	msg.rx_len = RDPKGCFG_READ_LEN_BASE + umsg->rx_len;
+	msg.tx_buf[0] = RDPKGCFG_PECI_CMD;
+	msg.tx_buf[1] = 0;         /* request byte for Host ID | Retry bit */
+				   /* Host ID is 0 for PECI 3.0 */
+	msg.tx_buf[2] = umsg->index;            /* RdPkgConfig index */
+	msg.tx_buf[3] = (u8)umsg->param;        /* LSB - Config parameter */
+	msg.tx_buf[4] = (u8)(umsg->param >> 8); /* MSB - Config parameter */
+
+	rc = peci_xfer_with_retries(adapter, &msg, false);
+	if (!rc)
+		memcpy(umsg->pkg_config, &msg.rx_buf[1], umsg->rx_len);
+
+	return rc;
+}
+
+static int peci_ioctl_wr_pkg_cfg(struct peci_adapter *adapter, void *vmsg)
+{
+	struct peci_wr_pkg_cfg_msg *umsg = vmsg;
+	struct peci_xfer_msg msg;
+	int rc = 0, i;
+
+	/* Per the PECI spec, the write length must be a dword */
+	if (umsg->tx_len != 4) {
+		dev_dbg(&adapter->dev, "Invalid write length, tx_len: %d\n",
+			umsg->tx_len);
+		return -EINVAL;
+	}
+
+	msg.addr = umsg->addr;
+	msg.tx_len = WRPKGCFG_WRITE_LEN_BASE + umsg->tx_len;
+	/* read lengths of 1 and 2 result in an error, so only use 4 for now */
+	msg.rx_len = WRPKGCFG_READ_LEN;
+	msg.tx_buf[0] = WRPKGCFG_PECI_CMD;
+	msg.tx_buf[1] = 0;         /* request byte for Host ID | Retry bit */
+				   /* Host ID is 0 for PECI 3.0 */
+	msg.tx_buf[2] = umsg->index;            /* RdPkgConfig index */
+	msg.tx_buf[3] = (u8)umsg->param;        /* LSB - Config parameter */
+	msg.tx_buf[4] = (u8)(umsg->param >> 8); /* MSB - Config parameter */
+	for (i = 0; i < umsg->tx_len; i++)
+		msg.tx_buf[5 + i] = (u8)(umsg->value >> (i << 3));
+
+	/* Add an Assure Write Frame Check Sequence byte */
+	msg.tx_buf[5 + i] = 0x80 ^
+			    peci_aw_fcs((u8 *)&msg, 8 + umsg->tx_len);
+
+	rc = peci_xfer_with_retries(adapter, &msg, true);
+
+	return rc;
+}
+
+static int peci_ioctl_rd_ia_msr(struct peci_adapter *adapter, void *vmsg)
+{
+	struct peci_rd_ia_msr_msg *umsg = vmsg;
+	struct peci_xfer_msg msg;
+	int rc = 0;
+
+	msg.addr = umsg->addr;
+	msg.tx_len = RDIAMSR_WRITE_LEN;
+	msg.rx_len = RDIAMSR_READ_LEN;
+	msg.tx_buf[0] = RDIAMSR_PECI_CMD;
+	msg.tx_buf[1] = 0;
+	msg.tx_buf[2] = umsg->thread_id;
+	msg.tx_buf[3] = (u8)umsg->address;
+	msg.tx_buf[4] = (u8)(umsg->address >> 8);
+
+	rc = peci_xfer_with_retries(adapter, &msg, false);
+	if (!rc)
+		memcpy(&umsg->value, &msg.rx_buf[1], sizeof(uint64_t));
+
+	return rc;
+}
+
+static int peci_ioctl_rd_pci_cfg(struct peci_adapter *adapter, void *vmsg)
+{
+	struct peci_rd_pci_cfg_msg *umsg = vmsg;
+	struct peci_xfer_msg msg;
+	u32 address;
+	int rc = 0;
+
+	address = umsg->reg;                  /* [11:0]  - Register */
+	address |= (u32)umsg->function << 12; /* [14:12] - Function */
+	address |= (u32)umsg->device << 15;   /* [19:15] - Device   */
+	address |= (u32)umsg->bus << 20;      /* [27:20] - Bus      */
+					      /* [31:28] - Reserved */
+	msg.addr = umsg->addr;
+	msg.tx_len = RDPCICFG_WRITE_LEN;
+	msg.rx_len = RDPCICFG_READ_LEN;
+	msg.tx_buf[0] = RDPCICFG_PECI_CMD;
+	msg.tx_buf[1] = 0;         /* request byte for Host ID | Retry bit */
+				   /* Host ID is 0 for PECI 3.0 */
+	msg.tx_buf[2] = (u8)address;         /* LSB - PCI Config Address */
+	msg.tx_buf[3] = (u8)(address >> 8);  /* PCI Config Address */
+	msg.tx_buf[4] = (u8)(address >> 16); /* PCI Config Address */
+	msg.tx_buf[5] = (u8)(address >> 24); /* MSB - PCI Config Address */
+
+	rc = peci_xfer_with_retries(adapter, &msg, false);
+	if (!rc)
+		memcpy(umsg->pci_config, &msg.rx_buf[1], 4);
+
+	return rc;
+}
+
+static int peci_ioctl_rd_pci_cfg_local(struct peci_adapter *adapter, void *vmsg)
+{
+	struct peci_rd_pci_cfg_local_msg *umsg = vmsg;
+	struct peci_xfer_msg msg;
+	u32 address;
+	int rc = 0;
+
+	/* Per the PECI spec, the read length must be a byte, word, or dword */
+	if (umsg->rx_len != 1 && umsg->rx_len != 2 && umsg->rx_len != 4) {
+		dev_dbg(&adapter->dev, "Invalid read length, rx_len: %d\n",
+			umsg->rx_len);
+		return -EINVAL;
+	}
+
+	address = umsg->reg;                  /* [11:0]  - Register */
+	address |= (u32)umsg->function << 12; /* [14:12] - Function */
+	address |= (u32)umsg->device << 15;   /* [19:15] - Device   */
+	address |= (u32)umsg->bus << 20;      /* [23:20] - Bus      */
+
+	msg.addr = umsg->addr;
+	msg.tx_len = RDPCICFGLOCAL_WRITE_LEN;
+	msg.rx_len = RDPCICFGLOCAL_READ_LEN_BASE + umsg->rx_len;
+	msg.tx_buf[0] = RDPCICFGLOCAL_PECI_CMD;
+	msg.tx_buf[1] = 0;         /* request byte for Host ID | Retry bit */
+				   /* Host ID is 0 for PECI 3.0 */
+	msg.tx_buf[2] = (u8)address;       /* LSB - PCI Configuration Address */
+	msg.tx_buf[3] = (u8)(address >> 8);  /* PCI Configuration Address */
+	msg.tx_buf[4] = (u8)(address >> 16); /* PCI Configuration Address */
+
+	rc = peci_xfer_with_retries(adapter, &msg, false);
+	if (!rc)
+		memcpy(umsg->pci_config, &msg.rx_buf[1], umsg->rx_len);
+
+	return rc;
+}
+
+static int peci_ioctl_wr_pci_cfg_local(struct peci_adapter *adapter, void *vmsg)
+{
+	struct peci_wr_pci_cfg_local_msg *umsg = vmsg;
+	struct peci_xfer_msg msg;
+	int rc = 0, i;
+	u32 address;
+
+	/* Per the PECI spec, the write length must be a byte, word, or dword */
+	if (umsg->tx_len != 1 && umsg->tx_len != 2 && umsg->tx_len != 4) {
+		dev_dbg(&adapter->dev, "Invalid write length, tx_len: %d\n",
+			umsg->tx_len);
+		return -EINVAL;
+	}
+
+	address = umsg->reg;                  /* [11:0]  - Register */
+	address |= (u32)umsg->function << 12; /* [14:12] - Function */
+	address |= (u32)umsg->device << 15;   /* [19:15] - Device   */
+	address |= (u32)umsg->bus << 20;      /* [23:20] - Bus      */
+
+	msg.addr = umsg->addr;
+	msg.tx_len = WRPCICFGLOCAL_WRITE_LEN_BASE + umsg->tx_len;
+	msg.rx_len = WRPCICFGLOCAL_READ_LEN;
+	msg.tx_buf[0] = WRPCICFGLOCAL_PECI_CMD;
+	msg.tx_buf[1] = 0;         /* request byte for Host ID | Retry bit */
+				   /* Host ID is 0 for PECI 3.0 */
+	msg.tx_buf[2] = (u8)address;       /* LSB - PCI Configuration Address */
+	msg.tx_buf[3] = (u8)(address >> 8);  /* PCI Configuration Address */
+	msg.tx_buf[4] = (u8)(address >> 16); /* PCI Configuration Address */
+	for (i = 0; i < umsg->tx_len; i++)
+		msg.tx_buf[5 + i] = (u8)(umsg->value >> (i << 3));
+
+	/* Add an Assure Write Frame Check Sequence byte */
+	msg.tx_buf[5 + i] = 0x80 ^
+			    peci_aw_fcs((u8 *)&msg, 8 + umsg->tx_len);
+
+	rc = peci_xfer_with_retries(adapter, &msg, true);
+
+	return rc;
+}
+
+typedef int (*peci_ioctl_fn_type)(struct peci_adapter *, void *);
+
+static const peci_ioctl_fn_type peci_ioctl_fn[PECI_CMD_MAX] = {
+	peci_ioctl_xfer,
+	peci_ioctl_ping,
+	peci_ioctl_get_dib,
+	peci_ioctl_get_temp,
+	peci_ioctl_rd_pkg_cfg,
+	peci_ioctl_wr_pkg_cfg,
+	peci_ioctl_rd_ia_msr,
+	NULL, /* Reserved */
+	peci_ioctl_rd_pci_cfg,
+	NULL, /* Reserved */
+	peci_ioctl_rd_pci_cfg_local,
+	peci_ioctl_wr_pci_cfg_local,
+};
+
+/**
+ * peci_command - transfer function of a PECI command
+ * @adapter: pointer to peci_adapter
+ * @vmsg: pointer to PECI messages
+ * Context: can sleep
+ *
+ * This performs a transfer of a PECI command using PECI messages parameter
+ * which has various formats on each command.
+ *
+ * Return: zero on success, else a negative error code.
+ */
+int peci_command(struct peci_adapter *adapter, enum peci_cmd cmd, void *vmsg)
+{
+	int rc = 0;
+
+	if (cmd >= PECI_CMD_MAX || cmd < PECI_CMD_XFER)
+		return -EINVAL;
+
+	dev_dbg(&adapter->dev, "%s, cmd=0x%02x\n", __func__, cmd);
+
+	if (!peci_ioctl_fn[cmd])
+		return -EINVAL;
+
+	rt_mutex_lock(&adapter->bus_lock);
+
+	rc = peci_cmd_support(adapter, cmd);
+	if (!rc)
+		rc = peci_ioctl_fn[cmd](adapter, vmsg);
+
+	rt_mutex_unlock(&adapter->bus_lock);
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(peci_command);
+
+static long peci_ioctl(struct file *file, unsigned int iocmd, unsigned long arg)
+{
+	struct peci_adapter *adapter = file->private_data;
+	void __user *argp = (void __user *)arg;
+	unsigned int msg_len;
+	enum peci_cmd cmd;
+	int rc = 0;
+	u8 *msg;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	dev_dbg(&adapter->dev, "ioctl, cmd=0x%x, arg=0x%lx\n", iocmd, arg);
+
+	switch (iocmd) {
+	case PECI_IOC_XFER:
+	case PECI_IOC_PING:
+	case PECI_IOC_GET_DIB:
+	case PECI_IOC_GET_TEMP:
+	case PECI_IOC_RD_PKG_CFG:
+	case PECI_IOC_WR_PKG_CFG:
+	case PECI_IOC_RD_IA_MSR:
+	case PECI_IOC_RD_PCI_CFG:
+	case PECI_IOC_RD_PCI_CFG_LOCAL:
+	case PECI_IOC_WR_PCI_CFG_LOCAL:
+		cmd = _IOC_NR(iocmd);
+		msg_len = _IOC_SIZE(iocmd);
+		break;
+
+	default:
+		dev_dbg(&adapter->dev, "Invalid ioctl cmd : 0x%x\n", iocmd);
+		return -ENOTTY;
+	}
+
+	if (!access_ok(argp, msg_len))
+		return -EFAULT;
+
+	msg = memdup_user(argp, msg_len);
+	if (IS_ERR(msg))
+		return PTR_ERR(msg);
+
+	rc = peci_command(adapter, cmd, msg);
+
+	if (!rc && copy_to_user(argp, msg, msg_len))
+		rc = -EFAULT;
+
+	kfree(msg);
+	return (long)rc;
+}
+
+static int peci_open(struct inode *inode, struct file *file)
+{
+	unsigned int minor = iminor(inode);
+	struct peci_adapter *adapter;
+
+	adapter = peci_get_adapter(minor);
+	if (!adapter)
+		return -ENODEV;
+
+	file->private_data = adapter;
+
+	return 0;
+}
+
+static int peci_release(struct inode *inode, struct file *file)
+{
+	struct peci_adapter *adapter = file->private_data;
+
+	peci_put_adapter(adapter);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+static const struct file_operations peci_fops = {
+	.owner          = THIS_MODULE,
+	.unlocked_ioctl = peci_ioctl,
+	.open           = peci_open,
+	.release        = peci_release,
+};
+
+static int peci_detect(struct peci_adapter *adapter, u8 addr)
+{
+	struct peci_ping_msg msg;
+
+	msg.addr = addr;
+
+	return peci_command(adapter, PECI_CMD_PING, &msg);
+}
+
+static const struct of_device_id *
+peci_of_match_device(const struct of_device_id *matches,
+		     struct peci_client *client)
+{
+#if IS_ENABLED(CONFIG_OF)
+	if (!(client && matches))
+		return NULL;
+
+	return of_match_device(matches, &client->dev);
+#else /* CONFIG_OF */
+	return NULL;
+#endif /* CONFIG_OF */
+}
+
+static const struct peci_device_id *
+peci_match_id(const struct peci_device_id *id, struct peci_client *client)
+{
+	if (!(id && client))
+		return NULL;
+
+	while (id->name[0]) {
+		if (!strncmp(client->name, id->name, PECI_NAME_SIZE))
+			return id;
+		id++;
+	}
+
+	return NULL;
+}
+
+static int peci_device_match(struct device *dev, struct device_driver *drv)
+{
+	struct peci_client *client = peci_verify_client(dev);
+	struct peci_driver *driver;
+
+	/* Attempt an OF style match */
+	if (peci_of_match_device(drv->of_match_table, client))
+		return 1;
+
+	driver = to_peci_driver(drv);
+
+	/* Finally an ID match */
+	if (peci_match_id(driver->id_table, client))
+		return 1;
+
+	return 0;
+}
+
+static int peci_device_probe(struct device *dev)
+{
+	struct peci_client *client = peci_verify_client(dev);
+	struct peci_driver *driver;
+	int status = -EINVAL;
+
+	if (!client)
+		return 0;
+
+	driver = to_peci_driver(dev->driver);
+
+	if (!driver->id_table &&
+	    !peci_of_match_device(dev->driver->of_match_table, client))
+		return -ENODEV;
+
+	dev_dbg(dev, "%s: name:%s\n", __func__, client->name);
+
+	status = dev_pm_domain_attach(&client->dev, true);
+	if (status == -EPROBE_DEFER)
+		return status;
+
+	if (driver->probe)
+		status = driver->probe(client);
+	else
+		status = -EINVAL;
+
+	if (status)
+		goto err_detach_pm_domain;
+
+	return 0;
+
+err_detach_pm_domain:
+	dev_pm_domain_detach(&client->dev, true);
+	return status;
+}
+
+static int peci_device_remove(struct device *dev)
+{
+	struct peci_client *client = peci_verify_client(dev);
+	struct peci_driver *driver;
+	int status = 0;
+
+	if (!client || !dev->driver)
+		return 0;
+
+	driver = to_peci_driver(dev->driver);
+	if (driver->remove) {
+		dev_dbg(dev, "%s: name:%s\n", __func__, client->name);
+		status = driver->remove(client);
+	}
+
+	dev_pm_domain_detach(&client->dev, true);
+
+	return status;
+}
+
+static void peci_device_shutdown(struct device *dev)
+{
+	struct peci_client *client = peci_verify_client(dev);
+	struct peci_driver *driver;
+
+	if (!client || !dev->driver)
+		return;
+
+	dev_dbg(dev, "%s: name:%s\n", __func__, client->name);
+
+	driver = to_peci_driver(dev->driver);
+	if (driver->shutdown)
+		driver->shutdown(client);
+}
+
+static struct bus_type peci_bus_type = {
+	.name		= "peci",
+	.match		= peci_device_match,
+	.probe		= peci_device_probe,
+	.remove		= peci_device_remove,
+	.shutdown	= peci_device_shutdown,
+};
+
+static int peci_check_addr_validity(u8 addr)
+{
+	if (addr < PECI_BASE_ADDR && addr > PECI_BASE_ADDR + PECI_OFFSET_MAX)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int peci_check_client_busy(struct device *dev, void *client_new_p)
+{
+	struct peci_client *client = peci_verify_client(dev);
+	struct peci_client *client_new = client_new_p;
+
+	if (client && client->addr == client_new->addr)
+		return -EBUSY;
+
+	return 0;
+}
+
+/**
+ * peci_get_cpu_id - read CPU ID from the Package Configuration Space of CPU
+ * @adapter: pointer to peci_adapter
+ * @addr: address of the PECI client CPU
+ * @cpu_id: where the CPU ID will be stored
+ * Context: can sleep
+ *
+ * Return: zero on success, else a negative error code.
+ */
+int peci_get_cpu_id(struct peci_adapter *adapter, u8 addr, u32 *cpu_id)
+{
+	struct peci_rd_pkg_cfg_msg msg;
+	int rc;
+
+	msg.addr = addr;
+	msg.index = MBX_INDEX_CPU_ID;
+	msg.param = PKG_ID_CPU_ID;
+	msg.rx_len = 4;
+
+	rc = peci_command(adapter, PECI_CMD_RD_PKG_CFG, &msg);
+	if (!rc)
+		*cpu_id = le32_to_cpup((__le32 *)msg.pkg_config);
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(peci_get_cpu_id);
+
+static struct peci_client *peci_new_device(struct peci_adapter *adapter,
+					   struct peci_board_info const *info)
+{
+	struct peci_client *client;
+	int rc;
+
+	/* Increase reference count for the adapter assigned */
+	if (!peci_get_adapter(adapter->nr))
+		return NULL;
+
+	client = kzalloc(sizeof(*client), GFP_KERNEL);
+	if (!client)
+		goto err_put_adapter;
+
+	client->adapter = adapter;
+	client->addr = info->addr;
+	strlcpy(client->name, info->type, sizeof(client->name));
+
+	rc = peci_check_addr_validity(client->addr);
+	if (rc) {
+		dev_err(&adapter->dev, "Invalid PECI CPU address 0x%02hx\n",
+			client->addr);
+		goto err_free_client_silent;
+	}
+
+	/* Check online status of client */
+	rc = peci_detect(adapter, client->addr);
+	if (rc)
+		goto err_free_client;
+
+	rc = device_for_each_child(&adapter->dev, client,
+				   peci_check_client_busy);
+	if (rc)
+		goto err_free_client;
+
+	client->dev.parent = &client->adapter->dev;
+	client->dev.bus = &peci_bus_type;
+	client->dev.type = &peci_client_type;
+	client->dev.of_node = info->of_node;
+	dev_set_name(&client->dev, "%d-%02x", adapter->nr, client->addr);
+
+	rc = device_register(&client->dev);
+	if (rc)
+		goto err_free_client;
+
+	dev_dbg(&adapter->dev, "client [%s] registered with bus id %s\n",
+		client->name, dev_name(&client->dev));
+
+	return client;
+
+err_free_client:
+	dev_err(&adapter->dev,
+		"Failed to register peci client %s at 0x%02x (%d)\n",
+		client->name, client->addr, rc);
+err_free_client_silent:
+	kfree(client);
+err_put_adapter:
+	peci_put_adapter(adapter);
+	return NULL;
+}
+
+static void peci_unregister_device(struct peci_client *client)
+{
+	if (!client)
+		return;
+
+	if (client->dev.of_node)
+		of_node_clear_flag(client->dev.of_node, OF_POPULATED);
+
+	device_unregister(&client->dev);
+}
+
+static int peci_unregister_client(struct device *dev, void *dummy)
+{
+	struct peci_client *client = peci_verify_client(dev);
+
+	peci_unregister_device(client);
+
+	return 0;
+}
+
+static void peci_adapter_dev_release(struct device *dev)
+{
+	struct peci_adapter *adapter = to_peci_adapter(dev);
+
+	dev_dbg(dev, "%s: %s\n", __func__, adapter->name);
+	mutex_destroy(&adapter->userspace_clients_lock);
+	rt_mutex_destroy(&adapter->bus_lock);
+	kfree(adapter);
+}
+
+static ssize_t peci_sysfs_new_device(struct device *dev,
+				     struct device_attribute *attr,
+				     const char *buf, size_t count)
+{
+	struct peci_adapter *adapter = to_peci_adapter(dev);
+	struct peci_board_info info = {};
+	struct peci_client *client;
+	char *blank, end;
+	int rc;
+
+	/* Parse device type */
+	blank = strchr(buf, ' ');
+	if (!blank) {
+		dev_err(dev, "%s: Missing parameters\n", "new_device");
+		return -EINVAL;
+	}
+	if (blank - buf > PECI_NAME_SIZE - 1) {
+		dev_err(dev, "%s: Invalid device type\n", "new_device");
+		return -EINVAL;
+	}
+	memcpy(info.type, buf, blank - buf);
+
+	/* Parse remaining parameters, reject extra parameters */
+	rc = sscanf(++blank, "%hi%c", &info.addr, &end);
+	if (rc < 1) {
+		dev_err(dev, "%s: Can't parse client address\n", "new_device");
+		return -EINVAL;
+	}
+	if (rc > 1  && end != '\n') {
+		dev_err(dev, "%s: Extra parameters\n", "new_device");
+		return -EINVAL;
+	}
+
+	client = peci_new_device(adapter, &info);
+	if (!client)
+		return -EINVAL;
+
+	/* Keep track of the added device */
+	mutex_lock(&adapter->userspace_clients_lock);
+	list_add_tail(&client->detected, &adapter->userspace_clients);
+	mutex_unlock(&adapter->userspace_clients_lock);
+	dev_info(dev, "%s: Instantiated device %s at 0x%02hx\n", "new_device",
+		 info.type, info.addr);
+
+	return count;
+}
+static DEVICE_ATTR(new_device, 0200, NULL, peci_sysfs_new_device);
+
+static ssize_t peci_sysfs_delete_device(struct device *dev,
+					struct device_attribute *attr,
+					const char *buf, size_t count)
+{
+	struct peci_adapter *adapter = to_peci_adapter(dev);
+	struct peci_client *client, *next;
+	struct peci_board_info info = {};
+	struct peci_driver *driver;
+	char *blank, end;
+	int rc;
+
+	/* Parse device type */
+	blank = strchr(buf, ' ');
+	if (!blank) {
+		dev_err(dev, "%s: Missing parameters\n", "delete_device");
+		return -EINVAL;
+	}
+	if (blank - buf > PECI_NAME_SIZE - 1) {
+		dev_err(dev, "%s: Invalid device type\n", "delete_device");
+		return -EINVAL;
+	}
+	memcpy(info.type, buf, blank - buf);
+
+	/* Parse remaining parameters, reject extra parameters */
+	rc = sscanf(++blank, "%hi%c", &info.addr, &end);
+	if (rc < 1) {
+		dev_err(dev, "%s: Can't parse client address\n",
+			"delete_device");
+		return -EINVAL;
+	}
+	if (rc > 1  && end != '\n') {
+		dev_err(dev, "%s: Extra parameters\n", "delete_device");
+		return -EINVAL;
+	}
+
+	/* Make sure the device was added through sysfs */
+	rc = -ENOENT;
+	mutex_lock(&adapter->userspace_clients_lock);
+	list_for_each_entry_safe(client, next, &adapter->userspace_clients,
+				 detected) {
+		driver = to_peci_driver(client->dev.driver);
+
+		if (client->addr == info.addr &&
+		    !strncmp(client->name, info.type, PECI_NAME_SIZE)) {
+			dev_info(dev, "%s: Deleting device %s at 0x%02hx\n",
+				 "delete_device", client->name, client->addr);
+			list_del(&client->detected);
+			peci_unregister_device(client);
+			rc = count;
+			break;
+		}
+	}
+	mutex_unlock(&adapter->userspace_clients_lock);
+
+	if (rc < 0)
+		dev_err(dev, "%s: Can't find device in list\n",
+			"delete_device");
+
+	return rc;
+}
+static DEVICE_ATTR_IGNORE_LOCKDEP(delete_device, 0200, NULL,
+				  peci_sysfs_delete_device);
+
+static struct attribute *peci_adapter_attrs[] = {
+	&dev_attr_name.attr,
+	&dev_attr_new_device.attr,
+	&dev_attr_delete_device.attr,
+	NULL
+};
+ATTRIBUTE_GROUPS(peci_adapter);
+
+static struct device_type peci_adapter_type = {
+	.groups		= peci_adapter_groups,
+	.release	= peci_adapter_dev_release,
+};
+
+/**
+ * peci_verify_adapter - return parameter as peci_adapter, or NULL
+ * @dev: device, probably from some driver model iterator
+ *
+ * Return: pointer to peci_adapter on success, else NULL.
+ */
+struct peci_adapter *peci_verify_adapter(struct device *dev)
+{
+	return (dev->type == &peci_adapter_type)
+			? to_peci_adapter(dev)
+			: NULL;
+}
+EXPORT_SYMBOL_GPL(peci_verify_adapter);
+
+#if IS_ENABLED(CONFIG_OF)
+static struct peci_client *peci_of_register_device(struct peci_adapter *adapter,
+						   struct device_node *node)
+{
+	struct peci_board_info info = {};
+	struct peci_client *result;
+	const __be32 *addr_be;
+	int len;
+
+	dev_dbg(&adapter->dev, "register %pOF\n", node);
+
+	if (of_modalias_node(node, info.type, sizeof(info.type)) < 0) {
+		dev_err(&adapter->dev, "modalias failure on %pOF\n", node);
+		return ERR_PTR(-EINVAL);
+	}
+
+	addr_be = of_get_property(node, "reg", &len);
+	if (!addr_be || len < sizeof(*addr_be)) {
+		dev_err(&adapter->dev, "invalid reg on %pOF\n", node);
+		return ERR_PTR(-EINVAL);
+	}
+
+	info.addr = be32_to_cpup(addr_be);
+	info.of_node = of_node_get(node);
+
+	result = peci_new_device(adapter, &info);
+	if (!result)
+		result = ERR_PTR(-EINVAL);
+
+	of_node_put(node);
+	return result;
+}
+
+static void peci_of_register_devices(struct peci_adapter *adapter)
+{
+	struct device_node *bus, *node;
+	struct peci_client *client;
+
+	/* Only register child devices if the adapter has a node pointer set */
+	if (!adapter->dev.of_node)
+		return;
+
+	bus = of_get_child_by_name(adapter->dev.of_node, "peci-bus");
+	if (!bus)
+		bus = of_node_get(adapter->dev.of_node);
+
+	for_each_available_child_of_node(bus, node) {
+		if (of_node_test_and_set_flag(node, OF_POPULATED))
+			continue;
+
+		client = peci_of_register_device(adapter, node);
+		if (IS_ERR(client)) {
+			dev_warn(&adapter->dev,
+				 "Failed to create PECI device for %pOF\n",
+				 node);
+			of_node_clear_flag(node, OF_POPULATED);
+		}
+	}
+
+	of_node_put(bus);
+}
+#else /* CONFIG_OF */
+static void peci_of_register_devices(struct peci_adapter *adapter) { }
+#endif /* CONFIG_OF */
+
+#if IS_ENABLED(CONFIG_OF_DYNAMIC)
+static int peci_of_match_node(struct device *dev, void *data)
+{
+	return dev->of_node == data;
+}
+
+/* must call put_device() when done with returned peci_client device */
+static struct peci_client *peci_of_find_device(struct device_node *node)
+{
+	struct peci_client *client;
+	struct device *dev;
+
+	dev = bus_find_device(&peci_bus_type, NULL, node, peci_of_match_node);
+	if (!dev)
+		return NULL;
+
+	client = peci_verify_client(dev);
+	if (!client)
+		put_device(dev);
+
+	return client;
+}
+
+/* must call put_device() when done with returned peci_adapter device */
+static struct peci_adapter *peci_of_find_adapter(struct device_node *node)
+{
+	struct peci_adapter *adapter;
+	struct device *dev;
+
+	dev = bus_find_device(&peci_bus_type, NULL, node, peci_of_match_node);
+	if (!dev)
+		return NULL;
+
+	adapter = peci_verify_adapter(dev);
+	if (!adapter)
+		put_device(dev);
+
+	return adapter;
+}
+
+static int peci_of_notify(struct notifier_block *nb,
+			  unsigned long action,
+			  void *arg)
+{
+	struct of_reconfig_data *rd = arg;
+	struct peci_adapter *adapter;
+	struct peci_client *client;
+
+	switch (of_reconfig_get_state_change(action, rd)) {
+	case OF_RECONFIG_CHANGE_ADD:
+		adapter = peci_of_find_adapter(rd->dn->parent);
+		if (!adapter)
+			return NOTIFY_OK;	/* not for us */
+
+		if (of_node_test_and_set_flag(rd->dn, OF_POPULATED)) {
+			put_device(&adapter->dev);
+			return NOTIFY_OK;
+		}
+
+		client = peci_of_register_device(adapter, rd->dn);
+		put_device(&adapter->dev);
+
+		if (IS_ERR(client)) {
+			dev_err(&adapter->dev,
+				"failed to create client for '%pOF'\n", rd->dn);
+			of_node_clear_flag(rd->dn, OF_POPULATED);
+			return notifier_from_errno(PTR_ERR(client));
+		}
+		break;
+	case OF_RECONFIG_CHANGE_REMOVE:
+		/* already depopulated? */
+		if (!of_node_check_flag(rd->dn, OF_POPULATED))
+			return NOTIFY_OK;
+
+		/* find our device by node */
+		client = peci_of_find_device(rd->dn);
+		if (!client)
+			return NOTIFY_OK;	/* no? not meant for us */
+
+		/* unregister takes one ref away */
+		peci_unregister_device(client);
+
+		/* and put the reference of the find */
+		put_device(&client->dev);
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block peci_of_notifier = {
+	.notifier_call = peci_of_notify,
+};
+#else /* CONFIG_OF_DYNAMIC */
+extern struct notifier_block peci_of_notifier;
+#endif /* CONFIG_OF_DYNAMIC */
+
+/**
+ * peci_alloc_adapter - allocate a PECI adapter
+ * @dev: the adapter, possibly using the platform_bus
+ * @size: how much zeroed driver-private data to allocate; the pointer to this
+ *	memory is in the driver_data field of the returned device,
+ *	accessible with peci_get_adapdata().
+ * Context: can sleep
+ *
+ * This call is used only by PECI adapter drivers, which are the only ones
+ * directly touching chip registers.  It's how they allocate a peci_adapter
+ * structure, prior to calling peci_add_adapter().
+ *
+ * This must be called from context that can sleep.
+ *
+ * The caller is responsible for initializing the adapter's methods before
+ * calling peci_add_adapter(); and (after errors while adding the device)
+ * calling put_device() to prevent a memory leak.
+ *
+ * Return: the peci_adapter structure on success, else NULL.
+ */
+struct peci_adapter *peci_alloc_adapter(struct device *dev, unsigned int size)
+{
+	struct peci_adapter *adapter;
+
+	if (!dev)
+		return NULL;
+
+	adapter = kzalloc(size + sizeof(*adapter), GFP_KERNEL);
+	if (!adapter)
+		return NULL;
+
+	device_initialize(&adapter->dev);
+	adapter->dev.parent = dev;
+	adapter->dev.bus = &peci_bus_type;
+	adapter->dev.type = &peci_adapter_type;
+	peci_set_adapdata(adapter, &adapter[1]);
+
+	return adapter;
+}
+EXPORT_SYMBOL_GPL(peci_alloc_adapter);
+
+static int peci_register_adapter(struct peci_adapter *adapter)
+{
+	int rc = -EINVAL;
+
+	/* Can't register until after driver model init */
+	if (WARN_ON(!is_registered))
+		goto err_free_idr;
+
+	if (WARN(!adapter->name[0], "peci adapter has no name"))
+		goto err_free_idr;
+
+	if (WARN(!adapter->xfer, "peci adapter has no xfer function\n"))
+		goto err_free_idr;
+
+	rt_mutex_init(&adapter->bus_lock);
+	mutex_init(&adapter->userspace_clients_lock);
+	INIT_LIST_HEAD(&adapter->userspace_clients);
+
+	dev_set_name(&adapter->dev, "peci-%d", adapter->nr);
+
+	/* cdev */
+	cdev_init(&adapter->cdev, &peci_fops);
+	adapter->cdev.owner = THIS_MODULE;
+	adapter->dev.devt = MKDEV(MAJOR(peci_devt), adapter->nr);
+	rc = cdev_add(&adapter->cdev, adapter->dev.devt, 1);
+	if (rc) {
+		pr_err("adapter '%s': can't add cdev (%d)\n",
+		       adapter->name, rc);
+		goto err_free_idr;
+	}
+	rc = device_add(&adapter->dev);
+	if (rc) {
+		pr_err("adapter '%s': can't add device (%d)\n",
+		       adapter->name, rc);
+		goto err_del_cdev;
+	}
+
+	dev_dbg(&adapter->dev, "adapter [%s] registered\n", adapter->name);
+
+	pm_runtime_no_callbacks(&adapter->dev);
+	pm_suspend_ignore_children(&adapter->dev, true);
+	pm_runtime_enable(&adapter->dev);
+
+	/* create pre-declared device nodes */
+	peci_of_register_devices(adapter);
+
+	return 0;
+
+err_del_cdev:
+	cdev_del(&adapter->cdev);
+err_free_idr:
+	mutex_lock(&core_lock);
+	idr_remove(&peci_adapter_idr, adapter->nr);
+	mutex_unlock(&core_lock);
+	return rc;
+}
+
+static int peci_add_numbered_adapter(struct peci_adapter *adapter)
+{
+	int id;
+
+	mutex_lock(&core_lock);
+	id = idr_alloc(&peci_adapter_idr, adapter,
+		       adapter->nr, adapter->nr + 1, GFP_KERNEL);
+	mutex_unlock(&core_lock);
+	if (WARN(id < 0, "couldn't get idr"))
+		return id == -ENOSPC ? -EBUSY : id;
+
+	return peci_register_adapter(adapter);
+}
+
+/**
+ * peci_add_adapter - add a PECI adapter
+ * @adapter: initialized adapter, originally from peci_alloc_adapter()
+ * Context: can sleep
+ *
+ * PECI adapters connect to their drivers using some non-PECI bus,
+ * such as the platform bus.  The final stage of probe() in that code
+ * includes calling peci_add_adapter() to hook up to this PECI bus glue.
+ *
+ * This must be called from context that can sleep.
+ *
+ * It returns zero on success, else a negative error code (dropping the
+ * adapter's refcount).  After a successful return, the caller is responsible
+ * for calling peci_del_adapter().
+ *
+ * Return: zero on success, else a negative error code.
+ */
+int peci_add_adapter(struct peci_adapter *adapter)
+{
+	struct device *dev = &adapter->dev;
+	int id;
+
+	if (dev->of_node) {
+		id = of_alias_get_id(dev->of_node, "peci");
+		if (id >= 0) {
+			adapter->nr = id;
+			return peci_add_numbered_adapter(adapter);
+		}
+	}
+
+	mutex_lock(&core_lock);
+	id = idr_alloc(&peci_adapter_idr, adapter, 0, 0, GFP_KERNEL);
+	mutex_unlock(&core_lock);
+	if (WARN(id < 0, "couldn't get idr"))
+		return id;
+
+	adapter->nr = id;
+
+	return peci_register_adapter(adapter);
+}
+EXPORT_SYMBOL_GPL(peci_add_adapter);
+
+/**
+ * peci_del_adapter - delete a PECI adapter
+ * @adapter: the adpater being deleted
+ * Context: can sleep
+ *
+ * This call is used only by PECI adpater drivers, which are the only ones
+ * directly touching chip registers.
+ *
+ * This must be called from context that can sleep.
+ *
+ * Note that this function also drops a reference to the adapter.
+ */
+void peci_del_adapter(struct peci_adapter *adapter)
+{
+	struct peci_client *client, *next;
+	struct peci_adapter *found;
+	int nr;
+
+	/* First make sure that this adapter was ever added */
+	mutex_lock(&core_lock);
+	found = idr_find(&peci_adapter_idr, adapter->nr);
+	mutex_unlock(&core_lock);
+
+	if (found != adapter)
+		return;
+
+	/* Remove devices instantiated from sysfs */
+	mutex_lock(&adapter->userspace_clients_lock);
+	list_for_each_entry_safe(client, next, &adapter->userspace_clients,
+				 detected) {
+		dev_dbg(&adapter->dev, "Removing %s at 0x%x\n", client->name,
+			client->addr);
+		list_del(&client->detected);
+		peci_unregister_device(client);
+	}
+	mutex_unlock(&adapter->userspace_clients_lock);
+
+	/**
+	 * Detach any active clients. This can't fail, thus we do not
+	 * check the returned value.
+	 */
+	device_for_each_child(&adapter->dev, NULL, peci_unregister_client);
+
+	/* device name is gone after device_unregister */
+	dev_dbg(&adapter->dev, "adapter [%s] unregistered\n", adapter->name);
+
+	/* free cdev */
+	cdev_del(&adapter->cdev);
+
+	pm_runtime_disable(&adapter->dev);
+
+	nr = adapter->nr;
+
+	device_unregister(&adapter->dev);
+
+	/* free bus id */
+	mutex_lock(&core_lock);
+	idr_remove(&peci_adapter_idr, nr);
+	mutex_unlock(&core_lock);
+}
+EXPORT_SYMBOL_GPL(peci_del_adapter);
+
+/**
+ * peci_register_driver - register a PECI driver
+ * @owner: owner module of the driver being registered
+ * @driver: the driver being registered
+ * Context: can sleep
+ *
+ * Return: zero on success, else a negative error code.
+ */
+int peci_register_driver(struct module *owner, struct peci_driver *driver)
+{
+	int rc;
+
+	/* Can't register until after driver model init */
+	if (WARN_ON(!is_registered))
+		return -EAGAIN;
+
+	/* add the driver to the list of peci drivers in the driver core */
+	driver->driver.owner = owner;
+	driver->driver.bus = &peci_bus_type;
+
+	/**
+	 * When registration returns, the driver core
+	 * will have called probe() for all matching-but-unbound devices.
+	 */
+	rc = driver_register(&driver->driver);
+	if (rc)
+		return rc;
+
+	pr_debug("driver [%s] registered\n", driver->driver.name);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(peci_register_driver);
+
+/**
+ * peci_del_driver - unregister a PECI driver
+ * @driver: the driver being unregistered
+ * Context: can sleep
+ */
+void peci_del_driver(struct peci_driver *driver)
+{
+	driver_unregister(&driver->driver);
+	pr_debug("driver [%s] unregistered\n", driver->driver.name);
+}
+EXPORT_SYMBOL_GPL(peci_del_driver);
+
+static int __init peci_init(void)
+{
+	int ret;
+
+	ret = bus_register(&peci_bus_type);
+	if (ret < 0) {
+		pr_err("peci: Failed to register PECI bus type!\n");
+		return ret;
+	}
+
+	ret = alloc_chrdev_region(&peci_devt, 0, PECI_CDEV_MAX, "peci");
+	if (ret < 0) {
+		pr_err("peci: Failed to allocate chr dev region!\n");
+		bus_unregister(&peci_bus_type);
+		return ret;
+	}
+
+	crc8_populate_msb(peci_crc8_table, PECI_CRC8_POLYNOMIAL);
+
+	if (IS_ENABLED(CONFIG_OF_DYNAMIC))
+		WARN_ON(of_reconfig_notifier_register(&peci_of_notifier));
+
+	is_registered = true;
+
+	return 0;
+}
+
+static void __exit peci_exit(void)
+{
+	if (IS_ENABLED(CONFIG_OF_DYNAMIC))
+		WARN_ON(of_reconfig_notifier_unregister(&peci_of_notifier));
+
+	unregister_chrdev_region(peci_devt, PECI_CDEV_MAX);
+	bus_unregister(&peci_bus_type);
+}
+
+postcore_initcall(peci_init);
+module_exit(peci_exit);
+
+MODULE_AUTHOR("Jason M Biils <jason.m.bills@linux.intel.com>");
+MODULE_AUTHOR("Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>");
+MODULE_DESCRIPTION("PECI bus core module");
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/peci.h b/include/linux/peci.h
new file mode 100644
index 000000000000..4b8be939585c
--- /dev/null
+++ b/include/linux/peci.h
@@ -0,0 +1,142 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2018-2019 Intel Corporation */
+
+#ifndef __LINUX_PECI_H
+#define __LINUX_PECI_H
+
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/peci-ioctl.h>
+#include <linux/rtmutex.h>
+
+#define PECI_NAME_SIZE   32
+
+struct peci_board_info {
+	char			type[PECI_NAME_SIZE];
+	unsigned short		addr;	/* CPU client address */
+	struct device_node	*of_node;
+};
+
+/**
+ * struct peci_adapter - represent a PECI adapter
+ * @owner: owner module of the PECI adpater
+ * @bus_lock: mutex for exclusion of multiple callers
+ * @dev: device interface to this driver
+ * @cdev: character device object to create character device
+ * @nr: the bus number to map
+ * @name: name of the adapter
+ * @userspace_clients_lock: mutex for exclusion of clients handling
+ * @userspace_clients: list of registered clients
+ * @xfer: low-level transfer function pointer of the adapter
+ * @cmd_mask: mask for supportable PECI commands
+ *
+ * Each PECI adapter can communicate with one or more PECI client children.
+ * These make a small bus, sharing a single wired PECI connection.
+ */
+struct peci_adapter {
+	struct module		*owner;
+	struct rt_mutex		bus_lock;
+	struct device		dev;
+	struct cdev		cdev;
+	int			nr;
+	char			name[PECI_NAME_SIZE];
+	struct mutex		userspace_clients_lock; /* clients list mutex */
+	struct list_head	userspace_clients;
+	int			(*xfer)(struct peci_adapter *adapter,
+					struct peci_xfer_msg *msg);
+	uint			cmd_mask;
+};
+
+static inline struct peci_adapter *to_peci_adapter(void *d)
+{
+	return container_of(d, struct peci_adapter, dev);
+}
+
+static inline void *peci_get_adapdata(const struct peci_adapter *adapter)
+{
+	return dev_get_drvdata(&adapter->dev);
+}
+
+static inline void peci_set_adapdata(struct peci_adapter *adapter, void *data)
+{
+	dev_set_drvdata(&adapter->dev, data);
+}
+
+/**
+ * struct peci_client - represent a PECI client device
+ * @dev: driver model device node for the client
+ * @adapter: manages the bus segment hosting this PECI device
+ * @addr: address used on the PECI bus connected to the parent adapter
+ * @name: indicates the type of the device
+ * @detected: detected PECI clients list
+ *
+ * A peci_client identifies a single device (i.e. CPU) connected to a peci bus.
+ * The behaviour exposed to Linux is defined by the driver managing the device.
+ */
+struct peci_client {
+	struct device		dev;
+	struct peci_adapter	*adapter;
+	u8			addr;
+	char			name[PECI_NAME_SIZE];
+	struct list_head	detected;
+};
+
+static inline struct peci_client *to_peci_client(void *d)
+{
+	return container_of(d, struct peci_client, dev);
+}
+
+struct peci_device_id {
+	char		name[PECI_NAME_SIZE];
+	unsigned long	driver_data;	/* Data private to the driver */
+};
+
+/**
+ * struct peci_driver - represent a PECI device driver
+ * @probe: callback for device binding
+ * @remove: callback for device unbinding
+ * @shutdown: callback for device shutdown
+ * @driver: device driver model driver
+ * @id_table: list of PECI devices supported by this driver
+ *
+ * The driver.owner field should be set to the module owner of this driver.
+ * The driver.name field should be set to the name of this driver.
+ */
+struct peci_driver {
+	int				(*probe)(struct peci_client *client);
+	int				(*remove)(struct peci_client *client);
+	void				(*shutdown)(struct peci_client *client);
+	struct device_driver		driver;
+	const struct peci_device_id	*id_table;
+};
+
+static inline struct peci_driver *to_peci_driver(void *d)
+{
+	return container_of(d, struct peci_driver, driver);
+}
+
+/**
+ * module_peci_driver - Helper macro for registering a modular PECI driver
+ * @__peci_driver: peci_driver struct
+ *
+ * Helper macro for PECI drivers which do not do anything special in module
+ * init/exit. This eliminates a lot of boilerplate. Each module may only
+ * use this macro once, and calling it replaces module_init() and module_exit()
+ */
+#define module_peci_driver(__peci_driver) \
+	module_driver(__peci_driver, peci_add_driver, peci_del_driver)
+
+/* use a define to avoid include chaining to get THIS_MODULE */
+#define peci_add_driver(driver) peci_register_driver(THIS_MODULE, driver)
+
+int  peci_register_driver(struct module *owner, struct peci_driver *drv);
+void peci_del_driver(struct peci_driver *driver);
+struct peci_client *peci_verify_client(struct device *dev);
+struct peci_adapter *peci_alloc_adapter(struct device *dev, unsigned int size);
+int  peci_add_adapter(struct peci_adapter *adapter);
+void peci_del_adapter(struct peci_adapter *adapter);
+struct peci_adapter *peci_verify_adapter(struct device *dev);
+int  peci_command(struct peci_adapter *adpater, enum peci_cmd cmd, void *vmsg);
+int  peci_get_cpu_id(struct peci_adapter *adapter, u8 addr, u32 *cpu_id);
+
+#endif /* __LINUX_PECI_H */
diff --git a/include/uapi/linux/peci-ioctl.h b/include/uapi/linux/peci-ioctl.h
new file mode 100644
index 000000000000..c88945029b2e
--- /dev/null
+++ b/include/uapi/linux/peci-ioctl.h
@@ -0,0 +1,403 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2018-2019 Intel Corporation */
+
+#ifndef __PECI_IOCTL_H
+#define __PECI_IOCTL_H
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+/* Base Address of 48d */
+#define PECI_BASE_ADDR  0x30  /* The PECI client's default address of 0x30 */
+#define PECI_OFFSET_MAX 8     /* Max numver of CPU clients */
+
+/* PCI Access */
+#define MAX_PCI_READ_LEN 24   /* Number of bytes of the PCI Space read */
+
+#define PCI_BUS0_CPU0      0x00
+#define PCI_BUS0_CPU1      0x80
+#define PCI_CPUBUSNO_BUS   0x00
+#define PCI_CPUBUSNO_DEV   0x08
+#define PCI_CPUBUSNO_FUNC  0x02
+#define PCI_CPUBUSNO       0xcc
+#define PCI_CPUBUSNO_1     0xd0
+#define PCI_CPUBUSNO_VALID 0xd4
+
+/* Package Identifier Read Parameter Value */
+#define PKG_ID_CPU_ID               0x0000  /* CPUID Info */
+#define PKG_ID_PLATFORM_ID          0x0001  /* Platform ID */
+#define PKG_ID_UNCORE_ID            0x0002  /* Uncore Device ID */
+#define PKG_ID_MAX_THREAD_ID        0x0003  /* Max Thread ID */
+#define PKG_ID_MICROCODE_REV        0x0004  /* CPU Microcode Update Revision */
+#define PKG_ID_MACHINE_CHECK_STATUS 0x0005  /* Machine Check Status */
+
+/* RdPkgConfig Index */
+#define MBX_INDEX_CPU_ID            0   /* Package Identifier Read */
+#define MBX_INDEX_VR_DEBUG          1   /* VR Debug */
+#define MBX_INDEX_PKG_TEMP_READ     2   /* Package Temperature Read */
+#define MBX_INDEX_ENERGY_COUNTER    3   /* Energy counter */
+#define MBX_INDEX_ENERGY_STATUS     4   /* DDR Energy Status */
+#define MBX_INDEX_WAKE_MODE_BIT     5   /* "Wake on PECI" Mode bit */
+#define MBX_INDEX_EPI               6   /* Efficient Performance Indication */
+#define MBX_INDEX_PKG_RAPL_PERF     8   /* Pkg RAPL Performance Status Read */
+#define MBX_INDEX_PER_CORE_DTS_TEMP 9   /* Per Core DTS Temperature Read */
+#define MBX_INDEX_DTS_MARGIN        10  /* DTS thermal margin */
+#define MBX_INDEX_SKT_PWR_THRTL_DUR 11  /* Socket Power Throttled Duration */
+#define MBX_INDEX_CFG_TDP_CONTROL   12  /* TDP Config Control */
+#define MBX_INDEX_CFG_TDP_LEVELS    13  /* TDP Config Levels */
+#define MBX_INDEX_DDR_DIMM_TEMP     14  /* DDR DIMM Temperature */
+#define MBX_INDEX_CFG_ICCMAX        15  /* Configurable ICCMAX */
+#define MBX_INDEX_TEMP_TARGET       16  /* Temperature Target Read */
+#define MBX_INDEX_CURR_CFG_LIMIT    17  /* Current Config Limit */
+#define MBX_INDEX_DIMM_TEMP_READ    20  /* Package Thermal Status Read */
+#define MBX_INDEX_DRAM_IMC_TMP_READ 22  /* DRAM IMC Temperature Read */
+#define MBX_INDEX_DDR_CH_THERM_STAT 23  /* DDR Channel Thermal Status */
+#define MBX_INDEX_PKG_POWER_LIMIT1  26  /* Package Power Limit1 */
+#define MBX_INDEX_PKG_POWER_LIMIT2  27  /* Package Power Limit2 */
+#define MBX_INDEX_TDP               28  /* Thermal design power minimum */
+#define MBX_INDEX_TDP_HIGH          29  /* Thermal design power maximum */
+#define MBX_INDEX_TDP_UNITS         30  /* Units for power/energy registers */
+#define MBX_INDEX_RUN_TIME          31  /* Accumulated Run Time */
+#define MBX_INDEX_CONSTRAINED_TIME  32  /* Thermally Constrained Time Read */
+#define MBX_INDEX_TURBO_RATIO       33  /* Turbo Activation Ratio */
+#define MBX_INDEX_DDR_RAPL_PL1      34  /* DDR RAPL PL1 */
+#define MBX_INDEX_DDR_PWR_INFO_HIGH 35  /* DRAM Power Info Read (high) */
+#define MBX_INDEX_DDR_PWR_INFO_LOW  36  /* DRAM Power Info Read (low) */
+#define MBX_INDEX_DDR_RAPL_PL2      37  /* DDR RAPL PL2 */
+#define MBX_INDEX_DDR_RAPL_STATUS   38  /* DDR RAPL Performance Status */
+#define MBX_INDEX_DDR_HOT_ABSOLUTE  43  /* DDR Hottest Dimm Absolute Temp */
+#define MBX_INDEX_DDR_HOT_RELATIVE  44  /* DDR Hottest Dimm Relative Temp */
+#define MBX_INDEX_DDR_THROTTLE_TIME 45  /* DDR Throttle Time */
+#define MBX_INDEX_DDR_THERM_STATUS  46  /* DDR Thermal Status */
+#define MBX_INDEX_TIME_AVG_TEMP     47  /* Package time-averaged temperature */
+#define MBX_INDEX_TURBO_RATIO_LIMIT 49  /* Turbo Ratio Limit Read */
+#define MBX_INDEX_HWP_AUTO_OOB      53  /* HWP Autonomous Out-of-band */
+#define MBX_INDEX_DDR_WARM_BUDGET   55  /* DDR Warm Power Budget */
+#define MBX_INDEX_DDR_HOT_BUDGET    56  /* DDR Hot Power Budget */
+#define MBX_INDEX_PKG_PSYS_PWR_LIM3 57  /* Package/Psys Power Limit3 */
+#define MBX_INDEX_PKG_PSYS_PWR_LIM1 58  /* Package/Psys Power Limit1 */
+#define MBX_INDEX_PKG_PSYS_PWR_LIM2 59  /* Package/Psys Power Limit2 */
+#define MBX_INDEX_PKG_PSYS_PWR_LIM4 60  /* Package/Psys Power Limit4 */
+#define MBX_INDEX_PERF_LIMIT_REASON 65  /* Performance Limit Reasons */
+
+/* WrPkgConfig Index */
+#define MBX_INDEX_DIMM_AMBIENT      19
+#define MBX_INDEX_DIMM_TEMP         24
+
+/* Device Specific Completion Code (CC) Definition */
+#define DEV_PECI_CC_SUCCESS          0x40
+#define DEV_PECI_CC_TIMEOUT          0x80
+#define DEV_PECI_CC_OUT_OF_RESOURCE  0x81
+#define DEV_PECI_CC_UNAVAIL_RESOURCE 0x82
+#define DEV_PECI_CC_INVALID_REQ      0x90
+
+/* Completion Code mask to check retry needs */
+#define DEV_PECI_CC_RETRY_CHECK_MASK 0xf0
+#define DEV_PECI_CC_NEED_RETRY       0x80
+
+/* Skylake EDS says to retry for 250ms */
+#define DEV_PECI_RETRY_TIME_MS       250
+#define DEV_PECI_RETRY_INTERVAL_USEC 10000
+#define DEV_PECI_RETRY_BIT           0x01
+
+#define GET_TEMP_WR_LEN   1
+#define GET_TEMP_RD_LEN   2
+#define GET_TEMP_PECI_CMD 0x01
+
+#define GET_DIB_WR_LEN   1
+#define GET_DIB_RD_LEN   8
+#define GET_DIB_PECI_CMD 0xf7
+
+#define RDPKGCFG_WRITE_LEN     5
+#define RDPKGCFG_READ_LEN_BASE 1
+#define RDPKGCFG_PECI_CMD      0xa1
+
+#define WRPKGCFG_WRITE_LEN_BASE 6
+#define WRPKGCFG_READ_LEN       1
+#define WRPKGCFG_PECI_CMD       0xa5
+
+#define RDIAMSR_WRITE_LEN 5
+#define RDIAMSR_READ_LEN  9
+#define RDIAMSR_PECI_CMD  0xb1
+
+#define WRIAMSR_PECI_CMD  0xb5
+
+#define RDPCICFG_WRITE_LEN 6
+#define RDPCICFG_READ_LEN  5
+#define RDPCICFG_PECI_CMD  0x61
+
+#define WRPCICFG_PECI_CMD  0x65
+
+#define RDPCICFGLOCAL_WRITE_LEN     5
+#define RDPCICFGLOCAL_READ_LEN_BASE 1
+#define RDPCICFGLOCAL_PECI_CMD      0xe1
+
+#define WRPCICFGLOCAL_WRITE_LEN_BASE 6
+#define WRPCICFGLOCAL_READ_LEN       1
+#define WRPCICFGLOCAL_PECI_CMD       0xe5
+
+#define PECI_BUFFER_SIZE 32
+
+/**
+ * enum peci_cmd - PECI client commands
+ * @PECI_CMD_XFER: raw PECI transfer
+ * @PECI_CMD_PING: ping, a required message for all PECI devices
+ * @PECI_CMD_GET_DIB: get DIB (Device Info Byte)
+ * @PECI_CMD_GET_TEMP: get maximum die temperature
+ * @PECI_CMD_RD_PKG_CFG: read access to the PCS (Package Configuration Space)
+ * @PECI_CMD_WR_PKG_CFG: write access to the PCS (Package Configuration Space)
+ * @PECI_CMD_RD_IA_MSR: read access to MSRs (Model Specific Registers)
+ * @PECI_CMD_WR_IA_MSR: write access to MSRs (Model Specific Registers)
+ * @PECI_CMD_RD_PCI_CFG: sideband read access to the PCI configuration space
+ *	maintained in downstream devices external to the processor
+ * @PECI_CMD_WR_PCI_CFG: sideband write access to the PCI configuration space
+ *	maintained in downstream devices external to the processor
+ * @PECI_CMD_RD_PCI_CFG_LOCAL: sideband read access to the PCI configuration
+ *	space that resides within the processor
+ * @PECI_CMD_WR_PCI_CFG_LOCAL: sideband write access to the PCI configuration
+ *	space that resides within the processor
+ *
+ * Available commands depend on client's PECI revision.
+ */
+enum peci_cmd {
+	PECI_CMD_XFER = 0,
+	PECI_CMD_PING,
+	PECI_CMD_GET_DIB,
+	PECI_CMD_GET_TEMP,
+	PECI_CMD_RD_PKG_CFG,
+	PECI_CMD_WR_PKG_CFG,
+	PECI_CMD_RD_IA_MSR,
+	PECI_CMD_WR_IA_MSR,
+	PECI_CMD_RD_PCI_CFG,
+	PECI_CMD_WR_PCI_CFG,
+	PECI_CMD_RD_PCI_CFG_LOCAL,
+	PECI_CMD_WR_PCI_CFG_LOCAL,
+	PECI_CMD_MAX
+};
+
+/**
+ * struct peci_xfer_msg - raw PECI transfer command
+ * @addr; address of the client
+ * @tx_len: number of data to be written in bytes
+ * @rx_len: number of data to be read in bytes
+ * @tx_buf: data to be written, or NULL
+ * @rx_buf: data to be read, or NULL
+ *
+ * raw PECI transfer
+ */
+struct peci_xfer_msg {
+	__u8 addr;
+	__u8 tx_len;
+	__u8 rx_len;
+	__u8 tx_buf[PECI_BUFFER_SIZE];
+	__u8 rx_buf[PECI_BUFFER_SIZE];
+} __attribute__((__packed__));
+
+/**
+ * struct peci_ping_msg - ping command
+ * @addr: address of the client
+ *
+ * Ping() is a required message for all PECI devices. This message is used to
+ * enumerate devices or determine if a device has been removed, been
+ * powered-off, etc.
+ */
+struct peci_ping_msg {
+	__u8 addr;
+} __attribute__((__packed__));
+
+/**
+ * struct peci_get_dib_msg - GetDIB command
+ * @addr: address of the client
+ * @dib: DIB data to be read
+ *
+ * The processor PECI client implementation of GetDIB() includes an 8-byte
+ * response and provides information regarding client revision number and the
+ * number of supported domains. All processor PECI clients support the GetDIB()
+ * command.
+ */
+struct peci_get_dib_msg {
+	__u8  addr;
+	__u64 dib;
+} __attribute__((__packed__));
+
+/**
+ * struct peci_get_temp_msg - GetTemp command
+ * @addr: address of the client
+ * @temp_raw: raw temperature data to be read
+ *
+ * The GetTemp() command is used to retrieve the maximum die temperature from a
+ * target PECI address. The temperature is used by the external thermal
+ * management system to regulate the temperature on the die. The data is
+ * returned as a negative value representing the number of degrees centigrade
+ * below the maximum processor junction temperature.
+ */
+struct peci_get_temp_msg {
+	__u8  addr;
+	__s16 temp_raw;
+} __attribute__((__packed__));
+
+/**
+ * struct peci_rd_pkg_cfg_msg - RdPkgConfig command
+ * @addr: address of the client
+ * @index: encoding index for the requested service
+ * @param: specific data being requested
+ * @rx_len: number of data to be read in bytes
+ * @pkg_config: package config data to be read
+ *
+ * The RdPkgConfig() command provides read access to the Package Configuration
+ * Space (PCS) within the processor, including various power and thermal
+ * management functions. Typical PCS read services supported by the processor
+ * may include access to temperature data, energy status, run time information,
+ * DIMM temperatures and so on.
+ */
+struct peci_rd_pkg_cfg_msg {
+	__u8  addr;
+	__u8  index;
+	__u16 param;
+	__u8  rx_len;
+	__u8  pkg_config[4];
+} __attribute__((__packed__));
+
+/**
+ * struct peci_wr_pkg_cfg_msg - WrPkgConfig command
+ * @addr: address of the client
+ * @index: encoding index for the requested service
+ * @param: specific data being requested
+ * @tx_len: number of data to be written in bytes
+ * @value: package config data to be written
+ *
+ * The WrPkgConfig() command provides write access to the Package Configuration
+ * Space (PCS) within the processor, including various power and thermal
+ * management functions. Typical PCS write services supported by the processor
+ * may include power limiting, thermal averaging constant programming and so on.
+ */
+struct peci_wr_pkg_cfg_msg {
+	__u8  addr;
+	__u8  index;
+	__u16 param;
+	__u8  tx_len;
+	__u32 value;
+} __attribute__((__packed__));
+
+/**
+ * struct peci_rd_ia_msr_msg - RdIAMSR command
+ * @addr: address of the client
+ * @thread_id: ID of the specific logical processor
+ * @address: address of MSR to read from
+ * @value: data to be read
+ *
+ * The RdIAMSR() PECI command provides read access to Model Specific Registers
+ * (MSRs) defined in the processor's Intel Architecture (IA).
+ */
+struct peci_rd_ia_msr_msg {
+	__u8  addr;
+	__u8  thread_id;
+	__u16 address;
+	__u64 value;
+} __attribute__((__packed__));
+
+/**
+ * struct peci_rd_pci_cfg_msg - RdPCIConfig command
+ * @addr: address of the client
+ * @bus: PCI bus number
+ * @device: PCI device number
+ * @function: specific function to read from
+ * @reg: specific register to read from
+ * @pci_config: config data to be read
+ *
+ * The RdPCIConfig() command provides sideband read access to the PCI
+ * configuration space maintained in downstream devices external to the
+ * processor.
+ */
+struct peci_rd_pci_cfg_msg {
+	__u8  addr;
+	__u8  bus;
+	__u8  device;
+	__u8  function;
+	__u16 reg;
+	__u8  pci_config[4];
+} __attribute__((__packed__));
+
+/**
+ * struct peci_rd_pci_cfg_local_msg - RdPCIConfigLocal command
+ * @addr: address of the client
+ * @bus: PCI bus number
+ * @device: PCI device number
+ * @function: specific function to read from
+ * @reg: specific register to read from
+ * @rx_len: number of data to be read in bytes
+ * @pci_config: config data to be read
+ *
+ * The RdPCIConfigLocal() command provides sideband read access to the PCI
+ * configuration space that resides within the processor. This includes all
+ * processor IIO and uncore registers within the PCI configuration space.
+ */
+struct peci_rd_pci_cfg_local_msg {
+	__u8  addr;
+	__u8  bus;
+	__u8  device;
+	__u8  function;
+	__u16 reg;
+	__u8  rx_len;
+	__u8  pci_config[4];
+} __attribute__((__packed__));
+
+/**
+ * struct peci_wr_pci_cfg_local_msg - WrPCIConfigLocal command
+ * @addr: address of the client
+ * @bus: PCI bus number
+ * @device: PCI device number
+ * @function: specific function to read from
+ * @reg: specific register to read from
+ * @tx_len: number of data to be written in bytes
+ * @value: config data to be written
+ *
+ * The WrPCIConfigLocal() command provides sideband write access to the PCI
+ * configuration space that resides within the processor. PECI originators can
+ * access this space even before BIOS enumeration of the system buses.
+ */
+struct peci_wr_pci_cfg_local_msg {
+	__u8  addr;
+	__u8  bus;
+	__u8  device;
+	__u8  function;
+	__u16 reg;
+	__u8  tx_len;
+	__u32 value;
+} __attribute__((__packed__));
+
+#define PECI_IOC_BASE  0xb7
+
+#define PECI_IOC_XFER \
+	_IOWR(PECI_IOC_BASE, PECI_CMD_XFER, struct peci_xfer_msg)
+
+#define PECI_IOC_PING \
+	_IOWR(PECI_IOC_BASE, PECI_CMD_PING, struct peci_ping_msg)
+
+#define PECI_IOC_GET_DIB \
+	_IOWR(PECI_IOC_BASE, PECI_CMD_GET_DIB, struct peci_get_dib_msg)
+
+#define PECI_IOC_GET_TEMP \
+	_IOWR(PECI_IOC_BASE, PECI_CMD_GET_TEMP, struct peci_get_temp_msg)
+
+#define PECI_IOC_RD_PKG_CFG \
+	_IOWR(PECI_IOC_BASE, PECI_CMD_RD_PKG_CFG, struct peci_rd_pkg_cfg_msg)
+
+#define PECI_IOC_WR_PKG_CFG \
+	_IOWR(PECI_IOC_BASE, PECI_CMD_WR_PKG_CFG, struct peci_wr_pkg_cfg_msg)
+
+#define PECI_IOC_RD_IA_MSR \
+	_IOWR(PECI_IOC_BASE, PECI_CMD_RD_IA_MSR, struct peci_rd_ia_msr_msg)
+
+#define PECI_IOC_RD_PCI_CFG \
+	_IOWR(PECI_IOC_BASE, PECI_CMD_RD_PCI_CFG, struct peci_rd_pci_cfg_msg)
+
+#define PECI_IOC_RD_PCI_CFG_LOCAL \
+	_IOWR(PECI_IOC_BASE, PECI_CMD_RD_PCI_CFG_LOCAL, \
+	      struct peci_rd_pci_cfg_local_msg)
+
+#define PECI_IOC_WR_PCI_CFG_LOCAL \
+	_IOWR(PECI_IOC_BASE, PECI_CMD_WR_PCI_CFG_LOCAL, \
+	      struct peci_wr_pci_cfg_local_msg)
+
+#endif /* __PECI_IOCTL_H */