diff mbox

[v1] docs/vhost-user: extend the vhost-user protocol to support the vhost-pci based inter-vm communication

Message ID 1477293002-144213-1-git-send-email-wei.w.wang@intel.com
State New
Headers show

Commit Message

Wang, Wei W Oct. 24, 2016, 7:10 a.m. UTC
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 docs/specs/vhost-user.txt | 81 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 72 insertions(+), 9 deletions(-)

Comments

Michael S. Tsirkin Oct. 30, 2016, 6:02 p.m. UTC | #1
On Mon, Oct 24, 2016 at 03:10:02PM +0800, Wei Wang wrote:
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>

As the implementation is not going to be merged
for this QEMU release, I think we shouldn't
drop the doc either.
I'll review and respond a bit later so we can finalize
the protocol meanwhile.

> ---
>  docs/specs/vhost-user.txt | 81 +++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 72 insertions(+), 9 deletions(-)
> 
> diff --git a/docs/specs/vhost-user.txt b/docs/specs/vhost-user.txt
> index 7890d71..173f693 100644
> --- a/docs/specs/vhost-user.txt
> +++ b/docs/specs/vhost-user.txt
> @@ -17,28 +17,37 @@ The protocol defines 2 sides of the communication, master and slave. Master is
>  the application that shares its virtqueues, in our case QEMU. Slave is the
>  consumer of the virtqueues.
>  
> -In the current implementation QEMU is the Master, and the Slave is intended to
> +In the traditional implementation QEMU is the Master, and the Slave is intended to
>  be a software Ethernet switch running in user space, such as Snabbswitch.
>  
>  Master and slave can be either a client (i.e. connecting) or server (listening)
>  in the socket communication.
>  
> +The current vhost-user protocol is extended to support the vhost-pci based inter-VM
> +communication. In this case, Slave is a QEMU which runs a vhost-pci server, and
> +Master is another QEMU which runs a vhost-pci client.
> +
>  Message Specification
>  ---------------------
>  
>  Note that all numbers are in the machine native byte order. A vhost-user message
> -consists of 3 header fields and a payload:
> +consists of 4 header fields and a payload:
>  
> -------------------------------------
> -| request | flags | size | payload |
> -------------------------------------
> +----------------------------------------------
> +| request | flags | conn_id | size | payload |
> +----------------------------------------------
>  
>   * Request: 32-bit type of the request
>   * Flags: 32-bit bit field:
>     - Lower 2 bits are the version (currently 0x01)
> -   - Bit 2 is the reply flag - needs to be sent on each reply from the slave
> +   - Bit 2 is the reply flag - needs to be sent on each reply
>     - Bit 3 is the need_reply flag - see VHOST_USER_PROTOCOL_F_REPLY_ACK for
>       details.
> + * Conn_id: 64-bit connection id to indentify a client socket connection. It is
> +            introduced in version 0x02 to support the "1-server-N-client" model
> +            and an asynchronous client read implementation. The connection id,
> +            0xFFFFFFFFFFFFFFFF, is used by an anonymous client (e.g. a client who
> +            has not got its connection id from the server in the initial talk)
>   * Size - 32-bit size of the payload
>  
>  
> @@ -97,6 +106,13 @@ Depending on the request type, payload can be:
>     log offset: offset from start of supplied file descriptor
>         where logging starts (i.e. where guest address 0 would be logged)
>  
> +* Device info
> +   --------------------
> +   | virito id | uuid |
> +   --------------------
> +   Virtio id: 16-bit virtio id of the device
> +   UUID: 128-bit UUID to identify the QEMU instance that creates the device
> +
>  In QEMU the vhost-user message is implemented with the following struct:
>  
>  typedef struct VhostUserMsg {
> @@ -109,6 +125,7 @@ typedef struct VhostUserMsg {
>          struct vhost_vring_addr addr;
>          VhostUserMemory memory;
>          VhostUserLog log;
> +        DeviceInfo dev_info;
>      };
>  } QEMU_PACKED VhostUserMsg;
>  
> @@ -119,17 +136,25 @@ The protocol for vhost-user is based on the existing implementation of vhost
>  for the Linux Kernel. Most messages that can be sent via the Unix domain socket
>  implementing vhost-user have an equivalent ioctl to the kernel implementation.
>  
> -The communication consists of master sending message requests and slave sending
> -message replies. Most of the requests don't require replies. Here is a list of
> -the ones that do:
> +Traditionally, the communication consists of master sending message requests
> +and slave sending message replies. Most of the requests don't require replies.
> +Here is a list of the ones that do:
>  
>   * VHOST_GET_FEATURES
>   * VHOST_GET_PROTOCOL_FEATURES
>   * VHOST_GET_VRING_BASE
>   * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
> + * VHOST_USER_GET_CONN_ID
> + * VHOST_USER_SET_PEER_CONNECTION
>  
>  [ Also see the section on REPLY_ACK protocol extension. ]
>  
> +Currently, the communication also supports the Slave (server) sending messages
> +to the Master (client). Here is a list of them:
> + * VHOST_USER_SET_FEATURES
> + * VHOST_USER_SET_PEER_CONNECTION (the serve may actively request to disconnect
> +   with the client)
> +
>  There are several messages that the master sends with file descriptors passed
>  in the ancillary data:
>  
> @@ -259,6 +284,7 @@ Protocol features
>  #define VHOST_USER_PROTOCOL_F_LOG_SHMFD      1
>  #define VHOST_USER_PROTOCOL_F_RARP           2
>  #define VHOST_USER_PROTOCOL_F_REPLY_ACK      3
> +#define VHOST_USER_PROTOCOL_F_VHOST_PCI      4
>  
>  Message types
>  -------------
> @@ -470,6 +496,43 @@ Message types
>        The first 6 bytes of the payload contain the mac address of the guest to
>        allow the vhost user backend to construct and broadcast the fake RARP.
>  
> + * VHOST_USER_GET_CONN_ID
> +
> +      Id: 20
> +      Equivalent ioctl: N/A
> +      Master payload: u64
> +
> +      The client sends this message to the server to ask for its connection id.
> +      The connection id is then put into the message header (the conn_id field),
> +      so that the server can always know who it is talking with.
> +
> +* VHOST_USER_SET_DEV_INFO
> +
> +      Id: 21
> +      Equivalent ioctl: N/A
> +      Master payload: dev info
> +
> +      The client sends the producer device info to the server.
> +      This request should be sent only when VHOST_USER_PROTOCOL_F_VHOST_PCI has
> +      been negotiated.
> +
> +* VHOST_USER_SET_PEER_CONNECTION
> +
> +      Id: 22
> +      Equivalent ioctl: N/A
> +      Master payload: u64
> +
> +      The producer device requests to connect or disconnect to the consumer device.
> +      The consumer device may request to disconnect to the producer device. This
> +      request should be sent only when VHOST_USER_PROTOCOL_F_VHOST_PCI has been
> +      negotiated.
> +      Connection request: If the reply message indicates "success", the vhost-pci based
> +      inter-VM communication channel has been established.
> +      Disconnection request: If the reply message indicates "success", the vhost-pci based
> +      inter-VM communication channel has been destroyed.
> +      #define VHOST_USER_SET_PEER_CONNECTION_F_OFF 0
> +      #define VHOST_USER_SET_PEER_CONNECTION_F_ON 1
> +
>  VHOST_USER_PROTOCOL_F_REPLY_ACK:
>  -------------------------------
>  The original vhost-user specification only demands replies for certain
> -- 
> 1.9.1
Wang, Wei W Nov. 7, 2016, 5:43 a.m. UTC | #2
Hi Marc-André,

On Monday, October 24, 2016 3:10 PM, Wei Wang wrote:
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> ---
>  docs/specs/vhost-user.txt | 81
> +++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 72 insertions(+), 9 deletions(-)
>
> diff --git a/docs/specs/vhost-user.txt b/docs/specs/vhost-user.txt index
> 7890d71..173f693 100644
> --- a/docs/specs/vhost-user.txt
> +++ b/docs/specs/vhost-user.txt

Do you have any comments on this doc? Thanks.

Best,
Wei
Marc-André Lureau Nov. 8, 2016, 7:47 a.m. UTC | #3
Hi

I suggest you split this patch for the various "features" you propose.

On Mon, Oct 24, 2016 at 11:10 AM Wei Wang <wei.w.wang@intel.com> wrote:

> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> ---
>  docs/specs/vhost-user.txt | 81
> +++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 72 insertions(+), 9 deletions(-)
>
> diff --git a/docs/specs/vhost-user.txt b/docs/specs/vhost-user.txt
> index 7890d71..173f693 100644
> --- a/docs/specs/vhost-user.txt
> +++ b/docs/specs/vhost-user.txt
> @@ -17,28 +17,37 @@ The protocol defines 2 sides of the communication,
> master and slave. Master is
>  the application that shares its virtqueues, in our case QEMU. Slave is the
>  consumer of the virtqueues.
>
> -In the current implementation QEMU is the Master, and the Slave is
> intended to
> +In the traditional implementation QEMU is the Master, and the Slave is
> intended to
>  be a software Ethernet switch running in user space, such as Snabbswitch.
>
>
ok


>  Master and slave can be either a client (i.e. connecting) or server
> (listening)
>  in the socket communication.
>
> +The current vhost-user protocol is extended to support the vhost-pci
> based inter-VM
> +communication. In this case, Slave is a QEMU which runs a vhost-pci
> server, and
> +Master is another QEMU which runs a vhost-pci client.
> +
>


Why introduce new terminology "server" and "client"? What does it change?
This is confusing with socket client/server configuration.


>  Message Specification
>  ---------------------
>
>  Note that all numbers are in the machine native byte order. A vhost-user
> message
> -consists of 3 header fields and a payload:
> +consists of 4 header fields and a payload:
>
> -------------------------------------
> -| request | flags | size | payload |
> -------------------------------------
> +----------------------------------------------
> +| request | flags | conn_id | size | payload |
> +----------------------------------------------
>
>   * Request: 32-bit type of the request
>   * Flags: 32-bit bit field:
>     - Lower 2 bits are the version (currently 0x01)
> -   - Bit 2 is the reply flag - needs to be sent on each reply from the
> slave
> +   - Bit 2 is the reply flag - needs to be sent on each reply
>     - Bit 3 is the need_reply flag - see VHOST_USER_PROTOCOL_F_REPLY_ACK
> for
>       details.
> + * Conn_id: 64-bit connection id to indentify a client socket connection.
> It is
> +            introduced in version 0x02 to support the "1-server-N-client"
> model
> +            and an asynchronous client read implementation. The
> connection id,
> +            0xFFFFFFFFFFFFFFFF, is used by an anonymous client (e.g. a
> client who
> +            has not got its connection id from the server in the initial
> talk)
>

I don't understand why you need a connection id, on each message. What's
the purpose? Since the communication is unicast, a single message should be
enough.

  * Size - 32-bit size of the payload
>
>
> @@ -97,6 +106,13 @@ Depending on the request type, payload can be:
>     log offset: offset from start of supplied file descriptor
>         where logging starts (i.e. where guest address 0 would be logged)
>
> +* Device info
> +   --------------------
> +   | virito id | uuid |
> +   --------------------
> +   Virtio id: 16-bit virtio id of the device
> +   UUID: 128-bit UUID to identify the QEMU instance that creates the
> device
> +
>

I wonder if UUID should be a different message.



>  In QEMU the vhost-user message is implemented with the following struct:
>
>  typedef struct VhostUserMsg {
> @@ -109,6 +125,7 @@ typedef struct VhostUserMsg {
>          struct vhost_vring_addr addr;
>          VhostUserMemory memory;
>          VhostUserLog log;
> +        DeviceInfo dev_info;
>      };
>  } QEMU_PACKED VhostUserMsg;
>
> @@ -119,17 +136,25 @@ The protocol for vhost-user is based on the existing
> implementation of vhost
>  for the Linux Kernel. Most messages that can be sent via the Unix domain
> socket
>  implementing vhost-user have an equivalent ioctl to the kernel
> implementation.
>
> -The communication consists of master sending message requests and slave
> sending
> -message replies. Most of the requests don't require replies. Here is a
> list of
> -the ones that do:
> +Traditionally, the communication consists of master sending message
> requests
> +and slave sending message replies. Most of the requests don't require
> replies.
> +Here is a list of the ones that do:
>
>   * VHOST_GET_FEATURES
>   * VHOST_GET_PROTOCOL_FEATURES
>   * VHOST_GET_VRING_BASE
>   * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
> + * VHOST_USER_GET_CONN_ID
> + * VHOST_USER_SET_PEER_CONNECTION
>
> Let's also fix  the VHOST_USER prefix of the above requests.

 [ Also see the section on REPLY_ACK protocol extension. ]
>
> +Currently, the communication also supports the Slave (server) sending
> messages
> +to the Master (client). Here is a list of them:
> + * VHOST_USER_SET_FEATURES
>
+ * VHOST_USER_SET_PEER_CONNECTION (the serve may actively request to
> disconnect
> +   with the client)
>

Oh, you are making the communication bidirectional? This is a fundamental
change in the protocol. This may be difficult to implement in qemu, since
the communication in synchronous, a request expects an immediate reply, if
it gets back a request (from the slave) in the middle, it will fail.

Currently all requests (including VHOST_USER_SET_FEATURES) are coming from
the Master. I don't understand yet the purpose of
VHOST_USER_SET_PEER_CONNECTION to propose an alternative, but I would
rather keep the unidirectional communication if possible.

 There are several messages that the master sends with file descriptors
> passed
>  in the ancillary data:
>
> @@ -259,6 +284,7 @@ Protocol features
>  #define VHOST_USER_PROTOCOL_F_LOG_SHMFD      1
>  #define VHOST_USER_PROTOCOL_F_RARP           2
>  #define VHOST_USER_PROTOCOL_F_REPLY_ACK      3
> +#define VHOST_USER_PROTOCOL_F_VHOST_PCI      4
>
>  Message types
>  -------------
> @@ -470,6 +496,43 @@ Message types
>        The first 6 bytes of the payload contain the mac address of the
> guest to
>        allow the vhost user backend to construct and broadcast the fake
> RARP.
>
> + * VHOST_USER_GET_CONN_ID
> +
> +      Id: 20
> +      Equivalent ioctl: N/A
> +      Master payload: u64
> +
> +      The client sends this message to the server to ask for its
> connection id.
>

Confusing, please keep the Master/Slave terminology


> +      The connection id is then put into the message header (the conn_id
> field),
> +      so that the server can always know who it is talking with.
> +
>

Could you explain what the connection id is for?

+This request should be sent only when VHOST_USER_PROTOCOL_F_VHOST_PCI
has...


> +* VHOST_USER_SET_DEV_INFO
> +
> +      Id: 21
> +      Equivalent ioctl: N/A
> +      Master payload: dev info
> +
> +      The client sends the producer device info to the server.
>

"Master sends producer device info to the Slave" works, no?

Could we guarantee this message is sent before SET_VRING*?


> +      This request should be sent only when
> VHOST_USER_PROTOCOL_F_VHOST_PCI has
> +      been negotiated.
> +
>

I think this message could be useful for other purposes than vhost-pci,
thus I would give it its own flag.


> +* VHOST_USER_SET_PEER_CONNECTION
> +
> +      Id: 22
> +      Equivalent ioctl: N/A
> +      Master payload: u64
> +
> +      The producer device requests to connect or disconnect to the
> consumer device.
>

producer->Master, consummer->Slave

How does it interact with SET_VRING_ENABLE?


> +      The consumer device may request to disconnect to the producer
> device. This
> +      request should be sent only when VHOST_USER_PROTOCOL_F_VHOST_PCI
> has been
> +      negotiated.
> +      Connection request: If the reply message indicates "success", the
> vhost-pci based
> +      inter-VM communication channel has been established.
> +      Disconnection request: If the reply message indicates "success",
> the vhost-pci based
> +      inter-VM communication channel has been destroyed.
> +      #define VHOST_USER_SET_PEER_CONNECTION_F_OFF 0
> +      #define VHOST_USER_SET_PEER_CONNECTION_F_ON 1
> +
>  VHOST_USER_PROTOCOL_F_REPLY_ACK:
>  -------------------------------
>  The original vhost-user specification only demands replies for certain
> --
> 1.9.1
>
>
thanks
Wang, Wei W Nov. 8, 2016, 11:35 a.m. UTC | #4
On 11/08/2016 03:47 PM, Marc-André Lureau wrote:
> Hi
>
> I suggest you split this patch for the various "features" you propose.

OK. I'll make it several small patches.   <*v1-AR1*>
>
>      Master and slave can be either a client (i.e. connecting) or
>     server (listening)
>      in the socket communication.
>
"Client" and "Server" have already been used in the doc here.
>
>     +The current vhost-user protocol is extended to support the
>     vhost-pci based inter-VM
>     +communication. In this case, Slave is a QEMU which runs a
>     vhost-pci server, and
>     +Master is another QEMU which runs a vhost-pci client.
>     +
>
>
>
> Why introduce new terminology "server" and "client"? What does it 
> change? This is confusing with socket client/server configuration.

OK. I will try with "Slave" and "Master" in this doc when it's possible. 
<*v1-AR2*>

>
>      Message Specification
>      ---------------------
>
>      Note that all numbers are in the machine native byte order. A
>     vhost-user message
>     -consists of 3 header fields and a payload:
>     +consists of 4 header fields and a payload:
>
>     -------------------------------------
>     -| request | flags | size | payload |
>     -------------------------------------
>     +----------------------------------------------
>     +| request | flags | conn_id | size | payload |
>     +----------------------------------------------
>
>       * Request: 32-bit type of the request
>       * Flags: 32-bit bit field:
>         - Lower 2 bits are the version (currently 0x01)
>     -   - Bit 2 is the reply flag - needs to be sent on each reply
>     from the slave
>     +   - Bit 2 is the reply flag - needs to be sent on each reply
>         - Bit 3 is the need_reply flag - see
>     VHOST_USER_PROTOCOL_F_REPLY_ACK for
>           details.
>     + * Conn_id: 64-bit connection id to indentify a client socket
>     connection. It is
>     +            introduced in version 0x02 to support the
>     "1-server-N-client" model
>     +            and an asynchronous client read implementation. The
>     connection id,
>     +            0xFFFFFFFFFFFFFFFF, is used by an anonymous client
>     (e.g. a client who
>     +            has not got its connection id from the server in the
>     initial talk)
>
>
> I don't understand why you need a connection id, on each message. 
> What's the purpose? Since the communication is unicast, a single 
> message should be enough.

Sure, please let me explain more:
The QEMU socket is going to be upgraded to support 1 server socket being 
connected by multiple client sockets (I've made patches to achieve 
this). In other words, here, multiple masters will connect to one slave, 
and the slave creates a vhost-pci device for each master after receiving 
the necessary message info. The slave needs to know which master it is 
talking to when receiving a message, as it maintains multiple 
connections at the same time.
Also shed some light on the implementation:
The slave maintains a table for those masters. Each master has an entry 
in the table (indexed by a "conn_id"). When the slave receives a 
message, the payload is added to the corresponding table entry. When 
things are ready (i.e. it has received enough info to create a vhost-pci 
device for the master), the device creation code creates and initializes 
a vhost-pci device (e.g. initialize VirtioPCIProxy in virtio-pci.c) from 
the corresponding table entry.

>
>       * Size - 32-bit size of the payload
>
>
>     @@ -97,6 +106,13 @@ Depending on the request type, payload can be:
>         log offset: offset from start of supplied file descriptor
>             where logging starts (i.e. where guest address 0 would be
>     logged)
>
>     +* Device info
>     +   --------------------
>     +   | virito id | uuid |
>     +   --------------------
>     +   Virtio id: 16-bit virtio id of the device
>     +   UUID: 128-bit UUID to identify the QEMU instance that creates
>     the device
>     +
>
>
> I wonder if UUID should be a different message.
>
We can make uuid another message if it has other usages.
Do you see any other usages of uuid?

>
>      In QEMU the vhost-user message is implemented with the following
>     struct:
>
>      typedef struct VhostUserMsg {
>     @@ -109,6 +125,7 @@ typedef struct VhostUserMsg {
>              struct vhost_vring_addr addr;
>              VhostUserMemory memory;
>              VhostUserLog log;
>     +        DeviceInfo dev_info;
>          };
>      } QEMU_PACKED VhostUserMsg;
>
>     @@ -119,17 +136,25 @@ The protocol for vhost-user is based on the
>     existing implementation of vhost
>      for the Linux Kernel. Most messages that can be sent via the Unix
>     domain socket
>      implementing vhost-user have an equivalent ioctl to the kernel
>     implementation.
>
>     -The communication consists of master sending message requests and
>     slave sending
>     -message replies. Most of the requests don't require replies. Here
>     is a list of
>     -the ones that do:
>     +Traditionally, the communication consists of master sending
>     message requests
>     +and slave sending message replies. Most of the requests don't
>     require replies.
>     +Here is a list of the ones that do:
>
>       * VHOST_GET_FEATURES
>       * VHOST_GET_PROTOCOL_FEATURES
>       * VHOST_GET_VRING_BASE
>       * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
>     + * VHOST_USER_GET_CONN_ID
>     + * VHOST_USER_SET_PEER_CONNECTION
>
> Let's also fix  the VHOST_USER prefix of the above requests.
>
Sure, will do. <*v1-AR3*>

>      [ Also see the section on REPLY_ACK protocol extension. ]
>
>     +Currently, the communication also supports the Slave (server)
>     sending messages
>     +to the Master (client). Here is a list of them:
>     + * VHOST_USER_SET_FEATURES
>
>     + * VHOST_USER_SET_PEER_CONNECTION (the serve may actively request
>     to disconnect
>     +   with the client)
>
>
> Oh, you are making the communication bidirectional? This is a 
> fundamental change in the protocol. This may be difficult to implement 
> in qemu, since the communication in synchronous, a request expects an 
> immediate reply, if it gets back a request (from the slave) in the 
> middle, it will fail.
>

Not really.
Adding the above two doesn't affect the existing synchronous read() 
messages (basically, those VHOST_USER_GET_xx messages). Like 
VHOST_USER_SET_FEATURES, the _SET_ messages don't need a reply. Here, we 
just make the slave capable of actively sending messages to the master.

> Currently all requests (including VHOST_USER_SET_FEATURES) are coming 
> from the Master. I don't understand yet the purpose of 
> VHOST_USER_SET_PEER_CONNECTION to propose an alternative, but I would 
> rather keep the unidirectional communication if possible.
>
>      There are several messages that the master sends with file
>     descriptors passed
>      in the ancillary data:
>
>     @@ -259,6 +284,7 @@ Protocol features
>      #define VHOST_USER_PROTOCOL_F_LOG_SHMFD      1
>      #define VHOST_USER_PROTOCOL_F_RARP           2
>      #define VHOST_USER_PROTOCOL_F_REPLY_ACK      3
>     +#define VHOST_USER_PROTOCOL_F_VHOST_PCI      4
>
>      Message types
>      -------------
>     @@ -470,6 +496,43 @@ Message types
>            The first 6 bytes of the payload contain the mac address of
>     the guest to
>            allow the vhost user backend to construct and broadcast the
>     fake RARP.
>
>     + * VHOST_USER_GET_CONN_ID
>     +
>     +      Id: 20
>     +      Equivalent ioctl: N/A
>     +      Master payload: u64
>     +
>     +      The client sends this message to the server to ask for its
>     connection id.
>
>
> Confusing, please keep the Master/Slave terminology
OK.
>
>     + The connection id is then put into the message header (the
>     conn_id field),
>     +      so that the server can always know who it is talking with.
>     +
>
>
> Could you explain what the connection id is for?

Explained above. Please let me know if I didn't make it clear.

>
> +This request should be sent only when VHOST_USER_PROTOCOL_F_VHOST_PCI 
> has...
>
>     +* VHOST_USER_SET_DEV_INFO
>     +
>     +      Id: 21
>     +      Equivalent ioctl: N/A
>     +      Master payload: dev info
>     +
>     +      The client sends the producer device info to the server.
>
>
> "Master sends producer device info to the Slave" works, no?

Yes, it works. The current dev info only contains a "virtio id" field 
(assume we'll take uuid out as a separate message), which tells the 
slave if it is a net, scsi, console or else. do you see any issue?
>
> Could we guarantee this message is sent before SET_VRING*?

Why do we need to guarantee this?

>
>     + This request should be sent only when
>     VHOST_USER_PROTOCOL_F_VHOST_PCI has
>     +      been negotiated.
>     +
>
>
> I think this message could be useful for other purposes than 
> vhost-pci, thus I would give it its own flag.

Could you please give an example of other usage? Thanks.

>
>     +* VHOST_USER_SET_PEER_CONNECTION
>     +
>     +      Id: 22
>     +      Equivalent ioctl: N/A
>     +      Master payload: u64
>     +
>     +      The producer device requests to connect or disconnect to
>     the consumer device.
>
>
> producer->Master, consummer->Slave
>
> How does it interact with SET_VRING_ENABLE?

It's independent of SET_VRING_ENABLE:
SET_VRING_ENABLE enables a virtq to be in "active".
SET_PEER_CONNECTION enables the peer (slave or master) device to be in 
"active". The driver shouldn't send packets if the device is inactive.

>
>     + The consumer device may request to disconnect to the producer
>     device. This
>     +      request should be sent only when
>     VHOST_USER_PROTOCOL_F_VHOST_PCI has been
>     +      negotiated.
>     +      Connection request: If the reply message indicates
>     "success", the vhost-pci based
>     +      inter-VM communication channel has been established.
>     +      Disconnection request: If the reply message indicates
>     "success", the vhost-pci based
>     +      inter-VM communication channel has been destroyed.
>     +      #define VHOST_USER_SET_PEER_CONNECTION_F_OFF 0
>     +      #define VHOST_USER_SET_PEER_CONNECTION_F_ON 1
>     +
>
I think it would be better to add one more command here:
#define VHOST_USER_SET_PEER_CONNECTION_F_INIT 2

The master uses this command to tell the slave it's ready to create the 
vhost-pci device. Regarding the implementation, it is put at the bottom 
of vhost_net_start() function (when all the vring info have been sent 
and enabled).

Best,
Wei
Wang, Wei W Nov. 8, 2016, 11:59 a.m. UTC | #5
On Tuesday, November 8, 2016 7:36 PM, Wei Wang wrote:

Missed this one:
> > Currently all requests (including VHOST_USER_SET_FEATURES) are coming

> > from the Master. I don't understand yet the purpose of

> > VHOST_USER_SET_PEER_CONNECTION to propose an alternative, but I would

> > rather keep the unidirectional communication if possible.


The VHOST_USER_SET_PEER_CONNECTION is proposed here to turn “on/off” the device connection status. For example, when the master side VM wants to turn down, the virtio-net driver sets the virtio-net device’s PEER_CONNECTION status to “off”, which needs to first synchronize with the vhost-pci device using this message. In return, the vhost-pci device will send VHOST_USER_SET_PEER_CONNECTION(cmd=OFF) to the virtio-net device, then the status is set to OFF, and the driver is ok to unload. (Same for the vhost-pci driver to unload)

Best,
Wei
Marc-André Lureau Nov. 8, 2016, 12:17 p.m. UTC | #6
Hi

On Tue, Nov 8, 2016 at 3:35 PM Wei Wang <wei.w.wang@intel.com> wrote:

On 11/08/2016 03:47 PM, Marc-André Lureau wrote:
> Hi
>
> I suggest you split this patch for the various "features" you propose.

OK. I'll make it several small patches.   <*v1-AR1*>
>
>      Master and slave can be either a client (i.e. connecting) or
>     server (listening)
>      in the socket communication.
>
"Client" and "Server" have already been used in the doc here.
>
>     +The current vhost-user protocol is extended to support the
>     vhost-pci based inter-VM
>     +communication. In this case, Slave is a QEMU which runs a
>     vhost-pci server, and
>     +Master is another QEMU which runs a vhost-pci client.
>     +
>
>
>
> Why introduce new terminology "server" and "client"? What does it
> change? This is confusing with socket client/server configuration.

OK. I will try with "Slave" and "Master" in this doc when it's possible.
<*v1-AR2*>

>
>      Message Specification
>      ---------------------
>
>      Note that all numbers are in the machine native byte order. A
>     vhost-user message
>     -consists of 3 header fields and a payload:
>     +consists of 4 header fields and a payload:
>
>     -------------------------------------
>     -| request | flags | size | payload |
>     -------------------------------------
>     +----------------------------------------------
>     +| request | flags | conn_id | size | payload |
>     +----------------------------------------------
>
>       * Request: 32-bit type of the request
>       * Flags: 32-bit bit field:
>         - Lower 2 bits are the version (currently 0x01)
>     -   - Bit 2 is the reply flag - needs to be sent on each reply
>     from the slave
>     +   - Bit 2 is the reply flag - needs to be sent on each reply
>         - Bit 3 is the need_reply flag - see
>     VHOST_USER_PROTOCOL_F_REPLY_ACK for
>           details.
>     + * Conn_id: 64-bit connection id to indentify a client socket
>     connection. It is
>     +            introduced in version 0x02 to support the
>     "1-server-N-client" model
>     +            and an asynchronous client read implementation. The
>     connection id,
>     +            0xFFFFFFFFFFFFFFFF, is used by an anonymous client
>     (e.g. a client who
>     +            has not got its connection id from the server in the
>     initial talk)
>
>
> I don't understand why you need a connection id, on each message.
> What's the purpose? Since the communication is unicast, a single
> message should be enough.

Sure, please let me explain more:
The QEMU socket is going to be upgraded to support 1 server socket being
connected by multiple client sockets (I've made patches to achieve
this). In other words, here, multiple masters will connect to one slave,
and the slave creates a vhost-pci device for each master after receiving
the necessary message info. The slave needs to know which master it is
talking to when receiving a message, as it maintains multiple
connections at the same time.


You should be able to identify each connection in the slave (as a socket
server), without a need for connection id: connected sockets are
independent from each others.


Also shed some light on the implementation:
The slave maintains a table for those masters. Each master has an entry
in the table (indexed by a "conn_id"). When the slave receives a
message, the payload is added to the corresponding table entry. When
things are ready (i.e. it has received enough info to create a vhost-pci
device for the master), the device creation code creates and initializes
a vhost-pci device (e.g. initialize VirtioPCIProxy in virtio-pci.c) from
the corresponding table entry.

>
>       * Size - 32-bit size of the payload
>
>
>     @@ -97,6 +106,13 @@ Depending on the request type, payload can be:
>         log offset: offset from start of supplied file descriptor
>             where logging starts (i.e. where guest address 0 would be
>     logged)
>
>     +* Device info
>     +   --------------------
>     +   | virito id | uuid |
>     +   --------------------
>     +   Virtio id: 16-bit virtio id of the device
>     +   UUID: 128-bit UUID to identify the QEMU instance that creates
>     the device
>     +
>
>
> I wonder if UUID should be a different message.
>
We can make uuid another message if it has other usages.
Do you see any other usages of uuid?


Allows to associate data/configuration with a particular VM, in a
multi-master/single-slave scenario. But tbh, I don't see how this is
necessary, I can imagine solving this differently (having different
connection address per vm for ex). I would like to understand your use case.


>
>      In QEMU the vhost-user message is implemented with the following
>     struct:
>
>      typedef struct VhostUserMsg {
>     @@ -109,6 +125,7 @@ typedef struct VhostUserMsg {
>              struct vhost_vring_addr addr;
>              VhostUserMemory memory;
>              VhostUserLog log;
>     +        DeviceInfo dev_info;
>          };
>      } QEMU_PACKED VhostUserMsg;
>
>     @@ -119,17 +136,25 @@ The protocol for vhost-user is based on the
>     existing implementation of vhost
>      for the Linux Kernel. Most messages that can be sent via the Unix
>     domain socket
>      implementing vhost-user have an equivalent ioctl to the kernel
>     implementation.
>
>     -The communication consists of master sending message requests and
>     slave sending
>     -message replies. Most of the requests don't require replies. Here
>     is a list of
>     -the ones that do:
>     +Traditionally, the communication consists of master sending
>     message requests
>     +and slave sending message replies. Most of the requests don't
>     require replies.
>     +Here is a list of the ones that do:
>
>       * VHOST_GET_FEATURES
>       * VHOST_GET_PROTOCOL_FEATURES
>       * VHOST_GET_VRING_BASE
>       * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
>     + * VHOST_USER_GET_CONN_ID
>     + * VHOST_USER_SET_PEER_CONNECTION
>
> Let's also fix  the VHOST_USER prefix of the above requests.
>
Sure, will do. <*v1-AR3*>

>      [ Also see the section on REPLY_ACK protocol extension. ]
>
>     +Currently, the communication also supports the Slave (server)
>     sending messages
>     +to the Master (client). Here is a list of them:
>     + * VHOST_USER_SET_FEATURES
>
>     + * VHOST_USER_SET_PEER_CONNECTION (the serve may actively request
>     to disconnect
>     +   with the client)
>
>
> Oh, you are making the communication bidirectional? This is a
> fundamental change in the protocol. This may be difficult to implement
> in qemu, since the communication in synchronous, a request expects an
> immediate reply, if it gets back a request (from the slave) in the
> middle, it will fail.
>

Not really.
Adding the above two doesn't affect the existing synchronous read()
messages (basically, those VHOST_USER_GET_xx messages). Like
VHOST_USER_SET_FEATURES, the _SET_ messages don't need a reply. Here, we
just make the slave capable of actively sending messages to the master.


Yes, that's the trouble. At any time the Master may send a request and
expects an immediate reply. There is a race of getting a request from the
Slave in the middle with your proposed change. I'd rather avoid making the
request bidirectionnal if possible. (I proposed a second channel for
Slave->Master request in the past:
https://lists.gnu.org/archive/html/qemu-devel/2016-04/msg00095.html)

> Currently all requests (including VHOST_USER_SET_FEATURES) are coming
> from the Master. I don't understand yet the purpose of
> VHOST_USER_SET_PEER_CONNECTION to propose an alternative, but I would
> rather keep the unidirectional communication if possible.
>
>      There are several messages that the master sends with file
>     descriptors passed
>      in the ancillary data:
>
>     @@ -259,6 +284,7 @@ Protocol features
>      #define VHOST_USER_PROTOCOL_F_LOG_SHMFD      1
>      #define VHOST_USER_PROTOCOL_F_RARP           2
>      #define VHOST_USER_PROTOCOL_F_REPLY_ACK      3
>     +#define VHOST_USER_PROTOCOL_F_VHOST_PCI      4
>
>      Message types
>      -------------
>     @@ -470,6 +496,43 @@ Message types
>            The first 6 bytes of the payload contain the mac address of
>     the guest to
>            allow the vhost user backend to construct and broadcast the
>     fake RARP.
>
>     + * VHOST_USER_GET_CONN_ID
>     +
>     +      Id: 20
>     +      Equivalent ioctl: N/A
>     +      Master payload: u64
>     +
>     +      The client sends this message to the server to ask for its
>     connection id.
>
>
> Confusing, please keep the Master/Slave terminology
OK.
>
>     + The connection id is then put into the message header (the
>     conn_id field),
>     +      so that the server can always know who it is talking with.
>     +
>
>
> Could you explain what the connection id is for?

Explained above. Please let me know if I didn't make it clear.

>
> +This request should be sent only when VHOST_USER_PROTOCOL_F_VHOST_PCI
> has...
>
>     +* VHOST_USER_SET_DEV_INFO
>     +
>     +      Id: 21
>     +      Equivalent ioctl: N/A
>     +      Master payload: dev info
>     +
>     +      The client sends the producer device info to the server.
>
>
> "Master sends producer device info to the Slave" works, no?

Yes, it works. The current dev info only contains a "virtio id" field
(assume we'll take uuid out as a separate message), which tells the
slave if it is a net, scsi, console or else. do you see any issue?
>
> Could we guarantee this message is sent before SET_VRING*?

Why do we need to guarantee this?


It would simplify the protocol to have expectations on when messages come.
In particular, an early message with devinfo would allow to
check/pre-configure the Slave for a particular device. Also
VHOST_USER_SET_DEV_INFO should probably be unique (don't allow a device to
be reconfigured)


>
>     + This request should be sent only when
>     VHOST_USER_PROTOCOL_F_VHOST_PCI has
>     +      been negotiated.
>     +
>
>
> I think this message could be useful for other purposes than
> vhost-pci, thus I would give it its own flag.

Could you please give an example of other usage? Thanks.


You could have a Slave that implements various devices, and pick the
corresponding one dynamically (we already have implementations for
net/input/gpu/scsi...)


>
>     +* VHOST_USER_SET_PEER_CONNECTION
>     +
>     +      Id: 22
>     +      Equivalent ioctl: N/A
>     +      Master payload: u64
>     +
>     +      The producer device requests to connect or disconnect to
>     the consumer device.
>
>
> producer->Master, consummer->Slave
>
> How does it interact with SET_VRING_ENABLE?

It's independent of SET_VRING_ENABLE:
SET_VRING_ENABLE enables a virtq to be in "active".
SET_PEER_CONNECTION enables the peer (slave or master) device to be in
"active". The driver shouldn't send packets if the device is inactive.


I fail to see the difference with SET_VRING_ENABLE, perhaps someone more
familiar with the protocol could help here.

>
>     + The consumer device may request to disconnect to the producer
>     device. This
>     +      request should be sent only when
>     VHOST_USER_PROTOCOL_F_VHOST_PCI has been
>     +      negotiated.
>     +      Connection request: If the reply message indicates
>     "success", the vhost-pci based
>     +      inter-VM communication channel has been established.
>     +      Disconnection request: If the reply message indicates
>     "success", the vhost-pci based
>     +      inter-VM communication channel has been destroyed.
>     +      #define VHOST_USER_SET_PEER_CONNECTION_F_OFF 0
>     +      #define VHOST_USER_SET_PEER_CONNECTION_F_ON 1
>     +
>
I think it would be better to add one more command here:
#define VHOST_USER_SET_PEER_CONNECTION_F_INIT 2

The master uses this command to tell the slave it's ready to create the
vhost-pci device. Regarding the implementation, it is put at the bottom
of vhost_net_start() function (when all the vring info have been sent
and enabled).


Do you have WIP branch for qemu vhost-pci? That could help to understand
the context.

thanks
Wang, Wei W Nov. 9, 2016, 8:32 a.m. UTC | #7
On 11/08/2016 08:17 PM, Marc-André Lureau wrote:
> >
>
>     >      Message Specification
>     >      ---------------------
>     >
>     >      Note that all numbers are in the machine native byte order. A
>     >     vhost-user message
>     >     -consists of 3 header fields and a payload:
>     >     +consists of 4 header fields and a payload:
>     >
>     >     -------------------------------------
>     >     -| request | flags | size | payload |
>     >     -------------------------------------
>     >     +----------------------------------------------
>     >     +| request | flags | conn_id | size | payload |
>     >     +----------------------------------------------
>     >
>     >       * Request: 32-bit type of the request
>     >       * Flags: 32-bit bit field:
>     >         - Lower 2 bits are the version (currently 0x01)
>     >     -   - Bit 2 is the reply flag - needs to be sent on each reply
>     >     from the slave
>     >     +   - Bit 2 is the reply flag - needs to be sent on each reply
>     >         - Bit 3 is the need_reply flag - see
>     >     VHOST_USER_PROTOCOL_F_REPLY_ACK for
>     >           details.
>     >     + * Conn_id: 64-bit connection id to indentify a client socket
>     >     connection. It is
>     >     +            introduced in version 0x02 to support the
>     >     "1-server-N-client" model
>     >     +            and an asynchronous client read implementation. The
>     >     connection id,
>     >     +            0xFFFFFFFFFFFFFFFF, is used by an anonymous client
>     >     (e.g. a client who
>     >     +            has not got its connection id from the server
>     in the
>     >     initial talk)
>     >
>     >
>     > I don't understand why you need a connection id, on each message.
>     > What's the purpose? Since the communication is unicast, a single
>     > message should be enough.
>
>     Sure, please let me explain more:
>     The QEMU socket is going to be upgraded to support 1 server socket
>     being
>     connected by multiple client sockets (I've made patches to achieve
>     this). In other words, here, multiple masters will connect to one
>     slave,
>     and the slave creates a vhost-pci device for each master after
>     receiving
>     the necessary message info. The slave needs to know which master it is
>     talking to when receiving a message, as it maintains multiple
>     connections at the same time.
>
>
> You should be able to identify each connection in the slave (as a 
> socket server), without a need for connection id: connected sockets 
> are independent from each others.


Yes, that's doable. But why couldn't we do it from the protocol layer? I 
think it will be easier.

Please check below my thoughts about the implementation if we do it in 
the slave:

The interface for receiving a msg is - tcp_chr_read(QIOChannel *chan, 
GIOCondition cond, void *opaque)

QIOChannel is the one that we can use to identify the master connection 
who sends this msg (the socket server now has an array of QIOChannel, 
ioc[MAX_CLIENTS]). Everytime a msg is received, the tcp_chr_read() needs 
to compare *chan and the ioc[] array, to find out the id (indexed into 
the ioc[]), and passes the id to qemu_chr_be_write(), and all the way 
down to the final slave handler where the msg is parsed and handled. 
This needs modifications to the existing APIs, for example, the 
mentioned qemu_chr_be_write() will need one more parameter, "id". This 
will not be compatible with the existing implementation, because all 
other implementations which invoke qemu_chr_be_write() will need to be 
patched to use the new qemu_chr_be_write(,"id",).



>     >       * Size - 32-bit size of the payload
>     >
>     >
>     >     @@ -97,6 +106,13 @@ Depending on the request type, payload
>     can be:
>     >         log offset: offset from start of supplied file descriptor
>     >             where logging starts (i.e. where guest address 0
>     would be
>     >     logged)
>     >
>     >     +* Device info
>     >     +   --------------------
>     >     +   | virito id | uuid |
>     >     +   --------------------
>     >     +   Virtio id: 16-bit virtio id of the device
>     >     +   UUID: 128-bit UUID to identify the QEMU instance that
>     creates
>     >     the device
>     >     +
>     >
>     >
>     > I wonder if UUID should be a different message.
>     >
>     We can make uuid another message if it has other usages.
>     Do you see any other usages of uuid?
>
>
> Allows to associate data/configuration with a particular VM, in a 
> multi-master/single-slave scenario. But tbh, I don't see how this is 
> necessary, I can imagine solving this differently (having different 
> connection address per vm for ex).


Using connection addresses, how could you know if the two connections 
are from the same VM?


> I would like to understand your use case.


Here is an example of the use case:
VM1 has two master connections (connection X and Y) and VM2 has 1 master 
connection (Z).
X,Y,Z - each has a connection id. But X and Y send the same uuid, uuid1, 
to the slave, and Z sends uuid2 to the slave. In this way, the slave 
know X and Y are the two connections from the same VM, and Z is a 
connection from a different VM.

For connection Y, the vhost-pci device will be created in a way which 
does not need the driver to map the memory, since it has already been 
mapped by device X from the same VM.


>
>
>     >      [ Also see the section on REPLY_ACK protocol extension. ]
>     >
>     >     +Currently, the communication also supports the Slave (server)
>     >     sending messages
>     >     +to the Master (client). Here is a list of them:
>     >     + * VHOST_USER_SET_FEATURES
>     >
>     >     + * VHOST_USER_SET_PEER_CONNECTION (the serve may actively
>     request
>     >     to disconnect
>     >     +   with the client)
>     >
>     >
>     > Oh, you are making the communication bidirectional? This is a
>     > fundamental change in the protocol. This may be difficult to
>     implement
>     > in qemu, since the communication in synchronous, a request
>     expects an
>     > immediate reply, if it gets back a request (from the slave) in the
>     > middle, it will fail.
>     >
>
>     Not really.
>     Adding the above two doesn't affect the existing synchronous read()
>     messages (basically, those VHOST_USER_GET_xx messages). Like
>     VHOST_USER_SET_FEATURES, the _SET_ messages don't need a reply.
>     Here, we
>     just make the slave capable of actively sending messages to the
>     master.
>
>
> Yes, that's the trouble. At any time the Master may send a request and 
> expects an immediate reply. There is a race of getting a request from 
> the Slave in the middle with your proposed change. I'd rather avoid 
> making the request bidirectionnal if possible. (I proposed a second 
> channel for Slave->Master request in the past: 
> https://lists.gnu.org/archive/html/qemu-devel/2016-04/msg00095.html)


  If the message that the slave got has a different "request" field 
value, it simply drops it and re-read again. The implementation is not 
complex also, please see the change example to vhost_user_get_u64() below:

    if (vhost_user_write(dev, &msg_w, NULL, 0) < 0) {
        return -1;
     }
retry:
     if (vhost_user_read(dev, &msg_r) < 0) {
         return -1;
    }
     if (msg_r.request != msg_w.request)
         goto retry;


On the other side, the slave's request to the master is dropped due to 
the race. This race can be solved in the protocol layer - let the _SET_ 
request ask for an ACK, if no ACK is received, re-sent it. Also, this 
kind of race should be very rare in real usage.


>
>     >
>     > +This request should be sent only when
>     VHOST_USER_PROTOCOL_F_VHOST_PCI
>     > has...
>     >
>     >     +* VHOST_USER_SET_DEV_INFO
>     >     +
>     >     +      Id: 21
>     >     +      Equivalent ioctl: N/A
>     >     +      Master payload: dev info
>     >     +
>     >     +      The client sends the producer device info to the server.
>     >
>     >
>     > "Master sends producer device info to the Slave" works, no?
>
>     Yes, it works. The current dev info only contains a "virtio id" field
>     (assume we'll take uuid out as a separate message), which tells the
>     slave if it is a net, scsi, console or else. do you see any issue?
>     >
>     > Could we guarantee this message is sent before SET_VRING*?
>
>     Why do we need to guarantee this?
>
>
> It would simplify the protocol to have expectations on when messages 
> come. In particular, an early message with devinfo would allow to 
> check/pre-configure the Slave for a particular device. Also 
> VHOST_USER_SET_DEV_INFO should probably be unique (don't allow a 
> device to be reconfigured)


Yes, it is sent in an early age of the vhost-user protocol interaction. 
It's implemented to be sent right after sending the 
VHOST_USER_SET_PROTOCOL_FEATURES msg. On the slave side, when it 
receives SET_DEV_INFO, it pre-configures the device in a table entry (as 
mentioned before, a device will be created from the table entry at a 
later stage of the protocol interaction).

I think it should be the implementation logic, like 
VHOST_USER_SET_PROTOCOL_FEATURES. why do we need to add a guarantee in 
the protocol to specify the order?


>
>
>     >
>     >     + This request should be sent only when
>     >     VHOST_USER_PROTOCOL_F_VHOST_PCI has
>     >     +      been negotiated.
>     >     +
>     >
>     >
>     > I think this message could be useful for other purposes than
>     > vhost-pci, thus I would give it its own flag.
>
>     Could you please give an example of other usage? Thanks.
>
>
> You could have a Slave that implements various devices, and pick the 
> corresponding one dynamically (we already have implementations for 
> net/input/gpu/scsi...)


If I understand the example correctly, the various devices still belongs 
to the vhost-pci series - in the future we would have vhost-pci-net, 
vhost-pci-scsi, vhost-pci-gpu etc. If that's the case, we may still use 
the VHOST_PCI flag.


>
>     >
>     >     +* VHOST_USER_SET_PEER_CONNECTION
>     >     +
>     >     +      Id: 22
>     >     +      Equivalent ioctl: N/A
>     >     +      Master payload: u64
>     >     +
>     >     +      The producer device requests to connect or disconnect to
>     >     the consumer device.
>     >
>     >
>     > producer->Master, consummer->Slave
>     >
>     > How does it interact with SET_VRING_ENABLE?
>
>     It's independent of SET_VRING_ENABLE:
>     SET_VRING_ENABLE enables a virtq to be in "active".
>     SET_PEER_CONNECTION enables the peer (slave or master) device to be in
>     "active". The driver shouldn't send packets if the device is inactive.
>
>
> I fail to see the difference with SET_VRING_ENABLE, perhaps someone 
> more familiar with the protocol could help here.


I'm not sure if  another email explaning this was sent out successfully, 
repost the explanation here:

The SET_PEER_CONNECTION msg is ued to turn "ON/OFF" the (slave or 
master) device connection status. For example, when the master side VM 
wants to turn down, the virtio-net driver sets the virtio-net device's 
PEER_CONNECTION status to "OFF" - before this happens, the virtio-net 
device needs to sync-up with the vhost-pci-net device first, that is, 
sending a VHOST_USER_SET_PEER_CONNECTION(cmd=OFF) msg to the master. In 
return (not as a synchronous reply, because it has to sync with the 
driver to stop using the slave side resource first), the vhost-pci-net 
device sends VHOST_USER_SET_PEER_CONNECTION(cmd=OFF) msg to the slave - 
this sets the virtio-net device's PEER_CONNECTION status to "OFF" and 
then the virtio driver is ready to unload. (same for the vhost-pci-net 
driver to unload)

SET_VRING_ENABLE controls the virtq status - the slave should not use 
the virtq if it's not enabled by the master. For example, a device may 
have 4 vitrqs, if vq[0].enabled==0, then the slave should not use vitrq 0.


>     >
>     >     + The consumer device may request to disconnect to the producer
>     >     device. This
>     >     +      request should be sent only when
>     >     VHOST_USER_PROTOCOL_F_VHOST_PCI has been
>     >     +      negotiated.
>     >     +      Connection request: If the reply message indicates
>     >     "success", the vhost-pci based
>     >     +      inter-VM communication channel has been established.
>     >     +      Disconnection request: If the reply message indicates
>     >     "success", the vhost-pci based
>     >     +      inter-VM communication channel has been destroyed.
>     >     +      #define VHOST_USER_SET_PEER_CONNECTION_F_OFF 0
>     >     +      #define VHOST_USER_SET_PEER_CONNECTION_F_ON 1
>     >     +
>     >
>     I think it would be better to add one more command here:
>     #define VHOST_USER_SET_PEER_CONNECTION_F_INIT 2
>
>     The master uses this command to tell the slave it's ready to
>     create the
>     vhost-pci device. Regarding the implementation, it is put at the
>     bottom
>     of vhost_net_start() function (when all the vring info have been sent
>     and enabled).
>
>
> Do you have WIP branch for qemu vhost-pci? That could help to 
> understand the context.


Yes, I can share them.


Best,
Wei
Marc-André Lureau Nov. 10, 2016, 12:26 p.m. UTC | #8
Hi Wei

On Wed, Nov 9, 2016 at 12:32 PM Wei Wang <wei.w.wang@intel.com> wrote:

> On 11/08/2016 08:17 PM, Marc-André Lureau wrote:
> > >
> >
> >     >      Message Specification
> >     >      ---------------------
> >     >
> >     >      Note that all numbers are in the machine native byte order. A
> >     >     vhost-user message
> >     >     -consists of 3 header fields and a payload:
> >     >     +consists of 4 header fields and a payload:
> >     >
> >     >     -------------------------------------
> >     >     -| request | flags | size | payload |
> >     >     -------------------------------------
> >     >     +----------------------------------------------
> >     >     +| request | flags | conn_id | size | payload |
> >     >     +----------------------------------------------
> >     >
> >     >       * Request: 32-bit type of the request
> >     >       * Flags: 32-bit bit field:
> >     >         - Lower 2 bits are the version (currently 0x01)
> >     >     -   - Bit 2 is the reply flag - needs to be sent on each reply
> >     >     from the slave
> >     >     +   - Bit 2 is the reply flag - needs to be sent on each reply
> >     >         - Bit 3 is the need_reply flag - see
> >     >     VHOST_USER_PROTOCOL_F_REPLY_ACK for
> >     >           details.
> >     >     + * Conn_id: 64-bit connection id to indentify a client socket
> >     >     connection. It is
> >     >     +            introduced in version 0x02 to support the
> >     >     "1-server-N-client" model
> >     >     +            and an asynchronous client read implementation.
> The
> >     >     connection id,
> >     >     +            0xFFFFFFFFFFFFFFFF, is used by an anonymous client
> >     >     (e.g. a client who
> >     >     +            has not got its connection id from the server
> >     in the
> >     >     initial talk)
> >     >
> >     >
> >     > I don't understand why you need a connection id, on each message.
> >     > What's the purpose? Since the communication is unicast, a single
> >     > message should be enough.
> >
> >     Sure, please let me explain more:
> >     The QEMU socket is going to be upgraded to support 1 server socket
> >     being
> >     connected by multiple client sockets (I've made patches to achieve
> >     this). In other words, here, multiple masters will connect to one
> >     slave,
> >     and the slave creates a vhost-pci device for each master after
> >     receiving
>

Before handling hotplugging and multiple vhost-pci, I think it would be
simpler to support 1-1.


> >     the necessary message info. The slave needs to know which master it
> is
> >     talking to when receiving a message, as it maintains multiple
> >     connections at the same time.
> >
> >
> > You should be able to identify each connection in the slave (as a
> > socket server), without a need for connection id: connected sockets
> > are independent from each others.
>
>
> Yes, that's doable. But why couldn't we do it from the protocol layer? I
> think it will be easier.
>

Since it can be done by implementation, I would rather not change the
protocol with unnecessary data.


>
> Please check below my thoughts about the implementation if we do it in
> the slave:
>
> The interface for receiving a msg is - tcp_chr_read(QIOChannel *chan,
> GIOCondition cond, void *opaque)
>
> QIOChannel is the one that we can use to identify the master connection
> who sends this msg (the socket server now has an array of QIOChannel,
> ioc[MAX_CLIENTS]). Everytime a msg is received, the tcp_chr_read() needs
> to compare *chan and the ioc[] array, to find out the id (indexed into
> the ioc[]), and passes the id to qemu_chr_be_write(), and all the way
> down to the final slave handler where the msg is parsed and handled.
> This needs modifications to the existing APIs, for example, the
> mentioned qemu_chr_be_write() will need one more parameter, "id". This
> will not be compatible with the existing implementation, because all
> other implementations which invoke qemu_chr_be_write() will need to be
> patched to use the new qemu_chr_be_write(,"id",).
>

We could have the "id" set with the CharBackend. But again, it's an
implementation detail, and I would rather treat it as a seperate enhacement.


>
>
>
> >     >       * Size - 32-bit size of the payload
> >     >
> >     >
> >     >     @@ -97,6 +106,13 @@ Depending on the request type, payload
> >     can be:
> >     >         log offset: offset from start of supplied file descriptor
> >     >             where logging starts (i.e. where guest address 0
> >     would be
> >     >     logged)
> >     >
> >     >     +* Device info
> >     >     +   --------------------
> >     >     +   | virito id | uuid |
> >     >     +   --------------------
> >     >     +   Virtio id: 16-bit virtio id of the device
> >     >     +   UUID: 128-bit UUID to identify the QEMU instance that
> >     creates
> >     >     the device
> >     >     +
> >     >
> >     >
> >     > I wonder if UUID should be a different message.
> >     >
> >     We can make uuid another message if it has other usages.
> >     Do you see any other usages of uuid?
> >
> >
> > Allows to associate data/configuration with a particular VM, in a
> > multi-master/single-slave scenario. But tbh, I don't see how this is
> > necessary, I can imagine solving this differently (having different
> > connection address per vm for ex).
>
>
> Using connection addresses, how could you know if the two connections
> are from the same VM?
>
>
Each VM could have a private connection address/server, that the management
layer sets up. Note that the same "Slave" process could have multiple
servers. Iow, you could have -chardev socket,id=c1,server,.. -chardev
socket,id=c2,server,... -device vhost-pci,chardev=c1 device
vhost-pci,chardev=c2 (this is somewhat similar to usbredir): with this
setup, qemu could implicitely starts a singleton vhost-user "Slave" server
with two listening sockets c1&c1, or it could start a server for each
chardev.


> > I would like to understand your use case.
>
>
> Here is an example of the use case:
> VM1 has two master connections (connection X and Y) and VM2 has 1 master
> connection (Z).
> X,Y,Z - each has a connection id. But X and Y send the same uuid, uuid1,
> to the slave, and Z sends uuid2 to the slave. In this way, the slave
> know X and Y are the two connections from the same VM, and Z is a
> connection from a different VM.
>
> For connection Y, the vhost-pci device will be created in a way which
> does not need the driver to map the memory, since it has already been
> mapped by device X from the same VM.
>
>
>
For now, I would ignore the fact that the same VM could be the Master for
both c1 & c2 (in the ex above). Is passing uuid necessary and enough to
avoid duplicated mmap? is this really an issue after all? it's only virtual
memory.


> >
> >
> >     >      [ Also see the section on REPLY_ACK protocol extension. ]
> >     >
> >     >     +Currently, the communication also supports the Slave (server)
> >     >     sending messages
> >     >     +to the Master (client). Here is a list of them:
> >     >     + * VHOST_USER_SET_FEATURES
> >     >
> >     >     + * VHOST_USER_SET_PEER_CONNECTION (the serve may actively
> >     request
> >     >     to disconnect
> >     >     +   with the client)
> >     >
> >     >
> >     > Oh, you are making the communication bidirectional? This is a
> >     > fundamental change in the protocol. This may be difficult to
> >     implement
> >     > in qemu, since the communication in synchronous, a request
> >     expects an
> >     > immediate reply, if it gets back a request (from the slave) in the
> >     > middle, it will fail.
> >     >
> >
> >     Not really.
> >     Adding the above two doesn't affect the existing synchronous read()
> >     messages (basically, those VHOST_USER_GET_xx messages). Like
> >     VHOST_USER_SET_FEATURES, the _SET_ messages don't need a reply.
> >     Here, we
> >     just make the slave capable of actively sending messages to the
> >     master.
> >
> >
> > Yes, that's the trouble. At any time the Master may send a request and
> > expects an immediate reply. There is a race of getting a request from
> > the Slave in the middle with your proposed change. I'd rather avoid
> > making the request bidirectionnal if possible. (I proposed a second
> > channel for Slave->Master request in the past:
> > https://lists.gnu.org/archive/html/qemu-devel/2016-04/msg00095.html)
>
>
>   If the message that the slave got has a different "request" field
> value, it simply drops it and re-read again. The implementation is not
> complex also, please see the change example to vhost_user_get_u64() below:
>
>     if (vhost_user_write(dev, &msg_w, NULL, 0) < 0) {
>         return -1;
>      }
> retry:
>      if (vhost_user_read(dev, &msg_r) < 0) {
>          return -1;
>     }
>      if (msg_r.request != msg_w.request)
>          goto retry;
>
>
> On the other side, the slave's request to the master is dropped due to
> the race. This race can be solved in the protocol layer - let the _SET_
> request ask for an ACK, if no ACK is received, re-sent it. Also, this
> kind of race should be very rare in real usage.
>
>
Ok, that's an option, not very elegant imho, but it could work. I would be
afraid of deadlocks if both side apply the same retry logic..


>
> >
> >     >
> >     > +This request should be sent only when
> >     VHOST_USER_PROTOCOL_F_VHOST_PCI
> >     > has...
> >     >
> >     >     +* VHOST_USER_SET_DEV_INFO
> >     >     +
> >     >     +      Id: 21
> >     >     +      Equivalent ioctl: N/A
> >     >     +      Master payload: dev info
> >     >     +
> >     >     +      The client sends the producer device info to the server.
> >     >
> >     >
> >     > "Master sends producer device info to the Slave" works, no?
> >
> >     Yes, it works. The current dev info only contains a "virtio id" field
> >     (assume we'll take uuid out as a separate message), which tells the
> >     slave if it is a net, scsi, console or else. do you see any issue?
> >     >
> >     > Could we guarantee this message is sent before SET_VRING*?
> >
> >     Why do we need to guarantee this?
> >
> >
> > It would simplify the protocol to have expectations on when messages
> > come. In particular, an early message with devinfo would allow to
> > check/pre-configure the Slave for a particular device. Also
> > VHOST_USER_SET_DEV_INFO should probably be unique (don't allow a
> > device to be reconfigured)
>
>
> Yes, it is sent in an early age of the vhost-user protocol interaction.
> It's implemented to be sent right after sending the
> VHOST_USER_SET_PROTOCOL_FEATURES msg. On the slave side, when it
> receives SET_DEV_INFO, it pre-configures the device in a table entry (as
> mentioned before, a device will be created from the table entry at a
> later stage of the protocol interaction).
>
> I think it should be the implementation logic, like
> VHOST_USER_SET_PROTOCOL_FEATURES. why do we need to add a guarantee in
> the protocol to specify the order?
>

It would help to simplify the possible protocol states, helping
implementation, testing and documentation. I think the protocol is too
liberal already, it would help to specify the various states of Master and
Slave and what messages are permitted in each state. This is out of scope
here, but a worthy goal for a v2 of the protocol.



> >
> >     >
> >     >     + This request should be sent only when
> >     >     VHOST_USER_PROTOCOL_F_VHOST_PCI has
> >     >     +      been negotiated.
> >     >     +
> >     >
> >     >
> >     > I think this message could be useful for other purposes than
> >     > vhost-pci, thus I would give it its own flag.
> >
> >     Could you please give an example of other usage? Thanks.
> >
> >
> > You could have a Slave that implements various devices, and pick the
> > corresponding one dynamically (we already have implementations for
> > net/input/gpu/scsi...)
>
>
> If I understand the example correctly, the various devices still belongs
> to the vhost-pci series - in the future we would have vhost-pci-net,
> vhost-pci-scsi, vhost-pci-gpu etc. If that's the case, we may still use
> the VHOST_PCI flag.
>
>
My point is that those messages are not specific to vhost-pci, they can be
useful for vhost-user Slave today.

That brings me to a related question: do you need to have different
vhost-pci devices, or can we make it generic enough that it could mirror
various device kinds dynamically. As a first step, I think it would be
simpler to focus on a specific device, like vhost-pci-net, like you are
doing in your qemu series.


>
> >
> >     >
> >     >     +* VHOST_USER_SET_PEER_CONNECTION
> >     >     +
> >     >     +      Id: 22
> >     >     +      Equivalent ioctl: N/A
> >     >     +      Master payload: u64
> >     >     +
> >     >     +      The producer device requests to connect or disconnect to
> >     >     the consumer device.
> >     >
> >     >
> >     > producer->Master, consummer->Slave
> >     >
> >     > How does it interact with SET_VRING_ENABLE?
> >
> >     It's independent of SET_VRING_ENABLE:
> >     SET_VRING_ENABLE enables a virtq to be in "active".
> >     SET_PEER_CONNECTION enables the peer (slave or master) device to be
> in
> >     "active". The driver shouldn't send packets if the device is
> inactive.
> >
> >
> > I fail to see the difference with SET_VRING_ENABLE, perhaps someone
> > more familiar with the protocol could help here.
>
>
> I'm not sure if  another email explaning this was sent out successfully,
> repost the explanation here:
>
> The SET_PEER_CONNECTION msg is ued to turn "ON/OFF" the (slave or
> master) device connection status. For example, when the master side VM
> wants to turn down, the virtio-net driver sets the virtio-net device's
> PEER_CONNECTION status to "OFF" - before this happens, the virtio-net
> device needs to sync-up with the vhost-pci-net device first, that is,
> sending a VHOST_USER_SET_PEER_CONNECTION(cmd=OFF) msg to the master. In
> return (not as a synchronous reply, because it has to sync with the
> driver to stop using the slave side resource first), the vhost-pci-net
> device sends VHOST_USER_SET_PEER_CONNECTION(cmd=OFF) msg to the slave -
> this sets the virtio-net device's PEER_CONNECTION status to "OFF" and
> then the virtio driver is ready to unload. (same for the vhost-pci-net
> driver to unload)
>
> SET_VRING_ENABLE controls the virtq status - the slave should not use
> the virtq if it's not enabled by the master. For example, a device may
> have 4 vitrqs, if vq[0].enabled==0, then the slave should not use vitrq 0.
>
>
> >     >
> >     >     + The consumer device may request to disconnect to the producer
> >     >     device. This
> >     >     +      request should be sent only when
> >     >     VHOST_USER_PROTOCOL_F_VHOST_PCI has been
> >     >     +      negotiated.
> >     >     +      Connection request: If the reply message indicates
> >     >     "success", the vhost-pci based
> >     >     +      inter-VM communication channel has been established.
> >     >     +      Disconnection request: If the reply message indicates
> >     >     "success", the vhost-pci based
> >     >     +      inter-VM communication channel has been destroyed.
> >     >     +      #define VHOST_USER_SET_PEER_CONNECTION_F_OFF 0
> >     >     +      #define VHOST_USER_SET_PEER_CONNECTION_F_ON 1
> >     >     +
> >     >
> >     I think it would be better to add one more command here:
> >     #define VHOST_USER_SET_PEER_CONNECTION_F_INIT 2
> >
> >     The master uses this command to tell the slave it's ready to
> >     create the
> >     vhost-pci device. Regarding the implementation, it is put at the
> >     bottom
> >     of vhost_net_start() function (when all the vring info have been sent
> >     and enabled).
> >
> >
> > Do you have WIP branch for qemu vhost-pci? That could help to
> > understand the context.
>
>
> Yes, I can share them.
>
>
thanks for sharing, I left some more comments there.
Wang, Wei W Nov. 11, 2016, 8:24 a.m. UTC | #9
Thanks for your comments, Marc-André.

On 11/10/2016 08:26 PM, Marc-André Lureau wrote:
> Hi Wei
>
> On Wed, Nov 9, 2016 at 12:32 PM Wei Wang <wei.w.wang@intel.com 
> <mailto:wei.w.wang@intel.com>> wrote:
>
>     On 11/08/2016 08:17 PM, Marc-André Lureau wrote:
>     > >
>     >
>     >     >      Message Specification
>     >     >      ---------------------
>     >     >
>     >     >      Note that all numbers are in the machine native byte
>     order. A
>     >     >     vhost-user message
>     >     >     -consists of 3 header fields and a payload:
>     >     >     +consists of 4 header fields and a payload:
>     >     >
>     >     >     -------------------------------------
>     >     >     -| request | flags | size | payload |
>     >     >     -------------------------------------
>     >     >  +----------------------------------------------
>     >     >     +| request | flags | conn_id | size | payload |
>     >     >  +----------------------------------------------
>     >     >
>     >     >       * Request: 32-bit type of the request
>     >     >       * Flags: 32-bit bit field:
>     >     >         - Lower 2 bits are the version (currently 0x01)
>     >     >     -   - Bit 2 is the reply flag - needs to be sent on
>     each reply
>     >     >     from the slave
>     >     >     +   - Bit 2 is the reply flag - needs to be sent on
>     each reply
>     >     >         - Bit 3 is the need_reply flag - see
>     >     >     VHOST_USER_PROTOCOL_F_REPLY_ACK for
>     >     >           details.
>     >     >     + * Conn_id: 64-bit connection id to indentify a
>     client socket
>     >     >     connection. It is
>     >     >     +            introduced in version 0x02 to support the
>     >     >     "1-server-N-client" model
>     >     >     +            and an asynchronous client read
>     implementation. The
>     >     >     connection id,
>     >     >     +            0xFFFFFFFFFFFFFFFF, is used by an
>     anonymous client
>     >     >     (e.g. a client who
>     >     >     +            has not got its connection id from the server
>     >     in the
>     >     >     initial talk)
>     >     >
>     >     >
>     >     > I don't understand why you need a connection id, on each
>     message.
>     >     > What's the purpose? Since the communication is unicast, a
>     single
>     >     > message should be enough.
>     >
>     >     Sure, please let me explain more:
>     >     The QEMU socket is going to be upgraded to support 1 server
>     socket
>     >     being
>     >     connected by multiple client sockets (I've made patches to
>     achieve
>     >     this). In other words, here, multiple masters will connect
>     to one
>     >     slave,
>     >     and the slave creates a vhost-pci device for each master after
>     >     receiving
>
>
> Before handling hotplugging and multiple vhost-pci, I think it would 
> be simpler to support 1-1.


Yes, we can start from 1-1 (server-client model), but I think we still 
need to use the hotplugging method for the following reasons:

1) the vhost-pci-net device (slave) is created based on the 
configuration info from the master VM - for example, the vhost-pci-net 
BAR size is set according to the memory size of the master VM. 
Typically, the master VM gets booted later than the slave VM, since the 
qemu client socket (master) needs to connect to the server socket 
(slave). So, I think we should let the socket server manages the device 
(creation/hotplug/destruction/). Instead of creating multiple 
vhost-pci-net devices, each server socket manages only one vhost-pci 
device for now. We can boot a VM like this:
-chardev socket,id=vp-server1,server,path=/opt/vp-server1
-chardev socket,id=vp-server2,server,path=/opt/vp-server2
-vhost-pci-server socket,chrdev=vp-server1,chrdev=vp-server2

2) The socket  manages the device instance, rather than the device owns 
the socket. This is flexible for the connection setup - we can also add 
qmp commands to support the setup of a socket connection manually. For 
example, we have VM1 booted with the above commands, and VM2 gets booted 
without connecting to VM1. We still have a chance to let the admin to 
use qmp commands to set up the connection, and a vhost-pci-net device is 
then created and hotplugged to the VM.

3) Considering that we will have the "1-server-N-client" version QEMU 
socket for the next step, I think this will be easy to upgrade.

What do you think?


>     >     the necessary message info. The slave needs to know which
>     master it is
>     >     talking to when receiving a message, as it maintains multiple
>     >     connections at the same time.
>     >
>     >
>     > You should be able to identify each connection in the slave (as a
>     > socket server), without a need for connection id: connected sockets
>     > are independent from each others.
>
>
>     Yes, that's doable. But why couldn't we do it from the protocol
>     layer? I
>     think it will be easier.
>
>
> Since it can be done by implementation, I would rather not change the 
> protocol with unnecessary data.
>
>
>     Please check below my thoughts about the implementation if we do it in
>     the slave:
>
>     The interface for receiving a msg is - tcp_chr_read(QIOChannel *chan,
>     GIOCondition cond, void *opaque)
>
>     QIOChannel is the one that we can use to identify the master
>     connection
>     who sends this msg (the socket server now has an array of QIOChannel,
>     ioc[MAX_CLIENTS]). Everytime a msg is received, the tcp_chr_read()
>     needs
>     to compare *chan and the ioc[] array, to find out the id (indexed into
>     the ioc[]), and passes the id to qemu_chr_be_write(), and all the way
>     down to the final slave handler where the msg is parsed and handled.
>     This needs modifications to the existing APIs, for example, the
>     mentioned qemu_chr_be_write() will need one more parameter, "id". This
>     will not be compatible with the existing implementation, because all
>     other implementations which invoke qemu_chr_be_write() will need to be
>     patched to use the new qemu_chr_be_write(,"id",).
>
>
> We could have the "id" set with the CharBackend. But again, it's an 
> implementation detail, and I would rather treat it as a seperate 
> enhacement.
>

For this step, we don't need the connection id, so I think we can 
discuss more about it when we get to the next step :)


>
>
>
>     >     >       * Size - 32-bit size of the payload
>     >     >
>     >     >
>     >     >     @@ -97,6 +106,13 @@ Depending on the request type, payload
>     >     can be:
>     >     >         log offset: offset from start of supplied file
>     descriptor
>     >     >             where logging starts (i.e. where guest address 0
>     >     would be
>     >     >     logged)
>     >     >
>     >     >     +* Device info
>     >     >     +   --------------------
>     >     >     +   | virito id | uuid |
>     >     >     +   --------------------
>     >     >     +   Virtio id: 16-bit virtio id of the device
>     >     >     +   UUID: 128-bit UUID to identify the QEMU instance that
>     >     creates
>     >     >     the device
>     >     >     +
>     >     >
>     >     >
>     >     > I wonder if UUID should be a different message.
>     >     >
>     >     We can make uuid another message if it has other usages.
>     >     Do you see any other usages of uuid?
>     >
>     >
>     > Allows to associate data/configuration with a particular VM, in a
>     > multi-master/single-slave scenario. But tbh, I don't see how this is
>     > necessary, I can imagine solving this differently (having different
>     > connection address per vm for ex).
>
>
>     Using connection addresses, how could you know if the two connections
>     are from the same VM?
>
>
> Each VM could have a private connection address/server, that the 
> management layer sets up. Note that the same "Slave" process could 
> have multiple servers. Iow, you could have -chardev 
> socket,id=c1,server,.. -chardev socket,id=c2,server,... -device 
> vhost-pci,chardev=c1 device vhost-pci,chardev=c2 (this is somewhat 
> similar to usbredir): with this setup, qemu could implicitely starts a 
> singleton vhost-user "Slave" server with two listening sockets c1&c1, 
> or it could start a server for each chardev.
>

   I think UUID is not useful for the first version of vhost-pci-net, so 
we can probably drop it for now.

>
>     > I would like to understand your use case.
>
>
>     Here is an example of the use case:
>     VM1 has two master connections (connection X and Y) and VM2 has 1
>     master
>     connection (Z).
>     X,Y,Z - each has a connection id. But X and Y send the same uuid,
>     uuid1,
>     to the slave, and Z sends uuid2 to the slave. In this way, the slave
>     know X and Y are the two connections from the same VM, and Z is a
>     connection from a different VM.
>
>     For connection Y, the vhost-pci device will be created in a way which
>     does not need the driver to map the memory, since it has already been
>     mapped by device X from the same VM.
>
>
>
> For now, I would ignore the fact that the same VM could be the Master 
> for both c1 & c2 (in the ex above). Is passing uuid necessary and 
> enough to avoid duplicated mmap? is this really an issue after all? 
> it's only virtual memory.


I will think more about it. We may discuss it in the next step as well. 
Thanks.


>     >
>     >     >      [ Also see the section on REPLY_ACK protocol extension. ]
>     >     >
>     >     >     +Currently, the communication also supports the Slave
>     (server)
>     >     >     sending messages
>     >     >     +to the Master (client). Here is a list of them:
>     >     >     + * VHOST_USER_SET_FEATURES
>     >     >
>     >     >     + * VHOST_USER_SET_PEER_CONNECTION (the serve may actively
>     >     request
>     >     >     to disconnect
>     >     >     +   with the client)
>     >     >
>     >     >
>     >     > Oh, you are making the communication bidirectional? This is a
>     >     > fundamental change in the protocol. This may be difficult to
>     >     implement
>     >     > in qemu, since the communication in synchronous, a request
>     >     expects an
>     >     > immediate reply, if it gets back a request (from the
>     slave) in the
>     >     > middle, it will fail.
>     >     >
>     >
>     >     Not really.
>     >     Adding the above two doesn't affect the existing synchronous
>     read()
>     >     messages (basically, those VHOST_USER_GET_xx messages). Like
>     >     VHOST_USER_SET_FEATURES, the _SET_ messages don't need a reply.
>     >     Here, we
>     >     just make the slave capable of actively sending messages to the
>     >     master.
>     >
>     >
>     > Yes, that's the trouble. At any time the Master may send a
>     request and
>     > expects an immediate reply. There is a race of getting a request
>     from
>     > the Slave in the middle with your proposed change. I'd rather avoid
>     > making the request bidirectionnal if possible. (I proposed a second
>     > channel for Slave->Master request in the past:
>     > https://lists.gnu.org/archive/html/qemu-devel/2016-04/msg00095.html)
>
>
>       If the message that the slave got has a different "request" field
>     value, it simply drops it and re-read again. The implementation is not
>     complex also, please see the change example to
>     vhost_user_get_u64() below:
>
>         if (vhost_user_write(dev, &msg_w, NULL, 0) < 0) {
>             return -1;
>          }
>     retry:
>          if (vhost_user_read(dev, &msg_r) < 0) {
>              return -1;
>         }
>          if (msg_r.request != msg_w.request)
>              goto retry;
>
>
>     On the other side, the slave's request to the master is dropped due to
>     the race. This race can be solved in the protocol layer - let the
>     _SET_
>     request ask for an ACK, if no ACK is received, re-sent it. Also, this
>     kind of race should be very rare in real usage.
>
>
> Ok, that's an option, not very elegant imho, but it could work. I 
> would be afraid of deadlocks if both side apply the same retry logic..
>

It should be a traditional problem. If necessary, I think we can insert 
a delay using binary exponential backoff before "goto retry".

>
>     >
>     >     >
>     >     > +This request should be sent only when
>     >     VHOST_USER_PROTOCOL_F_VHOST_PCI
>     >     > has...
>     >     >
>     >     >     +* VHOST_USER_SET_DEV_INFO
>     >     >     +
>     >     >     +      Id: 21
>     >     >     +      Equivalent ioctl: N/A
>     >     >     +      Master payload: dev info
>     >     >     +
>     >     >     +      The client sends the producer device info to
>     the server.
>     >     >
>     >     >
>     >     > "Master sends producer device info to the Slave" works, no?
>     >
>     >     Yes, it works. The current dev info only contains a "virtio
>     id" field
>     >     (assume we'll take uuid out as a separate message), which
>     tells the
>     >     slave if it is a net, scsi, console or else. do you see any
>     issue?
>     >     >
>     >     > Could we guarantee this message is sent before SET_VRING*?
>     >
>     >     Why do we need to guarantee this?
>     >
>     >
>     > It would simplify the protocol to have expectations on when messages
>     > come. In particular, an early message with devinfo would allow to
>     > check/pre-configure the Slave for a particular device. Also
>     > VHOST_USER_SET_DEV_INFO should probably be unique (don't allow a
>     > device to be reconfigured)
>
>
>     Yes, it is sent in an early age of the vhost-user protocol
>     interaction.
>     It's implemented to be sent right after sending the
>     VHOST_USER_SET_PROTOCOL_FEATURES msg. On the slave side, when it
>     receives SET_DEV_INFO, it pre-configures the device in a table
>     entry (as
>     mentioned before, a device will be created from the table entry at a
>     later stage of the protocol interaction).
>
>     I think it should be the implementation logic, like
>     VHOST_USER_SET_PROTOCOL_FEATURES. why do we need to add a guarantee in
>     the protocol to specify the order?
>
>
> It would help to simplify the possible protocol states, helping 
> implementation, testing and documentation. I think the protocol is too 
> liberal already, it would help to specify the various states of Master 
> and Slave and what messages are permitted in each state. This is out 
> of scope here, but a worthy goal for a v2 of the protocol.
>

OK. btw, have you got a plan what states to add to slave and master? 
Some messages (e.g. SET_VRING_CALL) are sent both in the initial and 
later stage of the protocol interaction.

>     >
>     >     >
>     >     >     + This request should be sent only when
>     >     >     VHOST_USER_PROTOCOL_F_VHOST_PCI has
>     >     >     +      been negotiated.
>     >     >     +
>     >     >
>     >     >
>     >     > I think this message could be useful for other purposes than
>     >     > vhost-pci, thus I would give it its own flag.
>     >
>     >     Could you please give an example of other usage? Thanks.
>     >
>     >
>     > You could have a Slave that implements various devices, and pick the
>     > corresponding one dynamically (we already have implementations for
>     > net/input/gpu/scsi...)
>
>
>     If I understand the example correctly, the various devices still
>     belongs
>     to the vhost-pci series - in the future we would have vhost-pci-net,
>     vhost-pci-scsi, vhost-pci-gpu etc. If that's the case, we may
>     still use
>     the VHOST_PCI flag.
>
>
> My point is that those messages are not specific to vhost-pci, they 
> can be useful for vhost-user Slave today.
>
> That brings me to a related question: do you need to have different 
> vhost-pci devices, or can we make it generic enough that it could 
> mirror various device kinds dynamically. As a first step, I think it 
> would be simpler to focus on a specific device, like vhost-pci-net, 
> like you are doing in your qemu series.


Ok, let's start from the simpler case. Please check another email for 
the 2-step plan. Thanks.

Best,
Wei
Michael S. Tsirkin Nov. 11, 2016, 5:12 p.m. UTC | #10
On Tue, Nov 08, 2016 at 12:17:38PM +0000, Marc-André Lureau wrote:
> Yes, that's the trouble. At any time the Master may send a request and expects
> an immediate reply. There is a race of getting a request from the Slave in the
> middle with your proposed change. I'd rather avoid making the request
> bidirectionnal if possible. (I proposed a second channel for Slave->Master
> request in the past: https://lists.gnu.org/archive/html/qemu-devel/2016-04/
> msg00095.html)

I don't have an issue with this, but supporting commands from
slave on the same channel shouldn't be too hard too,
and it's less work for the management.
diff mbox

Patch

diff --git a/docs/specs/vhost-user.txt b/docs/specs/vhost-user.txt
index 7890d71..173f693 100644
--- a/docs/specs/vhost-user.txt
+++ b/docs/specs/vhost-user.txt
@@ -17,28 +17,37 @@  The protocol defines 2 sides of the communication, master and slave. Master is
 the application that shares its virtqueues, in our case QEMU. Slave is the
 consumer of the virtqueues.
 
-In the current implementation QEMU is the Master, and the Slave is intended to
+In the traditional implementation QEMU is the Master, and the Slave is intended to
 be a software Ethernet switch running in user space, such as Snabbswitch.
 
 Master and slave can be either a client (i.e. connecting) or server (listening)
 in the socket communication.
 
+The current vhost-user protocol is extended to support the vhost-pci based inter-VM
+communication. In this case, Slave is a QEMU which runs a vhost-pci server, and
+Master is another QEMU which runs a vhost-pci client.
+
 Message Specification
 ---------------------
 
 Note that all numbers are in the machine native byte order. A vhost-user message
-consists of 3 header fields and a payload:
+consists of 4 header fields and a payload:
 
-------------------------------------
-| request | flags | size | payload |
-------------------------------------
+----------------------------------------------
+| request | flags | conn_id | size | payload |
+----------------------------------------------
 
  * Request: 32-bit type of the request
  * Flags: 32-bit bit field:
    - Lower 2 bits are the version (currently 0x01)
-   - Bit 2 is the reply flag - needs to be sent on each reply from the slave
+   - Bit 2 is the reply flag - needs to be sent on each reply
    - Bit 3 is the need_reply flag - see VHOST_USER_PROTOCOL_F_REPLY_ACK for
      details.
+ * Conn_id: 64-bit connection id to indentify a client socket connection. It is
+            introduced in version 0x02 to support the "1-server-N-client" model
+            and an asynchronous client read implementation. The connection id,
+            0xFFFFFFFFFFFFFFFF, is used by an anonymous client (e.g. a client who
+            has not got its connection id from the server in the initial talk)
  * Size - 32-bit size of the payload
 
 
@@ -97,6 +106,13 @@  Depending on the request type, payload can be:
    log offset: offset from start of supplied file descriptor
        where logging starts (i.e. where guest address 0 would be logged)
 
+* Device info
+   --------------------
+   | virito id | uuid |
+   --------------------
+   Virtio id: 16-bit virtio id of the device
+   UUID: 128-bit UUID to identify the QEMU instance that creates the device
+
 In QEMU the vhost-user message is implemented with the following struct:
 
 typedef struct VhostUserMsg {
@@ -109,6 +125,7 @@  typedef struct VhostUserMsg {
         struct vhost_vring_addr addr;
         VhostUserMemory memory;
         VhostUserLog log;
+        DeviceInfo dev_info;
     };
 } QEMU_PACKED VhostUserMsg;
 
@@ -119,17 +136,25 @@  The protocol for vhost-user is based on the existing implementation of vhost
 for the Linux Kernel. Most messages that can be sent via the Unix domain socket
 implementing vhost-user have an equivalent ioctl to the kernel implementation.
 
-The communication consists of master sending message requests and slave sending
-message replies. Most of the requests don't require replies. Here is a list of
-the ones that do:
+Traditionally, the communication consists of master sending message requests
+and slave sending message replies. Most of the requests don't require replies.
+Here is a list of the ones that do:
 
  * VHOST_GET_FEATURES
  * VHOST_GET_PROTOCOL_FEATURES
  * VHOST_GET_VRING_BASE
  * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
+ * VHOST_USER_GET_CONN_ID
+ * VHOST_USER_SET_PEER_CONNECTION
 
 [ Also see the section on REPLY_ACK protocol extension. ]
 
+Currently, the communication also supports the Slave (server) sending messages
+to the Master (client). Here is a list of them:
+ * VHOST_USER_SET_FEATURES
+ * VHOST_USER_SET_PEER_CONNECTION (the serve may actively request to disconnect
+   with the client)
+
 There are several messages that the master sends with file descriptors passed
 in the ancillary data:
 
@@ -259,6 +284,7 @@  Protocol features
 #define VHOST_USER_PROTOCOL_F_LOG_SHMFD      1
 #define VHOST_USER_PROTOCOL_F_RARP           2
 #define VHOST_USER_PROTOCOL_F_REPLY_ACK      3
+#define VHOST_USER_PROTOCOL_F_VHOST_PCI      4
 
 Message types
 -------------
@@ -470,6 +496,43 @@  Message types
       The first 6 bytes of the payload contain the mac address of the guest to
       allow the vhost user backend to construct and broadcast the fake RARP.
 
+ * VHOST_USER_GET_CONN_ID
+
+      Id: 20
+      Equivalent ioctl: N/A
+      Master payload: u64
+
+      The client sends this message to the server to ask for its connection id.
+      The connection id is then put into the message header (the conn_id field),
+      so that the server can always know who it is talking with.
+
+* VHOST_USER_SET_DEV_INFO
+
+      Id: 21
+      Equivalent ioctl: N/A
+      Master payload: dev info
+
+      The client sends the producer device info to the server.
+      This request should be sent only when VHOST_USER_PROTOCOL_F_VHOST_PCI has
+      been negotiated.
+
+* VHOST_USER_SET_PEER_CONNECTION
+
+      Id: 22
+      Equivalent ioctl: N/A
+      Master payload: u64
+
+      The producer device requests to connect or disconnect to the consumer device.
+      The consumer device may request to disconnect to the producer device. This
+      request should be sent only when VHOST_USER_PROTOCOL_F_VHOST_PCI has been
+      negotiated.
+      Connection request: If the reply message indicates "success", the vhost-pci based
+      inter-VM communication channel has been established.
+      Disconnection request: If the reply message indicates "success", the vhost-pci based
+      inter-VM communication channel has been destroyed.
+      #define VHOST_USER_SET_PEER_CONNECTION_F_OFF 0
+      #define VHOST_USER_SET_PEER_CONNECTION_F_ON 1
+
 VHOST_USER_PROTOCOL_F_REPLY_ACK:
 -------------------------------
 The original vhost-user specification only demands replies for certain