Patchwork [v4,4/9] rdma/cm: Update port reservation to support AF_IB

login
register
mail settings
Submitter Sean Hefty
Date Jan. 22, 2013, 9:56 p.m.
Message ID <1358891797-14625-5-git-send-email-sean.hefty@intel.com>
Download mbox | patch
Permalink /patch/214671/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Sean Hefty - Jan. 22, 2013, 9:56 p.m.
From: Sean Hefty <sean.hefty@intel.com>

The AF_IB uses a 64-bit service id (SID), which the
user can control through the use of a mask.  The rdma_cm
will assign values to the unmasked portions of the SID
based on the selected port space and port number.

Because the IB spec divides the SID range into several regions,
a SID/mask combination may fall into one of the existing
port space ranges as defined by the RDMA CM IP Annex.  Map
the AF_IB SID to the correct RDMA port space.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
---
 drivers/infiniband/core/cma.c |  107 +++++++++++++++++++++++++++++++++--------
 include/rdma/rdma_cm.h        |    5 ++
 2 files changed, 91 insertions(+), 21 deletions(-)
Or Gerlitz - Feb. 13, 2013, 11:09 a.m.
On 22/01/2013 23:56, sean.hefty@intel.com wrote:
> The AF_IB uses a 64-bit service id (SID), which the
> user can control through the use of a mask.  The rdma_cm
> will assign values to the unmasked portions of the SID
> based on the selected port space and port number.

Something here I am not sure to follow -- if AF_IB consumers are allowed 
to provide 64-bit SID, how
the IB port space is expressed in their SIDs?

Wouldn't it make sense to use a simpler approach that has a dedicated 
IB_PS in the port space
portion of the RDMA-CM IP IBTA annex and let the consumer provide 16 
bits for their part of the SID?

Or.


> Because the IB spec divides the SID range into several regions,
> a SID/mask combination may fall into one of the existing
> port space ranges as defined by the RDMA CM IP Annex.  Map
> the AF_IB SID to the correct RDMA port space.

> [...]
> --- a/include/rdma/rdma_cm.h
> +++ b/include/rdma/rdma_cm.h
> @@ -70,6 +70,11 @@ enum rdma_port_space {
>   	RDMA_PS_UDP   = 0x0111,
>   };
>   
> +#define RDMA_IB_IP_PS_MASK   0xFFFFFFFFFFFF0000ULL
> +#define RDMA_IB_IP_PS_TCP    0x0000000001060000ULL
> +#define RDMA_IB_IP_PS_UDP    0x0000000001110000ULL
> +#define RDMA_IB_IP_PS_IB     0x00000000013F0000ULL
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sean Hefty - Feb. 13, 2013, 5:44 p.m.
> > The AF_IB uses a 64-bit service id (SID), which the
> > user can control through the use of a mask.  The rdma_cm
> > will assign values to the unmasked portions of the SID
> > based on the selected port space and port number.
> 
> Something here I am not sure to follow -- if AF_IB consumers are allowed
> to provide 64-bit SID, how
> the IB port space is expressed in their SIDs?

The spec, notably the RDMA CM Annex, provides fixed values for the upper 48-bits of the service ID.  Only the lower 16-bits are available to users.  The upper bits end up identifying the various 'port spaces'. 

(Note that there's no actual association between an IB port space and port spaces in the IP network stack.)

> Wouldn't it make sense to use a simpler approach that has a dedicated
> IB_PS in the port space
> portion of the RDMA-CM IP IBTA annex and let the consumer provide 16
> bits for their part of the SID?

Ultimately, IB establishes communication using 64-bit service IDs, which is what sockaddr_ib exposes.

The SID mask was proposed by Jason as a way to simplify things.  Conceptually, AF_IB has a single port space range, the 64-bit SID.  The mask helps us break the SID into the various ranges, so we can clearly know which parts of the SID are fixed, and which can be variable.  E.g. when connecting, the entire SID must be provided.  However, when binding we want to support binding to a specific SID or only to a SID range (i.e. port space).

(I'm having trouble finding the original email threads where this was discussed.)

- Sean
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe - Feb. 13, 2013, 6:10 p.m.
On Wed, Feb 13, 2013 at 05:44:58PM +0000, Hefty, Sean wrote:

> Ultimately, IB establishes communication using 64-bit service IDs, which is what sockaddr_ib exposes.
> 
> The SID mask was proposed by Jason as a way to simplify things.
> Conceptually, AF_IB has a single port space range, the 64-bit SID.
> The mask helps us break the SID into the various ranges, so we can
> clearly know which parts of the SID are fixed, and which can be
> variable.  E.g. when connecting, the entire SID must be provided.
> However, when binding we want to support binding to a specific SID
> or only to a SID range (i.e. port space).

Right. AF_IB is about exposing the native IB CM, rather than a cooked
version of it, so allowing user space to control the entire SID is
desirable.

The intent of the masking was to provide a means for the kernel to
atomically select a process-unique ID (such as the 16 bits the IP
varient requires) for listening.

Other uses of IB CM would use a different prefix, but will still
fundamentally require a machine unique portion.

This addresses a fundamental problem in ib_ucm - there is no way for
multiple processes to co-ordinate on unique SIDs.

Sean, I've lost track of all this over time, but just to check in,
with this new interface, would it be possible to obsolete ib_ucm? That
seems desirable to me, as it is so flawed..

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sean Hefty - Feb. 13, 2013, 6:19 p.m.
> Sean, I've lost track of all this over time, but just to check in,
> with this new interface, would it be possible to obsolete ib_ucm? That
> seems desirable to me, as it is so flawed..

I'd love to obsolete ib_ucm.  This interface should make it possible, though some additional implementation would be needed (e.g. for automatic path migration).

I have an immediate need to connect using native IB addresses.  So without this, I'll be forced to use ib_ucm or roll a new user space CM.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe - Feb. 13, 2013, 6:26 p.m.
On Wed, Feb 13, 2013 at 06:19:51PM +0000, Hefty, Sean wrote:
> > Sean, I've lost track of all this over time, but just to check in,
> > with this new interface, would it be possible to obsolete ib_ucm? That
> > seems desirable to me, as it is so flawed..
> 
> I'd love to obsolete ib_ucm.  This interface should make it
> possible, though some additional implementation would be needed
> (e.g. for automatic path migration).

Okay - but there are some nice gaps in the API to cleanly place such
things in future?

A long time ago we discussed having different paths for CM GMP,
inbound data and outbound data (ie the full IB asymetric pathing
model), did that get included in the API as well?

> I have an immediate need to connect using native IB addresses.  So
> without this, I'll be forced to use ib_ucm or roll a new user space
> CM.

Using ucm is alot of trouble. Many sites don't have it setup properly,
don't have proper device node permissions and the issue with unique
IDs ends up being critical..

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sean Hefty - Feb. 13, 2013, 6:37 p.m.
> A long time ago we discussed having different paths for CM GMP,
> inbound data and outbound data (ie the full IB asymetric pathing
> model), did that get included in the API as well?

Yes - this support made it upstream.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index e08f830..7eed4ee 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -729,12 +729,22 @@  static int cma_addr_cmp(struct sockaddr *src, struct sockaddr *dst)
 	}
 }
 
-static inline __be16 cma_port(struct sockaddr *addr)
+static __be16 cma_port(struct sockaddr *addr)
 {
-	if (addr->sa_family == AF_INET)
+	struct sockaddr_ib *sib;
+
+	switch (addr->sa_family) {
+	case AF_INET:
 		return ((struct sockaddr_in *) addr)->sin_port;
-	else
+	case AF_INET6:
 		return ((struct sockaddr_in6 *) addr)->sin6_port;
+	case AF_IB:
+		sib = (struct sockaddr_ib *) addr;
+		return htons((u16) (be64_to_cpu(sib->sib_sid) &
+				    be64_to_cpu(sib->sib_sid_mask)));
+	default:
+		return 0;
+	}
 }
 
 static inline int cma_any_port(struct sockaddr *addr)
@@ -2139,10 +2149,29 @@  EXPORT_SYMBOL(rdma_set_afonly);
 static void cma_bind_port(struct rdma_bind_list *bind_list,
 			  struct rdma_id_private *id_priv)
 {
-	struct sockaddr_in *sin;
+	struct sockaddr *addr;
+	struct sockaddr_ib *sib;
+	u64 sid, mask;
+	__be16 port;
 
-	sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr;
-	sin->sin_port = htons(bind_list->port);
+	addr = (struct sockaddr *) &id_priv->id.route.addr.src_addr;
+	port = htons(bind_list->port);
+
+	switch (addr->sa_family) {
+	case AF_INET:
+		((struct sockaddr_in *) addr)->sin_port = port;
+		break;
+	case AF_INET6:
+		((struct sockaddr_in6 *) addr)->sin6_port = port;
+		break;
+	case AF_IB:
+		sib = (struct sockaddr_ib *) addr;
+		sid = be64_to_cpu(sib->sib_sid);
+		mask = be64_to_cpu(sib->sib_sid_mask);
+		sib->sib_sid = cpu_to_be64((sid & mask) | (u64) ntohs(port));
+		sib->sib_sid_mask = cpu_to_be64(~0ULL);
+		break;
+	}
 	id_priv->bind_list = bind_list;
 	hlist_add_head(&id_priv->node, &bind_list->owners);
 }
@@ -2280,31 +2309,67 @@  static int cma_bind_listen(struct rdma_id_private *id_priv)
 	return ret;
 }
 
-static int cma_get_port(struct rdma_id_private *id_priv)
+static struct idr *cma_select_inet_ps(struct rdma_id_private *id_priv)
 {
-	struct idr *ps;
-	int ret;
-
 	switch (id_priv->id.ps) {
 	case RDMA_PS_SDP:
-		ps = &sdp_ps;
-		break;
+		return &sdp_ps;
 	case RDMA_PS_TCP:
-		ps = &tcp_ps;
-		break;
+		return &tcp_ps;
 	case RDMA_PS_UDP:
-		ps = &udp_ps;
-		break;
+		return &udp_ps;
 	case RDMA_PS_IPOIB:
-		ps = &ipoib_ps;
-		break;
+		return &ipoib_ps;
 	case RDMA_PS_IB:
-		ps = &ib_ps;
-		break;
+		return &ib_ps;
 	default:
-		return -EPROTONOSUPPORT;
+		return NULL;
+	}
+}
+
+static struct idr *cma_select_ib_ps(struct rdma_id_private *id_priv)
+{
+	struct idr *ps = NULL;
+	struct sockaddr_ib *sib;
+	u64 sid_ps, mask, sid;
+
+	sib = (struct sockaddr_ib *) &id_priv->id.route.addr.src_addr;
+	mask = be64_to_cpu(sib->sib_sid_mask) & RDMA_IB_IP_PS_MASK;
+	sid = be64_to_cpu(sib->sib_sid) & mask;
+
+	if ((id_priv->id.ps == RDMA_PS_IB) && (sid == (RDMA_IB_IP_PS_IB & mask))) {
+		sid_ps = RDMA_IB_IP_PS_IB;
+		ps = &ib_ps;
+	} else if (((id_priv->id.ps == RDMA_PS_IB) || (id_priv->id.ps == RDMA_PS_TCP)) &&
+		   (sid == (RDMA_IB_IP_PS_TCP & mask))) {
+		sid_ps = RDMA_IB_IP_PS_TCP;
+		ps = &tcp_ps;
+	} else if (((id_priv->id.ps == RDMA_PS_IB) || (id_priv->id.ps == RDMA_PS_UDP)) &&
+		   (sid == (RDMA_IB_IP_PS_UDP & mask))) {
+		sid_ps = RDMA_IB_IP_PS_UDP;
+		ps = &udp_ps;
 	}
 
+	if (ps) {
+		sib->sib_sid = cpu_to_be64(sid_ps | ntohs(cma_port((struct sockaddr *) sib)));
+		sib->sib_sid_mask = cpu_to_be64(RDMA_IB_IP_PS_MASK |
+						be64_to_cpu(sib->sib_sid_mask));
+	}
+	return ps;
+}
+
+static int cma_get_port(struct rdma_id_private *id_priv)
+{
+	struct idr *ps;
+	int ret;
+
+	if (id_priv->id.route.addr.src_addr.ss_family != AF_IB)
+		ps = cma_select_inet_ps(id_priv);
+	else
+		ps = cma_select_ib_ps(id_priv);
+	if (!ps)
+		return -EPROTONOSUPPORT;
+
 	mutex_lock(&lock);
 	if (cma_any_port((struct sockaddr *) &id_priv->id.route.addr.src_addr))
 		ret = cma_alloc_any_port(ps, id_priv);
diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h
index ad3a314..1e6c3c7 100644
--- a/include/rdma/rdma_cm.h
+++ b/include/rdma/rdma_cm.h
@@ -70,6 +70,11 @@  enum rdma_port_space {
 	RDMA_PS_UDP   = 0x0111,
 };
 
+#define RDMA_IB_IP_PS_MASK   0xFFFFFFFFFFFF0000ULL
+#define RDMA_IB_IP_PS_TCP    0x0000000001060000ULL
+#define RDMA_IB_IP_PS_UDP    0x0000000001110000ULL
+#define RDMA_IB_IP_PS_IB     0x00000000013F0000ULL
+
 struct rdma_addr {
 	struct sockaddr_storage src_addr;
 	struct sockaddr_storage dst_addr;