diff mbox

[v1,08/12] IB/cma: Add net_dev and private data checks to RDMA CM

Message ID 1434976961-27424-9-git-send-email-haggaie@mellanox.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Haggai Eran June 22, 2015, 12:42 p.m. UTC
Instead of relying on a the ib_cm module to check an incoming CM request's
private data header, add these checks to the RDMA CM module. This allows a
following patch to to clean up the ib_cm interface and remove the code that
looks into the private headers. It will also allow supporting namespaces in
RDMA CM by making these checks namespace aware later on.

Signed-off-by: Haggai Eran <haggaie@mellanox.com>
---
 drivers/infiniband/core/cma.c | 170 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 168 insertions(+), 2 deletions(-)

Comments

Jason Gunthorpe July 13, 2015, 6:14 p.m. UTC | #1
On Mon, Jun 22, 2015 at 03:42:37PM +0300, Haggai Eran wrote:
> +	switch (ib_event->event) {
> +	case IB_CM_REQ_RECEIVED:
> +		req->device	= req_param->listen_id->device;
> +		req->port	= req_param->port;
> +		req->local_gid	= &req_param->primary_path->sgid;
> +		req->service_id	= req_param->primary_path->service_id;
> +		req->pkey	= be16_to_cpu(req_param->primary_path->pkey);

I feel pretty strongly that we should be using the pkey from the work
completion, not the pkey in the message.

The reason, if someone is using pkey like vlan, and expecting a
container to never receive packets outside the assigned pkey, then we
need to check each and every packet for the correct pkey before
associating it with that container.

When doing the namespace patches you should probably also look at
other CM GMPs than just the REQ and how the paths are setup and
consider what to do with the pkey. I'd probably suggest that the pkey
should be forced throughout the entire process to ensure it always
matches the ip device - at least for containers that is the right
thing.. I probably wouldn't turn it on for the root namespace though..

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Haggai Eran July 15, 2015, 10:57 a.m. UTC | #2
On 13/07/2015 21:14, Jason Gunthorpe wrote:
> On Mon, Jun 22, 2015 at 03:42:37PM +0300, Haggai Eran wrote:
>> +	switch (ib_event->event) {
>> +	case IB_CM_REQ_RECEIVED:
>> +		req->device	= req_param->listen_id->device;
>> +		req->port	= req_param->port;
>> +		req->local_gid	= &req_param->primary_path->sgid;
>> +		req->service_id	= req_param->primary_path->service_id;
>> +		req->pkey	= be16_to_cpu(req_param->primary_path->pkey);
> 
> I feel pretty strongly that we should be using the pkey from the work
> completion, not the pkey in the message.
> 
> The reason, if someone is using pkey like vlan, and expecting a
> container to never receive packets outside the assigned pkey, then we
> need to check each and every packet for the correct pkey before
> associating it with that container.

The way I see it is that you have one RDMA CM agent in the system, and
the header parameters address it. This agent allows addressing several
namespaces, and they are de-muxed according to the parameters in the
payload.

So a container never receives a packet outside of its assigned pkeys,
but the pkey you look at (as well as the GID, and possibly the IP
address) all come from the payload.

> When doing the namespace patches you should probably also look at
> other CM GMPs than just the REQ and how the paths are setup and
> consider what to do with the pkey. I'd probably suggest that the pkey
> should be forced throughout the entire process to ensure it always
> matches the ip device - at least for containers that is the right
> thing.. I probably wouldn't turn it on for the root namespace though..

Once a connection has been established, following GMPs use a unique ID
to address this connection, so no more de-muxing is needed. What is
really missing here I guess is a mechanism that would enforce containers
to only use certain pkeys - perhaps with something like an RDMA cgroup.
It could force containers to only use approved pkeys not only with RDMA
CM, but through uverbs, and other user-space interfaces.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 15, 2015, 6:49 p.m. UTC | #3
On Wed, Jul 15, 2015 at 01:57:48PM +0300, Haggai Eran wrote:
> On 13/07/2015 21:14, Jason Gunthorpe wrote:
> > On Mon, Jun 22, 2015 at 03:42:37PM +0300, Haggai Eran wrote:
> >> +	switch (ib_event->event) {
> >> +	case IB_CM_REQ_RECEIVED:
> >> +		req->device	= req_param->listen_id->device;
> >> +		req->port	= req_param->port;
> >> +		req->local_gid	= &req_param->primary_path->sgid;
> >> +		req->service_id	= req_param->primary_path->service_id;
> >> +		req->pkey	= be16_to_cpu(req_param->primary_path->pkey);
> > 
> > I feel pretty strongly that we should be using the pkey from the work
> > completion, not the pkey in the message.
> > 
> > The reason, if someone is using pkey like vlan, and expecting a
> > container to never receive packets outside the assigned pkey, then we
> > need to check each and every packet for the correct pkey before
> > associating it with that container.
> 
> The way I see it is that you have one RDMA CM agent in the system, and
> the header parameters address it. This agent allows addressing several
> namespaces, and they are de-muxed according to the parameters in the
> payload.

That doesn't address my argument at all.

A container should never, ever, process a packet on a pkey that is not
associated with it.

So, it doesn't matter one bit how you decide to demux. After the demux
is done all three pkeys needs to be checked to be compatible with the
container: GMP, Primary Path, Alternate Path.

If any are not compatible the packet *MUST* be dropped.

If you do the above check, you may as well start with the wc's pkey,
since it reflects the logical 'netdevice' that received the packet.

> So a container never receives a packet outside of its assigned pkeys,
> but the pkey you look at (as well as the GID, and possibly the IP
> address) all come from the payload.

That isn't entirey true, at a minimum the above scheme allows an
attacking pkey to create an unbounded number of REQs delivered to
userspace on a victim container, without needing to be part of the
victim's pkey.

> > When doing the namespace patches you should probably also look at
> > other CM GMPs than just the REQ and how the paths are setup and
> > consider what to do with the pkey. I'd probably suggest that the pkey
> > should be forced throughout the entire process to ensure it always
> > matches the ip device - at least for containers that is the right
> > thing.. I probably wouldn't turn it on for the root namespace though..
> 
> Once a connection has been established, following GMPs use a unique ID
> to address this connection, so no more de-muxing is needed.

I think there are open issues here about pkey correctness. Ie the
unique ID is not locked to a pkey, so an attacking pkey can guess the
unique ID and then issue GMPs that will be accepted.

Once locked to a container the unique ID absolutely needs to drop all
packets that are outside the container's pkey space. IMHO this would
be part of the namespace patches.

Basically, pkey in a container should work exactly the same as pkey on
a HCA, packets are dropped before they ever reach the container. This
means every single resource associated with a container much check the
pkey of any packet targeted at that resource.

> What is really missing here I guess is a mechanism that would
> enforce containers to only use certain pkeys - perhaps with
> something like an RDMA cgroup.  It could force containers to only
> use approved pkeys not only with RDMA CM, but through uverbs, and
> other user-space interfaces.

Ideally yes. Without that it just means you can't use pkey
meaningfully with containers..

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liran Liss July 15, 2015, 8:27 p.m. UTC | #4
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com]
 
> 
> > What is really missing here I guess is a mechanism that would
> > enforce containers to only use certain pkeys - perhaps with
> > something like an RDMA cgroup.  It could force containers to only
> > use approved pkeys not only with RDMA CM, but through uverbs, and
> > other user-space interfaces.
> 
> Ideally yes. Without that it just means you can't use pkey
> meaningfully with containers..
> 

That's why partition enforcement should be separated from namespaces.

If you want to restrict a container to a specific set of pkeys, use cgroups.
This would apply both to CM MADs and QPs.
- In the MAD case, CM MADs would be first matched to a namespace and rdma_id and dropped upon pkey conflict (with either the headers or the payload).
- In the QP case, modify_qp() would fail on conflict.
Partitioning needs to be enforced also for applications that don't use the CM at all...

For namespaces, it seems more natural to lookup the namespace based solely on the CM payload.
After all, it is the payload that designates the entity that you want to establish a connection to, rather than the packet headers, which are just meant to relay the packet to the proper CM agent. (Spec-wise, it is even possible to use completely different node to open a connection on behalf of another node.)
So, the sole role of pkeys in the context of namespaces is to locate the proper namespace; not for partition isolation.

So, as a first step, let's get the namespace lookup in place.
Next, add an rdma crgoup, which would control pkeys, as well as resource counts (you don't want one container to rob the whole system from all QPs...).

> Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 15, 2015, 9:03 p.m. UTC | #5
On Wed, Jul 15, 2015 at 08:27:06PM +0000, Liran Liss wrote:
> If you want to restrict a container to a specific set of pkeys, use
> cgroups.

Ideally yes, but in the absence of a cgroup the set of pkeys assigned
to the container via ipoib is a reasonable alternate.

> This would apply both to CM MADs and QPs.
> - In the MAD case, CM MADs would be first matched to a namespace and rdma_id and dropped upon pkey conflict (with either the headers or the payload).
> - In the QP case, modify_qp() would fail on conflict.
> Partitioning needs to be enforced also for applications that don't use the CM at all...

Yep, that is how pkey checking should work, cgroup or not.

So, you agree with me.

I say that until we get a cgroup capability, the pkey list of the
container is the set of IPoIB interfaces associated with it, and we
still have to do the various checks above. The first check is relavent
to this patchset and should be done by using the GMP's headers to
locate the net device.

> For namespaces, it seems more natural to lookup the namespace based
> solely on the CM payload.

How so? Which payload content do you use? The primary path? The
alternate path?

> After all, it is the payload that designates the entity that you
> want to establish a connection to, rather than the packet headers,
> which are just meant to relay the packet to the proper CM

No, that isn't right. The IBA uses the GMP's destination first, then
serviceID as the demux. Services IDs are not globally unique, they are
scoped by the destination.

The path data is just *routing* it doesn't describe at all the entity
we want to talk to, it is only a proposal for how to flow data to it.

In any event, both the GMP headers and the path data needs to be
checked against the container's pkey list. I don't know why this is so
contentions.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liran Liss July 16, 2015, 12:01 p.m. UTC | #6
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com]

> > After all, it is the payload that designates the entity that you
> > want to establish a connection to, rather than the packet headers,
> > which are just meant to relay the packet to the proper CM
> 
> No, that isn't right. The IBA uses the GMP's destination first, then
> serviceID as the demux. Services IDs are not globally unique, they are
> scoped by the destination.
> 

The destination is the physical CA port and kernel CM agent, so I don't think the answer is that clear.
Going forward along these lines:
- Name space lookup is done based on BTH.pkey, private_data.IP, and optionally GRH.DGID (if present, for extra validation)
-- Primary and alternate paths are not considered at all

- If P_Key enforcement is set up via cgroups:
-- For CM processing, we only check BTH.pkey
--- Upon conflict, the packet is dropped
-- The primary/alternate path pkeys are not validated by CM, but will be validated during QP modification

Does this sound OK?

In any case, let's complete the namespace lookup first, and then follow up with a cgroup patchset.

> The path data is just *routing* it doesn't describe at all the entity
> we want to talk to, it is only a proposal for how to flow data to it.
> 
> In any event, both the GMP headers and the path data needs to be
> checked against the container's pkey list. I don't know why this is so
> contentions.
> 
> Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 16, 2015, 6:22 p.m. UTC | #7
On Thu, Jul 16, 2015 at 12:01:55PM +0000, Liran Liss wrote:

> - Name space lookup is done based on BTH.pkey, private_data.IP, and
>   optionally GRH.DGID (if present, for extra validation)

Just changing the pkey to BTH.pkey would be fine by me.

Using GRH.DGID if available instead of the primary path hack is also
smart, I pointed that out at the beginning..

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 02914bf1e548..4701167f5117 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -300,7 +300,7 @@  static enum rdma_cm_state cma_exch(struct rdma_id_private *id_priv,
 	return old;
 }
 
-static inline u8 cma_get_ip_ver(struct cma_hdr *hdr)
+static inline u8 cma_get_ip_ver(const struct cma_hdr *hdr)
 {
 	return hdr->ip_version >> 4;
 }
@@ -1024,6 +1024,169 @@  static int cma_save_net_info(struct sockaddr *src_addr,
 	return cma_save_ip_info(src_addr, dst_addr, ib_event, service_id);
 }
 
+struct cma_req_info {
+	struct ib_device *device;
+	int port;
+	const union ib_gid *local_gid;
+	__be64 service_id;
+	u16 pkey;
+};
+
+static int cma_save_req_info(const struct ib_cm_event *ib_event,
+			     struct cma_req_info *req)
+{
+	const struct ib_cm_req_event_param *req_param =
+		&ib_event->param.req_rcvd;
+	const struct ib_cm_sidr_req_event_param *sidr_param =
+		&ib_event->param.sidr_req_rcvd;
+
+	switch (ib_event->event) {
+	case IB_CM_REQ_RECEIVED:
+		req->device	= req_param->listen_id->device;
+		req->port	= req_param->port;
+		req->local_gid	= &req_param->primary_path->sgid;
+		req->service_id	= req_param->primary_path->service_id;
+		req->pkey	= be16_to_cpu(req_param->primary_path->pkey);
+		break;
+	case IB_CM_SIDR_REQ_RECEIVED:
+		req->device	= sidr_param->listen_id->device;
+		req->port	= sidr_param->port;
+		req->local_gid	= NULL;
+		req->service_id	= sidr_param->service_id;
+		req->pkey	= sidr_param->pkey;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event,
+					  const struct cma_req_info *req)
+{
+	struct sockaddr_storage listen_addr_storage;
+	struct sockaddr *listen_addr = (struct sockaddr *)&listen_addr_storage;
+	struct net_device *net_dev;
+	int err;
+
+	err = cma_save_ip_info(listen_addr, NULL, ib_event, req->service_id);
+	if (err)
+		return ERR_PTR(err);
+
+	net_dev = ib_get_net_dev_by_params(req->device, req->port, req->pkey,
+					   req->local_gid, listen_addr);
+	if (!net_dev)
+		return ERR_PTR(-ENODEV);
+
+	return net_dev;
+}
+
+static enum rdma_port_space rdma_ps_from_service_id(__be64 service_id)
+{
+	return (be64_to_cpu(service_id) >> 16) & 0xffff;
+}
+
+static bool cma_match_private_data(struct rdma_id_private *id_priv,
+				   const struct cma_hdr *hdr)
+{
+	struct sockaddr *addr = cma_src_addr(id_priv);
+	__be32 ip4_addr;
+	struct in6_addr ip6_addr;
+
+	if (cma_any_addr(addr) && !id_priv->afonly)
+		return true;
+
+	switch (addr->sa_family) {
+	case AF_INET:
+		ip4_addr = ((struct sockaddr_in *)addr)->sin_addr.s_addr;
+		if (cma_get_ip_ver(hdr) != 4)
+			return false;
+		if (!cma_any_addr(addr) &&
+		    hdr->dst_addr.ip4.addr != ip4_addr)
+			return false;
+		break;
+	case AF_INET6:
+		ip6_addr = ((struct sockaddr_in6 *)addr)->sin6_addr;
+		if (cma_get_ip_ver(hdr) != 6)
+			return false;
+		if (!cma_any_addr(addr) &&
+		    memcmp(&hdr->dst_addr.ip6, &ip6_addr, sizeof(ip6_addr)))
+			return false;
+		break;
+	default:
+		return false;
+	}
+
+	return true;
+}
+
+static bool cma_match_net_dev(const struct rdma_id_private *id_priv,
+			      const struct net_device *net_dev)
+{
+	const struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr;
+
+	return !addr->bound_dev_if ||
+	       (net_eq(dev_net(net_dev), &init_net) &&
+		addr->bound_dev_if == net_dev->ifindex);
+}
+
+static struct rdma_id_private *cma_find_listener(
+		const struct rdma_bind_list *bind_list,
+		const struct ib_cm_id *cm_id,
+		const struct ib_cm_event *ib_event,
+		const struct cma_req_info *req,
+		const struct net_device *net_dev)
+{
+	struct rdma_id_private *id_priv, *id_priv_dev;
+
+	if (!bind_list)
+		return ERR_PTR(-EINVAL);
+
+	hlist_for_each_entry(id_priv, &bind_list->owners, node) {
+		if (cma_match_private_data(id_priv, ib_event->private_data)) {
+			if (id_priv->id.device == cm_id->device &&
+			    cma_match_net_dev(id_priv, net_dev))
+				return id_priv;
+			list_for_each_entry(id_priv_dev,
+					    &id_priv->listen_list,
+					    listen_list) {
+				if (id_priv_dev->id.device == cm_id->device &&
+				    cma_match_net_dev(id_priv_dev, net_dev))
+					return id_priv_dev;
+			}
+		}
+	}
+
+	return ERR_PTR(-EINVAL);
+}
+
+static struct rdma_id_private *cma_id_from_event(struct ib_cm_id *cm_id,
+						 struct ib_cm_event *ib_event)
+{
+	struct cma_req_info req;
+	struct rdma_bind_list *bind_list;
+	struct rdma_id_private *id_priv;
+	struct net_device *net_dev;
+	int err;
+
+	err = cma_save_req_info(ib_event, &req);
+	if (err)
+		return ERR_PTR(err);
+
+	net_dev = cma_get_net_dev(ib_event, &req);
+	if (IS_ERR(net_dev))
+		return ERR_PTR(PTR_ERR(net_dev));
+
+	bind_list = cma_ps_find(rdma_ps_from_service_id(req.service_id),
+				cma_port_from_service_id(req.service_id));
+	id_priv = cma_find_listener(bind_list, cm_id, ib_event, &req, net_dev);
+
+	dev_put(net_dev);
+
+	return id_priv;
+}
+
 static inline int cma_user_data_offset(struct rdma_id_private *id_priv)
 {
 	return cma_family(id_priv) == AF_IB ? 0 : sizeof(struct cma_hdr);
@@ -1383,7 +1546,10 @@  static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event)
 	struct rdma_cm_event event;
 	int offset, ret;
 
-	listen_id = cm_id->context;
+	listen_id = cma_id_from_event(cm_id, ib_event);
+	if (IS_ERR(listen_id))
+		return PTR_ERR(listen_id);
+
 	if (!cma_check_req_qp_type(&listen_id->id, ib_event))
 		return -EINVAL;