Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NATs & Private IPs #264

Open
selshowk opened this issue Jan 10, 2021 · 8 comments
Open

NATs & Private IPs #264

selshowk opened this issue Jan 10, 2021 · 8 comments

Comments

@selshowk
Copy link

For GCP & AWS its relatively easy to patch the instance.create functions to support using internal IPs for VMs (I'll post some diffs later) but once you do this the VM can no longer access the internet. The solution is to create a NAT so I'm trying to look at how to implement this in each provider and if its each to patch cloudbridge to do so (I also responded to #170 mentioning this).

For GCP the change (which I've tested) seems to be very minimal. The following code works:

gr=provider.gcp_compute.routers()
nat_data = {'name': 'nat-cloudbridge-test',
   'sourceSubnetworkIpRangesToNat': 'ALL_SUBNETWORKS_ALL_IP_RANGES',
   'natIpAllocateOption': 'AUTO_ONLY',
   'logConfig': {'enable': False, 'filter': 'ALL'},
   'enableEndpointIndependentMapping': True}
gr.patch(project=provider.project_name, region=provider.region_name, router=router.name, body={'nats':[nat_data]}).execute()

Rather than a patch the above could simply be added to the router create rule depending on an optional arg?

I am investigating the equivalent for AWS (and will eventually for Azure as well). Is this something you would be interested in adding in?

@nuwang
Copy link
Contributor

nuwang commented Jan 11, 2021

@selshowk I may be misunderstanding what you mean here, but do you mean that instances with only private ips, which have no floating IP assigned, are not able to connect to the internet? If so, have you performed all of the configuration steps documented here? http://cloudbridge.cloudve.org/en/latest/topics/networking.html#allowing-internet-access-from-a-launched-vm
Additionally, it may be necessary to add an outbound rule. I've not personally run into an issue with this, but perhaps your networking requirements are different?

I think a NAT may be more appropriate for different circumstances?
https://www.reddit.com/r/aws/comments/75bjei/private_subnets_nats_vs_simply_only_allowing/

@selshowk
Copy link
Author

@selshowk I may be misunderstanding what you mean here, but do you mean that instances with only private ips, which have no floating IP assigned, are not able to connect to the internet?

Yes this is exactly what I mean. I create a network, subnet, router, gateway, etc and hook them up according to the Getting Started guide. Previously I used to attach floating IPs but now I've modified the AWS code to simply allow the instances to have public IPs on creation (as mentioned in #255). This works and such VMs can be accessed externally and can access the internet. But if I create some instances without public IPs or floating IPs they cannot access the internet. Note that the only way to connect to such an instance to test is to do ssh-tunneling (because they are not externally accessible anyway) or ssh in from a different instance that does have a public or elastic IP.

If so, have you performed all of the configuration steps documented here? http://cloudbridge.cloudve.org/en/latest/topics/networking.html#allowing-internet-access-from-a-launched-vm

Yes I have. Are you claiming this is enough? Its not so trivial to check because the only way to connect to an instance to check if it has internet access is to either tunnel or ssh in from another instance with a public IP. In any case when I do this I find that the instance cannot access the outside world (this is true in both GCP and AWS, haven't tested Azure yet).

On GCP I was able to fix this going through the steps I posted above. Note this is pretty important on GCP because there are strict quota limits on public IP addresses (elastic or not).

I'm still working on an AWS solution but all the documentation I find suggests that instances created without public IPs (either in subnets with map_pubic_ip_on_launch=True, which is the case with CB subnets, or launched without the public_ip flag set to be true) cannot access the internet). For instance see here for a discussion of the topic.

Additionally, it may be necessary to add an outbound rule. I've not personally run into an issue with this, but perhaps your networking requirements are different?

Do you have a working example of cloudbrdige code that launches a VM without an elastic IP that can access the internet? If so can you share it (and also how you are tested access since the instances in question don't have public IPs)?

@nuwang
Copy link
Contributor

nuwang commented Jan 11, 2021

@selshowk In that case, I can confirm that instances without public IPs are able to create outbound connections, and we use this in practice. We are definitely not using NAT gateways. As an example of the code, please take a look here for the network setup code: https://github.com/galaxyproject/cloudlaunch/blob/master/django-cloudlaunch/cloudlaunch/backend_plugins/base_vm_app.py#L220

This is the firewall creation code: https://github.com/galaxyproject/cloudlaunch/blob/cd80c2403661d5408a3329b9133df6bf43cdd40c/django-cloudlaunch/cloudlaunch/backend_plugins/base_vm_app.py#L129

And as shown here, public IP assignment is optional: https://github.com/galaxyproject/cloudlaunch/blob/cd80c2403661d5408a3329b9133df6bf43cdd40c/django-cloudlaunch/cloudlaunch/backend_plugins/base_vm_app.py#L465

In the past, we've run into issues if networks, subnets, gateways, route tables and firewalls have some missing configuration which prevent outbound communication. Try creating a brand new set if you haven't already. By default, cloudbridge implicitly adds a rule to allow outbound communication (on AWS, this is implicit, on GCP, it is explicitly added by cloudbridge), so something else must be interfering if you're still running into issues:

self.provider.security._vm_firewall_rules.create_with_priority(

@selshowk
Copy link
Author

@nuwang I just tried what you suggested and I'm seeing results consistent with what i described: VMs created in a subnet and not assigned a public or elastic IP cannot access the internet. Below is a shortish code snippet to show this. I've only tested this on AWS.

Do you agree or do you see something I'm doing wrong:

image_id = 'ami-071884cefc7e770ba'      # eu-west-2 (london), might need to change VM for other regions
NETWORK_NAME="cloudbridge-connectivity-test"
CLUSTER_NAME="cloudbridge-connectivity-test"

# this is a keypair I've already set up manually
def get_kp(kpname=KPNAME):
    provider=get_provider()
    return provider.security.key_pairs.find(name=kpname)[0]


def create_network(network_name=NETWORK_NAME):
    provider=get_provider()
    net = provider.networking.networks.create(cidr_block='10.0.0.0/16',
                                              label=f'{network_name}-network')
    zone = provider.compute.regions.get(provider.region_name).zones[0]
    sn = net.subnets.create(cidr_block='10.0.0.0/28', label=f'{network_name}-subnet')
    router = provider.networking.routers.create(network=net,
            label=f'{network_name}-router')
    router.attach_subnet(sn)
    gateway = net.gateways.get_or_create()
    router.attach_gateway(gateway)
    return net,sn,zone,router,gateway


def create_vm_firewall(net, cluster_name=CLUSTER_NAME, 
        from_port=22, to_port=22):
    provider=get_provider()
    from cloudbridge.interfaces.resources import TrafficDirection
    fw = provider.security.vm_firewalls.create(
        label=f'{cluster_name}-firewall', 
        description='A VM firewall used by CloudBridge', network=net)
    fw.rules.create(TrafficDirection.INBOUND, 'tcp', from_port, to_port, '0.0.0.0/0')
    return fw

def launch_instance(sn, zone, kp, fw, associate_public_ip=False,
        min_cpu = 2, min_ram = 4,
        cluster_name=CLUSTER_NAME, user_data = None,
        role=None
        ):
    provider=get_provider()
    img = provider.compute.images.get(image_id)
    vm_type = sorted([t for t in provider.compute.vm_types
                      if t.vcpus >= min_cpu and t.ram >= min_ram],
                      key=lambda x: x.vcpus*x.ram)[0]
    inst = provider.compute.instances.create(
        image=img, vm_type=vm_type, label=f'{cluster_name}-{role}',
        subnet=sn, zone=zone, key_pair=kp, vm_firewalls=[fw], 
        user_data = user_data)
    # Wait until ready
    inst.wait_till_ready()  # This is a blocking call
    return inst

def assign_address(inst, gateway):
    if not inst.public_ips:
        fip = gateway.floating_ips.create()
        inst.add_floating_ip(fip)
        inst.refresh()
    inst.public_ips
    return inst

def do_test():
    kp=get_kp()
    net, sn, zone, router, gateway = create_network(NETWORK_NAME)
    fw = create_vm_firewall(net, cluster_name=CLUSTER_NAME)

    bastion = launch_instance(sn, zone, kp, fw, 
            cluster_name = CLUSTER_NAME, role = "bastion",
            )
    assign_address(bastion, gateway)
    # this now works
    print(f"ssh ubuntu@{bastion.public_ips[0]}")

    priv_vm = launch_instance(sn, zone, kp, fw, 
            cluster_name = CLUSTER_NAME, role = "priv",
            )
    # use ssh proxying to access private vm
    print(f"ssh -o ProxyCommand=\"ssh -W %h:%p ubuntu@{bastion.public_ips[0]}\" ubuntu@{priv_vm.private_ips[0]}")
    # Now run these commands _on_ private VM and see that they don't work
    print(f"ssh ubuntu@{bastion.public_ips[0]}")
    print(f"telnet {bastion.public_ips[0]} 22")
    # can also try to telnet to any public server/port (telnet cnn.com 80) and will get same error

@nuwang
Copy link
Contributor

nuwang commented Jan 14, 2021

@selshowk I'm checking up on this now, will let you know how it goes.

@nuwang
Copy link
Contributor

nuwang commented Jan 14, 2021

It looks like you're right, it seems that you need to either have a NAT gateway, a public IP or an IPv6 IP. I think that when I tested this scenario, I must have had subnets auto assign public IPs or something, although we didn't assign an elastic IP, and it seems that none of our instances with private IPs only have outbound internet connectivity after all. So apologies for sending you off on a wild-goose chase on this.

In terms of a resolution, there appear to be several possible paths. The NAT gateway is one obvious path, and it may be the simplest path, but the issue with that is the ongoing hourly cost. Another would be an egress-only gateway, but it looks like you'd need to put your instances into two different subnets for it to work - the bastion host on a subnet connected to a standard internet gateway, and the private nodes on a separate subnet connected to an egress-only gateway.

It may be more cost effective to simply allow the instance to have an auto-assigned public ip, but block all incoming traffic through a security group? This avoids the cost of an elastic IP, but offers a reasonable level of isolation?

@selshowk
Copy link
Author

It looks like you're right, it seems that you need to either have a NAT gateway, a public IP or an IPv6 IP. I think that when I tested this scenario, I must have had subnets auto assign public IPs or something, although we didn't assign an elastic IP, and it seems that none of our instances with private IPs only have outbound internet connectivity after all. So apologies for sending you off on a wild-goose chase on this.

No prob, I've hit similar confusions myself. Note that this problem does NOT occur with Azure -- they seem to automatically NAT somehow.

In terms of a resolution, there appear to be several possible paths. The NAT gateway is one obvious path, and it may be the simplest path, but the issue with that is the ongoing hourly cost. Another would be an egress-only gateway, but it looks like you'd need to put your instances into two different subnets for it to work - the bastion host on a subnet connected to a standard internet gateway, and the private nodes on a separate subnet connected to an egress-only gateway.

The solutions look quite different on the three providers:

  • on aws I have to create two subnets (a public & private one), add a NAT gateway to the public one and route traffic from the private net to the public one.
  • on GCP a simple modification of the router enables NAT (see above)
  • on Azure this works out of the box
    The above is kind of ugly in terms of homogeneity of the interface so it might not be something you want to solve?

It may be more cost effective to simply allow the instance to have an auto-assigned public ip, but block all incoming traffic through a security group? This avoids the cost of an elastic IP, but offers a reasonable level of isolation?

I don't think that's the path we'll take but I have actually modified CB (locally) to auto-assign public IPs for the servers I do want to be externally accessible. We need to spin up a lot of short-lived, externally accessible servers so static IPs are not a good option. Given how minimal the changes are I think supporting this natively in CB makes sense. I mentioned this in #255 but I don't have my code in shape to PR (because I'm not editing cloudbridge, I'm sort of monkey-patching it).

@nuwang
Copy link
Contributor

nuwang commented Jan 15, 2021

The solutions look quite different on the three providers:

on aws I have to create two subnets (a public & private one), add a NAT gateway to the public one and route traffic from the private net to the public one.
on GCP a simple modification of the router enables NAT (see above)
on Azure this works out of the box
The above is kind of ugly in terms of homogeneity of the interface so it might not be something you want to solve?

Thanks for investigating these options. From a cloudbridge perspective, I think a sensible default would be for all "private" instances to have outbound internet connectivity, but no inbound connectivity at all. For AWS, I think that simply enabling public IP auto assignment on the subnet by default would make the process fairly seamless. On GCP, I guess we can enable NAT on the router by default. Looks like Azure isn't a problem, which leaves OpenStack. I'm not a 100% sure what a good solution there might be, but my guess is that it might be "auto-natted" like Azure. Will check on this.
@almahmoud @afgane Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants