%%bash
curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip" # Download the AWS CLI from Amazon's website
unzip awscli-bundle.zip
sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
rm ./awscli-bundle.zip
rm -r ./awscli-bundle
%%bash
if [[ -d ~/.aws ]] # If there exists a directory called ~/.aws
then
rm -r ~/.aws
fi
mkdir ~/.aws
touch ~/.aws/config
echo "[default]" >> ~/.aws/config
echo "region=us-west-2" >> ~/.aws/config # THIS MUST MATCH WITH cluster-config.yaml.
echo "output=json" >> ~/.aws/config
%%bash
sudo apt-get update -y
sudo apt-get install rsync -y
!ray up ./cluster-config.yaml -y # launch the cluster with -y to automatically accept
Cluster: example_cluster
Checking AWS environment settings
AWS config
IAM Profile: ray-autoscaler-v1 [default]
EC2 Key pair (all available node types): ray-autoscaler_18_us-west-2 [default]
VPC Subnets (all available node types): subnet-b29806ca, subnet-74259d3e [default]
EC2 Security groups (all available node types): sg-05ac775661dd43007 [default]
EC2 AMI (all available node types): ami-0a2363a9cff180a64 [dlami]
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]
Acquiring an up-to-date head node
Launched 1 nodes [subnet_id=subnet-74259d3e]
Launched instance i-0784be4c1d83f9a84 [state=pending, info=pending]
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running `uptime` as a test.
Waiting for IP
Not yet available, retrying in 5 seconds
Not yet available, retrying in 5 seconds
Not yet available, retrying in 5 seconds
Not yet available, retrying in 5 seconds
Not yet available, retrying in 5 seconds
Not yet available, retrying in 5 seconds
Received: 34.214.186.57
SSH still not available (SSH command failed.), retrying in 5 seconds.
SSH still not available (SSH command failed.), retrying in 5 seconds.
SSH still not available (SSH command failed.), retrying in 5 seconds.
SSH still not available (SSH command failed.), retrying in 5 seconds.
02:56:42 up 0 min, 1 user, load average: 3.28, 0.81, 0.27
Success.
Updating cluster configuration. [hash=9712f8be1aa1ec7cddf165cb0530e95ab3d2576c]
New status: syncing-files
[2/7] Processing file mounts
[3/7] No worker file mounts to sync
New status: setting-up
[4/7] No initialization commands to run.
[5/7] Initalizing command runner
latest-cpu: Pulling from rayproject/ray
Digest: sha256:c3b15b82825d978fd068a1619e486020c7211545c80666804b08a95ef7665371
Status: Downloaded newer image for rayproject/ray:latest-cpu
docker.io/rayproject/ray:latest-cpu
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
2021-08-17 02:59:04,177 WARNING command_runner.py:904 -- Nvidia Container Runtime is present, but no GPUs found.
e60e3bb452763ad36ed2640a5cb51b647fd86115eadf520f7d4de4b7871ab1e4
[6/7] Running setup commands
(0/2) pip install joblib
Collecting joblib
Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
|████████████████████████████████| 303 kB 6.3 MB/s
Installing collected packages: joblib
Successfully installed joblib-1.0.1
(1/2) pip install scikit-learn
Collecting scikit-learn
Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
|████████████████████████████████| 22.3 MB 26.6 MB/s
Requirement already satisfied: scipy>=0.19.1 in ./anaconda3/lib/python3.7/site-packages (from scikit-learn) (1.7.1)
Requirement already satisfied: numpy>=1.13.3 in ./anaconda3/lib/python3.7/site-packages (from scikit-learn) (1.21.1)
Requirement already satisfied: joblib>=0.11 in ./anaconda3/lib/python3.7/site-packages (from scikit-learn) (1.0.1)
Collecting threadpoolctl>=2.0.0
Downloading threadpoolctl-2.2.0-py3-none-any.whl (12 kB)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-0.24.2 threadpoolctl-2.2.0
[7/7] Starting the Ray runtime
Did not find any active Ray processes.
Local node IP: 172.31.47.32
2021-08-16 19:59:53,639 INFO services.py:1247 -- View the Ray dashboard at http://127.0.0.1:8265
2021-08-16 19:59:53,642 WARNING services.py:1716 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 196902912 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=0.25gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
--------------------
Ray runtime started.
--------------------
Next steps
To connect to this Ray runtime from another node, run
ray start --address='172.31.47.32:6379' --redis-password='5241590000000000'
Alternatively, use the following Python code:
import ray
ray.init(address='auto', _redis_password='5241590000000000')
If connection fails, check your firewall settings and network configuration.
To terminate the Ray runtime, run
ray stop
New status: up-to-date
Useful commands
Monitor autoscaling with
ray exec /work/cluster-config.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Connect to a terminal on the cluster head:
ray attach /work/cluster-config.yaml
Get a remote shell to the cluster manually:
ssh -tt -o IdentitiesOnly=yes -i /root/.ssh/ray-autoscaler_18_us-west-2.pem ubuntu@34.214.186.57 docker exec -it ray_container /bin/bash
# This cleans up the old head node instance info in case this is not a new machine
import os
if os.path.exists('headinstance.json'):
os.remove('headinstance.json')
%%bash
aws ec2 describe-instances --filters "Name=tag:Name,Values=ray-example_cluster-head" \
--query "Reservations[*].Instances[*].{Ip:PublicIpAddress, SgGroupId:NetworkInterfaces[*].Groups[*].GroupId}"\
>> headinstance.json
# Read the required info
import json
with open('headinstance.json') as f:
instance_info = json.load(f)
for instance in instance_info:
if instance[0]['Ip'] is not None:
conn_addr, sggroupid = instance[0]['Ip'], instance[0]['SgGroupId'][0][0]
print(f'Connect to {conn_addr} with sggroupid {sggroupid}')
Connect to 34.214.186.57 with sggroupid sg-05ac775661dd43007
!aws ec2 authorize-security-group-ingress --group-id {sggroupid} --protocol tcp --port 10001 --cidr $(curl ipinfo.io/ip)/24
!aws ec2 authorize-security-group-ingress --group-id {sggroupid} --protocol udp --port 10001 --cidr $(curl ipinfo.io/ip)/24
!aws ec2 authorize-security-group-egress --group-id {sggroupid} --protocol tcp --port 10001 --cidr $(curl ipinfo.io/ip)/24
!aws ec2 authorize-security-group-egress --group-id {sggroupid} --protocol udp --port 10001 --cidr $(curl ipinfo.io/ip)/24
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12 100 12 0 0 82 0 --:--:-- --:--:-- --:--:-- 82
{
"Return": true,
"SecurityGroupRules": [
{
"SecurityGroupRuleId": "sgr-03ee1306066d8c04c",
"GroupId": "sg-05ac775661dd43007",
"GroupOwnerId": "680074127864",
"IsEgress": false,
"IpProtocol": "tcp",
"FromPort": 10001,
"ToPort": 10001,
"CidrIpv4": "54.152.6.0/24"
}
]
}
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12 100 12 0 0 363 0 --:--:-- --:--:-- --:--:-- 363
{
"Return": true,
"SecurityGroupRules": [
{
"SecurityGroupRuleId": "sgr-0d845b570e1315431",
"GroupId": "sg-05ac775661dd43007",
"GroupOwnerId": "680074127864",
"IsEgress": false,
"IpProtocol": "udp",
"FromPort": 10001,
"ToPort": 10001,
"CidrIpv4": "54.152.6.0/24"
}
]
}
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12 100 12 0 0 363 0 --:--:-- --:--:-- --:--:-- 363
{
"Return": true,
"SecurityGroupRules": [
{
"SecurityGroupRuleId": "sgr-041471d1efced867b",
"GroupId": "sg-05ac775661dd43007",
"GroupOwnerId": "680074127864",
"IsEgress": true,
"IpProtocol": "tcp",
"FromPort": 10001,
"ToPort": 10001,
"CidrIpv4": "54.152.6.0/24"
}
]
}
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12 100 12 0 0 210 0 --:--:-- --:--:-- --:--:-- 210
{
"Return": true,
"SecurityGroupRules": [
{
"SecurityGroupRuleId": "sgr-0e7a89b76dbd3d919",
"GroupId": "sg-05ac775661dd43007",
"GroupOwnerId": "680074127864",
"IsEgress": true,
"IpProtocol": "udp",
"FromPort": 10001,
"ToPort": 10001,
"CidrIpv4": "54.152.6.0/24"
}
]
}
import time
time.sleep(30) # Let things settle down on the head node
import ray
ray.init(address=f'ray://{conn_addr}:10001')
import socket
from collections import Counter
@ray.remote
def check_hosts():
time.sleep(5)
return socket.gethostname()
for run in range(5):
if run != 0:
time.sleep(60) # Let boxes make progress initialising
remote_promises = [check_hosts.remote() for _ in range(10)]
ids = ray.get(remote_promises)
print(Counter(ids))
(autoscaler +19s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +19s) Adding 3 nodes of type ray.worker.default.
Counter({'ip-172-31-47-32': 10})
(autoscaler +1m16s) Resized to 2 CPUs.
(autoscaler +1m22s) Resized to 3 CPUs.
(autoscaler +2m1s) Adding 1 nodes of type ray.worker.default.
Counter({'ip-172-31-47-32': 4, 'ip-172-31-35-132': 3, 'ip-172-31-45-161': 3})
Counter({'ip-172-31-47-32': 4, 'ip-172-31-35-132': 3, 'ip-172-31-45-161': 3})
(autoscaler +3m43s) Resized to 5 CPUs.
(autoscaler +3m49s) Resized to 6 CPUs.
Counter({'ip-172-31-47-32': 2, 'ip-172-31-35-132': 2, 'ip-172-31-22-184': 2, 'ip-172-31-45-161': 2, 'ip-172-31-31-234': 1, 'ip-172-31-19-139': 1})
(autoscaler +5m31s) Resized to 7 CPUs.
Counter({'ip-172-31-47-32': 2, 'ip-172-31-35-132': 2, 'ip-172-31-22-184': 2, 'ip-172-31-43-4': 1, 'ip-172-31-45-161': 1, 'ip-172-31-31-234': 1, 'ip-172-31-19-139': 1})
!aws ec2 revoke-security-group-ingress --group-id {sggroupid} --protocol tcp --port 10001 --cidr $(curl ipinfo.io/ip)/24
!aws ec2 revoke-security-group-ingress --group-id {sggroupid} --protocol udp --port 10001 --cidr $(curl ipinfo.io/ip)/24
!aws ec2 revoke-security-group-egress --group-id {sggroupid} --protocol tcp --port 10001 --cidr $(curl ipinfo.io/ip)/24
!aws ec2 revoke-security-group-egress --group-id {sggroupid} --protocol udp --port 10001 --cidr $(curl ipinfo.io/ip)/24
!ray down ./cluster-config.yaml -y
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12 100 12 0 0 179 0 --:--:-- --:--:-- --:--:-- 179
{
"Return": true
}
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12 100 12 0 0 324 0 --:--:-- --:--:-- --:--:-- 324
{
"Return": true
}
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12 100 12 0 0 342 0 --:--:-- --:--:-- --:--:-- 342
{
"Return": true
}
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12 100 12 0 0 363 0 --:--:-- --:--:-- --:--:-- 375
{
"Return": true
}
Checking AWS environment settings
Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 34.214.186.57
Stopped all 15 Ray processes.
Fetched IP: 34.214.186.57
Fetched IP: 54.244.171.137
Fetched IP: 54.201.158.156
Fetched IP: 54.203.18.219
Fetched IP: 35.164.127.32
Fetched IP: 54.245.31.197
Fetched IP: 54.191.169.84
Requested 7 nodes to shut down. [interval=1s]
0 nodes remaining after 5 second(s).
No nodes remaining.