Heterogeneous Cluster - a hello world training job

This basic example on how to run a Heterogeneous Clusters training job consisting of two instance groups. Each instance group includes a different instance type. Each instance prints its environment information including its instance group and exits.

You can retrieve environment information in either of the following ways: - Option 1: Read instance group information using the convenient sagemaker_training.environment.Environment class. - Option 2: Read instance group information from /opt/ml/input/config/resourceconfig.json.

Note: This notebook does not demonstrate offloading of data preprocessing job to data group and deep neural network training to dnn_group. We will cover those examples in TensorFlow’s tf.data.service based Amazon SageMaker Heterogeneous Clusters for training and PyTorch and gRPC distributed dataloader based Amazon SageMaker Heterogeneous Clusters for training notebooks.

A. Setting up SageMaker Studio notebook

Before you start

Ensure you have selected Python 3 (TensorFlow 2.6 Python 3.8 CPU Optimized) image for your SageMaker Studio Notebook instance, and running on ml.t3.medium instance type.

Step 1 - Upgrade SageMaker SDK and dependent packages

Heterogeneous Clusters for Amazon SageMaker model training was announced on 07/08/2022. This feature release requires you to have updated SageMaker SDK and boto3 client libraries.

[3]:

%%bash
python3 -m pip install --upgrade boto3 botocore awscli sagemaker

Requirement already satisfied: boto3 in /usr/local/lib/python3.8/site-packages (1.24.72)
Collecting boto3
  Downloading boto3-1.24.83-py3-none-any.whl (132 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 132.5/132.5 kB 2.5 MB/s eta 0:00:00
Requirement already satisfied: botocore in /usr/local/lib/python3.8/site-packages (1.27.72)
Collecting botocore
  Downloading botocore-1.27.83-py3-none-any.whl (9.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.2/9.2 MB 42.5 MB/s eta 0:00:00
Requirement already satisfied: awscli in /usr/local/lib/python3.8/site-packages (1.25.73)
Collecting awscli
  Downloading awscli-1.25.84-py3-none-any.whl (3.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.9/3.9 MB 35.4 MB/s eta 0:00:00
Requirement already satisfied: sagemaker in /usr/local/lib/python3.8/site-packages (2.109.0)
Collecting sagemaker
  Downloading sagemaker-2.110.0.tar.gz (576 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 576.0/576.0 kB 9.9 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /usr/local/lib/python3.8/site-packages (from boto3) (0.6.0)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /usr/local/lib/python3.8/site-packages (from boto3) (0.10.0)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /usr/local/lib/python3.8/site-packages (from botocore) (1.25.11)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.8/site-packages (from botocore) (2.8.2)
Requirement already satisfied: colorama<0.4.5,>=0.2.5 in /usr/local/lib/python3.8/site-packages (from awscli) (0.4.3)
Requirement already satisfied: PyYAML<5.5,>=3.10 in /usr/local/lib/python3.8/site-packages (from awscli) (5.4.1)
Requirement already satisfied: docutils<0.17,>=0.10 in /usr/local/lib/python3.8/site-packages (from awscli) (0.15.2)
Requirement already satisfied: rsa<4.8,>=3.1.2 in /usr/local/lib/python3.8/site-packages (from awscli) (4.7.2)
Requirement already satisfied: attrs<22,>=20.3.0 in /usr/local/lib/python3.8/site-packages (from sagemaker) (21.2.0)
Requirement already satisfied: google-pasta in /usr/local/lib/python3.8/site-packages (from sagemaker) (0.2.0)
Requirement already satisfied: numpy<2.0,>=1.9.0 in /usr/local/lib/python3.8/site-packages (from sagemaker) (1.19.5)
Requirement already satisfied: protobuf<4.0,>=3.1 in /usr/local/lib/python3.8/site-packages (from sagemaker) (3.19.1)
Requirement already satisfied: protobuf3-to-dict<1.0,>=0.1.5 in /usr/local/lib/python3.8/site-packages (from sagemaker) (0.1.5)
Requirement already satisfied: smdebug_rulesconfig==1.0.1 in /usr/local/lib/python3.8/site-packages (from sagemaker) (1.0.1)
Requirement already satisfied: importlib-metadata<5.0,>=1.4.0 in /usr/local/lib/python3.8/site-packages (from sagemaker) (4.8.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/site-packages (from sagemaker) (21.3)
Requirement already satisfied: pandas in /usr/local/lib/python3.8/site-packages (from sagemaker) (1.2.5)
Requirement already satisfied: pathos in /usr/local/lib/python3.8/site-packages (from sagemaker) (0.2.8)
Collecting schema
  Downloading schema-0.7.5-py2.py3-none-any.whl (17 kB)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/site-packages (from importlib-metadata<5.0,>=1.4.0->sagemaker) (3.6.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/site-packages (from packaging>=20.0->sagemaker) (3.0.6)
Requirement already satisfied: six in /usr/local/lib/python3.8/site-packages (from protobuf3-to-dict<1.0,>=0.1.5->sagemaker) (1.16.0)
Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/lib/python3.8/site-packages (from rsa<4.8,>=3.1.2->awscli) (0.4.8)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/site-packages (from pandas->sagemaker) (2021.3)
Requirement already satisfied: dill>=0.3.4 in /usr/local/lib/python3.8/site-packages (from pathos->sagemaker) (0.3.4)
Requirement already satisfied: ppft>=1.6.6.4 in /usr/local/lib/python3.8/site-packages (from pathos->sagemaker) (1.6.6.4)
Requirement already satisfied: pox>=0.3.0 in /usr/local/lib/python3.8/site-packages (from pathos->sagemaker) (0.3.0)
Requirement already satisfied: multiprocess>=0.70.12 in /usr/local/lib/python3.8/site-packages (from pathos->sagemaker) (0.70.12.2)
Collecting contextlib2>=0.5.5
  Downloading contextlib2-21.6.0-py2.py3-none-any.whl (13 kB)
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py): started
  Building wheel for sagemaker (setup.py): finished with status 'done'
  Created wheel for sagemaker: filename=sagemaker-2.110.0-py2.py3-none-any.whl size=791666 sha256=5e4f859fef28f399b5eb60568410a22ddb2c42bbc357d0b3eae61587a14ca679
  Stored in directory: /root/.cache/pip/wheels/ad/56/4f/4c5b1ed9fb3a725a634741aa293beb6fad882af965e2ccb6ae
Successfully built sagemaker
Installing collected packages: contextlib2, schema, botocore, boto3, awscli, sagemaker
  Attempting uninstall: botocore
    Found existing installation: botocore 1.27.72
    Uninstalling botocore-1.27.72:
      Successfully uninstalled botocore-1.27.72
  Attempting uninstall: boto3
    Found existing installation: boto3 1.24.72
    Uninstalling boto3-1.24.72:
      Successfully uninstalled boto3-1.24.72
  Attempting uninstall: awscli
    Found existing installation: awscli 1.25.73
    Uninstalling awscli-1.25.73:
      Successfully uninstalled awscli-1.25.73
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.109.0
    Uninstalling sagemaker-2.109.0:
      Successfully uninstalled sagemaker-2.109.0
Successfully installed awscli-1.25.84 boto3-1.24.83 botocore-1.27.83 contextlib2-21.6.0 sagemaker-2.110.0 schema-0.7.5

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

Step 2 - Restart the notebook kernel

[ ]:

#import IPython
#IPython.Application.instance().kernel.do_shutdown(True)

Step 3 - Validate SageMaker Python SDK and TensorFlow versions

Ensure the output of the cell below reflects:

SageMaker Python SDK version 2.98.0 or above,
boto3 1.24 or above
botocore 1.27 or above
TensorFlow 2.6 or above

[4]:

!pip show sagemaker boto3 botocore tensorflow protobuf |egrep 'Name|Version|---'

Name: sagemaker
Version: 2.110.0
---
Name: boto3
Version: 1.24.83
---
Name: botocore
Version: 1.27.83
---
Name: tensorflow
Version: 2.6.2
---
Name: protobuf
Version: 3.19.1

B. Run a heterogeneous cluster training job

Step 1: Set up training environment

Import the required libraries that enable you to use Heterogeneous clusters for training. In this step, you are also inheriting this notebook’s IAM role and SageMaker session.

[7]:

import os
import json
import datetime

import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow
from sagemaker.instance_group import InstanceGroup

sess = sagemaker.Session()
role = get_execution_role()

Step 2: Define instance groups

Here we define instance groups. Each instance group includes a different instance type.

[8]:

data_group = InstanceGroup("data_group", "ml.c5.xlarge", 1)
dnn_group = InstanceGroup("dnn_group", "ml.m4.xlarge", 1)

Step 3: Review the “hello world” training code

[10]:

!pygmentize source_dir/train.py

import json
import os
import sys
from sagemaker_training import environment # This module is present on the DLC images, or you can install it with pip install sagemaker_training

if __name__ == "__main__":

    print("Option-1: Read instance group information from the sagemaker_training.environment.Environment class")
    env = environment.Environment()
    print(f"env.is_hetero: {env.is_hetero}")
    print(f"env.current_host: {env.current_host}")
    print(f"env.current_instance_type: {env.current_instance_type}")
    print(f"env.current_instance_group: {env.current_instance_group}")
    print(f"env.current_instance_group_hosts: {env.current_instance_group_hosts}")
    print(f"env.instance_groups: {env.instance_groups}")
    print(f"env.instance_groups_dict: {env.instance_groups_dict}")
    print(f"env.distribution_hosts: {env.distribution_hosts}")
    print(f"env.distribution_instance_groups: {env.distribution_instance_groups}")


    file_path = '/opt/ml/input/config/resourceconfig.json'
    print("Option-2: Read instance group information from {file_path}.\
            You'll need to parse the json yourself. This doesn't require an additional library.\n")

    with open(file_path, 'r') as f:
        config = json.load(f)

    print(f'{file_path} dump = {json.dumps(config, indent=4, sort_keys=True)}')

    print(f"env.is_hetero: {'instance_groups' in config}")
    print(f"current_host={config['current_host']}")
    print(f"current_instance_type={config['current_instance_type']}")
    print(f"env.current_instance_group: {config['current_group_name']}")
    print(f"env.current_instance_group_hosts: TODO")
    print(f"env.instance_groups: TODO")
    print(f"env.instance_groups_dict: {config['instance_groups']}")
    print(f"env.distribution_hosts: TODO")
    print(f"env.distribution_instance_groups: TODO")

Step 4: Configure the Estimator

In order to use SageMaker to fit our algorithm, we’ll create an Estimator that defines how to use the container to train. This includes the configuration we need to invoke SageMaker training.

[13]:

estimator = TensorFlow(
    entry_point='train.py',
    source_dir='./source_dir',
    #instance_type='ml.m4.xlarge',
    #instance_count=1,
    instance_groups = [data_group, dnn_group,],
    framework_version='2.9.1',
    py_version='py39',
    role=role,
    volume_size=10,
    max_run=3600,
    disable_profiler=True,
)

Step 5: Submit the training job

Here you are submitting the heterogeneous cluster training job.

[14]:

estimator.fit(
    job_name='hello-world-heterogenous' +
    '-' + datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%SZ"),
)

2022-09-30 17:23:58 Starting - Starting the training job...
2022-09-30 17:24:26 Starting - Preparing the instances for training.........
2022-09-30 17:25:56 Downloading - Downloading input data...
2022-09-30 17:26:22 Training - Downloading the training image...............
2022-09-30 17:28:53 Training - Training image download completed. Training in progress....
2022-09-30 17:29:24 Uploading - Uploading generated training model
2022-09-30 17:29:24 Completed - Training job completed
..Training seconds: 0
Billable seconds: 0

Step 6: Review the logs for environment information

Wait for the training job to finish, and review its logs in the AWS Console (click on View logs from the Training Jobs node in Amazon SageMaker Console) You’ll find two logs: Algo1, Algo2. Examine the printouts on each node on how to retrieve instance group environment information. An example is shown here:

Option-1: Read instance group information from the sagemaker_training.environment.Environment class
env.is_hetero: True
env.current_host: algo-1
env.current_instance_type: ml.c5.xlarge
env.current_instance_group: data_group
env.current_instance_group_hosts: ['algo-1']
env.instance_groups: ['data_group', 'dnn_group']

Option-2: Read instance group information from {file_path}.            You'll need to parse the json yourself. This doesn't require an additional library.
/opt/ml/input/config/resourceconfig.json dump = {
    "current_group_name": "data_group",
    "current_host": "algo-1",
    "current_instance_type": "ml.c5.xlarge",
    "hosts": [
        "algo-1",
        "algo-2"
    ],
    "instance_groups": [
        {
            "hosts": [
                "algo-1"
            ],
            "instance_group_name": "data_group",
            "instance_type": "ml.c5.xlarge"
        },
        {
            "hosts": [
                "algo-2"
            ],
            "instance_group_name": "dnn_group",
            "instance_type": "ml.m4.xlarge"
        }
    ],
    "network_interface_name": "eth0"
}
env.is_hetero: True
current_host=algo-1
current_instance_type=ml.c5.xlarge
env.current_instance_group: data_group
env.current_instance_group_hosts: TODO
env.instance_groups: TODO
env.instance_groups_dict: [{'instance_group_name': 'data_group', 'instance_type': 'ml.c5.xlarge', 'hosts': ['algo-1']}, {'instance_group_name': 'dnn_group', 'instance_type': 'ml.m4.xlarge', 'hosts': ['algo-2']}]
env.distribution_hosts: TODO
env.distribution_instance_groups: TODO

C. Next steps

In this notebook, we demonstrated how to retrieve the environment information, and differentiate which instance group an instance belongs to. Based on this, you can build logic to offload data processing tasks in your training job to a dedicated instance group. To understand how that can be done with a real-world example, we suggest going through the following notebook examples: