Download, Structure, and Preprocess Image Data for PyTorch Models

Notes: * This notebook should be used with the conda_pytorch_latest_p36 kernel * You can also explore image preprocessing with TensorFlow and SageMaker Built-in Algorithms by running Download, Structure, and Preprocess Image Data for TensorFlow Models and Download, Structure, and Preprocess Image Data for SageMaker Built-In Algorithms, respectively.

The main purpose of this notebook is to demonstrate how you can preprocess image data to train PyTorch Models.

Contents

Part 1: Download the Dataset
Part 2: Structure the Dataset
Part 3: Preprocess Images for PyTorch Models
Part 4: Train the PyTorch Model

## Part 1: Download the Dataset

In this section, you will use a dataset manifest to download animal images from the COCO dataset for all ten animal classes. You will then download frog images from the CIFAR dataset and add them to your COCO animal images. In order to simulate coming to SageMaker with your own dataset, we will keep the data in an unstructured form until the next notebook where you will learn the best practices for structuring an image dataset.

[ ]:

import json
import pickle
import shutil
import urllib
import pathlib
import tarfile
from tqdm import tqdm
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
from imageio import imread, imwrite
from joblib import Parallel, delayed, parallel_backend
from sagemaker.pytorch import PyTorch  # PyTorch Estimator for TensorFlow

The COCO and CIFAR Datasets

For this series of notebooks we will be sampling images from the COCO dataset and CIFAR-10 dataset (before beginning the notebooks in this series, it’s a good idea to browse each dataset website to familiaraize youreself with the data). Both are datasets of images, but come formatted very differently. The COCO dataset contains images from Flickr that represent a real-world dataset which isn’t formatted or resized specifically for deep learning. This makes it a good dataset for this guide because we want it to be as comprehensive as possible. The CIFAR-10 images, on the other hand, are preprocessed specifically for deep learning as they come cropped, resized and vectorized (i.e. not in a readable image format). This notebooks will show you how to work with both types of datasets.

Download the annotations

The dataset annotation file contains info on each image in the dataset such as the class, superclass, file name and url to download the file. Just the annotations for the COCO dataset are about 242MB.

[ ]:

anno_url = "http://images.cocodataset.org/annotations/annotations_trainval2017.zip"
urllib.request.urlretrieve(anno_url, "coco-annotations.zip");

[ ]:

shutil.unpack_archive("coco-annotations.zip")

Load the annotations into Python

The training and validation annotations come in separate files

[ ]:

with open("annotations/instances_train2017.json", "r") as f:
    train_metadata = json.load(f)

with open("annotations/instances_val2017.json", "r") as f:
    val_metadata = json.load(f)

Extract only the animal annotations

To limit the scope of the dataset for this guide we’re only using the images of animals in the COCO dataset

[ ]:

category_labels = {
    c["id"]: c["name"] for c in train_metadata["categories"] if c["supercategory"] == "animal"
}

Extract metadata and image filepaths

For the train and validation sets, the data we need for the image labels and the filepaths are under different headings in the annotations. We have to extract each out and combine them into a single annotation in subsequent steps.

[ ]:

train_annos = {}
for a in train_metadata["annotations"]:
    if a["category_id"] in category_labels:
        train_annos[a["image_id"]] = {"category_id": a["category_id"]}

train_images = {}
for i in train_metadata["images"]:
    train_images[i["id"]] = {"coco_url": i["coco_url"], "file_name": i["file_name"]}

val_annos = {}
for a in val_metadata["annotations"]:
    if a["category_id"] in category_labels:
        val_annos[a["image_id"]] = {"category_id": a["category_id"]}

val_images = {}
for i in val_metadata["images"]:
    val_images[i["id"]] = {"coco_url": i["coco_url"], "file_name": i["file_name"]}

Combine label and filepath info

Later in this series of guides we’ll make our own train, validation and test splits. For this reason we’ll combine the training and validation datasets together.

[ ]:

for id, anno in train_annos.items():
    anno.update(train_images[id])

for id, anno in val_annos.items():
    anno.update(val_images[id])

[ ]:

all_annos = {}
for k, v in train_annos.items():
    all_annos.update({k: v})
for k, v in val_annos.items():
    all_annos.update({k: v})

Sample the dataset

In order to make working with the data easier, we’ll select 250 images from each class at random. To make sure you get the same set of cell images for each run of this we’ll also set Numpy’s random seed to 0. This is a small fraction of the dataset, but it demonstrates how using transfer learning can give you good results without needing very large datasets.

[ ]:

np.random.seed(0)

[ ]:

sample_annos = {}

for category_id in category_labels:
    subset = [k for k, v in all_annos.items() if v["category_id"] == category_id]
    sample = np.random.choice(subset, size=250, replace=False)
    for k in sample:
        sample_annos[k] = all_annos[k]

Create a download function

In order to parallelize downloading the images we must wrap the download and save process with a function for multi-threading with joblib.

[ ]:

def download_image(url, path):
    data = imread(url)
    imwrite(path / url.split("/")[-1], data)

Download the sample of the dataset (2,500 images, ~5min)

[ ]:

sample_dir = pathlib.Path("data_sample_2500")
sample_dir.mkdir(exist_ok=True)

[ ]:

with parallel_backend("threading", n_jobs=5):
    Parallel(verbose=3)(
        delayed(download_image)(a["coco_url"], sample_dir) for a in sample_annos.values()
    )

Combine with CIFAR-10 frog data

The COCO dataset doesn’t include any images of frogs, but let’s say our model must also be able to label images of frogs. To fix this we can download another dataset of images which includes frogs, sample 250 frog images and add them to our existing image data. These images are much smaller (32x32) so they will appear pixelated and blurry when we increase the size of them to (244x244). We’ll use the CIFAR-10 dataset to achieve this. As you’ll see the CIFAR-10 dataset comes formatted in a very different manner from COCO dataset. We must process the CIFAR-10 data into individual image files so that it’s congruent to our COCO images.

Download and extract the CIFAR-10 dataset

[ ]:

!wget https://www.cs.toronto.edu/%7Ekriz/cifar-10-python.tar.gz

[ ]:

tf = tarfile.open("cifar-10-python.tar.gz")
tf.extractall()

Open first batch of CIFAR-10 dataset

The CIFAR-10 dataset comes in five training batches and one test batch. Each training batch has 10,000 randomly ordered images. Since we only need 250 frog images for our dataset, just pulling from the first batch will suffice.

[ ]:

with open("./cifar-10-batches-py/data_batch_1", "rb") as f:
    batch_1 = pickle.load(f, encoding="bytes")

[ ]:

image_data = batch_1[b"data"]

Pull 250 sample frog images

[ ]:

frog_indices = np.array(batch_1[b"labels"]) == 6
sample_frog_indices = np.random.choice(frog_indices.nonzero()[0], size=250, replace=False)
sample_data = image_data[sample_frog_indices, :]
frog_images = sample_data.reshape(len(sample_data), 3, 32, 32).transpose(0, 2, 3, 1)

View frog images

[ ]:

fig, axs = plt.subplots(3, 4, figsize=(10, 7))
indices = np.random.randint(low=0, high=249, size=12)

for i, ax in enumerate(axs.flatten()):
    ax.imshow(frog_images[indices[i]])
    ax.axis("off")

Write sample frog images to `data_sample_2500` directory

[ ]:

frog_filenames = np.array(batch_1[b"filenames"])[sample_frog_indices]

[ ]:

for idx, filename in enumerate(frog_filenames):
    filename = filename.decode()
    data = frog_images[idx]
    if filename.endswith(".png"):
        filename = filename.replace(".png", ".jpg")
    imwrite(sample_dir / filename, data)

[ ]:

sample_dir.rename("data_sample_2750")

Add frog annotations to `sample_annos`

[ ]:

category_labels[26] = "frog"

[ ]:

next_anno_idx = np.array(list(sample_annos.keys())).max() + 1

frog_anno_ids = range(next_anno_idx, next_anno_idx + len(frog_images))

[ ]:

for idx, frog_id in enumerate(frog_anno_ids):
    sample_annos[frog_id] = {
        "category_id": 26,
        "file_name": frog_filenames[idx].decode().replace(".png", ".jpg"),
    }

## Part 2: Structure the Dataset

In this section, you will properly structure your image files for ingestion by the model. Then, we will use Python to create the new folder structure and copy the files into the correct set and label folder.

Proper folder structure

Although most tools can accommodate data in any file structure with enough tinkering, it makes most sense to use the sensible defaults that frameworks like MXNet, TensorFlow and PyTorch all share to make data ingestion as smooth as possible. By default, most tools will look for image data in the file structure depicted below:

+-- train
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|
+-- val
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|
+-- test
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg

You will notice that the COCO dataset does not come structured like above so we must use the annotation data to help restructure the folders of the COCO dataset so they match the pattern above. Once the new directory structures are created you can use your desired framework’s data loading tool to gracefully load and define transformation for your image data. Many datasets may already be in this structure in which case you can skip this guide.

Make train, validation and test splits

We should divide our data into train, validation and test splits. A typical split ratio is 80/10/10. Our image classification algorithm will train on the first 80% (training) and evaluate its performance at each epoch with the next 10% (validation) and we’ll give our model’s final accuracy results using the last 10% (test). It’s important that before we split the data we make sure to shuffle it randomly so that class distribution among splits is roughly proportional.

[ ]:

np.random.seed(0)
image_ids = sorted(list(sample_annos.keys()))
np.random.shuffle(image_ids)
first_80 = int(len(image_ids) * 0.8)
next_10 = int(len(image_ids) * 0.9)
train_ids, val_ids, test_ids = np.split(image_ids, [first_80, next_10])

Make new folder structure and copy image files

This new folder structure can then be read by data loaders for SageMaker’s built-in algorithms, TensorFlow or PyTorch for easy loading of the image data into your framework of choice.

[ ]:

unstruct_dir = Path("data_sample_2750")
struct_dir = Path("data_structured")
struct_dir.mkdir(exist_ok=True, parents=True)

for name, split in zip(["train", "val", "test"], [train_ids, val_ids, test_ids]):
    split_dir = struct_dir / name
    split_dir.mkdir(exist_ok=True)
    for image_id in tqdm(split):
        category_dir = split_dir / f'{category_labels[sample_annos[image_id]["category_id"]]}'
        category_dir.mkdir(exist_ok=True)
        source_path = (unstruct_dir / sample_annos[image_id]["file_name"]).as_posix()
        target_path = (category_dir / sample_annos[image_id]["file_name"]).as_posix()
        shutil.copy(source_path, target_path)

## Part 3: Preprocess Images for PyTorch Models

In this section, you will create resizing and data augmentation transforms for training with PyTorch. You will also upload your dataset to S3 for training with SageMaker.

Dependencies

For this guide we’ll use the SageMaker Python SDK version 2.9.2. By default, SageMaker Notebooks come with version 1.72.0. Other guides provided by Amazon may be set up to work with other versions of the Python SDK so you may wish to roll-back to 1.72.0. We will also be using PyTorch 1.6.0 which can also be rolled back at the end of this guide to 1.4.0.

Update the SageMaker Python SDK and PyTorch

[ ]:

import sys
original_sagemaker_version = !conda list | grep -E "sagemaker\s" | awk '{print $2}'
original_pytorch_version = !conda list | grep -E "torch\s" | awk '{print $2}'
!{sys.executable} -m pip install -q "sagemaker==2.9.2" "torch==1.6.0" "torchvision"

[ ]:

import uuid
import boto3
import torch
import shutil
import pickle
import pathlib
import sagemaker
import numpy as np
from tqdm import tqdm
import torchvision as tv
import matplotlib.pyplot as plt

[ ]:

print(f"sagemaker updated  {original_sagemaker_version[0]} -> {sagemaker.__version__}")
print(f"pytorch   updated  {original_pytorch_version[0]} -> {torch.__version__}")

Define the Resize and Augmentation Transformations

Resize

Before going to the GPU for training, all image data must have the same dimensions for length, width and channel. Typically, algorithms use a square format so the length and width are the same and many pre-made datasets areadly have the images nicely cropped into squares. However, most real-world datasets will begin with images in many different dimensions and ratios. In order to prep our dataset for training we will need to resize and crop the images if they aren’t already square.

This transformation is deceptivley simple. If we want to keep the images from looking squished or stretched, we need to crop it to a square and we want to make sure the important object in the image doesn’t get cropped out. Unfortunately, there is no easy way to make sure each crop is optimal so we typically choose a center crop which works well most of the time.

[ ]:

resize = tv.transforms.Compose(
    [tv.transforms.Resize(224), tv.transforms.CenterCrop(224), tv.transforms.ToTensor()]
)

[ ]:

sample = tv.datasets.ImageFolder(root="data_structured/train", transform=tv.transforms.ToTensor())

sample_resized = tv.datasets.ImageFolder(root="data_structured/train", transform=resize)

[ ]:

sample = iter(sample)
sample_resized = iter(sample_resized)

Re-rull the cell below to sample another image

[ ]:

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
image = next(iter(sample))[0]
image_resized = next(iter(sample_resized))[0]

ax[0].imshow(image.permute(1, 2, 0))
ax[0].axis("off")
ax[0].set_title(f"Before - {tuple(image.shape)}")
ax[1].imshow(image_resized.permute(1, 2, 0))
ax[1].axis("off")
ax[1].set_title(f"After - {tuple(image_resized.shape)}")
plt.tight_layout()

Augmentation

An easy way to improve trainging is to randomly augment the images to help our training algorithm generalize better. Threre are many augmentations to choose from, but keep in mind that the more we add to our augment function, the more processing will be required before we can send the image to the GPU for training. Also, it’s important to note that we don’t need to augment the validation data because we want to generate a prediction on the image as it normally would be presented.

[ ]:

augment = tv.transforms.Compose(
    [
        tv.transforms.RandomResizedCrop(224),
        tv.transforms.RandomHorizontalFlip(p=0.5),
        tv.transforms.RandomVerticalFlip(p=0.5),
        tv.transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.2),
        tv.transforms.ToTensor(),
    ]
)

[ ]:

sample = tv.datasets.ImageFolder(root="data_structured/train", transform=tv.transforms.ToTensor())

sample_augmented = tv.datasets.ImageFolder(root="data_structured/train", transform=augment)

[ ]:

sample = iter(sample)
sample_augmented = iter(sample_augmented)

Re-rull the cell below to sample another image

[ ]:

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
image = next(iter(sample))[0]
image_augmented = next(iter(sample_augmented))[0]

ax[0].imshow(image.permute(1, 2, 0))
ax[0].axis("off")
ax[0].set_title(f"Before - {tuple(image.shape)}")
ax[1].imshow(image_augmented.permute(1, 2, 0))
ax[1].axis("off")
ax[1].set_title(f"After - {tuple(image_augmented.shape)}")
plt.tight_layout()

A note on applying the transformations

The training data set will get the resize and augment functions applied to it, but the validation dataset only gets resized because it’s not directly used for training. We will apply the transforms by passing them to the corresponding dataset with the transform kwarg. However, it doesn’t actually transform the image yet. Rather, the transformation will be fully applied by the CPU right before it gets sent to the GPU for training. This is nice beause we can experiment quickly without having to wait for all the images to be transformed.

You may be wondering why we’re applying the transformations randomly. This is done because our training algorithm will cycle through the data in epochs. Each epoch it will get a chance to view the image again so instead of sending the same image through each time, we’ll apply a random augmentation. Ideally, we’d let the algorithm see all versions of the image each epoch, but this would scale the size of the training dataset by the number of augmentations. Scaling the data storage and training time by that factor isn’t worth the relatively minor changes introduced into the dataset.

More documentation on all the transforms supported directly by Torchvision is available here

[ ]:

data_transforms = {
    "train": tv.transforms.Compose(
        [
            tv.transforms.RandomResizedCrop(224),
            tv.transforms.RandomHorizontalFlip(p=0.5),
            tv.transforms.RandomVerticalFlip(p=0.5),
            tv.transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.1),
            tv.transforms.ToTensor(),
        ]
    ),
    "val": tv.transforms.Compose(
        [tv.transforms.Resize(224), tv.transforms.CenterCrop(224), tv.transforms.ToTensor()]
    ),
}

Create the PyTorch datasets and dataloaders

Datasets

Datasets in PyTorch keep track of all the data in your dataset–where to find them (their path), what class they belong to and what transformations they get. In this case, we’ll use PyTorch’s handy ImageFolder to easily generate the dataset from the directory structure created in the previous guide.

[ ]:

data_dir = pathlib.Path("./data_structured")
splits = ["train", "val"]

datasets = {}
for s in splits:
    datasets[s] = tv.datasets.ImageFolder(root=data_dir / s, transform=data_transforms[s])

Dataloaders

Dataloaders structure how the images get sent to the CPU and GPU for training. Thye include important hyper-parameters such as: * batch_size: this tells the data loader how many images to send to the training algorithm at once for back propogagation. It will therefore also control the number to gradient updates which occur in one epoch for optimizers like SGD. * shuffle: this will randomize the orders of your training data * num_workers: this defines how many parallel processes you want to load and transform images before being sent to the GPU for training. Adding more workers will therefore speed up training. However, too many workers will slow training down due to the overhead of trying manage all the workers. Also, each worker will consume a considerable amount of RAM (depending on batch_size) and you cannot have more workers than cpu cores available on the EC2 instance used for training.

[ ]:

batch_size = 4
shuffle = True
num_workers = 4

dataloaders = {}
for s in splits:
    dataloaders[s] = torch.utils.data.DataLoader(
        datasets[s], batch_size=batch_size, shuffle=shuffle, num_workers=num_workers
    )

Visualize the transforms

Just to make sure everything is working we can apply some transformations on a few images and view them to make sure thye outout looks good. Simply re-run the cell to see a fresh batch of images.

[ ]:

rows = 3
cols = batch_size
fig, axs = plt.subplots(rows, cols, figsize=(10, 7))

for row in range(rows):
    batch = next(iter(dataloaders["train"]))
    images, labels = batch
    for col, image in enumerate(images):
        ax = axs[row, col]
        ax.imshow(image.permute(2, 1, 0))
        ax.axis("off")

plt.tight_layout()

With your datasets and dataloaders defined, you’re now ready to define the training architecture for your model.

Upload Data to S3

Resize images and save to disk

[ ]:

data_dir = pathlib.Path("./data_structured")
splits = ["train", "val", "test"]

datasets = {}
for s in splits:
    datasets[s] = tv.datasets.ImageFolder(
        root=data_dir / s,
        transform=tv.transforms.Compose([tv.transforms.Resize(224), tv.transforms.ToTensor()]),
    )

[ ]:

resized_path = pathlib.Path("./data_resized")
resized_path.mkdir(exist_ok=True)
for s in splits:
    split_path = resized_path / s
    split_path.mkdir(exist_ok=True)
    for idx, (img_tensor, label) in enumerate(tqdm(datasets[s])):
        label_path = split_path / f"{label:02}"
        label_path.mkdir(exist_ok=True)
        filename = datasets[s].imgs[idx][0].split("/")[-1]
        tv.utils.save_image(img_tensor, label_path / filename)

Upload augmented images to S3

Get S3 bucket

[ ]:

bucket_name = sagemaker.Session().default_bucket()
prefix = "DEMO-sm-preprocess-train-image-data-pytorch-algo"
s3 = boto3.resource("s3")
region = sagemaker.Session().boto_region_name

Upload data to S3 (~3min)

[ ]:

s3_uploader = sagemaker.s3.S3Uploader()

for s in splits:
    data_s3_uri = s3_uploader.upload(
        local_path=(resized_path / s).as_posix(),
        desired_s3_uri=f"s3://{bucket_name}/{prefix}/data/{s}",
    )

## Part 4: Train the PyTorch Model

In this section, you will use the SageMaker SDK to create a PyTorch Estimator and train it on a remote EC2 instance.

### Algorithm hyperparameters ___ Hyperparamters represent the tuning knobs for our algorithm which we set before training begins. Typically they are pre-set to defaults so if we don’t specify them we can still run the training algorithm, but they usually need tweaking to get optimal results. What these values should be depend entirely on the dataset. Unfortunately, there’s no formula to tell us what the best settings are, we just have to try them ourselves and see what we get, but there are

best practices and tips to help guide us in choosing them.

Optimizer - The optimizer refers to the optimization algorithm being used to choose the best weights. For deep learning on image data, SGD or ADAM is typically used.
Learning Rate - After each batch of training we update the model’s weights to give us the best possible results for that batch. The learning rate controls by how much we should update the weights. Best practices dictate a value between 0.2 and .001, typically never going higher than 1. The higher the learning rate, the faster your training will converge to the optimal weights, but going too fast can lead you to overshoot the target. In this example, we’re using the weights from a pre-trained model so we’d want to start with a lower learning rate because the weights have already been optimized and we don’t want move too far away from them.
Epochs - An epoch refers to one cycle through the training set and having more epochs to train means having more oppotunities to improve accracy. Suitable values range from 5 to 25 epochs depending on your time and budget constraints. Ideally, the right number of epochs is right before your validation accuracy plateaus.
Batch Size - Training on batches reduces the amount of data you need to hold in RAM and can speed up the training algorithm. For these reasons the training data is nearly always batched. The optimal batch size will depended on the dataset, how large the images are and how much RAM the training computer has. For a dataset like ours reasonable vaules would be bewteen 8 and 64 images per batch.
Criterion - This is the type of loss function that will be used by the optimizer to update the model’s weights during training. For training on a dataset with with more than two classes, the most common loss function is Cross-Entropy Loss.

Review the training script

The training function

Unlike other frameworks, PyTorch doesn’t use model objects with a .fit() method to train them. Instead the user must define their own training function. This adds more code to our training script, but offers more transparency for customizing and debugging the model training. This is one major reasaon why researchers enjoy using PyTorch. In this example we use the training fuction defined in the PyTorch tutorial for transfer learning here: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

[ ]:

!pygmentize "training_pytorch/pytorch_train.py" | sed -n 12,78p

Execution safety

For safety we wrap the training code in this standard if statement though it is not strictly required

[ ]:

!pygmentize "training_pytorch/pytorch_train.py" | sed -n 81p

Parse argument variables

These argument variables are passed via the hyperparameter argument for the estimator configuration.

[ ]:

!pygmentize "training_pytorch/pytorch_train.py" | sed -n 83,90p

Define data transformations and load data

These are the transformations from the pre-processing guide. Since the data was resized before it was saved to S3, we don’t need to do any resizing except for random cropping of the training dataset and center cropping the valications dataset.

[ ]:

!pygmentize "training_pytorch/pytorch_train.py" | sed -n 92,127p

Detect device and create and modify the base model

The base model for this guide is a RestNet18 model using pre-trained weights. We need to modify the base model by replacing the fully connected layer with a dense layer to classify our animal images. The model is then loaded for the device (GPU or CPU) that our EC@ instance is using.

[ ]:

!pygmentize "pytorch_train/pytorch_train-revised.py" | sed -n 128,140p

Define loss criterion, optimization algorithm and train the model

The weights for the epoch with the best accuracy are saved so we can load the model after training and make predictions on our test data.

[ ]:

!pygmentize "pytorch_train/pytorch_train-revised.py" | sed -n 142,150p

Estimator configuration

These define the the resources to use for training and how they are configured. Here are some important one to single out:

entry_point (str) – Path (absolute or relative) to the Python source file which should be executed as the entry point to training. If source_dir is specified, then entry_point must point to a file located at the root of source_dir.
framework_version (str) – PyTorch version you want to use for executing your model training code. Defaults to None. Required unless image_uri is provided. List of supported versions: https://github.com/aws/sagemaker-python-sdk#pytorch-sagemaker-estimators.
py_version (str) – Python version you want to use for executing your model training code. One of ‘py2’ or ‘py3’. Defaults to None. Required unless image_uri is provided.
source_dir (str) – Path (absolute, relative or an S3 URI) to a directory with any other training source code dependencies aside from the entry point file (default: None). If source_dir is an S3 URI, it must point to a tar.gz file. Structure within this directory are preserved when training on Amazon SageMaker.
dependencies (list[str]) – A list of paths to directories (absolute or relative) with any additional libraries that will be exported to the container (default: []). The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. If ‘git_config’ is provided, ‘dependencies’ should be a list of relative locations to directories with any additional libraries needed in the Git repo.
git_config (dict[str, str]) – Git configurations used for cloning files, including repo, branch, commit, 2FA_enabled, username, password and token. The repo field is required. All other fields are optional. repo specifies the Git repository where your training script is stored. If you don’t provide branch, the default value ‘master’ is used. If you don’t provide commit, the latest commit in the specified branch is used.
role (str) – An AWS IAM role (either name or full ARN). The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
instance_count (int) – Number of Amazon EC2 instances to use for training.
instance_type (str) – Type of EC2 instance to use for training, for example, ‘ml.c4.xlarge’.
volume_size (int) – Size in GB of the EBS volume to use for storing input data during training (default: 30). Must be large enough to store training data if File Mode is used (which is the default).
model_uri (str) – URI where a pre-trained model is stored, either locally or in S3 (default: None). If specified, the estimator will create a channel pointing to the model so the training job can download it. This model can be a ‘model.tar.gz’ from a previous training job, or other artifacts coming from a different source. In local mode, this should point to the path in which the model is located and not the file itself, as local Docker containers will try to mount the URI as a volume.
output_path (str) - S3 location for saving the training result (model artifacts and output files). If not specified, results are stored to a default bucket. If the bucket with the specific name does not exist, the estimator creates the bucket during the fit() method execution. file:// urls are used for local mode. For example: ‘file://model/’ will save to the model folder in the current directory.

Training on an EC2 instance

Now that we’ve worked out any bugs in our trainging script we can send the training job to an EC2 instance with a GPU with a larger batch size, number of workers and number of epochs.

Define the hyperparameters for EC2 training

[ ]:

hyperparameters = {"epochs": 10, "batch-size": 64, "learning-rate": 0.001, "workers": 4}

Define the estimator configuration for EC2 training

[ ]:

estimator_config = {
    "entry_point": "pytorch_train.py",
    "source_dir": "training_pytorch",
    "framework_version": "1.6.0",
    "py_version": "py3",
    "instance_type": "ml.p3.2xlarge",
    "instance_count": 1,
    "role": sagemaker.get_execution_role(),
    "output_path": f"s3://{bucket_name}/{prefix}",
    "hyperparameters": hyperparameters,
}

Create the estimator configured for EC2 training

[ ]:

pytorch_estimator = PyTorch(**estimator_config)

Define the data channels using the proper S3 URIs

[ ]:

data_channels = {
    "train": f"s3://{bucket_name}/{prefix}/data/train",
    "val": f"s3://{bucket_name}/{prefix}/data/val",
}

[ ]:

pytorch_estimator.fit(data_channels)

Load the Trained Model and Predict

After training the model and saving its parameters (weights) to S3, we can retrive the parameters and load them back into PyTorch to generate predicions.

Download the trained weights from S3

[ ]:

sagemaker.s3.S3Downloader().download(pytorch_estimator.model_data, "training_pytorch")
tf = tarfile.open("training_pytorch/model.tar.gz")
tf.extractall("training_pytorch")

Load the weights back into a PyTorch model

Since the model was trained on a GPU we need to use the map_location=torch.device('cpu') kwarg to load the model on a CPU backed notebook instance.

[ ]:

model = tv.models.resnet18()
num_ftrs = model.fc.in_features
model.fc = torch.nn.Linear(num_ftrs, 11)
model.load_state_dict(torch.load("training_pytorch/model.pt", map_location=torch.device("cpu")))
model.eval();

Link the model predictions (0 to 10) back to original class names (bear to zebra)

To map the index number back to the category label, we need to use the category labels created in the first guide of this series (Downloading Data).

[ ]:

category_labels = {idx: name for idx, name in enumerate(sorted(category_labels.values()))}
category_labels

Load validation images for predictions

[ ]:

test_ds = sample = tv.datasets.ImageFolder(
    root="data_resized/test",
    transform=tv.transforms.Compose([tv.transforms.CenterCrop(244), tv.transforms.ToTensor()]),
)

test_ds = torch.utils.data.DataLoader(test_ds, batch_size=4, shuffle=True)

Show validation images with model predictions

[ ]:

rows = 3
cols = 4
fig, axs = plt.subplots(rows, cols, figsize=(10, 7))

for row in range(rows):
    batch = next(iter(test_ds))
    images, labels = batch
    _, preds = torch.max(model(images), 1)
    preds = preds.numpy()
    for col, image in enumerate(images):
        ax = axs[row, col]
        ax.imshow(image.permute(1, 2, 0))
        ax.axis("off")
        ax.set_title(f"predicted: {category_labels[preds[col]]}")

plt.tight_layout()

[ ]: