Download, Structure, and Preprocess Image Data for TensorFlow Models

Notes: * This notebook should be used with the conda_pytorch_latest_p36 kernel * You can also explore image preprocessing with PyTorch and SageMaker Built-in Algorithms by running Download, Structure, and Preprocess Image Data for PyTorch Models and Download, Structure, and Preprocess Image Data for SageMaker Built-In Algorithms, respectively.

The main purpose of this notebook is to demonstrate how you can preprocess image data to train PyTorch Models.

Contents

  1. Part 1: Download the Dataset

  2. Part 2: Structure the Dataset

  3. Part 3: Preprocess Images for TensorFlow Models

  4. Part 4: Train the TensorFlow Model

## Part 1: Download the Dataset


In this section, you will use a dataset manifest to download animal images from the COCO dataset for all ten animal classes. You will then download frog images from the CIFAR dataset and add them to your COCO animal images. In order to simulate coming to SageMaker with your own dataset, we will keep the data in an unstructured form until the next notebook where you will learn the best practices for structuring an image dataset.

[ ]:
! pip install imageio
[ ]:
import json
import pickle
import shutil
import urllib
import pathlib
import tarfile
from tqdm import tqdm
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
from imageio import imread, imwrite
from joblib import Parallel, delayed, parallel_backend
from sagemaker.tensorflow import TensorFlow

The COCO and CIFAR Datasets


For this series of notebooks we will be sampling images from the COCO dataset and CIFAR-10 dataset (before beginning the notebooks in this series, it’s a good idea to browse each dataset website to familiaraize youreself with the data). Both are datasets of images, but come formatted very differently. The COCO dataset contains images from Flickr that represent a real-world dataset which isn’t formatted or resized specifically for deep learning. This makes it a good dataset for this guide because we want it to be as comprehensive as possible. The CIFAR-10 images, on the other hand, are preprocessed specifically for deep learning as they come cropped, resized and vectorized (i.e. not in a readable image format). This notebooks will show you how to work with both types of datasets.

Download the annotations


The dataset annotation file contains info on each image in the dataset such as the class, superclass, file name and url to download the file. Just the annotations for the COCO dataset are about 242MB.

[ ]:
anno_url = "http://images.cocodataset.org/annotations/annotations_trainval2017.zip"
urllib.request.urlretrieve(anno_url, "coco-annotations.zip");
[ ]:
shutil.unpack_archive("coco-annotations.zip")

Load the annotations into Python

The training and validation annotations come in separate files

[ ]:
with open("annotations/instances_train2017.json", "r") as f:
    train_metadata = json.load(f)

with open("annotations/instances_val2017.json", "r") as f:
    val_metadata = json.load(f)

Extract only the animal annotations


To limit the scope of the dataset for this guide we’re only using the images of animals in the COCO dataset

[ ]:
category_labels = {
    c["id"]: c["name"] for c in train_metadata["categories"] if c["supercategory"] == "animal"
}

Extract metadata and image filepaths

For the train and validation sets, the data we need for the image labels and the filepaths are under different headings in the annotations. We have to extract each out and combine them into a single annotation in subsequent steps.

[ ]:
train_annos = {}
for a in train_metadata["annotations"]:
    if a["category_id"] in category_labels:
        train_annos[a["image_id"]] = {"category_id": a["category_id"]}

train_images = {}
for i in train_metadata["images"]:
    train_images[i["id"]] = {"coco_url": i["coco_url"], "file_name": i["file_name"]}

val_annos = {}
for a in val_metadata["annotations"]:
    if a["category_id"] in category_labels:
        val_annos[a["image_id"]] = {"category_id": a["category_id"]}

val_images = {}
for i in val_metadata["images"]:
    val_images[i["id"]] = {"coco_url": i["coco_url"], "file_name": i["file_name"]}

Combine label and filepath info

Later in this series of guides we’ll make our own train, validation and test splits. For this reason we’ll combine the training and validation datasets together.

[ ]:
for id, anno in train_annos.items():
    anno.update(train_images[id])

for id, anno in val_annos.items():
    anno.update(val_images[id])
[ ]:
all_annos = {}
for k, v in train_annos.items():
    all_annos.update({k: v})
for k, v in val_annos.items():
    all_annos.update({k: v})

Sample the dataset


In order to make working with the data easier, we’ll select 250 images from each class at random. To make sure you get the same set of cell images for each run of this we’ll also set Numpy’s random seed to 0. This is a small fraction of the dataset, but it demonstrates how using transfer learning can give you good results without needing very large datasets.

[ ]:
np.random.seed(0)
[ ]:
sample_annos = {}

for category_id in category_labels:
    subset = [k for k, v in all_annos.items() if v["category_id"] == category_id]
    sample = np.random.choice(subset, size=250, replace=False)
    for k in sample:
        sample_annos[k] = all_annos[k]

Create a download function

In order to parallelize downloading the images we must wrap the download and save process with a function for multi-threading with joblib.

[ ]:
def download_image(url, path):
    data = imread(url)
    imwrite(path / url.split("/")[-1], data)

Download the sample of the dataset (2,500 images, ~5min)

[ ]:
sample_dir = pathlib.Path("data_sample_2500")
sample_dir.mkdir(exist_ok=True)
[ ]:
with parallel_backend("threading", n_jobs=5):
    Parallel(verbose=3)(
        delayed(download_image)(a["coco_url"], sample_dir) for a in sample_annos.values()
    )

Combine with CIFAR-10 frog data


The COCO dataset doesn’t include any images of frogs, but let’s say our model must also be able to label images of frogs. To fix this we can download another dataset of images which includes frogs, sample 250 frog images and add them to our existing image data. These images are much smaller (32x32) so they will appear pixelated and blurry when we increase the size of them to (244x244). We’ll use the CIFAR-10 dataset to achieve this. As you’ll see the CIFAR-10 dataset comes formatted in a very different manner from COCO dataset. We must process the CIFAR-10 data into individual image files so that it’s congruent to our COCO images.

Download and extract the CIFAR-10 dataset

[ ]:
!wget https://www.cs.toronto.edu/%7Ekriz/cifar-10-python.tar.gz
[ ]:
tf = tarfile.open("cifar-10-python.tar.gz")
tf.extractall()

Open first batch of CIFAR-10 dataset

The CIFAR-10 dataset comes in five training batches and one test batch. Each training batch has 10,000 randomly ordered images. Since we only need 250 frog images for our dataset, just pulling from the first batch will suffice.

[ ]:
with open("./cifar-10-batches-py/data_batch_1", "rb") as f:
    batch_1 = pickle.load(f, encoding="bytes")
[ ]:
image_data = batch_1[b"data"]

Pull 250 sample frog images

[ ]:
frog_indices = np.array(batch_1[b"labels"]) == 6
sample_frog_indices = np.random.choice(frog_indices.nonzero()[0], size=250, replace=False)
sample_data = image_data[sample_frog_indices, :]
frog_images = sample_data.reshape(len(sample_data), 3, 32, 32).transpose(0, 2, 3, 1)

View frog images

[ ]:
fig, axs = plt.subplots(3, 4, figsize=(10, 7))
indices = np.random.randint(low=0, high=249, size=12)

for i, ax in enumerate(axs.flatten()):
    ax.imshow(frog_images[indices[i]])
    ax.axis("off")

Write sample frog images to data_sample_2500 directory

[ ]:
frog_filenames = np.array(batch_1[b"filenames"])[sample_frog_indices]
[ ]:
for idx, filename in enumerate(frog_filenames):
    filename = filename.decode()
    data = frog_images[idx]
    if filename.endswith(".png"):
        filename = filename.replace(".png", ".jpg")
    imwrite(sample_dir / filename, data)
[ ]:
sample_dir.rename("data_sample_2750")

Add frog annotations to sample_annos

[ ]:
category_labels[26] = "frog"
[ ]:
next_anno_idx = np.array(list(sample_annos.keys())).max() + 1

frog_anno_ids = range(next_anno_idx, next_anno_idx + len(frog_images))
[ ]:
for idx, frog_id in enumerate(frog_anno_ids):
    sample_annos[frog_id] = {
        "category_id": 26,
        "file_name": frog_filenames[idx].decode().replace(".png", ".jpg"),
    }

## Part 2: Structure the Dataset


In this section, you will properly structure your image files for ingestion by the model. Then, we will use Python to create the new folder structure and copy the files into the correct set and label folder.

Proper folder structure


Although most tools can accommodate data in any file structure with enough tinkering, it makes most sense to use the sensible defaults that frameworks like MXNet, TensorFlow and PyTorch all share to make data ingestion as smooth as possible. By default, most tools will look for image data in the file structure depicted below:

+-- train
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|
+-- val
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|
+-- test
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg

You will notice that the COCO dataset does not come structured like above so we must use the annotation data to help restructure the folders of the COCO dataset so they match the pattern above. Once the new directory structures are created you can use your desired framework’s data loading tool to gracefully load and define transformation for your image data. Many datasets may already be in this structure in which case you can skip this guide.

Make train, validation and test splits


We should divide our data into train, validation and test splits. A typical split ratio is 80/10/10. Our image classification algorithm will train on the first 80% (training) and evaluate its performance at each epoch with the next 10% (validation) and we’ll give our model’s final accuracy results using the last 10% (test). It’s important that before we split the data we make sure to shuffle it randomly so that class distribution among splits is roughly proportional.

[ ]:
np.random.seed(0)
image_ids = sorted(list(sample_annos.keys()))
np.random.shuffle(image_ids)
first_80 = int(len(image_ids) * 0.8)
next_10 = int(len(image_ids) * 0.9)
train_ids, val_ids, test_ids = np.split(image_ids, [first_80, next_10])

Make new folder structure and copy image files


This new folder structure can then be read by data loaders for SageMaker’s built-in algorithms, TensorFlow or PyTorch for easy loading of the image data into your framework of choice.

[ ]:
unstruct_dir = Path("data_sample_2750")
struct_dir = Path("data_structured")
struct_dir.mkdir(exist_ok=True, parents=True)

for name, split in zip(["train", "val", "test"], [train_ids, val_ids, test_ids]):
    split_dir = struct_dir / name
    split_dir.mkdir(exist_ok=True)
    for image_id in tqdm(split):
        category_dir = split_dir / f'{category_labels[sample_annos[image_id]["category_id"]]}'
        category_dir.mkdir(exist_ok=True)
        source_path = (unstruct_dir / sample_annos[image_id]["file_name"]).as_posix()
        target_path = (category_dir / sample_annos[image_id]["file_name"]).as_posix()
        shutil.copy(source_path, target_path)

## Part 3: Preprocess Images for TensorFlow Models


In this notebook, you will create resizing and data augmentation transforms for trainging with the TensorFlow framework. You will also convert your data to TensorFlow’s TFRecord format for the most efficient training.

Dependencies


For this guide we’ll use the SageMaker Python SDK version 2.9.2. By default, SageMaker Notebooks come with version 1.72.0. Other guides provided by Amazon may be set up to work with other versions of the Python SDK so you may wish to roll-back to 1.72.0. In addition to updating the SageMaker SDK we’ll also update TensorFlow to 2.3.1 and install TensorFlow Datasets.

We will also debug our code by training on the instance running this notebook (Local Mode). In order to run through one epoch of training in a reasonable amount of time I advise using a notebook backed by a p2.xlarge instance. Once youre script has completely run locally and all bugs have been ironed out, then you can switch back to a smaller instance.

Update SageMaker Python SDK and TensorFlow

[ ]:
import sys
original_sagemaker_version = !pip list | grep -E "sagemaker\s" | awk '{print $2}'
original_tensorflow_version = !pip list | grep -E "tensorflow\s" | awk '{print $2}'
!{sys.executable} -m pip install -q "sagemaker==2.9.2" "tensorflow-serving-api==2.3.0" "tensorflow==2.3.1" "tensorflow-datasets"
[ ]:
import uuid
import pickle
import numpy as np
import sagemaker
import boto3
from tqdm import tqdm
import tensorflow as tf
import pathlib
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds
[ ]:
print(f"sagemaker  updated  {original_sagemaker_version[0]} -> {sagemaker.__version__}")
print(f"tensorflow updated  {original_tensorflow_version[0]} -> {tf.__version__}")

Loading data with TensorFlow Datasets


TensorFlow Datasets is a helpful module for getting your data ready for use with TensorFlow and Keras by generating wrapper for the dataset and each record in it. This wrapper has mathods which allow you to easily control sharding, batch size, and prefetching as well data transformations and augmentations. TensorFlow Datasets can also import many external datasets from the internet which come already structured and annoatated. However, for this guide we’ll assume that your dataset isn’t perfectly organized from the get-go.

Create the ImageFolder builder

tfds.ImageFolder is a pre-made builder for reading image data in the common folder structure we created previously.

[ ]:
image_folder = tfds.ImageFolder("./data_structured")
[ ]:
image_folder.info

Now that your image data is cataloged, you can generate a TensorFlow dataset for traing and validation. These datasets are very flexible can by be used for processing, augmentation and training with just TensorFlow or with Keras as well.

TensorFlow resizing and augmentations


In this step we create separate datasets for training and validation then define the necessary transformations required before our algorithm can train on the data. We will also define image augmentations which allow us to get the most out of the data we have and improve training effectiveness.

Create training and validation datasets

The .as_dataset() method is a conveient way generating (image, label) tuples required by the training algorithm * split - designates the data split for this dataset * shuffle_files - mix the order of files * as_supervised - discards any metadata just keeping the (image, label) tuple

[ ]:
train_ds = image_folder.as_dataset(split=["train"], shuffle_files=True, as_supervised=True)[0]

# create a sample which is easy to iterate through for example purposes
sample_ds = train_ds.take(100).as_numpy_iterator()

Define resize transformation

Before going to the GPU for training, all image data must have the same dimensions for length, width and channel. Typically, algorithms use a square format so the length and width are the same and many pre-made datasets areadly have the images nicely cropped into squares. However, most real-world datasets will begin with images in many different dimensions and ratios. In order to prep our dataset for training we need to resize and crop the images if they aren’t already square.

This transformation is deceptivley simple because if we want to keep the images from looking squished or stretched, we need to crop it to a square and we want to make sure the important object in the image doesn’t get cropped out. Unfortunately, there is no easy way to make sure each crop is optimal so we typically choose a center crop which works well most of the time.

[ ]:
def resize(image, label):
    image = tf.image.resize(image, (400, 400), preserve_aspect_ratio=True)
    image = tf.image.resize_with_crop_or_pad(image, 244, 244)
    return (image, label)

Re-run the cell below to see the resize transform on different image

[ ]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
image, label = next(sample_ds)
image_resized = resize(image, label)[0]
ax[0].imshow(image)
ax[0].axis("off")
ax[0].set_title(f"Before - {image.shape}")
ax[1].imshow(image_resized / 255)
ax[1].axis("off")
ax[1].set_title(f"After - {image_resized.shape}");

Define data augmentations

An easy way to improve trainging is to randomly augment the images to help our training algorithm generalize better. Threre are many augmentations to choose from, but keep in mind that the more we add to our augment function, the more processing will be required before we can send the image to the GPU for training. Also, it’s important to note that we don’t need to augment the validation data because we want to generate a prediction on the image as it is.

[ ]:
def augment(image, label):
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_flip_up_down(image)
    image = tf.image.random_brightness(image, 0.2)
    image = tf.image.random_hue(image, 0.1)
    return (image, label)
[ ]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
image, label = next(sample_ds)
image_aug = augment(image, label)[0]
ax[0].imshow(image)
ax[0].axis("off")
ax[1].imshow(image_aug)
ax[1].axis("off");

Apply transformations to the datasets

The training data set will get the resize and augment functions applied to it, but the validation dataset only gets resized because it’s not directly used for training. When we call the .map() method to apply the transformation to each record. However, it doesn’t actually transform the image yet. Rather, the transformation will be fully applied by the CPU right before it gets sent to the GPU for training. This is nice beause we can experiment quickly without having to wait for all the images to be transformed.

You may be wondering why we’re applying the transformations randomly. This is done because our training algorithm will cycle through the data in epochs. Each epoch it will get a chance to view the image again so instead of sending the same image through each time, we’ll apply a random augmentation. Ideally, we’d let the algorithm see all versions of the image each epoch, but this would scale the size of the training dataset by the number of augmentations. Scaling the data storage and training time by that factor isn’t worth the relatively minor changes introduced into the dataset.

[ ]:
train_ds = train_ds.map(resize).map(augment)

Visualize the transformations

Just to make sure everything is working we can apply some transformations on a few images and view them to make sure the output looks good.

[ ]:
fig, axs = plt.subplots(3, 4, figsize=(12, 7))

for ax in axs.flatten():
    sample = next(iter(train_ds))
    ax.imshow(tf.cast(sample[0], dtype=tf.uint8))
    ax.axis("off")

plt.tight_layout()

Save the datasets to TFRecord format


TensorFlow has its own record format which makes moving and training on image data much easier. The format is called TFRecord and it basically converts your image data into one or more binary chunks that are much easier to read process than thousands of individual files. One downside to the TFRecord format is that the images it saves are uncompressed so if you have large jpeg images this can really add up to a large filesize. One solution is to use TFRecord’s built-in compression, but you’ll still have to uncompress the files during training which may slow training down. The solution we’ll implement here is preform the resizing transform before converting to a TFRecord so the uncompressed image size is much smaller.

Define helper fuctions

[ ]:
def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy()  # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


def _image_as_bytes_feature(image):
    """Returns a bytes_list from an image tensor."""

    if image.dtype != tf.uint8:
        # `tf.io.encode_jpeg``requires tf.unit8 input images, with values between
        # 0 and 255. We do the conversion with the following function, if needed:
        image = tf.image.convert_image_dtype(image, tf.uint8, saturate=True)

    # We convert the image tensor back into a byte list...
    image_string = tf.io.encode_jpeg(image, quality=90)

    # ... and then into a Feature:
    return _bytes_feature(image_string)


def image_example(image_tensor, label):
    image_shape = image_tensor.shape

    feature = {
        "height": _int64_feature(image_shape[0]),
        "width": _int64_feature(image_shape[1]),
        "depth": _int64_feature(image_shape[2]),
        "label": _int64_feature(label),
        "image_raw": _image_as_bytes_feature(image_tensor),
    }

    return tf.train.Example(features=tf.train.Features(feature=feature))

Define resize and rescale transformation

[ ]:
def resize_rescale(image, label):
    image = tf.image.resize(image, (400, 400), preserve_aspect_ratio=True)
    image = tf.image.resize_with_crop_or_pad(image, 244, 244)
    image = image / 255.0
    return (image, label)

Write data to TFRecord files

[ ]:
tfrecord_dir = pathlib.Path("./data_tfrecord")
tfrecord_dir.mkdir(exist_ok=True)
[ ]:
train_ds = image_folder.as_dataset(split=["train"], shuffle_files=True, as_supervised=True)[0]
val_ds = image_folder.as_dataset(split=["val"], shuffle_files=True, as_supervised=True)[0]
test_ds = image_folder.as_dataset(split=["test"], shuffle_files=True, as_supervised=True)[0]

train_ds = train_ds.map(resize_rescale)
val_ds = val_ds.map(resize_rescale)
test_ds = test_ds.map(resize_rescale)

for name, data_split in zip(["train", "val", "test"], [train_ds, val_ds, test_ds]):
    record_file = f"data_tfrecord/{name}.tfrecord"
    with tf.io.TFRecordWriter(record_file) as writer:
        for image_tensor, label in tqdm(data_split):
            tf_example = image_example(image_tensor, label)
            writer.write(tf_example.SerializeToString())

Upload datasets to S3


Get S3 Bucket

[ ]:
bucket_name = sagemaker.Session().default_bucket()
prefix = "DEMO-sm-preprocess-train-image-data-pytorch-algo"
s3 = boto3.resource("s3")
region = sagemaker.Session().boto_region_name

Upload .rec files

[ ]:
s3_uploader = sagemaker.s3.S3Uploader()

for data_split in ["train", "val"]:
    data_path = f"data_tfrecord/{data_split}.tfrecord"
    data_s3_uri = s3_uploader.upload(
        local_path=data_path, desired_s3_uri=f"s3://{bucket_name}/{prefix}/data/{data_split}"
    )

## Part 4: Train the TensorFlow Model


In this section, you will use the SageMaker SDK to create a TensorFlow Estimator and train it on a remote EC2 instance.

Algorithm hyperparameters


Hyperparamters represent the tuning knobs for our algorithm which we set before training begins. Typically they are pre-set to defaults so if we don’t specify them we can still run the training algorithm, but they usually need tweaking to get optimal results. What these values should be depend entirely on the dataset. Unfortunately, there’s no formula to tell us what the best settings are, we just have to try them ourselves and see what we get, but there are best practices and tips to help guide us in choosing them.

  • Optimizer - The optimizer refers to the optimization algorithm being used to choose the best weights. For deep learning on image data, SGD or ADAM is typically used.

  • Learning Rate - After each batch of training we update the model’s weights to give us the best possible results for that batch. The learning rate controls by how much we should update the weights. Best practices dictate a value between 0.2 and .001, typically never going higher than 1. The higher the learning rate, the faster your training will converge to the optimal weights, but going too fast can lead you to overshoot the target. In this example, we’re using the weights from a pre-trained model so we’d want to start with a lower learning rate because the weights have already been optimized and we don’t want move too far away from them.

  • Epochs - An epoch refers to one cycle through the training set and having more epochs to train means having more oppotunities to improve accracy. Suitable values range from 5 to 25 epochs depending on your time and budget constraints. Ideally, the right number of epochs is right before your validation accuracy plateaus.

  • Batch Size - Training on batches reduces the amount of data you need to hold in RAM and can speed up the training algorithm. For these reasons the training data is nearly always batched. The optimal batch size will depended on the dataset, how large the images are and how much RAM the training computer has. For a dataset like ours reasonable vaules would be bewteen 8 and 64 images per batch.

  • Loss - This is the type of loss function that will be used by the optimizer to update the model’s weights during training. For training on a dataset with with more than two classes, the most common loss function is Cross-Entropy Loss. In TensorFlow, if your labels are a single number corresponding to a class (i.e. mutually excusive) then the type of loss is Sparse Categorical Crossentropy.

Review the training script


Helper functions

These helper functions define transformations needed to be done to our TFRecords datasets before training. For more in-depth info see the Pre-processing guide in this series.

[ ]:
!pygmentize "training_tensorflow/tensorflow_train.py" | sed -n 7,23p

Execution safety

For safety we wrap the training code in this standard if statement though it is not strictly required

[ ]:
!pygmentize "training_tensorflow/tensorflow_train.py" | sed -n 25p

Parse argument variables

These argument variables are passed via the hyperparameter argument for the estimator config and the input argument to the fit method.

[ ]:
!pygmentize "training_tensorflow/tensorflow_train.py" | sed -n 27,34p

Use autotune for configuring parallelization

In order to speed up training, TensorFlow can spread certain tasks scross mutilple cores. It can be difficult to determine the optimal number of workers to spread the work across (too few and you underutilizing your GPU and too many will cause a lag due to the overhead of scheduling the work). Luckily, TensorFlow comes wih a method of determing the right amount based on the computer doing the training.

[ ]:
!pygmentize "training_tensorflow/tensorflow_train.py" | sed -n 36p

Load the datasets

The training and validation datasets are loaded. Augmentation is applied to the training data, but not the validation data. We don’t need to do any resizing or rescaling because we already applied this transformation when we converted the images to TDRecord files.

[ ]:
!pygmentize "training_tensorflow/tensorflow_train.py" | sed -n 38,56p

Determine if GPU is available

This will set the device of training as the GPU if a GPU is available, otherwise it’ll use a CPU

[ ]:
!pygmentize "training_tensorflow/tensorflow_train.py" | sed -n 58,63p

Create and modify the base model

First the device context is set to ensure we’re using the proper device (GPU or CPU). Then we use a ResNet50 architecture and initialize the weights to weights pre-trainged on the ImageNet dataset. Since the top layer of the pretained model is configured for the ImageNet images, we need to removbe the classification layer (inlcude_top=False) and replace it with a classifiaction layer for our 11 animals.

[ ]:
!pygmentize "training_tensorflow/tensorflow_train.py" | sed -n 65,73p

Define the optimizer and train the model

For this example we’ll use SGD to optimize the weights of the model. At the end of training the weights for the epoch with the best validation accuracy are saved so we can load the model later for predictions on our test dataset.

[ ]:
!pygmentize "training_tensorflow/tensorflow_train.py" | sed -n 75,85p

Estimator configuration


These define the the resources to use for training and how they are configured. Here are some important one to single out:

  • entry_point (str) – Path (absolute or relative) to the Python source file which should be executed as the entry point to training. If source_dir is specified, then entry_point must point to a file located at the root of source_dir.

  • framework_version (str) – PyTorch version you want to use for executing your model training code. Defaults to None. Required unless image_uri is provided. List of supported versions: https://github.com/aws/sagemaker-python-sdk#pytorch-sagemaker-estimators.

  • py_version (str) – Python version you want to use for executing your model training code. One of ‘py2’ or ‘py3’. Defaults to None. Required unless image_uri is provided.

  • source_dir (str) – Path (absolute, relative or an S3 URI) to a directory with any other training source code dependencies aside from the entry point file (default: None). If source_dir is an S3 URI, it must point to a tar.gz file. Structure within this directory are preserved when training on Amazon SageMaker.

  • dependencies (list[str]) – A list of paths to directories (absolute or relative) with any additional libraries that will be exported to the container (default: []). The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. If ‘git_config’ is provided, ‘dependencies’ should be a list of relative locations to directories with any additional libraries needed in the Git repo.

  • git_config (dict[str, str]) – Git configurations used for cloning files, including repo, branch, commit, 2FA_enabled, username, password and token. The repo field is required. All other fields are optional. repo specifies the Git repository where your training script is stored. If you don’t provide branch, the default value ‘master’ is used. If you don’t provide commit, the latest commit in the specified branch is used.

  • role (str) – An AWS IAM role (either name or full ARN). The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.

  • instance_count (int) – Number of Amazon EC2 instances to use for training.

  • instance_type (str) – Type of EC2 instance to use for training, for example, ‘ml.c4.xlarge’.

  • volume_size (int) – Size in GB of the EBS volume to use for storing input data during training (default: 30). Must be large enough to store training data if File Mode is used (which is the default).

  • model_uri (str) – URI where a pre-trained model is stored, either locally or in S3 (default: None). If specified, the estimator will create a channel pointing to the model so the training job can download it. This model can be a ‘model.tar.gz’ from a previous training job, or other artifacts coming from a different source. In local mode, this should point to the path in which the model is located and not the file itself, as local Docker containers will try to mount the URI as a volume.

  • output_path (str) - S3 location for saving the training result (model artifacts and output files). If not specified, results are stored to a default bucket. If the bucket with the specific name does not exist, the estimator creates the bucket during the fit() method execution. file:// urls are used for local mode. For example: ‘file://model/’ will save to the model folder in the current directory.

Training on an EC2 instance


Define hyperparameters for training

[ ]:
hyperparameters = {
    "epochs": 3,
    "batch-size": 32,
    "learning-rate": 0.001,
}

Define the estimator configuration

[ ]:
estimator_config = {
    "entry_point": "tensorflow_train.py",
    "source_dir": "training_tensorflow",
    "framework_version": "2.3",
    "py_version": "py37",
    "instance_type": "ml.p3.2xlarge",
    "instance_count": 1,
    "role": sagemaker.get_execution_role(),
    "hyperparameters": hyperparameters,
    "output_path": f"s3://{bucket_name}/{prefix}",
}
[ ]:
tf_estimator = TensorFlow(**estimator_config)

Define the data channels for training and validation

[ ]:
s3_data_channels = {
    "training": f"s3://{bucket_name}/{prefix}/data/train/train.tfrecord",
    "validation": f"s3://{bucket_name}/{prefix}/data/val/val.tfrecord",
}

Train the model

[ ]:
tf_estimator.fit(s3_data_channels)

Load trained model and predict on test data


After training the model and saving it to S3, we can retrive it and load it back into TensorFlow to generate predicions. It’s important that after training we evaluate the model on the test data. This data has never been seen by the model for trainging or for choosing the best epoch.

Download the trained model from S3

[ ]:
sagemaker.s3.S3Downloader().download(tf_estimator.model_data, "training_tensorflow")
[ ]:
tfile = tarfile.open("training_tensorflow/model.tar.gz")
tfile.extractall("training_tensorflow")

Load the trained model

[ ]:
model = tf.keras.models.load_model("training_tensorflow/model")

Load images from the test dataset for predictions

[ ]:
image_folder = tfds.ImageFolder("./data_structured")
[ ]:
def tfrecord_parser(record):
    features = {
        "height": tf.io.FixedLenFeature([], tf.int64),
        "width": tf.io.FixedLenFeature([], tf.int64),
        "depth": tf.io.FixedLenFeature([], tf.int64),
        "label": tf.io.FixedLenFeature([], tf.int64),
        "image_raw": tf.io.FixedLenFeature([], tf.string),
    }
    parsed_features = tf.io.parse_single_example(record, features)
    return tf.io.decode_jpeg(parsed_features["image_raw"]), parsed_features["label"]
[ ]:
test_ds = tf.data.TFRecordDataset(filenames=["data_tfrecord/test.tfrecord"], num_parallel_reads=2)

test_ds = test_ds.map(tfrecord_parser, num_parallel_calls=2).as_numpy_iterator()

Show validation images with model predictions

Re-run cell to see more predictions

[ ]:
fig, axs = plt.subplots(3, 4, figsize=(10, 7))

for ax in axs.flatten():
    sample = next(iter(test_ds))
    image = sample[0]
    pred = model.predict(tf.expand_dims(image, axis=0))
    pred_name = category_labels[np.argmax(pred)]
    ax.imshow(image)
    ax.axis("off")
    ax.set_title(f"prediction: {pred_name}")
[ ]: