Download, Structure, and Preprocess Image Data for SageMaker Built-In Algorithms

Notes: * This notebook should be used with the conda_amazonei_mxnet_p36 kernel * You can also explore image preprocessing with TensorFlow and PyTorch by running Download, Structure, and Preprocess Image Data for TensorFlow Models and Download, Structure, and Preprocess Image Data for PyTorch Models, respectively.

The main purpose of this notebook is to demonstrate how you can preprocess image data to train SageMaker Built-In Algorithms.

Contents

  1. Part 1: Download the Dataset

  2. Part 2: Structure the Dataset

  3. Part 3: Preprocess Images for Built-in Algorithms

  4. Part 4: Train the Built-in Image Classification Algorithm

## Part 1: Download the Dataset


In this section, you will use a dataset manifest to download animal images from the COCO dataset for all ten animal classes. You will then download frog images from the CIFAR dataset and add them to your COCO animal images. In order to simulate coming to SageMaker with your own dataset, we will keep the data in an unstructured form until the next notebook where you will learn the best practices for structuring an image dataset.

[ ]:
! pip install imageio joblib opencv-python
[ ]:
import json
import pickle
import shutil
import urllib
import pathlib
import tarfile
from tqdm import tqdm
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
from imageio import imread, imwrite
from joblib import Parallel, delayed, parallel_backend

The COCO and CIFAR Datasets


For this series of notebooks we will be sampling images from the COCO dataset and CIFAR-10 dataset (before beginning the notebooks in this series, it’s a good idea to browse each dataset website to familiaraize youreself with the data). Both are datasets of images, but come formatted very differently. The COCO dataset contains images from Flickr that represent a real-world dataset which isn’t formatted or resized specifically for deep learning. This makes it a good dataset for this guide because we want it to be as comprehensive as possible. The CIFAR-10 images, on the other hand, are preprocessed specifically for deep learning as they come cropped, resized and vectorized (i.e. not in a readable image format). This notebooks will show you how to work with both types of datasets.

Download the annotations


The dataset annotation file contains info on each image in the dataset such as the class, superclass, file name and url to download the file. Just the annotations for the COCO dataset are about 242MB.

[ ]:
anno_url = "http://images.cocodataset.org/annotations/annotations_trainval2017.zip"
urllib.request.urlretrieve(anno_url, "coco-annotations.zip");
[ ]:
shutil.unpack_archive("coco-annotations.zip")

Load the annotations into Python

The training and validation annotations come in separate files

[ ]:
with open("annotations/instances_train2017.json", "r") as f:
    train_metadata = json.load(f)

with open("annotations/instances_val2017.json", "r") as f:
    val_metadata = json.load(f)

Extract only the animal annotations


To limit the scope of the dataset for this guide we’re only using the images of animals in the COCO dataset

[ ]:
category_labels = {
    c["id"]: c["name"] for c in train_metadata["categories"] if c["supercategory"] == "animal"
}

Extract metadata and image filepaths

For the train and validation sets, the data we need for the image labels and the filepaths are under different headings in the annotations. We have to extract each out and combine them into a single annotation in subsequent steps.

[ ]:
train_annos = {}
for a in train_metadata["annotations"]:
    if a["category_id"] in category_labels:
        train_annos[a["image_id"]] = {"category_id": a["category_id"]}

train_images = {}
for i in train_metadata["images"]:
    train_images[i["id"]] = {"coco_url": i["coco_url"], "file_name": i["file_name"]}

val_annos = {}
for a in val_metadata["annotations"]:
    if a["category_id"] in category_labels:
        val_annos[a["image_id"]] = {"category_id": a["category_id"]}

val_images = {}
for i in val_metadata["images"]:
    val_images[i["id"]] = {"coco_url": i["coco_url"], "file_name": i["file_name"]}

Combine label and filepath info

Later in this series of guides we’ll make our own train, validation and test splits. For this reason we’ll combine the training and validation datasets together.

[ ]:
for id, anno in train_annos.items():
    anno.update(train_images[id])

for id, anno in val_annos.items():
    anno.update(val_images[id])
[ ]:
all_annos = {}
for k, v in train_annos.items():
    all_annos.update({k: v})
for k, v in val_annos.items():
    all_annos.update({k: v})

Sample the dataset


In order to make working with the data easier, we’ll select 250 images from each class at random. To make sure you get the same set of cell images for each run of this we’ll also set Numpy’s random seed to 0. This is a small fraction of the dataset, but it demonstrates how using transfer learning can give you good results without needing very large datasets.

[ ]:
np.random.seed(0)
[ ]:
sample_annos = {}

for category_id in category_labels:
    subset = [k for k, v in all_annos.items() if v["category_id"] == category_id]
    sample = np.random.choice(subset, size=250, replace=False)
    for k in sample:
        sample_annos[k] = all_annos[k]

Create a download function

In order to parallelize downloading the images we must wrap the download and save process with a function for multi-threading with joblib.

[ ]:
def download_image(url, path):
    data = imread(url)
    imwrite(path / url.split("/")[-1], data)

Download the sample of the dataset (2,500 images, ~5min)

[ ]:
sample_dir = pathlib.Path("data_sample_2500")
sample_dir.mkdir(exist_ok=True)
[ ]:
with parallel_backend("threading", n_jobs=5):
    Parallel(verbose=3)(
        delayed(download_image)(a["coco_url"], sample_dir) for a in sample_annos.values()
    )

Combine with CIFAR-10 frog data


The COCO dataset doesn’t include any images of frogs, but let’s say our model must also be able to label images of frogs. To fix this we can download another dataset of images which includes frogs, sample 250 frog images and add them to our existing image data. These images are much smaller (32x32) so they will appear pixelated and blurry when we increase the size of them to (244x244). We’ll use the CIFAR-10 dataset to achieve this. As you’ll see the CIFAR-10 dataset comes formatted in a very different manner from COCO dataset. We must process the CIFAR-10 data into individual image files so that it’s congruent to our COCO images.

Download and extract the CIFAR-10 dataset

[ ]:
!wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
[ ]:
tf = tarfile.open("cifar-10-python.tar.gz")
tf.extractall()

Open first batch of CIFAR-10 dataset

The CIFAR-10 dataset comes in five training batches and one test batch. Each training batch has 10,000 randomly ordered images. Since we only need 250 frog images for our dataset, just pulling from the first batch will suffice.

[ ]:
with open("./cifar-10-batches-py/data_batch_1", "rb") as f:
    batch_1 = pickle.load(f, encoding="bytes")
[ ]:
image_data = batch_1[b"data"]

Pull 250 sample frog images

[ ]:
frog_indices = np.array(batch_1[b"labels"]) == 6
sample_frog_indices = np.random.choice(frog_indices.nonzero()[0], size=250, replace=False)
sample_data = image_data[sample_frog_indices, :]
frog_images = sample_data.reshape(len(sample_data), 3, 32, 32).transpose(0, 2, 3, 1)

View frog images

[ ]:
fig, axs = plt.subplots(3, 4, figsize=(10, 7))
indices = np.random.randint(low=0, high=249, size=12)

for i, ax in enumerate(axs.flatten()):
    ax.imshow(frog_images[indices[i]])
    ax.axis("off")

Write sample frog images to data_sample_2500 directory

[ ]:
frog_filenames = np.array(batch_1[b"filenames"])[sample_frog_indices]
[ ]:
for idx, filename in enumerate(frog_filenames):
    filename = filename.decode()
    data = frog_images[idx]
    if filename.endswith(".png"):
        filename = filename.replace(".png", ".jpg")
    imwrite(sample_dir / filename, data)
[ ]:
sample_dir.rename("data_sample_2750")

Add frog annotations to sample_annos

[ ]:
category_labels[26] = "frog"
[ ]:
next_anno_idx = np.array(list(sample_annos.keys())).max() + 1

frog_anno_ids = range(next_anno_idx, next_anno_idx + len(frog_images))
[ ]:
for idx, frog_id in enumerate(frog_anno_ids):
    sample_annos[frog_id] = {
        "category_id": 26,
        "file_name": frog_filenames[idx].decode().replace(".png", ".jpg"),
    }

## Part 2: Structure the Dataset


In this section, you will properly structure your image files for ingestion by the model. Then, we will use Python to create the new folder structure and copy the files into the correct set and label folder.

Proper folder structure


Although most tools can accommodate data in any file structure with enough tinkering, it makes most sense to use the sensible defaults that frameworks like MXNet, TensorFlow and PyTorch all share to make data ingestion as smooth as possible. By default, most tools will look for image data in the file structure depicted below:

+-- train
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|
+-- val
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|
+-- test
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg

You will notice that the COCO dataset does not come structured like above so we must use the annotation data to help restructure the folders of the COCO dataset so they match the pattern above. Once the new directory structures are created you can use your desired framework’s data loading tool to gracefully load and define transformation for your image data. Many datasets may already be in this structure in which case you can skip this guide.

### Make train, validation and test splits ___ We should divide our data into train, validation and test splits. A typical split ratio is 80/10/10. Our image classification algorithm will train on the first 80% (training) and evaluate its performance at each epoch with the next 10% (validation) and we’ll give our model’s final accuracy results using the last 10% (test). It’s important that before we split the data we make sure to shuffle it randomly so that class distribution among splits is

roughly proportional.

[ ]:
np.random.seed(0)
image_ids = sorted(list(sample_annos.keys()))
np.random.shuffle(image_ids)
first_80 = int(len(image_ids) * 0.8)
next_10 = int(len(image_ids) * 0.9)
train_ids, val_ids, test_ids = np.split(image_ids, [first_80, next_10])

### Make new folder structure and copy image files ___ This new folder structure can then be read by data loaders for SageMaker’s built-in algorithms, TensorFlow or PyTorch for easy loading of the image data into your framework of choice.

[ ]:
unstruct_dir = Path("data_sample_2750")
struct_dir = Path("data_structured")
struct_dir.mkdir(exist_ok=True, parents=True)

for name, split in zip(["train", "val", "test"], [train_ids, val_ids, test_ids]):
    split_dir = struct_dir / name
    split_dir.mkdir(exist_ok=True)
    for image_id in tqdm(split):
        category_dir = split_dir / f'{category_labels[sample_annos[image_id]["category_id"]]}'
        category_dir.mkdir(exist_ok=True)
        source_path = (unstruct_dir / sample_annos[image_id]["file_name"]).as_posix()
        target_path = (category_dir / sample_annos[image_id]["file_name"]).as_posix()
        shutil.copy(source_path, target_path)

## Part 3: Preprocess Images for Built-in Algorithms


In this section, we will explore the different ways to format your image dataset for SageMaker’s built-in algorithms. The first involves creating a manifest file for the train and validations sets and the other has you creating .REC files (RecordIO format) which are single binary files made up of all the images for the train and validation sets. Since the RecordIO format is preferred, we will upload the .REC files to S3 for training in the nedxt notebook.

Dependencies


[ ]:
import uuid
import boto3
import shutil
import urllib
import pickle
import pathlib
import sagemaker
import subprocess

Application/x-image format


This format is also referred to as “Image Format” or “LST” format. The benefit of using this format is that it doesn’t require any modification or restructuring of your dataset. Instead, you create a manifest of the images for your training set and validation set. These two manifests are separate .lst files which list all the images giving each of them a unique index, the class they belong to and the relative path to the image file from the main training folder. The data in the .lst file is in tab separated values.

While its the easiest format to use, it requires SageMaker to do more work behind the scenes. For datasets with many images, this will cause training to take longer. For datasets with fewer images, the performance difference isn’t as pronounced.

Below are two examples of how to create your .LST manifest files. One uses your own code and the other uses a script from MXNet. If you want to create .REC files of your images, you should skip to Option 2.

Option 1: Manually generate the .LST files

[ ]:
category_ids = {name: idx for idx, name in enumerate(sorted(category_labels.values()))}
print(category_ids)
[ ]:
image_paths = pathlib.Path("./data_structured").rglob("*.jpg")

for idx, p in enumerate(image_paths):
    image_id = f"{idx:010}"
    category = category_ids[p.parts[-2]]
    path = p.as_posix()
    split = p.parts[-3]
    with open(f"{split}.lst", "a") as f:
        line = f"{image_id}\t{category}\t{path}\n"
        f.write(line)

View the contents of the train.lst file

[ ]:
!head train.lst

Option 2: Use im2rec.py script to generate the .LST files

[ ]:
script_url = "https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py"
urllib.request.urlretrieve(script_url, "im2rec.py");

python im2rec.py --list --recursive LST_FILE_PREFIX DATA_DIR * –list - generate an LST file * –recursive - looks inside subfolders for image data * LST_FILE_PREFIX - choose the name you want for the .lst file * DATA_DIR - relative path to directory with the data

[ ]:
!python im2rec.py --list --recursive train data_structured/train
[ ]:
!python im2rec.py --list --recursive val data_structured/val

View the contents of the train.lst file

[ ]:
!head train.lst

Application/x-recordio (preferred format)


This format is commonly referred to as RecordIO. It creates a new file for your each of your training and validation datasets with the .rec suffix. The .rec file is a single file that contains all of the images in the dataset so it can be streamed directly to the SageMaker training algorithm without the overhead involved with transfering thousands of individual files. For datasets with many images this provides a huge reduction in training time because SageMaker doesn’t need to download all the image files before it can run the training algorithm. If you use the im2rec.py script, it will also resize the images for you as well. The benefits of resizing the files before saving them in the RecordIO format is that it’ll reduce the amount of data you need to transfer to s3 and will also speed up trainging by doing the resizing ahead of time instead of at training.

1. Run Option 2 from application/x-image above and copy LST files

Once you’ve run Option 2 from above then proceed below.

[ ]:
recordio_dir = pathlib.Path("./data_recordio")
recordio_dir.mkdir(exist_ok=True)
shutil.copy("train.lst", "data_recordio/")
shutil.copy("val.lst", "data_recordio/");

2. Generate .rec files in the RecordIO Format

Once the .lst file is generated, the same im2rec.py script will also generate the .rec file.

python im2rec.py --resize 224 --quality 90 --num-thread 16 LST_FILE_PREFIX DATA_DIR/ * –resize: Have the script resize the files before saving them all to a .rec file. For the image classification algorithm the default dimensions are 224x224. Resizing now will also reduce the size of your .rec file. * –quality: Default settings will save the image data uncompressed. Adding some compression will keep the filesize of your .rec down especially if you’re not resizing them. * –num_thread: Set how many threads to parallelize the work * –LST_FILE_PREFIX: Name of the .lst you’re referencing for creating the .rec file * –DATA_DIR: Relative path directory which holds the data listed in the .lst file

Training dataset
[ ]:
!python im2rec.py --resize 224 --quality 90 --num-thread 16 data_recordio/train data_structured/train
Validation dataset
[ ]:
!python im2rec.py --resize 224 --quality 90 --num-thread 16 data_recordio/val data_structured/val

Upload the data to S3


In order for SageMaker’s built-in algrorithms to train on the data, it must be stored in an S3 bucket. Here, we will create a bucket, but you can use an existing bucket if you like by replacing the bucket_name variable in the first line of the else statement below.

Get S3 Bucket

[ ]:
bucket_name = sagemaker.Session().default_bucket()
prefix = "DEMO-sm-preprocess-train-image-data-builtin-algo"
s3 = boto3.resource("s3")
region = sagemaker.Session().boto_region_name

Upload .rec files to S3

[ ]:
s3_uploader = sagemaker.s3.S3Uploader()

data_path = recordio_dir / "train.rec"

data_s3_uri = s3_uploader.upload(
    local_path=data_path.as_posix(), desired_s3_uri=f"s3://{bucket_name}/{prefix}/data/train"
)
[ ]:
data_path = recordio_dir / "val.rec"

data_s3_uri = s3_uploader.upload(
    local_path=data_path.as_posix(), desired_s3_uri=f"s3://{bucket_name}/{prefix}/data/val"
)

## Part 4: Train the Built-in Image Classification Algorithm


In this section, you will use the SageMaker SDK to create an Estimator for SageMaker’s Built-in Image Classification algorithm and train it on a remote EC2 instance.

Built-in Image Classification algorithm


Create SageMaker training and validation channels

[ ]:
train_data = sagemaker.inputs.TrainingInput(
    s3_data=f"s3://{bucket_name}/{prefix}/data/train",
    content_type="application/x-recordio",
    s3_data_type="S3Prefix",
    input_mode="Pipe",
)

val_data = sagemaker.inputs.TrainingInput(
    s3_data=f"s3://{bucket_name}/{prefix}/data/val",
    content_type="application/x-recordio",
    s3_data_type="S3Prefix",
    input_mode="Pipe",
)

data_channels = {"train": train_data, "validation": val_data}

Configure the algorithm’s hyperparameters

https://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html * num_layers - The built-in image classification algrorithm is based off the ResNet architecture. There are many different versions of this architecture differing by how many layers they use. We’ll use the smallest one for this guide to speed up training. If the algorithm’s accuracy is hitting a plateau and you need better accuracy, increasing the number of layers may help. * use_pretrained_model - This will initialize the weights from a pre-trained model for transfer learning. Otherwise weights are initialized randomly. * augmentation_type - Allows you to add augmentations to your trainingset to help your model generalize better. For small datasets, augmentation can greatly imporve training. * image_shape - The channel, height, width of all the images * num_classes - Number of classes in your dataset * num_training_samples - Total number of images in your training set (used to help calculate progres) * mini_batch_size - The batch size you would like to use during training. * epochs - An epoch refers to one cycle through the training set and having more epochs to train means having more oppotunities to improve accracy. Suitable values range from 5 to 25 epochs depending on your time and budget constraints. Ideally, the right number of epochs is right before your validation accuracy plateaus. * learning_rate: After each batch of training we update the model’s weights to give us the best possible results for that batch. The learning rate controls by how much we should update the weights. Best practices dictate a value between 0.2 and .001, typically never going higher than 1. The higher the learning rate, the faster your training will converge to the optimal weights, but going too fast can lead you to overshoot the target. In this example, we’re using the weights from a pre-trained model so we’d want to start with a lower learning rate because the weights have already been optimized and we don’t want move too far away from them. * precision_dtype - Whether you want to use a 32-bit float data type for the model’s weights or 16-bit. 16-bit can be used if you’re running into memory management issues. However, weights can grow or shrink rapidly so having 32-bit weights make your training more robust to these issues and is typically the default in most frameworks.

[ ]:
num_classes = len(category_labels)
num_training_samples = len(set(pathlib.Path("data_structured/train").rglob("*.jpg")))
[ ]:
hyperparameters = {
    "num_layers": 18,
    "use_pretrained_model": 1,
    "augmentation_type": "crop_color_transform",
    "image_shape": "3,224,224",
    "num_classes": num_classes,
    "num_training_samples": num_training_samples,
    "mini_batch_size": 64,
    "epochs": 5,
    "learning_rate": 0.001,
    "precision_dtype": "float32",
}

Configure the type of algorithm and resources to use

[ ]:
training_image = sagemaker.image_uris.retrieve(
    "image-classification", sagemaker.Session().boto_region_name
)
[ ]:
algo_config = {
    "hyperparameters": hyperparameters,
    "image_uri": training_image,
    "role": sagemaker.get_execution_role(),
    "instance_count": 1,
    "instance_type": "ml.p3.2xlarge",
    "volume_size": 100,
    "max_run": 360000,
    "output_path": f"s3://{bucket_name}/data/output",
}

Create and train the algorithm

[ ]:
algorithm = sagemaker.estimator.Estimator(**algo_config)
[ ]:
algorithm.fit(inputs=data_channels, logs=True)

## Understanding the training output ___

[09/14/2020 05:37:38 INFO 139869866030912] Epoch[0] Batch [20]#011Speed: 111.811 samples/sec#011accuracy=0.452381
[09/14/2020 05:37:54 INFO 139869866030912] Epoch[0] Batch [40]#011Speed: 131.393 samples/sec#011accuracy=0.570503
[09/14/2020 05:38:10 INFO 139869866030912] Epoch[0] Batch [60]#011Speed: 139.540 samples/sec#011accuracy=0.617700
[09/14/2020 05:38:27 INFO 139869866030912] Epoch[0] Batch [80]#011Speed: 144.003 samples/sec#011accuracy=0.644483
[09/14/2020 05:38:43 INFO 139869866030912] Epoch[0] Batch [100]#011Speed: 146.600 samples/sec#011accuracy=0.664991
Training has begun: * Epoch[0]: One epoch corresponds to one training cycle through all the data. Stochastic optimizers like SGD and Adam improve accuracy by running multiple epochs. Random data augmentations is also applied with each new epoch allowing the training algorithm to learn on modified data. * Batch: The number of batches processed by the training algorithm. We specified one batch to be 64 images in the mini_batch_size hyperparameter. For algorithms like SGD, the model get a chance to update itself every batch.
* Speed: the number of images sent to the training algorithm per second. This information is important in determining how changes in your dataset affect the speed of training. * Accuracy: the training accuracy achieved at each interval (in this case, 20 batches).
[09/14/2020 05:38:58 INFO 139869866030912] Epoch[0] Train-accuracy=0.677083
[09/14/2020 05:38:58 INFO 139869866030912] Epoch[0] Time cost=102.745
[09/14/2020 05:39:02 INFO 139869866030912] Epoch[0] Validation-accuracy=0.729492
[09/14/2020 05:39:02 INFO 139869866030912] Storing the best model with validation accuracy: 0.729492
[09/14/2020 05:39:02 INFO 139869866030912] Saved checkpoint to "/opt/ml/model/image-classification-0001.params"

The first epoch of training has ended (for this example we only train for one epoch). The final training accuracy is reported as well as the accuracy on the validation set. Comparing these two number is important in determining if your model is overfit or underfit as well as the bais/variance trade-off. The saved model uses the learned weights from the epoch with the best validation accuracy.

2020-09-14 05:39:03 Uploading - Uploading generated training model
2020-09-14 05:39:15 Completed - Training job completed
Training seconds: 235
Billable seconds: 235

The final model parameters are saved as a .tar.gz in S3 to the directory specified in the output_path of algo_config. Total billable seconds is also reported to help compute the cost of training since you are only charged for the time the EC2 instance is training on the data. Other costs such as S3 storage also apply, but are not included here.

[ ]: