Setup Amazon SageMaker for Training spacy Models

8 minute read

Purpose of this tutorial

What I want to do:

  • develop my code locally in my IDE with syntax highlighting, linting, etc.
  • test my code locally
  • run training jobs in the cloud for actual model training
  • use the NLP library spacy in Amazon SageMaker

What I don’t want to do:

  • use a predefined Estimator from SageMaker for HuggingFace or PyTorch etc. (I prefer spacy.)
  • extend a prebuilt Docker Container from SageMaker with the spacy dependency (I prefer my Docker container as lightweight as possible.)
  • run my training jobs from a SageMaker notebook instance (I want to develop and submit my code locally.)
  • create a custom Docker image with spacy to run it on SageMaker studio

Especially for the last one, it is rather subtle to notice that this is not what I want to do. And trying to do what I do want to do, I followed a large number of tutorials that I always had to adapt in one way or the other. This is why I decided to write my own. If you, too, want to do exactly what I wanted to do and not one of the many slightly different scenarios, I hope these instructions will be helpful to you, too.

Overview

  1. Build a custom Docker container that plays well with both SageMaker and spacy
  2. Push that container to the AWS Elastic Container Registry (ECR)
  3. Write a train job submission script in python that loads the container from ECR and handles hyperparameters etc.
  4. Write a second python script that performs the actual training

Build a custom container

Our code will run in a Docker container on SageMaker. A docker image is a bit comparable to a virtual machine where the initial setup is defined in a so-called Dockerfile. This is enough intuition for this tutorial but if you want to learn more about docker, have a look here.

So first, we’ll write a Dockerfile (just Dockerfile without file extension):

FROM python:3.9

######################
# OVERVIEW
# 0. Updates pip
# 1. Installs and configures Poetry, then installs the environment defined in pyproject.toml
# 2. Configures the kernel (ipykernel should be installed on the parent image or defined in pyproject.toml)
######################

# Update pip
RUN pip --no-cache-dir install --upgrade pip

# Install Poetry
RUN pip install poetry
# Disable virtual environments (see notes in README.md)
RUN poetry config virtualenvs.create false --local
# Copy the environment definition file and install the environment
COPY pyproject.toml /
RUN poetry install

# Configure the kernel
RUN python -m ipykernel install --sys-prefix

# Download spacy models for German and French
RUN python -m spacy download de_core_news_sm
RUN python -m spacy download fr_core_news_sm

This bases our Docker image on a predefined one where python 3.9 is already installed (something we would want to do anyway). Next, we install poetry, a package manager for python, which will take care of installing our python packages for us. Finally, we configure an ipykernel (following advice from here) and download some spacy models for French and German. This last step is entirely optional, of course, and it depends on what you want to do with spacy later on. As I often work with French and German texts, these basic models for the two languages are a reasonable start for most of my projects.

In order for poetry to install the packages you need, you have to specify them in a configuration file pyproject.toml. Here is mine:

[tool.poetry]
name = "spacy_sagemaker_container"
description = "A custom docker container for training spacy models on AWS SageMaker."

[tool.poetry.dependencies]
python = ">=3.9.7,<3.10"
boto3 = "^1.17.51"
ipykernel = "^5.5.3"
sagemaker = "^2.39.0"
sagemaker-training = "^4.2.0"
spacy = "^3.2.4"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry-core>=1.1.6"]
build-backend = "poetry.core.masonry.api"

Of course, you can modify the dependencies and their versions in any way you need but this configuration works for me at the time of writing.

Push the Docker container to the ECR

Create a new repository on ECR

I used the AWS web interface for this step. There might be better tutorials just for this step and maybe AWS has already changed its interface and/or naming by the time you read this, but here is (in short) what I had to do:

  1. Sign in to the AWS account you want to run your Sagemaker jobs in later.
  2. Go to Services (top-left corner) and search for Elastic Container Registry
  3. Click on Create repository
  4. Fill the blank text box with a name for your new repository (e.g., it could be sagemaker_spacy)
  5. Leave the default settings for the rest and click Create repository

Build and push the Docker container

Install docker if you haven’t done already. Then execute the following shell script:

REGION=<enter your region here>
ACCOUNT_NUM=<your AWS account number>
REPO_NAME=<enter the repository name you have given above>
CONTAINER_NAME=<choose a new name for the container you are about to build>

# login to ECR with your account
aws --region "$REGION" ecr get-login-password | docker login --username AWS --password-stdin "${ACCOUNT_NUM}.dkr.ecr.${REGION}.amazonaws.com/${REPO_NAME}"
# build the container
docker build . -t "$REPO_NAME" -t "${ACCOUNT_NUM}.dkr.ecr.${REGION}.amazonaws.com/${REPO_NAME}:${CONTAINER_NAME}"
# push the container
docker push "${ACCOUNT_NUM}.dkr.ecr.${REGION}.amazonaws.com/${REPO_NAME}:${CONTAINER_NAME}"

Write the job submission script

Let’s write up the script to sumit a training job.

import boto3
import sagemaker

# 0. Define some data storing constants

S3_BUCKET_DATA = "S3 bucket URI storing data"
S3_PREFIX = "subfolder in S3 bucket containing data for current project"
S3_BUCKET_MODELS = "S3 bucket URI storing trained models"

# 1. Get docker container URI
session = boto3.Session()
region = session.region_name

sts = boto3.client("sts")
account_id = sts.get_caller_identity()["Account"]

repository_name = "FILL IN YOUR REPOSITORY NAME"
container_name = "FILL IN YOUR CONTAINER NAME"

spacy_container_uri = (
    f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repository_name}:{container_name}"
)

# 2. Define Estimator
model = sagemaker.estimator.Estimator(
        image_uri=spacy_container_uri,
        role="SageMaker-ExecutionRole",
        instance_count=1,
        instance_type="ml.m4.xlarge",
        volume_size=1,
        entry_point="spacy_train.py",
        source_dir="PATH/TO/DIR/CONTAINING/TRAINSCRIPT",
        dependencies=["spacy_configs"],
        output_path=f"s3://{S3_BUCKET_MODELS}/{S3_PREFIX}/{experiment_name}",
    )

# 3. Define the training data (in S3 bucket)
train_input = sagemaker.TrainingInput(
    f"s3://{S3_BUCKET_DATA}/{S3_PREFIX}/train.spacy",
    content_type="application/zlib",
)
dev_input = sagemaker.TrainingInput(
    f"s3://{S3_BUCKET_DATA}/{S3_PREFIX}/dev.spacy", content_type="application/zlib"
)

# 4. Set the spacy config file and start training
model.set_hyperparameters(config="PATH/TO/SPACY/CONFIGFILE")
model.fit({"train": train_input, "dev": dev_input})

First of all, see the above script rather as a snippet than a fully production-ready script. I simplified mine to focus on the important parts. Obviously, you can enhance it a lot by simply adding some handling of command-line arguments etc.

In the first part, I define some constants related to storing of training data and trained model weights. I simply created one (company-wide) bucket for each. The S3_PREFIX then distinguishes between projects by defining a folder structure in these buckets.

After that, I retrieve the necessary information from AWS to build the full URI of the docker container we defined earlier.

Then, we define our training job with Sagemaker’s general Estimator class. I only showed the most important arguments but you are highly recommended to check out the documentation for more advanced things like defining a job name, setting environment variables, or tagging (e.g., we use tagging for budgeting). Arguments like instance_count or instance_type are rather self-explanatory. This is why I want to direct your attention to only three arguments in particular:

  1. entry_point is the name of the training script that is actually run when the training job is submitted (you could call it your training script). We will write that script in the next section.
  2. source_dir is the directory where your entry_point script is stored. I, for instance, run my scripts from the root directory of my project and keep my source code in a src directory with a modular structure that makes sense to me (see also here for more thoughts on data science project structure). So I need to specify with source_dir where my training script can be found.
  3. dependencies is a very useful argument, especially with spacy. It takes a list of paths to other directories that should be copied to the Sagemaker instance before executing our training script. I specify a directory spacy_configs here that contains all my spacy config files for the different experiments I want to conduct. In that way, I just need to specify which spacy config file I want to run, e.g., as a command-line argument, and it works as if I was running the training script on my local machine.

In the second-to-last step, I define the training and dev data that I have previously stored in my data S3 bucket. If I use the binary spacy format, I can specify the application/zlib content type and it works like a charm.

Finally, I set the path to the config file that specifies my training configuration as a hyperparameter (you will see in a minute how we can access this information in the training script). And I submit the training job by calling fit.

Write the actual training script

After we have correctly configured our training job, we simply need to access the configuration in our training script. Thanks to the excellent library sagemaker_training, this is very easy:

import os
from sagemaker_training import environment
from spacy.cli.train import train

env = environment.Environment()
config_path = env.hyperparameters.pop("config")

overrides = {
    "paths.train": os.path.join(env.channel_input_dirs["train"], "train.spacy"),
    "paths.dev": os.path.join(env.channel_input_dirs["dev"], "dev.spacy"),
}
overrides.update(env.hyperparameters)

use_gpu: int = 0 if env.num_gpus > 0 else -1

train(
    config_path,
    output_path=env.model_dir,
    use_gpu=use_gpu,
    overrides=overrides,
)

The script makes use of the spacy train command that you would otherwise use directly from the command line and it mimics the overriding of parts of the configuration file that you can usually do on the command line by making the contents of the hyperparameters you set in the job submission script accessible to the spacy train command.

In that way, it is a very generic script even though we haven’t made use of this functionality in the submission script from the last section.

Submitting a training job

And this is it. Now you can simply run the job submission script and it will execute a spacy training job corresponding to the config file you provide. Note that I did not cover any AWS authentication that might be required.