Soon after I started as a data scientist at an early stage startup I was tasked with helping productionalize and deploy analytical models as we ramped up more and more clients. The first few deploys were rushed, but relatively easy - just setup the root environment with all the dependencies, and have a simple bash script call the code on a schedule.
Then as time went on things started to get messy. We started adding more data scientists, who all developed in different ways. Some of our earliest models relied on certain versions of packages, whereas some of the latest models were reliant on completely different (and sometimes breaking version) enhancement of the same package.
Our “production” server became incredibly fragile, one accidental update of a package version could (and did) take down multiple processes. We eventually figured out a home-grown solution that worked out, but that experience was painful. This experience led me to taking the time to figure out a good, reproducible way to deal with environments and deployments in subsequent roles.
The Final Product
The final product will be a Docker project that we can run in development mode and in production mode, switching easily between the two. It’s important to note that all development will occur within a Docker container.
- Development mode will fire up a JupyterLab instance where we can prototype and test code
- Production mode will run a productionalized Python script
- Environment variables control whether we are hitting development or production systems
Additionally - once ready for production a Docker image that’s easily deployable via a variety of different methods will be available.
You can find the complete project here:
Before we dig in - here’s a quick rundown of the different components we’ll rely on heavily.
A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.
Essentially - docker containers let us bundle together everything we need for a project down to the operating system, and more so once we have a docker image we can guarantee that it’ll run exactly as we expect in another server/environment.
Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration. To learn more about all the features of Compose, see the list of features.
Docker Compose allows us orchestrate and define how our containers run - allowing us to use configuration to define things like what ports should be mapped, what commmands should be run on startup, and more.
Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.
Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.
Conda handles dependency management for you. Say you want to install version 0.20 of Scikit-Learn, Conda can find everything that specific version needs (both Python and system dependencies), and installs everything for you. More so - you easily define what package versions you want as configuration, allowing you to recreate your Python environment exactly.
Combining Them Together
The core of this all will be our Dockerfile - it’s the configuration that tells Docker how to build our container. We’re going to take advantage of a prebuilt image that Anaconda provides to get started. The
environment.yml defines what Python packages we want to install, the and last command simply tells conda to update itself, then install everything we defined.
We’ll have two different versions, with the production version having two additional commands to package directories into the docker image as well.
# base image FROM continuumio/miniconda3:4.5.11 # load in the environment.yml file ADD environment.yml / # create the environment RUN conda update -n base conda -y && conda env update
# base image FROM continuumio/miniconda3:4.5.11 # load in the environment.yml file ADD ./docker/environment.yml / # create the environment RUN conda update -n base conda -y && conda env update # add the code folder and notebooks folder to the docker image - this allows us to run # the image directly without relying on docker-compose or local file dependenciess ADD ./code/ /code ADD ./notebooks /notebooks
Here’s an example
environment.yml - in it we tell conda to install explicit versions of everything except JupyterLab.
name: base channels: - defaults dependencies: - pandas=0.22.0 - sqlalchemy=1.2.1 - scikit-learn=0.20.0 - pyodbc=4.0.23 - jupyterlab
To make running things easier, instead of always typing out what ports to map, what volumes to mount, and what command to run evey single time as a
docker run command, we’ll instead use
We’ll first create a folder structure similar to this:
my_project │ docker-compose.yml │ docker-compose.prod.yml | .env │ └───code │ │ │ └───.. (this will contain all our Python scripts/model assets) │ └───data │ │ │ └───.. (this will contain any external data we rely on (or are testing with)) │ └───notebooks │ │ │ └───.. (this will contain all our Jupyter notebooks during prototyping) │ └───docker │ └───Dockerfile └───environment.yml
This will be our
docker-compose.yml file. It’s our “sane default”. At worst all that’ll happen from accidentally running
docker-compose up is that JuypterLab will start up. (As opposed to a script starting up that hits production systems). In practice I also sometimes end up adding in other containers, such a database container that allows for quick iteration without having to hit external systems.
version: "3.2" services: python: build: dockerfile: ./docker/Dockerfile-dev context: ./ image: my_project_python env_file: - .env_dev ports: - "8888:8888" command: jupyter lab --no-browser --ip=0.0.0.0 --allow-root --NotebookApp.token='local_dev' volumes: - ./data:/data - ./code:/code - ./notebooks:/notebooks
docker-compose.prod.yml file will be similar, with the exception of the ports and the command. You’ll notice that it doesn’t mount the local
./notebooks folders anymore (since they’ll be packaged into the image).
version: "3.2" services: python: build: dockerfile: ./docker/Dockerfile-prod context: ./ image: my_project_python # replace with "registry/image" if pushing to a docker registry env_file: - .env_prod command: python /code/example.py # note - once in production I usually pull from an API or some other location, if you # still expect to pull data from the file system then uncomment the below two lines and # replace /real_data_location with the correct path #volumes: # - /real_data_location:/data
And last - our
.env files will contain variables that we want available within the container at runtime. We’ll populate
.env_dev with development variables. When deployed we’ll either pass in environment variables from a similar file (something like
.env_prod), or pass in environment variables using something like AirFlow.
# credentials and database information db_username=username db_password=password db_host=host db_port=1433 db_name=test # random seed for reproducible models random_seed=42
Within our Python scripts we can then simply access these variables via
os.environ.get('variable_name'). This allows us to store sensitive data, without having it hardcoded within our scripts, nor storing it within our repository. More so - we don’t have to change our code to change the environments we run in, we just have to change what docker-compose file we run with.
Make sure to add .env* to your
.gitignore file, otherwise you may accidently commit it to Git.
If you clone down the repo as it is this is the default
.gitignore, which will ignore everything except the
How to Use This All
As an example - here’s my normal development process. Using it I can get from development to production with little friction, knowing that my code will work as expected, and that it won’t negatively affect other processes on the production server.
Developing and Packaging
- Clone the template down. Update the
environment.ymlas needed with packages I know I’ll need, and run
docker-compose build. This will build the development image with all the packages I defined installed within it.
- Create a
.env_devfile with development environment variables, and a
.env_prodwith production environment variables.
docker-compose upand navigate to JupyterLab, which will be running on http://localhost:8888. We can access it by entering in the token
- From there prototype and develop a model/process using Jupyter Notebooks, saving any notebooks I create along the way into
/notebooksas a development diary. Any final artifacts/models I plan on using in production I save within
- Once I have a final version of my code, save it (and any models it relies on) into
- Update the
commandsection to point to the my scripts’ name, and the
imagesection to point to my docker registry (something like my_registry/my_project:0.1).
docker-compose -f docker-compose.prod.yml build- this builds the production version of the image, packaging everything in the
/notebooksdirectories directly onto the image.
docker-compose -f docker-compose.prod.yml pushwhich pushes that packaged image into my organizations docker registry.
At this point I now have an image that contains all my code, models, and other artifacts I need, that’s preinstalled with exact versions of the Python packages and dependencies I require. It’s stored in a central location where I can easily pull it down onto other servers.
Within production we have multiple options.
We could simply copy the entire project over as is, run
docker-compose -f docker-compose.prod.yml pullto pull down the image, and schedule it in crontab. Make sure to create a
.env_prodfile in the same location with the environment variables you need. (If you didn’t use a docker registry, then just run
docker-compose -f docker-compose.prod.yml buildfirst before scheduling it.)
The crontab command would look something like this:
0 22 * * * docker-compose -f /path/to/project/docker-compose.prod.yml up
Alternatively we could run the image directly and pass in the environment variables via the command line or read them in from a central environment file.
0 22 * * * docker run my_project_name python /code/my_script.py --env_file /env/.env_prod
We could do the same thing in AirFlow as well - using the Docker Operator or the Bash Operator to run the image on some schedule, passing in the environment variables from Airflow’s central variable store.
The last option - and the one I’ll explore further in an upcoming post - is using MLflow to extend the above framework to both track modelling iterations, and to serve the model as as a long-running REST API that we can call to make new predictions.
Note: In practice we’d likely have some continuous deployment software in place that handles building the final image and pushing it into the registry once you check your code into version control. I didn’t include that since that’s very organization specific.