Deploying a toy ML model to production

TL;DR: This post walks through the full lifecycle of deploying a profanity detection model from training with sklearn to serving with FastAPI and BentoML, containerizing with Docker, automating CI/CD with GitHub Actions, and monitoring with Prometheus, Grafana, and Evidently. It’s a hands-on guide to turning a simple ML model into a production-grade service.

The repsitory with the corresponding PyTorch implementation is available here.

People forget that in machine learning engineering, a good chunk of the work happens before modeling (cleaning, pre-processing, etc.) but also a good amount happens after training as well. Serving and deploying the model lets users touch your model safely (instead of you having to send the .pth file, someone loading it, and then finally evaluating with torch.no_grad()).

In this blog post I want to focus on what happens after we do machine learning and want to focus on the part that actually lets users touch ML models at scale.

Here’s the high-level game plan:

Trying to attack a real-problem

The simplest model that I could think of which still has some use is profanity detection: it’s not that hard to train and anyone can hit it the API with any string. We would also only need a tiny corpus and having binary labels keeps the training loop pretty simple so that we can actually focus on the production part instead of drowning in things like feature engineering or other deep learning nuances.

The dataset I want to use is the Davidson 2017 “Hate/Offensive” Tweets set which has around ~25000 balanced, single-line tweets.

Since it’s binary we can map:

hate + offensive speech → 1
anything else → 0

Train & lock the artifact

We’ll run a simple training loop (we can aim for a decent F1 but that’s not the main priority here) and save the model as a .joblib.

Wrapping it in an API

This would be the most important part. I want to use FastAPI and BentoML; we’ll use FastAPI for the clean routing and request handling and BentoML for model management, packaging, and scalable deployment.

I’ve seen people only use FastAPI but I feel that you don’t get full model versioning support or multi-model serving if you skip BentoML.

Package it

I’ll show the example of using both (dockerfile and bento build) to create an image in a registry.

Shipping

We have a couple of options: Fly.io (heroku but for docker), Northflank (PaaS built for containers & microservices), ECS (reliable but I don’t want to pay), and K8s (overkill for this)

Fly.io seems the most viable one here because it has the least amount of overhead.

Wire CI/CD

We’ll use GitHub actions to build our image and deploy on every main push.

Observability

We’ll implement logs, latency, traffic, model-drift hooks

Closing steps

A/B rollout, feature flags, retraining loop

Now that we have a clear idea of what we are going to do, let’s get started.

Pretraining & Training

Deep learning is definitely overkill here so I wanted to use TFIDF (Term Frequency - Inverse Document Frequency) along with Logistic Regression using sklearn. Essentially, TF-IDF gives high scores to words that are common in a document but rare across the dataset, which usually carry the most signal.

Let’s take a quick example with movie reviews:

Let’s say suppose we have 1000 movie reviews with 100 words each (very hypothetical).
The word “amazing” appears 10 times in one review, and in only 20 of the 1000 reviews.
TF = 10 / total words in that review = 10 / 100 = 0.1
IDF = log(1000 / 20) = log(50) ≈ 1.7
TF-IDF = 0.1 * 1.7 = 0.17 so “amazing” gets a decently high score.

But if the word was “the” (which is in basically every review), IDF would be near 0, so its TF-IDF score would be ~0.0005 → not meaningful.

This method is great for spam detection, product categorization based on descriptions, basic sentiment analysis, and in our case, profanity/toxic comment classification.

Here is what our loop would look like after we import our data:

train.py

import pandas as pd, joblib
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer  
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

df = pd.read_csv("tweets.csv") # columns: text, label
X_train, X_val, y_train, y_val = train_test_split(
    df["text"], df["label"], test_size=0.2, random_state=42
)

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(min_df=3, ngram_range=(1,2))),
    ("clf",  LogisticRegression(max_iter=1000))
])

pipeline.fit(X_train, y_train)
print("val-acc:", pipeline.score(X_val, y_val))
joblib.dump(pipeline, "profanity.joblib")

So we now have our model saved as profanity.joblib and we can proceed to putting this to production now. In the repository I’ve added eval metrics like F1 and confusion matrix but I don’t think they are that important in this case.

2) Serving

I’ll start with showing just how to use FastAPI and then will introduce BentoML to show how it supercharges your workflow.

FastAPI

We would have to write a python script that exposes our API and then a dockerfile to serve.

service.py

from fastapi import FastAPI
from pydantic import BaseModel
import joblib

model = joblib.load("profanity.joblib")
app = FastAPI()

class Req(BaseModel):
    text: str

@app.post("/predict")
def predict(r: Req):
    proba = model.predict_proba([r.text])[0][1]
    return {"is_profane": proba > 0.5, "confidence": proba}

dockerfile

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt . && pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "service:app", "--host", "0.0.0.0", "--port", "8000"]

Not bad right? Here’s how BentoML simplifies it though:

service.py

import bentoml
from bentoml.io import JSON

import joblib; pipe = joblib.load("profanity.joblib")
model_ref = bentoml.sklearn.save_model("profanity_detector", pipe)

svc = bentoml.Service(
    "profanity_api",
    runners=[model_ref.to_runner()],
)

@svc.api(input=JSON(), output=JSON())
def predict(payload):
    text = payload["text"]
    proba = svc.runners.profanity_detector.predict_proba.run([text])[0][1]
    return {"is_profane": proba > 0.5, "confidence": proba}

Serving locally:

bentoml serve service.py:svc

Packaging:

bentoml build
bentoml containerize profanity_api:latest
docker run -p 3000:3000 profanity_api:latest

Bento gives you healthz and metrics endpoints out the box. healthz is useful for when load balancers can ping to check if your service is healthy or needs to be restarted. metrics is a prometheus-compatible endpoint which exposes internal stats like request counts, response times, errors, memory usage, and latency histograms. This is really useful if you want to host your own observability serivce.

Without BentoML, you’d have to manually implement these endpoints which can be gruesome. Bento takes care of all of these and you instantly have them production-ready.

3) Continuous Integration & Continous Deployment

Here’s the pipeline flow:

Trigger on push to main (from wherever you are developing from, or a PR gets merged)
Checkout your current repository
GitHub builds Bento
Container gets built + pushed to GHRC
flyctl deploy pulls from GHRC and deploys globally

name: Deploy ML endpoint

on:
  push:
    branches: [main]
  workflow_dispatch: {}

jobs:
  build-push:
    runs-on: ubuntu-latest

    permissions:
      contents: read
      packages: write

    steps:
      - uses: actions/checkout@v4

      - uses: bentoml/setup-bentoml-action@v1
        with:
          python-version: "3.11"
          cache: pip

      - name: Build Bento
        run: bentoml build

      - name: Login to GHCR
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.repository_owner }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Containerize & push
        uses: bentoml/containerize-push-action@v1
        with:
          bento-tag: profanity_api:latest
          push: true
          tags: ghcr.io/${{ github.repository_owner }}/profanity-api:latest

      - name: Install flyctl
        uses: superfly/flyctl-actions/setup-flyctl@master

      - name: Deploy to Fly.io
        run: flyctl deploy --image ghcr.io/${{ github.repository_owner }}/profanity-api:latest --remote-only
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

4) Observability

With BentoML, every service already exposes /metrics in Prometheus format without any extra code (which is extremely useful).

We can easily see what it is returning with the following:

curl http://localhost:3000/metrics | head
# HELP http_request_duration_seconds ...
# TYPE http_request_duration_seconds histogram

All this data is already in prometheus format, which makes it super easy down the line.

We can quickly spin up Prom + Grafana:

docker compose up -d prometheus grafana

And now we have observability baked in 🤑

Here’s the overall flow:

BentoML exposes /metrics
Prometheus scrapes + stores time-series data
Grafana queries Prom + visualizes insights

This obviously changes with the scale of your application - I would assume Youtube’s RecSys and other tools would have more sophisticated versions of observability. At it’s core they are measuring the same thing:

how their infrastructure is doing (monitoring CPU, memory, p95 latency)
how the application is behaving (2xx/4xx/5xx counts, request time, queue depth)
how the inputs to the model are (mean prob, class skew, confidence histograms)

5) Model & Data-Drift Monitoring

Have you ever heard the term that ML models can silently fail in production? Unlike a software issue that is easy to spot, ML issues happen with the model producing incorrect (or even slightly incorrect) results. This happens when the input data changes and our model isn’t prepared for that.

In our case of the profanity checking model, that could happen if people could start saying new words / slurs that our model just has never seen before. In our case, we used a dataset from 2017. Every following year after that, the profane words and slurs that people say probably change slightly. Over years, this effect can compound and essentially break the model even if everything else (infra, APIs, etc.) is working well.

A nightly cron + Evidently AI gives you guardrails against this happening.

nightly_check.py

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

baseline = pd.read_parquet("2025-06-01.parquet")
latest = pd.read_parquet("2025-06-28.parquet")

rpt = Report(metrics=[DataDriftPreset()])
rpt.run(reference_data=baseline, current_data=latest, column_mapping=ColumnMapping(target=None))
rpt.save_html("drift.html")

and for scheduling we can do

import schedule
schedule.every().day.at("03:00").do(run_data_drift_check)

A lot of people then take this drift.html and store in in S3 which Grafana then embeds so product managers / non-technical people can see without having to ssh into the specific machine.

6) Versioning, Rollbacks, Deployment Strategies

This part is very similar to it’s software counterpart.

1. Model Versioning

Every time a new model is trained, it should be treated like a software artifact. We should go ahead and save it with a unique version ID along with some other metadata (training date, dataset version, training script hash, val & test set metrics (accuracy, F1, drift, etc.)).

For version tracking: MLflow, Weights & Biases, BentoML
For storage: S3, GCS, MinIO

This is very useful because we always know which model is in production, we have the ability to compare and debug past verisons, and reproduce behavior in production from months ago.

2. Rollbacks

Even if a new model looks promising in UAT with decent F1 scores, it can suddenly perform poorly in prod (drops in accuracy, increased latency, or user complaints).

To be able to roll back a model instantly, we can just rebind the production endpoint to an older model or roll back the image tag if it was hosted as a Docker container.

This can easily be automated by setting up triggers from Prometheus + Grafana + AlertManager which can go and trigger previous versions through CI/CD pipelines.

Here’s how simple it can be

# assume previous model version was "profanity_api:v2" and v3 is failing in prod

# first pull the stable model
docker pull ghcr.io/your-org/profanity_api:v2

# then stop the current container
docker stop profanity-api
docker rm profanity-api

# finally re-run with previous version (v2)
docker run -d --name profanity-api -p 3000:3000 ghcr.io/your-org/profanity_api:v2

3. Deployment Strategies

Blue-Green

Deploy the new model to a parallel env while the old one is still serving. After validation, switch all traffic over.

Canary

Slowly roll out the new model to a small subset of users (5-10%) while monitoring for issues. If things look good, gradually increase traffic.

A/B testing

Route users randomly to different model versions (A vs B) and compare performance (CTR, conversions, etc.) Best for evaluating new models based on real user outcomes.

Multi-Armed Bandit (MAB)

A smarter version of A/B testing; we are going to route more traffic to the better-performing model as results come in. Dynamically allocating traffic based on current performance.

Shadow Deployment (my most favorite, tradeoff is that you are wasting compute)

Spin up the new model but send real production traffic to both the models (old and new) without returning results from the new model to users. It shadows the main model, allowing you to compare logs and behavior.

7) Automated Retraining

Automated retraining is how we keep our model fresh as the world changes (data drift, concept drift, new patterns, etc.).

We can setup a pipeline to run on a schedule to:

Pull new data
Clean + process it
Retrain the model
Evaluate it
Upload to registry
Deploy it (using one of the above strategies, if it beats current prod model)

A potential stack that we could use is

Scheduler: Airflow (or cron if it’s simple)
Storage: S3, GCS, or BigQuery
Eval: sklearn.metrics or W&B
Registry: GHRC or MLflow

Given this, we form a feedback loop

Data drift → Model degrades → Pipeline retrains → Deploy an updated model

All this keeps your system continuously learning without human intervention.

This “self-updating model factory” is essential as you scale your model. At tech companies like Google, Meta, or Snap, this loop can run daily or weekly. Some teams even have a “auto-promote” trigger (similar to the rollback that we saw above) that deploys the new model if it beats the current prod in eval tests.

Deploying machine learning models isn’t really about beating your previous F1 or just writing highly optimized fused CUDA kernels to make inference as fast as possible. It’s about building systems around your model that make it work reliably in the real world. I personally feel that the fact that these models can have “silent failures” make it such that we need twice the effort to support an ML model compared to a software feature.

Take a look at this repository to see the full code.

Pretraining & Training#

2) Serving#

FastAPI#

3) Continuous Integration & Continous Deployment#

4) Observability#

5) Model & Data-Drift Monitoring#

6) Versioning, Rollbacks, Deployment Strategies#

1. Model Versioning#

2. Rollbacks#

3. Deployment Strategies#

7) Automated Retraining#