TF-Recommenders & Kubernetes for flexible RecSys Model Development & Deployment

Introducing TF-Recommenders

Recently, Google open sourced a Keras API for building recommender systems called TF-Recommenders.

TF-Recommenders is flexible, making it easy to integrate heterogeneous signals like implicit ratings from user interactions, content embeddings, or real-time context info. This module also introduces losses specialized for ranking and retrieval which can be combined to benefit from multi-task learning. The developers emphasize the ease-of-use in research, as well as the robustness for deployment in web-scale applications.

In this blog, we demonstrate how easy it is to deploy a TFRS model using the popular Kubernetes platform for a highly-scalable recommender system.

Application deployment with Kubernetes

As a reference, we are using the multitask demo example featured in the TF-Recommenders repo.

This model generates user embeddings from:

The model.user_model() saved as a SavedModel

With a user embedding, we generate candidates with approximate nearest neighbors using an index of items.

An Annoy index of the embeddings from model.movie_model()

For more details on saving the model and annoy index, see our repo.

The deployment uses two pods, one for each process:

Pod 1: serving the model.user_model() using a TF-Serving Docker image
Pod 2: serving a Flask API app that gets a user_id as input and returns the top N recommendations using the AnnoyIndex of movie.movie_model() embeddings and user embedding.

We begin testing our deployment locally with minikube.

$ minikube start

In a production environment, we use a container registry to store our Docker images but here we use locally built Docker images.

The minikube cluster can only find Docker images within its environment, so we configure the local environment to use the Docker daemon inside the minikube cluster.

$ eval $(minikube docker-env)

Next, we build the Docker images for each pod.

A simple Flask app works to:

query our user model server for the user embedding
return movie recommendations using the user embedding and indexed movie embeddings

We use the grpc example from the tensorflow serving repo to model how to query the user model server.

We also import a saved annoy index of our movie model embeddings and a dictionary that translates the annoy index to movie titles.

top_N = 10
embedding_dimension = 32

# Load annoy index
content_index = AnnoyIndex(embedding_dimension, "dot")
content_index.load('content_embedding.tree')

# load index to movie mapping
with open('content_index_to_movie.p', 'rb') as fp: 
    content_index_to_movie = pickle.load(fp)

def get_user_embedding(user_id):
    """
    Helper function to ping user model server to return user embeddings
    input: user id
    output: 32-dim user embedding
    """
    channel = grpc.insecure_channel(FLAGS.server)
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    # Send request
    # See prediction_service.proto for gRPC request/response details.
    request = predict_pb2.PredictRequest()
    request.model_spec.name = 'user_model'
    request.model_spec.signature_name = 'serving_default'
    request.inputs['string_lookup_1_input'].CopyFrom(
            tf.make_tensor_proto(user_id, shape=[1]))
    result = stub.Predict(request, 10.0)  # 10 secs timeout
    embedding = np.array(result.outputs["embedding_1"].float_val)
    return tf.convert_to_tensor(embedding)

Then, we set up our endpoint that takes a user id and returns a list of the top_N recommended movies.

class Recommender(Resource):
    """
    Flask API that returns a list of top_N recommended titles for a user id
    input: user id
    output: list of top_N recommended movie titles
    """
    def get(self, user_id):
        user_recs = {"user_id": [], "recommendations": []} 
        user = tf.convert_to_tensor(user_id, dtype="string")
        query_embedding = get_user_embedding(user)
        # get nearest neighbor of user embedding from indexed movie embeddings
        candidates = content_index.get_nns_by_vector(query_embedding, top_N) 
        candidates = [content_index_to_movie[x].decode("utf-8") \
                        for x in candidates] # translate from annoy index to movie title
        user_recs["user_id"].append(user.numpy().decode("utf-8"))
        user_recs["recommendations"].append(candidates)
        return user_recs

Now we want to build the docker image for our Flask app. See our Dockerfile.

$ docker build -f Dockerfile -t recommender-app:latest .

We use TFServing as our base image and build a with our user model. More detailed instructions here.

Now, each pod has a deployment configuration and a service configuration.

The deployment section references the docker images we’ve built locally. The service section configures how the apps will interface with each other and with outside requests.

The TFServing pod should only be accessible to the Flask app, which is in the same cluster. Therefore, we can configure it to expose the ClusterIP port.

In contrast, the Flask app serves requests to clients outside of the cluster, so we assign an external ip and configure the Flask app to expose the LoadBalancer port. This also allows for flexible scaling of the pod to handle more requests.

Deploying the full app is simple using kubectl with our minikube cluster. We can deploy both pods with:

$ kubectl apply -f recommender-app.yaml
$ kubectl apply -f user-model.yaml

We can check their statuses with:

$ kubectl get deployments

>NAME              READY   UP-TO-DATE   AVAILABLE   AGE
recommender-app   1/1     1            1           1m
user-model        1/1     1            1           1m

And we can see the services running with:

$ kubectl get services

NAME                  TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
kubernetes            ClusterIP      10.96.0.1        <none>        443/TCP          10m
recommender-service   LoadBalancer   10.105.151.169   <pending>     6000:30242/TCP   1m
user-model-service    ClusterIP      10.96.196.146    <none>        8500/TCP         1m

While deploying with minikube locally, we need to expose the external ip for the recommender-service shown above. Simply run:

$ minikube service recommender-service

|-----------|---------------------|-------------|---------------------------|
| NAMESPACE |        NAME         | TARGET PORT |            URL            |
|-----------|---------------------|-------------|---------------------------|
| default   | recommender-service |        6000 | http://192.168.49.2:30242 |
|-----------|---------------------|-------------|---------------------------|

Which returns a url we can curl! Now curl the Flask API app like so:

$ curl 192.168.49.2:30242/recommend/1   # get recommendations for user 1

And the server will return something like this:

{
    "user_id": [
        "1"
    ],
    "recommendations": [
        [
            'Toy Story (1995)',
            'Jumanji (1995)',
            ...
        ]
    ]
}

After successfully modeling our Kubernetes deployment locally, we can clean up.

$ kubectl delete services recommender-service user-model-service
$ kubectl delete deployments recommender-app user-model
$ minikube stop
$ minikube delete

With tools like Kubernetes and libraries like TF-Recommenders and Annoy, the process of building fast recommender systems is simplified. The TF-Recommenders library simplifies the integration of various signals for making recommendations using neural networks.