Skip to Content

User labels to swap Cloud Dataproc clusters

Confidently upgrade Spark & Hadoop, run A/B tests, split work between clusters thanks to Cloud Dataproc user labels

Posted on 2 mins read

There may be a need to swap Apache Spark and Apache Hadoop clusters without disrupting existing workflows and applications. For example, there are a few cases where you may want to swap out a cluster running Spark or Hadoop in order to:

  • Try out or upgrade to new versions of Spark and Hadoop
  • Run A/B tests between clusters to test performance or other changes
  • Split work between clusters

Trying to do this can require complicated or brittle oprchestration logic and may involve downtime. With Cloud Dataproc user labels, however, it can be easy to swap clusters or share work between clusters.

As an example, say you want to denote your production clusters and only submit work to those clusters. This is easy to do. First, you need to create a cluster with a label to indicate it is a production cluster:

gcloud dataproc clusters create <cluster_name> \
    --labels environment=production

Now you can list clusters which have this label active:

gcloud dataproc clusters list \
	--filter "status.state=ACTIVE AND labels.environment=production"

If you have multiple clusters with this label, you can implement whatever logic you’d like to select one. When you want to remove this cluster from the production fleet, you only need to remove or modify the label:

gcloud dataproc jobs clusters update <cluster_name> \
	--labels environment=offline

Let’s use the case where you want to A/B test a new version of Apache Spark with existing jobs. In that case you could have two clusters with the environment=production label and distributue 5% of work to the new Spark cluster in your submission tooling (custom script, Apache Airflow, etc.)

This also allows you to gracefully drain a cluster of all pending jobs before deleting it by simply removing the label.

This means you needn’t change your job submission logic every time you want to change clusters!

comments powered by Disqus