There may be a need to swap Apache Spark and Apache Hadoop clusters without disrupting existing workflows and applications. For example, there are a few cases where you may want to swap out a cluster running Spark or Hadoop in order to:
- Try out or upgrade to new versions of Spark and Hadoop
- Run A/B tests between clusters to test performance or other changes
- Split work between clusters
Trying to do this can require complicated or brittle oprchestration logic and may involve downtime. With Cloud Dataproc user labels, however, it can be easy to swap clusters or share work between clusters.
As an example, say you want to denote your
and only submit work to those clusters. This is easy to do.
First, you need to create a cluster with a label to indicate it is a production cluster:
gcloud dataproc clusters create <cluster_name> \ --labels environment=production
Now you can list clusters which have this label active:
gcloud dataproc clusters list \ --filter "status.state=ACTIVE AND labels.environment=production"
If you have multiple clusters with this label, you can implement whatever logic you’d like to select one. When you want to remove this cluster from the
you only need to remove or modify the label:
gcloud dataproc jobs clusters update <cluster_name> \ --labels environment=offline
Let’s use the case where you want to A/B test a new version of Apache Spark with existing jobs.
In that case you could have two clusters with the
and distributue 5% of work to the new Spark cluster in your submission tooling
(custom script, Apache Airflow, etc.)
This also allows you to gracefully drain a cluster of all pending jobs before deleting it by simply removing the label.
This means you needn’t change your job submission logic every time you want to change clusters!