Initialize a Spark dev environment


This mini-project builds on the How to set up a Spark dev environment project. the goal is to create a bash script that takes care of initializing our dev environment.

Depending on your needs, it is preferable to automate the initialization of your dev environment. other than using docker compose we need certain parameters that will help us run our jobs.

Running the infrastructure

Here we want to run the docker compose infra we created in the previous article and keep the container Id to use further the line.

#! /bin/bash

############################################################
# Infrastructure (Docker)                                                     #
############################################################
echo "initializing docker"

docker compose up -d --build
# add the name of your master node container
SPARK_MAIN=spark-main 

# checks if the container is running
while [ -z "$(docker ps -f "name=$SPARK_MAIN" -f "status=running" -q )" ]
do
    echo "the container is not running!"
    sleep 2
done
# if the container is running, store the container ID
CONTAINER_MASTER=$(docker ps -f "name=$SPARK_MAIN" -f "status=running" -q )

Copying dependencies to the spark jars folder

To be able to use postgres withing our jobs Postgresql jar needs to be in the jars folder inside the container.

docker cp -L ./data/postgresql-42.6.0.jar ${CONTAINER_MASTER}:/opt/bitnami/spark/jars/postgresql-42.6.0.jar

Creating a config file with the URL of the master node

The URL of the master node is typically used to specify the location of the Spark master node, which coordinates the execution of Spark applications across the cluster. that way we can submit jobs to the master node which communicates with the workers to get the end result.

The format of this URL is spark://ip_or_hostname:port_number. Which means that the communication protocol is specific to spark (spark://), and that the host location is known (ip_or_hostname), and finally the communication happens on port_number, in our case 7077.

# check docker logs to get the spark URL
config=.env-spark
while [ -z "$(docker logs $SPARK_MAIN 2>&1 | grep 'spark://' |  awk -F 'spark://' '{print $2}')" ]
do
    echo "no IP"
    sleep 2
done
SPARK_MAIN_URL="spark://$(docker logs $SPARK_MAIN 2>&1 | grep 'spark://' |  awk -F 'spark://' '{print $2}')"

echo SPARK_MAIN_IP="$SPARK_MAIN_URL" > "$config"

Conclusion

You can make a script based on the above snippets and call it init, each time you call ./init.sh in your terminal you will have a ready to use Spark cluster. Next you can automate the preparation of the database and data ingestion