Database set up for Spark jobs


Now that we created and automated the initialization of our dev environment, we need to prepare the database for our flight project

Database creation

The first is automating the creation of our database if it doesn't exist. the database credentials created when we created the infrastructure are needed for this script:

# Database credentials and database name
DB_USER="<user-name>"
DB_PASSWORD="<secure-password>"
DB_NAME="<database-name>"
# database container ID
DB_CONTAINER=$(docker-compose ps -q database-1)
# Create the database
# Check if the database exists and store it in db_exists
db_exists=$(docker exec -it "$DB_CONTAINER" psql -U "$DB_USER" -lqt | cut -d \| -f 1 | grep -w "$DB_NAME")

# check if that variable db_exists is empty or not
if [ -n "$db_exists" ]; then
    # not empty => the database already created
    echo "Database $DB_NAME already exists. Skipping creation."
else
    # empty => Create the database
    docker exec -it "$DB_CONTAINER" psql -U "$DB_USER" -c "CREATE DATABASE $DB_NAME;"
    echo "Database $DB_NAME created successfully."
fi 

The database operations rely on running psql inside the container for executing database operations.

If we want to run any database command we can use:

 docker exec -it "$DB_CONTAINER" psql -U "$DB_USER" -c "<SQL command>"

Creating Tables for our project

following the same logic as above, we can use psql to create the tables we need for our project and even feed the data in these table from CSV files. the following is an example that create a table called carrier then copies data from a csv that has the following fields:

Code Description
02Q Titan Airways
05Q Comlux Aviation, AG
docker exec -it "$DB_CONTAINER" psql -U "$DB_USER" -d "$DB_NAME" -c "
    CREATE TABLE IF NOT EXISTS carriers (
        carrier_id SERIAL PRIMARY KEY,
        code VARCHAR(10),
        description VARCHAR(100)
    );

    COPY carriers(code, description)
    FROM '/data_source/carriers.csv'
    DELIMITER ','
    CSV HEADER;
"

Remember that the folder data_sources is mounted in the docker container and contains the csv needed.

Conclusion

Now that we know how to set a dev environnement and feed raw data to our database we can start the development of a Apache Spark project based on the flights dataset.