10. PySpark

Spark is a powerful distributed computing framework widely used for big data processing and analytics. Spark is an open-source unified analytics engine for large-scale data processing, with built-in modules for streaming, SQL, machine learning, and graph processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

In this section, we will explore how to set up and run a Spark cluster using Docker Compose. Adapted from ¹.

Spark Cluster on Docker

The Docker setup for Spark includes a directory structure that organizes configuration files, data, scripts, and logs. Below is an overview of the directory structure and its contents:

📁 docker/
├── 📁 config/ # (1)
│   ├──  log4j2.properties # (2)
│   └──  spark-defaults.conf # (3)
├── 📁 data/ # (4)
├── 📁 scripts/ # (5)
├── 📁 logs/ # (6)
├──  compose.yaml #(7)
└──  .gitignore

Contains configuration files for Spark.
Configuration file for Spark logging.
Default configuration settings for Spark.
Directory to store datasets used in Spark examples.
Contains helper scripts to start and manage the Spark environment.
Directory where Spark logs will be stored.
Docker Compose file to configure and start Spark services.

Compose fileLog4j ConfigurationSpark Defaults Configuration

To set up a Spark cluster using Docker Compose, use the following configuration (compose.yaml):

compose.yaml
name: spark-cluster

services:

  spark-master:
    build:
      dockerfile_inline: |
        FROM spark:latest
        USER root
        RUN apt update && apt install -y bash curl wget && \
            python3 -m pip install --upgrade pip && \
            pip3 install numpy pandas matplotlib seaborn scikit-learn requests && \
            rm -rf /var/lib/apt/lists/*
        USER spark
        ENV SPARK_HOME=/opt/spark
        WORKDIR /opt/spark
    container_name: spark-master
    hostname: spark-master
    environment:
      - SPARK_MODE=master
      - SPARK_NO_DAEMONIZE=true
      - PATH=/opt/spark/bin:$PATH
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
    ports:
      - 8080:8080  # Spark Master Web UI
      - "7077:7077"  # Spark Master port
    volumes:
      - ./data:/opt/spark/data
      - ./apps:/opt/spark/apps
      - ./scripts:/opt/spark/scripts
      - ./config:/opt/spark/conf
      - ./logs:/opt/spark/logs
    networks:
      - spark-network

  spark-worker:
    image: spark:latest
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=1g
      - SPARK_NO_DAEMONIZE=true
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
    volumes:
      - ./config:/opt/spark/conf
      - ./logs:/opt/spark/logs
    deploy:
      replicas: 3
    depends_on:
      - spark-master
    networks:
      - spark-network

  spark-history:
    image: spark:latest
    container_name: spark-history
    hostname: spark-history
    environment:
      - SPARK_NO_DAEMONIZE=true
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer
    ports:
      - "18080:18080"  # Spark History Server Web UI
    volumes:
      - ./config:/opt/spark/conf
      - ./logs:/opt/spark/logs
    depends_on:
      - spark-master
    networks:
      - spark-network

networks:
  spark-network:
    driver: bridge

The log4j2.properties file configures logging for Spark. Below is a sample configuration (log4j2.properties):

log4j2.properties
# Set root logger level to ERROR and attach to console appender
log4j.rootCategory=ERROR, console

# Console appender configuration
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Suppress INFO logs for specific Spark components (optional)
log4j.logger.org.apache.spark=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.akka=WARN

The spark-defaults.conf file sets default configurations for Spark. Below is a sample configuration (spark-defaults.conf):

spark-defaults.conf
1 2 3	`spark.eventLog.enabled true spark.eventLog.dir file:/opt/spark/logs spark.history.fs.logDirectory file:/opt/spark/logs`

Starting

To start the Spark cluster, navigate to the docker/ directory and run the following command:

Start Spark Cluster

docker compose up -d --build

The configuration defines a Spark master, three worker nodes, and a history server, along with necessary environment variables and volume mounts for configuration files and data.

graph LR
    user@{ icon: ":octicons-person-24:", form: "square", label: "User", pos: "t", h: 60 }
    subgraph Spark Cluster
        direction LR
        master[Master Node]
        worker1([Worker Node 1])
        worker2([Worker Node 2])
        worker3([Worker Node 3])
        history[(History Server)]
    end
    user -->|Submits Jobs| master
    user -->|Accesses UI| history
    master --> worker1
    master --> worker2
    master --> worker3
    master --> history

Accessing

Once the Spark cluster is up and running, you can access the Spark Master UI and History Server UI through your web browser:

Spark Master UI: http://localhost:8080
Spark History Server UI: http://localhost:18080

These interfaces allow you to monitor the status of your Spark cluster, view running jobs, and analyze job history.

Setup

The whole sample with compose file, directory structure, and configuration could be found at spark-docker.zip (~193 MB).

Exercise

Entrega

04.dez 23:59

Individual

Entrega do link via Canvas.

To further explore Spark and its capabilities, consider following this comprehensive tutorial that guides you through the basics of Pyspark and how to get started with it:

Pyspark Tutorial: Getting Started with Pyspark (starting from Step 1: Creating a SparkSession)

Example

The ml.py script in the scripts/ directory provides a basic example of how to use Pyspark for machine learning tasks and this works with the same dataset as in the tutorial.

This tutorial covers essential topics such as setting up a Spark environment, loading and processing data, and performing basic data analysis using Pyspark. The dataset used in the tutorial can be found in the data/ directory of the Docker setup. Also, feel free to experiment with the provided scripts in the scripts/ directory to deepen your understanding of Spark's functionalities.

The exercise will help you gain hands-on experience with Spark and enhance your data processing skills using Pyspark.

Executing Scripts Inside the Spark Master Container

To run a script inside the Spark master container, use the following command:

docker exec -it spark-master /opt/spark/bin/spark-submit /opt/spark/scripts/<your_script.py>

Replace <your_script.py> with the name of the script you want to execute. This command allows you to submit Spark jobs directly from within the master node of your Spark cluster.

The results could be accessed through the folder data/ mounted inside the container at /opt/spark/data/.

To plot graphs or visualize results, you might need to plot them in files (e.g., PNG) and then save them to the mounted data/ directory for access outside the container.

Alternative: You can also run the scripts from inside the Spark master container.

docker exec -it spark-master bashspark@spark-master:/opt/spark$ spark-submit ./scripts/ml.pyWARNING: Using incubator modules: jdk.incubator.vector
...

Adaptation is needed

Pay attention that running Spark in Docker may require adapting the code to work correctly within the containerized environment. This includes ensuring that file paths, environment variables, and dependencies are correctly set up to match the Docker configuration.

Delivering

The exercise is considered complete when you have successfully set up the Spark cluster using Docker Compose, accessed the Spark UIs, executed at least one Pyspark script from the scripts/ directory, and run the whole tutorial. Additionally, you should be able to analyze the results of your Spark jobs and visualize any outputs as needed.

Also, this exercise can be delivered by sharing a brief report or summary of your experience, including any challenges faced and how you overcame them while working with Spark in a Dockerized environment. The report can include screenshots of the Spark UIs, code snippets from the executed scripts, and any insights gained from the data analysis performed using Pyspark. Add the report to your learning portfolio (github Pages) for future reference.

Criteria

Points	Criteria
3	Successfully set up a Spark cluster using Docker Compose.
2	Accessed and demonstrated the Spark Master and History Server UIs.
4	Successfully executed the all tutorial. With plots and visualizations.
1	Provided a comprehensive report on the experience.

Running Spark using Docker Compose ↩