Data processing Chatelet Station Air Quality

This small project focuses on the use of Apache Spark to learn more on data processing.

Overview

In this project, we'll explore Apache Spark. We'll use PySpark (for more details, go check docs), the Python API for Spark, to work with a dataset and perform various data processing tasks.

Steps followed

0. Installation & Setup

Website: https://spark.apache.org/downloads.html (or use pip install)

pip install pyspark

1. Write a Python script to perform data processing tasks

For the dataset, I chose the Chatelet Station Air Quality.

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

# Create a Spark configuration and context
conf = SparkConf().setAppName("DataProcessingChateletStationAirQuality")
sc = SparkContext(conf=conf)

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# ...
# Use of select, where, filter, sort, when, expr, col, cast, alias, avg...
# ...

2. Run the Python script

python data_processing_chatelet-station-air-quality.py

Where I got a bit stuck / Interesting points

Even if the parameter inferSchema is set to True while reading the CSV file, it might have difficulties to retrieve the Data Type. Therefore, it was interesting to cast the column into other Data Type (String -> Float).
It may be possible to have bizarre column names such as DATE/HOUR. So while using the SQL queries, the use of backticks is primordial to ensure that the entire string is treated as a column identifier.
The ~ operator in filter SQL function is interesting as it is used to have the inverse condition.
first() aggregate function returns the first value in a group as a Row format. To retrieve the value, the nice idea is to put an alias and to get it as a key dictionnary (e.g. examples proceeded with the 7th column in the Python file).
* operator is used to unpack elements from a list and can be passed as arguments to functions.
Be careful about the amount of data manipulated, that's to say we need to care about the worker nodes and the driver node. If we decide to use collect(), it can be risky as the computer (on the driver node) may be run out of memory.

Extra: Setup of pre-commit

pip install pre-commit

Once the .pre-commit-config.yaml completed, we need to set up the git hooks scripts.

pre-commit install

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
chatelet-station-air-quality.csv		chatelet-station-air-quality.csv
data_processing_chatelet-station-air-quality.py		data_processing_chatelet-station-air-quality.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

README.md

README.md

chatelet-station-air-quality.csv

chatelet-station-air-quality.csv

data_processing_chatelet-station-air-quality.py

data_processing_chatelet-station-air-quality.py

Repository files navigation

Data processing Chatelet Station Air Quality

Overview

Steps followed

0. Installation & Setup

1. Write a Python script to perform data processing tasks

2. Run the Python script

Where I got a bit stuck / Interesting points

Extra: Setup of pre-commit

About

Releases

Packages

Languages

mokwilliam/data-processing-chatelet-station-air-quality

Folders and files

Latest commit

History

Repository files navigation

Data processing Chatelet Station Air Quality

Overview

Steps followed

0. Installation & Setup

1. Write a Python script to perform data processing tasks

2. Run the Python script

Where I got a bit stuck / Interesting points

Extra: Setup of pre-commit

About

Resources

Stars

Watchers

Forks

Languages