This small project focuses on the use of Apache Kafka and Python to learn more on event streaming.
Help from https://kafka.apache.org/.
In this project, we will build a real-time social media streaming pipeline using Apache Kafka and Python. The pipeline will enable the ingestion and processing of sample social media posts and engagements, empowering us to perform dynamic analyses such as sentiment analysis and engagement metrics calculation.
Website: https://kafka.apache.org/downloads
# Setup in our PATH (e.g. in the .bashrc)
export KAFKA_HOME=/path/to/kafka
export PATH=$KAFKA_HOME/bin:$PATH
# Start ZooKeeper
apache_kafka_folder/bin/zookeeper-server-start.sh apache_kafka_folder/config/zookeeper.properties
# Start Kafka
apache_kafka_folder/bin/kafka-server-start.sh apache_kafka_folder/config/server.properties
I use the Python library faker
to generate fake data.
For more details, go check the docs or GitHub.
Check the content of generate_social_media_data.py
.
apache_kafka_folder/bin/kafka-topics.sh --create --topic social_media_stream --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
- Producer: those client applications that publish (write) events to Kafka.
- Consumer: those that subscribe to (read and process) these events.
Python library confluent-kafka
is used to create basic clients.
For more details, go check the docs or GitHub.
Check the content of kafka_producer.py
and kafka_consumer.py
.
- First, generate the social media data.
- Simultaneously, the Kafka scripts in order to see the data ingestion and processing in real-time.
pip install pre-commit
Once the .pre-commit-config.yaml
completed, we need to set up the git hooks scripts.
pre-commit install