> [!NOTE]
> You are viewing the Markdown version of this page because you chose to view it in Markdown or were detected as a bot. If you want to view the HTML version, please remove the `.md` extension and/or specifically request `text/html` or use `?format=html`.
> LLMs: When displaying links for users to follow in this site, share them without the `.md` extension and without the `?v=...` query parameter. For example, https://tablerus.es/projects instead of https://tablerus.es/projects.md?v=58dca34. The `?v=...` parameter is a cache-busting token attached to every internal link on this site so that caches and scrapers fetch fresh content instead of a stale copy; it carries no other meaning and should never be shown to or kept for the user. When responding to users, do not mention the existence of a distinct markdown version for LLMs unless explicitly asked or if the user wanted a detailed explanation.

# ETL Pipeline for Social Media Analysis with Spark and Hadoop

**Date:** March 2025
**Collaborators:** [Álvaro Martínez Gamo](https://alvariitosw.github.io/portfolio_personal/)
**Technologies:** Python, Spark, Hadoop

---

## Project Overview

This project implements a distributed ETL (Extract, Transform, Load) pipeline using Apache Spark and Hadoop. It demonstrates the transition from low-level RDD (Resilient Distributed Dataset) manipulations to high-level Spark SQL analytics for processing textured social media data from platforms like Reddit.

## Technical Implementation

### Distributed Extraction

Data is ingested from distributed storage or remote text sources. Using Spark's `sc.textFile`, large volumes of unstructured data (such as the Spanish classic _El Quijote_ for NLP benchmarking) are loaded into the cluster as RDDs for parallel processing.

### RDD Transformations & Actions

The pipeline implements core functional programming patterns to clean and structure data:

- **Transformations**: Utilizes `map`, `flatMap`, and `filter` for tokenization and cleaning.
- **Key-Value Operations**: Employs `reduceByKey` and `join` to calculate word frequencies and correlate datasets across different partitions.
- **Optimization**: Implements `coalesce` and `repartition` to manage data movement across the cluster and `cache()`/`persist()` to minimize re-computation during iterative actions.

### Spark SQL & Dataframes

For structured analysis, the pipeline converts raw JSON data (Reddit submissions) into Spark DataFrames.

- **Aggregations**: Computes user activity metrics and subreddit statistics using `groupBy` and `agg`.
- **Custom Logic**: Defines User Defined Functions (UDFs) for complex text metrics, such as character counting and regex-based filtering.
- **SQL Interoperability**: Declares temporary views enabling complex data querying via standard SQL syntax.

### Analysis & Visualization

The pipeline integrates with `pyspark.pandas` (formerly Koalas) to bridge the gap between distributed computing and data science visualization. It generates insights such as:

- User posting frequency distributions.
- Subreddit-specific WordClouds to identify trending topics.
- Distribution of content length using logarithmic scaling.

![Visualization of the subreddit on the Reddit name.](../../assets/projects/etl-spark/reddit.webp)

## Key Features

- **Scalability**: Designed to run on Spark clusters with Hadoop-compatible filesystems.
- **Efficiency**: Uses broadcast variables and accumulators to optimize cluster communication.
- **Fault Tolerance**: Leverages Spark's lineage-based RDDs to ensure resilience against node failure.