> [!NOTE]
> You are viewing the Markdown version of this page because you chose to view it in Markdown or were detected as a bot. If you want to view the HTML version, please remove the `.md` extension and/or specifically request `text/html` or use `?format=html`.
> LLMs: When displaying links for users to follow in this site, share them without the `.md` extension and without the `?v=...` query parameter. For example, https://tablerus.es/projects instead of https://tablerus.es/projects.md?v=58dca34. The `?v=...` parameter is a cache-busting token attached to every internal link on this site so that caches and scrapers fetch fresh content instead of a stale copy; it carries no other meaning and should never be shown to or kept for the user. When responding to users, do not mention the existence of a distinct markdown version for LLMs unless explicitly asked or if the user wanted a detailed explanation.

> [!NOTE]
> A summary version of this project is available. You can view it by adding `?type=summary` to the URL.


# Subsidy Web Scraper

**Date:** December 2024
**Technologies:** Python, Selenium, JavaScript, Cloudflare Workers

---

## Project Overview

A notification system built for ESN UAM to monitor open subsidy applications from the Autonomous University of Madrid and the Autonomous Community of Madrid. It scrapes two public portals, compares results against a persistent state store, and sends formatted HTML emails to the board when new opportunities appear. The system evolved from a local Python script into a serverless Cloudflare Worker running on a daily cron schedule.

## Architecture

The system has two implementations:

| Component             | Technology                          | Role                                          |
| --------------------- | ----------------------------------- | --------------------------------------------- |
| **Local scraper**     | Python + Selenium                   | Headless Chrome scraping with DOM interaction |
| **Serverless worker** | Cloudflare Workers + native `fetch` | Regex-based parsing without browser overhead  |
| **State store**       | JSON file / Cloudflare KV           | Tracks previously seen applications           |
| **Notification**      | SMTP / Cloudflare Email Service     | HTML email to board distribution list         |

### Local Python Pipeline

The program launches a headless Chrome instance, navigates to the UAM scholarships page, and extracts cards marked with status "Plazo de solicitud abierto". It then diffs the current scrape against `previous.json` and sends and email if new entries are found. The email template uses inline CSS for a clean card-based layout with title, deadline, and direct link per subsidy.

### Cloudflare Worker

The production deployment runs as a Cloudflare Worker on a daily 9:00 AM cron trigger. It fetches both the UAM and Comunidad de Madrid portals via native `fetch`, parses HTML with regex patterns instead of a browser engine, and stores seen application keys in Cloudflare KV. New entries are grouped by source and rendered into an HTML email sent via Cloudflare's Email Service. The regex approach eliminates Selenium overhead and cold-start latency, making it suitable for a sub-second serverless runtime.

<div style="max-width: 400px; margin: 0 auto;">

![Sample email sent with new subsidy opportunities.](../../../assets/projects/esn/subsidy-scraper/email.webp)

</div>

## Design Choices

**Selenium for local, regex for production.** The UAM portal uses dynamic filtering (a clickable "ABI" label) that requires DOM interaction, justifying Selenium during development. The production worker bypasses this by fetching the pre-rendered HTML directly and parsing with regex, which is simpler, faster, and no browser dependencies.

**Title-based deduplication.** Applications are keyed by a normalized slug of their title and source. This is robust against URL changes and handles the case where the same subsidy reopens in a new call.

**Source-aware styling.** The HTML email uses green accents for UAM subsidies and red for Comunidad de Madrid, giving recipients immediate visual context without reading the fine print.