All jobs in Remote or All jobs in Madrid

Data Engineer in Madrid or Remote

The Agile Monkeys

Category

Data Engineer

Industry

Agency Industry

Workplace

Remote

Hours

Full-Time

Internship

Skills

SQL Python Pyspark

Share offer

Job Description

The Agile Monkeys is a software boutique with over 75 people and we keep growing. We do two things:

We create products: We are a very creative team that is constantly pursuing new ideas and projects we love. Now we are focused on building products related to search and AI. We take great joy in tackling the challenges of AI associated with scalability and information retrieval.
We deliver bespoke consulting solutions to leading enterprises, driving strategic transformation, operational excellence, and sustainable growth.

Why join us?
We’re independent and self-funded, so we get to build what we believe in — no investor politics
You’ll join a radically creative, fast-moving team that values ownership and experimentation
We work globally, but our roots are in the Canary Islands 🌴 — remote-first, international, and human-centered
We’re entering a hyper-creative phase in AI and product development — and we want your brain in the mix

Role Summary

Our customer is a nonpartisan nonprofit that ingests, curates, and publishes U.S. government statistics to help the public and policymakers make data-informed decisions. The Agile Monkeys partners with them to build modern, reliable data platforms and pipelines at scale.
We’re looking for a Senior Data Engineer to own core data pipelines across the Wild → Raw → Bronze → Silver → Gold Medallion architecture, with a focus on Dataset Mapping, Polling, and Change Data Capture (CDC). You’ll design, implement, and harden Databricks/Spark jobs, Airflow orchestrations, and Delta Lake patterns that turn messy public datasets into governed, trustworthy, and publish-ready data.

What you’ll do

Design & build pipelines in Python, PySpark, and SQL for batch/near-real-time ingestion and transformation on Databricks (Delta Lake + Unity Catalog).
Operationalize CDC & incremental load: table/version change detection, column-level deltas, idempotent reprocessing, and efficient upserts/soft-deletes.
Implement polling & source change detection: hash-based snapshot detection, metadata-driven schedules, and cost-aware download strategies.
Own the Medallion flow (Raw→Bronze→Silver): schema validation, schema evolution/forking, terminology & customer's mapping, lineage preservation.
Data quality & observability: build validations (Great Expectations or similar), metrics, and alerts; publish run metadata to audit tables and dashboards.
Governance: apply Unity Catalog best practices (permissions, tags, lineage), structured logging, and reproducibility across pipelines.
Performance & cost: optimize Spark/Delta (partitioning, file sizing, Z-ordering, AQE, cluster configs), control cloud spend, and tune Airflow concurrency.
Reliability engineering: implement robust retry/exception-handling policies, backoffs, and graceful degradation across jobs and operators.
Collaboration & reviews: contribute to design docs, PRs, and coding standards; mentor teammates; work closely with data analysts and product.

Tech environment

Languages: Python (PySpark), SQL
Platform: Databricks (Delta Lake, Unity Catalog)
Orchestration: Airflow (Astronomer)
Data quality: Great Expectations (or equivalent)
Cloud & storage: AWS/Azure (S3/ABFS/Blob), Parquet/Delta
CI/CD & tooling: GitHub, tests (pytest), code style, linters; YAML-driven configs; dashboards/metrics (e.g., Prometheus/Grafana or Databricks metrics)

Projects you’ll touch

Dataset Mapping: Apply curated terminology/semantic mappings to normalize heterogeneous government sources into consistent Silver datasets.
Polling Framework: Detect upstream source changes reliably (metadata and content hashing), trigger selective re-ingestion, and keep costs in check.
CDC System: Compute row- and column-level deltas, maintain detail/log/state views, and expose clear lineage for publishable facts.

Minimum qualifications

5+ years building production Spark/Databricks data pipelines in Python/PySpark and SQL.
Deep knowledge of Delta Lake internals (ACID, checkpoints, optimize/vacuum, partitioning, file layout) and performance tuning.
Hands-on Airflow experience (DAG design, sensors, concurrency, retries, SLAs), ideally on Astronomer.

Offer not available? Let us know!

Read full job description

About The Agile Monkeys

Website
http://theagilemonkeys.com
Industry
Agency
Headquarters
Las Palmas, Gran Canaria
Company Size
50 - 200
Founded
2011

We are a group of engineers, product designers, and creators working in software and tech innovation.

We are focused in the following areas:

- Microservices and event-driven architectures.

- Blockchain and Web3 projects.

- Security in software.

- Artificial Intelligence.

Data Engineer in Madrid or Remote

The Agile Monkeys

Job Description

Read full job description

About The Agile Monkeys

Other data engineer jobs that might interest you...

Data Architect at Sngular

Machine Learning Platform Engineer at Adyen

Data Architect AWS at Keepler Data Tech