Todos los empleos en Remoto o Todos los empleos en Madrid

Data Engineer en Madrid o en remoto

The Agile Monkeys

Categoría

Ingeniero de Datos

Industria

Industria Agency

Lugar de trabajo

En remoto

Horas

Full-Time

Prácticas

false

Habilidades

SQL Python Pyspark

Comparte la oferta

Descripción de la oferta

The Agile Monkeys is a software boutique with over 75 people and we keep growing. We do two things:

We create products: We are a very creative team that is constantly pursuing new ideas and projects we love. Now we are focused on building products related to search and AI. We take great joy in tackling the challenges of AI associated with scalability and information retrieval.
We deliver bespoke consulting solutions to leading enterprises, driving strategic transformation, operational excellence, and sustainable growth.

Why join us?
We’re independent and self-funded, so we get to build what we believe in — no investor politics
You’ll join a radically creative, fast-moving team that values ownership and experimentation
We work globally, but our roots are in the Canary Islands 🌴 — remote-first, international, and human-centered
We’re entering a hyper-creative phase in AI and product development — and we want your brain in the mix

Role Summary

Our customer is a nonpartisan nonprofit that ingests, curates, and publishes U.S. government statistics to help the public and policymakers make data-informed decisions. The Agile Monkeys partners with them to build modern, reliable data platforms and pipelines at scale.
We’re looking for a Senior Data Engineer to own core data pipelines across the Wild → Raw → Bronze → Silver → Gold Medallion architecture, with a focus on Dataset Mapping, Polling, and Change Data Capture (CDC). You’ll design, implement, and harden Databricks/Spark jobs, Airflow orchestrations, and Delta Lake patterns that turn messy public datasets into governed, trustworthy, and publish-ready data.

What you’ll do

Design & build pipelines in Python, PySpark, and SQL for batch/near-real-time ingestion and transformation on Databricks (Delta Lake + Unity Catalog).
Operationalize CDC & incremental load: table/version change detection, column-level deltas, idempotent reprocessing, and efficient upserts/soft-deletes.
Implement polling & source change detection: hash-based snapshot detection, metadata-driven schedules, and cost-aware download strategies.
Own the Medallion flow (Raw→Bronze→Silver): schema validation, schema evolution/forking, terminology & customer's mapping, lineage preservation.
Data quality & observability: build validations (Great Expectations or similar), metrics, and alerts; publish run metadata to audit tables and dashboards.
Governance: apply Unity Catalog best practices (permissions, tags, lineage), structured logging, and reproducibility across pipelines.
Performance & cost: optimize Spark/Delta (partitioning, file sizing, Z-ordering, AQE, cluster configs), control cloud spend, and tune Airflow concurrency.
Reliability engineering: implement robust retry/exception-handling policies, backoffs, and graceful degradation across jobs and operators.
Collaboration & reviews: contribute to design docs, PRs, and coding standards; mentor teammates; work closely with data analysts and product.

Tech environment

Languages: Python (PySpark), SQL
Platform: Databricks (Delta Lake, Unity Catalog)
Orchestration: Airflow (Astronomer)
Data quality: Great Expectations (or equivalent)
Cloud & storage: AWS/Azure (S3/ABFS/Blob), Parquet/Delta
CI/CD & tooling: GitHub, tests (pytest), code style, linters; YAML-driven configs; dashboards/metrics (e.g., Prometheus/Grafana or Databricks metrics)

Projects you’ll touch

Dataset Mapping: Apply curated terminology/semantic mappings to normalize heterogeneous government sources into consistent Silver datasets.
Polling Framework: Detect upstream source changes reliably (metadata and content hashing), trigger selective re-ingestion, and keep costs in check.
CDC System: Compute row- and column-level deltas, maintain detail/log/state views, and expose clear lineage for publishable facts.

Minimum qualifications

5+ years building production Spark/Databricks data pipelines in Python/PySpark and SQL.
Deep knowledge of Delta Lake internals (ACID, checkpoints, optimize/vacuum, partitioning, file layout) and performance tuning.
Hands-on Airflow experience (DAG design, sensors, concurrency, retries, SLAs), ideally on Astronomer.

Leer la descripción completa

Acerca de The Agile Monkeys

Sitio web
http://theagilemonkeys.com
Industria
Agency
Sede central
Las Palmas, Gran Canaria
Tamaño de la compañía
50 - 200
Fundada
2011

We are a group of engineers, product designers, and creators working in software and tech innovation.

We are focused in the following areas:

- Microservices and event-driven architectures.

- Blockchain and Web3 projects.

- Security in software.

- Artificial Intelligence.

Data Engineer en Madrid o en remoto

The Agile Monkeys

Descripción de la oferta

Leer la descripción completa

Acerca de The Agile Monkeys

Otras ofertas de ingeniero de datos que podrían interesarte...

Senior Data Engineer en ThoughtWorks

Data Engineer en Lingokids

Big Data Architect en Solera