Data Engineer en Madrid o en remoto

The Agile Monkeys

Lugar de trabajo
En remoto
Horas
Full-Time
Prácticas
false
Habilidades
Comparte la oferta

Descripción de la oferta

The Agile Monkeys is a software boutique with over 75 people and we keep growing. We do two things:

  • We create products: We are a very creative team that is constantly pursuing new ideas and projects we love. Now we are focused on building products related to search and AI. We take great joy in tackling the challenges of AI associated with scalability and information retrieval.

  • We deliver bespoke consulting solutions to leading enterprises, driving strategic transformation, operational excellence, and sustainable growth.

    Why join us?

  • We’re independent and self-funded, so we get to build what we believe in — no investor politics

  • You’ll join a radically creative, fast-moving team that values ownership and experimentation

  • We work globally, but our roots are in the Canary Islands 🌴 — remote-first, international, and human-centered

  • We’re entering a hyper-creative phase in AI and product development — and we want your brain in the mix

Role Summary

Our customer is a nonpartisan nonprofit that ingests, curates, and publishes U.S. government statistics to help the public and policymakers make data-informed decisions. The Agile Monkeys partners with them to build modern, reliable data platforms and pipelines at scale.
We’re looking for a Senior Data Engineer to own core data pipelines across the Wild → Raw → Bronze → Silver → Gold Medallion architecture, with a focus on Dataset Mapping, Polling, and Change Data Capture (CDC). You’ll design, implement, and harden Databricks/Spark jobs, Airflow orchestrations, and Delta Lake patterns that turn messy public datasets into governed, trustworthy, and publish-ready data.

What you’ll do

  • Design & build pipelines in Python, PySpark, and SQL for batch/near-real-time ingestion and transformation on Databricks (Delta Lake + Unity Catalog).

  • Operationalize CDC & incremental load: table/version change detection, column-level deltas, idempotent reprocessing, and efficient upserts/soft-deletes.

  • Implement polling & source change detection: hash-based snapshot detection, metadata-driven schedules, and cost-aware download strategies.

  • Own the Medallion flow (Raw→Bronze→Silver): schema validation, schema evolution/forking, terminology & customer's mapping, lineage preservation.

  • Data quality & observability: build validations (Great Expectations or similar), metrics, and alerts; publish run metadata to audit tables and dashboards.

  • Governance: apply Unity Catalog best practices (permissions, tags, lineage), structured logging, and reproducibility across pipelines.

  • Performance & cost: optimize Spark/Delta (partitioning, file sizing, Z-ordering, AQE, cluster configs), control cloud spend, and tune Airflow concurrency.

  • Reliability engineering: implement robust retry/exception-handling policies, backoffs, and graceful degradation across jobs and operators.

  • Collaboration & reviews: contribute to design docs, PRs, and coding standards; mentor teammates; work closely with data analysts and product.

Tech environment

  • Languages: Python (PySpark), SQL

  • Platform: Databricks (Delta Lake, Unity Catalog)

  • Orchestration: Airflow (Astronomer)

  • Data quality: Great Expectations (or equivalent)

  • Cloud & storage: AWS/Azure (S3/ABFS/Blob), Parquet/Delta

  • CI/CD & tooling: GitHub, tests (pytest), code style, linters; YAML-driven configs; dashboards/metrics (e.g., Prometheus/Grafana or Databricks metrics)

Projects you’ll touch

  • Dataset Mapping: Apply curated terminology/semantic mappings to normalize heterogeneous government sources into consistent Silver datasets.

  • Polling Framework: Detect upstream source changes reliably (metadata and content hashing), trigger selective re-ingestion, and keep costs in check.

  • CDC System: Compute row- and column-level deltas, maintain detail/log/state views, and expose clear lineage for publishable facts.

Minimum qualifications

  • 5+ years building production Spark/Databricks data pipelines in Python/PySpark and SQL.

  • Deep knowledge of Delta Lake internals (ACID, checkpoints, optimize/vacuum, partitioning, file layout) and performance tuning.

  • Hands-on Airflow experience (DAG design, sensors, concurrency, retries, SLAs), ideally on Astronomer.

 

Acerca de The Agile Monkeys

  • Agency

  • Las Palmas, Gran Canaria

  • 50 - 200

  • 2011

We are a group of engineers, product designers, and creators working in software and tech innovation.

We are focused in the following areas:

- Microservices and event-driven architectures.

- Blockchain and Web3 projects.

- Security in software.

- Artificial Intelligence.

Otras ofertas de ingeniero de datos que podrían interesarte...