The Agile Monkeys is a software boutique with over 75 people and we keep growing. We do two things:
We create products: We are a very creative team that is constantly pursuing new ideas and projects we love. Now we are focused on building products related to search and AI. We take great joy in tackling the challenges of AI associated with scalability and information retrieval.
We deliver bespoke consulting solutions to leading enterprises, driving strategic transformation, operational excellence, and sustainable growth.
Why join us?
We’re independent and self-funded, so we get to build what we believe in — no investor politics
You’ll join a radically creative, fast-moving team that values ownership and experimentation
We work globally, but our roots are in the Canary Islands 🌴 — remote-first, international, and human-centered
We’re entering a hyper-creative phase in AI and product development — and we want your brain in the mix
Role Summary
Our customer is a nonpartisan nonprofit that ingests, curates, and publishes U.S. government statistics to help the public and policymakers make data-informed decisions. The Agile Monkeys partners with them to build modern, reliable data platforms and pipelines at scale.
We’re looking for a Senior Data Engineer to own core data pipelines across the Wild → Raw → Bronze → Silver → Gold Medallion architecture, with a focus on Dataset Mapping, Polling, and Change Data Capture (CDC). You’ll design, implement, and harden Databricks/Spark jobs, Airflow orchestrations, and Delta Lake patterns that turn messy public datasets into governed, trustworthy, and publish-ready data.
What you’ll do
Design & build pipelines in Python, PySpark, and SQL for batch/near-real-time ingestion and transformation on Databricks (Delta Lake + Unity Catalog).
Operationalize CDC & incremental load: table/version change detection, column-level deltas, idempotent reprocessing, and efficient upserts/soft-deletes.
Implement polling & source change detection: hash-based snapshot detection, metadata-driven schedules, and cost-aware download strategies.
Own the Medallion flow (Raw→Bronze→Silver): schema validation, schema evolution/forking, terminology & customer's mapping, lineage preservation.
Data quality & observability: build validations (Great Expectations or similar), metrics, and alerts; publish run metadata to audit tables and dashboards.
Governance: apply Unity Catalog best practices (permissions, tags, lineage), structured logging, and reproducibility across pipelines.
Performance & cost: optimize Spark/Delta (partitioning, file sizing, Z-ordering, AQE, cluster configs), control cloud spend, and tune Airflow concurrency.
Reliability engineering: implement robust retry/exception-handling policies, backoffs, and graceful degradation across jobs and operators.
Collaboration & reviews: contribute to design docs, PRs, and coding standards; mentor teammates; work closely with data analysts and product.
Tech environment
Languages: Python (PySpark), SQL
Platform: Databricks (Delta Lake, Unity Catalog)
Orchestration: Airflow (Astronomer)
Data quality: Great Expectations (or equivalent)
Cloud & storage: AWS/Azure (S3/ABFS/Blob), Parquet/Delta
CI/CD & tooling: GitHub, tests (pytest), code style, linters; YAML-driven configs; dashboards/metrics (e.g., Prometheus/Grafana or Databricks metrics)
Projects you’ll touch
Dataset Mapping: Apply curated terminology/semantic mappings to normalize heterogeneous government sources into consistent Silver datasets.
Polling Framework: Detect upstream source changes reliably (metadata and content hashing), trigger selective re-ingestion, and keep costs in check.
CDC System: Compute row- and column-level deltas, maintain detail/log/state views, and expose clear lineage for publishable facts.
Minimum qualifications
5+ years building production Spark/Databricks data pipelines in Python/PySpark and SQL.
Deep knowledge of Delta Lake internals (ACID, checkpoints, optimize/vacuum, partitioning, file layout) and performance tuning.
Hands-on Airflow experience (DAG design, sensors, concurrency, retries, SLAs), ideally on Astronomer.