Join a world-class team of skilled engineers who build creative digital solutions to support our colleagues and clients. We make a broad organizational impact by delivering cutting-edge technology solutions that power Gartner. Gartner IT values its culture of nonstop innovation, an outcome-driven approach to success, and the notion that great ideas can come from anyone on the team.
About this role:
The person will primarily be responsible for supporting production or operations of critical client facing applications. They will ensure the application's operational readiness by evaluating its performance, reliability, scale, resiliency & observability. They will be responsible for identifying issues in production, triaging identified issues, partnering with other engineers on the team to identify the root cause. Other responsibilities include managing applications and infrastructure as a code, creating & executing chaos tests, managing alerts & dashboards.
What you’ll do
As part of the SRE scrum team, perform full stack triaging of alerts and engage other engineers to identify root cause of application performance & stability issues.
Collaborate with cross functional members of swat team during production incidents and provide critical technical insight to identify the root cause.
Establish Relationships with stakeholders such as development teams or product owners to define service level objectives (SLOs) for application features/services
Measure performance against SLOs in partnership with development teams or other stakeholders, and ensure systems continue to meet SLOs over time.
Identify opportunities to improve performance, scalability, and stability of applications.
Participate in operational support and on-call rotation shifts for supported systems and products
Available to work flexible hours as required for operational support and during select events like major releases to ensure coordination among globally distributed team
Conduct blameless post mortems to troubleshoot priority incidents.
Design, develop dashboards and reports to communicate key metrics
Identify opportunities to improve alerting posture and create/update alerts accordingly.
Oversee, design, implement, and manage DevOps capabilities using continuous integration/continuous delivery toolsets and automation
Identify opportunities to automate manual operational work (i.e., “toil”) using pipelines or by using new software or any other appropriate mechanisms
Work closely with the Application team to understand application architecture and perform Single point of failure analysis and create scenarios for testing resiliency of the application
Create/derive NFR/Workload model and ensure performance & resiliency is considered early in the SDLC.
Execute performance/chaos tests, analyze using APM and other tools to identify performance & stability issues.
Document any findings/analysis/results, communicate and present to stakeholders
Incorporate automation to reduce the probability and/or impact of problem recurrence
What you’ll need:
Experienced in triaging of production issues using APM tools such as Dynatrace or AppDynamics or New Relic and log aggregation tools such as Splunk, ELK, etc.
Experience with SRE concepts like SLI/SLOs & error budgets
Experience with AWS cloud, specifically services such as EC2, EKS, API GW, Lambda, Route53, SNS, RDS, Elasticcache, OpenSearch, etc. or similar cloud technologies & services
Knowledge of Docker containers and related orchestration technologies
Ability to work independently and partner with team members with a strong sense of initiative and drive
Excellent analytical, verbal & written communication skills with data driven analysis
Nice to Have:
Experience with CDN such as Cloudflare and various features like bot management features, Advanced WAF, Coding at edge
Experience with CI/CD processes and tools ( Jenkins, Argo, Harness, etc.)
Experience with chaos engineering
Proficient Operating Systems (UNIX/Linux) background. Proficient Operating Systems (UNIX/Linux) background.
Experience with Agile and DevOps development methodologies
Exposure to automation and scripting skills using Jenkins, python, shell, etc.
Knowledge of Infrastructure as a code using terraform.
Don’t meet every single requirement? We encourage you to apply anyway. You might just be the right candidate for this, or other roles.
What you will get:
Competitive compensation
23 days annual holiday and an additional day off for your birthday.
Private Medical and Dental Care.
Life and Disability Insurance.
Public Transport Subsidy.
Ticket Restaurant Card.
Childcare Vouchers (Ticket Guarderia).
IncentiFit - annual reimbursement for health-and-wellness-related activities.
Pension Scheme.
Tuition Reimbursement.
Employee Stock Purchase Plan.
Employee Assistance Program.
Gartner Gives Charity Match.
Relocation Assistance - a specialist to help you with all the appointments and paperwork.
Limitless growth and learning opportunities.
A collaborative and positive culture - join a diverse team of professionals that are as smart and driven as you.
A chance to make an impact – your work will contribute directly to our strategy.
A hybrid work environment—enjoy the flexibility of working from home and the energy of collaborating with peers in our dynamic offices.
New offices close to Glòries and the beach in the fantastic, “innovation and tech” 22@ district of sunny Barcelona.
Fresh fruit, snacks, selection of teas, fair trade organic coffee and a fridge full of beer for our Thursday Beer O’Clock.
And much more!