Live Nation

Job Summary:JOB DESCRIPTION - LEAD SITE RELIABILITY ENGINEER - CSRE CONSULTINGLocation: London, United KingdomDivision: Ticketmaster UK LimitedLine Manager: Engagement Lead, CSRE ConsultingContract Terms: Permanent, 40 hours per weekTHE TEAMA career at Ticketmaster will challenge and engage you. We support the creators and producers of shows and live performances, while connecting more passionate fans to these events. The pace here is fast, the atmosphere is fun and a passion for live events is a common thread that ties us together. As a global and growing business, we can truly offer a world of opportunities to expand your skills and develop your career. Visit any of our offices and you'll find a diverse mix of passionate employees, helping fans around the globe connect with the artists, teams and events they love. It truly is a unique and rewarding environment.You will be part of the Central SRE Consulting team, which partners with product and platform engineering teams throughout Ticketmaster to improve reliability, resilience, and sustainable engineering practices. We often deliver through work that combine hands-on delivery with capability building so teams can sustain improvements independently. The team's remit is to increase adoption and maturity of SRE principles across Ticketmaster and ensure our services are appropriately scaled and reliable.We support teams across the globe, with many peers in the USA. Most of your teammates operate in UTC/UTC+1, and we are adding people in other time zones.THE JOBAs a Lead Site Reliability Engineer in CSRE Consulting, you will lead reliability consulting work across multiple teams or a domain, aligning stakeholders on priorities and driving delivery of sustained improvements. You will translate reliability goals into sequenced workstreams, align dependencies, and ensure teams can maintain the mechanisms after you move on.You will mentor other consultants, codify reusable patterns, and influence shared platforms so reliability improvements propagate beyond any single team or engagement.WHAT YOU WILL BE DOINGLead consulting work from discovery through delivery by aligning stakeholders on priorities, sequencing work, and communicating measurable outcomes.Establish working cadence and facilitate decision forums to surface risks, map dependencies, and drive clear ownership and timelines.Align product, platform, and engineering stakeholders on reliability targets and trade-offs using SLOs and error budgets.Partner regularly with Engineering Managers, product managers, Staff and Principal engineers, and platform leads to keep dependencies, decisions, and delivery aligned.Identify systemic risks across shared dependencies and coordinate remediation across multiple teams to reduce recurring incidents.Drive change adoption by embedding reliability mechanisms into partner team routines such as planning, PRRs, and on-call practices.Design and implement reusable reliability mechanisms, templates, and tooling that can be adopted across teams.Establish and evolve production readiness review practices with partner teams to improve launch quality and change safety.Drive observability strategy for partner domains by improving signal quality, alerting philosophy, and operational dashboards.Lead complex incident investigations and ensure learnings translate into durable fixes with clear owners and verification.Lead reliability-focused design and code reviews and guide teams toward simpler, safer architectures.Mentor Senior engineers and other consultants through pairing, reviews, and structured coaching to multiply impact.Partner with internal platform engineering to influence roadmaps and deliver shared capabilities that accelerate SRE adoption.Improve CSRE Consulting playbooks and operating practices based on repeated patterns observed across teams.WHAT YOU NEED TO KNOW (or TECHNICAL SKILLS)Deep practical understanding of SRE principles, including SLO governance and error budget policy in practice.Proven ability to lead cross-team technical work and influence without authority.Strong experience designing and troubleshooting distributed systems with cross-service failure modes.Experience shaping observability and alerting strategy and improving operational signal quality.Strong Kubernetes and AWS experience, including governance and cost trade-offs.Ability to design reliability automation and tooling that is reusable and adopted by multiple teams.Experience leading production readiness and resilience practices, including DR validation and controlled testing.Strong software engineering fundamentals with the ability to deliver and review high-quality changes in enterprise codebases.Advanced incident analysis skills focused on systemic risk reduction and organizational learning.Excellent communication skills, including exec-ready summaries and clear technical diagrams.YOU (BEHAVIOURAL SKILLS)Lead with service and humility, creating clarity and momentum without relying on authority.Build relationships across teams and functions, and set clear expectations for how you partner and deliver.Facilitate alignment by framing problems, surfacing trade-offs, and running working sessions that end in decisions.Persuade with evidence and empathy, adapting your narrative for engineers, product, and senior stakeholders.Coach and mentor deliberately, helping others grow in reliability thinking and consulting craft.Maintain psychological safety while raising standards, giving direct feedback with respect.Stay persistent and patient in complex organizations, keeping work moving despite slow dependencies.Hold ambiguity comfortably and turn messy inputs into clear plans, options, and next steps.Favor simple mechanisms that scale adoption, not bespoke one-offs that require you to maintain them.Operate at a sustainable pace and discourage hero culture by designing systems that do not need it.Take pride in quality, including documentation and decision records that help teams sustain the work.Remain adaptable, switching between hands-on debugging, stakeholder management, and planning as needed.LIFE AT TICKETMASTERWe are proud to be a part of Live Nation Entertainment, the world's largest live entertainment company.Our vision at Ticketmaster is to connect people around the world to the live events they love. As the world's largest ticket marketplace and the leading global provider of enterprise tools and services for the live entertainment business, we are uniquely positioned to successfully deliver on that vision.We do it all with an intense passion for Live and an inspiring and diverse culture driven by accessible leaders, attentive managers, and enthusiastic teams. If you're passionate about live entertainment like we are, and you want to work at a company dedicated to helping millions of fans experience it, we want to hear from you.Our work is guided by our values:Reliability - We understand that fans and clients rely on us to power their live event experiences, and we rely on each other to make it happen.Teamwork - We believe individual achievement pales in comparison to the level of success that can be achieved by a teamIntegrity - We are committed to the highest moral and ethical standards on behalf of the countless partners and stakeholders we representBelonging - We are committed to building a culture in which all people can be their authentic selves, have an equal voice and opportunities to thriveEQUAL OPPORTUNITIESWe are passionate and committed to our people and go beyond the rhetoric of diversity and inclusion. You will be working in an inclusive environment and be encouraged to bring your whole self to work. We will do all that we can to help you successfully balance your work and homelife. As a growing business we will encourage you to develop your professional and personal aspirations, enjoy new experiences, and learn from the talented people you will be working with. It's talent that matters to us and we encourage applications from people irrespective of their gender, race, sexual orientation, religion, age, disability status or caring responsibilities. Nation Entertainment will never request payment or equipment purchases as part of the hiring process. Recruiters will only contact candidates from official Live Nation or affiliated brand email domains.

Jun 15, 2026

Full time

Job Summary:JOB DESCRIPTION - LEAD SITE RELIABILITY ENGINEER - CSRE CONSULTINGLocation: London, United KingdomDivision: Ticketmaster UK LimitedLine Manager: Engagement Lead, CSRE ConsultingContract Terms: Permanent, 40 hours per weekTHE TEAMA career at Ticketmaster will challenge and engage you. We support the creators and producers of shows and live performances, while connecting more passionate fans to these events. The pace here is fast, the atmosphere is fun and a passion for live events is a common thread that ties us together. As a global and growing business, we can truly offer a world of opportunities to expand your skills and develop your career. Visit any of our offices and you'll find a diverse mix of passionate employees, helping fans around the globe connect with the artists, teams and events they love. It truly is a unique and rewarding environment.You will be part of the Central SRE Consulting team, which partners with product and platform engineering teams throughout Ticketmaster to improve reliability, resilience, and sustainable engineering practices. We often deliver through work that combine hands-on delivery with capability building so teams can sustain improvements independently. The team's remit is to increase adoption and maturity of SRE principles across Ticketmaster and ensure our services are appropriately scaled and reliable.We support teams across the globe, with many peers in the USA. Most of your teammates operate in UTC/UTC+1, and we are adding people in other time zones.THE JOBAs a Lead Site Reliability Engineer in CSRE Consulting, you will lead reliability consulting work across multiple teams or a domain, aligning stakeholders on priorities and driving delivery of sustained improvements. You will translate reliability goals into sequenced workstreams, align dependencies, and ensure teams can maintain the mechanisms after you move on.You will mentor other consultants, codify reusable patterns, and influence shared platforms so reliability improvements propagate beyond any single team or engagement.WHAT YOU WILL BE DOINGLead consulting work from discovery through delivery by aligning stakeholders on priorities, sequencing work, and communicating measurable outcomes.Establish working cadence and facilitate decision forums to surface risks, map dependencies, and drive clear ownership and timelines.Align product, platform, and engineering stakeholders on reliability targets and trade-offs using SLOs and error budgets.Partner regularly with Engineering Managers, product managers, Staff and Principal engineers, and platform leads to keep dependencies, decisions, and delivery aligned.Identify systemic risks across shared dependencies and coordinate remediation across multiple teams to reduce recurring incidents.Drive change adoption by embedding reliability mechanisms into partner team routines such as planning, PRRs, and on-call practices.Design and implement reusable reliability mechanisms, templates, and tooling that can be adopted across teams.Establish and evolve production readiness review practices with partner teams to improve launch quality and change safety.Drive observability strategy for partner domains by improving signal quality, alerting philosophy, and operational dashboards.Lead complex incident investigations and ensure learnings translate into durable fixes with clear owners and verification.Lead reliability-focused design and code reviews and guide teams toward simpler, safer architectures.Mentor Senior engineers and other consultants through pairing, reviews, and structured coaching to multiply impact.Partner with internal platform engineering to influence roadmaps and deliver shared capabilities that accelerate SRE adoption.Improve CSRE Consulting playbooks and operating practices based on repeated patterns observed across teams.WHAT YOU NEED TO KNOW (or TECHNICAL SKILLS)Deep practical understanding of SRE principles, including SLO governance and error budget policy in practice.Proven ability to lead cross-team technical work and influence without authority.Strong experience designing and troubleshooting distributed systems with cross-service failure modes.Experience shaping observability and alerting strategy and improving operational signal quality.Strong Kubernetes and AWS experience, including governance and cost trade-offs.Ability to design reliability automation and tooling that is reusable and adopted by multiple teams.Experience leading production readiness and resilience practices, including DR validation and controlled testing.Strong software engineering fundamentals with the ability to deliver and review high-quality changes in enterprise codebases.Advanced incident analysis skills focused on systemic risk reduction and organizational learning.Excellent communication skills, including exec-ready summaries and clear technical diagrams.YOU (BEHAVIOURAL SKILLS)Lead with service and humility, creating clarity and momentum without relying on authority.Build relationships across teams and functions, and set clear expectations for how you partner and deliver.Facilitate alignment by framing problems, surfacing trade-offs, and running working sessions that end in decisions.Persuade with evidence and empathy, adapting your narrative for engineers, product, and senior stakeholders.Coach and mentor deliberately, helping others grow in reliability thinking and consulting craft.Maintain psychological safety while raising standards, giving direct feedback with respect.Stay persistent and patient in complex organizations, keeping work moving despite slow dependencies.Hold ambiguity comfortably and turn messy inputs into clear plans, options, and next steps.Favor simple mechanisms that scale adoption, not bespoke one-offs that require you to maintain them.Operate at a sustainable pace and discourage hero culture by designing systems that do not need it.Take pride in quality, including documentation and decision records that help teams sustain the work.Remain adaptable, switching between hands-on debugging, stakeholder management, and planning as needed.LIFE AT TICKETMASTERWe are proud to be a part of Live Nation Entertainment, the world's largest live entertainment company.Our vision at Ticketmaster is to connect people around the world to the live events they love. As the world's largest ticket marketplace and the leading global provider of enterprise tools and services for the live entertainment business, we are uniquely positioned to successfully deliver on that vision.We do it all with an intense passion for Live and an inspiring and diverse culture driven by accessible leaders, attentive managers, and enthusiastic teams. If you're passionate about live entertainment like we are, and you want to work at a company dedicated to helping millions of fans experience it, we want to hear from you.Our work is guided by our values:Reliability - We understand that fans and clients rely on us to power their live event experiences, and we rely on each other to make it happen.Teamwork - We believe individual achievement pales in comparison to the level of success that can be achieved by a teamIntegrity - We are committed to the highest moral and ethical standards on behalf of the countless partners and stakeholders we representBelonging - We are committed to building a culture in which all people can be their authentic selves, have an equal voice and opportunities to thriveEQUAL OPPORTUNITIESWe are passionate and committed to our people and go beyond the rhetoric of diversity and inclusion. You will be working in an inclusive environment and be encouraged to bring your whole self to work. We will do all that we can to help you successfully balance your work and homelife. As a growing business we will encourage you to develop your professional and personal aspirations, enjoy new experiences, and learn from the talented people you will be working with. It's talent that matters to us and we encourage applications from people irrespective of their gender, race, sexual orientation, religion, age, disability status or caring responsibilities. Nation Entertainment will never request payment or equipment purchases as part of the hiring process. Recruiters will only contact candidates from official Live Nation or affiliated brand email domains.

Lead Site Reliability Engineer (SRE)

McGregor Boyall Bromley, Kent

Lead Site Reliability Engineer (SRE) - Banking and Payments Contract, 12 months + Based in Bromley (Hybrid working - 3 days office) Global financial services client is seeking an SRE to lead the designs and reliability engineering across banking and payments, establishing SRE standards, automation, and learning practices to improve resilience, reduce incidents, and scale engineering led operations. The successful candidate will be skilled in resilient engineering, risk control, and scaling operations across complex banking and payments environments. Demonstrate flexibility, navigate ambiguity, and quickly establish credibility among technical peers Excellent written and verbal communication skills. Responsibilities include: SRE strategy ownership Banking/payments resilience Reliability engineering transformation SLO/SLI/error budget adoption Incident reduction and operational scaling Senior stakeholder influence If this is of interest and you have the required skills, please submit your CV over for immediate consideration. McGregor Boyall is an equal opportunity employer and do not discriminate on any grounds.

Jun 11, 2026

Contractor

Lead Site Reliability Engineer (SRE) - Banking and Payments Contract, 12 months + Based in Bromley (Hybrid working - 3 days office) Global financial services client is seeking an SRE to lead the designs and reliability engineering across banking and payments, establishing SRE standards, automation, and learning practices to improve resilience, reduce incidents, and scale engineering led operations. The successful candidate will be skilled in resilient engineering, risk control, and scaling operations across complex banking and payments environments. Demonstrate flexibility, navigate ambiguity, and quickly establish credibility among technical peers Excellent written and verbal communication skills. Responsibilities include: SRE strategy ownership Banking/payments resilience Reliability engineering transformation SLO/SLI/error budget adoption Incident reduction and operational scaling Senior stakeholder influence If this is of interest and you have the required skills, please submit your CV over for immediate consideration. McGregor Boyall is an equal opportunity employer and do not discriminate on any grounds.

Lead Site Reliability Engineer (SRE)

McGregor Boyall

Lead Site Reliability Engineer (SRE) - Transformation - SRE Adoption - Banking and Payments Contract, 12 months + Based in London (Hybrid working - 3 days office) Global financial services client is seeking an SRE to lead the designs and reliability engineering across banking and payments, establishing SRE standards, automation, and learning practices to improve resilience, reduce incidents, and scale engineering led operations. The successful candidate will be skilled in resilient engineering, risk control, and scaling operations across complex banking and payments environments. Demonstrate flexibility, navigate ambiguity, and quickly establish credibility among technical peers Excellent written and verbal communication skills. Responsibilities include: SRE strategy ownership Banking/payments resilience Reliability engineering transformation SLO/SLI/error budget adoption Incident reduction and operational scaling Senior stakeholder influence If this is of interest and you have the required skills, please submit your CV over for immediate consideration. McGregor Boyall is an equal opportunity employer and do not discriminate on any grounds.

Jun 11, 2026

Contractor

Lead Site Reliability Engineer (SRE) - Transformation - SRE Adoption - Banking and Payments Contract, 12 months + Based in London (Hybrid working - 3 days office) Global financial services client is seeking an SRE to lead the designs and reliability engineering across banking and payments, establishing SRE standards, automation, and learning practices to improve resilience, reduce incidents, and scale engineering led operations. The successful candidate will be skilled in resilient engineering, risk control, and scaling operations across complex banking and payments environments. Demonstrate flexibility, navigate ambiguity, and quickly establish credibility among technical peers Excellent written and verbal communication skills. Responsibilities include: SRE strategy ownership Banking/payments resilience Reliability engineering transformation SLO/SLI/error budget adoption Incident reduction and operational scaling Senior stakeholder influence If this is of interest and you have the required skills, please submit your CV over for immediate consideration. McGregor Boyall is an equal opportunity employer and do not discriminate on any grounds.

Senior Software Engineer, Substrate

Palantir

A World-Changing Company Palantir builds the world's leading software for data-driven decisions and operations. By bringing the right data to the people who need it, our platforms empower our partners to develop lifesaving drugs, forecast supply chain disruptions, locate missing children, and more. The Role Substrate is the team responsible for Palantir's core production infrastructure - 100s of K8s clusters - from on-prem to the major cloud hyperscalers, whether they are internet-connected or air-gapped, small hardware footprint or large. As a Senior Software Engineer on Substrate, you will design and build Palantir's managed Kubernetes product offerings across all these environments. You and your team will be responsible for bootstrapping and operating the entire fleet of K8s clusters with zero manual steps by building industry leading tooling and contributing to core CNCF components. You will also be responsible for ensuring scale, stability and security across a matrix of compliance regimes and hosting infrastructure types. Your team culture emphasizes engineering rigor and operational excellence at scale. This means issues in production should be pre-empted and deeply root-caused, and investments in automation and self-healing systems are key. If you're excited about infrastructure at scale and working with Kubernetes, this is the right role for you. Core Responsibilities Deliver a container runtime to challenging new environment types - new clouds, on premise, edge devices Build automation and establish standards for operating K8s securely at scale with zero manual ops overhead Drive innovation through adoption of novel K8s features and CNCF tools, making upstream contributions as needed Design the next generation of Palantir's infrastructure through a deep understanding of internal systems and CNCF standards What We Value Systems programming experience with strong proficiency in golang, C/C++ or equivalent Working knowledge or hands on experience of infrastructure automation tools such as Terraform, ansible, puppet or K8s operators, and competent coding in Go, Java, or equivalent for the purposes of automation or scripting Deep familiarity with hardware and OS configurations, diagnostic tooling, networking nuts and bolts Deep familiarity with containers (Docker) and orchestration (Kubernetes) at scale Experience working with a cloud provider (AWS/Azure/GCE), or sysadmin/SRE experience in data centers Experience designing, building, and operating high-scale observability or infrastructure systems Working knowledge of networking fundamentals, experience with CNIs or cloud networking infrastructure preferred What We Require 4+ years of professional software development experience on core infrastructure with emphasis on operational excellence 2+ years of experience contributing to the system design or architecture (architecture, design patterns, reliability and scaling) of new and existing systems Bachelor's degree in Computer Science or equivalent Life at Palantir We want every Palantirian to achieve their best outcomes, that's why we celebrate individuals' strengths, skills, and interests, from your first interview to your longterm growth, rather than rely on traditional career ladders. Paying attention to the needs of our community enables us to optimize our opportunities to grow and helps ensure many pathways to success at Palantir. Promoting health and well-being across all areas of Palantirians' lives is just one of the ways we're investing in our community. Learn more at Life at Palantir and note that our offerings may vary by region. In keeping consistent with Palantir's values and culture, we believe employees are "better together" and in-person work affords the opportunity for more creative outcomes. Therefore, we encourage employees to work from our offices to foster connectivity and innovation. Many teams do offer hybrid options (WFH a day or two a week), allowing our employees to strike the right trade-off for their personal productivity. Based on business need, there are a few roles that allow for "Remote" work on an exceptional basis. If you are applying for one of these roles, you must work from the city and or country in which you are employed. If the posting is specified as Onsite, you are required to work from an office. If you want to empower the world's most important institutions, you belong here. Palantir values excellence regardless of background. We are committed to making the application and hiring process accessible to everyone and will provide a reasonable accommodation for those living with a disability. If you need an accommodation for the application or hiring process, please reach out and let us know how we can help.

May 29, 2026

Full time

A World-Changing Company Palantir builds the world's leading software for data-driven decisions and operations. By bringing the right data to the people who need it, our platforms empower our partners to develop lifesaving drugs, forecast supply chain disruptions, locate missing children, and more. The Role Substrate is the team responsible for Palantir's core production infrastructure - 100s of K8s clusters - from on-prem to the major cloud hyperscalers, whether they are internet-connected or air-gapped, small hardware footprint or large. As a Senior Software Engineer on Substrate, you will design and build Palantir's managed Kubernetes product offerings across all these environments. You and your team will be responsible for bootstrapping and operating the entire fleet of K8s clusters with zero manual steps by building industry leading tooling and contributing to core CNCF components. You will also be responsible for ensuring scale, stability and security across a matrix of compliance regimes and hosting infrastructure types. Your team culture emphasizes engineering rigor and operational excellence at scale. This means issues in production should be pre-empted and deeply root-caused, and investments in automation and self-healing systems are key. If you're excited about infrastructure at scale and working with Kubernetes, this is the right role for you. Core Responsibilities Deliver a container runtime to challenging new environment types - new clouds, on premise, edge devices Build automation and establish standards for operating K8s securely at scale with zero manual ops overhead Drive innovation through adoption of novel K8s features and CNCF tools, making upstream contributions as needed Design the next generation of Palantir's infrastructure through a deep understanding of internal systems and CNCF standards What We Value Systems programming experience with strong proficiency in golang, C/C++ or equivalent Working knowledge or hands on experience of infrastructure automation tools such as Terraform, ansible, puppet or K8s operators, and competent coding in Go, Java, or equivalent for the purposes of automation or scripting Deep familiarity with hardware and OS configurations, diagnostic tooling, networking nuts and bolts Deep familiarity with containers (Docker) and orchestration (Kubernetes) at scale Experience working with a cloud provider (AWS/Azure/GCE), or sysadmin/SRE experience in data centers Experience designing, building, and operating high-scale observability or infrastructure systems Working knowledge of networking fundamentals, experience with CNIs or cloud networking infrastructure preferred What We Require 4+ years of professional software development experience on core infrastructure with emphasis on operational excellence 2+ years of experience contributing to the system design or architecture (architecture, design patterns, reliability and scaling) of new and existing systems Bachelor's degree in Computer Science or equivalent Life at Palantir We want every Palantirian to achieve their best outcomes, that's why we celebrate individuals' strengths, skills, and interests, from your first interview to your longterm growth, rather than rely on traditional career ladders. Paying attention to the needs of our community enables us to optimize our opportunities to grow and helps ensure many pathways to success at Palantir. Promoting health and well-being across all areas of Palantirians' lives is just one of the ways we're investing in our community. Learn more at Life at Palantir and note that our offerings may vary by region. In keeping consistent with Palantir's values and culture, we believe employees are "better together" and in-person work affords the opportunity for more creative outcomes. Therefore, we encourage employees to work from our offices to foster connectivity and innovation. Many teams do offer hybrid options (WFH a day or two a week), allowing our employees to strike the right trade-off for their personal productivity. Based on business need, there are a few roles that allow for "Remote" work on an exceptional basis. If you are applying for one of these roles, you must work from the city and or country in which you are employed. If the posting is specified as Onsite, you are required to work from an office. If you want to empower the world's most important institutions, you belong here. Palantir values excellence regardless of background. We are committed to making the application and hiring process accessible to everyone and will provide a reasonable accommodation for those living with a disability. If you need an accommodation for the application or hiring process, please reach out and let us know how we can help.

Senior Software Engineer / Reliability Engineering - Real-time Data

Bloomberg L.P.

Senior Software Engineer / Reliability Engineering - Real-time Data Location: London Business Area: Engineering and CTO Ref #: Description & Requirements Our department is responsible for efficiently distributing financial data from its source to interested users all around the world. This includes (for example) stock prices or foreign exchange rates. Data can either be served in response to a request or streamed in real time. The group owns: The distribution software and infrastructure A range of different sources of data Supporting services to administer and manage the system, including permissioning and metering The team is also responsible for the Enterprise endpoint ("B-PIPE"), which allows end-users to programmatically consume data via our SDK. Data is also available through the Bloomberg Terminal and Microsoft Excel. The main challenge faced by the group is one of scale. Data is sourced from more than 370 global exchanges, with a combined volume in excess of 60 billion messages each day. We deliver this data to hundreds of thousands of terminals and thousands of B-PIPEs. Handling this volume requires significant infrastructure, we manage multiple clusters in our main data centres, as well as a network of many thousands of servers around the world. Group Overview The RD Reliability Engineering group comprises three sub-teams located in Tokyo, London, and New York, providing follow-the-sun support. Our mission is to ensure systems are reliable, scalable, and observable through software engineering, while continuously improving how systems behave under load and failure conditions. We work in an outcome-driven model, focusing on measurable improvements in availability, latency, capacity, and recovery. Our goal is to ensure systems meet defined service level objectives while minimising manual operational effort through automation and software solutions. The systems we support must behave predictably under extreme load, recover quickly from failures, and continue to evolve without compromising stability - these are the core challenges we solve. London Team Focus - Availability & Resiliency The London team plays a key role in ensuring the availability and resiliency of RD infrastructure globally. We focus on: Detecting and preventing failures across large-scale distributed systems Ensuring infrastructure demonstrates sufficient capacity and failover capability during site-loss scenarios Reducing time to detect, diagnose, and recover from incidents Ensuring systems behave predictably under both normal and adverse conditions This role provides the opportunity to influence how reliability is engineered across the platform, working closely with teams globally to improve system behaviour and design. What You'll Do Build and maintain production-grade software supporting Bloomberg's global distribution infrastructure Design and implement scalable, fault-tolerant systems with a focus on observability, performance, and automation Analyse system behaviour under real-world and failure scenarios to validate capacity, failover, and recovery meet resilience objectives Identify bottlenecks, scaling limits, and reliability risks across distributed systems Improve detection, diagnosis, and prevention of production issues Build tools and frameworks to increase system visibility and reduce time to detect and resolve incidents Automate operational workflows to reduce manual effort and improve system reliability Partner with application and infrastructure teams to improve system design, resilience, and performance Contribute to design discussions, incident reviews, and reliability improvements across the platform Systems You'll Work With Configuration systems serving thousands of servers across the global network Service discovery and clustering systems for distributed infrastructure Monitoring and observability frameworks for large-scale server estates Tooling for diagnosing data quality and distribution issues Ownership of systems may evolve over time as the team focuses on areas of highest impact. What Success Looks Like Systems consistently meet defined reliability, latency, and capacity objectives Issues are detected and mitigated before significant customer impact Systems are demonstrably resilient, with proven failover capability and sufficient capacity under failure conditions Operational processes are automated and scalable Reliability is achieved through engineering improvements rather than manual intervention What We're Looking For We're not a traditional SRE team. We engineer reliability through software, building solutions that automate operations and improve system resilience by design. Experience with an object-oriented programming language (preferably Python or C++) Strong focus on building reliable, observable distributed systems Experience working with SLOs, SLIs, and production reliability metrics Proven ability to triage and resolve live production problems A mindset focused on automation and reducing operational toil A strength in collaborating within an inclusive team environment The ability to work across departments and build strong relationships with both technical and non-technical partners Why Join Us You'll work on systems that sit at the core of Bloomberg's real-time data platform, operating at global scale and under demanding performance and reliability requirements. This is an opportunity to: Solve complex distributed systems problems with real-world impact Influence how reliability is engineered across a critical platform Work with teams across multiple regions and technical domains Build systems that are resilient by design and operate at massive scale If indicated, please note that years of experience are a guide; we will consider applications from all candidates who can demonstrate the skills necessary for the role. Discover what makes Bloomberg unique - watch our for an inside look at our culture, values, and the people behind our success. Bloomberg is an equal opportunity employer and we value diversity at our company. We do not discriminate on the basis of age, ancestry, color, gender identity or expression, genetic predisposition or carrier status, marital status, national or ethnic origin, race, religion or belief, sex, sexual orientation, sexual and other reproductive health decisions, parental or caring status, physical or mental disability, pregnancy or parental leave, protected veteran status, status as a victim of domestic violence, or any other classification protected by applicable law. Bloomberg is a disability inclusive employer. Please let us know if you require any reasonable adjustments to be made for the recruitment process. If you would prefer to discuss this confidentially, please email

May 29, 2026

Full time

Senior Software Engineer / Reliability Engineering - Real-time Data Location: London Business Area: Engineering and CTO Ref #: Description & Requirements Our department is responsible for efficiently distributing financial data from its source to interested users all around the world. This includes (for example) stock prices or foreign exchange rates. Data can either be served in response to a request or streamed in real time. The group owns: The distribution software and infrastructure A range of different sources of data Supporting services to administer and manage the system, including permissioning and metering The team is also responsible for the Enterprise endpoint ("B-PIPE"), which allows end-users to programmatically consume data via our SDK. Data is also available through the Bloomberg Terminal and Microsoft Excel. The main challenge faced by the group is one of scale. Data is sourced from more than 370 global exchanges, with a combined volume in excess of 60 billion messages each day. We deliver this data to hundreds of thousands of terminals and thousands of B-PIPEs. Handling this volume requires significant infrastructure, we manage multiple clusters in our main data centres, as well as a network of many thousands of servers around the world. Group Overview The RD Reliability Engineering group comprises three sub-teams located in Tokyo, London, and New York, providing follow-the-sun support. Our mission is to ensure systems are reliable, scalable, and observable through software engineering, while continuously improving how systems behave under load and failure conditions. We work in an outcome-driven model, focusing on measurable improvements in availability, latency, capacity, and recovery. Our goal is to ensure systems meet defined service level objectives while minimising manual operational effort through automation and software solutions. The systems we support must behave predictably under extreme load, recover quickly from failures, and continue to evolve without compromising stability - these are the core challenges we solve. London Team Focus - Availability & Resiliency The London team plays a key role in ensuring the availability and resiliency of RD infrastructure globally. We focus on: Detecting and preventing failures across large-scale distributed systems Ensuring infrastructure demonstrates sufficient capacity and failover capability during site-loss scenarios Reducing time to detect, diagnose, and recover from incidents Ensuring systems behave predictably under both normal and adverse conditions This role provides the opportunity to influence how reliability is engineered across the platform, working closely with teams globally to improve system behaviour and design. What You'll Do Build and maintain production-grade software supporting Bloomberg's global distribution infrastructure Design and implement scalable, fault-tolerant systems with a focus on observability, performance, and automation Analyse system behaviour under real-world and failure scenarios to validate capacity, failover, and recovery meet resilience objectives Identify bottlenecks, scaling limits, and reliability risks across distributed systems Improve detection, diagnosis, and prevention of production issues Build tools and frameworks to increase system visibility and reduce time to detect and resolve incidents Automate operational workflows to reduce manual effort and improve system reliability Partner with application and infrastructure teams to improve system design, resilience, and performance Contribute to design discussions, incident reviews, and reliability improvements across the platform Systems You'll Work With Configuration systems serving thousands of servers across the global network Service discovery and clustering systems for distributed infrastructure Monitoring and observability frameworks for large-scale server estates Tooling for diagnosing data quality and distribution issues Ownership of systems may evolve over time as the team focuses on areas of highest impact. What Success Looks Like Systems consistently meet defined reliability, latency, and capacity objectives Issues are detected and mitigated before significant customer impact Systems are demonstrably resilient, with proven failover capability and sufficient capacity under failure conditions Operational processes are automated and scalable Reliability is achieved through engineering improvements rather than manual intervention What We're Looking For We're not a traditional SRE team. We engineer reliability through software, building solutions that automate operations and improve system resilience by design. Experience with an object-oriented programming language (preferably Python or C++) Strong focus on building reliable, observable distributed systems Experience working with SLOs, SLIs, and production reliability metrics Proven ability to triage and resolve live production problems A mindset focused on automation and reducing operational toil A strength in collaborating within an inclusive team environment The ability to work across departments and build strong relationships with both technical and non-technical partners Why Join Us You'll work on systems that sit at the core of Bloomberg's real-time data platform, operating at global scale and under demanding performance and reliability requirements. This is an opportunity to: Solve complex distributed systems problems with real-world impact Influence how reliability is engineered across a critical platform Work with teams across multiple regions and technical domains Build systems that are resilient by design and operate at massive scale If indicated, please note that years of experience are a guide; we will consider applications from all candidates who can demonstrate the skills necessary for the role. Discover what makes Bloomberg unique - watch our for an inside look at our culture, values, and the people behind our success. Bloomberg is an equal opportunity employer and we value diversity at our company. We do not discriminate on the basis of age, ancestry, color, gender identity or expression, genetic predisposition or carrier status, marital status, national or ethnic origin, race, religion or belief, sex, sexual orientation, sexual and other reproductive health decisions, parental or caring status, physical or mental disability, pregnancy or parental leave, protected veteran status, status as a victim of domestic violence, or any other classification protected by applicable law. Bloomberg is a disability inclusive employer. Please let us know if you require any reasonable adjustments to be made for the recruitment process. If you would prefer to discuss this confidentially, please email

Lead Site Reliability Engineer (SRE)

McGregor Boyall Bromley, Kent

Lead Site Reliability Engineer (SRE) - Banking and Payments Contract, 12 months + Based in Bromley (Hybrid working - 3 days office) Global financial services client is seeking an SRE to lead the designs and reliability engineering across banking and payments, establishing SRE standards, automation, and learning practices to improve resilience, reduce incidents, and scale engineering led operations. The successful candidate will be skilled in resilient engineering, risk control, and scaling operations across complex banking and payments environments. Demonstrate flexibility, navigate ambiguity, and quickly establish credibility among technical peers Excellent written and verbal communication skills. Responsibilities include: SRE strategy ownership Banking/payments resilience Reliability engineering transformation SLO/SLI/error budget adoption Incident reduction and operational scaling Senior stakeholder influence If this is of interest and you have the required skills, please submit your CV over for immediate consideration. McGregor Boyall is an equal opportunity employer and do not discriminate on any grounds.

May 25, 2026

Contractor

Lead Site Reliability Engineer (SRE) - Banking and Payments Contract, 12 months + Based in Bromley (Hybrid working - 3 days office) Global financial services client is seeking an SRE to lead the designs and reliability engineering across banking and payments, establishing SRE standards, automation, and learning practices to improve resilience, reduce incidents, and scale engineering led operations. The successful candidate will be skilled in resilient engineering, risk control, and scaling operations across complex banking and payments environments. Demonstrate flexibility, navigate ambiguity, and quickly establish credibility among technical peers Excellent written and verbal communication skills. Responsibilities include: SRE strategy ownership Banking/payments resilience Reliability engineering transformation SLO/SLI/error budget adoption Incident reduction and operational scaling Senior stakeholder influence If this is of interest and you have the required skills, please submit your CV over for immediate consideration. McGregor Boyall is an equal opportunity employer and do not discriminate on any grounds.

Principal Site Reliability Engineering Expert Director

Boston Consulting Group

Who We Are Boston Consulting Group partners with leaders in business and society to tackle their most important challenges and capture their greatest opportunities. BCG was the pioneer in business strategy when it was founded in 1963. Today, we help clients with total transformation-inspiring complex change, enabling organizations to grow, building competitive advantage, and driving bottom-line impact. To succeed, organizations must blend digital and human capabilities. Our diverse, global teams bring deep industry and functional expertise and a range of perspectives to spark change. BCG delivers solutions through leading-edge management consulting along with technology and design, corporate and digital ventures-and business purpose. We work in a uniquely collaborative model across the firm and throughout all levels of the client organization, generating results that allow our clients to thrive. What You'll Do The Principal Site Reliability Engineer (SRE) is a senior technical leader responsible for shaping how reliability, automation, and operational excellence are engineered across the organisation. Operating across domains including traditional infrastructure, cloud engineering, network operations, identity, observability, security, AI-driven operations, and automated data workflows, the role focuses on designing scalable systems, reusable engineering patterns, and standardised controls that reduce operational toil, improve resilience, and embed reliability, governance, and compliance directly into delivery pipelines and operational platforms. This role will drive organisational change towards automation-first, measurable, and repeatable practices. A key part of the role is building and evolving reusable CI/CD and Terraform modules, engineering guardrails, observability patterns, and automation frameworks that can be adopted across multiple teams and domains without requiring each team to solve the same problems independently. The Principal SRE also plays an important enablement role beyond deeply technical teams, helping less technical areas of the business adopt structured, governed, and scalable ways of working. This includes translating complex engineering practices into practical standards, improving how governance is implemented through engineering controls rather than manual oversight, and driving operational maturity across a broad and diverse technology landscape. The ideal candidate is a systems thinker who understands how services, networks, identity, data flows, and operational processes fail in real-world conditions, and can apply that understanding to build automation-first, reliability-focused operating models that scale across both technical and non-technical functions. Key Responsibilities Cross-Domain Reliability Engineering Design and evolve reliability patterns across cloud, network, identity, and security domains. Identify systemic risks and failure modes across platforms and services, and define engineering solutions to mitigate them. Ensure operational activities are embedded into delivery models through automation, CI/CD integration, and event-driven workflows. Automation & Toil Reduction at Scale Lead the design of automation frameworks that eliminate manual operational tasks across multiple domains. Translate incident learnings and operational inefficiencies into scalable automation and preventative controls. Drive adoption of automation-first principles, reducing dependency on human-driven processes. Contribute to AI-driven operational use cases, including event correlation, anomaly detection, noise reduction, operational insights, and automated remediation. Ensure AIOps capabilities are grounded in reliable telemetry, clear control boundaries, and measurable operational outcomes. Observability & 24/7 Operational Excellence Define standards for telemetry, monitoring, alerting, and operational visibility across all critical systems. Ensure services are observable, measurable, and support proactive detection of issues. Improve operational readiness, incident response effectiveness, and time-to-recovery through engineering solutions. CI/CD & Platform Integration Contribute to the design of CI/CD patterns that embed reliability, security, and operational controls into pipelines. Ensure infrastructure, network, identity, and security configurations are managed through code and validated automatically. Support integration of platform services into delivery pipelines to enable consistent, repeatable deployments. Security & Identity Integration Contribute to secure-by-design patterns, including least privilege, identity-based access, and short-lived credentials. Support integration of security controls (e.g. secrets management, authentication, policy enforcement) into engineering workflows. Ensure security and compliance requirements are met through engineering controls rather than manual processes. Network & Infrastructure Reliability Support the design of resilient network architectures and segmentation aligned with Zero Trust principles. Ensure network configurations and controls are automated, validated, and observable. Contribute to infrastructure design patterns that improve availability, scalability, and fault tolerance. Design and improve operational patterns for network reliability, segmentation, visibility, and change validation. Support automation and standardisation of network controls and operational procedures to reduce manual intervention and configuration drift. Technical Leadership & Enablement Provide technical leadership across teams, influencing standards, architecture, and engineering practices. Mentor engineers on reliability engineering, automation, and systems thinking. Drive consistency through reusable patterns, frameworks, and documentation. Strategic Influence & Continuous Improvement Contribute to reliability engineering strategy and roadmap across the organisation. Communicate technical concepts, risks, and recommendations to senior stakeholders and leadership. Lead initiatives that improve reliability maturity, engineering efficiency, and operational scalability. Support less technical teams and functions in adopting structured, automated, and measurable operational practices. Act as a bridge between engineering capability and organisational change, helping scale good practice beyond core platform teams. Automated Data Workflows Design and improve automated data workflows that support operational reporting, observability, governance, and decision-making. Ensure operational data pipelines are reliable, timely, and aligned to engineering and business needs. Reusable Engineering Frameworks Build and evolve reusable modules, patterns, and frameworks for CI/CD, Terraform, and operational automation. Embed governance, validation, and reliability controls into these shared engineering assets by default. Governance by Engineering Translate governance requirements into practical engineering controls, automated checks, and repeatable standards. Help teams adopt compliant and supportable operating models without relying on manual policing or process-heavy interventions. What You'll Bring Required Qualifications 10+ years of experience in Site Reliability Engineering, Platform Engineering, or related fields. Strong hands-on experience across multiple domains, including: Cloud platforms (AWS, Azure) CI/CD and Infrastructure-as-Code (e.g. Terraform) Observability tools (e.g. Datadog, Splunk) Automation and scripting (e.g. Python) Experience designing and implementing scalable automation and reliability solutions. Deep understanding of distributed systems, failure modes, and resilience patterns. Experience integrating operational and security controls into engineering workflows. Strong stakeholder engagement and technical communication skills. Preferred Qualifications Experience with identity and access management systems (e.g. Entra ID, Vault). Experience with network architecture and security controls (e.g. firewalls, segmentation). Familiarity with Zero Trust principles and security engineering practices. Experience working in large, federated organisations with diverse technology stacks. Exposure to compliance and regulatory requirements (e.g. PCI, HIPAA, SOX). Additional info Hybrid or on-site work model. Operates as a senior individual contributor with broad cross-organisational influence. Expected to balance hands-on technical leadership with strategic direction. Occasional travel may be required for team or stakeholder engagement. Boston Consulting Group is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, age, religion, sex, sexual orientation, gender identity / expression, national origin, disability, protected veteran status, or any other characteristic protected under national, provincial, or local law, where applicable, and those with criminal histories will be considered in a manner consistent with applicable state and local laws. BCG is an E - Verify Employer. Click here for more information on E-Verify.

May 21, 2026

Full time

Who We Are Boston Consulting Group partners with leaders in business and society to tackle their most important challenges and capture their greatest opportunities. BCG was the pioneer in business strategy when it was founded in 1963. Today, we help clients with total transformation-inspiring complex change, enabling organizations to grow, building competitive advantage, and driving bottom-line impact. To succeed, organizations must blend digital and human capabilities. Our diverse, global teams bring deep industry and functional expertise and a range of perspectives to spark change. BCG delivers solutions through leading-edge management consulting along with technology and design, corporate and digital ventures-and business purpose. We work in a uniquely collaborative model across the firm and throughout all levels of the client organization, generating results that allow our clients to thrive. What You'll Do The Principal Site Reliability Engineer (SRE) is a senior technical leader responsible for shaping how reliability, automation, and operational excellence are engineered across the organisation. Operating across domains including traditional infrastructure, cloud engineering, network operations, identity, observability, security, AI-driven operations, and automated data workflows, the role focuses on designing scalable systems, reusable engineering patterns, and standardised controls that reduce operational toil, improve resilience, and embed reliability, governance, and compliance directly into delivery pipelines and operational platforms. This role will drive organisational change towards automation-first, measurable, and repeatable practices. A key part of the role is building and evolving reusable CI/CD and Terraform modules, engineering guardrails, observability patterns, and automation frameworks that can be adopted across multiple teams and domains without requiring each team to solve the same problems independently. The Principal SRE also plays an important enablement role beyond deeply technical teams, helping less technical areas of the business adopt structured, governed, and scalable ways of working. This includes translating complex engineering practices into practical standards, improving how governance is implemented through engineering controls rather than manual oversight, and driving operational maturity across a broad and diverse technology landscape. The ideal candidate is a systems thinker who understands how services, networks, identity, data flows, and operational processes fail in real-world conditions, and can apply that understanding to build automation-first, reliability-focused operating models that scale across both technical and non-technical functions. Key Responsibilities Cross-Domain Reliability Engineering Design and evolve reliability patterns across cloud, network, identity, and security domains. Identify systemic risks and failure modes across platforms and services, and define engineering solutions to mitigate them. Ensure operational activities are embedded into delivery models through automation, CI/CD integration, and event-driven workflows. Automation & Toil Reduction at Scale Lead the design of automation frameworks that eliminate manual operational tasks across multiple domains. Translate incident learnings and operational inefficiencies into scalable automation and preventative controls. Drive adoption of automation-first principles, reducing dependency on human-driven processes. Contribute to AI-driven operational use cases, including event correlation, anomaly detection, noise reduction, operational insights, and automated remediation. Ensure AIOps capabilities are grounded in reliable telemetry, clear control boundaries, and measurable operational outcomes. Observability & 24/7 Operational Excellence Define standards for telemetry, monitoring, alerting, and operational visibility across all critical systems. Ensure services are observable, measurable, and support proactive detection of issues. Improve operational readiness, incident response effectiveness, and time-to-recovery through engineering solutions. CI/CD & Platform Integration Contribute to the design of CI/CD patterns that embed reliability, security, and operational controls into pipelines. Ensure infrastructure, network, identity, and security configurations are managed through code and validated automatically. Support integration of platform services into delivery pipelines to enable consistent, repeatable deployments. Security & Identity Integration Contribute to secure-by-design patterns, including least privilege, identity-based access, and short-lived credentials. Support integration of security controls (e.g. secrets management, authentication, policy enforcement) into engineering workflows. Ensure security and compliance requirements are met through engineering controls rather than manual processes. Network & Infrastructure Reliability Support the design of resilient network architectures and segmentation aligned with Zero Trust principles. Ensure network configurations and controls are automated, validated, and observable. Contribute to infrastructure design patterns that improve availability, scalability, and fault tolerance. Design and improve operational patterns for network reliability, segmentation, visibility, and change validation. Support automation and standardisation of network controls and operational procedures to reduce manual intervention and configuration drift. Technical Leadership & Enablement Provide technical leadership across teams, influencing standards, architecture, and engineering practices. Mentor engineers on reliability engineering, automation, and systems thinking. Drive consistency through reusable patterns, frameworks, and documentation. Strategic Influence & Continuous Improvement Contribute to reliability engineering strategy and roadmap across the organisation. Communicate technical concepts, risks, and recommendations to senior stakeholders and leadership. Lead initiatives that improve reliability maturity, engineering efficiency, and operational scalability. Support less technical teams and functions in adopting structured, automated, and measurable operational practices. Act as a bridge between engineering capability and organisational change, helping scale good practice beyond core platform teams. Automated Data Workflows Design and improve automated data workflows that support operational reporting, observability, governance, and decision-making. Ensure operational data pipelines are reliable, timely, and aligned to engineering and business needs. Reusable Engineering Frameworks Build and evolve reusable modules, patterns, and frameworks for CI/CD, Terraform, and operational automation. Embed governance, validation, and reliability controls into these shared engineering assets by default. Governance by Engineering Translate governance requirements into practical engineering controls, automated checks, and repeatable standards. Help teams adopt compliant and supportable operating models without relying on manual policing or process-heavy interventions. What You'll Bring Required Qualifications 10+ years of experience in Site Reliability Engineering, Platform Engineering, or related fields. Strong hands-on experience across multiple domains, including: Cloud platforms (AWS, Azure) CI/CD and Infrastructure-as-Code (e.g. Terraform) Observability tools (e.g. Datadog, Splunk) Automation and scripting (e.g. Python) Experience designing and implementing scalable automation and reliability solutions. Deep understanding of distributed systems, failure modes, and resilience patterns. Experience integrating operational and security controls into engineering workflows. Strong stakeholder engagement and technical communication skills. Preferred Qualifications Experience with identity and access management systems (e.g. Entra ID, Vault). Experience with network architecture and security controls (e.g. firewalls, segmentation). Familiarity with Zero Trust principles and security engineering practices. Experience working in large, federated organisations with diverse technology stacks. Exposure to compliance and regulatory requirements (e.g. PCI, HIPAA, SOX). Additional info Hybrid or on-site work model. Operates as a senior individual contributor with broad cross-organisational influence. Expected to balance hands-on technical leadership with strategic direction. Occasional travel may be required for team or stakeholder engagement. Boston Consulting Group is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, age, religion, sex, sexual orientation, gender identity / expression, national origin, disability, protected veteran status, or any other characteristic protected under national, provincial, or local law, where applicable, and those with criminal histories will be considered in a manner consistent with applicable state and local laws. BCG is an E - Verify Employer. Click here for more information on E-Verify.

Principal Site Reliability Engineering Expert Director

Boston Consulting Group

Who We Are Boston Consulting Group partners with leaders in business and society to tackle their most important challenges and capture their greatest opportunities. BCG was the pioneer in business strategy when it was founded in 1963. Today, we help clients with total transformation-inspiring complex change, enabling organizations to grow, building competitive advantage, and driving bottom-line impact. To succeed, organizations must blend digital and human capabilities. Our diverse, global teams bring deep industry and functional expertise and a range of perspectives to spark change. BCG delivers solutions through leading-edge management consulting along with technology and design, corporate and digital ventures-and business purpose. We work in a uniquely collaborative model across the firm and throughout all levels of the client organization, generating results that allow our clients to thrive. What You'll Do The Principal Site Reliability Engineer (SRE) is a senior technical leader responsible for shaping how reliability, automation, and operational excellence are engineered across the organisation. Operating across domains including traditional infrastructure, cloud engineering, network operations, identity, observability, security, AI-driven operations, and automated data workflows, the role focuses on designing scalable systems, reusable engineering patterns, and standardised controls that reduce operational toil, improve resilience, and embed reliability, governance, and compliance directly into delivery pipelines and operational platforms. This role will drive organisational change towards automation-first, measurable, and repeatable practices. A key part of the role is building and evolving reusable CI/CD and Terraform modules, engineering guardrails, observability patterns, and automation frameworks that can be adopted across multiple teams and domains without requiring each team to solve the same problems independently. The Principal SRE also plays an important enablement role beyond deeply technical teams, helping less technical areas of the business adopt structured, governed, and scalable ways of working. This includes translating complex engineering practices into practical standards, improving how governance is implemented through engineering controls rather than manual oversight, and driving operational maturity across a broad and diverse technology landscape. The ideal candidate is a systems thinker who understands how services, networks, identity, data flows, and operational processes fail in real-world conditions, and can apply that understanding to build automation-first, reliability-focused operating models that scale across both technical and non-technical functions. Key Responsibilities Cross-Domain Reliability Engineering Design and evolve reliability patterns across cloud, network, identity, and security domains. Identify systemic risks and failure modes across platforms and services, and define engineering solutions to mitigate them. Ensure operational activities are embedded into delivery models through automation, CI/CD integration, and event-driven workflows. Automation & Toil Reduction at Scale Lead the design of automation frameworks that eliminate manual operational tasks across multiple domains. Translate incident learnings and operational inefficiencies into scalable automation and preventative controls. Drive adoption of automation-first principles, reducing dependency on human-driven processes. Contribute to AI-driven operational use cases, including event correlation, anomaly detection, noise reduction, operational insights, and automated remediation. Ensure AIOps capabilities are grounded in reliable telemetry, clear control boundaries, and measurable operational outcomes. Observability & 24/7 Operational Excellence Define standards for telemetry, monitoring, alerting, and operational visibility across all critical systems. Ensure services are observable, measurable, and support proactive detection of issues. Improve operational readiness, incident response effectiveness, and time-to-recovery through engineering solutions. CI/CD & Platform Integration Contribute to the design of CI/CD patterns that embed reliability, security, and operational controls into pipelines. Ensure infrastructure, network, identity, and security configurations are managed through code and validated automatically. Support integration of platform services into delivery pipelines to enable consistent, repeatable deployments. Security & Identity Integration Contribute to secure-by-design patterns, including least privilege, identity-based access, and short-lived credentials. Support integration of security controls (e.g. secrets management, authentication, policy enforcement) into engineering workflows. Ensure security and compliance requirements are met through engineering controls rather than manual processes. Network & Infrastructure Reliability Support the design of resilient network architectures and segmentation aligned with Zero Trust principles. Ensure network configurations and controls are automated, validated, and observable. Contribute to infrastructure design patterns that improve availability, scalability, and fault tolerance. Design and improve operational patterns for network reliability, segmentation, visibility, and change validation. Support automation and standardisation of network controls and operational procedures to reduce manual intervention and configuration drift. Technical Leadership & Enablement Provide technical leadership across teams, influencing standards, architecture, and engineering practices. Mentor engineers on reliability engineering, automation, and systems thinking. Drive consistency through reusable patterns, frameworks, and documentation. Strategic Influence & Continuous Improvement Contribute to reliability engineering strategy and roadmap across the organisation. Communicate technical concepts, risks, and recommendations to senior stakeholders and leadership. Lead initiatives that improve reliability maturity, engineering efficiency, and operational scalability. Support less technical teams and functions in adopting structured, automated, and measurable operational practices. Act as a bridge between engineering capability and organisational change, helping scale good practice beyond core platform teams. Automated Data Workflows Design and improve automated data workflows that support operational reporting, observability, governance, and decision-making. Ensure operational data pipelines are reliable, timely, and aligned to engineering and business needs. Reusable Engineering Frameworks Build and evolve reusable modules, patterns, and frameworks for CI/CD, Terraform, and operational automation. Embed governance, validation, and reliability controls into these shared engineering assets by default. Governance by Engineering Translate governance requirements into practical engineering controls, automated checks, and repeatable standards. Help teams adopt compliant and supportable operating models without relying on manual policing or process-heavy interventions. What You'll Bring Required Qualifications 10+ years of experience in Site Reliability Engineering, Platform Engineering, or related fields. Strong hands-on experience across multiple domains, including: Cloud platforms (AWS, Azure) CI/CD and Infrastructure-as-Code (e.g. Terraform) Observability tools (e.g. Datadog, Splunk) Automation and scripting (e.g. Python) Experience designing and implementing scalable automation and reliability solutions. Deep understanding of distributed systems, failure modes, and resilience patterns. Experience integrating operational and security controls into engineering workflows. Strong stakeholder engagement and technical communication skills. Preferred Qualifications Experience with identity and access management systems (e.g. Entra ID, Vault). Experience with network architecture and security controls (e.g. firewalls, segmentation). Familiarity with Zero Trust principles and security engineering practices. Experience working in large, federated organisations with diverse technology stacks. Exposure to compliance and regulatory requirements (e.g. PCI, HIPAA, SOX). Additional info Hybrid or on-site work model. Operates as a senior individual contributor with broad cross-organisational influence. Expected to balance hands-on technical leadership with strategic direction. Occasional travel may be required for team or stakeholder engagement. Boston Consulting Group is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, age, religion, sex, sexual orientation, gender identity / expression, national origin, disability, protected veteran status, or any other characteristic protected under national, provincial, or local law, where applicable, and those with criminal histories will be considered in a manner consistent with applicable state and local laws. BCG is an E - Verify Employer. Click here for more information on E-Verify.

May 21, 2026

Full time

Who We Are Boston Consulting Group partners with leaders in business and society to tackle their most important challenges and capture their greatest opportunities. BCG was the pioneer in business strategy when it was founded in 1963. Today, we help clients with total transformation-inspiring complex change, enabling organizations to grow, building competitive advantage, and driving bottom-line impact. To succeed, organizations must blend digital and human capabilities. Our diverse, global teams bring deep industry and functional expertise and a range of perspectives to spark change. BCG delivers solutions through leading-edge management consulting along with technology and design, corporate and digital ventures-and business purpose. We work in a uniquely collaborative model across the firm and throughout all levels of the client organization, generating results that allow our clients to thrive. What You'll Do The Principal Site Reliability Engineer (SRE) is a senior technical leader responsible for shaping how reliability, automation, and operational excellence are engineered across the organisation. Operating across domains including traditional infrastructure, cloud engineering, network operations, identity, observability, security, AI-driven operations, and automated data workflows, the role focuses on designing scalable systems, reusable engineering patterns, and standardised controls that reduce operational toil, improve resilience, and embed reliability, governance, and compliance directly into delivery pipelines and operational platforms. This role will drive organisational change towards automation-first, measurable, and repeatable practices. A key part of the role is building and evolving reusable CI/CD and Terraform modules, engineering guardrails, observability patterns, and automation frameworks that can be adopted across multiple teams and domains without requiring each team to solve the same problems independently. The Principal SRE also plays an important enablement role beyond deeply technical teams, helping less technical areas of the business adopt structured, governed, and scalable ways of working. This includes translating complex engineering practices into practical standards, improving how governance is implemented through engineering controls rather than manual oversight, and driving operational maturity across a broad and diverse technology landscape. The ideal candidate is a systems thinker who understands how services, networks, identity, data flows, and operational processes fail in real-world conditions, and can apply that understanding to build automation-first, reliability-focused operating models that scale across both technical and non-technical functions. Key Responsibilities Cross-Domain Reliability Engineering Design and evolve reliability patterns across cloud, network, identity, and security domains. Identify systemic risks and failure modes across platforms and services, and define engineering solutions to mitigate them. Ensure operational activities are embedded into delivery models through automation, CI/CD integration, and event-driven workflows. Automation & Toil Reduction at Scale Lead the design of automation frameworks that eliminate manual operational tasks across multiple domains. Translate incident learnings and operational inefficiencies into scalable automation and preventative controls. Drive adoption of automation-first principles, reducing dependency on human-driven processes. Contribute to AI-driven operational use cases, including event correlation, anomaly detection, noise reduction, operational insights, and automated remediation. Ensure AIOps capabilities are grounded in reliable telemetry, clear control boundaries, and measurable operational outcomes. Observability & 24/7 Operational Excellence Define standards for telemetry, monitoring, alerting, and operational visibility across all critical systems. Ensure services are observable, measurable, and support proactive detection of issues. Improve operational readiness, incident response effectiveness, and time-to-recovery through engineering solutions. CI/CD & Platform Integration Contribute to the design of CI/CD patterns that embed reliability, security, and operational controls into pipelines. Ensure infrastructure, network, identity, and security configurations are managed through code and validated automatically. Support integration of platform services into delivery pipelines to enable consistent, repeatable deployments. Security & Identity Integration Contribute to secure-by-design patterns, including least privilege, identity-based access, and short-lived credentials. Support integration of security controls (e.g. secrets management, authentication, policy enforcement) into engineering workflows. Ensure security and compliance requirements are met through engineering controls rather than manual processes. Network & Infrastructure Reliability Support the design of resilient network architectures and segmentation aligned with Zero Trust principles. Ensure network configurations and controls are automated, validated, and observable. Contribute to infrastructure design patterns that improve availability, scalability, and fault tolerance. Design and improve operational patterns for network reliability, segmentation, visibility, and change validation. Support automation and standardisation of network controls and operational procedures to reduce manual intervention and configuration drift. Technical Leadership & Enablement Provide technical leadership across teams, influencing standards, architecture, and engineering practices. Mentor engineers on reliability engineering, automation, and systems thinking. Drive consistency through reusable patterns, frameworks, and documentation. Strategic Influence & Continuous Improvement Contribute to reliability engineering strategy and roadmap across the organisation. Communicate technical concepts, risks, and recommendations to senior stakeholders and leadership. Lead initiatives that improve reliability maturity, engineering efficiency, and operational scalability. Support less technical teams and functions in adopting structured, automated, and measurable operational practices. Act as a bridge between engineering capability and organisational change, helping scale good practice beyond core platform teams. Automated Data Workflows Design and improve automated data workflows that support operational reporting, observability, governance, and decision-making. Ensure operational data pipelines are reliable, timely, and aligned to engineering and business needs. Reusable Engineering Frameworks Build and evolve reusable modules, patterns, and frameworks for CI/CD, Terraform, and operational automation. Embed governance, validation, and reliability controls into these shared engineering assets by default. Governance by Engineering Translate governance requirements into practical engineering controls, automated checks, and repeatable standards. Help teams adopt compliant and supportable operating models without relying on manual policing or process-heavy interventions. What You'll Bring Required Qualifications 10+ years of experience in Site Reliability Engineering, Platform Engineering, or related fields. Strong hands-on experience across multiple domains, including: Cloud platforms (AWS, Azure) CI/CD and Infrastructure-as-Code (e.g. Terraform) Observability tools (e.g. Datadog, Splunk) Automation and scripting (e.g. Python) Experience designing and implementing scalable automation and reliability solutions. Deep understanding of distributed systems, failure modes, and resilience patterns. Experience integrating operational and security controls into engineering workflows. Strong stakeholder engagement and technical communication skills. Preferred Qualifications Experience with identity and access management systems (e.g. Entra ID, Vault). Experience with network architecture and security controls (e.g. firewalls, segmentation). Familiarity with Zero Trust principles and security engineering practices. Experience working in large, federated organisations with diverse technology stacks. Exposure to compliance and regulatory requirements (e.g. PCI, HIPAA, SOX). Additional info Hybrid or on-site work model. Operates as a senior individual contributor with broad cross-organisational influence. Expected to balance hands-on technical leadership with strategic direction. Occasional travel may be required for team or stakeholder engagement. Boston Consulting Group is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, age, religion, sex, sexual orientation, gender identity / expression, national origin, disability, protected veteran status, or any other characteristic protected under national, provincial, or local law, where applicable, and those with criminal histories will be considered in a manner consistent with applicable state and local laws. BCG is an E - Verify Employer. Click here for more information on E-Verify.

Platform Engineer

Digital Waffle

Role: Senior Azure DevOps Engineer (Kafka) Location: Remote (UK) Salary: £75k + 10% Bonus Key Skills: Azure Terraform Kubernetes Kafka Data platforms Overview We're partnering with a fast-growing SaaS organisation building a modern, data-driven platform that powers better insight into behaviour, performance, and collaboration at scale. They're scaling their platform capability and are now looking for a Platform Engineer to help design, build, and evolve their cloud and data infrastructure. This is a hands-on role spanning cloud infrastructure, DevOps, and data platform engineering. The Role You'll work across Azure cloud infrastructure and data platforms, helping to build scalable, secure systems and improve how data flows across the organisation. You'll also play a key role in improving deployment processes, system reliability, and overall platform performance, working closely with software and data teams. Key Responsibilities Design and manage scalable Azure cloud infrastructure Own Infrastructure as Code using Terraform Build and maintain CI/CD pipelines using GitHub Actions (essential) Support GitHub-based release and deployment workflows Work with Kafka for event-driven streaming and real-time data movement Support and evolve data platforms (Databricks ideal) Build and maintain data pipelines (batch + streaming / ETL / ELT) Improve platform reliability, observability, and performance Collaborate with engineering teams to improve developer experience Requirements Strong Azure cloud experience Background in Platform Engineering, DevOps, or SRE Strong experience with GitHub Actions for CI/CD (essential) Kubernetes, Docker, and Terraform experience Strong scripting skills (Python, Bash, or PowerShell) Experience with Kafka or similar streaming platforms Experience with Databricks or modern data platforms (e.g. Snowflake, Synapse, BigQuery) Strong understanding of data pipelines and distributed systems Focus on automation, scalability, and reliability Nice to Have Lakehouse or large-scale data platform experience Observability tooling (Datadog, Grafana, Prometheus) SaaS / high-growth product experience Strong developer experience mindset

May 19, 2026

Full time

Role: Senior Azure DevOps Engineer (Kafka) Location: Remote (UK) Salary: £75k + 10% Bonus Key Skills: Azure Terraform Kubernetes Kafka Data platforms Overview We're partnering with a fast-growing SaaS organisation building a modern, data-driven platform that powers better insight into behaviour, performance, and collaboration at scale. They're scaling their platform capability and are now looking for a Platform Engineer to help design, build, and evolve their cloud and data infrastructure. This is a hands-on role spanning cloud infrastructure, DevOps, and data platform engineering. The Role You'll work across Azure cloud infrastructure and data platforms, helping to build scalable, secure systems and improve how data flows across the organisation. You'll also play a key role in improving deployment processes, system reliability, and overall platform performance, working closely with software and data teams. Key Responsibilities Design and manage scalable Azure cloud infrastructure Own Infrastructure as Code using Terraform Build and maintain CI/CD pipelines using GitHub Actions (essential) Support GitHub-based release and deployment workflows Work with Kafka for event-driven streaming and real-time data movement Support and evolve data platforms (Databricks ideal) Build and maintain data pipelines (batch + streaming / ETL / ELT) Improve platform reliability, observability, and performance Collaborate with engineering teams to improve developer experience Requirements Strong Azure cloud experience Background in Platform Engineering, DevOps, or SRE Strong experience with GitHub Actions for CI/CD (essential) Kubernetes, Docker, and Terraform experience Strong scripting skills (Python, Bash, or PowerShell) Experience with Kafka or similar streaming platforms Experience with Databricks or modern data platforms (e.g. Snowflake, Synapse, BigQuery) Strong understanding of data pipelines and distributed systems Focus on automation, scalability, and reliability Nice to Have Lakehouse or large-scale data platform experience Observability tooling (Datadog, Grafana, Prometheus) SaaS / high-growth product experience Strong developer experience mindset

Windows Endpoint Infrastructure Engineer

Oscar Technology Glasgow, Lanarkshire

Windows Endpoint Infrastructure Engineer Glasgow 3 days onsite 12 month contract £600-£700 per day Outside IR35 We're looking for a highly experienced Windows Endpoint Infrastructure Engineer to join a large-scale enterprise environment. This is a hands-on, senior-level engineering role suited to someone with a strong operational background in building, running, and troubleshooting Windows infrastructure at scale. You'll be part of a specialist endpoint engineering function focused on designing, deploying, and maintaining secure Windows environments across both on-premise and cloud platforms. Working closely with cybersecurity teams, you'll deliver robust, scalable endpoint solutions that protect critical systems while ensuring high performance and reliability. This role is heavily infrastructure-focused, with a strong emphasis on automation and operational excellence. Key Responsibilities Engineer, deploy, and support Windows endpoint infrastructure across a large enterprise estate (hundreds of thousands of devices) Automate provisioning, configuration, and maintenance tasks using PowerShell and Python Troubleshoot complex endpoint, OS, and infrastructure issues across distributed environments Collaborate with cybersecurity teams to implement and enhance endpoint security controls Maintain and optimize platform performance, reliability, and scalability Support ongoing improvements through automation, tooling, and DevOps practices Contribute to infrastructure design and take ownership through full lifecycle delivery and support What We're Looking For Extensive experience in enterprise Windows infrastructure engineering (large-scale environments) Strong knowledge of Windows internals and endpoint management Proven ability to operate and troubleshoot complex production environments Solid scripting skills in PowerShell and/or Python (essential) Experience delivering solutions end-to-end, from design through to operational support Strong analytical thinking and problem-solving capabilities Clear communication skills, both technical and non-technical Nice to Have Exposure to endpoint security tools (e.g. Microsoft Defender suite, E5 security stack) Experience with SCCM, Intune, or similar endpoint management platforms Familiarity with system hardening, disk encryption, and security best practices Experience with monitoring/logging tools such as Splunk Background working in DevOps or SRE environments Broader systems knowledge (networking, storage, Unix/MacOS) Interest in cybersecurity and working closely with security-focused teams If this sounds like a fit, APPLY NOW! Windows Endpoint Infrastructure Engineer Glasgow 3 days onsite 12 month contract £600-£700 per day Outside IR35 Oscar Associates (UK) Limited is acting as an Employment Business in relation to this vacancy. To understand more about what we do with your data please review our privacy policy in the privacy section of the Oscar website.

May 19, 2026

Contractor

Windows Endpoint Infrastructure Engineer Glasgow 3 days onsite 12 month contract £600-£700 per day Outside IR35 We're looking for a highly experienced Windows Endpoint Infrastructure Engineer to join a large-scale enterprise environment. This is a hands-on, senior-level engineering role suited to someone with a strong operational background in building, running, and troubleshooting Windows infrastructure at scale. You'll be part of a specialist endpoint engineering function focused on designing, deploying, and maintaining secure Windows environments across both on-premise and cloud platforms. Working closely with cybersecurity teams, you'll deliver robust, scalable endpoint solutions that protect critical systems while ensuring high performance and reliability. This role is heavily infrastructure-focused, with a strong emphasis on automation and operational excellence. Key Responsibilities Engineer, deploy, and support Windows endpoint infrastructure across a large enterprise estate (hundreds of thousands of devices) Automate provisioning, configuration, and maintenance tasks using PowerShell and Python Troubleshoot complex endpoint, OS, and infrastructure issues across distributed environments Collaborate with cybersecurity teams to implement and enhance endpoint security controls Maintain and optimize platform performance, reliability, and scalability Support ongoing improvements through automation, tooling, and DevOps practices Contribute to infrastructure design and take ownership through full lifecycle delivery and support What We're Looking For Extensive experience in enterprise Windows infrastructure engineering (large-scale environments) Strong knowledge of Windows internals and endpoint management Proven ability to operate and troubleshoot complex production environments Solid scripting skills in PowerShell and/or Python (essential) Experience delivering solutions end-to-end, from design through to operational support Strong analytical thinking and problem-solving capabilities Clear communication skills, both technical and non-technical Nice to Have Exposure to endpoint security tools (e.g. Microsoft Defender suite, E5 security stack) Experience with SCCM, Intune, or similar endpoint management platforms Familiarity with system hardening, disk encryption, and security best practices Experience with monitoring/logging tools such as Splunk Background working in DevOps or SRE environments Broader systems knowledge (networking, storage, Unix/MacOS) Interest in cybersecurity and working closely with security-focused teams If this sounds like a fit, APPLY NOW! Windows Endpoint Infrastructure Engineer Glasgow 3 days onsite 12 month contract £600-£700 per day Outside IR35 Oscar Associates (UK) Limited is acting as an Employment Business in relation to this vacancy. To understand more about what we do with your data please review our privacy policy in the privacy section of the Oscar website.

SRE Consultant

Akkodis City, London

SRE Managing Consultant Cloud Operating Model & Reliability Transformation Security Clearance: SC eligible (UK residency required) Shape the Future of Cloud Reliability Are you passionate about building resilient, scalable cloud platforms that truly support the business? Do you thrive at the intersection of engineering excellence, operating models, and senior stakeholder advisory? We're looking for a Managing Consultant in Site Reliability Engineering (SRE) to help organisations shift from reactive operations to measurable, product-aligned reliability - embedding SRE as a core engineering discipline across cloud and hybrid environments. You'll work with senior leaders, engineering teams, and platform organisations to design operating models that deliver availability, reliability, scalability, and operational excellence at scale. What You'll Be Doing As part of a growing Cloud Advisory capability, you'll lead and shape client engagements focused on reliability, resilience, and modern cloud operations. Key responsibilities include: Define and embed SRE engagement models aligned to modern engineering and traditional ITSM/ITIL practices Establish SLIs, SLOs, and Error Budgets Shape observability strategies using metrics, logs, and traces Design incident response models and post-incident learning loops Reduce toil through automation and engineering excellence Deliver SRE capability assessments and roadmaps Act as a trusted senior advisor to stakeholders What We're Looking For Extensive experience in SRE, cloud operations, or DevOps Proven consulting or advisory background Experience with AWS, Azure, or GCP Strong observability and incident management expertise Ability to obtain UK SC clearance Modis International Ltd acts as an employment agency for permanent recruitment and an employment business for the supply of temporary workers in the UK. Modis Europe Ltd provide a variety of international solutions that connect clients to the best talent in the world. For all positions based in Switzerland, Modis Europe Ltd works with its licensed Swiss partner Accurity GmbH to ensure that candidate applications are handled in accordance with Swiss law. Both Modis International Ltd and Modis Europe Ltd are Equal Opportunities Employers. By applying for this role your details will be submitted to Modis International Ltd and/ or Modis Europe Ltd. Our Candidate Privacy Information Statement which explains how we will use your information is available on the Modis website.

May 02, 2026

Full time

SRE Managing Consultant Cloud Operating Model & Reliability Transformation Security Clearance: SC eligible (UK residency required) Shape the Future of Cloud Reliability Are you passionate about building resilient, scalable cloud platforms that truly support the business? Do you thrive at the intersection of engineering excellence, operating models, and senior stakeholder advisory? We're looking for a Managing Consultant in Site Reliability Engineering (SRE) to help organisations shift from reactive operations to measurable, product-aligned reliability - embedding SRE as a core engineering discipline across cloud and hybrid environments. You'll work with senior leaders, engineering teams, and platform organisations to design operating models that deliver availability, reliability, scalability, and operational excellence at scale. What You'll Be Doing As part of a growing Cloud Advisory capability, you'll lead and shape client engagements focused on reliability, resilience, and modern cloud operations. Key responsibilities include: Define and embed SRE engagement models aligned to modern engineering and traditional ITSM/ITIL practices Establish SLIs, SLOs, and Error Budgets Shape observability strategies using metrics, logs, and traces Design incident response models and post-incident learning loops Reduce toil through automation and engineering excellence Deliver SRE capability assessments and roadmaps Act as a trusted senior advisor to stakeholders What We're Looking For Extensive experience in SRE, cloud operations, or DevOps Proven consulting or advisory background Experience with AWS, Azure, or GCP Strong observability and incident management expertise Ability to obtain UK SC clearance Modis International Ltd acts as an employment agency for permanent recruitment and an employment business for the supply of temporary workers in the UK. Modis Europe Ltd provide a variety of international solutions that connect clients to the best talent in the world. For all positions based in Switzerland, Modis Europe Ltd works with its licensed Swiss partner Accurity GmbH to ensure that candidate applications are handled in accordance with Swiss law. Both Modis International Ltd and Modis Europe Ltd are Equal Opportunities Employers. By applying for this role your details will be submitted to Modis International Ltd and/ or Modis Europe Ltd. Our Candidate Privacy Information Statement which explains how we will use your information is available on the Modis website.

Senior Site Reliability Engineer

83Zero Ltd Wokingham, Berkshire

Senior Site Reliability Engineer - Active SC Required! Up to £75,000 + benefits Wokingham - Hybrid (UK-based) We're seeking a Senior Site Reliability Engineer to play a key role in designing and operating highly reliable, scalable systems in a fast-paced environment. You'll act as a technical leader within the team, driving best practices across reliability engineering, automation, and system performance. What you'll be doing: Designing and improving system reliability, scalability, and observability Leading incident management and driving root cause analysis Building and maintaining robust CI/CD pipelines and automation frameworks Partnering with development teams to embed SRE principles into the SDLC Mentoring junior engineers and promoting engineering best practices What we're looking for: Strong experience in SRE, DevOps, or platform engineering roles Deep understanding of cloud infrastructure (AWS, Azure, or GCP) Hands-on experience with Kubernetes and containerised environments Strong scripting/programming skills (Python, Go, or similar) Experience with monitoring, alerting, and observability tooling Proven ability to troubleshoot complex distributed systems Why apply? Opportunity to influence technical direction and best practices Work on large-scale, mission-critical systems Leadership exposure with clear progression to principal level

Apr 01, 2026

Full time

Senior Site Reliability Engineer - Active SC Required! Up to £75,000 + benefits Wokingham - Hybrid (UK-based) We're seeking a Senior Site Reliability Engineer to play a key role in designing and operating highly reliable, scalable systems in a fast-paced environment. You'll act as a technical leader within the team, driving best practices across reliability engineering, automation, and system performance. What you'll be doing: Designing and improving system reliability, scalability, and observability Leading incident management and driving root cause analysis Building and maintaining robust CI/CD pipelines and automation frameworks Partnering with development teams to embed SRE principles into the SDLC Mentoring junior engineers and promoting engineering best practices What we're looking for: Strong experience in SRE, DevOps, or platform engineering roles Deep understanding of cloud infrastructure (AWS, Azure, or GCP) Hands-on experience with Kubernetes and containerised environments Strong scripting/programming skills (Python, Go, or similar) Experience with monitoring, alerting, and observability tooling Proven ability to troubleshoot complex distributed systems Why apply? Opportunity to influence technical direction and best practices Work on large-scale, mission-critical systems Leadership exposure with clear progression to principal level

12 jobs found

Modal Window