Contract: Site Reliability Engineer (Automation & Observability) Start Date: ASAP Duration: 12 months Location: Glasgow Rate: £400 - £450 per day inside of IR35 Reference: 20130 We are looking for an Site Reliability Engineer (Automation & Observability) to help strengthen or client's data protection environment. The role focuses on building automation, monitoring, and alerting capabilities that im
Jan 08, 2026
Full time
Contract: Site Reliability Engineer (Automation & Observability) Start Date: ASAP Duration: 12 months Location: Glasgow Rate: £400 - £450 per day inside of IR35 Reference: 20130 We are looking for an Site Reliability Engineer (Automation & Observability) to help strengthen or client's data protection environment. The role focuses on building automation, monitoring, and alerting capabilities that im
Cambridge University Press & Assessment
Cambridge, Cambridgeshire
Job Title: Technical Lead Salary: £51,400 - £68,800 Location: Cambridge/Hybrid Contract: Permanent The Tech Lead at Cambridge University Press and Assessment plays a pivotal role in delivering high-quality, scalable, and resilient applications that support the global spectrum of education needs. By providing technical leadership across the full technology stack, the Tech Lead ensures systems are designed for maintainability, extensibility, and long-term success, empowering teams to unlock potential, and deliver transformative solutions for learners, teachers, and researchers. We are Cambridge University Press & Assessment, a world-leading academic publisher and assessment organisation and a proud part of the University of Cambridge. About the role We're seeking a hands-on Tech Lead with strong leadership and collaboration skills to guide the migration of legacy 3-tier enterprise applications to scalable, cloud-native architectures on AWS, balancing technical innovation with effective stakeholder engagement. In this role, you will not only lead and mentor an agile, cross-functional engineering team, but also remain deeply involved in the technical delivery-applying your expertise directly to architect, code, and implement solutions. You will set technical standards, write and review code, and solve complex challenges alongside your team, ensuring resilient, high-performance systems that empower our global platforms and serve millions of users worldwide. This is an opportunity to establish the foundation for future innovation, including the adoption of event-driven architecture (EDA) and modern practices to support successful cloud migration, while building strong relationships across teams and stakeholders to drive alignment and adoption. We are looking for someone who leads by example, demonstrates recent hands-on experience with modern technologies, and is passionate about both technical excellence and team development. Key Responsibilities Shape Cloud-Native Architecture: Lead the migration of legacy applications to AWS, designing scalable, resilient architectures using approaches like microservices, serverless, and containerisation, while communicating design rationale to diverse stakeholders for buy-in and feedback. Drive Observability: Implement robust observability frameworks to ensure system performance, reliability, and proactive issue resolution, fostering a team culture of shared accountability and continuous improvement. Prepare for Event-Driven Architecture: Design systems with flexibility for EDA patterns, collaborating with cross-functional groups to identify opportunities for decoupled, scalable solutions. Build Web-App Solutions: Develop responsive, high-performance web applications using modern frameworks integrated with cloud-native backends, adapting to user feedback and business priorities through iterative collaboration. Optimise Cloud Deployments: Leverage cloud services and Infrastructure as Code to ensure cost-effective, scalable solutions, while mentoring teams on best practices and facilitating knowledge-sharing sessions. Ensure Security and Compliance: Champion secure practices aligned with industry standards, influencing stakeholders to prioritize security in decision-making processes. Lead Automation and CI/CD: Drive automation through CI/CD pipelines to accelerate delivery and minimise technical debt, promoting agile ways of working that encourage team input and adaptability. Mentor and Collaborate: Guide engineers, foster a culture of learning and inclusivity, and collaborate with product teams, architects, and stakeholders to align technical solutions with business goals, using strong communication skills to bridge technical and non-technical perspectives. Facilitate Stakeholder Engagement: Act as a key liaison in stakeholder groups, translating technical concepts into actionable insights, resolving conflicts, and building consensus to ensure project success About you As a hands-on Tech Lead, you'll bring broad expertise in cloud architecture and migration strategies, coupled with exceptional soft skills to thrive in collaborative environments. Your experience includes transforming legacy enterprise applications into modern, scalable solutions, with a focus on horizontal skills like problem-solving, adaptable design thinking, and leading through influence. You excel in agile, cross-functional teams, where your ability to communicate effectively, mentor diverse groups, and manage stakeholder expectations is as critical as your technical acumen. You will be expected to apply your technical skills directly-writing code, architecting systems, and solving complex technical challenges alongside your team. Key soft and horizontal skills include: Leadership and Mentoring: Proven track record of guiding teams, fostering collaboration, and promoting a growth mindset to unlock team potential. Communication and Stakeholder Management: Skilled at articulating complex technical ideas to non-technical audiences, facilitating workshops, and building relationships to align on shared goals. Adaptability and Collaboration: Thrive in dynamic settings, adapting to evolving priorities while collaborating across functions to drive innovation and resolve challenges. Problem-Solving and Resilience: Approach issues with a holistic, user-centred mindset, balancing technical feasibility with business impact. On the technical side, you'll have proficiency in areas like microservices, serverless, containerisation, and building web applications, with experience in observability, security standards, Infrastructure as Code, and CI/CD practices. We value transferable skills over specific tool expertise, as technical depth can be developed through our supportive learning environment. Ideally, you're also keen to explore event-driven architecture (EDA) and Domain-Driven Design (DDD), with a background that equips you to address industry-specific challenges in education and publishing. If you would like to know more about this opportunity and what will make you successful, please see the full job description attached to the bottom of this vacancy on our careers site. Rewards and benefits We will support you to be at your best in work and to live well outside of it. In addition to competitive salaries, we offer a world-class, flexible rewards package , featuring family-friendly and planet-friendly benefits including: 28 days annual leave plus bank holidays Private medical and Permanent Health Insurance Discretionary annual bonus Group personal pension scheme Life assurance up to 4 x annual salary Green travel schemes We are a hybrid working organisation, and we offer a range of flexible working options from day one. We expect most hybrid-working colleagues to spend 40-60% of their time at their dedicated office or location. We will also consider other work arrangements if you wish to work more flexibly or require adjustments due to a disability. Ready to pursue your potential? Apply now. We review applications on an ongoing basis, with a closing date for all applications being 7 th January with interview taking place shortly after. Please note that successful applicants will be subject to satisfactory background checks including DBS due to working in a regulated industry. Cambridge University Press & Assessment is an approved UK employer for the sponsorship of eligible roles and applicants under the Skilled Worker visa route. Please refer to the gov.uk website for guidance to understand the eligibility based on the role you are applying for. Why join us Joining us is your opportunity to pursue potential. You'll belong to a collaborative team that's exploring new and better ways to serve students, teachers and researchers across the globe - for the benefit of individuals, society and the world. Sharing our mission will inspire your own growth, development and progress, in an environment which embraces difference, change and aspiration. Cambridge University Press & Assessment is committed to being a place where anyone can enjoy a successful career, where it's safe to speak up, and where we learn continuously to improve together. We welcome applications from all candidates, regardless of demographic characteristics (age, disability, educational attainment, ethnicity, gender, marital status, neurodiversity, religion, sex, gender identity and sexual identity), cultural, or social class/background. We believe better outcomes come through diversity of thought, background and approach. We welcome applications from people from all backgrounds and communities, actively seeking to employ people from a wide range of different communities.
Jan 07, 2026
Full time
Job Title: Technical Lead Salary: £51,400 - £68,800 Location: Cambridge/Hybrid Contract: Permanent The Tech Lead at Cambridge University Press and Assessment plays a pivotal role in delivering high-quality, scalable, and resilient applications that support the global spectrum of education needs. By providing technical leadership across the full technology stack, the Tech Lead ensures systems are designed for maintainability, extensibility, and long-term success, empowering teams to unlock potential, and deliver transformative solutions for learners, teachers, and researchers. We are Cambridge University Press & Assessment, a world-leading academic publisher and assessment organisation and a proud part of the University of Cambridge. About the role We're seeking a hands-on Tech Lead with strong leadership and collaboration skills to guide the migration of legacy 3-tier enterprise applications to scalable, cloud-native architectures on AWS, balancing technical innovation with effective stakeholder engagement. In this role, you will not only lead and mentor an agile, cross-functional engineering team, but also remain deeply involved in the technical delivery-applying your expertise directly to architect, code, and implement solutions. You will set technical standards, write and review code, and solve complex challenges alongside your team, ensuring resilient, high-performance systems that empower our global platforms and serve millions of users worldwide. This is an opportunity to establish the foundation for future innovation, including the adoption of event-driven architecture (EDA) and modern practices to support successful cloud migration, while building strong relationships across teams and stakeholders to drive alignment and adoption. We are looking for someone who leads by example, demonstrates recent hands-on experience with modern technologies, and is passionate about both technical excellence and team development. Key Responsibilities Shape Cloud-Native Architecture: Lead the migration of legacy applications to AWS, designing scalable, resilient architectures using approaches like microservices, serverless, and containerisation, while communicating design rationale to diverse stakeholders for buy-in and feedback. Drive Observability: Implement robust observability frameworks to ensure system performance, reliability, and proactive issue resolution, fostering a team culture of shared accountability and continuous improvement. Prepare for Event-Driven Architecture: Design systems with flexibility for EDA patterns, collaborating with cross-functional groups to identify opportunities for decoupled, scalable solutions. Build Web-App Solutions: Develop responsive, high-performance web applications using modern frameworks integrated with cloud-native backends, adapting to user feedback and business priorities through iterative collaboration. Optimise Cloud Deployments: Leverage cloud services and Infrastructure as Code to ensure cost-effective, scalable solutions, while mentoring teams on best practices and facilitating knowledge-sharing sessions. Ensure Security and Compliance: Champion secure practices aligned with industry standards, influencing stakeholders to prioritize security in decision-making processes. Lead Automation and CI/CD: Drive automation through CI/CD pipelines to accelerate delivery and minimise technical debt, promoting agile ways of working that encourage team input and adaptability. Mentor and Collaborate: Guide engineers, foster a culture of learning and inclusivity, and collaborate with product teams, architects, and stakeholders to align technical solutions with business goals, using strong communication skills to bridge technical and non-technical perspectives. Facilitate Stakeholder Engagement: Act as a key liaison in stakeholder groups, translating technical concepts into actionable insights, resolving conflicts, and building consensus to ensure project success About you As a hands-on Tech Lead, you'll bring broad expertise in cloud architecture and migration strategies, coupled with exceptional soft skills to thrive in collaborative environments. Your experience includes transforming legacy enterprise applications into modern, scalable solutions, with a focus on horizontal skills like problem-solving, adaptable design thinking, and leading through influence. You excel in agile, cross-functional teams, where your ability to communicate effectively, mentor diverse groups, and manage stakeholder expectations is as critical as your technical acumen. You will be expected to apply your technical skills directly-writing code, architecting systems, and solving complex technical challenges alongside your team. Key soft and horizontal skills include: Leadership and Mentoring: Proven track record of guiding teams, fostering collaboration, and promoting a growth mindset to unlock team potential. Communication and Stakeholder Management: Skilled at articulating complex technical ideas to non-technical audiences, facilitating workshops, and building relationships to align on shared goals. Adaptability and Collaboration: Thrive in dynamic settings, adapting to evolving priorities while collaborating across functions to drive innovation and resolve challenges. Problem-Solving and Resilience: Approach issues with a holistic, user-centred mindset, balancing technical feasibility with business impact. On the technical side, you'll have proficiency in areas like microservices, serverless, containerisation, and building web applications, with experience in observability, security standards, Infrastructure as Code, and CI/CD practices. We value transferable skills over specific tool expertise, as technical depth can be developed through our supportive learning environment. Ideally, you're also keen to explore event-driven architecture (EDA) and Domain-Driven Design (DDD), with a background that equips you to address industry-specific challenges in education and publishing. If you would like to know more about this opportunity and what will make you successful, please see the full job description attached to the bottom of this vacancy on our careers site. Rewards and benefits We will support you to be at your best in work and to live well outside of it. In addition to competitive salaries, we offer a world-class, flexible rewards package , featuring family-friendly and planet-friendly benefits including: 28 days annual leave plus bank holidays Private medical and Permanent Health Insurance Discretionary annual bonus Group personal pension scheme Life assurance up to 4 x annual salary Green travel schemes We are a hybrid working organisation, and we offer a range of flexible working options from day one. We expect most hybrid-working colleagues to spend 40-60% of their time at their dedicated office or location. We will also consider other work arrangements if you wish to work more flexibly or require adjustments due to a disability. Ready to pursue your potential? Apply now. We review applications on an ongoing basis, with a closing date for all applications being 7 th January with interview taking place shortly after. Please note that successful applicants will be subject to satisfactory background checks including DBS due to working in a regulated industry. Cambridge University Press & Assessment is an approved UK employer for the sponsorship of eligible roles and applicants under the Skilled Worker visa route. Please refer to the gov.uk website for guidance to understand the eligibility based on the role you are applying for. Why join us Joining us is your opportunity to pursue potential. You'll belong to a collaborative team that's exploring new and better ways to serve students, teachers and researchers across the globe - for the benefit of individuals, society and the world. Sharing our mission will inspire your own growth, development and progress, in an environment which embraces difference, change and aspiration. Cambridge University Press & Assessment is committed to being a place where anyone can enjoy a successful career, where it's safe to speak up, and where we learn continuously to improve together. We welcome applications from all candidates, regardless of demographic characteristics (age, disability, educational attainment, ethnicity, gender, marital status, neurodiversity, religion, sex, gender identity and sexual identity), cultural, or social class/background. We believe better outcomes come through diversity of thought, background and approach. We welcome applications from people from all backgrounds and communities, actively seeking to employ people from a wide range of different communities.
Overview As a DevOps Engineer at Payter, you will play a crucial role in the company's growth by delivering key software solutions. Joining a small, close-knit team, you will engage in software delivery, collaborating closely with domain owners to deliver high-quality, clean, testable code in line with standards, strategies, and best practices. At Payter, we are innovators, pioneers, and leaders in the dynamic realm of unattended/self-service contactless and cashless payment technology in a wide range of markets such as Electrical Vehicle Charging, Transportation, Retail, Hospitality, Vending, Charity, Parking, and beyond. The adaptable Payter platform accommodates a diverse range of payment technologies (NFC, EMV, ApplePay, GooglePay, etc.), international banking processes, closed-loop payment and loyalty schemes and telemetry. Through continuous innovation and in-house development, we redefine how vendors connect with their customers, empowering them to boost revenue, enhance user experiences, and access real-time sales and performance data. We support a broad range of technologies, from Contact & Contactless EMV, Mifare, WiFi, 5G, Bluetooth, Touch Screens and more. Our state-of-the-art products have an extremely long service life, are of high quality, compliant with multiple international standards, boast great design, are user-friendly for all, multifunctional, and easy to integrate. Examples of successful collaboration include: EV Charging: Fastned, Shell, BP, Ionity, Alfen, EVBOX Cashless Charity Donations: Hartstichting, WWF, Save the Children, Royal British Legion Food & Drink Vending: Coca Cola, Lavazza, Starbucks, Jacobs Douwe Egberts, Costa, Heineken, Maas International, Franke, WMF, Wurlitzer, Selecta Hospitality & public locations: Compass Group, Sodexo, Albron, TU Delft, TU Eindhoven Gaming & Entertainment: Pinball, Slot Machines, Gaming Arcades, Efteling Petrol Stations services Laundry, Car Wash, Kiosks, Toilets: Shell, BP, Exxon Special Products: Photo Booths, Dog Wash Station Responsibilities Lead the design and implementation of scalable, secure cloud infrastructure for our new platform while maintaining and enhancing existing production systems alongside the current DevOps team Architect and implement comprehensive Infrastructure as Code solutions using Terraform, establishing reusable modules and advanced deployment patterns across multiple environments Design, build, and optimise complex CI/CD pipelines using Bitbucket Pipelines, incorporating advanced deployment strategies including canary releases, and automated rollback mechanisms Implement sophisticated monitoring, logging, and alerting systems using GCP native tools, with emphasis on proactive anomaly detection, predictive failure analysis, and SLO/SLI management Drive automation initiatives across the organisation, eliminating manual processes and establishing GitOps workflows and Everything-as-Code practices for consistent infrastructure management Lead security implementation ensuring PCI-DSS, PCI-PIN, and PCI-P2PE compliance, including infrastructure hardening, access controls, and integrated security scanning within deployment pipelines Work with existing DevOps engineers and collaborate with Software Engineering teams to optimise deployment processes and foster a culture of continuous improvement Establish and maintain disaster recovery procedures, backup strategies, and high-availability architectures to ensure business continuity and system resilience Qualifications 5+ years of Senior DevOps/Site Reliability Engineering experience with demonstrated expertise in leading complex, enterprise-scale cloud infrastructure projects and team mentorship Expert-level proficiency with Google Cloud Platform including advanced services (GKE, Cloud Run Services, VPC networking, IAM, Cloud Armor, API Gateways) and cost optimisation strategies Deep expertise in Bitbucket and Bitbucket Pipelines including complex workflow orchestration, parallel execution strategies, advanced branching models, and pipeline optimisation techniques Advanced Terraform experience including module development, remote state management, workspace strategies, and enterprise-scale infrastructure provisioning patterns Proven experience with Kubernetes administration including cluster management, advanced networking, service mesh implementation, resource optimisation, and troubleshooting distributed systems Demonstrated ability to design and implement highly available, fault-tolerant systems with experience in event-driven architectures, microservices, and distributed system challenges Advanced knowledge of database administration (SQL/NoSQL), caching technologies, and data pipeline optimisation in cloud-native environments Strong expertise in security practices, compliance frameworks (PCI-DSS, PCI-PIN, PCI-P2PE), infrastructure hardening, and security monitoring with hands-on experience in network security and access management would be advantageous Technical Skills Cloud Platform: Expert-level Google Cloud Platform (GCP) including advanced service integration, networking, and cost management Version Control & CI/CD: Advanced Bitbucket and Bitbucket Pipelines expertise with complex workflow design and optimisation capabilities Infrastructure as Code: Deep Terraform proficiency including enterprise module development, state management, and advanced provisioning patterns Container Orchestration: Advanced Kubernetes (GKE) administration including service mesh, advanced networking, and performance tuning Monitoring & Observability: Google Cloud Logging, Cloud Monitoring, distributed tracing, and advanced alerting with SLI/SLO implementation Security & Compliance: PCI-DSS, PCI-PIN and PCI-P2PE standards implementation, infrastructure security hardening, and automated security scanning integration Database & Caching: Advanced experience with cloud databases (Alloy DB, Cloud SQL, Firestore, BigQuery), Memcached, and data pipeline optimisation Networking & Load Balancing: Expert knowledge of GCP VPC design, Cloud Load Balancer configuration, firewall management, and hybrid cloud connectivity Software Development Practices: Advanced Git workflows, automated testing integration, code review processes, and understanding of software architecture patterns Soft Skills Excellent technical communication skills, demonstrated by experience in bringing the team on the journey, and promoting DevOps best practices. Advanced problem-solving and analytical thinking capabilities, with proven expertise in diagnosing complex distributed system issues and implementing comprehensive solutions. Outstanding communication skills, capable of clearly articulating technical challenges and solutions to diverse audiences, including executives, product managers, and development teams A strong commitment to quality and continuous improvement, coupled with a passion for delivering enterprise-grade, maintainable solutions that inspire pride within teams What we have to offer Competitive compensation including a discretionary bonus based on business results; Great benefits like 25 leave days plus extra monthly "wellbeing days", a travel allowance and an attractive pension plan; Flexible working options within the Netherlands (Rotterdam office or hybrid/remote) or fully remote in the UK; we are not hiring outside these regions at this time. Thriving in a close-knit environment valuing flexibility, work-life balance, and mental well-being; Join Payter and become part of an international scale-up, shaping the future in a booming market where you can have impact and growth opportunities.
Jan 01, 2026
Full time
Overview As a DevOps Engineer at Payter, you will play a crucial role in the company's growth by delivering key software solutions. Joining a small, close-knit team, you will engage in software delivery, collaborating closely with domain owners to deliver high-quality, clean, testable code in line with standards, strategies, and best practices. At Payter, we are innovators, pioneers, and leaders in the dynamic realm of unattended/self-service contactless and cashless payment technology in a wide range of markets such as Electrical Vehicle Charging, Transportation, Retail, Hospitality, Vending, Charity, Parking, and beyond. The adaptable Payter platform accommodates a diverse range of payment technologies (NFC, EMV, ApplePay, GooglePay, etc.), international banking processes, closed-loop payment and loyalty schemes and telemetry. Through continuous innovation and in-house development, we redefine how vendors connect with their customers, empowering them to boost revenue, enhance user experiences, and access real-time sales and performance data. We support a broad range of technologies, from Contact & Contactless EMV, Mifare, WiFi, 5G, Bluetooth, Touch Screens and more. Our state-of-the-art products have an extremely long service life, are of high quality, compliant with multiple international standards, boast great design, are user-friendly for all, multifunctional, and easy to integrate. Examples of successful collaboration include: EV Charging: Fastned, Shell, BP, Ionity, Alfen, EVBOX Cashless Charity Donations: Hartstichting, WWF, Save the Children, Royal British Legion Food & Drink Vending: Coca Cola, Lavazza, Starbucks, Jacobs Douwe Egberts, Costa, Heineken, Maas International, Franke, WMF, Wurlitzer, Selecta Hospitality & public locations: Compass Group, Sodexo, Albron, TU Delft, TU Eindhoven Gaming & Entertainment: Pinball, Slot Machines, Gaming Arcades, Efteling Petrol Stations services Laundry, Car Wash, Kiosks, Toilets: Shell, BP, Exxon Special Products: Photo Booths, Dog Wash Station Responsibilities Lead the design and implementation of scalable, secure cloud infrastructure for our new platform while maintaining and enhancing existing production systems alongside the current DevOps team Architect and implement comprehensive Infrastructure as Code solutions using Terraform, establishing reusable modules and advanced deployment patterns across multiple environments Design, build, and optimise complex CI/CD pipelines using Bitbucket Pipelines, incorporating advanced deployment strategies including canary releases, and automated rollback mechanisms Implement sophisticated monitoring, logging, and alerting systems using GCP native tools, with emphasis on proactive anomaly detection, predictive failure analysis, and SLO/SLI management Drive automation initiatives across the organisation, eliminating manual processes and establishing GitOps workflows and Everything-as-Code practices for consistent infrastructure management Lead security implementation ensuring PCI-DSS, PCI-PIN, and PCI-P2PE compliance, including infrastructure hardening, access controls, and integrated security scanning within deployment pipelines Work with existing DevOps engineers and collaborate with Software Engineering teams to optimise deployment processes and foster a culture of continuous improvement Establish and maintain disaster recovery procedures, backup strategies, and high-availability architectures to ensure business continuity and system resilience Qualifications 5+ years of Senior DevOps/Site Reliability Engineering experience with demonstrated expertise in leading complex, enterprise-scale cloud infrastructure projects and team mentorship Expert-level proficiency with Google Cloud Platform including advanced services (GKE, Cloud Run Services, VPC networking, IAM, Cloud Armor, API Gateways) and cost optimisation strategies Deep expertise in Bitbucket and Bitbucket Pipelines including complex workflow orchestration, parallel execution strategies, advanced branching models, and pipeline optimisation techniques Advanced Terraform experience including module development, remote state management, workspace strategies, and enterprise-scale infrastructure provisioning patterns Proven experience with Kubernetes administration including cluster management, advanced networking, service mesh implementation, resource optimisation, and troubleshooting distributed systems Demonstrated ability to design and implement highly available, fault-tolerant systems with experience in event-driven architectures, microservices, and distributed system challenges Advanced knowledge of database administration (SQL/NoSQL), caching technologies, and data pipeline optimisation in cloud-native environments Strong expertise in security practices, compliance frameworks (PCI-DSS, PCI-PIN, PCI-P2PE), infrastructure hardening, and security monitoring with hands-on experience in network security and access management would be advantageous Technical Skills Cloud Platform: Expert-level Google Cloud Platform (GCP) including advanced service integration, networking, and cost management Version Control & CI/CD: Advanced Bitbucket and Bitbucket Pipelines expertise with complex workflow design and optimisation capabilities Infrastructure as Code: Deep Terraform proficiency including enterprise module development, state management, and advanced provisioning patterns Container Orchestration: Advanced Kubernetes (GKE) administration including service mesh, advanced networking, and performance tuning Monitoring & Observability: Google Cloud Logging, Cloud Monitoring, distributed tracing, and advanced alerting with SLI/SLO implementation Security & Compliance: PCI-DSS, PCI-PIN and PCI-P2PE standards implementation, infrastructure security hardening, and automated security scanning integration Database & Caching: Advanced experience with cloud databases (Alloy DB, Cloud SQL, Firestore, BigQuery), Memcached, and data pipeline optimisation Networking & Load Balancing: Expert knowledge of GCP VPC design, Cloud Load Balancer configuration, firewall management, and hybrid cloud connectivity Software Development Practices: Advanced Git workflows, automated testing integration, code review processes, and understanding of software architecture patterns Soft Skills Excellent technical communication skills, demonstrated by experience in bringing the team on the journey, and promoting DevOps best practices. Advanced problem-solving and analytical thinking capabilities, with proven expertise in diagnosing complex distributed system issues and implementing comprehensive solutions. Outstanding communication skills, capable of clearly articulating technical challenges and solutions to diverse audiences, including executives, product managers, and development teams A strong commitment to quality and continuous improvement, coupled with a passion for delivering enterprise-grade, maintainable solutions that inspire pride within teams What we have to offer Competitive compensation including a discretionary bonus based on business results; Great benefits like 25 leave days plus extra monthly "wellbeing days", a travel allowance and an attractive pension plan; Flexible working options within the Netherlands (Rotterdam office or hybrid/remote) or fully remote in the UK; we are not hiring outside these regions at this time. Thriving in a close-knit environment valuing flexibility, work-life balance, and mental well-being; Join Payter and become part of an international scale-up, shaping the future in a booming market where you can have impact and growth opportunities.
At Moody's, we unite the brightest minds to turn today's risks into tomorrow's opportunities. We do this by striving to create an inclusive environment where everyone feels welcome to be who they are-with the freedom to exchange ideas, think innovatively, and listen to each other and customers in meaningful ways. Moody's is transforming how the world sees risk. As a global leader in ratings and integrated risk assessment, we're advancing AI to move from insight to action-enabling intelligence that not only understands complexity but responds to it. We decode risk to unlock opportunity, helping our clients navigate uncertainty with clarity, speed, and confidence. If you are excited about this opportunity but do not meet every single requirement, please apply! You still may be a great fit for this role or other open roles. We are seeking candidates who model our values: invest in every relationship, lead with curiosity, champion diverse perspectives, turn inputs into actions, and uphold trust through integrity. Skills and Competencies Minimum 1 year experience in DevOps, Site Reliability Engineering, or infrastructure engineering roles. Previous experience of one or more cloud platforms-Google Cloud Platform (GCP), Azure, or AWS. Experience working with CI/CD tools such as CircleCI and GitHub Actions. Skilled in infrastructure automation using Terraform and Kubernetes. Coding and debugging skills in one programming language or more such as Python, C#, Rust. Basic understanding of artificial intelligence concepts, with curiosity and enthusiasm for learning how AI tools can be used to improve processes and drive efficiency. Interest in exploring AI systems and a willingness to develop awareness of responsible AI practices, including risk management and ethical use. Education Bachelor's degree in Computer Science, Engineering, or a related technical field required. Advanced degree (Master's or PhD) in a relevant discipline is a plus. Responsibilities Support the reliability, scalability, and automation of Moody's MAP platform services through infrastructure automation, CI/CD enablement, and cross-team collaboration. Develop and maintain infrastructure-as-code solutions using Terraform and Kubernetes. Contribute to build and deploy pipelines using CircleCI and GitHub Actions. Maintain and monitor production services on MAP, ensuring performance, availability, and security. Participate in incident response and root cause analysis. Collaborate with development, security, and platform teams to onboard strategic applications and ensure compliance. Develop and refine observability tooling and automation frameworks. Support onboarding of other engineers and assist in documentation and knowledge sharing. Ensure infrastructure and deployment practices align with Moody's ISMS and audit requirements. The Team Our Platform Engineering team is responsible for building and maintaining the foundational systems and services that power Moody's technology ecosystem. We enable innovation by providing scalable, reliable, and secure platforms that support the development and deployment of cutting edge applications. By joining our team, you will be part of exciting work in cloud infrastructure, DevOps, and AI driven platform solutions. Moody's is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability, protected veteran status, sexual orientation, gender expression, gender identity or any other characteristic protected by law. Candidates for Moody's Corporation may be asked to disclose securities holdings pursuant to Moody's Policy for Securities Trading and the requirements of the position. Employment is contingent upon compliance with the Policy, including remediation of positions in those holdings as necessary.
Jan 01, 2026
Full time
At Moody's, we unite the brightest minds to turn today's risks into tomorrow's opportunities. We do this by striving to create an inclusive environment where everyone feels welcome to be who they are-with the freedom to exchange ideas, think innovatively, and listen to each other and customers in meaningful ways. Moody's is transforming how the world sees risk. As a global leader in ratings and integrated risk assessment, we're advancing AI to move from insight to action-enabling intelligence that not only understands complexity but responds to it. We decode risk to unlock opportunity, helping our clients navigate uncertainty with clarity, speed, and confidence. If you are excited about this opportunity but do not meet every single requirement, please apply! You still may be a great fit for this role or other open roles. We are seeking candidates who model our values: invest in every relationship, lead with curiosity, champion diverse perspectives, turn inputs into actions, and uphold trust through integrity. Skills and Competencies Minimum 1 year experience in DevOps, Site Reliability Engineering, or infrastructure engineering roles. Previous experience of one or more cloud platforms-Google Cloud Platform (GCP), Azure, or AWS. Experience working with CI/CD tools such as CircleCI and GitHub Actions. Skilled in infrastructure automation using Terraform and Kubernetes. Coding and debugging skills in one programming language or more such as Python, C#, Rust. Basic understanding of artificial intelligence concepts, with curiosity and enthusiasm for learning how AI tools can be used to improve processes and drive efficiency. Interest in exploring AI systems and a willingness to develop awareness of responsible AI practices, including risk management and ethical use. Education Bachelor's degree in Computer Science, Engineering, or a related technical field required. Advanced degree (Master's or PhD) in a relevant discipline is a plus. Responsibilities Support the reliability, scalability, and automation of Moody's MAP platform services through infrastructure automation, CI/CD enablement, and cross-team collaboration. Develop and maintain infrastructure-as-code solutions using Terraform and Kubernetes. Contribute to build and deploy pipelines using CircleCI and GitHub Actions. Maintain and monitor production services on MAP, ensuring performance, availability, and security. Participate in incident response and root cause analysis. Collaborate with development, security, and platform teams to onboard strategic applications and ensure compliance. Develop and refine observability tooling and automation frameworks. Support onboarding of other engineers and assist in documentation and knowledge sharing. Ensure infrastructure and deployment practices align with Moody's ISMS and audit requirements. The Team Our Platform Engineering team is responsible for building and maintaining the foundational systems and services that power Moody's technology ecosystem. We enable innovation by providing scalable, reliable, and secure platforms that support the development and deployment of cutting edge applications. By joining our team, you will be part of exciting work in cloud infrastructure, DevOps, and AI driven platform solutions. Moody's is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability, protected veteran status, sexual orientation, gender expression, gender identity or any other characteristic protected by law. Candidates for Moody's Corporation may be asked to disclose securities holdings pursuant to Moody's Policy for Securities Trading and the requirements of the position. Employment is contingent upon compliance with the Policy, including remediation of positions in those holdings as necessary.
Blip is a leading tech company focused on software engineering solutions for sports entertainment. We operate at scale. As part of Flutter Entertainment, we play an essential role in the Group's goal of becoming the global leader in online sports betting and iGaming, developing innovative products and platforms for over 14 million monthly customers worldwide. We are serious about Tech. We are problem-solvers with big ambitions, keeping a people-first mindset at the core of our work. We prioritize flexibility as we strive to deliver the best technological products and tackle the greatest industry challenges. Recognizing that everyone brings their own strengths, backgrounds and new perspectives, we empower you to be yourself. That uniqueness shapes the culture of belonging we are so proud of. The Role We are seeking a motivated and experienced senior engineer to join our dynamic organisation. As a Senior Site Reliability Engineer in our UK&I division, you will be responsible for overseeing a group of employees, providing direction and support to ensure goals are met and operations run smoothly. If you have a strong background in team management and are ready to take on a new challenge, we want to hear from you. Come be a part of our team and make a positive impact on our organisation's success. What You'll Be Doing Engage in and improve the whole lifecycle of services-from design, deployment, operation, and refinement. Take an active part in production problems root cause investigation, identification, and resolution (where necessary) Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. Be an active part of performance and capacity testing; Optimize reliability monitoring & alerting; Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity. Iteratively perform Auditing of performance and reliability vulnerabilities; Define and revise Service Level Indicators (SLIs); Practice sustainable incident response and blameless postmortems. What You'll Bring Deep familiarity building and troubleshooting release and build pipelines (ex Jenkins, buildkite, GitHub actions) Experience implementing creative approach in monitoring distributed systems while leveraging industry best practices (ex instrumenting tagging taxonomy across disparate systems) Experience building, managing, and deploying an application utilizing containerized microservices, in a distributed infrastructure (ex AWS, GCP, self hosted cloud) Experience leveraging new technologies when it best serves a business need Comprehensive understanding of incident management best practices Opinionated and knowledgable approach for implementing industry best practices Demonstrated experience developing teams, encouraging growth, serving as a technical mentor and leader Shows strength and comprehension in at least one programming languages (ex. Java, Python, Scala, Kotlin) Experience making large directional technical decisions (ex. Deciding which technology, or pattern to create or leverage) Experience being "on-call" for a service, and familiarity with incident notification tooling (ex. Pagerduty, Opsgenie) Comprehensive understanding of SRE principles (ex. Working knowledge of the Google SRE book) Demonstrated strength in leading a project in a agile/scrum environment Thrives in a diverse work environment We'd Like You To Master In Experience managing complex telemetry solutions which directly contributed to overall reliability Design greenfield solutions leveraging Configuration Management/Infrastructure as Code tools (ex. Chef, puppet, Terraform) Create automated tooling that contributed to multiple teams velocity Demonstrated experience with project management best practices Shows the ability to break down large technical concepts into effective communication with stakeholders from across the organization Extensive knowledge of networking best practices, tools, and observability Experiencing developing and deploying automated service configuration at the edge (ex. CDN configuration, certificate renewal) Work consulting with a team being able to advise on their technology, workflows, dev tooling, monitoring, alerting best practices Identified need for and lead development of automation that significantly reduced toil (ex Deployment pipelines, distributed dev environments) Built and maintained a system and culture that supported and implemented SLOs Has shown to be a thought leader contributing to the broader industry conversation about SRE principals and topics (ex. Speaking at conferences) We are committed to creating a diverse and inclusive workplace. We strongly encourage people from all backgrounds, ways of thinking, and working to apply. We are committed to including everyone regardless of their race, disability, age, gender identity, sexual orientation, and religion. Everyone brings different perspectives and experiences; you don't have to meet all the requirements listed to apply for this role. If you need any adjustments to apply for the position and to ensure this role aligns with your needs, please send an email to . We will only respond to inquiries related to disabilities.
Jan 01, 2026
Full time
Blip is a leading tech company focused on software engineering solutions for sports entertainment. We operate at scale. As part of Flutter Entertainment, we play an essential role in the Group's goal of becoming the global leader in online sports betting and iGaming, developing innovative products and platforms for over 14 million monthly customers worldwide. We are serious about Tech. We are problem-solvers with big ambitions, keeping a people-first mindset at the core of our work. We prioritize flexibility as we strive to deliver the best technological products and tackle the greatest industry challenges. Recognizing that everyone brings their own strengths, backgrounds and new perspectives, we empower you to be yourself. That uniqueness shapes the culture of belonging we are so proud of. The Role We are seeking a motivated and experienced senior engineer to join our dynamic organisation. As a Senior Site Reliability Engineer in our UK&I division, you will be responsible for overseeing a group of employees, providing direction and support to ensure goals are met and operations run smoothly. If you have a strong background in team management and are ready to take on a new challenge, we want to hear from you. Come be a part of our team and make a positive impact on our organisation's success. What You'll Be Doing Engage in and improve the whole lifecycle of services-from design, deployment, operation, and refinement. Take an active part in production problems root cause investigation, identification, and resolution (where necessary) Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. Be an active part of performance and capacity testing; Optimize reliability monitoring & alerting; Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity. Iteratively perform Auditing of performance and reliability vulnerabilities; Define and revise Service Level Indicators (SLIs); Practice sustainable incident response and blameless postmortems. What You'll Bring Deep familiarity building and troubleshooting release and build pipelines (ex Jenkins, buildkite, GitHub actions) Experience implementing creative approach in monitoring distributed systems while leveraging industry best practices (ex instrumenting tagging taxonomy across disparate systems) Experience building, managing, and deploying an application utilizing containerized microservices, in a distributed infrastructure (ex AWS, GCP, self hosted cloud) Experience leveraging new technologies when it best serves a business need Comprehensive understanding of incident management best practices Opinionated and knowledgable approach for implementing industry best practices Demonstrated experience developing teams, encouraging growth, serving as a technical mentor and leader Shows strength and comprehension in at least one programming languages (ex. Java, Python, Scala, Kotlin) Experience making large directional technical decisions (ex. Deciding which technology, or pattern to create or leverage) Experience being "on-call" for a service, and familiarity with incident notification tooling (ex. Pagerduty, Opsgenie) Comprehensive understanding of SRE principles (ex. Working knowledge of the Google SRE book) Demonstrated strength in leading a project in a agile/scrum environment Thrives in a diverse work environment We'd Like You To Master In Experience managing complex telemetry solutions which directly contributed to overall reliability Design greenfield solutions leveraging Configuration Management/Infrastructure as Code tools (ex. Chef, puppet, Terraform) Create automated tooling that contributed to multiple teams velocity Demonstrated experience with project management best practices Shows the ability to break down large technical concepts into effective communication with stakeholders from across the organization Extensive knowledge of networking best practices, tools, and observability Experiencing developing and deploying automated service configuration at the edge (ex. CDN configuration, certificate renewal) Work consulting with a team being able to advise on their technology, workflows, dev tooling, monitoring, alerting best practices Identified need for and lead development of automation that significantly reduced toil (ex Deployment pipelines, distributed dev environments) Built and maintained a system and culture that supported and implemented SLOs Has shown to be a thought leader contributing to the broader industry conversation about SRE principals and topics (ex. Speaking at conferences) We are committed to creating a diverse and inclusive workplace. We strongly encourage people from all backgrounds, ways of thinking, and working to apply. We are committed to including everyone regardless of their race, disability, age, gender identity, sexual orientation, and religion. Everyone brings different perspectives and experiences; you don't have to meet all the requirements listed to apply for this role. If you need any adjustments to apply for the position and to ensure this role aligns with your needs, please send an email to . We will only respond to inquiries related to disabilities.
Location: London, England, United Kingdom Join Axon and be a Force for Good. Your Impact From body-worn cameras to mobile apps and dispatch tools, Axon Assistant enables first responders to access critical information exactly when they need it. As a Senior SDET, you'll play a pivotal role in ensuring this mission-critical, intelligent software ships with the highest levels of quality, performance, and trust. You'll architect test infrastructure, lead automation strategy, and own the systems that validate our most advanced multi-platform experiences. Work Location: This role is based out of our London office and follows a hybrid schedule. We rely on in-person collaboration and ask that team members work onsite Tuesdays through Fridays, with the flexibility to work remotely on Mondays, unless there is an approved workplace accommodation. We believe that connection fuels innovation, and our in-office culture is designed to foster meaningful teamwork, mentorship, and shared success. What You'll Do Architect and implement automation frameworks, test strategies, and quality infrastructure across web, mobile, and on-device platforms. Design scalable validation systems for real-time voice interaction, AI/LLM-driven features, and distributed cloud services. Partner with engineers to shape code for testability and embed quality early in the development process. Lead cross-functional quality initiatives to improve CI/CD pipelines, observability, and release readiness. Drive performance, load, and resilience testing, especially for latency-sensitive, real-time systems. Mentor other SDETs and developers in automation strategy, debugging, and risk mitigation. Own root cause analysis for complex, system-level issues - using telemetry, tracing, and logs. Contribute to documentation of tools, architecture, and best practices that scale across teams. What You Bring 7+ years of experience in test automation, software engineering, or SDET roles. Strong experience building and scaling test automation frameworks and developer-focused tools. Deep understanding of distributed systems, API testing, and CI/CD pipelines. Hands-on experience testing AI/ML-powered systems, real-time services, or multi-modal UIs. Track record of owning quality strategy and delivery for mission-critical software in production. Nice to Have Background working in regulated or high-trust domains like public safety, healthcare, or finance. Ability to influence technical direction and raise the bar for engineering quality across teams. Who you are Engineering-Driven - You bring a software-first mindset to quality and reliability. Strategic Thinker - You balance fast execution with long-term test infrastructure health. Mentor and Collaborator - You help others grow through reviews, pairing, and shared accountability. System Debugger - You enjoy diving into logs, tracing services, and unraveling complex bugs. Mission-Oriented - You take pride in building systems that people rely on in high-stakes moments. Benefits that benefit you Competitive base salary and RSUs Comprehensive pension plan with matching contribution Private health insurance & cash plans 30 days paid holiday + UK public holidays Enhanced maternity/paternity leave Life assurance & income protection Career growth support and wellness resources Don't meet every single requirement? That's ok. At Axon, we Aim Far. We think big with a long-term view because we want to reinvent the world to be a safer, better place. We are also committed to building diverse teams that reflect the communities we serve. Studies have shown that women and people of color are less likely to apply to jobs unless they check every box in the job description. If you're excited about this role and our mission to Protect Life but your experience doesn't align perfectly with every qualification listed here, we encourage you to apply anyways. You may be just the right candidate for this or other roles. The above job description is not intended as, nor should it be construed as, exhaustive of all duties, responsibilities, skills, efforts, or working conditions associated with this job. The job description may change or be supplemented at any time in accordance with business needs and conditions. Some roles may also require legal eligibility to work in a firearms environment. Axon's mission is to Protect Life and is committed to the well-being and safety of its employees as well as Axon's impact on the environment. All Axon employees must be aware of and committed to the appropriate environmental, health, and safety regulations, policies, and procedures. Axon employees are empowered to report safety concerns as they arise and activities potentially impacting the environment. We are an equal opportunity employer that promotes justice, advances equity, values diversity and fosters inclusion. We're committed to hiring the best talent - regardless of race, creed, color, ancestry, religion, sex (including pregnancy), national origin, sexual orientation, age, citizenship status, marital status, disability, gender identity, genetic information, veteran status, or any other characteristic protected by applicable laws, regulations and ordinances - and empowering all of our employees so they can do their best work. If you have a disability or special need that requires assistance or accommodation during the application or the recruiting process, please email . Please note that this email address is for accommodation purposes only. Axon will not respond to inquiries for other purposes.
Jan 01, 2026
Full time
Location: London, England, United Kingdom Join Axon and be a Force for Good. Your Impact From body-worn cameras to mobile apps and dispatch tools, Axon Assistant enables first responders to access critical information exactly when they need it. As a Senior SDET, you'll play a pivotal role in ensuring this mission-critical, intelligent software ships with the highest levels of quality, performance, and trust. You'll architect test infrastructure, lead automation strategy, and own the systems that validate our most advanced multi-platform experiences. Work Location: This role is based out of our London office and follows a hybrid schedule. We rely on in-person collaboration and ask that team members work onsite Tuesdays through Fridays, with the flexibility to work remotely on Mondays, unless there is an approved workplace accommodation. We believe that connection fuels innovation, and our in-office culture is designed to foster meaningful teamwork, mentorship, and shared success. What You'll Do Architect and implement automation frameworks, test strategies, and quality infrastructure across web, mobile, and on-device platforms. Design scalable validation systems for real-time voice interaction, AI/LLM-driven features, and distributed cloud services. Partner with engineers to shape code for testability and embed quality early in the development process. Lead cross-functional quality initiatives to improve CI/CD pipelines, observability, and release readiness. Drive performance, load, and resilience testing, especially for latency-sensitive, real-time systems. Mentor other SDETs and developers in automation strategy, debugging, and risk mitigation. Own root cause analysis for complex, system-level issues - using telemetry, tracing, and logs. Contribute to documentation of tools, architecture, and best practices that scale across teams. What You Bring 7+ years of experience in test automation, software engineering, or SDET roles. Strong experience building and scaling test automation frameworks and developer-focused tools. Deep understanding of distributed systems, API testing, and CI/CD pipelines. Hands-on experience testing AI/ML-powered systems, real-time services, or multi-modal UIs. Track record of owning quality strategy and delivery for mission-critical software in production. Nice to Have Background working in regulated or high-trust domains like public safety, healthcare, or finance. Ability to influence technical direction and raise the bar for engineering quality across teams. Who you are Engineering-Driven - You bring a software-first mindset to quality and reliability. Strategic Thinker - You balance fast execution with long-term test infrastructure health. Mentor and Collaborator - You help others grow through reviews, pairing, and shared accountability. System Debugger - You enjoy diving into logs, tracing services, and unraveling complex bugs. Mission-Oriented - You take pride in building systems that people rely on in high-stakes moments. Benefits that benefit you Competitive base salary and RSUs Comprehensive pension plan with matching contribution Private health insurance & cash plans 30 days paid holiday + UK public holidays Enhanced maternity/paternity leave Life assurance & income protection Career growth support and wellness resources Don't meet every single requirement? That's ok. At Axon, we Aim Far. We think big with a long-term view because we want to reinvent the world to be a safer, better place. We are also committed to building diverse teams that reflect the communities we serve. Studies have shown that women and people of color are less likely to apply to jobs unless they check every box in the job description. If you're excited about this role and our mission to Protect Life but your experience doesn't align perfectly with every qualification listed here, we encourage you to apply anyways. You may be just the right candidate for this or other roles. The above job description is not intended as, nor should it be construed as, exhaustive of all duties, responsibilities, skills, efforts, or working conditions associated with this job. The job description may change or be supplemented at any time in accordance with business needs and conditions. Some roles may also require legal eligibility to work in a firearms environment. Axon's mission is to Protect Life and is committed to the well-being and safety of its employees as well as Axon's impact on the environment. All Axon employees must be aware of and committed to the appropriate environmental, health, and safety regulations, policies, and procedures. Axon employees are empowered to report safety concerns as they arise and activities potentially impacting the environment. We are an equal opportunity employer that promotes justice, advances equity, values diversity and fosters inclusion. We're committed to hiring the best talent - regardless of race, creed, color, ancestry, religion, sex (including pregnancy), national origin, sexual orientation, age, citizenship status, marital status, disability, gender identity, genetic information, veteran status, or any other characteristic protected by applicable laws, regulations and ordinances - and empowering all of our employees so they can do their best work. If you have a disability or special need that requires assistance or accommodation during the application or the recruiting process, please email . Please note that this email address is for accommodation purposes only. Axon will not respond to inquiries for other purposes.
Are you looking for a new challenge as a System Engineer and excited to take ownership of designing, maintaining, and optimizing our IT systems and infrastructure? Do you thrive on ensuring reliability, security, and scalability across the environments that power DeepHealth's operations worldwide? Are you passionate about driving continuous improvement, streamlining processes, and collaborating with cross-functional teams to deliver efficient, compliant, and high-performing systems? Then please read on Responsibilities: Prepare, perform, and maintain customer installations of our advanced AI medical imaging software, ensuring seamless integration with existing IT infrastructure. Gather customer requirements and installation environment details for smooth, efficient deployments. Provide expert technical support to customers and partners, troubleshooting and resolving incidents. Contribute to continuous improvement of the Delivery and Implementation team's workflows, aiming to streamline processes and enhance the customer experience. Be adaptable and available to support customers outside of regular working hours (on-call rotation may be required). Collaborate with development teams to ensure seamless deployment and integration of applications. Document system configurations, procedures, and best practices. Stay up-to-date with emerging technologies and industry trends. Skills, education and experience: Bachelor's degree in Computer Science, Information Technology, or a related field. Minimum of 5 years of experience in system engineering or a similar role. Proficiency in at least one scripting language (e.g., Python, PowerShell, Bash). Solid knowledge of Linux, networking, security. Public/private cloud (ideally GCP), Kubernetes and container orchestration Linux shell and shell scripting, Linux tool chain installation, upgrade and maintenance Experience in docker container troubleshooting Familiarity with CI/CD pipeline integrated with automation tool stacks Terraform, Ansible and ArgoCD Knowledge of monitoring platforms (GCP/AWS/Azure observability tool stacks) Strong understanding of networking concepts (TCP/IP, DNS, DHCP, VPN) is a plus. Excellent problem-solving, analytical, and communication skills. Ability to work independently and as part of a team. Proactive and comfortable with shuffling between multiple projects. Promotes good public relations on the phone and in person. What can DeepHealth offer for you? Competitive pay & pension - We value your work and reward it fairly Generous time off - 25 paid holidays for a full-time (40h/week) role Flexibility that works for you - Enjoy flexible hours and hybrid/remote options Travel allowance - An NS business card or an allowance for your travel per KM Work from home budget - A joining home office budget and a WFH allowance Fun, international vibe - Join a collaborative team with diverse backgrounds Team celebration - Our yearly event brings everyone together in one exciting location Did you know ? DeepHealth, a wholly-owned subsidiary of RadNet, is dedicated to advancing healthcare through AI-powered informatics. Our core offering, the DeepHealth OS, is a cloud-native operating system that centralizes and orchestrates data to enhance value across the entire enterprise. We aim to expand the radiologist's influence beyond traditional radiology, encompassing the full care pathway. By providing personalized workflows, DeepHealth empowers all users within the care continuum, making their work simpler and more impactful. DeepHealth leverages advanced AI technologies in breast, lung, and prostate health, alongside operational efficiencies, to create comprehensive, end-to-end efficiency throughout the organization. For more information, visit DeepHealth's website. Physical Demands This position often requires sitting, standing, walking, bending, twisting, reaching with hands and arms, using hands and fingers, handling, or feeling, speaking, listening, and high-level cognitive thinking. Also, must be able to lift up to 10 pounds occasionally. Working Environment Hybrid/Remote ACCOMMODATIONS Reasonable accommodations may be made to enable people with disabilities to perform the essential functions of the job. Salary Range: €70,000 - €80,000 Per year 60,000 GBP - 70,000 GBP
Jan 01, 2026
Full time
Are you looking for a new challenge as a System Engineer and excited to take ownership of designing, maintaining, and optimizing our IT systems and infrastructure? Do you thrive on ensuring reliability, security, and scalability across the environments that power DeepHealth's operations worldwide? Are you passionate about driving continuous improvement, streamlining processes, and collaborating with cross-functional teams to deliver efficient, compliant, and high-performing systems? Then please read on Responsibilities: Prepare, perform, and maintain customer installations of our advanced AI medical imaging software, ensuring seamless integration with existing IT infrastructure. Gather customer requirements and installation environment details for smooth, efficient deployments. Provide expert technical support to customers and partners, troubleshooting and resolving incidents. Contribute to continuous improvement of the Delivery and Implementation team's workflows, aiming to streamline processes and enhance the customer experience. Be adaptable and available to support customers outside of regular working hours (on-call rotation may be required). Collaborate with development teams to ensure seamless deployment and integration of applications. Document system configurations, procedures, and best practices. Stay up-to-date with emerging technologies and industry trends. Skills, education and experience: Bachelor's degree in Computer Science, Information Technology, or a related field. Minimum of 5 years of experience in system engineering or a similar role. Proficiency in at least one scripting language (e.g., Python, PowerShell, Bash). Solid knowledge of Linux, networking, security. Public/private cloud (ideally GCP), Kubernetes and container orchestration Linux shell and shell scripting, Linux tool chain installation, upgrade and maintenance Experience in docker container troubleshooting Familiarity with CI/CD pipeline integrated with automation tool stacks Terraform, Ansible and ArgoCD Knowledge of monitoring platforms (GCP/AWS/Azure observability tool stacks) Strong understanding of networking concepts (TCP/IP, DNS, DHCP, VPN) is a plus. Excellent problem-solving, analytical, and communication skills. Ability to work independently and as part of a team. Proactive and comfortable with shuffling between multiple projects. Promotes good public relations on the phone and in person. What can DeepHealth offer for you? Competitive pay & pension - We value your work and reward it fairly Generous time off - 25 paid holidays for a full-time (40h/week) role Flexibility that works for you - Enjoy flexible hours and hybrid/remote options Travel allowance - An NS business card or an allowance for your travel per KM Work from home budget - A joining home office budget and a WFH allowance Fun, international vibe - Join a collaborative team with diverse backgrounds Team celebration - Our yearly event brings everyone together in one exciting location Did you know ? DeepHealth, a wholly-owned subsidiary of RadNet, is dedicated to advancing healthcare through AI-powered informatics. Our core offering, the DeepHealth OS, is a cloud-native operating system that centralizes and orchestrates data to enhance value across the entire enterprise. We aim to expand the radiologist's influence beyond traditional radiology, encompassing the full care pathway. By providing personalized workflows, DeepHealth empowers all users within the care continuum, making their work simpler and more impactful. DeepHealth leverages advanced AI technologies in breast, lung, and prostate health, alongside operational efficiencies, to create comprehensive, end-to-end efficiency throughout the organization. For more information, visit DeepHealth's website. Physical Demands This position often requires sitting, standing, walking, bending, twisting, reaching with hands and arms, using hands and fingers, handling, or feeling, speaking, listening, and high-level cognitive thinking. Also, must be able to lift up to 10 pounds occasionally. Working Environment Hybrid/Remote ACCOMMODATIONS Reasonable accommodations may be made to enable people with disabilities to perform the essential functions of the job. Salary Range: €70,000 - €80,000 Per year 60,000 GBP - 70,000 GBP
Are you an engineer who thrives on solving complex reliability challenges across cloud platforms? We're looking for a Site Reliability Engineer who can combine strong technical capability with a pragmatic approach to automation, monitoring, and service delivery. You'll help keep Tribal's education-driven SaaS products highly available, scalable, and performant. At Tribal Tribal is a leading EdTech business providing market-leading software solutions to the global education market. We research, develop, and deliver the products, services, and solutions that education institutions worldwide rely on to support their core mission: educating students, delivering exceptional learning experiences, and achieving successful outcomes. Our Platform Engineering function is at the heart of this, ensuring our systems are designed and maintained to the highest standards of reliability and security. As part of the SRE & Operations team, you'll play a key role in delivering Tribal's products through the public cloud as SaaS services across AWS and Azure. The Role As a Site Reliability Engineer, you'll design, build, and operate large-scale systems with an emphasis on reliability, efficiency, and automation. You'll work across deployment, monitoring, and incident response to ensure our platforms stay healthy and our customers experience uninterrupted service. You'll be involved in: Maintaining and improving production systems for availability, latency, and scalability Supporting application deployment and configuration to production environments Building or enhancing automation tools (Ansible, scripts, utilities) Implementing and managing observability tools such as DataDog or New Relic Analyzing logs and metrics to identify trends and improve reliability Supporting incident response and performing root-cause analysis Collaborating closely with engineering and customer teams to deliver proactive, preventative support Participating in on-call and out-of-hours rotations in line with Tribal's On-Call Policy This is a full-time, fully remote UK-based role, with occasional national travel for team collaboration or customer engagements. What you'll bring Strong experience with AWS (or Azure) environments Solid knowledge of Linux, Apache, and PHP in a production context Familiarity with automation/configuration tools such as Ansible Experience with monitoring and logging platforms (e.g. DataDog, New Relic, Azure Monitor) Good understanding of database fundamentals (SQL Server / Oracle) Hands-on troubleshooting and problem-solving skills Customer-facing experience with incident or service management tools (RemedyForce, ServiceNow) Strong written and verbal communication skills, able to translate technical details clearly Nice-to-have: Experience coding or scripting (Python, PowerShell, or Bash) Understanding of CI/CD pipelines (Azure DevOps or similar) ITIL Foundation or cloud certifications (AWS SysOps Administrator, AWS Solutions Architect) Note to applicants: We welcome applications from individuals who already have the right to work in the UK. As an equal opportunity employer, Tribal celebrate diversity and are committed to creating an inclusive environment for all employees. We make sure that our recruitment and selection processes never discriminate based upon any protected characteristics and actively welcome applications from all groups, not least those underrepresented in the tech sector. Note to all applicants - Tribal reserve the right to close an advertisement to applications ahead of the advertised closure date. For this reason, shortlisting may take place prior to the closing date on some occasions. With this in mind, please do not hesitate to apply early. Application Form Forename(1) Forename(2) Forename(3) Surname Known as or preferred name Phone Email Address Address Address Line 1 Address Line 2 City State/Province ZIP / Postal Country Upload your CV No File Chosen No File Chosen Do you have eligibility to work in the UK? Do you have eligibility to work in the UK? No Yes Minimum Salary Expectation Do you currently work for Tribal? Where did you find out about this opportunity? Where did you find out about this opportunity? Required field Linkedin Someone who works at Tribal Tribal Website University Job Board Other Join our Talent Pool! Join our Talent Pool! Yes, I opt into the talent pool and agree for you to contact me Join our Talent Pool! Join our Talent Pool! Yes, I opt into the talent pool and agree for you to contact me Please confirm if you wish to opt into our Talent pool and agree to receive communications about potential career opportunities at Tribal For details on how Tribal will use and retain your data click here Recruitment Agencies: Please review our recruitment agency statement here
Jan 01, 2026
Full time
Are you an engineer who thrives on solving complex reliability challenges across cloud platforms? We're looking for a Site Reliability Engineer who can combine strong technical capability with a pragmatic approach to automation, monitoring, and service delivery. You'll help keep Tribal's education-driven SaaS products highly available, scalable, and performant. At Tribal Tribal is a leading EdTech business providing market-leading software solutions to the global education market. We research, develop, and deliver the products, services, and solutions that education institutions worldwide rely on to support their core mission: educating students, delivering exceptional learning experiences, and achieving successful outcomes. Our Platform Engineering function is at the heart of this, ensuring our systems are designed and maintained to the highest standards of reliability and security. As part of the SRE & Operations team, you'll play a key role in delivering Tribal's products through the public cloud as SaaS services across AWS and Azure. The Role As a Site Reliability Engineer, you'll design, build, and operate large-scale systems with an emphasis on reliability, efficiency, and automation. You'll work across deployment, monitoring, and incident response to ensure our platforms stay healthy and our customers experience uninterrupted service. You'll be involved in: Maintaining and improving production systems for availability, latency, and scalability Supporting application deployment and configuration to production environments Building or enhancing automation tools (Ansible, scripts, utilities) Implementing and managing observability tools such as DataDog or New Relic Analyzing logs and metrics to identify trends and improve reliability Supporting incident response and performing root-cause analysis Collaborating closely with engineering and customer teams to deliver proactive, preventative support Participating in on-call and out-of-hours rotations in line with Tribal's On-Call Policy This is a full-time, fully remote UK-based role, with occasional national travel for team collaboration or customer engagements. What you'll bring Strong experience with AWS (or Azure) environments Solid knowledge of Linux, Apache, and PHP in a production context Familiarity with automation/configuration tools such as Ansible Experience with monitoring and logging platforms (e.g. DataDog, New Relic, Azure Monitor) Good understanding of database fundamentals (SQL Server / Oracle) Hands-on troubleshooting and problem-solving skills Customer-facing experience with incident or service management tools (RemedyForce, ServiceNow) Strong written and verbal communication skills, able to translate technical details clearly Nice-to-have: Experience coding or scripting (Python, PowerShell, or Bash) Understanding of CI/CD pipelines (Azure DevOps or similar) ITIL Foundation or cloud certifications (AWS SysOps Administrator, AWS Solutions Architect) Note to applicants: We welcome applications from individuals who already have the right to work in the UK. As an equal opportunity employer, Tribal celebrate diversity and are committed to creating an inclusive environment for all employees. We make sure that our recruitment and selection processes never discriminate based upon any protected characteristics and actively welcome applications from all groups, not least those underrepresented in the tech sector. Note to all applicants - Tribal reserve the right to close an advertisement to applications ahead of the advertised closure date. For this reason, shortlisting may take place prior to the closing date on some occasions. With this in mind, please do not hesitate to apply early. Application Form Forename(1) Forename(2) Forename(3) Surname Known as or preferred name Phone Email Address Address Address Line 1 Address Line 2 City State/Province ZIP / Postal Country Upload your CV No File Chosen No File Chosen Do you have eligibility to work in the UK? Do you have eligibility to work in the UK? No Yes Minimum Salary Expectation Do you currently work for Tribal? Where did you find out about this opportunity? Where did you find out about this opportunity? Required field Linkedin Someone who works at Tribal Tribal Website University Job Board Other Join our Talent Pool! Join our Talent Pool! Yes, I opt into the talent pool and agree for you to contact me Join our Talent Pool! Join our Talent Pool! Yes, I opt into the talent pool and agree for you to contact me Please confirm if you wish to opt into our Talent pool and agree to receive communications about potential career opportunities at Tribal For details on how Tribal will use and retain your data click here Recruitment Agencies: Please review our recruitment agency statement here
About Mistral At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life. We democratize AI through high performance, optimized, open source and cutting edge models, products and solutions. Our comprehensive AI platform is designed to meet enterprise needs, whether on premises or in cloud environments. Our offerings include le Chat, the AI assistant for life and work. We are a dynamic, collaborative team passionate about AI and its potential to transform society. Our diverse workforce thrives in competitive environments and is committed to driving innovation. Our teams are distributed between France, USA, UK, Germany and Singapore. We are creative, low ego and team spirited. Join us to be part of a pioneering company shaping the future of AI. Together, we can make a meaningful impact. See more about our culture on Role Summary We are seeking highly experienced Site Reliability Engineers (SRE) to shape the reliability, scalability and performance of our platform and customer facing applications. You will work closely with our software engineers and research teams to ensure our systems meet and exceed our internal and external customers' expectations. Location: Paris or London Reporting line: Head of Engineering What you will do As a Site Reliability Engineer, you balance the day to day operations on production systems with long term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems. Operations (50%) Design, build, and maintain scalable, highly available and fault tolerant infrastructures to support our web services and ML workloads Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters Operate systems and troubleshoot issues in production environments (interrupts, on call responses, users admin, data extraction, infrastructure scaling, etc.) Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client facing APIs and large training runs Participate occasionally in on call rotations to respond to incidents and perform root cause analysis to prevent future occurrences Development (50%) Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model training experiments Build a cloud agnostic platform offering an abstraction layer between science and infrastructure Design and develop new workflows and tooling to improve the reliability, availability and performance of our systems (automation scripts, refactoring, new API based features, web apps, dashboards, etc.) Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements Document processes and procedures to ensure consistency and knowledge sharing across the team Contribute to open source projects, research publications, blog articles and conferences About you Master's degree in Computer Science, Engineering or a related field 7+ years of experience in a DevOps/SRE role Strong experience with cloud computing and highly available distributed systems Exposure to site reliability issues in critical environments (issue root cause analysis, in production troubleshooting, on call rotations ) Experience working against reliability KPIs (observability, alerting, SLAs) Hands on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes ) Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog ) Familiarity with infrastructure as code tools like Terraform or CloudFormation Proficiency in scripting languages (Python, Go, Bash ) and knowledge of software development best practices Strong understanding of networking, security, and system administration concepts Excellent problem solving and communication skills Self motivated and able to work well in a fast paced startup environment Your application will be all the more interesting if you also have: Experience in an AI/ML environment Experience of high performance computing (HPC) systems and workload managers (Slurm) Worked with modern AI oriented solutions (Fluidstack, Coreweave, Vast ) Location & Remote This role is primarily based at one of our European offices (Paris, France and London, UK). We will prioritize candidates who either reside in Paris or are open to relocating. We strongly believe in the value of in person collaboration to foster strong relationships and seamless communication within our team. In certain specific situations, we will also consider remote candidates based in one of the countries listed in this job posting - currently France, UK, Germany, Belgium, Netherlands, Spain and Italy. In that case, we ask all new hires to visit our Paris office: for the first week of their onboarding (accommodation and travelling covered) then at least 3 days per month What we offer Competitive salary and equity ️ Health insurance Transportation allowance Sport allowance Meal vouchers Private pension plan Parental : Generous parental leave policy Visa sponsorship
Jan 01, 2026
Full time
About Mistral At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life. We democratize AI through high performance, optimized, open source and cutting edge models, products and solutions. Our comprehensive AI platform is designed to meet enterprise needs, whether on premises or in cloud environments. Our offerings include le Chat, the AI assistant for life and work. We are a dynamic, collaborative team passionate about AI and its potential to transform society. Our diverse workforce thrives in competitive environments and is committed to driving innovation. Our teams are distributed between France, USA, UK, Germany and Singapore. We are creative, low ego and team spirited. Join us to be part of a pioneering company shaping the future of AI. Together, we can make a meaningful impact. See more about our culture on Role Summary We are seeking highly experienced Site Reliability Engineers (SRE) to shape the reliability, scalability and performance of our platform and customer facing applications. You will work closely with our software engineers and research teams to ensure our systems meet and exceed our internal and external customers' expectations. Location: Paris or London Reporting line: Head of Engineering What you will do As a Site Reliability Engineer, you balance the day to day operations on production systems with long term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems. Operations (50%) Design, build, and maintain scalable, highly available and fault tolerant infrastructures to support our web services and ML workloads Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters Operate systems and troubleshoot issues in production environments (interrupts, on call responses, users admin, data extraction, infrastructure scaling, etc.) Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client facing APIs and large training runs Participate occasionally in on call rotations to respond to incidents and perform root cause analysis to prevent future occurrences Development (50%) Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model training experiments Build a cloud agnostic platform offering an abstraction layer between science and infrastructure Design and develop new workflows and tooling to improve the reliability, availability and performance of our systems (automation scripts, refactoring, new API based features, web apps, dashboards, etc.) Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements Document processes and procedures to ensure consistency and knowledge sharing across the team Contribute to open source projects, research publications, blog articles and conferences About you Master's degree in Computer Science, Engineering or a related field 7+ years of experience in a DevOps/SRE role Strong experience with cloud computing and highly available distributed systems Exposure to site reliability issues in critical environments (issue root cause analysis, in production troubleshooting, on call rotations ) Experience working against reliability KPIs (observability, alerting, SLAs) Hands on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes ) Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog ) Familiarity with infrastructure as code tools like Terraform or CloudFormation Proficiency in scripting languages (Python, Go, Bash ) and knowledge of software development best practices Strong understanding of networking, security, and system administration concepts Excellent problem solving and communication skills Self motivated and able to work well in a fast paced startup environment Your application will be all the more interesting if you also have: Experience in an AI/ML environment Experience of high performance computing (HPC) systems and workload managers (Slurm) Worked with modern AI oriented solutions (Fluidstack, Coreweave, Vast ) Location & Remote This role is primarily based at one of our European offices (Paris, France and London, UK). We will prioritize candidates who either reside in Paris or are open to relocating. We strongly believe in the value of in person collaboration to foster strong relationships and seamless communication within our team. In certain specific situations, we will also consider remote candidates based in one of the countries listed in this job posting - currently France, UK, Germany, Belgium, Netherlands, Spain and Italy. In that case, we ask all new hires to visit our Paris office: for the first week of their onboarding (accommodation and travelling covered) then at least 3 days per month What we offer Competitive salary and equity ️ Health insurance Transportation allowance Sport allowance Meal vouchers Private pension plan Parental : Generous parental leave policy Visa sponsorship
Company Overview: Ori Industries is at the forefront of AI infrastructure, revolutionising the connection between software and hardware for the AI era. Our mission is to empower AI teams with scalable, secure, and efficient infrastructure solutions that support seamless model training, deployment, and scaling. Job Summary: We're looking for an experienced Infrastructure Site Reliability Engineer to run and evolve our infrastructure stack. You'll contribute across bare-metal, virtualization, and orchestration layers, keeping things stable and secure 24/7 x 365 - all while mentoring teammates, improving process and automation as well as helping translate deep technical concepts for a wide range of collaborators and customers. What You'll Do : Deploy and operate resilient, scalable infrastructure supporting AI/HPC workloads Optimize Linux system configuration, BIOS/firmware, kernel, and disk subsystem for performance Configure, monitor and manage bare-metal infrastructure using IPMI, Redfish, etc Build and maintain automation scripts and infrastructure as code to support platform lifecycle, as well as simplifying troubleshooting for Incident resolution and provision of tooling for our support organisation Apply ITSM frameworks: Incident, Major Incident, Change Management, and service improvement. Maintain and enhance ORI's observability stack: Prometheus, Grafana, and custom monitoring integrations Operate and support services in 24x7 production environments, including on-call rotation Contribute to Incident postmortem analyses, root cause analysis, document learnings, and automate remediations Mentor junior engineers and act as an Operational requirements consultant to other departments Communicate technical decisions clearly to non-technical stakeholders and customers Uphold a culture of: do, document, automate Willingness to cross train with Platform Engineering/Platform SRE to fully support both our infrastructure and platform stacks. Willingness to cross train with HPC Engineering, supported by NVIDIA to enhance our HPC supportability offering What you bring: 5+ Years Proven experience in globally scaled, performance-intensive environments operating to a 24/7 support model Expert-level Linux administration, especially Ubuntu distributions Proficiency in system tuning, disk I/O optimization, and hardware-level performance tweaks Familiarity with Out of Band management tools (IPMI, Redfish, PXE, etc.) Strong networking fundamentals: TCP/IP, DNS, DHCP, VLANs, routing, switching Strong experience with infrastructure scripting and automation (Bash, Python, Ansible) Deep understanding of observability principles and tools (Prometheus, Grafana) Hands on experience operating orchestration platforms (Kubernetes, MAAS, Tinkerbell) Strong grasp of ITSM and service operation best practices Excellent communication and mentorship skills Comfortable interfacing with internal stakeholders and external customers Bonus: Knowledge of HPC workloads and GPU based infrastructure Bonus: Experience with InfiniBand networks and HPC performance tuning Nice to have: Bachelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent experience. LPIC Certifications ITIL Foundation level qualification or equivalent experience How you work: You approach problems with a systems mindset - balancing practical execution with long-term scalability You elevate the team, setting high standards for technical quality and engineering excellence. You hold yourself and others accountable - giving direct feedback and expecting the same You take initiative, owning challenges end to end and proactively driving solutions. You invest in others, mentoring to build both capability and confidence. You communicate clearly - translating complexity into clarity across engineering and business audiences Why should you join us? What sets us apart is our blend of modern technology, competitive benefits, and an open, welcoming work culture that enables our people to thrive. Here are just some of the great things you can expect from us: 30 days of annual leave: we value your peace of mind. With 30 days off (excluding public holidays) and access to mental health resources, we make sure you're as strong mentally as you are professionally. A culture that emphasises results over hierarchy, process & ego: we place great emphasis on the quality, ingenuity and creativity of work. Open communication, regular feedback: we value smooth collaboration, direct and actionable feedback, and believe that leading with empathy and a growth mindset makes us better together. Learning Time: we all have dedicated learning time to focus on new skills, projects or interests that lay outside of your day to day job. Health & Wellbeing: we want everyone to feel healthy and happy, so we offer private medical insurance via Bupa. Cycle to Work Scheme: we're committed to building a sustainable business, so we encourage cycling to work. Gympass subscription to a variety of gyms and wellbeing apps Participation in the company shares program Enhanced parental pay & leave Diversity, Equity, Inclusion and Belonging We are an equal opportunity employer and we strive to reduce unconscious bias throughout our hiring process. All applicants will be considered for employment without attention to ethnicity, religion, sexual orientation, gender identity, family or parental status, national origin, veteran, neurodiversity status or disability status. To ensure our recruitment processes provide an equal opportunity for all applicants to succeed, we encourage you to let us know if there are any adjustments that we can make.
Jan 01, 2026
Full time
Company Overview: Ori Industries is at the forefront of AI infrastructure, revolutionising the connection between software and hardware for the AI era. Our mission is to empower AI teams with scalable, secure, and efficient infrastructure solutions that support seamless model training, deployment, and scaling. Job Summary: We're looking for an experienced Infrastructure Site Reliability Engineer to run and evolve our infrastructure stack. You'll contribute across bare-metal, virtualization, and orchestration layers, keeping things stable and secure 24/7 x 365 - all while mentoring teammates, improving process and automation as well as helping translate deep technical concepts for a wide range of collaborators and customers. What You'll Do : Deploy and operate resilient, scalable infrastructure supporting AI/HPC workloads Optimize Linux system configuration, BIOS/firmware, kernel, and disk subsystem for performance Configure, monitor and manage bare-metal infrastructure using IPMI, Redfish, etc Build and maintain automation scripts and infrastructure as code to support platform lifecycle, as well as simplifying troubleshooting for Incident resolution and provision of tooling for our support organisation Apply ITSM frameworks: Incident, Major Incident, Change Management, and service improvement. Maintain and enhance ORI's observability stack: Prometheus, Grafana, and custom monitoring integrations Operate and support services in 24x7 production environments, including on-call rotation Contribute to Incident postmortem analyses, root cause analysis, document learnings, and automate remediations Mentor junior engineers and act as an Operational requirements consultant to other departments Communicate technical decisions clearly to non-technical stakeholders and customers Uphold a culture of: do, document, automate Willingness to cross train with Platform Engineering/Platform SRE to fully support both our infrastructure and platform stacks. Willingness to cross train with HPC Engineering, supported by NVIDIA to enhance our HPC supportability offering What you bring: 5+ Years Proven experience in globally scaled, performance-intensive environments operating to a 24/7 support model Expert-level Linux administration, especially Ubuntu distributions Proficiency in system tuning, disk I/O optimization, and hardware-level performance tweaks Familiarity with Out of Band management tools (IPMI, Redfish, PXE, etc.) Strong networking fundamentals: TCP/IP, DNS, DHCP, VLANs, routing, switching Strong experience with infrastructure scripting and automation (Bash, Python, Ansible) Deep understanding of observability principles and tools (Prometheus, Grafana) Hands on experience operating orchestration platforms (Kubernetes, MAAS, Tinkerbell) Strong grasp of ITSM and service operation best practices Excellent communication and mentorship skills Comfortable interfacing with internal stakeholders and external customers Bonus: Knowledge of HPC workloads and GPU based infrastructure Bonus: Experience with InfiniBand networks and HPC performance tuning Nice to have: Bachelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent experience. LPIC Certifications ITIL Foundation level qualification or equivalent experience How you work: You approach problems with a systems mindset - balancing practical execution with long-term scalability You elevate the team, setting high standards for technical quality and engineering excellence. You hold yourself and others accountable - giving direct feedback and expecting the same You take initiative, owning challenges end to end and proactively driving solutions. You invest in others, mentoring to build both capability and confidence. You communicate clearly - translating complexity into clarity across engineering and business audiences Why should you join us? What sets us apart is our blend of modern technology, competitive benefits, and an open, welcoming work culture that enables our people to thrive. Here are just some of the great things you can expect from us: 30 days of annual leave: we value your peace of mind. With 30 days off (excluding public holidays) and access to mental health resources, we make sure you're as strong mentally as you are professionally. A culture that emphasises results over hierarchy, process & ego: we place great emphasis on the quality, ingenuity and creativity of work. Open communication, regular feedback: we value smooth collaboration, direct and actionable feedback, and believe that leading with empathy and a growth mindset makes us better together. Learning Time: we all have dedicated learning time to focus on new skills, projects or interests that lay outside of your day to day job. Health & Wellbeing: we want everyone to feel healthy and happy, so we offer private medical insurance via Bupa. Cycle to Work Scheme: we're committed to building a sustainable business, so we encourage cycling to work. Gympass subscription to a variety of gyms and wellbeing apps Participation in the company shares program Enhanced parental pay & leave Diversity, Equity, Inclusion and Belonging We are an equal opportunity employer and we strive to reduce unconscious bias throughout our hiring process. All applicants will be considered for employment without attention to ethnicity, religion, sexual orientation, gender identity, family or parental status, national origin, veteran, neurodiversity status or disability status. To ensure our recruitment processes provide an equal opportunity for all applicants to succeed, we encourage you to let us know if there are any adjustments that we can make.
Site Reliability EngineerUK - Remote (ALK)Job FamilySite Reliability Engineering Your Business Sector: AECO - Architects, Engineers, Construction & Owners What You Will Do: We are seeking a skilled and motivated Site Reliability Engineer to join our team in Trimble's Core Cloud Platform. The ideal candidate will have a strong background in cloud platforms, infrastructure as code, and automation via programming/scripting languages. You will work with a distributed team to drive the reliability, scalability, and security of the team's services and infrastructure.The Core Cloud Platform group builds the foundational common services used by dozens of Trimble products and millions of users. Key Responsibilities: Develop and maintain infrastructure as code (IaC) using Terraform to ensure reliable and scalable cloud environments Implement and enhance observability solutions using tools like New Relic, DataDog, Sumologic and Splunk for monitoring, logging, and alerting Perform code deployments and manage CI/CD pipelines using Jenkins, Github, and related tooling to ensure smooth and efficient delivery processes Automate routine tasks and workflows to increase operational efficiency and reduce manual intervention Evaluate system designs and architectures for reliability, performance, security, and efficiency, ensuring best practices are followed Lead incident response efforts, conduct root cause analysis, and implement long-term solutions for complex issues Develop and maintain comprehensive runbooks and procedures for incident response and operational tasks Collaborate with cross-functional teams to review and provide feedback on technical designs, ensuring alignment with SRE principles Participate in on-call rotations and handle critical incidents with confidence and expertise. Continuously improve documentation for systems and services, contributing to a knowledge-sharing culture within the team What Skills & Experience You Should Bring: Bachelor's or Master's degree in Computer Engineering or a related field At least 5 years of technical experience with a proven ability to take ownership Strong collaboration skills with leading cross-functional work Demonstrated success in managing infrastructure in production environments Expertise in capacity planning and cost optimisation for efficient operations Extensive experience with Cloud provider hosted infrastructure (Amazon Web Services & Azure) Proficient in high-level scripting languages (Python) and Infrastructure as Code (IaC) tools (Terraform, CloudFormation), along with containerisation Experience with Kubernetes or other containerisation technologies Familiarity with CI/CD pipelines and tools such as Azure DevOps, Jenkins, Argo CD, Helm, etc Experience with monitoring tools and incident management processes like Prometheus, Grafana, New Relic, DataDog, Splunk, Cloudwatch, Sumologic etc Strong understanding of networking and security concepts About Trimble: Dedicated to the world's tomorrow, Trimble is a technology company delivering solutions that enable our customers to work in new ways to measure, build, grow and move goods for a better quality of life. Core technologies in positioning, modeling, connectivity and data analytics connect the digital and physical worlds to improve productivity, quality, safety, transparency and sustainability. From purpose-built products and enterprise lifecycle solutions to industry cloud services, Trimble is transforming critical industries such as construction, geospatial, agriculture and transportation to power an interconnected world of work. For more information about Trimble (NASDAQ: TRMB), visit: How to Apply: Please submit an online application for this position by clicking on the 'Apply Now' button located in this posting. Application Deadline: Applications could be accepted until at least 30 days from the posting date. Join a Values-Driven Team: Belong, Grow, Innovate. At Trimble, our core values of Belong, Grow, and Innovate aren't just words-they're the foundation of our culture. We foster an environment where you are seen, heard, and valued (Belong); where you have an opportunity to build a career and drive our collective growth (Grow); and where your innovative ideas shape the future (Innovate). We believe in empowering local teams to create impactful strategies, ensuring our global vision resonates with every individual. Become part of a team where your contributions truly matter. If you need assistance or would like to request an accommodation in connection with the application process, please contact . Job Title: Site Reliability Engineer Location: UK - RemoteTop skillsJavaScript/TypeScriptRxjsSystems ProgrammingAgile MethodologiesSQL
Jan 01, 2026
Full time
Site Reliability EngineerUK - Remote (ALK)Job FamilySite Reliability Engineering Your Business Sector: AECO - Architects, Engineers, Construction & Owners What You Will Do: We are seeking a skilled and motivated Site Reliability Engineer to join our team in Trimble's Core Cloud Platform. The ideal candidate will have a strong background in cloud platforms, infrastructure as code, and automation via programming/scripting languages. You will work with a distributed team to drive the reliability, scalability, and security of the team's services and infrastructure.The Core Cloud Platform group builds the foundational common services used by dozens of Trimble products and millions of users. Key Responsibilities: Develop and maintain infrastructure as code (IaC) using Terraform to ensure reliable and scalable cloud environments Implement and enhance observability solutions using tools like New Relic, DataDog, Sumologic and Splunk for monitoring, logging, and alerting Perform code deployments and manage CI/CD pipelines using Jenkins, Github, and related tooling to ensure smooth and efficient delivery processes Automate routine tasks and workflows to increase operational efficiency and reduce manual intervention Evaluate system designs and architectures for reliability, performance, security, and efficiency, ensuring best practices are followed Lead incident response efforts, conduct root cause analysis, and implement long-term solutions for complex issues Develop and maintain comprehensive runbooks and procedures for incident response and operational tasks Collaborate with cross-functional teams to review and provide feedback on technical designs, ensuring alignment with SRE principles Participate in on-call rotations and handle critical incidents with confidence and expertise. Continuously improve documentation for systems and services, contributing to a knowledge-sharing culture within the team What Skills & Experience You Should Bring: Bachelor's or Master's degree in Computer Engineering or a related field At least 5 years of technical experience with a proven ability to take ownership Strong collaboration skills with leading cross-functional work Demonstrated success in managing infrastructure in production environments Expertise in capacity planning and cost optimisation for efficient operations Extensive experience with Cloud provider hosted infrastructure (Amazon Web Services & Azure) Proficient in high-level scripting languages (Python) and Infrastructure as Code (IaC) tools (Terraform, CloudFormation), along with containerisation Experience with Kubernetes or other containerisation technologies Familiarity with CI/CD pipelines and tools such as Azure DevOps, Jenkins, Argo CD, Helm, etc Experience with monitoring tools and incident management processes like Prometheus, Grafana, New Relic, DataDog, Splunk, Cloudwatch, Sumologic etc Strong understanding of networking and security concepts About Trimble: Dedicated to the world's tomorrow, Trimble is a technology company delivering solutions that enable our customers to work in new ways to measure, build, grow and move goods for a better quality of life. Core technologies in positioning, modeling, connectivity and data analytics connect the digital and physical worlds to improve productivity, quality, safety, transparency and sustainability. From purpose-built products and enterprise lifecycle solutions to industry cloud services, Trimble is transforming critical industries such as construction, geospatial, agriculture and transportation to power an interconnected world of work. For more information about Trimble (NASDAQ: TRMB), visit: How to Apply: Please submit an online application for this position by clicking on the 'Apply Now' button located in this posting. Application Deadline: Applications could be accepted until at least 30 days from the posting date. Join a Values-Driven Team: Belong, Grow, Innovate. At Trimble, our core values of Belong, Grow, and Innovate aren't just words-they're the foundation of our culture. We foster an environment where you are seen, heard, and valued (Belong); where you have an opportunity to build a career and drive our collective growth (Grow); and where your innovative ideas shape the future (Innovate). We believe in empowering local teams to create impactful strategies, ensuring our global vision resonates with every individual. Become part of a team where your contributions truly matter. If you need assistance or would like to request an accommodation in connection with the application process, please contact . Job Title: Site Reliability Engineer Location: UK - RemoteTop skillsJavaScript/TypeScriptRxjsSystems ProgrammingAgile MethodologiesSQL
MARGO is hiring an experienced ITS Application Services Engineer to join a Global Markets technology team. You'll support and enhance the low-latency infrastructure powering electronic trading systems - ensuring connectivity, performance, and reliability across FX and Fixed Income platforms. Location: Hybrid, Central London Your mission: Deliver and maintain a world-class, low-latency trading environment - managing messaging middleware, network connectivity, observability tooling, and capacity planning to meet front office demands. What you'll do: Connectivity & Monitoring Deploy and support global market connectivity (Solace/Tibco, multicast, TCP/IP) across 30+ sites. Implement SRE principles: automated health checks, alerting, SLIs/SLOs, and capacity forecasts. Incident & Performance Management Lead deep dives post incident analyses (e.g., market events), using packet captures (Wireshark/tcpdump/Corvil). Maintain dashboards (Geneos, Prometheus, Grafana) for proactive issue detection. Automation & Resiliency Build scripts (Python/Bash/PowerShell) and IaC for self healing systems and reduced manual toil. Execute lab based testing and benchmarking for new connectivity solutions. Collaboration & Delivery Partner with global network, platform, and development teams to onboard applications and define standards. Own change management processes, operational handoffs, and documentation (topologies, runbooks). Technical fit - what we're looking for: Strong background in low latency networking: TCP/IP, multicast, traffic shaping, performance tuning. Proven experience with messaging middleware (Solace, 29West, Tibco, LBM) in performance sensitive environments. Hands on packet analysis using Wireshark, tcpdump, Corvil (custom decoder skills a bonus). Scripting/automation with Python, Bash, or PowerShell. Familiarity with observability platforms (ITRS Geneos, Prometheus, Grafana). Experience supporting real time trading applications, feed handlers, matching engines. Excellent communication - able to translate technical metrics into business insights. Nice to have: Exposure to FIX/market data protocols and order routing architectures. Hands on use of CI/CD pipelines and IaC tools (Terraform, Ansible). Experience with capacity planning and error budget management. Tech stack & skills: Middleware: Solace/Tibco/LBM Networking: TCP/IP, multicast, BGP/OSPF, traffic shaping Monitoring: ITRS Geneos, Prometheus, Grafana, Corvil Automation: Python, Bash, PowerShell, IaC Systems: Linux (Red Hat/CentOS), hybrid cloud infrastructure Recruitment process: Intro call with MARGO Talent Acquisition Technical deep dive with Margo consultant Meeting with the bank's hiring team
Jan 01, 2026
Full time
MARGO is hiring an experienced ITS Application Services Engineer to join a Global Markets technology team. You'll support and enhance the low-latency infrastructure powering electronic trading systems - ensuring connectivity, performance, and reliability across FX and Fixed Income platforms. Location: Hybrid, Central London Your mission: Deliver and maintain a world-class, low-latency trading environment - managing messaging middleware, network connectivity, observability tooling, and capacity planning to meet front office demands. What you'll do: Connectivity & Monitoring Deploy and support global market connectivity (Solace/Tibco, multicast, TCP/IP) across 30+ sites. Implement SRE principles: automated health checks, alerting, SLIs/SLOs, and capacity forecasts. Incident & Performance Management Lead deep dives post incident analyses (e.g., market events), using packet captures (Wireshark/tcpdump/Corvil). Maintain dashboards (Geneos, Prometheus, Grafana) for proactive issue detection. Automation & Resiliency Build scripts (Python/Bash/PowerShell) and IaC for self healing systems and reduced manual toil. Execute lab based testing and benchmarking for new connectivity solutions. Collaboration & Delivery Partner with global network, platform, and development teams to onboard applications and define standards. Own change management processes, operational handoffs, and documentation (topologies, runbooks). Technical fit - what we're looking for: Strong background in low latency networking: TCP/IP, multicast, traffic shaping, performance tuning. Proven experience with messaging middleware (Solace, 29West, Tibco, LBM) in performance sensitive environments. Hands on packet analysis using Wireshark, tcpdump, Corvil (custom decoder skills a bonus). Scripting/automation with Python, Bash, or PowerShell. Familiarity with observability platforms (ITRS Geneos, Prometheus, Grafana). Experience supporting real time trading applications, feed handlers, matching engines. Excellent communication - able to translate technical metrics into business insights. Nice to have: Exposure to FIX/market data protocols and order routing architectures. Hands on use of CI/CD pipelines and IaC tools (Terraform, Ansible). Experience with capacity planning and error budget management. Tech stack & skills: Middleware: Solace/Tibco/LBM Networking: TCP/IP, multicast, BGP/OSPF, traffic shaping Monitoring: ITRS Geneos, Prometheus, Grafana, Corvil Automation: Python, Bash, PowerShell, IaC Systems: Linux (Red Hat/CentOS), hybrid cloud infrastructure Recruitment process: Intro call with MARGO Talent Acquisition Technical deep dive with Margo consultant Meeting with the bank's hiring team
A leading online gaming company based in Manchester is looking for a Site Reliability Engineer to enhance system reliability and observability. You will collaborate across multiple functions and contribute to continuous improvement initiatives within the software development life cycle. The ideal candidate should have expertise in Site Reliability Engineering principles, automation tools, and observability technologies. This role supports a hybrid working arrangement, providing flexibility while ensuring operational efficiency.
Jan 01, 2026
Full time
A leading online gaming company based in Manchester is looking for a Site Reliability Engineer to enhance system reliability and observability. You will collaborate across multiple functions and contribute to continuous improvement initiatives within the software development life cycle. The ideal candidate should have expertise in Site Reliability Engineering principles, automation tools, and observability technologies. This role supports a hybrid working arrangement, providing flexibility while ensuring operational efficiency.
Founded in 2013, GSR is a leading market maker and programmatic trading firm in the fast-evolving world of cryptocurrency trading. With over 200 employees across seven countries, we provide billions of dollars in liquidity daily to cryptocurrency protocols and exchanges. We build long-term relationships with crypto communities and institutional investors by offering exceptional service, expertise, and tailored trading solutions. GSR works with token issuers, exchanges, investors, miners, and more than 30 cryptocurrency exchanges around the world. In volatile markets we are a trusted partner to crypto native builders and to those exploring the industry for the first time. Our team of veteran finance and technology executives from Goldman Sachs, Two Sigma, and Citadel, among others, has developed one of the world's most robust trading platforms designed to navigate issues unique to the digital asset markets. We have continuously improved our technology throughout our history, allowing for our clients to scale and execute their strategies with the highest level of efficiency. Working at GSR is an opportunity to be deeply embedded in every major sector of the cryptocurrency ecosystem. About the Role We are seeking a Site Reliability Engineer (SRE) to design, optimize, and support highly available systems across our global trading infrastructure. As part of GSR's SRE team, you will manage a multi-regional cloud environment while integrating and automating our physical server inventory using Infrastructure as Code (IaC). You will work across all layers of infrastructure, including: Networking & Exchange Connectivity Microservice Orchestration & Observability Disaster Recovery & Security Optimization Your mission is to improve latency, scalability, and reliability, ensuring GSR remains a best-in-class market maker. We value engineers who drive automation, reduce friction, and enhance developer velocity through better tooling, CI/CD, and infrastructure design. Who We're Looking For Core Skills Containers & Orchestration: Strong expertise in container security and Kubernetes (multi-cluster/global deployment is a plus). Distributed Systems & Messaging: Knowledge of clusters, storage, Kafka, Aeron, and experience with multicast or HPC. Automation & IaC: Proficiency in Python, Golang, or Rust with experience in IaC tools and immutable infrastructure. Continuous Delivery & Config Management: Familiarity with FluxCD, ArgoCD, and custom CD deployments. Strong grasp of CI/CD pipelines. Linux & Networking: Solid understanding of Linux internals, cgroups, routing, switching, firewalls, and DNS/service discovery. Databases: Experience with MySQL, MongoDB, and database administration (Flyway or Liquibase a plus). Bonus Experience Data center operations Crypto, fintech, bare-metal provisioning or trading experience What We Offer A collaborative and transparent company culture founded on Integrity, Innovation and Performance. Competitive Salary with two discretionary bonus payments a year. Benefits such as Healthcare, Dental, Vision, Retirement Planning, 30 days holiday and free lunches when in the office. Hybrid working pattern in all of our offices from London, New York, Singapore, Zug and Malaga. Regular Town Halls and offsites, team lunches and drinks. A Corporate and Social Responsibility program as well as charity fundraising matching and volunteer days. Immigration and relocation support where required. GSR is proudly an Equal Employment Opportunity employer. We do not discriminate based upon any applicable legally protected characteristics such as race, religion, colour, country of origin, sexual orientation, gender, gender identity, gender expression or age. We operate a meritocracy; all aspects of people engagement from the decision to hire or promote as well as our performance management process will be based on the business needs and individual merit. Learn more about us at .
Jan 01, 2026
Full time
Founded in 2013, GSR is a leading market maker and programmatic trading firm in the fast-evolving world of cryptocurrency trading. With over 200 employees across seven countries, we provide billions of dollars in liquidity daily to cryptocurrency protocols and exchanges. We build long-term relationships with crypto communities and institutional investors by offering exceptional service, expertise, and tailored trading solutions. GSR works with token issuers, exchanges, investors, miners, and more than 30 cryptocurrency exchanges around the world. In volatile markets we are a trusted partner to crypto native builders and to those exploring the industry for the first time. Our team of veteran finance and technology executives from Goldman Sachs, Two Sigma, and Citadel, among others, has developed one of the world's most robust trading platforms designed to navigate issues unique to the digital asset markets. We have continuously improved our technology throughout our history, allowing for our clients to scale and execute their strategies with the highest level of efficiency. Working at GSR is an opportunity to be deeply embedded in every major sector of the cryptocurrency ecosystem. About the Role We are seeking a Site Reliability Engineer (SRE) to design, optimize, and support highly available systems across our global trading infrastructure. As part of GSR's SRE team, you will manage a multi-regional cloud environment while integrating and automating our physical server inventory using Infrastructure as Code (IaC). You will work across all layers of infrastructure, including: Networking & Exchange Connectivity Microservice Orchestration & Observability Disaster Recovery & Security Optimization Your mission is to improve latency, scalability, and reliability, ensuring GSR remains a best-in-class market maker. We value engineers who drive automation, reduce friction, and enhance developer velocity through better tooling, CI/CD, and infrastructure design. Who We're Looking For Core Skills Containers & Orchestration: Strong expertise in container security and Kubernetes (multi-cluster/global deployment is a plus). Distributed Systems & Messaging: Knowledge of clusters, storage, Kafka, Aeron, and experience with multicast or HPC. Automation & IaC: Proficiency in Python, Golang, or Rust with experience in IaC tools and immutable infrastructure. Continuous Delivery & Config Management: Familiarity with FluxCD, ArgoCD, and custom CD deployments. Strong grasp of CI/CD pipelines. Linux & Networking: Solid understanding of Linux internals, cgroups, routing, switching, firewalls, and DNS/service discovery. Databases: Experience with MySQL, MongoDB, and database administration (Flyway or Liquibase a plus). Bonus Experience Data center operations Crypto, fintech, bare-metal provisioning or trading experience What We Offer A collaborative and transparent company culture founded on Integrity, Innovation and Performance. Competitive Salary with two discretionary bonus payments a year. Benefits such as Healthcare, Dental, Vision, Retirement Planning, 30 days holiday and free lunches when in the office. Hybrid working pattern in all of our offices from London, New York, Singapore, Zug and Malaga. Regular Town Halls and offsites, team lunches and drinks. A Corporate and Social Responsibility program as well as charity fundraising matching and volunteer days. Immigration and relocation support where required. GSR is proudly an Equal Employment Opportunity employer. We do not discriminate based upon any applicable legally protected characteristics such as race, religion, colour, country of origin, sexual orientation, gender, gender identity, gender expression or age. We operate a meritocracy; all aspects of people engagement from the decision to hire or promote as well as our performance management process will be based on the business needs and individual merit. Learn more about us at .
# Our Privacy Statement & Cookie Policy Site Reliability Engineer page is loaded Site Reliability Engineerremote type: Remote Job: Hybridlocations: GBR-London-5 Canada Squaretime type: Full timeposted on: Posted Todayjob requisition id: JREQ195152Thomson Reuters is seeking a Site Reliability Engineer to join the Case Center product within our Service Management, Technology team. The Site Reliability Engineer will support the reliability, performance, and operability of customer environments by contributing to routine change, incident, and problem management processes, as well as by driving continuous improvements in observability and automation across both non-production and production environments. The role will collaborate closely with product engineering, content, customer service, and wider technology teams to deliver resilient and high-quality service outcomes. About the Role: As a Site Reliability Engineer, you will: Lead proactive monitoring and health management for production and non-production environments; identify options for problem resolution and initiate appropriate actions. Own incident response for complex cases, including triage, stabilisation, root-cause analysis, post-incident review, and knowledge capture. Plan and execute standard installations, upgrades, migrations, configuration, and maintenance activities; contribute to technical plans and environment hardening initiatives. Develop, configure, and support tooling for system monitoring, troubleshooting, and automation to improve repeatability and time-to-restore. Maintain and evolve observability (alerts, dashboards, runbooks) to reduce noise and improve mean-time-to-detect and mean-time-to-restore. Liaise with application development, content, customer service, and software/hardware support teams to manage escalations and coordinate change. Contribute to the development of automation and internal tooling (e.g., packaging, checks, and deployment pipelines) to increase operational throughput and consistency. Produce and maintain operational documentation, standards, and implementation patterns to support secure, repeatable, and compliant operations. Maintain accurate, auditable records for change, deployment, and security across environments. Participate in a collaborative on-call rotation; occasional out-of-hours work may be required to support availability service levels. About You: Adequate familiarity with deploying, managing, and troubleshooting applications and services on Microsoft Azure. Proficiency in PowerShell and/or Python, with confidence using the command line (Bash) for diagnostics and tooling. A solid understanding of Windows Server administration, including configuration, security, and maintenance tasks. Experience in using and integrating APIs, particularly in configuring and employing them effectively within business processes. A foundational understanding of Artificial Intelligence principles and their practical application in improving system efficiency and performance. Experience or familiarity with one or more of the following is desirable: IIS Web Server, SQL Server and querying, HTML, Networking and firewall management, ITIL Framework, ServiceNow, Microsoft Power BI, Microsoft Power Apps, DataDog. A Bachelor's Degree in Computer Science, Computer Engineering, or a closely related field; alternatively, a professional certification in Microsoft Azure may be considered. Hybrid Work Model: We've adopted a flexible hybrid working environment (2-3 days a week in the office depending on the role) for our office-based roles while delivering a seamless experience that is digitally and physically connected. Flexibility & Work-Life Balance: Flex My Way is a set of supportive workplace policies designed to help manage personal and professional responsibilities, whether caring for family, giving back to the community, or finding time to refresh and reset. This builds upon our flexible work arrangements, including work from anywhere for up to 8 weeks per year, empowering employees to achieve a better work-life balance. Career Development and Growth: By fostering a culture of continuous learning and skill development, we prepare our talent to tackle tomorrow's challenges and deliver real-world solutions. Our Grow My Way programming and skills-first approach ensures you have the tools and knowledge to grow, lead, and thrive in an AI-enabled future. Industry Competitive Benefits: We offer comprehensive benefit plans to include flexible vacation, two company-wide Mental Health Days off, access to the Headspace app, retirement savings, tuition reimbursement, employee incentive programs, and resources for mental, physical, and financial wellbeing. Culture: Globally recognized, award-winning reputation for inclusion and belonging, flexibility, work-life balance, and more. We live by our values: Obsess over our Customers, Compete to Win, Challenge (Y)our Thinking, Act Fast / Learn Fast, and Stronger Together. Social Impact: Make an impact in your community with our Social Impact Institute. We offer employees two paid volunteer days off annually and opportunities to get involved with pro-bono consulting projects and Environmental, Social, and Governance (ESG) initiatives. Making a Real-World Impact: We are one of the few companies globally that helps its customers pursue justice, truth, and transparency. Together, with the professionals and institutions we serve, we help uphold the rule of law, turn the wheels of commerce, catch bad actors, report the facts, and provide trusted, unbiased information to people all over the world.Thomson Reuters informs the way forward by bringing together the trusted content and technology that people and organizations need to make the right decisions. We serve professionals across legal, tax, accounting, compliance, government, and media. Our products combine highly specialized software and insights to empower professionals with the data, intelligence, and solutions needed to make informed decisions, and to help institutions in their pursuit of justice, truth, and transparency. Reuters, part of Thomson Reuters, is a world leading provider of trusted journalism and news.As a global business, we rely on the unique backgrounds, perspectives, and experiences of all employees to deliver on our business goals. To ensure we can do that, we seek talented, qualified employees in all our operations around the world regardless of race, color, sex/gender, including pregnancy, gender identity and expression, national origin, religion, sexual orientation, disability, age, marital status, citizen status, veteran status, or any other protected classification under applicable law. Thomson Reuters is proud to be an Equal Employment Opportunity Employer providing a drug-free workplace.We also make reasonable accommodations for qualified individuals with disabilities and for sincerely held religious beliefs in accordance with applicable law. More information on requesting an accommodation .Learn more on how to protect yourself from fraudulent job postings .More information about Thomson Reuters can be found on
Jan 01, 2026
Full time
# Our Privacy Statement & Cookie Policy Site Reliability Engineer page is loaded Site Reliability Engineerremote type: Remote Job: Hybridlocations: GBR-London-5 Canada Squaretime type: Full timeposted on: Posted Todayjob requisition id: JREQ195152Thomson Reuters is seeking a Site Reliability Engineer to join the Case Center product within our Service Management, Technology team. The Site Reliability Engineer will support the reliability, performance, and operability of customer environments by contributing to routine change, incident, and problem management processes, as well as by driving continuous improvements in observability and automation across both non-production and production environments. The role will collaborate closely with product engineering, content, customer service, and wider technology teams to deliver resilient and high-quality service outcomes. About the Role: As a Site Reliability Engineer, you will: Lead proactive monitoring and health management for production and non-production environments; identify options for problem resolution and initiate appropriate actions. Own incident response for complex cases, including triage, stabilisation, root-cause analysis, post-incident review, and knowledge capture. Plan and execute standard installations, upgrades, migrations, configuration, and maintenance activities; contribute to technical plans and environment hardening initiatives. Develop, configure, and support tooling for system monitoring, troubleshooting, and automation to improve repeatability and time-to-restore. Maintain and evolve observability (alerts, dashboards, runbooks) to reduce noise and improve mean-time-to-detect and mean-time-to-restore. Liaise with application development, content, customer service, and software/hardware support teams to manage escalations and coordinate change. Contribute to the development of automation and internal tooling (e.g., packaging, checks, and deployment pipelines) to increase operational throughput and consistency. Produce and maintain operational documentation, standards, and implementation patterns to support secure, repeatable, and compliant operations. Maintain accurate, auditable records for change, deployment, and security across environments. Participate in a collaborative on-call rotation; occasional out-of-hours work may be required to support availability service levels. About You: Adequate familiarity with deploying, managing, and troubleshooting applications and services on Microsoft Azure. Proficiency in PowerShell and/or Python, with confidence using the command line (Bash) for diagnostics and tooling. A solid understanding of Windows Server administration, including configuration, security, and maintenance tasks. Experience in using and integrating APIs, particularly in configuring and employing them effectively within business processes. A foundational understanding of Artificial Intelligence principles and their practical application in improving system efficiency and performance. Experience or familiarity with one or more of the following is desirable: IIS Web Server, SQL Server and querying, HTML, Networking and firewall management, ITIL Framework, ServiceNow, Microsoft Power BI, Microsoft Power Apps, DataDog. A Bachelor's Degree in Computer Science, Computer Engineering, or a closely related field; alternatively, a professional certification in Microsoft Azure may be considered. Hybrid Work Model: We've adopted a flexible hybrid working environment (2-3 days a week in the office depending on the role) for our office-based roles while delivering a seamless experience that is digitally and physically connected. Flexibility & Work-Life Balance: Flex My Way is a set of supportive workplace policies designed to help manage personal and professional responsibilities, whether caring for family, giving back to the community, or finding time to refresh and reset. This builds upon our flexible work arrangements, including work from anywhere for up to 8 weeks per year, empowering employees to achieve a better work-life balance. Career Development and Growth: By fostering a culture of continuous learning and skill development, we prepare our talent to tackle tomorrow's challenges and deliver real-world solutions. Our Grow My Way programming and skills-first approach ensures you have the tools and knowledge to grow, lead, and thrive in an AI-enabled future. Industry Competitive Benefits: We offer comprehensive benefit plans to include flexible vacation, two company-wide Mental Health Days off, access to the Headspace app, retirement savings, tuition reimbursement, employee incentive programs, and resources for mental, physical, and financial wellbeing. Culture: Globally recognized, award-winning reputation for inclusion and belonging, flexibility, work-life balance, and more. We live by our values: Obsess over our Customers, Compete to Win, Challenge (Y)our Thinking, Act Fast / Learn Fast, and Stronger Together. Social Impact: Make an impact in your community with our Social Impact Institute. We offer employees two paid volunteer days off annually and opportunities to get involved with pro-bono consulting projects and Environmental, Social, and Governance (ESG) initiatives. Making a Real-World Impact: We are one of the few companies globally that helps its customers pursue justice, truth, and transparency. Together, with the professionals and institutions we serve, we help uphold the rule of law, turn the wheels of commerce, catch bad actors, report the facts, and provide trusted, unbiased information to people all over the world.Thomson Reuters informs the way forward by bringing together the trusted content and technology that people and organizations need to make the right decisions. We serve professionals across legal, tax, accounting, compliance, government, and media. Our products combine highly specialized software and insights to empower professionals with the data, intelligence, and solutions needed to make informed decisions, and to help institutions in their pursuit of justice, truth, and transparency. Reuters, part of Thomson Reuters, is a world leading provider of trusted journalism and news.As a global business, we rely on the unique backgrounds, perspectives, and experiences of all employees to deliver on our business goals. To ensure we can do that, we seek talented, qualified employees in all our operations around the world regardless of race, color, sex/gender, including pregnancy, gender identity and expression, national origin, religion, sexual orientation, disability, age, marital status, citizen status, veteran status, or any other protected classification under applicable law. Thomson Reuters is proud to be an Equal Employment Opportunity Employer providing a drug-free workplace.We also make reasonable accommodations for qualified individuals with disabilities and for sincerely held religious beliefs in accordance with applicable law. More information on requesting an accommodation .Learn more on how to protect yourself from fraudulent job postings .More information about Thomson Reuters can be found on
WALT Labs, a leading managed service provider, is dedicated to empowering businesses by harnessing the power of cloud technology. Our team specializes in delivering customized solutions tailored to meet the unique needs of our clients, driving growth and operational efficiency across industries. From supporting small businesses with seamless data migration to enabling large corporations to manage complex infrastructure projects, we provide exceptional service while staying at the forefront of cloud technology advancements. We are seeking a skilled Site Reliability Engineer - UK with a strong focus on Google Cloud Platform (GCP) to join our dynamic team. In this role, you'll be responsible for maintaining cloud infrastructure, managing incidents, and ensuring seamless operations for our clients. You'll use tools like incident.io and JIRA to manage and resolve support requests efficiently. This is an in-office role: Monday - Friday, 9 AM - 6 PM GMT / BST Qualifications for Site Reliability Engineer: Proven experience with Google Cloud Platform (GCP) services - 3+ years. (Kubernetes a must!) Understanding of Google Workspace (admin experience a plus) Familiarity with incident.io for incident tracking and management (of equivalent) Proficiency in using JIRA for task management and support workflows. Strong experience working with observability tools (Grafana and DataDog) Strong troubleshooting and problem-solving skills in cloud environments. Understanding of cloud security and performance optimization best practices. Knowledge of scripting or automation tools (e.g., Python, Terraform) is a plus. Excellent written communication and customer service skills. Certifications in GCP (e.g., Google Cloud Associate or Professional certifications) are highly desirable. Ability to work under pressure and prioritize tasks effectively. Role responsibilities Provide technical support and resolve issues related to Google Cloud Platform (GCP) services and AWS. Provide client support for Google Workspace Manage and respond to cloud incidents using incident.io, ensuring timely resolution. Use JIRA to log, track, and prioritize support tickets and workflow tasks. Monitor and maintain cloud infrastructure for performance, reliability, and security. Collaborate with teams to identify and implement solutions to technical challenges. Assist in deploying, configuring, and optimizing GCP resources. Create and maintain documentation for troubleshooting processes and best practices. Proactively identify opportunities to improve cloud environments and support processes. Support clients and stakeholders by providing clear communication and updates during incident resolution. Stay up-to-date with the latest GCP developments and contribute to team knowledge sharing. Benefits Private Medical Insurance Paid Time Off that increases with longevity (additional 1.5 days every 2 years) Professional development and advancement opportunities Pension Growth opportunities
Jan 01, 2026
Full time
WALT Labs, a leading managed service provider, is dedicated to empowering businesses by harnessing the power of cloud technology. Our team specializes in delivering customized solutions tailored to meet the unique needs of our clients, driving growth and operational efficiency across industries. From supporting small businesses with seamless data migration to enabling large corporations to manage complex infrastructure projects, we provide exceptional service while staying at the forefront of cloud technology advancements. We are seeking a skilled Site Reliability Engineer - UK with a strong focus on Google Cloud Platform (GCP) to join our dynamic team. In this role, you'll be responsible for maintaining cloud infrastructure, managing incidents, and ensuring seamless operations for our clients. You'll use tools like incident.io and JIRA to manage and resolve support requests efficiently. This is an in-office role: Monday - Friday, 9 AM - 6 PM GMT / BST Qualifications for Site Reliability Engineer: Proven experience with Google Cloud Platform (GCP) services - 3+ years. (Kubernetes a must!) Understanding of Google Workspace (admin experience a plus) Familiarity with incident.io for incident tracking and management (of equivalent) Proficiency in using JIRA for task management and support workflows. Strong experience working with observability tools (Grafana and DataDog) Strong troubleshooting and problem-solving skills in cloud environments. Understanding of cloud security and performance optimization best practices. Knowledge of scripting or automation tools (e.g., Python, Terraform) is a plus. Excellent written communication and customer service skills. Certifications in GCP (e.g., Google Cloud Associate or Professional certifications) are highly desirable. Ability to work under pressure and prioritize tasks effectively. Role responsibilities Provide technical support and resolve issues related to Google Cloud Platform (GCP) services and AWS. Provide client support for Google Workspace Manage and respond to cloud incidents using incident.io, ensuring timely resolution. Use JIRA to log, track, and prioritize support tickets and workflow tasks. Monitor and maintain cloud infrastructure for performance, reliability, and security. Collaborate with teams to identify and implement solutions to technical challenges. Assist in deploying, configuring, and optimizing GCP resources. Create and maintain documentation for troubleshooting processes and best practices. Proactively identify opportunities to improve cloud environments and support processes. Support clients and stakeholders by providing clear communication and updates during incident resolution. Stay up-to-date with the latest GCP developments and contribute to team knowledge sharing. Benefits Private Medical Insurance Paid Time Off that increases with longevity (additional 1.5 days every 2 years) Professional development and advancement opportunities Pension Growth opportunities
Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. At Nscale, our Support and Operations team plays a critical role in maintaining service availability, driving service reliability and rapid response to customer tickets globally. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you'll be contributing to building the technology that powers the future. About the Role (Job Purpose) Nscale is seeking an Infrastructure Support Manager to lead the daily operations and support of our global datacenter infrastructure. This role will manage a team of engineers providing monitoring, troubleshooting, and incident response for mission critical GPU, networking, and storage systems across multiple datacenters. You will ensure that incidents are resolved quickly, infrastructure health is continuously monitored, and support processes are followed consistently. This leadership role is key to guaranteeing operational excellence and reliability across Nscale's datacenter footprint. What You'll be Doing (Responsibilities) Lead and manage a team of infrastructure support engineers across global datacenter sites. Oversee daily monitoring and support of GPU, networking, and storage systems. Ensure rapid and effective incident response, escalation, and resolution. Develop and maintain support processes, runbooks, and escalation procedures. Collaborate with engineering, buildout, and operations teams to improve reliability and reduce recurring issues. Conduct root cause analysis and implement preventative measures for critical incidents. Track and report on support metrics (SLAs, uptime, MTTR, incident volume) to leadership. Drive adoption of monitoring, observability, and automation tools across the team. Mentor and develop team members, fostering a culture of operational excellence. Participate in the on call rotation and ensure adequate coverage across regions. About You (Skills / Qualifications) Proven experience in datacenter infrastructure support or operations management. Strong technical knowledge of servers, GPUs, networking, and storage systems. Solid understanding of monitoring and observability practices and tools (e.g., Prometheus, Grafana, Datadog). Experience leading support teams in mission critical 24/7 environments. Excellent troubleshooting and problem solving skills with a focus on root cause analysis. Familiarity with ITIL or other support frameworks for incident, problem, and change management. Strong leadership, communication, and coaching skills with the ability to manage global teams. Ability to collaborate across engineering, operations, and vendor partners. Nice to have: Experience in AI/ML or high performance computing infrastructure support. Knowledge of GPU orchestration and containerized environments (e.g., Kubernetes). Familiarity with automation and Infrastructure as Code (Terraform, Ansible, Pulumi). Exposure to sustainability and datacenter energy efficiency practices. What We Can Offer You At Nscale, you'll find a collaborative, supportive, and innovative environment where your contributions spark real impact. We're building something extraordinary, and we want you at the core. Highly competitive package (base + equity) with reviews every 12 months. Join the fastest growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting edge AI. Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support. Human First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments. Join our thriving remote first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work. We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio economic backgrounds. If there's anything we can do to accommodate your specific situation, please let us know. The responsibilities outlined in this job description are not exhaustive and are intended to provide a general overview of the position. The employee may be required to perform additional duties, tasks, and responsibilities as assigned by management, consistent with the skills and qualifications required for the role.
Jan 01, 2026
Full time
Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. At Nscale, our Support and Operations team plays a critical role in maintaining service availability, driving service reliability and rapid response to customer tickets globally. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you'll be contributing to building the technology that powers the future. About the Role (Job Purpose) Nscale is seeking an Infrastructure Support Manager to lead the daily operations and support of our global datacenter infrastructure. This role will manage a team of engineers providing monitoring, troubleshooting, and incident response for mission critical GPU, networking, and storage systems across multiple datacenters. You will ensure that incidents are resolved quickly, infrastructure health is continuously monitored, and support processes are followed consistently. This leadership role is key to guaranteeing operational excellence and reliability across Nscale's datacenter footprint. What You'll be Doing (Responsibilities) Lead and manage a team of infrastructure support engineers across global datacenter sites. Oversee daily monitoring and support of GPU, networking, and storage systems. Ensure rapid and effective incident response, escalation, and resolution. Develop and maintain support processes, runbooks, and escalation procedures. Collaborate with engineering, buildout, and operations teams to improve reliability and reduce recurring issues. Conduct root cause analysis and implement preventative measures for critical incidents. Track and report on support metrics (SLAs, uptime, MTTR, incident volume) to leadership. Drive adoption of monitoring, observability, and automation tools across the team. Mentor and develop team members, fostering a culture of operational excellence. Participate in the on call rotation and ensure adequate coverage across regions. About You (Skills / Qualifications) Proven experience in datacenter infrastructure support or operations management. Strong technical knowledge of servers, GPUs, networking, and storage systems. Solid understanding of monitoring and observability practices and tools (e.g., Prometheus, Grafana, Datadog). Experience leading support teams in mission critical 24/7 environments. Excellent troubleshooting and problem solving skills with a focus on root cause analysis. Familiarity with ITIL or other support frameworks for incident, problem, and change management. Strong leadership, communication, and coaching skills with the ability to manage global teams. Ability to collaborate across engineering, operations, and vendor partners. Nice to have: Experience in AI/ML or high performance computing infrastructure support. Knowledge of GPU orchestration and containerized environments (e.g., Kubernetes). Familiarity with automation and Infrastructure as Code (Terraform, Ansible, Pulumi). Exposure to sustainability and datacenter energy efficiency practices. What We Can Offer You At Nscale, you'll find a collaborative, supportive, and innovative environment where your contributions spark real impact. We're building something extraordinary, and we want you at the core. Highly competitive package (base + equity) with reviews every 12 months. Join the fastest growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting edge AI. Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support. Human First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments. Join our thriving remote first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work. We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio economic backgrounds. If there's anything we can do to accommodate your specific situation, please let us know. The responsibilities outlined in this job description are not exhaustive and are intended to provide a general overview of the position. The employee may be required to perform additional duties, tasks, and responsibilities as assigned by management, consistent with the skills and qualifications required for the role.
Job Description Joining Busuu means being part of one of the top EdTech companies in the world, a multiple award-winner recognised for its innovation and impact in language learning.Busuu's vision is to empower people through languages. We are the world's largest online community for language learning, with 120+ million registered users. We make learning a language easy by combining AI-powered courses with feedback from our global community of native speakers and lesson content designed for real life.Busuu is part of the global Chegg family. Chegg is the leading student-first connected learning platform and a NYSE listed company.At Busuu, we're building technology that helps millions of people learn languages every day. We're looking for a Senior Platform Engineer to help us design, scale, and improve the infrastructure that powers our platform. You'll be part of a team focused on reliability, automation, and observability - enabling our engineers to deliver faster and with confidence. What you'll do Build and scale reliable cloud infrastructure. preferably in AWS (experience with Google Cloud is also great). Operate and optimise Kubernetes clusters at scale. Improve observability across systems using OpenTelemetry and New Relic. Automate infrastructure and deployments with Terraform, Terragrunt, and GitOps (ArgoCD). Design and evolve secure, scalable network architectures across multiple environments. Collaborate on CI/CD pipelines (GitHub Actions) and contribute to improving developer experience. Participate in our on-call rotation, ensuring reliability and quick incident response. A solid foundation in cloud infrastructure, Kubernetes, and Infrastructure as Code. Hands-on experience with monitoring, tracing, and logging. Strong understanding of networking principles, routing, and connectivity in cloud-native architectures. Programming or scripting skills (Python preferred). Familiarity with databases (MySQL/PostgreSQL, MongoDB, Redis) and Kafka or similar event systems. A problem-solving mindset, comfortable taking ownership and learning new things on the fly. Nice to have Experience supporting AI/ML workloads (model deployment, scalability, observability). Expertise in database performance tuning. Experience working remotely in distributed teams. At Busuu we want to ensure that you have access to some great benefits: Our centrally located offices are well-equipped with free breakfast, plenty of snacks and fresh fruit You get 2 free lunches per week at our office that you can choose out of a wide selection of restaurants in the area Busuu offers a great Private Health Insurance scheme There is a personal training budget just for you, so you can learn more in your field to ensure our employees can continuously grow and progress in their careers We like to support our teams with their work-life balance so we offer flexible working hours and a hybrid model of working We offer enhanced maternity and paternity leave Staying connected as a team is very important to us, so we have lots of social activities for you to join such as team lunches, Thursday socials, quarterly team, and company events What happens next We aim to have a simple and speedy hiring process and we want to make sure that we are right for you as much as the other way around. CV application review - We will review it as quickly as possible Let's chat - Quick chat with our recruiter about your experience and the role Culture fit interview - On-site or video call with the Engineering Manager Technical questions - Technical call with the team Technical test - We will send you a technical test to complete in your time Coding review - Technical interview and task review with the team Our platform is for everyone, and so is our workplace. We pride ourselves on embracing our differences, whether they're cultural, racial, religious, or otherwise. This means each one of us comes to work knowing that we have a voice - and a safe, judgement-free zone to speak freely. If you like the sound of that, join us. We'd love to hear what you have to say. is the world's largest language-learning community, touching the lives of 120 million learners across the world.At Busuu, we make learning a language easier by combining AI-powered courses, instant feedback from our global community of native speakers, and lessons with qualified teachers.If you love languages, want to work with smart, creative, energetic people and possess the initiative, confidence and good judgement to make independent decisions every day, then you'll love working with us!
Jan 01, 2026
Full time
Job Description Joining Busuu means being part of one of the top EdTech companies in the world, a multiple award-winner recognised for its innovation and impact in language learning.Busuu's vision is to empower people through languages. We are the world's largest online community for language learning, with 120+ million registered users. We make learning a language easy by combining AI-powered courses with feedback from our global community of native speakers and lesson content designed for real life.Busuu is part of the global Chegg family. Chegg is the leading student-first connected learning platform and a NYSE listed company.At Busuu, we're building technology that helps millions of people learn languages every day. We're looking for a Senior Platform Engineer to help us design, scale, and improve the infrastructure that powers our platform. You'll be part of a team focused on reliability, automation, and observability - enabling our engineers to deliver faster and with confidence. What you'll do Build and scale reliable cloud infrastructure. preferably in AWS (experience with Google Cloud is also great). Operate and optimise Kubernetes clusters at scale. Improve observability across systems using OpenTelemetry and New Relic. Automate infrastructure and deployments with Terraform, Terragrunt, and GitOps (ArgoCD). Design and evolve secure, scalable network architectures across multiple environments. Collaborate on CI/CD pipelines (GitHub Actions) and contribute to improving developer experience. Participate in our on-call rotation, ensuring reliability and quick incident response. A solid foundation in cloud infrastructure, Kubernetes, and Infrastructure as Code. Hands-on experience with monitoring, tracing, and logging. Strong understanding of networking principles, routing, and connectivity in cloud-native architectures. Programming or scripting skills (Python preferred). Familiarity with databases (MySQL/PostgreSQL, MongoDB, Redis) and Kafka or similar event systems. A problem-solving mindset, comfortable taking ownership and learning new things on the fly. Nice to have Experience supporting AI/ML workloads (model deployment, scalability, observability). Expertise in database performance tuning. Experience working remotely in distributed teams. At Busuu we want to ensure that you have access to some great benefits: Our centrally located offices are well-equipped with free breakfast, plenty of snacks and fresh fruit You get 2 free lunches per week at our office that you can choose out of a wide selection of restaurants in the area Busuu offers a great Private Health Insurance scheme There is a personal training budget just for you, so you can learn more in your field to ensure our employees can continuously grow and progress in their careers We like to support our teams with their work-life balance so we offer flexible working hours and a hybrid model of working We offer enhanced maternity and paternity leave Staying connected as a team is very important to us, so we have lots of social activities for you to join such as team lunches, Thursday socials, quarterly team, and company events What happens next We aim to have a simple and speedy hiring process and we want to make sure that we are right for you as much as the other way around. CV application review - We will review it as quickly as possible Let's chat - Quick chat with our recruiter about your experience and the role Culture fit interview - On-site or video call with the Engineering Manager Technical questions - Technical call with the team Technical test - We will send you a technical test to complete in your time Coding review - Technical interview and task review with the team Our platform is for everyone, and so is our workplace. We pride ourselves on embracing our differences, whether they're cultural, racial, religious, or otherwise. This means each one of us comes to work knowing that we have a voice - and a safe, judgement-free zone to speak freely. If you like the sound of that, join us. We'd love to hear what you have to say. is the world's largest language-learning community, touching the lives of 120 million learners across the world.At Busuu, we make learning a language easier by combining AI-powered courses, instant feedback from our global community of native speakers, and lessons with qualified teachers.If you love languages, want to work with smart, creative, energetic people and possess the initiative, confidence and good judgement to make independent decisions every day, then you'll love working with us!
London, Waterloo (Hybrid, 4 days in-office - Wednesday is our set work from home day, though you can come in on Wednesday too if you wish) We are disrupting one of the world's largest asset classes, property. With £2Bn+ assets on our platform and 30,000+ users across 70 countries, we're building the future of asset ownership and in doing so, are able to address wealth inequality. Our product simplifies property investing from start to finish, making real estate investment accessible to everyone. What you'll love doing: Working in cross functional product teams, taking infrastructure and reliability initiatives from concept to production. Navigating ambiguity in a fast moving environment where ownership and freedom are core to how we operate. Building and maintaining robust, scalable infrastructure across our GCP cloud environment. Working with Kubernetes, Terraform, Cloudflare, and modern observability tooling to ensure our platform runs smoothly. Collaborating closely with engineering teams to design CI/CD pipelines, improve deployment practices, and champion reliability as a core engineering principle. Helping to define SRE practices for a high growth fintech platform. Mentoring other engineers as we scale our teams and impact. What you'll be doing: Designing, implementing, and maintaining our cloud infrastructure on Google Cloud Platform (GCP), ensuring scalability, reliability, and security. Owning our Kubernetes clusters and containerization strategy - from Docker image optimisation to cluster management and deployment orchestration. Building and evolving our Infrastructure as Code using Terraform, creating modular, testable, well documented configurations that scale with our rapid growth. Managing and optimising our Cloudflare infrastructure, including Workers for edge computing, DNS, CDN, security policies, and performance optimisation. Deploy AI powered product features in isolated and secure serverless environments. Implementing comprehensive monitoring and observability using Prometheus and Grafana, defining SLIs/SLOs, and proactively identifying issues before they impact users. Designing and maintaining CI/CD pipelines with appropriate quality gates, testing strategies, and deployment techniques (blue green, canary) to enable fast, safe releases. Ensuring security best practices across our infrastructure - from network design and access controls to secrets management and vulnerability scanning. Working with engineering teams to improve application reliability, performance, and observability through instrumentation and architectural guidance. Enabling developer productivity through self service tooling, clear documentation, and automation of operational tasks. What we're looking for: Essential: 5+ years in SRE, DevOps, or platform engineering roles with production grade infrastructure experience Strong hands on experience with Google Cloud Platform (GCP) Expert level knowledge of Kubernetes and Docker - you've deployed, managed, and troubleshot production clusters Proficiency in Terraform for infrastructure as code Experience with Cloudflare services, including Workers, DNS, CDN, and security features Experience implementing and managing observability stacks with Prometheus and Grafana Strong understanding of CI/CD principles, pipeline design, and deployment strategies Experience with cloud networking, security groups, VPCs, and network peering Solid scripting skills (Shell, Python, or similar) Desirable: Experience with blue green or canary deployment techniques Familiarity with programming languages like Go or TypeScript Background in implementing security automation and quality gates Experience with configuration management tools Understanding of SRE principles: SLIs, SLOs, error budgets, and blameless post mortems Experience with edge computing and serverless architectures Track record of mentoring engineers and fostering a culture of reliability What we are building: The first end to end real estate investment offering - making the dream of owning real estate more accessible to everyone globally. Diversity & inclusion at GetGround: We encourage applications from all sections of society and we believe in the criticality of an inclusive culture. We are committed to equal employment opportunity regardless of race, religion or belief, ethnic or national origin, disability, age, citizenship, marital, domestic or civil partnership status, sexual orientation, gender identity or any other basis as protected by law. 42% of our employees identify as female or non specified, 58% as male 22 nationalities represented across offices in 5 countries Our work on Design Accessibility Inclusion is at the heart of our culture - we celebrate and reflect on key D&I and cultural events such as: Black History Month, International Women's Day and Pride For more information on how we store your candidate data, please see our recruitment privacy policy.
Jan 01, 2026
Full time
London, Waterloo (Hybrid, 4 days in-office - Wednesday is our set work from home day, though you can come in on Wednesday too if you wish) We are disrupting one of the world's largest asset classes, property. With £2Bn+ assets on our platform and 30,000+ users across 70 countries, we're building the future of asset ownership and in doing so, are able to address wealth inequality. Our product simplifies property investing from start to finish, making real estate investment accessible to everyone. What you'll love doing: Working in cross functional product teams, taking infrastructure and reliability initiatives from concept to production. Navigating ambiguity in a fast moving environment where ownership and freedom are core to how we operate. Building and maintaining robust, scalable infrastructure across our GCP cloud environment. Working with Kubernetes, Terraform, Cloudflare, and modern observability tooling to ensure our platform runs smoothly. Collaborating closely with engineering teams to design CI/CD pipelines, improve deployment practices, and champion reliability as a core engineering principle. Helping to define SRE practices for a high growth fintech platform. Mentoring other engineers as we scale our teams and impact. What you'll be doing: Designing, implementing, and maintaining our cloud infrastructure on Google Cloud Platform (GCP), ensuring scalability, reliability, and security. Owning our Kubernetes clusters and containerization strategy - from Docker image optimisation to cluster management and deployment orchestration. Building and evolving our Infrastructure as Code using Terraform, creating modular, testable, well documented configurations that scale with our rapid growth. Managing and optimising our Cloudflare infrastructure, including Workers for edge computing, DNS, CDN, security policies, and performance optimisation. Deploy AI powered product features in isolated and secure serverless environments. Implementing comprehensive monitoring and observability using Prometheus and Grafana, defining SLIs/SLOs, and proactively identifying issues before they impact users. Designing and maintaining CI/CD pipelines with appropriate quality gates, testing strategies, and deployment techniques (blue green, canary) to enable fast, safe releases. Ensuring security best practices across our infrastructure - from network design and access controls to secrets management and vulnerability scanning. Working with engineering teams to improve application reliability, performance, and observability through instrumentation and architectural guidance. Enabling developer productivity through self service tooling, clear documentation, and automation of operational tasks. What we're looking for: Essential: 5+ years in SRE, DevOps, or platform engineering roles with production grade infrastructure experience Strong hands on experience with Google Cloud Platform (GCP) Expert level knowledge of Kubernetes and Docker - you've deployed, managed, and troubleshot production clusters Proficiency in Terraform for infrastructure as code Experience with Cloudflare services, including Workers, DNS, CDN, and security features Experience implementing and managing observability stacks with Prometheus and Grafana Strong understanding of CI/CD principles, pipeline design, and deployment strategies Experience with cloud networking, security groups, VPCs, and network peering Solid scripting skills (Shell, Python, or similar) Desirable: Experience with blue green or canary deployment techniques Familiarity with programming languages like Go or TypeScript Background in implementing security automation and quality gates Experience with configuration management tools Understanding of SRE principles: SLIs, SLOs, error budgets, and blameless post mortems Experience with edge computing and serverless architectures Track record of mentoring engineers and fostering a culture of reliability What we are building: The first end to end real estate investment offering - making the dream of owning real estate more accessible to everyone globally. Diversity & inclusion at GetGround: We encourage applications from all sections of society and we believe in the criticality of an inclusive culture. We are committed to equal employment opportunity regardless of race, religion or belief, ethnic or national origin, disability, age, citizenship, marital, domestic or civil partnership status, sexual orientation, gender identity or any other basis as protected by law. 42% of our employees identify as female or non specified, 58% as male 22 nationalities represented across offices in 5 countries Our work on Design Accessibility Inclusion is at the heart of our culture - we celebrate and reflect on key D&I and cultural events such as: Black History Month, International Women's Day and Pride For more information on how we store your candidate data, please see our recruitment privacy policy.
Once For All is a high-growth, cloud-based, SaaS subscription business. Our technology helps our customers to manage their supply chain governance, risk management and compliance. We work across public and private sector and have over 250k customers across the UK across 20 different sectors including construction, transport, retail, hospitality education, facility and property management, manufacturing, local and central government. Role Summary Join our Reliability and Platform group partnering with 10 Agile SCRUM teams to scale and harden a suite of microservices on Microsoft Azure. You will own production reliability for tier-1 services, set and track SLOs, automate operations, and lead incident response to keep our next-generation Supplier Risk Assessment platform fast, secure, and available. This role is fully remote role. Job Responsibilities Define SLOs, SLIs, and error budgets for critical services. Architect resilient multi-region and zone-aware workloads on Azure and AKS. Build infrastructure as code with Terraform or Bicep. Enforce policy as code. Design safe releases with progressive delivery, automated rollbacks, and feature flags. Lead on-call rotations, incident response, postmortems, and corrective actions. Implement end-to-end observability: metrics, logs, traces, dashboards, alerts. Plan capacity, tune performance, and optimize cost without impacting reliability. Secure the stack with Managed Identity, Key Vault, workload identity, and network segmentation. Establish backup, disaster recovery, and tested restore procedures with clear RPO and RTO. Mentor engineers and raise reliability standards across product teams Candidate Requirements 10+ years in SRE, platform, or production-facing engineering roles running large-scale systems. 7+ years hands on with Microsoft Azure: AKS, Front Door or Application Gateway, VNets, Private Link, Key Vault, Monitor, Log Analytics, Application Insights, Service Bus, Storage, SQL or Cosmos DB. 6+ years operating Kubernetes in production, including at least 3 years on AKS (network policies, PodDisruptionBudgets, HPA/VPA, node pools, upgrade playbooks). 5+ years infrastructure as code with Terraform or Bicep and Git based workflows. 5+ years designing observability and SLO based alerting using OpenTelemetry and Kusto queries. 4+ years running canary or blue green deployments in Azure DevOps or GitHub Actions. Proven incident command experience with measurable MTTR and MTTD improvements. Strong automation skills in Python or Go, plus Bash and PowerShell. Solid understanding of security hardening, container image scanning, SBOM, and least privilege. Experience with performance testing, p95 and p99 tuning, caching and connection pool strategies. Nice To Have Multi tenant SaaS and data sovereignty patterns. Service mesh, eBPF, or advanced traffic shaping. Compliance and audit trail design. FinOps practice with cost per request or per tenant KPIs. What We Offer Health and Wellbeing: Private Medical Insurance or wellness fund, 24/7 Employee Assistance Programme. Financial Benefits: Pension, Life Assurance (3x salary). Time Off: 25 days holiday + 8 bank holidays, holiday purchase scheme (+5 days), paid and unpaid volunteering days. Growth and Development: Ongoing CPD, team offsites, and company events. Everyday Perks: Home office budget, high spec laptop and peripherals. Work Setup: Fully remote within UK time zones, optional access to our Basingstoke office. Tech Stack You Will Use Azure, AKS, Terraform or Bicep, Azure DevOps or GitHub Actions, Docker, Helm, Service Bus, Storage, SQL Server, Cosmos DB, Key Vault, Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana, OpenTelemetry, Feature flagging tools. Interview Process Intro and role overview with Talent. Technical deep dive on Azure and AKS architecture. Practical exercise: propose SLOs and an alert plan for a sample service, plus a release safety plan. Culture and collaboration interview with Engineering.
Jan 01, 2026
Full time
Once For All is a high-growth, cloud-based, SaaS subscription business. Our technology helps our customers to manage their supply chain governance, risk management and compliance. We work across public and private sector and have over 250k customers across the UK across 20 different sectors including construction, transport, retail, hospitality education, facility and property management, manufacturing, local and central government. Role Summary Join our Reliability and Platform group partnering with 10 Agile SCRUM teams to scale and harden a suite of microservices on Microsoft Azure. You will own production reliability for tier-1 services, set and track SLOs, automate operations, and lead incident response to keep our next-generation Supplier Risk Assessment platform fast, secure, and available. This role is fully remote role. Job Responsibilities Define SLOs, SLIs, and error budgets for critical services. Architect resilient multi-region and zone-aware workloads on Azure and AKS. Build infrastructure as code with Terraform or Bicep. Enforce policy as code. Design safe releases with progressive delivery, automated rollbacks, and feature flags. Lead on-call rotations, incident response, postmortems, and corrective actions. Implement end-to-end observability: metrics, logs, traces, dashboards, alerts. Plan capacity, tune performance, and optimize cost without impacting reliability. Secure the stack with Managed Identity, Key Vault, workload identity, and network segmentation. Establish backup, disaster recovery, and tested restore procedures with clear RPO and RTO. Mentor engineers and raise reliability standards across product teams Candidate Requirements 10+ years in SRE, platform, or production-facing engineering roles running large-scale systems. 7+ years hands on with Microsoft Azure: AKS, Front Door or Application Gateway, VNets, Private Link, Key Vault, Monitor, Log Analytics, Application Insights, Service Bus, Storage, SQL or Cosmos DB. 6+ years operating Kubernetes in production, including at least 3 years on AKS (network policies, PodDisruptionBudgets, HPA/VPA, node pools, upgrade playbooks). 5+ years infrastructure as code with Terraform or Bicep and Git based workflows. 5+ years designing observability and SLO based alerting using OpenTelemetry and Kusto queries. 4+ years running canary or blue green deployments in Azure DevOps or GitHub Actions. Proven incident command experience with measurable MTTR and MTTD improvements. Strong automation skills in Python or Go, plus Bash and PowerShell. Solid understanding of security hardening, container image scanning, SBOM, and least privilege. Experience with performance testing, p95 and p99 tuning, caching and connection pool strategies. Nice To Have Multi tenant SaaS and data sovereignty patterns. Service mesh, eBPF, or advanced traffic shaping. Compliance and audit trail design. FinOps practice with cost per request or per tenant KPIs. What We Offer Health and Wellbeing: Private Medical Insurance or wellness fund, 24/7 Employee Assistance Programme. Financial Benefits: Pension, Life Assurance (3x salary). Time Off: 25 days holiday + 8 bank holidays, holiday purchase scheme (+5 days), paid and unpaid volunteering days. Growth and Development: Ongoing CPD, team offsites, and company events. Everyday Perks: Home office budget, high spec laptop and peripherals. Work Setup: Fully remote within UK time zones, optional access to our Basingstoke office. Tech Stack You Will Use Azure, AKS, Terraform or Bicep, Azure DevOps or GitHub Actions, Docker, Helm, Service Bus, Storage, SQL Server, Cosmos DB, Key Vault, Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana, OpenTelemetry, Feature flagging tools. Interview Process Intro and role overview with Talent. Technical deep dive on Azure and AKS architecture. Practical exercise: propose SLOs and an alert plan for a sample service, plus a release safety plan. Culture and collaboration interview with Engineering.