127 Site Reliability jobs in Ireland
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Imagine what you could do here. At Apple, great ideas have a way of becoming great products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish. Apple Corporate Finance Systems Engineering is looking for an Application Support Engineer for a critical payments application. The Corporate Systems group at Apple focuses on creative ways to engineer business solutions to meet growing needs of Apple's Finance, Apple Pay, iTunes, Sales, Retail, and IT Service organizations. At core, our portfolio comprises of engineered custom solutions to process very high-volume micro-transactions from Apple Pay, iTunes Downloads, App Store, iPhone Activations, sales from Retail, Online, and Resellers, etc. These solutions are based on cutting edge enterprise technologies ranging from Server Side Java, Web Technologies, Search Platforms, Cloud Technology, Oracle and NoSQL Databases. Accurately processing such high volume transactions is our core strength.
Description
In this role, you will be a part of the operations team for a payments platform. The role requires candidate to embrace a complex software product architecture to support a continuous stream of payments data to ensure accurate invoicing, disbursements, receipts and payments. This role is responsible for infrastructure design and scalability to support the platform and for the related product development. The role will actively monitor production health of the platform including tickets, operational retrospect's and definition/management of necessary controls. As business evolves, this role will identify utilities that can assist in operations and execute the product development for it. Monitoring, alerting, profiling tools, fault tolerance etc will be a charter for this role. The platform is a part of very fast paced initiates from a range of business units. Good communication, building business relationships and thriving in fast pace are necessary for this role. Will you join us in crafting solutions that do not yet exist?
Minimum Qualifications
- BS/MS Computer Science or Equivalent.
- Relevant experience supporting critical production applications.
- Fundamental knowledge in Java and fluency in scripting languages such as Bash and/or Python.
- Significant experience querying and modifying data in SQL and NoSQL databases.
- Experience in identifying performance bottlenecks and suggesting optimizations.
Preferred Qualifications
- Experience working in financial systems.
- Experience with application monitoring and profiling tools.
Familiarity with container technology and CI/CD systems.
Submit CV
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Who We Are:
Bed Bath & Beyond is a leading online furniture and home furnishings retailer, headquartered in the USA, with an innovative Software Development base, in Sligo that build and support the e-commerce platforms for our global retail sites.
What We Do:
We innovate to deliver simple, fast, secure, and delightful experiences for our customers, partners, and teams. Our software engineering teams thrive in our positive, open, excellence driven and innovative culture. Our team members play a critical role in keeping that magic.
Are you passionate about automation, modern infrastructure, and helping development teams move faster and more securely? We're looking for a SRE Engineer who thrives in a fast-paced environment and loves solving complex technical challenges.
You'll join our team, working alongside developers and infrastructure engineers to maintain, modernize, and support our internal platforms. From Linux systems to Kubernetes clusters, and Git repos to secrets management, you'll have a direct impact on how code ships and scales.
Key responsibilities
- Be the go-to Linux administrator for internal and production tooling
- Support and manage Kubernetes environments, including Vault, Consul, ArgoCD, and PostgreSQL Operator (PGO)
- Maintain and improve internal tools like Jenkins, Bitbucket, Nexus, and Jira
- Automate infrastructure using Ansible, Shell Scripts, or Python
- Participate in an on-call rotation for urgent issues (light duty with learning opportunities)
- Help migrate legacy systems and Git repositories to modern SaaS platforms
- Drive system reliability and performance through proactive testing and engineering
- Contribute to documentation, knowledge sharing, and internal best practices
Required skills
- 5+ years of Linux systems administration experience (performance tuning, scripting, troubleshooting)
- Infrastructure-as-code skills using Ansible, Puppet, or similar
- Proficiency in Bash, Python, or equivalent scripting languages
- Ability to work independently with strong critical thinking and problem-solving skills
- Experience in agile, fast-moving teams
- Familiarity with compliance (SOX/PCI),
- Java web apps, and Apache/Tomcat environments
- Experience managing Kubernetes in production (CKA certification a bonus)
Nice to Have
- Experience with Hashicorp Vault and Consul in secure environments
- Experience with Disaster Recovery and Disaster Preparation a plus
- Openstack or cloud technologies experience a plus
- Hands-on experience with Git repositories (Bitbucket Server/Cloud preferred)
- Experience with tools like Jenkins, Nexus, SonarQube, Jira, or Spring Boot
Education
- Bachelors degree in Computer Science, Information Systems or related field or equivalent experience
What We Offer
- Pension contributions, share options, bonus, private health and dental for you and family, Paid maternity and paternity leave
- Hybrid working model
- Option to work abroad for up to 3 months each year
What We Value:
- Life/Work Balance
- Pride in Production
- Trust
- Challenge yourself, inspire others
- Success through diversity
Equal Employment Opportunity:
It is our commitment to ensure that all employment decisions are made without regard to gender, civil status, family status, sexual orientation, age, disability, race, religion and membership of the Traveller community, (protected characteristics under the Employment Equality Acts
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Description: Site Reliability Engineer
Kneat enables regulated organizations to move from paper-based validation to intelligent, digitized, paperless solutions. And we do it through the ongoing development of a powerful, purpose-built software platform. In 2014, after eight years of intensive software development, we launched Kneat Gx—the world's most advanced validation software to help revolutionize the speed, precision, transparency, and intelligence of validation in the Life Sciences sector. Our solution is now used by some of the world's leading Life Sciences companies.
What we're looking for
As Kneat continues to expand, we are looking for an enthusiastic Site Reliability Engineer (SRE) to join our Engineering department.
Position overview
Reporting to our Site Reliability (SRE) Team Lead, the Site Reliability Engineer (SRE) will help to build-out the next generation of our SaaS platform.
The successful candidate will support Kneat's delivery of a best-in-class SaaS platform to our customers. They will focus on ensuring the optimum performance and availability of the hosts Kneat's SaaS platform and on ensuring that we deliver the availability and scalability that our customers demand. We are seeking a detail-orientated individual with a passion for automation and cloud delivery models. If this sounds like you, we want to hear from you
Responsibilities
Work as a member of the Site Reliability Engineering team to build, administer and maintain our 24X7 production environment, staging/development environments and supporting infrastructure.
To be a point of contact for technical issues on a variety of components e.g. production systems, staging environment, development and deployment tools.
Continuously endeavour to improve the stability, performance and scalability of the platform that hosts the Kneat SaaS solution.
Partner closely with the DBA team to ensure the smooth operation of the solution, and to support the Cloud Automation team and the Deployment team in the delivery of the solution.
Collaborate with application engineering teams to solve business needs with provided cloud services.
Work with the Security function to ensure security best practises are followed across systems and that reporting metrics are available.
Participate in the on-call support rota to provide 24/7 support.
Minimum qualifications
Minimum of 3+ years of experience in a Systems Administration/SRE/DevOps capacity (with strong focus on scripting and automation).
Experience administrating of Linux and/or Windows based infrastructure in highly available environments.
Solid understanding and experience with AWS services and cloud operation concepts including IAM, EC2, S3, Redshift, Lambda, OpenSearch and VPC, Security Groups, Load Balancer, ASG, as well as cloud security, such as IAM.
Deep understanding of Microservices architecture and deployments.
Web Server administration experience (NGINX / IIS preferred).
Automation - Strong scripting experience (preferably Powershell & Python, but other scripting languages also considered).
Infrastructure-as-Code (IaC) experience, preferably with Ansible, Terraform
Experience of containerization tools and technologies such as Docker, Kubernetes and Helm.
Understanding of TCP/IP, DNS, routing, VPN, load-balancing, SMTP.
Working Experience in supporting Patch Management using tools such as Qualys.
Good understanding of DR Testing.
Working Knowledge of Prometheus, Grafana and ELK, or other log-aggregation / monitoring solutions.
Degree or equivalent in a computing or engineering discipline.
Strong team player with a results-oriented track record.
Excellent written and verbal communication skills.
Self-motivated and enthusiastic with a continuous learning mindset.
Nice to haves:
Familiarity with DevOps technologies - CI-CD stacks, Git, Azure DevOps
SQL Server administration experience.
Knowledge of best practices in cloud security, compliance, and governance for AWS environments.
Previous experience of maintaining infrastructure and delivering change in a regulated environment.
Experience of Agile / Kanban methodology.
Ability to have a positive impact on team members and communicate openly and directly to individuals or groups at all levels.
AWS Certified
What we offer you
At Kneat, we truly value ideas and collaboration so we've created an environment that builds, protects, and celebrates teamwork. Our strong culture is central to our continued success.
We offer programs and rewards that one would expect from a highly successful and growing technology company.
A fantastic culture, team, and energy.
Competitive compensation.
Comprehensive benefits package.
Flexible work arrangements.
Training and professional development.
Kneat is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to Equal Employment Opportunity (EEO) regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or veteran status.
Reasonable accommodations may be made to enable qualified individuals with disabilities or special needs to perform these essential functions. If you have a disability or special need that requires accommodation to complete this application form, please contact us at Ext 2004) or email us at for assistance
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Skills
Mandatory – Production support, SRE skills, Event driven,PCF, Kafka, splunk
Key Responsibilities
Design, build, and maintain
event-driven architectures
**to support scalable and resilient applications.
Manage and optimize
PCF (Pivotal Cloud Foundry)
deployments, ensuring application performance and availability.
Operate and administer
Apache Kafka
clusters, including monitoring, scaling, security, and troubleshooting.
Develop and manage
observability and monitoring solutions
using
Splunk
, ensuring proactive issue detection and resolution.
Collaborate with development teams to integrate
SRE best practices
(SLIs, SLOs, SLAs, error budgets, etc.).
Automate operational tasks, CI/CD pipelines, and system monitoring to reduce manual interventions.
Conduct incident management, root cause analysis, and postmortem reviews.
Implement
capacity planning, scaling strategies, and fault-tolerant solutions
.
Contribute to
infrastructure as code (IaC)
and cloud-native deployments.
Requirements**
Required Skills Qualifications
- Strong understanding of event-driven architecture and microservices.
- Hands-on experience with PCF (Pivotal Cloud Foundry) – deployment, scaling, and troubleshooting.
- Expertise in Kafka – cluster management, topics, partitions, producers/consumers, schema registry, and stream processing.
- Experience with Splunk – logging, dashboards, alerting, and operational insights.
- Solid knowledge of Linux, networking, and system performance tuning.
- Proficiency in scripting/programming (Python, Shell, or similar).
- Experience with CI/CD pipelines (Jenkins, GitLab, or similar).
- Strong troubleshooting and problem-solving skills in distributed systems.
- Good understanding of SRE principles – monitoring, automation, SLIs/SLOs/SLAs.
Nice-to-Have Skills
- Exposure to Kubernetes and container orchestration.
- Experience with cloud platforms (AWS, GCP, or Azure).
- Familiarity with infrastructure as code (Terraform, Ansible, etc.).
- Knowledge of other observability tools (Prometheus, Grafana, ELK stack).
Education Experience
- Bachelor's degree in Computer Science, Engineering, or related field (or equivalent work experience).
- 3–6 years of experience in SRE, DevOps, or related roles.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Site Reliability Engineer (BizOps)– 12-Month Contract, South Dublin
Don't miss out on this great opportunity. Apply today for an immediate start
What's involved:
- Work on Automation & Monitoring
- Salesforce Admin & Case Resolution
- Collaborate with product and development teams to build resilient systems
- Lead incident response & root cause analysis
- Drive
DevOps and automation practices
, including CI/CD pipeline management and tooling improvements. - Shape operational design, capacity planning, monitoring strategies, and more
What you need:
- 3+ years' experience as an SRE/ BizOps Engineer
- Experience with Salesforce (Service Cloud/ Sales Cloud) is a plus
- In depth knowledge of scripting/ coding (Python, Shell)
- CI/CD pipeline deployment
- Proven work experience in an agile environment
What's on offer:
- €300- €350 per day depending on experience
- Remote work- must be based in Ireland
- High chance of contract extension or going permanent
Apply Now
If you would like to discuss this opportunity in person or with one of our IT Resourcers please forward your CV to Vantage Resources or contact Eric Seery
on or for a confidential discussion. Vantage Resources will not forward your details without prior discussion and approval.
Vantage Resources is an equal opportunity employer. All qualified applicants will receive equal consideration for engagement and/or employment. An inclusive and diverse workforce is an essential part of the development of our organisation's culture which we believe enhances both our working environment and the service we provide to our customers.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Description – Site Reliability Engineer (SRE)
Key Responsibilities
Design, build, and maintain
event-driven architectures
to support scalable and resilient applications.
Manage and optimize
PCF (Pivotal Cloud Foundry)
deployments, ensuring application performance and availability.
Operate and administer
Apache Kafka
clusters, including monitoring, scaling, security, and troubleshooting.
Develop and manage
observability and monitoring solutions
using
Splunk
, ensuring proactive issue detection and resolution.
Collaborate with development teams to integrate
SRE best practices
(SLIs, SLOs, SLAs, error budgets, etc.).
Automate operational tasks, CI/CD pipelines, and system monitoring to reduce manual interventions.
Conduct incident management, root cause analysis, and postmortem reviews.
Implement
capacity planning, scaling strategies, and fault-tolerant solutions
.
Contribute to
infrastructure as code (IaC)
and cloud-native deployments.
Requirements
Required Skills & Qualifications
- Strong understanding of
event-driven architecture
and microservices. - Hands-on experience with
PCF (Pivotal Cloud Foundry)
– deployment, scaling, and troubleshooting. - Expertise in
Kafka
– cluster management, topics, partitions, producers/consumers, schema registry, and stream processing. - Experience with
Splunk
– logging, dashboards, alerting, and operational insights. - Solid knowledge of
Linux, networking, and system performance tuning
. - Proficiency in
scripting/programming
(Python, Shell, or similar). - Experience with
CI/CD pipelines
(Jenkins, GitLab, or similar). - Strong troubleshooting and problem-solving skills in distributed systems.
- Good understanding of
SRE principles
– monitoring, automation, SLIs/SLOs/SLAs.
Nice-to-Have Skills
- Exposure to
Kubernetes
and container orchestration. - Experience with
cloud platforms
(AWS, GCP, or Azure). - Familiarity with
infrastructure as code (Terraform, Ansible, etc.)
. - Knowledge of other observability tools (Prometheus, Grafana, ELK stack).
Education & Experience
- Bachelor's degree in Computer Science, Engineering, or related field (or equivalent work experience).
- 3–6 years of experience in SRE, DevOps, or related roles.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Site Reliability Engineer *** Great company benefits *** Remote working
Our client are keen to add to their growing engineering team across Ireland by adding a Site Reliability Engineer.
This is an ideal opportunity for someone interested in joining a fast-paced start-up, pioneering novel metrics, data products, and intelligence solutions, whilst offering insights into the economics, markets, usage, health, and other aspects of the FinTech space.
Key Responsibilities
● System Architecture and Design: Collaborate with software engineering teams to design scalable, highly available, and resilient systems. Drive architectural improvements to enhance system reliability and performance.
● Implement Infrastructure as Code to manage services and deployments in a multi-cloud, multi-project configuration.
● Automation and Tooling: Develop automation tools and scripts to streamline deployment, monitoring, and incident response processes. Implement and maintain infrastructure as code frameworks.
● Monitoring and Alerting: Configure and maintain monitoring systems to detect and mitigate potential issues proactively. Define alerting thresholds and response procedures to ensure timely incident resolution.
● Incident Management: Respond to and resolve critical incidents, perform root cause analysis, and implement preventive measures to minimize the likelihood of recurrence. Participate in an on-call rotation to provide 24/7 support as needed.
● Capacity Planning and Performance Optimization: Analyze system performance metrics, identify bottlenecks, and propose optimizations to improve resource utilization and efficiency.
● Security and Compliance: Work closely with security teams to implement best practices for data protection, access control, and compliance with regulatory requirements. Conduct periodic security audits and vulnerability assessments.
● Documentation and Knowledge Sharing: Document system configurations, procedures, and troubleshooting steps. Share knowledge and best practices with team members to foster a culture of continuous learning and improvement.
Must Have:
● Proven experience in an independent contributor role working with cloud platforms: GCP, AWS, Azure, Infrastructure-as-Code tooling: Terraform, Helm, and CI/CD orchestration platforms: GitlabCI, ArgoCD, Github Actions or similar GitOps workflows.
● Excellent problem-solving skills and the ability to independently troubleshoot complex issues.
● Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams.
● Strong Architectural & Security Mindset.
Should Have:
● Strong understanding of Linux/Unix systems administration and networking concepts.
● Hands-on experience with configuring and running monitoring tools like Prometheus, Grafana, etc.
● 5+ years experience of maintaining infrastructure-as-code on Google Cloud Platform, Amazon Web Services or Azure.
● Experience working in SOC 2 Type 1 and Type 2 certified companies.
Desirable:
● Proficiency in scripting and programming languages such as BASH, Golang, Python and TypeScript.
● 2+ years hands-on experience operating highly available Kubernetes clusters.
● Experience being involved in incident management and resolution.
● Experience with AI development tools and related security considerations.
● Passion for the Blockchain Industry & Decentralised Systems.
● Experience with Blockchain Infrastructure, either in a personal or professional capacity.
If this role is of interest to you please apply now or contact Ciarán Bergin in Parker Stewart with any additional questions you may have.
Be The First To Know
About the latest Site reliability Jobs in Ireland !
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Apple Services Engineering team is one of the most exciting examples of Apple's long-held passion for combining art and technology. Join Apple Services Engineering Cloud Service Infrastructure team, as a Site Reliability Engineer, to help support and scale cloud services for millions of Apple users. We are building and supporting new and existing critical infrastructural systems and frameworks which provide and support services like structured and unstructured storage, caching, queueing, searching, and much more at hyperscale. These form the platform upon which many iCloud and other backend systems at Apple are built. The team is responsible for the next generation platform that will power Apple's infrastructural services. These services operate at extremely large scale and store exabytes of data. The platform will support a variety of services based on open-source software, such as Kubernetes, Cassandra, Zookeeper, Kafka, Redis, etc, alongside internally developed services.
Description
The Apple Services Engineering Cloud Services SRE organization are domain experts in fleet management, systems, and software engineering. We build automations, instrument reliability tools, and respond to alerts and incidents which may pose a risk to the reliability of the platform. The team's focus is on infrastructure capabilities and processes, improving the reliability and efficiency of the systems, at scale. We are looking for a strong, enthusiastic developer to join as a member of this group. This person will have a tremendous amount of individual responsibility and influence over the direction the core platform of many critical Apple internet services takes for years to come. You are someone with ideas and real passion for software delivered as a service to improve reuse, efficiency, and simplicity. This engineer's work will affect hundreds of millions of users and be essential to the success of some of the most visible current and future Apple features.
Minimum Qualifications
- Strong emphasis on SRE as an engineering subject area, with proficiency in at least in one of the following languages (Golang, Rust, Python, Swift)
- Successful track-record and proven experience as a backend internet services software developer
- Knowledge of SDLC, including continuous integration, testing methodologies, TDD and agile development methodologies
- Understanding of base internet infrastructure services including DNS, DHCP, LDAP, server virtualization, server monitoring in critical, large scale distributed systems experience, combining Hardware, Operating Systems and Software
- Understanding of SRE principals, including monitoring, alerting, error budgets, fault analysis, and other common reliability engineering concepts, with a keen eye for opportunities to eliminate toil by code and process improvements.
- Bachelors or Masters in Computer Science, Computer Engineering, or equivalent experience.
Preferred Qualifications
- Working with large bare-metal infrastructure and release management.
- Experience with large scale server provisioning, fleet management and maintenance
- Experience with development within Kubernetes ecosystem, including operator framework, controllers and CRDs
- Hardware bootstrap and associated security (PXE, BIOS, TPM, secure boot, trusted computing)
- Automating operations processes via services and tools
Configuration management and fleet orchestration via Puppet, Chef, Ansible, or others
Submit CV
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Title:
Site Reliability Engineer
Job Location:
Dublin ,Ireland
Job Type:
Permanent
Job description:
Mandatory skill –
Tools - Splunk, Dynatrace, Jenkins, Blazemeter, Xmatters, Grafana,
Work - Production support, deploymnet work. planned maintenance
support - 24*7 once in a while
Job Description:
The Cyber Security Service Biz Ops Team is looking for a Site Reliability Engineer familiar with and who has practiced the Standard Engineering Principles.
Major Responsibilities: The role of business operations is to be the production readiness steward for the platform. This is accomplished by closely partnering with developers to design, build, implement, and support technology services. A business operations engineer will ensure operational criteria like system availability, capacity, performance, monitoring, self-healing, and deployment automation are implemented throughout the delivery process. Business Operations plays a key role in leading the DevOps transformation at Client through our tooling and by being an advocate for change and standards throughout the development, quality, release, and product organizations.
We accomplish this transformation through supporting daily operations with a hyper focus on triage and then root cause by understanding the business impact of our products. The goal of every Biz Ops team is to shift left to be more proactive and upfront in the development process, and to proactively manage production and change activities to maximize customer experience, and increase the overall value of supported applications. Biz Ops teams also focus on risk management by tying all our activities together with an overarching responsibility for compliance and risk mitigation across all our environments. A Biz Ops focus is also on streamlining and standardizing traditional application specific support activities and centralizing points of interaction for both internal and external partners by communicating effectively with all key stakeholders.
Ultimately, the role of Biz Ops is to align Product and Customer Focused priorities with Operational needs. We regularly review our run state not only from an internal perspective, but also understanding and providing the feedback loop to our development partners on how we can improve the customer experience of our applications.
Requirements
Job Description: The Cyber Security Service Biz Ops Team is looking for a Site Reliability Engineer familiar with and who has practiced the Standard Engineering Principles.
Major Responsibilities: The role of business operations is to be the production readiness steward for the platform. This is accomplished by closely partnering with developers to design, build, implement, and support technology services. A business operations engineer will ensure operational criteria like system availability, capacity, performance, monitoring, self-healing, and deployment automation are implemented throughout the delivery process. Business Operations plays a key role in leading the DevOps transformation at Client through our tooling and by being an advocate for change and standards throughout the development, quality, release, and product organizations.
We accomplish this transformation through supporting daily operations with a hyper focus on triage and then root cause by understanding the business impact of our products. The goal of every Biz Ops team is to shift left to be more proactive and upfront in the development process, and to proactively manage production and change activities to maximize customer experience, and increase the overall value of supported applications. Biz Ops teams also focus on risk management by tying all our activities together with an overarching responsibility for compliance and risk mitigation across all our environments. A Biz Ops focus is also on streamlining and standardizing traditional application specific support activities and centralizing points of interaction for both internal and external partners by communicating effectively with all key stakeholders.
Ultimately, the role of Biz Ops is to align Product and Customer Focused priorities with Operational needs. We regularly review our run state not only from an internal perspective, but also understanding and providing the feedback loop to our development partners on how we can improve the customer experience of our applications
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Role: Site Reliability Engineer
Location: Dublin, Ireland (Hybrid)
Job type: Permanent
Mandatory skills:
Axon OR Kafka
Manage and operate Apache Kafka (must-have skill): configure topics, manage partitions, ensure high availability, monitor metrics (e.g., consumer lag, throughput), and troubleshoot issues like message loss or latency.
Work with Axon Framework (must-have skill): design and maintain event-driven systems using CQRS/ES (Command Query Responsibility Segregation / Event Sourcing) patterns, integrate with Kafka for event streaming, and ensure scalability and resilience of distributed applications.
Manage and operate other messaging/streaming platforms such as NATS or MQ as needed.
Requirements
Responsibilities:
Design, develop, and maintain automation scripts, tools, and integrations using languages such as Python, Go, Java, or Bash.
Write clean, maintainable code, perform debugging, interact with APIs, and manage version control (Git) with unit testing.
Administer Linux/Unix systems, including process management, file systems, permissions, kernel tuning, shell scripting, server configuration, updates, and security hardening.
Build and manage cloud infrastructure on AWS, GCP, or Azure, leveraging IaC tools like Terraform or CloudFormation.
Architect and operate scalable and highly available systems using Kubernetes, ECS, or other container orchestration tools.
Configure and troubleshoot network protocols and services (TCP/IP, HTTP, DNS, VPNs, firewalls, load balancing) with diagnostic tools (e.g., Wireshark, traceroute).
Implement observability practices using Splunk, Dynatrace, Prometheus, Grafana, Datadog, Jaeger/Zipkin. Define SLIs/SLOs and build dashboards for actionable insights into system health.
Develop and maintain CI/CD pipelines with Jenkins, GitLab CI, or GitHub Actions to automate build, test, and deployment processes, including rollback strategies.
Diagnose and resolve production issues through logs, metrics, and debugging tools. Participate in incident management, perform root cause analysis (RCA), and contribute to blameless postmortems.
Implement security best practices: secrets management (Vault), zero-trust architectures, vulnerability management, and compliance standards (SOC 2, GDPR).
Manage and operate Apache Kafka (must-have skill): configure topics, manage partitions, ensure high availability, monitor metrics (e.g., consumer lag, throughput), and troubleshoot issues like message loss or latency.
Work with Axon Framework (must-have skill): design and maintain event-driven systems using CQRS/ES (Command Query Responsibility Segregation / Event Sourcing) patterns, integrate with Kafka for event streaming, and ensure scalability and resilience of distributed applications.
Manage and operate other messaging/streaming platforms such as NATS or MQ as needed.
Qualifications:
BS in Computer Science or a related technical field (e.g., Physics, Mathematics) OR equivalent practical experience.
4–5 years of hands-on experience in software development, systems administration, and cloud infrastructure management.
Proven expertise in Apache Kafka OR Axon Framework (must-have).