Building Resilient IT Teams and Solutions

Episode 12   Published January 2, 202513 minute watch

Episode Summary

In this episode of the Automate IT podcast, Jeremy Maldonado discusses the importance of resilience and adaptability in the IT field as we enter 2025. He emphasizes the need for personal and team resilience, effective communication, and the ability to learn from setbacks. The conversation covers practical steps for achieving IT resilience, including risk identification, researching services, testing, and having a disaster recovery plan. Jeremy concludes by encouraging listeners to embrace resilience for a successful year ahead.

Transcript

Jeremy Maldonado (00:00)

Hello everyone. Firstly, on behalf of Automox, I would like to give you a warm welcome to the year 2025. I'm wishing you a peaceful and productive start to your year. And I hope that you took some time to reset and rest because I think we can all agree that the world of IT can sometimes be really demanding on ourselves. My name is Jeremy and I want to welcome you to the Automate IT podcast or Automate IT podcast where I plan to talk to you about

My experience with navigating the world of IT from the perspective of a person who is coming face to face with new learning experiences Just about every day. I very much consider myself someone who is constantly learning And if you feel like you're very much the same way as I then please join me for a bit grab a cup of coffee Water anything you like and let's just have a chat to kick off the year casually

Let's start it off with the question, how was your December? Was it easy? Did anything blow up?

I feel like for some of us we fully suspect that nothing can go perfectly. However, just because something could not go according to plan doesn't mean you get a complete failure. It does not matter if you're talking about baking a cherry pie, like something I tried to do over my holiday break and I burned it terribly. I followed the recipe exactly. Didn't matter. Or we can talk about were you trying to deploy an application.

using a content sharing service that worked for three weeks and then all of a sudden the day before your vacation, it decided not to deploy properly because that happened to me prior to this vacation too. So today, the topic for our conversation is going to be about resilience and adaptability. This is very much an IT topic because after all, this is supposed to be an IT podcast, but I can't help.

but make this about things that are not just specific to technology and software development. I whipped out a dictionary for this. sorry, here's my dictionary right now.

I want to thank you, Stella, for sharing your knowledge with me. Resilience is the capability and the capacity to quickly recover from difficulties. So when we talk about building resilience in our technology, I think when you talk about building resilient people, that makes me sound way too philosophical for this.

I don't think anyone came here for my philosophy. I do have books about that. We can talk about that. Maybe that's for another chat though. When we talk about building resilience, we all have the capability to learn from difficult experiences. I'm not even talking about IT anymore. I used to wait tables prior to this. I used to manage restaurants prior to this.

Some of those experiences are far more stressful than anything I had overcome in my 10 years working in IT. But leaning on these past experiences is how we build resilience. And this resilience that we've gained as people at work and outside of work is resilience that we also bring into our organization to help one another.

Before we can talk about technical topics for IT resilience, can we first talk about personal topics for maintaining resilience in our day-to-day work? So when you or your teammates are speaking to one another, can you talk about learning from setbacks, calculating change and challenge before trying to take on something new? Can you talk about setting realistic goals?

I think for a team to be resilient, they have to work together to foster a team culture of open communication, encouraging everyone to contribute. When everyone is able to speak, they can bring up concerns that most likely others on the team just weren't able to think of yet. Conversation allows us to open up different parts of our brain that we think that where we consider unconscious thoughts or subconscious thoughts.

Not only is honest communication good for learning and development, but having good communication with your team also provides resourcefulness. When faced with new challenges or problems, can your team pull together and build a creative and effective solution? Resilience does not always have to be about toughness. It can be about creativity.

For example, my background comes mostly from a Linux perspective. So at the beginning of my time with Linux systems, I gave myself this impression that solving problems meant learning everything. But learning everything just simply isn't possible. Creativity is responsible for solving more problems than...

I think really anyone could probably imagine learning from experience allows you and people who work with you to approach new challenges inside of new environments. And this is very much the definition of adaptability. So I sure hope I don't regret saying this, but for any situation that you or your organization could possibly face, you are not the only person that's ever had to come face to face with that event.

Google being a prime example. Google just about any scenario that's ever happened to you and chances are it's happened to someone else and this is a good thing. This goes back to resourcefulness.

The goals that stand against your environments and your ecosystems are susceptible to the same events and concerns in your company as well as everyone else's too. Let's think about natural disasters for example. Hurricanes can ravage the coast and cause downtime not only for staff but for entire data centers at one time. We have a responsibility to ensure that our data and our infrastructure can be operational.

we can even use a smaller example of having like an application outage rendering a certain web page or a client application unusable.

In today's landscape, you really have no excuse for not taking backups, especially with a plethora of remote backup services. And as someone who administrates their own remote Linux servers, constantly hosting websites and applications that I host for myself, whether for testing or for small businesses that I just happen to be associated with. There's a plethora of services that I can use for having my data available.

I can certainly tell you stories where a broken DNS configuration file broke a website on my system or DNS wasn't working as a result of this and a website couldn't resolve or where like I had a corrupted InnoDB database and that required me to restore from the backup. Backup save lives. I know that sounds dramatic, but I'm really speaking from the heart here. I also think that when it comes to this topic, we have to talk about failover.

specifically hardware failure and I'm talking about the age-old quote that we all know using air quotes here. If one server goes down another server comes back up to pick up the load end quote. Cloud service providers have the ability and hardware monitoring has also come a long way for Linux pacemaker is well thought of as a cluster manager for high availability environments and for Windows environments WSFC.

is also well documented all over the internet and as well from Microsoft's own documentation, which is what I've referred to in the past. All of these topics serve one purpose, which is making our environments more resilient. Clustering and high availability have been put into good work. I tried to outline this as best as I could, but for IT resilience, I think I've summed this up into four steps that I tried my best to write down.

in my sketchbook here. The first thing is identifying risk and communication is key here. Strong teams with great communicators are going to set up your organization for success. And is your team able to rely on each other's experiences? This is a role where senior members of your team who are tenured should be able to speak up on past experiences. Anyone can tell you hindsight is 20-20.

what worked really well can very well work very well again. And the people who were here to experience problematic occurrences or big changes in the past, they should be willing to speak up and share their experience and communicate challenges as well as communicate successes and failures.

Step 2: researching services, products, hardware, and practices. Today, like I said, when I spoke about Pacemaker for Linux high availability or WSFC for Windows Server high availability or failover cluster.

Some of these services and products are very well documented. Today, we're fortunate enough to have that many resources to help us build out our plans to keep our organizations on its feet. Redundancy can be implemented at various levels, including server failovers, network devices, or even external services, for which I don't want to name examples because I'm not sponsored, but I think you get my point. Your third step should be testing and monitoring.

Everything has to be tested. If you work in IT, you likely already know that anything can be broken. There's no such thing as a website that cannot be down or a system that cannot boot. While it might be hard to replicate like an entire hurricane or a natural disaster taking down an entire service, we could replicate this by just walking up to a server and unplugging it from the wall. Testing and monitoring can also be a hassle to implement, but it's extremely important to understand that at a moment's notice,

your team can be creative and be agile. And I think the fourth step, which is probably the one that I didn't need to outline, but it's having a disaster recovery plan.

Procedures should be properly outlined and knowledge must be shared. Resilient IT starts with resilient people. As I said earlier, most types of events that can happen to your organization likely already happened to someone else's. These experiences can be leaned on from experiences and experiences and experiences need to be shared.

Tools, software, and access to systems should be established to all available staff to ensure that should a disaster strike, the team is ready to keep the organization on its feet. If you do have a team, for example, of more than five people, but only one of them has access to certain systems or services that are vital for maintaining availability during some sort of disaster, that could be massively problematic.

We're in the holidays, key members of certain teams might be unavailable. Having everyone on the team able to contribute to a problem should be a massive part of your disaster recovery plan. Resilience isn't just about bouncing back, it's about learning, evolving, and being prepared for what's next. Whether it's fine-tuning a disaster recovery plan or...

Simply sharing knowledge within your team the steps that you take today will make all of the difference tomorrow

Before I sign off, wanted to thank you for tuning in. Here's to building adaptable, resilient, dependable systems and teams so that we can tackle 2025. If you stayed with me this entire time, I really am grateful. Thank you to Automox for letting me speak to you for the last, I don't know, about 13 minutes. Welcome to 2025 again.

On behalf of Automox, thank you for giving me your time. Be kind to yourself at work and at home, and I'll see you next time. Thank you.