Needle in the application stack: Finding and predicting hybrid IT issues
- By Mav Turner
- Jul 06, 2017
Trying to find the root cause of IT problems can often feel like looking for a needle in a haystack. Worse, there are often multiple haystacks -- and sometimes the haystack you need to search is on a completely different farm.
Federal administrators running hybrid IT environments have to look in a lot of different haystacks to identify the source of their networks’ problems. Issues may exist on-premises in their complex application stacks. Or they could exist far away, somewhere in the cloud. Without proper and complete visibility into all aspects of the network, it can be very difficult to tell where the problem lies -- and finding those proverbial needles can be nearly impossible.
Today, many federal network managers do not have the ability to continuously monitor both on- and off-site environments. Core dependencies stretch across boundaries, and individual tools used to monitor what happens on-premises will not necessarily pick up all of the interactions throughout hybrid infrastructures.
A hybrid IT canvas
This is a problem for a world that is becoming increasingly hybrid. Today’s federal IT managers need a network view that is broad and expansive and offers visibility into resources and applications, wherever they reside. This canvas must take into account everything in their networks, including virtualization, storage, applications, servers, cloud and internet providers, users and more. Managers must be able to see, correlate and understand the data being collected from all of these resources and share it with their colleagues.
But seeing the data is only the first step toward finding that needle. Network managers must deploy methods that allow them to compare data types side by side to more easily identify the cause of potential issues. Being able to look at data derived from application performance counters, storage performance or virtual machine host memory utilization can help operations teams easily pinpoint anomalies and find the right haystack to tackle.
Timelines can be laid on top of this information to further identify the cause of slowdowns or outages. For instance, a manager who is alerted to a non-responding application at 11:15 a.m. can review the disparate data streams and look for warning signs in those streams around the time that the issue first occurred. Managers can share these dashboards with their teams to get everyone on the same page and verify that the problems are resolved quickly.
Being able to see events that took place during a specific timeframe can help assess the impact they had on other IT resources. This dependency mapping can be critical in complex environments where one application depends on another. Unlike traditional IT, dependencies are highly dynamic in a cloud environment. Databases can move around, and containers can pop up and disappear. Being able to quickly and automatically identify dependencies and the impact that events can have on connected resources -- whether on-premises or hosted -- can save precious problem-solving time.
A window into the future
Speaking of time, systems should not only provide managers with a view into their entire hybrid IT environments -- they should also provide windows into the future.
Too often, federal IT managers are forced to react to problems -- the needle falls into a haystack, and their users ask them to find it after the fact. But reacting to an incident is usually more time- and resource-intensive than preventing the problem in the first place. Plus, repair processes can adversely impact end users’ quality of service.
It’s far better to use predictive analytics to avoid the issues altogether. By collecting and analyzing all of the aforementioned network and systems data, federal IT managers can better predict when capacity problems or failures may happen and take steps to mitigate issues before they occur. Based on trends, anomalous patterns and other algorithms, managers can be alerted prior to an event, receive insight into its potential impact and advice on how best to react.
For example, at some point in the past, an agency may have experienced an issue with CPU and memory being oversubscribed on a set of virtual machines. If the events that led up to that issue recur, the manager can receive recommendations on how to address the problem before it becomes a real concern. Those recommendations could include relieving memory problems or high CPU usage by moving a VM from one host to another, allowing IT managers to optimize workloads and avert problems.
A network that runs smoothly
One of the primary jobs of any federal IT manager is to keep their network running smoothly so the user experience does not degrade. Sometimes that involves sorting through increasingly complex hybrid IT environments to find that one little needle. But it could also involve not letting that needle get lost in the first place. Either way, managers must discover and implement new ways to gain complete network and system visibility and continuously monitor all of their resources.
Mav Turner is senior director, product management, for SolarWinds.