National lab mines operational IT data to boost customer experience
- By Stephanie Kanowitz
- Mar 16, 2016
When Pacific Northwest National Laboratory CIO Brian Abrahamson asked his IT staff to put more emphasis on customer experience, the team that handles monitoring and automation turned the spotlight on PNNL data.
“We had the insight that we could use the data that we have to understand what our customers were experiencing,” said Daryl Anderson, chief infrastructure architect at the lab. The IT team’s primary objectives, he said, were to “understand how our operations are impacting our customers and what can we do to proactively make their experience better.”
But that was difficult to do because of the glut of available information -- about 400 gigabytes of data from 30 to 40 sources and about 13,000 devices. To manage all that machine-generated data with one tool, the team deployed Splunk IT Service Intelligence, a scalable monitoring and analytics solution that addressed the IT staff’s two main objectives: It shows any operations problems at a glance and learns to predict future issues.
Released in November 2015, Splunk ITSI provides end-to-end visibility into the operational health of IT systems through dashboards and visualizations that can be mapped to user-determined key performance indicators (KPIs). It collects and indexes terabytes of data and metrics across data centers and cloud-based infrastructures. If it doesn’t have a collection component for what an agency needs, the company can easily build a software development kit or application programming interface (API) to do it.
“We actually look at a collection of components, if you will, as they make up an entire end-to-end service,” said Andi Mann, Splunk's chief technology advocate. “If you think of email, for example, it’s not just your Outlook or your inbox. It’s also the network that sends the email, the storage at the server that’s storing all the email. This is a service…. ITSI works with that service view. So instead of, for example, just looking at server uptime or response rates or things like that, you’re actually seeing what end users see. You’re seeing the real service, the entirety of the service.”
For instance, there’s the Glass Tables user interface, which lets users monitor IT services through color-coded visualizations showing overall health plus specifics such as the number of users and number of open help-desk tickets. It also allows for Deep Dives, an investigative tool that shows KPI search results over time, and lets users visually correlate root causes of events in a graphical dashboard.
What’s more, the solution is capable of machine learning so it can assess what baseline operations look like and predict problems to let IT managers stop them before they start. Users can run correlation searches against learned indicators to pinpoint areas for concern.
"This was key to freeing up their creativity by delivering strategic goals of driving greater automation, removing menial tasks and allowing their talented crew to do more innovative work," Mann said.
Every morning, the PNNL team starts with an operations meeting at which they study their infrastructure health dashboard, which offers an overview of all the servers, systems and network devices. It shows any outages that have occurred and also lets the team see which problems are high priority.
“It’s a great source for us to see quickly every morning if there’s focuses and priorities that we need to attack the day with,” Anderson said.
The team also created the objective lifecycle replacement dashboard to let IT staff get a better understanding of an individual computer’s performance. It made a score card for every computer using the data being pulled into Splunk. Now, IT workers can replace computers when they stop performing, rather than according to a set timetable, said Justin Brown, an IT engineer at PNNL.
Additionally, the team created website visitors dashboards that show how many visitors a PNNL website had in the past 90 days and how frequently those users visited it. Now, when a site is going to be taken down, regular users can be notified.
“It gives them a list of emails and everything else so they can quickly put together a personalized message that says, ‘Hey, Jim, I noticed you use our site a lot, and we’re going to be down tomorrow. Just wanted to let you know,’” Brown said. “It’s allowed us to be a lot more personal and taken a lot of the effort out of it as well.”
PNNL uses Splunk ITSI on-premise, but it’s available in the cloud or hybrid environments, too.
When the lab started using Splunk a few years ago, it took several months to migrate and understand the data and figure out KPIs and the dashboards. The IT staff prioritized data migration by looking at the effects on customer service, Anderson said.
Having ironed all that out before introducing ITSI made for an easier migration, Brown explained. “Now as new data comes in, we have a better handle” on it, he said.
Looking ahead, the team is working on adding mobile alerting capabilities that will use the infrastructure health dashboard to understand who needs alerts about their systems and how to best communicate those alerts, said Arzu Gosney, monitoring and automation technical lead at PNNL. We’re “really going with mobile … and notifying the right people at the right times for the right messages.”
PNNL’s need for a single view into its IT operations is common in the federal government, Mann said.
“What we’re seeing in federal IT, and especially federal government IT, is adoption of new kinds of services,” Mann said, such as cloud computing, micro services, containerization, open data and APIs. “We’re seeing a lot of fragmentation of IT across these different components and services and providers and technology, and that’s creating a lack of visibility.”
Because each technology keeps an ongoing record of everything that happens -- who’s logging on and off, what transactions are being made, whether users run a packet over a network -- when a problem arises, IT managers have to comb through the machine data from each system.
“It’s a needle-in-the-haystack-type of problem except you’ve got 20 different haystacks to look through,” Mann said. “Being able to look through all of those haystacks with one system lets [PNNL] find those outages faster. The really cool thing, though, in my mind, is that it lets them start to predict what’s going to happen. You can start to see trend lines, you can start to see into the future.”
Stephanie Kanowitz is a freelance writer based in northern Virginia.