In today’s market, two of the main goals of IT infrastructure managers is to increase the mean time between failures occurrences (MTTF) and decrease the mean time to failure resolution (MTTR). These indicators demonstrate the impact of equipment failures in the technological park on the business, since various services, processes, and activities of the enterprise depend on the IT environment.
In the management universe, a situation, project, or problem is well attended to when the result meets the “three Es” of the administration, that is, it is effective because it was attended in a fast way, it is efficient because it was assertive to the problem and effective, as it had a positive impact on the business. These three concepts must also be considered for the demands met by the IT team, a failure that takes time to be addressed, can trigger various types of problems, a problem solved palliatively, for sure, will continue returning and harming the operation of the technical team. However, although many professionals are aware of this ideal scenario, many others treat it as utopian and out of everyday reality, this thought can start not only from managers, but also from employees who are part of the operation.
Another relevant point is that many times, when a network problem happens, the wrong team is notified, often because they do not know for sure the cause of the failure in question. In this sense, the incident is bureaucratized, and it gradually takes longer to resolve.
The big question is, how to be efficient, and effective in managing the team and the IT environment?
In this article, we will take a look at a case study of a customer who uses Telcomanager’s monitoring solutions and how he solved a periodic failure in his data network.
Case Study: Server Monitoring
For good network management, having visibility of the network is essential. The IT equipment infrastructure must be monitored in detail, with performance metrics, charts, reports, alarms, and the like.
A Telcomanager customer once claimed that a crucial service for the company’s billing was unavailable at a certain time of day, which was considered peak hours, that is, a time of high demand for that service. Every day, at that time, the service was unavailable.
This service was provided through a local server. Faced with this scenario, the support team directed the incident to the group that developed the application, which was hosted on the server, and development staff redirected the failure to the support team that was responsible for the network infrastructure.
In this situation, who will solve the problem?
That’s when SLAview came on the scene. Telcomanager’s technical consultant noticed that traffic was decreasing before the service became unavailable, which was abnormal, since the initial suspicion was that the server was not supporting the high demand from users. However, while traffic was decreasing, CPU consumption was increasing.
Therefore, we concluded that the problem was not in the infrastructure, but in a process that was consuming too much of the server’s processing resource. From the SLAview graphs, the precise moment at which CPU consumption was increasing, was identified, and so, it was discovered that a specific procedure in the database was overloading the machine.
Ultimately, the responsibility for handling the incident lay with the database team, who noticed a query in SQL that was running too long, and this overloaded the server’s CPU.
This scenario demonstrates the importance of correlating metrics in relation to the same object, with SLAview the manager will be able to understand the behavior of each object and speed up the problem-solving process with assertiveness. Thereafter, the issue never happened again.
SLAview is Telcomanager’s robust system embedded in a proprietary appliance and scaled according to any demand. It is a versatile system, allowing for various types of collections, such as SNMP, ICMP queries, and even script-based collections that can be used to integrate with APIs. In addition, it has several types of customizable data outputs, such as reports, dashboards, topologies maps, graphs, alarms, and much more.
Therefore, it is indispensable to have network visibility for IT management to be effectual, effective, and efficient in solving incidents and failures. And furthermore, having a qualified visibility, with a robust and flexible system like SLAview, is essential. In addition to this fact, we also saw the importance of good incident management, the processes need to be well-defined in the technical team, this will trigger greater assertiveness in solving failures.
With this in mind, Telcomanager, Latin America’s leader in the network management software industry, in the market since 2002 with a unique and innovative methodology, provides intelligent solutions for data monitoring in order to provide complete visibility to the customer’s infrastructure, allowing your company to follow the main aspects of its network.