Currently, the importance of monitoring the entire IT environment of companies is undeniable, even more so in complex environments with several different network technologies, with different equipment and diversified systems. A failure can be catastrophic for the production of the entire business and can generate serious material, immaterial and financial losses for the enterprise.
The absence of an unnotified database backup, a storage close to its maximum capacity, an unavailability of a server service, among other examples, can be critical because nowadays, we know that the entire network infrastructure and IT is an essential asset for most companies in the world. The essence of a monitoring environment is reflected in the fact that many failures, such as those mentioned, could be avoided or, if they invariably happen, have practicality and assertiveness in their respective solution.
The objective of this article is to bring a case study, aiming to show how Telcomanager solutions, specifically SLAview, can help in several problems of a client that has a robust and complex technology park.
Diversified and complete monitoring
The client in question has an IT environment with several equipment from different manufacturers, such as servers, databases, storage, sensors, nobreak, printers, routers, WAN links of different technologies, switches, AWS machines, among others. All of these are currently monitored by SLAview.
SLAview is a robust Telcomanager system that comes embedded in an appliance of its own manufacture and sized according to any demand, it performs performance and failure monitoring of the entire IT environment of the customer in question. Next, we will highlight some crucial customer monitoring and how SLAview is important in numerous aspects related to its technology park.
On servers, CPU consumption, memory, interfaces, processes, services, storage, among other indicators are monitored. Let’s specify the monitoring of services, for the client is critical the monitoring of the availability of services, such as Apache, for example. A service outage can affect a production area.
For these reasons, auto-recovery alarms have been configured in SLAview so that when a service goes down, Telcomanager’s own solution restarts the service and this usually solves the customer’s problem.
In databases, several metrics (KPIs) are monitored such as backup status, instances, table space, database status, transaction log, among others. Let’s go deeper into the Transaction Log monitoring, this is a log that records the data of transactions in the database that, in a problem, allows you to rewind. This log has limited space, if this space fills up, the database stops, and this can cause a cascading effect for the client’s business and can stop an entire production.
When the space allocated to this log is being occupied close to its limit, alarms are sent to the database teams to inform the responsible teams so that they can act proactively to prevent this event from happening.
The monitoring of nobreaks is fundamental to the customer because during a power outage, they keep the equipment turned on temporarily or until the generator takes over. It is important to monitor the batteries because bad and uncharged batteries give the feeling that everything is fine, but when there is a power outage, there is no time for the generator to start working.
In this sense, alarms are triggered by SLAview to notify the responsible teams of problems related to the state of the batteries.
Sensor monitoring is done centrally by custom dashboards, where it is possible to identify the collected temperatures. Several statistics are collected from the sensors, among the most important, temperature and humidity.
Temperature sensors: used to identify the temperature in the data center and is extremely important in cases where there is a failure in the site’s cooling system.
Humidity sensors: used to identify the moisture concentration in the data center, if there is a higher than acceptable concentration, the liquid vapor can condense and cause damage to the equipment of the site.
Local IT and cooling staff are notified by SLAview of an increase that is not critical but could be of concern. This allows teams to anticipate the problem by acting proactively.
Distributed management of alarm notifications
Specific alarm notifications are assigned to specific teams. In this way, there is an optimization of who will receive the alerts, that is, specific alerts are destined to technicians that will know how to deal with the problem, below we demonstrate some examples of situations:
- Is temperature sensor alarm detected by SLAview?
The technician and the refrigeration team are notified, in the latter, the notification goes straight to the team’s e-mail.
- Switches alarms?
Local technician and company’s network team are notified;
- Bank alarms?
DBA time is notified.
In addition, the client receives alarm notifications via Telegram, in fact, there are diverse groups that receive specific notifications of specific alarms, as well as in Microsoft Teams and, and, on top of all this, SLAview notifications were integrated into the internal ticket system that the client uses on a daily basis.
Therefore, it is undeniable that SLAview is the ideal and robust system for monitoring large, medium and small technology parks with great effectiveness. In addition to the aforementioned points, the solution can monitor several other aspects of the network and IT infrastructure in its totality.
With this in mind, Telcomanager, Latin America’s leader in the network management software industry, in the market since 2002 with a unique and innovative methodology, provides intelligent solutions for monitoring data in order to provide complete visibility into the customer’s infrastructure, allowing your company to keep track of the main aspects of your network.
Posted in 08/26/2022