Skip to content

UNDERSTANDING OBSERVABILITY VS. MONITORING. PART 1

The development of clouds, the DevOps movement, and distributed microservice-based architecture have come together to make observability vital for modern architecture. We’re going to dive into what observability is and how to approach the metrics we need to track.

Observability is a way of spotting and troubleshooting the root causes of problems involving software systems whose internals we might not understand. It extends the concept of monitoring, applying it to complex systems with unpredictable and/or complex failure scenarios.

I’ll start with some of the basic principles of observability that I’ve been helping to implement across a growing number of products and teams at Nord Security.

 

observability

 

Monitoring vs. Observability

“Monitoring” and “observability” are often used interchangeably, but these concepts have a few fundamental differences.

Monitoring is the process of using telemetry data to understand the health and performance of your application. Monitoring telemetry data is preconfigured, implying that the user has detailed information on their system’s possible failure scenarios and wants to detect them as soon as they happen.

In the classical approach to monitoring, we define a set of metrics, collect them from our software system, and react to any changes in the values of these metrics that are of interest to us.

For example:

Excessive CPU usage can indicate that we need to scale it up to compensate for increasing system loads;

A drop in successfully served requests after a fresh release can indicate that the newly released version of the API is malfunctioning;

Health checks process binary metrics that represent whether the system is alive at all or not.

Observability extends this approach. Observability is the ability to understand the state of the system by performing continuous real time analysis of the data it outputs.

Instead of just collecting and watching predefined metrics, we continuously collect different output signals. The most common types of signals – the three pillars of observability – are:

  • Metrics: Numeric data aggregates representing software system performance;

  • Logs: Time-stamped messages gathered by the software system and its components while working;

  • Traces: Maps of the paths taken by requests as they move through the software system.

The development of complex distributed microservice architectures has led to complex failure scenarios that can be hard or even impossible to predict. Simple monitoring is not enough to catch them. Observability helps by improving our understanding of the internal state of the system.

Metrics

Choosing the right metrics to collect is key to establishing an observability layer for our software system. Here are a few different popular approaches that define a unified framework of must-have metrics in any software system.

USE

Originally described by Brendan Gregg, this approach focuses more on white-box monitoring – monitoring of the infrastructure itself. Here’s the framework:

  • Utilization – resource utilization.

    • % of CPU / RAM / Network I/O being utilized.

  • Saturation – how much remaining work hasn’t been processed yet.

    • CPU run queue length;

    • Storage wait queue length;

  • Errors – errors per second

    • CPU cache miss;

    • Storage system fail events;

Note: Defining “saturation” in this approach can be a tricky task and may not be possible in specific cases.

Four Golden signals

Originally described in the Google SRE Handbook, the Four Golden signals framework is defined as follows:

  • Latency – time to process requests;

  • Traffic – requests per second;

  • Errors – errors per second;

  • Saturation – resource utilization.

RED

Originally described by Tom Wilkie, this approach focuses on black-box monitoring – monitoring the microservices themselves. This simplified subset of the Four Golden Signals uses the following framework:

  • Rate – requests per second;

  • Errors – errors per second;

  • Duration – time to process requests.

Choosing and following one of these approaches allows you to unify your monitoring concept throughout the whole system and make it easier to understand what is happening. They complement one another, and your choice may depend on which part of a system we want to monitor. These approaches also don´t exclude additional business-related metrics that vary from one component of the software system to another.

Logs

System logs are a useful source of additional context when investigating what is going on inside a system. They are immutable, time-stamped text records that provide context to your metrics.

Logs should be kept in a unified structured format like JSON. Use additional log storage/visualization tools to simplify interaction with the massive amount of text data the software system provides. One very well-known and popular solution for log storage is ElasticSearch.

Traces

Traces help us better understand the request flow in our system by representing the full path any given request takes through a distributed software system. This is very helpful in identifying failing nodes and bottlenecks.

Traces themselves are hierarchical structures of spans, where each span is a structure representing the request and its context in every node in its path. Most common tracing visualization tools like Jaeger or Grafana display traces as waterfall diagrams showing the parent and child spans caused by the request.

Conclusion

Building an observable software system lets you identify failure scenarios and possible risks during the whole system life cycle. A combination of metrics, extensive log collection, and traces helps us understand what’s happening inside our system at any moment and speeds up investigations of abnormal behavior.

This article was just the first step. We’ve covered the standard approaches to metrics and briefly discussed traces and logs. But to implement an observable software system, we need to set up its components correctly to supply us with the signals we need. In part 2, we’ll discuss instrumentation approaches and modern standards in this field.

About Version 2 Digital

Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

About Nord Security
The web has become a chaotic space where safety and trust have been compromised by cybercrime and data protection issues. Therefore, our team has a global mission to shape a more trusted and peaceful online future for people everywhere.

Did Iranian Hackers Cause The Fire At An Israeli Power Plant?

Almost immediately after a fire broke out in an active power plant in southern Israel on July 14, 2022, an Iranian hacking group claimed responsibility. While it’s understandable why the group, which goes by the name #Altahrea, would want to boost their hacker profile by saying they caused the fire, there is ample evidence that they actually had nothing to do with it. 

The Orot Yosef power plant, part of the Edeltech group, is located in Ramat Hovav, Israel and has been in operation since 1989. 

Orot Yosef Power Plant

To understand why we believe this fire was not the work of hackers, let’s take a look at how this plant operates and what might have happened to cause the fire. (SCADAfence’s security team research lead Yossi Reuven also spoke about the attack to Techmonitor.ai)

Gas turbines can be used in conjunction with steam boilers by passing hot gasses from the boiler through a gas turbine to produce mechanical drive for electricity generation. This combined arrangement is commonly referred to as “cogeneration.” Cogeneration is thermodynamically the most efficient method for generating electrical power, and it is the method used by the Orot Yosef facility. 

Why is this important? Understanding the process used by a facility is crucial to determining what event took place. Gas turbines require a correctly ratioed air-to-fuel mixture to operate. Running a turbine too rich or too lean, (too much air or too little air, respectively) can cause significant damage to the turbine. This means that if someone with malicious intent were able to compromise the air handling and run the turbine at maximum output with a lean mixture there is a good chance of detonation, overheating, loss of power, and damage to the turbine. These issues would all relate to the turbine housing and be far more catastrophic of an event.

We know that GE turbines were purchased and installed in the plant in 1989 as you can see in the image below from the Global Energy Observatory. (The GEO is a publicly available database of global energy information)

GEO entry for Orot Power Plant

The Power Plant Fire 

Shortly after the fire began, the Iranian hacker group #Altahrea posted a photo on Telegram of a fire that looks to have started in the building known as the, “Air Filter House”.

Most of the technology that resides inside the filter house is there to detect if the system is clogged. When a clog happens, it triggers the shutdown of the turbine to protect it from too much debris passing through the filter system, which can shorten the lifespan of the turbine.

Fire is a major risk for filter houses that have poor maintenance cycles. If filters are not replaced routinely, particulates and debris build up and all it takes for the filter cartridge pairs to go up in flames is a single spark. 

Based on open-source intel, it is likely that this facility is running an Electrostatic Precipitator.Power plant information from open source database

An Electrostatic Precipitator is typically used for pollution control to remove dirt from flue gasses in exhaust systems. Due to the fact that this facility has the ability to use Diesel as a secondary source of power generation, it is possible that an ESP could be present.

Another detail that provides relevant information is a redacted picture of Shodan.io’s Industrial Webcrawler revealing a Phoenix Contact EMpro PLC running a Webserver exposed to the internet as shown below.

Shodan.io shows information on the Phoenix Contact EMpro

The EMpro is used to measure voltages and current in a power supply system. The measure is used primarily to manage critical load balancing across a system and not for any critical process control of the filter house. If the device were to be compromised it would only allow an individual to carry out relatively small actions, and this is only in the event that the device had the Digital Output wired up.

This all begs the question, is it possible that a remote monitoring device was compromised in a way that allowed an adversary to trigger a discharge inside the filter house which then ultimately triggered a fire. Possibly. However it would require ideal conditions for this to happen and would also require a lapse in maintenance with a buildup of debris etc. I would expect that the same level of probability would occur if someone discarded a cigarette that was still lit and the filter house consumed it into the filter cartridge stage. In this case, that is a more likely cause of the fire, and not the Iranian hackers who claimed credit. 

To learn more about how the SCADAfence Platform can protect your OT network request a demo today.

About Version 2 Digital

Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

About SCADAfence
SCADAfence helps companies with large-scale operational technology (OT) networks embrace the benefits of industrial IoT by reducing cyber risks and mitigating operational threats. Our non-intrusive platform provides full coverage of large-scale networks, offering best-in-class detection accuracy, asset discovery and user experience. The platform seamlessly integrates OT security within existing security operations, bridging the IT/OT convergence gap. SCADAfence secures OT networks in manufacturing, building management and critical infrastructure industries. We deliver security and visibility for some of world’s most complex OT networks, including Europe’s largest manufacturing facility. With SCADAfence, companies can operate securely, reliably and efficiently as they go through the digital transformation journey.

GREYCORTEX Mendel 3.9

We have released a new version of GREYCORTEX Mendel

GREYCORTEX Mendel 3.9 is more interactive, safer and allows even deeper data analysis than ever before. We have increased the interoperability of Mendel with other tools and extended the hardware support.

GREYCORTEX Mendel 3.9 Features List

Interactive Visualization of Detected Threats

Detect an attack on your infrastructure easily and in time

You’ll see the detected events even clearer thanks to the new interactive dashboard, based on GREYCORTEX’s and MITRE ATT&CK®’s knowledge. You’ll easily see if someone is attacking your infrastructure according to known tactics and techniques, no matter whether Mendel is helping secure your IT or OT environment.

New API features

Connect Mendel to other systems via APIs

New two-way connectivity with other security tools (SIEM, BI and others) enables external visualization or deeper data analysis. Mendel’s API currently covers:

  • direct database access to stored network data
  • capturing traffic and downloading data in pcap data format
  • management of false positives
  • third-party security information sources (blacklists based on IP addresses and malicious files)
  • integration with the MISP security platform

User Activity Log

Control who is looking into your Mendel

Mendel is even more secure. It records user activity in the system itself, helping to meet even the strictest of security policies and corporate standards.

Extended Support of Hardware Devices

No more surprises from unavailable devices

We optimized Mendel to run on up-to-date hardware devices with new generations of CPUs, such as DELL and HP servers, and have wide support for new network card models from Napatech, Intel and Broadcom.

Improved Visibility and Data Analysis

Understand completely what happened in your network

You can now view the data for all the use cases you have defined and get broader insights than the system views already set up by the standard user interface. In combination with the new attributes and metrics, you can stipulate your database queries over stored network data even more precisely. You can also export or import saved views between machines and for further investigation, use Mendel’s ability to bring the parameters of the displayed data into the main filter.

Working on: Microsensors for IT and OT Networks

Find out basic information about the devices in your network

A microsensor, either as a small device or in a virtualized form, scans your network and in a follow-up report you can see: what devices are in the network; what vulnerabilities they have; which manufacturers they are from; or what protocols they use.

The tool is already ready to use in an alpha version. If you are interested in the solution, please contact us for more information.

About Version 2 Digital

Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

About GREYCORTEX
GREYCORTEX uses advanced artificial intelligence, machine learning, and data mining methods to help organizations make their IT operations secure and reliable.

MENDEL, GREYCORTEX’s network traffic analysis solution, helps corporations, governments, and the critical infrastructure sector protect their futures by detecting cyber threats to sensitive data, networks, trade secrets, and reputations, which other network security products miss.

MENDEL is based on 10 years of extensive academic research and is designed using the same technology which was successful in four US-based NIST Challenges.

GREYCORTEX Mendel 3.9 Now Available

June 20, 2022 – We have released a new version of GREYCORTEX Mendel. Version 3.9 is more interactive, safer and allows even deeper data analysis than ever before. We have increased the interoperability of Mendel with other tools and extended the hardware support.

More about GREYCORTEX Mendel 3.9

About Version 2 Digital

Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

About GREYCORTEX
GREYCORTEX uses advanced artificial intelligence, machine learning, and data mining methods to help organizations make their IT operations secure and reliable.

MENDEL, GREYCORTEX’s network traffic analysis solution, helps corporations, governments, and the critical infrastructure sector protect their futures by detecting cyber threats to sensitive data, networks, trade secrets, and reputations, which other network security products miss.

MENDEL is based on 10 years of extensive academic research and is designed using the same technology which was successful in four US-based NIST Challenges.

×

Hello!

Click one of our contacts below to chat on WhatsApp

×