(Security-) Data Analytics, close to the data

(Security-) Data Analytics, close to the data

In my previous company I was an advocate for “In-Stream Analytics”, let me explain

Camelia.boban, CC BY-SA 3.0 , via Wikimedia Commons

What is the biggest challenge when it comes to multiple cloud environments and security data analytics? To me the answer is simple: Egress Cost

If you ask a data science person it will become clear that everything starts with data. You need to have good, organized and definitely a lot of data to drive data analytics and gain the right insights.

If you run your systems and applications in multi clouds you will reckon that most of the data analytic platforms require the data to be in one place in order to analyze it. This is true for traditional security platforms (e.g. SIEM’s) as well as data analytic platforms (e.g. BigQuery).

Cost per GB per cloud provider ranges from 0.06$ up to 0.1$, so moving hundreds of GB per day or months comes at a price. A price which also does not account for the systems needed to handle the data analytics and storage of this data in the “central” cloud.

So what would happen if we:

  • Build a model (machine learning) on central data.
  • Normalize the data in the different clouds.
  • Perform data analytics using the cloud native tools in the respective cloud.

As a result

  • Egress costs are gone, as no data needs to be moved.
  • Cloud free tiers and discounts apply to each cloud.
  • Cloud native data analytics and data pipelining can be used.

As an example, most data analytics can work with Pub/Sub. Pub/Sub can be found in every cloud with a different name (of course :))

Camelia.boban, CC BY-SA 3.0 , via Wikimedia Commons

In theory all of the services are pretty close to Kafka or at least support the same commands. While there are several tools for normalization of Kafka data, Google Pub/Sub at least also includes the use of schemas, so you can easily format the the data according to your data model you used.

Data schema/normalization

Once you have created a topic and made sure that the data is ingested into Pub/Sub you can use the native analytics with your data model.

On Google you might want to use something like Vertex AI, to analyze the data. To do that, you would trigger a Vertex run via Pub/Sub as described here:

The same is pretty much doable on all cloud platforms.

The easiest way to start are most likely Jupyter-Notebooks, where you can write your data analytics in python and just share the script in other clouds. A data analytics overview for the biggest cloud vendors can be found here:

Camelia.boban, CC BY-SA 3.0 , via Wikimedia Commons

That’s it for today, please leave a 👏 and be excellent to each other.