ASAB

Streaming Data from Various Sources

One of the main research and development topics nowadays is data processing and analysis, which can help companies discover relevant information about their customers or technologies using reports, visualizations, dashboards, and other business intelligence outputs. In the previous article, I recalled our team’s workshop where we put a foundation to our data mining and analysis endeavors. The open-source product ASAB, which you can see and contribute to at GitHub, forms a basis for request processing, event management, and metrics computation. However, its focus is not to process data from various sources and send them to business intelligence applications, data warehouses, or databases. Rather, these tasks are solved with another layer of data handling, which this article is about. This layer is called BSPump, a short form of Black Swan Pump.

Origins of Black Swan

As with other foundations for our applications, this one also started with a common workshop. After a review of ASAB functions and its possibilities, we had a discussion about asynchronous data processing and its implementation in Python’s library, called asyncio. We then realized that we could use experience from our previous products and implement the data processing as multiple independent instances of the so-called pipelines. Generally, a pipeline is a linear set of connected data processors, where the first one of these processors receives raw data from a specified source, and the last one pushes the transformed, processed, and enriched bulk of data into a specified database, file, or application.

The BSPump pipeline

Picture: Schema of a BSPump pipeline.

My own task was to focus on Influx database outputs and the kind of processors that databases receive bulks of data from - the kind we call a sink. In the beginning, we had no clear definition of what the pipelines and processors should look like in the code or how they could be easily connected and configured. However, after a series of talks, a solution finally emerged, which you can view on GitHub. Like ASAB, BSPump is also open-source and you are free to contribute to it. Our basic idea dwells in using the publish-subscribe mechanism, which can start, finish, or temporarily pause certain processors in pipelines and in simple data flow illustrated in designs of abstract classes and their methods. I hope the previous description did not overwhelm you, but now you have an idea about what is going on with data in the BSPump.

Pipelines

The concept of pipelines with the publish-subscribe mechanism is a flexible and strong one. Not only can the pipelines run alongside one another and process data in real-time, they can also subscribe for events (such as system interrupts) to finish necessary data sending via sinks to output data stores or applications before they are shut down. In this way, we can be sure there are no data losses along the way. While I was working on the concept of database sinks and while my colleague Mila was focusing on source processors (reading data from logs and other inputs), Honza tried to implement an Elasticsearch connector from our previous project, which would also be used in sink processors. We work with Elasticsearch a lot and use Kibana visualizations that are formed from its indexed data, so implementing an Elasticsearch connector was one of our first decisions and considerations when it came to BSPump. Our team was quite busy with implementing all the features and we had to decide what to do next after the workshop had finished. Ales made a few refinements afterwards related to the design and architecture, but the workshop itself was successful and created a basis for BSPump, which we have been extending since then.

A real-time stream processor

So, technically speaking, BSPump can process data coming from a source stream in real-time, enrich them with information (like precise location), and then transform them into a specified output format or send them to data stores like Elasticsearch. One of the most exciting features is the computation of defined metrics (which form the basis for data mining analysis) and anomaly detection. The data transformation can be used for anonymizations of personal information such as emails, names as part of the GDPR solutions. If you are interested in the project or would like to contribute to it, please see our GitHub project or contact us at info@teskalabs.com or on Gitter. BSPump is open-source and ready to integrate thoughts and solutions from a wide community!

Continue to next article

About the Author

Premysl Cerny

Software Developer at TeskaLabs




You Might Be Interested in Reading These Articles

Inotify in ASAB Library

From blocking read challenge, ctypes and bitmasks to a solution that enables the ASAB framework to react to changes in the file system in real time.

Continue reading ...

asab development tech eliska

Published on August 15, 2023

5 Things to Look for in an Enterprise Mobile Development Platform Solution

Today many enteprises are looking to have their own mobile applications. With the right solution, you can build a mobile app that will fit your organization’s needs like a glove and be in the driver’s seat of the development.

Continue reading ...

mobile development

Published on September 01, 2015

Building a private cloud on AMD Ryzen and Linux Containers

At our company, we develop our own software products that we offer to our clients and often also run ourselves. So far our company has operated its IT infrastructure — about 30 virtual servers—on a public cloud, specifically on MS Azure.

Continue reading ...

development tech

Published on July 01, 2018