ASAB

Streaming Data from Various Sources

One of the main research and development topics nowadays is data processing and analysis, which can help companies discover relevant information about their customers or technologies using reports, visualizations, dashboards, and other business intelligence outputs. In the previous article, I recalled our team’s workshop where we put a foundation to our data mining and analysis endeavors. The open-source product ASAB, which you can see and contribute to at GitHub, forms a basis for request processing, event management, and metrics computation. However, its focus is not to process data from various sources and send them to business intelligence applications, data warehouses, or databases. Rather, these tasks are solved with another layer of data handling, which this article is about. This layer is called BSPump, a short form of Black Swan Pump.

Origins of Black Swan

As with other foundations for our applications, this one also started with a common workshop. After a review of ASAB functions and its possibilities, we had a discussion about asynchronous data processing and its implementation in Python’s library, called asyncio. We then realized that we could use experience from our previous products and implement the data processing as multiple independent instances of the so-called pipelines. Generally, a pipeline is a linear set of connected data processors, where the first one of these processors receives raw data from a specified source, and the last one pushes the transformed, processed, and enriched bulk of data into a specified database, file, or application.

The BSPump pipeline

Picture: Schema of a BSPump pipeline.

My own task was to focus on Influx database outputs and the kind of processors that databases receive bulks of data from - the kind we call a sink. In the beginning, we had no clear definition of what the pipelines and processors should look like in the code or how they could be easily connected and configured. However, after a series of talks, a solution finally emerged, which you can view on GitHub. Like ASAB, BSPump is also open-source and you are free to contribute to it. Our basic idea dwells in using the publish-subscribe mechanism, which can start, finish, or temporarily pause certain processors in pipelines and in simple data flow illustrated in designs of abstract classes and their methods. I hope the previous description did not overwhelm you, but now you have an idea about what is going on with data in the BSPump.

Pipelines

The concept of pipelines with the publish-subscribe mechanism is a flexible and strong one. Not only can the pipelines run alongside one another and process data in real-time, they can also subscribe for events (such as system interrupts) to finish necessary data sending via sinks to output data stores or applications before they are shut down. In this way, we can be sure there are no data losses along the way. While I was working on the concept of database sinks and while my colleague Mila was focusing on source processors (reading data from logs and other inputs), Honza tried to implement an Elasticsearch connector from our previous project, which would also be used in sink processors. We work with Elasticsearch a lot and use Kibana visualizations that are formed from its indexed data, so implementing an Elasticsearch connector was one of our first decisions and considerations when it came to BSPump. Our team was quite busy with implementing all the features and we had to decide what to do next after the workshop had finished. Ales made a few refinements afterwards related to the design and architecture, but the workshop itself was successful and created a basis for BSPump, which we have been extending since then.

A real-time stream processor

So, technically speaking, BSPump can process data coming from a source stream in real-time, enrich them with information (like precise location), and then transform them into a specified output format or send them to data stores like Elasticsearch. One of the most exciting features is the computation of defined metrics (which form the basis for data mining analysis) and anomaly detection. The data transformation can be used for anonymizations of personal information such as emails, names as part of the GDPR solutions. If you are interested in the project or would like to contribute to it, please see our GitHub project or contact us at info@teskalabs.com or on Gitter. BSPump is open-source and ready to integrate thoughts and solutions from a wide community!

Continue to next article

About the Author

Premysl Cerny

Software Developer at TeskaLabs




You Might Be Interested in Reading These Articles

Example of using BSPump with CSV files

Let us say we have a CSV file, no, wait, a lot of CSV files that are coming to our storage directory with flashing speed. We even do not have the slightest notion what names of the files are, the only thing we know is that we need to process their data and deliver them to our database as quickly as possible.

Continue reading ...

tutorial development asab

Published on August 19, 2018

Situations Where Mobile App Security Best Practices is Necessary

The use of mobile app security best practices has become a necessity as app development and mobile usage continue to grow. These practices are needed to improve consumer protection, trust, and regulatory compliance.

Continue reading ...

security development

Published on March 24, 2015

From State Machine to Stateless Microservice

In my last blog post, I wrote about implementing a state machine inside a microservice I call Remote Control that will automate deployments of our products and monitor the cluster. Here I would like to describe how all this was wrong and why I had to rewrite the code completely.

Continue reading ...

development tech eliska

Published on February 15, 2023