Streaming Data from Various Sources
One of the main research and development topics nowadays is data processing and analysis, which can help companies discover relevant information about their customers or technologies using reports, visualizations, dashboards, and other business intelligence outputs. In the previous article, I recalled our team’s workshop where we put a foundation to our data mining and analysis endeavors. The open-source product ASAB, which you can see and contribute to at GitHub, forms a basis for request processing, event management, and metrics computation. However, its focus is not to process data from various sources and send them to business intelligence applications, data warehouses, or databases. Rather, these tasks are solved with another layer of data handling, which this article is about. This layer is called BSPump, a short form of Black Swan Pump.
Origins of Black Swan
As with other foundations for our applications, this one also started with a common workshop. After a review of ASAB functions and its possibilities, we had a discussion about asynchronous data processing and its implementation in Python’s library, called
asyncio. We then realized that we could use experience from our previous products and implement the data processing as multiple independent instances of the so-called pipelines. Generally, a pipeline is a linear set of connected data processors, where the first one of these processors receives raw data from a specified source, and the last one pushes the transformed, processed, and enriched bulk of data into a specified database, file, or application.
Picture: Schema of a BSPump pipeline.
My own task was to focus on Influx database outputs and the kind of processors that databases receive bulks of data from - the kind we call a sink. In the beginning, we had no clear definition of what the pipelines and processors should look like in the code or how they could be easily connected and configured. However, after a series of talks, a solution finally emerged, which you can view on GitHub. Like ASAB, BSPump is also open-source and you are free to contribute to it. Our basic idea dwells in using the publish-subscribe mechanism, which can start, finish, or temporarily pause certain processors in pipelines and in simple data flow illustrated in designs of abstract classes and their methods. I hope the previous description did not overwhelm you, but now you have an idea about what is going on with data in the BSPump.
The concept of pipelines with the publish-subscribe mechanism is a flexible and strong one. Not only can the pipelines run alongside one another and process data in real-time, they can also subscribe for events (such as system interrupts) to finish necessary data sending via sinks to output data stores or applications before they are shut down. In this way, we can be sure there are no data losses along the way. While I was working on the concept of database sinks and while my colleague Mila was focusing on source processors (reading data from logs and other inputs), Honza tried to implement an Elasticsearch connector from our previous project, which would also be used in sink processors. We work with Elasticsearch a lot and use Kibana visualizations that are formed from its indexed data, so implementing an Elasticsearch connector was one of our first decisions and considerations when it came to BSPump. Our team was quite busy with implementing all the features and we had to decide what to do next after the workshop had finished. Ales made a few refinements afterwards related to the design and architecture, but the workshop itself was successful and created a basis for BSPump, which we have been extending since then.
A real-time stream processor
So, technically speaking, BSPump can process data coming from a source stream in real-time, enrich them with information (like precise location), and then transform them into a specified output format or send them to data stores like Elasticsearch. One of the most exciting features is the computation of defined metrics (which form the basis for data mining analysis) and anomaly detection. The data transformation can be used for anonymizations of personal information such as emails, names as part of the GDPR solutions. If you are interested in the project or would like to contribute to it, please see our GitHub project or contact us at firstname.lastname@example.org or on Gitter. BSPump is open-source and ready to integrate thoughts and solutions from a wide community!
Data encryption tool for GDPRMore information
You Might Be Interested in Reading These Articles
One of the most exciting tasks for our team in the last month was to create a new application server “boilerplate” that would be used as a basis for most of our growing data-processing products, as well as for other people and companies ...
Published on January 16, 2018
Using scalable and reliable software is vital for the success of any large-scale IT project. As increasing numbers of transactions are made, application infrastructure needs to stand strong and support that growth, and not be another source of problems.
Published on January 17, 2017
As we approach the end of the year, we thought it would be fitting to look back and highlight the moments, people, and things that mattered most to us. Here is the A to Z of TeskaLab in 2016.
Published on December 23, 2016