Streaming Data from Various Sources
One of the main research and development topics nowadays is data processing and analysis, which can help companies discover relevant information about their customers or technologies using reports, visualizations, dashboards, and other business intelligence outputs. In the previous article, I recalled our team’s workshop where we put a foundation to our data mining and analysis endeavors. The open-source product ASAB, which you can see and contribute to at GitHub, forms a basis for request processing, event management, and metrics computation. However, its focus is not to process data from various sources and send them to business intelligence applications, data warehouses, or databases. Rather, these tasks are solved with another layer of data handling, which this article is about. This layer is called BSPump, a short form of Black Swan Pump.
Origins of Black Swan
As with other foundations for our applications, this one also started with a common workshop. After a review of ASAB functions and its possibilities, we had a discussion about asynchronous data processing and its implementation in Python’s library, called
asyncio. We then realized that we could use experience from our previous products and implement the data processing as multiple independent instances of the so-called pipelines. Generally, a pipeline is a linear set of connected data processors, where the first one of these processors receives raw data from a specified source, and the last one pushes the transformed, processed, and enriched bulk of data into a specified database, file, or application.
Picture: Schema of a BSPump pipeline.
My own task was to focus on Influx database outputs and the kind of processors that databases receive bulks of data from - the kind we call a sink. In the beginning, we had no clear definition of what the pipelines and processors should look like in the code or how they could be easily connected and configured. However, after a series of talks, a solution finally emerged, which you can view on GitHub. Like ASAB, BSPump is also open-source and you are free to contribute to it. Our basic idea dwells in using the publish-subscribe mechanism, which can start, finish, or temporarily pause certain processors in pipelines and in simple data flow illustrated in designs of abstract classes and their methods. I hope the previous description did not overwhelm you, but now you have an idea about what is going on with data in the BSPump.
The concept of pipelines with the publish-subscribe mechanism is a flexible and strong one. Not only can the pipelines run alongside one another and process data in real-time, they can also subscribe for events (such as system interrupts) to finish necessary data sending via sinks to output data stores or applications before they are shut down. In this way, we can be sure there are no data losses along the way. While I was working on the concept of database sinks and while my colleague Mila was focusing on source processors (reading data from logs and other inputs), Honza tried to implement an Elasticsearch connector from our previous project, which would also be used in sink processors. We work with Elasticsearch a lot and use Kibana visualizations that are formed from its indexed data, so implementing an Elasticsearch connector was one of our first decisions and considerations when it came to BSPump. Our team was quite busy with implementing all the features and we had to decide what to do next after the workshop had finished. Ales made a few refinements afterwards related to the design and architecture, but the workshop itself was successful and created a basis for BSPump, which we have been extending since then.
A real-time stream processor
So, technically speaking, BSPump can process data coming from a source stream in real-time, enrich them with information (like precise location), and then transform them into a specified output format or send them to data stores like Elasticsearch. One of the most exciting features is the computation of defined metrics (which form the basis for data mining analysis) and anomaly detection. The data transformation can be used for anonymizations of personal information such as emails, names as part of the GDPR solutions. If you are interested in the project or would like to contribute to it, please see our GitHub project or contact us at email@example.com or on Gitter. BSPump is open-source and ready to integrate thoughts and solutions from a wide community!
Continue to next article
Most Recent Articles
- Five Ways AI And Machine Learning Can Enhance Cybersecurity Strategy
- C-ITS ITS-S Security microservice
- C-ITS PKI as a Service
- Creative Dock, TeskaLabs, Indermedica, Czech Ministry of Industry and Trade and Line 1212 launch the indicative test for new COVID-19 coronavirus
- Cyber-health with a password and an antivirus program is not enough
You Might Be Interested in Reading These Articles
The previous tutorial introduces several concepts and helps you understand the basic of REST API integration with iOS client written in Swift. There are several limitations, with data storage being the most important. This article provides instructions on how to work around this restriction.
Published on November 25, 2014
Developers of web applications often take advantage of using HTTP proxies to debug their applications. It allows them to check headers and body of outgoing requests and incomming responses and track possible flaws of their client application and server backend. You want to be able to do the same with mobile applications secured with SeaCat. Here is how you do it by integrating Charles Web Debugging into the process very easily.
Published on March 17, 2016
Of course, this is a bold statement, but for those who deal with security issues from mobile applications, they can pinpoint where the flaw occurred with developers not taking security into account when developing mobile apps. Security takes the back seat to app functionality and remains as second thought.
Published on March 07, 2015