Big data is an important issue. This is already shown by a study by Bitkom. In 2018, the association surveyed more than 600 companies on trend topics and found the following results:
- 57 percent plan to invest in big data or are already in the process of
- The top five topics are big data (57%), Industry 4.0 (39%), 3D printing (38%), robotics (36%) and VR (25%)
- But: New concepts and possibilities such as artificial intelligence and blockchain have rarely been used
Reading tip: What is big data?
Implementation of big data only hesitant
According to the study, the potential of big data is being used only hesitantly. According to the study, the reasons are the requirements for data protection (63%) and technical implementation (54%) as well as a lack of skilled workers (42%).
Reading Tip: What is a Data Scientist
I am currently dealing with the technical implementation. In order to make a real use of big data, numerous technical prerequisites have to be created. I would like to give a practical example of this, which should serve as an impulse for practice.
Practical example: Building a Data Lake
Through my job, I am often in customer projects that want to build big data architectures. In the following, I have turned the majority of projects into a kind of blueprint. I would like to present this to you today. To illustrate the example, I write the whole thing as a case study.
The customer is a fictitious large bank. From various sources, e.g. databases and data streams, the system copies data in raw form into a Staging Area. A data stream is a continuous flow of data sets, the end of which is usually not predictable in advance, e.g. the transfers of a bank or deposits to an account as well as the pulse meter in the hospital or the temperature measurement of a weather station. The staging serves to ensure that the raw data is stored in the current form with time stamps. The advantage is that they are still there in case of loss of the external data source.
The staging area stores the data on different disks through a secure and redundant storage format. The data is stored again by various (self-developed) algorithms in a uniform format in the Norming Area. SQL queries are used to export data to the CSV format and store it in Sharepoint. From this, numerous external IT consultants produce weekly reports in PowerPoint and Excel. The reports are sorted into folders with the source data on Sharepoint.
Summary of the architecture:
- Data comes from different source systems
- The data is always processed, standardized and recorded at the end of the day
- Various queries are made
- Reports are generated weekly by external service providers (MS Office and Sharepoint as storage location)
Now we come into play. The customer commissioned my team and me to design a new architecture. The reasons for this were:
- High effort of the reports
- Little flexibility (especially AD-HOC requests)
- No versioned raw data
- High costs of external service providers
- Requirements regarding BCBS239 and MARISK no longer enforceable (principles for effective aggregation of risk data and risk reporting)
Now we have started to re-establish the architecture of the customer. In the first step, we have revised the loading processes. On the one hand, we have loaded new data daily (at night) through batch loading processes. On the other hand, the data streams are continuously loaded.
All this data is stored on hard drives and gets a versionization stamp. This concept is called Data Lake and Metadata Management. In The Data Lake, we are talking about a very large data store, i.e. an oversized hard drive that captures data from a wide variety of sources in its raw format.
The Data Lake helped us to carry out so-called Data Lineage. Imagine this: Data-Lineage is like a patient record with important information about when data is created, up to the current data age and changes. The display is usually very clear in diagram form.
Advantage: We were able to prove exactly on which data basis we made them at time X. Using Data Lineage, we were then able to track the change in data.
The basis of our Data Lake was Hadoop with modern monitoring by Prometheus, Grafana and Icinga 2. We have built up highly available clusters for this purpose. The Hadoop Distributed File System (HDFS) is a distributed file system that can use a MapReduce algorithm to split complex and compute-intensive tasks into many small parts on multiple machines. Thus, evaluations based on the raw data at run time are possible.
By the way: Due to the guidelines of MARISK and BCBS 239 for banks, we have loaded risk data into a separate cluster. This cluster could only be retrieved by authorized persons. Concerns are that in the Lake so many cross-circuits are drawn by the combination of data that we have stored certain data separately for security reasons.
Now we want to process the raw data from the lake. First, we have copied the data from standardized reports that are used every week or requested by certain departments as data marts on extra hard drives. This allowed us to ensure access control and improve performance. We had fixed (permanent) and volatile data marts (project-based).
Limitation: I realize that Data Marts are a concept of the data warehouse, but in our context these were really helpful as we can’t put 100% on the data lake.
Using new algorithms (customer’s internal algorithms), we have standardized the data at run time or for the data marts. We have also tried to use software to gather new insights and models from the data by creating clusters through artificial intelligence. For example, we have had data correlated or examined departure. This has been taken over by a special AI service provider. Another element is the data catalog. The Data Catalog is a catalog of metadata and shows the display rules for all data and the relationships between the different data.
Note: The data catalog has the important function that no relationship can be established between certain personal risk data without access to the catalog.
Now we come to the logic of the evaluation. We want to ensure that the various stakeholders in the company can simply send three standards to our query server. The requests are distributed sensibly by a load balancer and also protected by an access control. The three standards are:
- Hive (SQL-like Hadoop compatible language)
- Tableau (Software Tool)
Now we come to the actual report creation for the end user. We have our three groups in the company. These are:
- classic controlling,
- departments (and project managers) and
- data scientists.
All three roles can send requests to our query server, depending on access control. For each role there is a possible evaluation:
- Fixed and automated reports for controlling,
- Interactive and customizable dashboards for the departments and
- Explorative Reports in Tableau for the Data Scientists.
In summary, there are the standardized automatic reports, which we were able to view in CSV or Excel, as well as interactive dashboards for the individual real-time report through our own software. Furthermore, a team of data scientists had the goal of using Tableau to gain new insights from data (exploratory reports).
Summary of the new architecture:
- Raw data is loaded into the data holding (Hadoop Cluster)
- Map-Reduce algorithm intelligently distributes the evaluation
- Evaluations are carried out directly from the raw data layer
- Transformation always at runtime
- Data catalog for storing data relationships
- Modeling by AI
- Query Server allows various requests from different languages
- Data Lineage for versioning the data
- Automated reports and interactive/exploratory dashboards through in-house development
Big data can change companies significantly and is high on the agenda. However, the main hurdles are the technical implementation and processing of the data. Classical concepts are no longer able to enhance such data and companies are to store data with a targeted legal security and to provide up-to-date reports due to legal requirements under pressure.
In this case study, I have given an example of technical implementation, which can serve as a stimulus for practice. In the case study, I built a data lake and used various well-known concepts like Hadoop. My experience shows that the correct implementation of the concepts can help to make the potentials of big data possible. It is important to draw meaningful reports and insights from the data.
Image source: Shop photo created by mindandi – de.freepik.comGenderhinweis: Ich habe zur leichteren Lesbarkeit die männliche Form verwendet. Sofern keine explizite Unterscheidung getroffen wird, sind daher stets sowohl Frauen, Diverse als auch Männer sowie Menschen jeder Herkunft und Nation gemeint. Lesen Sie mehr dazu.
Falls es noch Fragen gibt, können Sie mich gerne anrufen. Hierzu einfach im Buchungssystem nach einen freien Termin schauen. Ich nehme mir jeden Monat einige Stunden Zeit um mit Lesern zu interagieren.
Helfen Sie meinem Blog, vernetzen Sie sich oder arbeiten Sie mit mirSie haben eigene, interessante Gedanken rund um die Themenwelt des Blogs und möchten diese in einem Gastartikel auf meinem Blog teilen? – Aber gerne! Sie können dadurch Kunden und Fachkräfte ansprechen.
Ich suche aktuell außerdem Werbepartner für Bannerwerbung für meinen Blog. Sollte es für Sie spannend sein Fachkräfte oder Kunden auf Ihre Seite zu leiten, dann bekommen Sie mehr Informationen hier.
Vernetzen Sie sich in jedem Fall auf Xing oder LinkedIn oder kontaktieren Sie mich direkt für einen Austausch, wenn Sie gleich mit mir ins Gespräch kommen wollen. Werfen Sie auch einen Blick in meine Buchvorschläge zur Digitalisierung, vielleicht wollen Sie mir auch ein Buch empfehlen?
Ich arbeite gerne mit Unternehmen zusammen. Sie können mich ebenfalls gerne bezüglich folgender Punkte anfragen:
- Halten von Vorträgen zu Arbeit, Führung und Agilität
- Unterstützung Ihres Marketings (z.B Blogartikel)