/*Blog module replace 'read more' text*/

Supporting data migration related data cleansing process using the KNIME Analytics Platform

MINDSPIRE BLOG

Follow MINDSPIRE on social!

Aug 10, 2022 | Data Migration Blog

Introduction

According to our experience, due to the lack of resources, data cleansing activities during data migration projects are often overlooked, which of course cannot be considered an ideal approach.

In the framework of our pilot project, we examined which tool could be used simply, efficiently and without developer knowledge to support the unique data cleaning tasks of a data migration project.

With the help of such a solution and the related methodology, it could be possible to avoid the question of data quality fading into the background during the execution of ETL (Extract, Transform, Load) tasks.

Ákos Erdész

Ákos Erdész

Business Analyst / Data Migration Expert

Ákos has been working in the banking sector for almost 14 years. In the past 5-6 years, he participated in several major bank data migration projects in Hungary and the Central and Eastern European region. As a migration specialist and data set owner, his main tasks are to conduct logical mapping workshops, develop transformation rules and prepare parameterization in the Delta Migration Tool and evaluate migration tests. In addition to project activities, his other great passion is Excel, which he holds regular training sessions for his colleagues.

Preface

Our company, MINDSPIRE Consulting, has been offering data migration services to its banking and financial institution clients for many years, and has completed several successful ETL projects in Hungary and in the CEE region. Based on our previous experience, we are constantly developing our data migration methodology and tools and our goal is to be able to offer the most comprehensive services to our customers.

As a part of this effort, we are constantly researching new methodology elements and approaches that are directly or indirectly related to data migration in banking and can be useful additions to our data migration methodology or to our DELTA data migration tool. During our activities of this nature, we have previously developed a special reporting function for our DELTA solution, and created our own data migration reconciliation methodology.

In this blog post, we address another such related area, data cleansing, or data scrubbing. Although we have not yet developed a specific data cleaning functionality in the MINDSPIRE DELTA tool, the data transformation processes we have performed almost always include the checking, correcting, and deduplicating the source data to be migrated, so we have already gained a lot of experience in this area.

Using our data migration and data cleansing experience, MINDSPIRE, together with Onespire Ltd., a member of Inovivo Group, started an interesting pilot project as well. We are sharing the details of this undertaking below.

Data cleansing on migration projects

During a data migration project, IT managers, business managers and various developers and IT specialists may encounter a wide variety of issues related to data cleanliness. From a simple typing mistake in a customer master data – such as name or mailing address – to logical data gaps and complex data consistency errors affecting several database tables.

Despite the fact that data cleansing and data migration activities are rather closely related to each other, according to our experience, prior to a migration project, data cleaning tasks in systems containing the source data to be transferred are usually overlooked. Also, it can be stated in general that data cleansing as an activity is not part of the daily operational processes. This is typically due to a lack of resource capacity. At the start of data migration projects, most of the time, the goal is to carry out the minimal data corrections necessary for a technically successful migration, but not to fully solve the known data cleanliness problems.

The correction of discovered data errors can be performed in several ways during the execution of the project. Ideally, errors will be partially or fully corrected in the source system, based on the prepared error reports.

If this is not possible, then typically the task of transferring the data to the target systems in the required structure, format and appropriate data quality falls to the experts developing the transformation logic.

Error corrections can be performed using data tables developed in SQL, built into the transformation logic. The data cleansing method should be selected according to the extent to which the data error affects several data categories (for example, client, invoice, credit, bank card, etc.) or whether it is a unique error specific only to the given data category.

With having numerous domestic and foreign data migration references in the banking industry, we have already encountered a variety of data errors and inconsistencies.

One of the difficulties of the data cleansing process lies precisely in the fact that a solution developed in a given project cannot be used in an unchanged form during the execution of another task. So, at the beginning of a new project, after the data quality assessment, a completely customized concept has to be developed to handle the data cleanliness issues. The management of the data errors in the given projects can be significantly facilitated by the methodological experience accumulated during previous successful data migration activities and the earlier developed technical solutions.

Outlook

Based on the above factors, we started to investigate what alternative solutions exist beyond our current toolkit, which can support the execution of data scrubbing tasks during the implementation of a data migration project.

We looked for a solution in the realm of Data Science, exploiting the competencies found within our group.

First, we examined how the Data Science (DS) toolkit can cover the data cleansing needs arising in a migration project.

In doing so, we determined that the DS methodology has many aspects that are not relevant to a data migration project (for example, scaling and normalization). However, we also found several methodological elements (such as replacing missing values, removing duplicates, type conversion, etc.) that can be easily transferred to the data migration methodology.

As part of a pilot project (Proof of Concept, POC), we have created a data cleansing process using KNIME, which is a popular tool in the Data Science environment.

In the framework of the POC, we chose the KNIME platform because it has a number of specific functions that can be used effectively to solve data scrubbing tasks, in addition to its ease of use.

Introducing the KNIME tool

What is KNIME?

The KNIME Analytic Platform is a free, open-source data analysis, reporting and integration platform capable of covering the entire data science project life cycle, effectively supporting data extraction – data transformation – data loading (ETL) processes.

KNIME gives a wide range of possibilities to the community of more than 100,000 users from 66+ countries, who, in addition to ETL processes, also prefer the tool for data cleaning, training algorithms, deep learning, predictive analytics, interactive visualization, and report creation.

What can the KNIME platform be used for?

KNIME is good at identifying data patterns, so this hidden information can be exploited. For example, the document identifiers usually consist of a specific number of characters in a given format, records that differ from this can be effectively filtered out using the solution. The tool thus supports business decisions in diverse areas such as healthcare, financial services, including the banking sector, manufacturing, production, retail, research and development, consulting or the government sector.

Using the KNIME tool

KNIME does not require programming skills, a complete process can be built using the drag and drop method on the user interface (workbench). A workflow consists of the so-called nodes, which are elementary units performing a data function, namely they perform a certain task on the data, for example reading a csv file, filtering a column, connecting several tables. The nodes are grouped into categories with descriptive names, with an efficient search added, so users can more easily find the most suitable data function from the more than 2,000 available variations.

Introducing the KNIME data cleansing pilot project

The data scrubbing pilot project performed by Onespire’s Data Science competence center

The purpose of the pilot project was to investigate whether KNIME can be used to support the data cleaning tasks of a data migration project. A significant advantage of the KNIME platform, in addition to its ease of use, is its flexibility. A workflow built with the tool can be easily and quickly modified by inserting new steps or by replacing and configuring the previous steps. Its disadvantage is that in case of larger amounts of data, we may face performance issues using the free version.

The first completed KNIME sample workflow (demo workflow)

KNIME sample workflow

Structure of the KNIME workflow

The entire workflow, performing the data cleansing of a 10-record client master data, was divided into four separate tasks. Two experts from our Data Science competence center worked as a team to develop the workflow, along the following steps:

  1. Defining the data scrubbing phase based on the theoretical background of Data Science’s data cleansing methodology.
  2. Selecting or building a sample database.
  3. Creating the workflow structure using the KNIME workbench.
  4. Iterative process of testing and correction.

During the actual execution, the data cleansing process was compiled by selecting the appropriate nodes. In the meantime, it was possible to exploit the potential of the KNIME community, as we were able to borrow ideas or even ready-made solutions for many specific problems from the user base of more than 100,000 people. The nodes were selected mainly on a “trial and error” basis, when one of them did not produce the expected result, it was easy to replace it with another one. With that, the complete assembly of the workflow took 6-8 hours including testing.

The workflow itself ran on a limited number of records, which “intentionally” contained incorrect customer data.

The steps of the workflow

The workflow originated from a general data cleansing methodology used in the Data Science world.

As a first step, we had to give access to the data, which was stored in two separate csv files in this case. By selecting the appropriate Joiner node and specifying the key between the two tables, KNIME creates a data table from the two source files.

Following that, we have performed the data cleaning tasks on the source data in several steps:

  • converting date fields to the same format,
  • converting different currencies into the same currency (during real projects this can only be used in certain cases),
  • correction of misspellings in country names,
  • filtering out inappropriate data columns,
  • replacing the missing values according to a predefined logic (for example determining missing travel cost to the given destination country using average calculation) (during real projects this can only be used in certain cases),
  • performing type conversions,
  • identifying outliers,
  • carrying out the normalization on the travel cost column,
  • finally, compiling the cleansed data table.

The steps above reflect a general Data Science data scrubbing approach, however not all elements of this are relevant to the data cleansing tasks of a migration project. On one hand, bringing the data in a given field to the same format or verifying the formal correctness of the data stored in a field (e.g.: ID number data) is extremely important, but the normalization of the data or the replacement of missing data by average calculation for example, is not a relevant aspect during data migration.

During the sample project, we have examined in detail the data cleaning methods according to the Data Science methodology and their possible role in data migration projects.

The result of this is summarized in the table below.

KNIME Analytics Platform user interface

KNIME workflow

Supporting data migration related data cleansing process using the KNIME Analytics Platform: Conclusion

The accomplished data cleansing pilot project confirmed our preliminary assumption that there are many similarities between the methods defined by Data Science and used in practice, as well as the mostly ad-hoc solutions used for data scrubbing tasks arising in data migration projects.

Therefore, it is definitely worth using methodological know-how or even practical knowledge with Data Science tool systems during the planned expansion of our data migration and cleansing services. Based on the experience of the pilot project so far, the methodology in which the KNIME workflow is used to plan, build and test the data cleansing function, iterating the steps until the desired result is achieved, seems to be optimal.

However, in the financial sector a final data cleansing solution can best be imagined in an independent module written in a robust programming language, which can ensure processing large amounts of data with adequate response times.

For IT and business managers and information technology experts working on data migration projects, discovering the KNIME tool can be beneficial because the solution supports the analysis of the patterns hidden behind the data, and without developer knowledge.

Therefore, the tool can be used much more widely, which makes it possible to involve additional colleagues in the implementation of projects often facing resource problems.

If you have any questions about supporting the data cleaning processes performed during data migration with the KNIME Analytics Platform, request a callback from our experts!

More information is available about our data migration services here.

Data cleansing methods according to the Data Science methodology and their possible role in data migration projects

TaskData Science aspectData migration aspect
Data profilingVerifying whether a given column conforms to certain standards or patterns.It is important to check all database fields which store data in a specific structure, such as document number or ZIP code.
VisualizationsBy analyzing and displaying data using statistical methods such as mean, standard deviation, range, or quantiles, we may find values that are unexpected and thus erroneous.It is irrelevant from a data migration point of view.
Cleansing irrelevant dataIrrelevant data is those that is not actually required and does not fit the context of the problem to be solved.After the business (logical) mapping workshops, it is possible to determine the scope of the source tables and the fields within them that are relevant for the data migration scope.
Removing duplicatesDuplicates are data points that are repeated in the dataset.It is also an important aspect in data migration projects, where we mainly (but not exclusively) pay special attention to the deduplication of client records. For this purpose, we define project-specific business attributes, duplicates are identified by examining their similarity and their scoring-based evaluation.
Type conversionEnsuring numbers are stored as numeric data types and dates as date objects or timestamps. Values that cannot be converted to the specified type should be changed to N/A, which indicates that the value is incorrect and needs to be corrected.A significant part of these verifications already take place during the loading into the migration database, during which the identified inconsistencies are reported back to the source data owners.
Syntax errors- Removing empty spaces: Extra spaces at the beginning or at the end of the string should be removed.



Some of the syntax errors can be managed when loading into the migration database, for example: removing empty spaces, pad strings.
- Pad strings: Strings can be padded with spaces or other characters to reach a certain width. For example, some numeric codes are often marked with leading zeros so that they always have the same number of digits.The identification of other syntax errors typically involves a deeper data analysis task, for example: Budapest had 17 different versions in the city column of a bank’s database.
- Fix typing mistakes in character strings.
- “Fuzzy matching”: An algorithm identifying the distance between the expected character string(s) and the given character string(s). It basically counts how many operations are needed to convert one string to another.
- Barplot is useful for displaying all unique values. It is noticeable that some values are different, but have the same meaning, for example "information_technology" and "IT". Sometimes the difference is caused just by the spelling, as in the case of "other" and "Other".
Standardization- In addition to recognizing typing errors, our task is also to store each value in a standardized format.It is also relevant from a data migration point of view.
- In case of numerical values, it has to be ensured that each value has a certain unit of measurement.
- For dates, the US version is different from the European version. Recording a date as a timestamp (thousandths of a second) number is not the same as recording a date as a date object.
Scaling / TransformationScaling means transforming the data so that it falls within a specific range, such as 0-100 or 0-1. It can also help certain types of data to be easier to display and compare.It is irrelevant from a data migration point of view.
NormalizationWhile normalization also scales values to a given range between 0 and 1, it aims to ensure that the normalized data is distributed normally. In most cases, we normalize the data if statistical methods will be used that assume a normal distribution.It is irrelevant from a data migration point of view.
Missing values- Drop: If the missing values in a column occur rarely and randomly, the simplest and most favorable solution is to drop the rows containing the missing observations. If most of the values in a column are missing and occur randomly, a typical decision is to discard the entire column.During the data migration, the lack of data relevant to the target system must be corrected primarily in the source systems. When this is not possible, then the data must be replaced "artificially", according to the logic built into the data migration transformation.
- Impute: calculating the missing values based on other observations.
- Flag: Missing numeric data can be replaced with 0, but these zeros should be ignored when calculating statistical values or displaying the distribution.
Although categorical data can be filled with: “missing”: A new category indicating that this data is missing.
Outlier searchThere are values that differ significantly from all other observations. Outliers are innocent until proven guilty. By this we mean that they should not be removed unless we have a good (business) reason to do so.It is irrelevant from a data migration point of view.
In-record & cross-datasets errorsExample: A data set is given about the cost of living in cities. The 'total' column should equal the sum of rent, transportation and food.For example, the date of account status changes must be later than the account opening date, etc.
VerificationAfter data cleansing, the results are checked to verify their correctness.It is also relevant from a data migration point of view.
RiportingCreating a report on the changes made and the quality of the currently stored data.It is also relevant from a data migration point of view.

Please see MINDSPIRE’s related:

 

Data Migration
Services

Data migration services icon 2022

Data Migration
References

Supporting data migration related data cleansing process using the KNIME Analytics Platform
Project management contact form

Do you have a question or comment about this post?

Send us your message and our experts will contact you!

Data migration icon

Our latest Data Migration References

Are you interested in our Banking Data Migration services?

You can find more information here:

Share This