Home

Dealing With Dirty Data

This Article describes what is Dirty Data and how to Deal with it

What is Dirty Data anyway?


In reference to databases it is a data that contain errors. Dirty data can contain such mistakes as spelling or punctuation, incorrect data associated with a field, incomplete or outdated data or even data that is duplicated in the database.

 Other common causes of dirty data are:

  • Wrong fields sizes
  • Wrong and inconsistent formats
  • Logical inconsistency like typing zipcode into phone number box
  • User Errors

Most of the problems comes when working with Text or Excel Files

Life is much easier when data source is ODBC compliant database however there are still some potential problems

Imagine that you are loading orders from different countries into your oracle data warehouse.

Part of  the data comes from text files, part from MS Excel files and some of the data is direct ODBC connection to the source database.
Some files are result of manual consolidation of multiple files

Data-warehouse Table Definition is

  • COUNTRY_ID INTEGER
  • ORDER_ID INTEGER
  • ORDER_DATE DATE
  • AMOUNT NUMBER(10.2)


Every country have different formats for ORDER_DATE and Amount field. This situation is far too familiar for many ETL Consultants

 

In order to load data we need to make sure that format of Amount and Order_Date fields is consistent.

For amount field we need to get rid of dollars, pounds and commas.

It could easily done by using replace function of Advanced ETL Processor. 

What you see is what you load

 

For ORDER_DATE field we will apply multiple date formats.

Result of Date Format function is a string in 'YYYY-MM-DD HH:NN:SS.ZZZ' format 

What you see is what you load
 

 Full Data Transformation:

 Result of Data Transformation:

 

 

This is just a small example how Advanced ETL Processor can help you to validate and transform data. 

About Advanced ETL Processor


Advanced ETL Processor
is an ETL tool designed to automate extracting data from ANY database, transform, validate it and load into ANY database . Typical usage of it would be extract data from Excel File,Validate Date Formats, Sort data, deduplicate it and load it into Oracle database, run stored procedure or Sql script, once loading is completed. Unlike Oracle SQL loader, BCP,  DTS or SSIS Advanced ETL Processor can also add new and update old records based on primary key.

More Information

 

Testimonials

"The DBSL Integration solution eliminated our data access bottle neck that previously impeded company growth. We are now able to provide solutions to long standing problems areas such as automated order processing and business reporting limitations. Additionally the solution allows for new opportunities to simply hook-on to our existing data sources. From development through testing the DBSL support  team continues to be helpful, resourceful and responsive to our company needs."

John Kil,
IT Manager

Our customers

BP

BBC

HSBC


Databases we work with

Go to top