What is Dirty Data anyway? In reference to databases it is a data that contain errors. Dirty data can contain such mistakes as spelling or punctuation, incorrect data associated with a field, incomplete or outdated data or even data that is duplicated in the database.
Other common causes of dirty data are: - Wrong fields sizes
- Wrong and inconsistent formats
- Logical inconsistency like typing zipcode into phone number box
- User Errors
Most of the problems comes when working with Text or Excel Files
Life is much easier when data source is ODBC compliant database however there are still some potential problems
Imagine that you are loading orders from different countries into your oracle datawarehouse.
Part of the data comes from text files, part from MS Excel files and some of the data is direct ODBC connenction to the source database. Some files are result of manual consolidation of multiple files
Datawarehouse Table Definition is - COUNTRY_ID INTEGER
- ORDER_ID INTEGER
- ORDER_DATE DATE
- AMOUNT NUMBER(10.2)
Every country have different formats for ORDER_DATE and Amount field. This situation is far too familiar for many ETL Consultants
In order to load data we need to make sure that format of Amount and Order_Date fields is consistent. For amount field we need to get rid of dollars, pounds and commas. It could easily done by using replace function of Advanced ETL Processor. For ORDER_DATE field we will apply multiple date formats. Result of Date Format function is a string in 'YYYY-MM-DD HH:NN:SS.ZZZ' format Full Transformation Result of Transformation This is just a small example how Advanced ETL Processor can help you to validate and transform data. Please feel free to Download a Free Trial or get the latest version today from only $340. Read Advanced ETL Processor FAQ , Watch Online Flash Demo , Read Product Documentation (PDF).
|