Wednesday, September 8, 2010

ETL Sorting or SQL Sorting (ORDER BY)

ETL Magic



Often hear about ETL (Extract, Transform, Load) solution for better data integration between applications, systems or solution within an enterprise. What are the major consideration before choosing an appropriate ETL solution?

Sorting, Merging, Unification, Transformation and Loading are quite commonly happen within Data Integration stage. Often, we heard of NSort, CoSort (IRI), SyncSort and others famous 3rd party ETL solution and tools. Each of these offers some great performance if not exceptional in their respective technical strength with many data integration efficiency (Terabytes of data size) bar being set higher and higher.

EnTerprise solution missing Link?



Quite honestly, many solution architects would think these ready made ETL solution is best used to glue their data movement processes within the sub-solutions altogether to not just making data extraction, transformation and loading much faster (conceptually), but also act as a missing link to gel data from many independent components within the enterprise solution to ensure they get the right data in timely fashion before going live.

Question of the day



What would be the topic of discussion today is rather, will those 3rd parties solution offers best performance value to the enterprise solution and often considered as the missing link in the solutions integration?

ETL Rules of thumb: -



If data is already in the database and you would want to extract all these data into another database, consider doing the sorting and filtering related tasks within your database server itself. Ordering, filtering and unification tasks within the database server is always consider as "as good" when compared to the other ETL solution.

If the data has to be somewhat inserted into tables within database, consider doing sorting, unification, merging (joins), filtering, data enrichment (meta tagging) and etc. altogether within the database server. Again, the same principles applied here. Database server can handle all those as good as the third parties ETL.

If the batch data integration process can be avoided by putting it into an incremental data integration approach, by all means doing it in an incremental fashion with proper application/data partitioning.

If the data is migrating from one database platform into another, try to look for some out of the box integration solutions. For example, SSIS for Microsoft SQL Server. (SSIS is free and has also recently breaking some integration records.)

Never ever thinking of re-inventing your own ETL solution, from ground up, if you are not intending to step into the red ocean, ETL business.

Consider 3rd parties ETL solutions when the data source is originated from a non-database model and the data target is another non-database model.

Consider 3rd parties ETL solutions when majority of the data transformation, extraction, unification, merging, filtering and sorting involves many different types of data sources.

Consider 3rd parties ETL solutions when it has more than one intended data target that are of different platforms.

Take Away



SQL ordering, filtering, conditional relationship/merging (joins) are almost equally as fast as the 3rd parties ETL tools.

No real magics with ready made ETL solutions, it is the solution integration architecture that counts.

Fancy solutions will cost more in terms of investment capital, learning curve, maintenance efforts, business opportunity, end product selling prices (to your customers).

No comments:

Post a Comment