Sunday, March 18, 2012

A Simple External Sort

Background


Sort is a simple word and in computing field surely everyone has learned some simple sorting algorithm. Its use are trivial and often we applied it to help solving some complex issues, such as merging, unification, searching, indexing; to name a few.

Often, buying a tool to just do efficient sorting can be expensive, especially for those that comes as part of the ETL package. The consideration is rather on how much data must be processed and how soon must we accomplished the data sorting.

Consideration


  • Considers sorting data at your Database Server.
  • Favors incremental update over batch. You don't need sorting when you do incremental update.
  • Considers an efficient ETL tool if all the above does not applied.
  • Considers writing your own highly threaded sorting tools if all the above are not applicable.

Also, refers to ETL Sorting or SQL Sorting

External Sort


In case you find yourself into a resolution to consider writing your own sorting tools, consider this Simple External Sort.

External Sort, by definition, uses external storage as the intermediate storage to ensure the ability to sort data sample larger than the permitted memory constraint.

Here is the pseudo approach of a simple External Sort.
  1. Splits a subset of data into multiple smaller data collection where it is bound to the permitted memory size.
  2. Running Quick Sort at each and every data collection created, in a separate thread respectively.
  3. Writes every quick sorted data collection into a file and stores externally to your hard disk drive.
  4. Repeat all the above until the data sample completely processed.
  5. You will now have multiple sorted data subset in files stored externally in hard disk drive, you can now merge them like a Merge Sort to build your final sorted data sample.

To be exact, this is the External Merge + Quick Sort.

Future Work


From my observation, if you have done this Simple External Sort properly, it is faster than the Sort function available within .NET Collection Data Type. The performance improvement of this Simple External Sort is proportion to the data sample you selected. Personally, I recorded ~10 times faster than the sort available in .NET when the record size over a million.

What one can consider for the future work to improve this Simple External Sort is to bake in an Advance IO Elimination Algorithm to avoid writing of final quick sorted data chunk into the Hard Disk Drive, but manipulating it off the memory directly for the merging with other data chunks that already persisted to hard disk drive. This way, you can safely shorten *at least* one file chunk IO round-trip for better performance and lesser IO.

Hope this helps.

No comments:

Post a Comment