Sunday, March 18, 2012

A Simple External Sort

Background


Sort is a simple word and in computing field surely everyone has learned some simple sorting algorithm. Its use are trivial and often we applied it to help solving some complex issues, such as merging, unification, searching, indexing; to name a few.

Often, buying a tool to just do efficient sorting can be expensive, especially for those that comes as part of the ETL package. The consideration is rather on how much data must be processed and how soon must we accomplished the data sorting.

Consideration


  • Considers sorting data at your Database Server.
  • Favors incremental update over batch. You don't need sorting when you do incremental update.
  • Considers an efficient ETL tool if all the above does not applied.
  • Considers writing your own highly threaded sorting tools if all the above are not applicable.

Also, refers to ETL Sorting or SQL Sorting

External Sort


In case you find yourself into a resolution to consider writing your own sorting tools, consider this Simple External Sort.

External Sort, by definition, uses external storage as the intermediate storage to ensure the ability to sort data sample larger than the permitted memory constraint.

Here is the pseudo approach of a simple External Sort.
  1. Splits a subset of data into multiple smaller data collection where it is bound to the permitted memory size.
  2. Running Quick Sort at each and every data collection created, in a separate thread respectively.
  3. Writes every quick sorted data collection into a file and stores externally to your hard disk drive.
  4. Repeat all the above until the data sample completely processed.
  5. You will now have multiple sorted data subset in files stored externally in hard disk drive, you can now merge them like a Merge Sort to build your final sorted data sample.

To be exact, this is the External Merge + Quick Sort.

Future Work


From my observation, if you have done this Simple External Sort properly, it is faster than the Sort function available within .NET Collection Data Type. The performance improvement of this Simple External Sort is proportion to the data sample you selected. Personally, I recorded ~10 times faster than the sort available in .NET when the record size over a million.

What one can consider for the future work to improve this Simple External Sort is to bake in an Advance IO Elimination Algorithm to avoid writing of final quick sorted data chunk into the Hard Disk Drive, but manipulating it off the memory directly for the merging with other data chunks that already persisted to hard disk drive. This way, you can safely shorten *at least* one file chunk IO round-trip for better performance and lesser IO.

Hope this helps.

Wednesday, February 29, 2012

SCRUM - Dos and Don'ts

Background


Many people practice Agile these days and many of them are with SCRUM Methodology. We need to draw a line on what we must do and what we must not. Reason are simple, it violates the essence of Agile.

Read here: SCRUM Principles.

Dos and Don'ts


  • Do: Allows only one person to play the SCRUM Master role within the team.
  • Don't: Never run a SCRUM with more than one SCRUM Master or rotates the SCRUM Master.
  • Do: Ensure SCRUM Master and Product Owner roles were played by different person.
  • Don't: Never make Product Owner and SCRUM Master the same person. It creates conflict of interests.
  • Do: Scales your SCRUM team when it makes a large team or it is geographically separated.
  • Don't: Never make a team larger than 7 nor geographically separated. Your first impediment before your SCRUM starts will be distance, and distance and distance!
  • Do: Favour the approach of SCRUM for SCRUM. Creates sub teams to run proper SCRUM and runs another final SCRUM with ALL sub SCRUM Teams' SCRUM Masters and SCRUM Product Owners respectively.
  • Don't: Never run a SCRUM that cannot be scaled.
  • Do: A SCRUM Master/Product Owner works best when he/she has 1 team to run. Less good when with 2 teams.
  • Don't: Never let the same SCRUM Master/Product Owner runs more than a team if you want to build a successful SCRUM team.
  • Do: Listen to everyone's concern in retrospective and make sure you have them copied.
  • Don't: Never defence for the team when in SPRINT Retrospective.
  • Do: Team oriented and takes ownership.
  • Don't: Never take anything beyond the team. Don't be personal, and it is never anything about you or anybody.