External Sorting

From

Revision as of 15:04, 25 March 2009 by 209.237.84.181 (Talk)
Jump to: navigation, search

An external sort is one in which the entire data to be sorted does not fit into internal memory at the same time. Therefore, only portions of the data may be in the internal memory at any given time, while the remainder stays on the disk. Because of the large amount of disk I/O required for an external sort, any external sorting algorithm is much slower than an internal sorting algorithm and should only be used when there is no other alternative. The number of comparisons is not a particularly significant performance measure for an external sort due to the time required for I/O.


Most external sorting algorithms fall into the category of a Sort-Merge algorithm. This means that the algorithm has two stages. The first is stage involves sorting portions of the data, and the second stage involves a series of merges of the sorted portions.


Let’s consider a basic file merge to begin with. Assume that I have two files, F0 and F1. These files contain integers and the individual files are sorted. To merge the two files, we open a third file, F2, and create a two place buffer in main memory. One place in the buffer is for an integer from F0 and the other place is for an integer from F1. The merge algorithm is as follows:

Let the buffer be called B where B[0]is for F0 and B[1] is for F1.
 
Load B[0] with the first element from F0 and B[1] with the first element from F1.
 
While neither file is empty do 
	If B[0] < B[1] 
		write B[0] to F2 and replace B[0] with the next value from F0; 
	Else
		write B[1] to F2 and replace B[1] with the next value from F1. 

When one file becomes empty, write the remainder of the other file to F2.
 


As an example, consider the merge of the following:


(...need tables or diagram here...)


Simple Sort-Merge


This is about the simplest of the external sorting algorithms. It is not, strictly speaking, a sort-merge algorithm, but it does require sorting and merging. Suppose I have a very large
file called source that needs to be sorted. In addition, I have two temporary files called F1 and F2 that will be used for the process and an internal array called A of size k. Normally, this array is chosen to be as large as possible and still fit into the memory.

The first step is to read into the array A a block of size k from the file source. This block is sorted by an internal sorting algorithm and written to file F1. This ends the setup step. The second step is to read into the array a block of size k from the file source, sort the array using an internal sorting algorithm, and then merge the array A with file F1, writing the result to file F2. This step is repeated until the file source has been completely read. However, with each repetition, we alternate between merging from F1 to F2 and from F2 to F1.

As we read the file source in blocks of k, we will eventually have read the entire file. When this happens, the last file that was the destination of the merge contains the sorted version. As an example of a simple sort-merge, we will use the following data with a block size of 4: 81, 94, 11, 96, 12, 35, 17, 95, 28, 14, 39, 58, 75, 15.


Balanced m-Way Sort-Merge

A second external sort is the Balanced m-Way Sort-Merge. For this algorithm, we have a large unsorted file called source. We also have 2m auxiliary files called F1, F2, …, F2m. Finally we have a block size called k.


The basic algorithm is in two phases. In the sort phase, the source file is read into the memory in blocks of k and each block is sorted by an internal sorting algorithm. This means that k is chosen to be a large value, but not so large that the memory cannot hold an array of k elements from source. After a block is sorted, it is written to one of the first m auxiliary files. The blocks are written in a round-robin fashion. That is, block 1 is written to F1, block 2 is written to F2, … block m is written to Fm, block m + 1 is written to F1, block m + 2 is written to F2, and so on. This process continues until the source file has been read completely. Note that the last block is likely to be less than size k.


Once the source file has been completely distributed over the first m auxiliary files, the second phase—the merge phase—begins. In this phase, the first block of each file F1 to Fm are merged to a block on Fm+1. Then the second block of each auxiliary file F1 to Fm are merged to a block on Fm+2. This pattern continues until all the blocks from F1 to Fm are merged to blocks on Fm+1 to F2m. Note that, in general, the blocks on Fm+1 to F2m are of size m*k. Now the pattern of merging blocks continues but taking a block from each file Fm+1 to F2m and merging them to a block on F1 to Fm. With each cycle, the block sizes grow by a factor of m. Eventually, as the blocks get larger, we end up with one block. When this happens, the one block is the sorted source file.


As an example, we will use a 2-way balanced sort-merge with a block size of 4 to sort the following: 81, 94, 11, 96, 12, 35, 17, 95, 28, 14, 39, 58, 75, 15.


Polyphase Sort-Merge

The problem with a Balanced m-Way Sort-Merge is that the entire file to be sorted must pass through the computer’s memory on each pass. The Polyphase Sort-Merge reduces the total number of bytes that need to be transferred by using a unique distribution scheme based on a kth order Fibonacci sequence.


Example: Suppose I have a source file that divides into 25 blocks, where each block can be sorted by an internal sorting algorithm. In addition, I have 5 auxiliary files called F1, …, F5 that are used as temporary files. Now, in the sort phase of the algorithm, I read my source file one block at a time, sort the block, and write the block to one of the files F1 to F4. Note that I am leaving F5 empty as it will be the destination of the merge phase for the first pass. However, unlike a Balanced m-Way Sort-Merge, I do not distribute the blocks equally across the auxiliary files. The blocks are distributed in the following manner.

(...need table here...)

Personal tools
MediaWiki Appliance - Powered by TurnKey Linux