External Sorting

From

Revision as of 15:01, 25 March 2009 by 209.237.84.181 (Talk)
Jump to: navigation, search

An external sort is one in which the entire data to be sorted does not fit into internal memory at the same time. Therefore, only portions of the data may be in the internal memory at any given time, while the remainder stays on the disk. Because of the large amount of disk I/O required for an external sort, any external sorting algorithm is much slower than an internal sorting algorithm and should only be used when there is no other alternative. The number of comparisons is not a particularly significant performance measure for an external sort due to the time required for I/O.


Most external sorting algorithms fall into the category of a Sort-Merge algorithm. This means that the algorithm has two stages. The first is stage involves sorting portions of the data, and the second stage involves a series of merges of the sorted portions.


Let’s consider a basic file merge to begin with. Assume that I have two files, F0 and F1. These files contain integers and the individual files are sorted. To merge the two files, we open a third file, F2, and create a two place buffer in main memory. One place in the buffer is for an integer from F0 and the other place is for an integer from F1. The merge algorithm is as follows:

Let the buffer be called B where B[0]is for F0 and B[1] is for F1.
 
Load B[0] with the first element from F0 and B[1] with the first element from F1.
 
While neither file is empty do 
	If B[0] < B[1] 
		write B[0] to F2 and replace B[0] with the next value from F0; 
	Else
		write B[1] to F2 and replace B[1] with the next value from F1. 

When one file becomes empty, write the remainder of the other file to F2.
 


As an example, consider the merge of the following:


(...need tables or diagram here...)


Simple Sort-Merge


This is about the simplest of the external sorting algorithms. It is not, strictly speaking, a sort-merge algorithm, but it does require sorting and merging. Suppose I have a very large
file called source that needs to be sorted. In addition, I have two temporary files called F1 and F2 that will be used for the process and an internal array called A of size k. Normally, this array is chosen to be as large as possible and still fit into the memory.

The first step is to read into the array A a block of size k from the file source. This block is sorted by an internal sorting algorithm and written to file F1. This ends the setup step. The second step is to read into the array a block of size k from the file source, sort the array using an internal sorting algorithm, and then merge the array A with file F1, writing the result to file F2. This step is repeated until the file source has been completely read. However, with each repetition, we alternate between merging from F1 to F2 and from F2 to F1.

As we read the file source in blocks of k, we will eventually have read the entire file. When this happens, the last file that was the destination of the merge contains the sorted version. As an example of a simple sort-merge, we will use the following data with a block size of 4: 81, 94, 11, 96, 12, 35, 17, 95, 28, 14, 39, 58, 75, 15.


Balanced m-Way Sort-Merge

A second external sort is the Balanced m-Way Sort-Merge. For this algorithm, we have a large unsorted file called source. We also have 2m auxiliary files called F1, F2, …, F2m. Finally we have a block size called k.


The basic algorithm is in two phases. In the sort phase, the source file is read into the memory in blocks of k and each block is sorted by an internal sorting algorithm. This means that k is chosen to be a large value, but not so large that the memory cannot hold an array of k elements from source. After a block is sorted, it is written to one of the first m auxiliary files. The blocks are written in a round-robin fashion. That is, block 1 is written to F1, block 2 is written to F2, … block m is written to Fm, block m + 1 is written to F1, block m + 2 is written to F2, and so on. This process continues until the source file has been read completely. Note that the last block is likely to be less than size k.


Once the source file has been completely distributed over the first m auxiliary files, the second phase—the merge phase—begins. In this phase, the first block of each file F1 to Fm are merged to a block on Fm+1. Then the second block of each auxiliary file F1 to Fm are merged to a block on Fm+2. This pattern continues until all the blocks from F1 to Fm are merged to blocks on Fm+1 to F2m. Note that, in general, the blocks on Fm+1 to F2m are of size m*k. Now the pattern of merging blocks continues but taking a block from each file Fm+1 to F2m and merging them to a block on F1 to Fm. With each cycle, the block sizes grow by a factor of m. Eventually, as the blocks get larger, we end up with one block. When this happens, the one block is the sorted source file.


As an example, we will use a 2-way balanced sort-merge with a block size of 4 to sort the following: 81, 94, 11, 96, 12, 35, 17, 95, 28, 14, 39, 58, 75, 15.

Personal tools
MediaWiki Appliance - Powered by TurnKey Linux