FastSort

Commercial Component

We assume that you have already learned what is described in:

If you want to find the right Transformer for your purposes, see Transformers Comparison.

Short Summary

FastSort sorts input records using a sort key. FastSort is faster than ExtSort but requires more system resources.

Component Same input metadata Sorted inputs Inputs Outputs Java CTL
FastSort-
no
11-N--

Abstract

FastSort is a high performance sort component reaching the optimal efficiency when enough system resources are available. FastSort can be up to 2.5 times faster than ExtSort but consumes significantly more memory and temporary disk space.

The component takes input records and sorts them using a sorting key - a single field or a set of fields. You can specify sorting order for each field in the key separately. The sorted output is sent to all connected ports.

Pretty good results can be obtained with the default settings (just the sorting key needs to be specified). However, to achieve the best performance, a number of parameters is available for tweaking.

Icon

Ports

Port typeNumberRequiredDescriptionMetadata
Input0
yes
for input data recordsthe same input and output metadata [ 1)]
Output0
yes
for sorted data records
1-N
no
for sorted data records

[ 1)] As all output metadata must be same as the input metadata, they can be propagated through this component.

FastSort Attributes

AttributeReqDescriptionPossible values
Basic
Sort key
yes
List of fields (separated by semicolon) the data records are to be sorted by, including the sorting order for each data field separately, see Sort Key  
Estimated record count[ 1)] Estimated number of input records to be sorted. A rough guess of the order of magnitude is sufficient, see Estimated Record Count.auto (default) | 1-N
In memory only If true, internal sorting is forced and all attributes except Sort key and Run size are ignored.false (default) | true
Temp directories List of paths to temporary directories for storing sorted runs, separated by semicolonssystem TEMP dir (default) | other dir
Advanced
Run size (records)[ 1)] [ 2)]Number of records sorted at once in memory. Largely affects speed and memory requirements, see Run Sizeauto from (if set) Estimated record count | 20,000 default | 1000 - N
Max open files Limits the number of temp files that can be created during the sorting. Too low number (500 or less) significantly reduces the performance, see Max Open Files.unlimited (default) | 1-N
Concurrency (threads) Number of worker threads to do the job. The default value ensures the optimal results while overriding the default may even slow the graph run down, see Concurrency.auto (default) | 1-N
Number of read buffers[ 2)]How many chunks of data will be held in memory at a time, see Number of Read Buffers.auto (default) | 1-N
Average record size (bytes)[ 2)]Guess on average byte size of records, see Average Record Size.auto (default) | 1-N
Maximum memory (MB, GB)[ 2)]Rough estimate of maximum memory that can be used, see Maximum Memory.auto (default) | 1-N
Tape buffer (bytes) Buffer used by a worker for filling the output. Affects the performance slightly, see Tape Buffer.8192 (default) | 1-N
Compress temporary files If true, temporary files are compressed. For more information see Compress Temporary Files.false (default) | true
Deprecated
Sorting locale Locale used for correct sorting order none (default) | any locale
Case sensitive By default (Sorting locale is none), upper-case characters are sorted separately and precede lower-case characters that are sorted separately too. If Sorting locale is set, upper- and lower-case characters are sorted together - if Case sensitive is true, a lower-case precedes corresponding upper-case while false preservers the order, data strings appears in the input in.false (default) | true

[ 1)] Estimated record count is a helper attribute which is used for calculating (rather unnatural) Run size automatically as approximately Estimated record count to the power 0.66. If Run size set explicitly, Estimated record count is ignored.

Reasonable Run sizes vary from 5,000 to 200,000 based on the record size and the total number of records.

[ 2)] These attributes affect automatic guess of Run size. Generally, the following formula must be true:

Number of read buffers * Average record size < Maximum memory

Advanced Description

Sorting Null Values

Remember that FastSort processes the records in which the same fields of the Sort key attribute have null values as if these nulls were equal.

FastSort Tweaking

Basically, you do not need to set any of these attributes, however, sometimes you can increase performance by setting them. You may have a limited memory or you need to sort agreat number of records, or these records are too big. In similar cases, you can fit FastSort to your needs.

  1. Estimated Record Count

    Basic attribute which lets FastSort know a rough number of records it will have to deal with. The attribute is complementary to Run size; you don't need to set it if Run size is specified. On the other hand, if you don't want to play with attributes setting much, giving the rough number of records spares memory to be allocated during the graph run. Based on this count, Maximum memory, records size, etc., Run size is determined.

  2. Run Size

    The core attribute for FastSort; determines how many records form a "run" (i.e., a bunch of sorted records in temp files). The less Run size, the more temp files get created, less memory is used and greater speed is achieved. On the other hand, higher values might cause memory issues. There is no rule of thumb as to whether Run size should be high or low to get the best performance. Generally, the more records you are about to sort the bigger Run size you might want. The rough formula for Run size is Estimated record count^0.66. Note that memory consumption multiplies with Number of read buffers and Concurrency. So, higher Run sizes result in much higher memory footprints.

  3. Max Open Files

    FastSort uses relatively large numbers of temporary files during its operation. In case you hit quota or OS-specific limits, you can limit the maximum number of files to be created. The following table should give you a better idea:

    Dataset sizeNumber of temp. filesDefault Run sizeNote
    1,000,000~100~10,000 
    10,000,000~250~45,000 
    1,000,000,00020,000 to 2,00050,000 to 500,000Depends on available memory

    Note that numbers in the table above are not exact and might be different on your system. However, sometimes such large numbers of files might cause problems hitting user quotas or other runtime limitations, see Performance Bottlenecks for a help how to solve such issues.

  4. Concurrency

    Tells FastSort how many runs (chunks) should be sorted at a time in parallel. By default, it is automatically set to 1 or 2 based on the number of CPU cores in your system. Overriding this value makes sense if your system has lots of CPU cores and you think your disk performance can handle working with so many parallel data streams.

  5. Maximum Memory

    You can set the maximum amount of memory dedicated to a single component. This is a guide for FastSort when computing Run size, i.e., if Run size is set explicitly, this setting is ignored. A unit must be specified, e.g., '200MB', '1gb', etc.

  6. Average Record Size

    You can set Average record size in bytes. If omitted, it will be computed as an average record size from the first 1000 parsed records.

  7. Number of Read Buffers

    This setting corresponds tightly to the number of threads (Concurrency) - must be equal to or greater than Concurrency. The more read buffers the less change the workers will block each other. Defaults to Concurrency + 2

  8. Compress Temporary Files

    Along with Temporary files charset this option lets you reduce the space required for temporary files. Compression can save a lot of space but affects performance by up to 30% down so be careful with this setting.

  9. Tape Buffer

    Size (in bytes) of a file output buffer. The default value is 8kB. Decreasing this value might avoid memory exhaustion for large numbers of runs (e.g. when Run size is very small compared to the total number of records). However, the impact of this setting is quite small.

Tips & Tricks

Performance Bottlenecks