DataSampler

Commercial Component

We suppose that you have already learned what is described in:

If you want to find the appropriate Transformer for your purpose, see Transformers Comparison.

Short Summary

DataSampler passes only some input records to the output. There is a range of filtering strategies you can select from to control the transformation.

Component Same input metadata Sorted inputs Inputs Outputs Java CTL
DataSampler-no11-Nnono

Abstract

DataSampler receives data on its single input edge. It then filters input records and passes only some of them to the output. You can control which input records are passed by selecting one of the filtering strategies called Sampling methods. The input and output metadata have to match each other.

Icon

Ports

Port typeNumberRequiredDescriptionMetadata
Input0yesFor input data recordsAny
Output0YesFor sampled data recordsInput0

DataSampler Attributes

AttributeReqDescriptionPossible values
Basic
Sampling methodyesThe filtering strategy that determines which records will be passed to the output. Individual strategies you can choose from are described in Advanced Description Simple|Systematic|Stratified|PPS
Required sample sizeyesThe desired size of output data expressed as a fraction of the input. If you want the output to be e.g. 15% (roughly) of the input size, set this attribute to 0.15. (0; 1)
Sampling key1) A field name the Sampling method uses to define strata. Field names can be chained in a sequence separated by a colon, semicolon or pipe. Every field can be followed by an order indicator in brackets (a for ascending, d for descending, i for ignore and r for automatic estimate). e.g. Surname(a); FirstName(i); Salary(d)
Advanced
Random seed A long number that is used in the random generator. It assures that results are random but remain identical on every graph run. <0; N>

Legend:

1) The attribute is required in all sampling methods except for Simple.

Advanced Description

A typical use case for DataSamper can be imagined like this. You want to check whether your data transformation works properly. In case you are processing millions of records, it might be useful to get only a few thousands and observe. That is why you will use this component to create a data sample.

DataSampler offers four Sampling methods to create a representative sample of the whole data set:

Comparing the methods, Simple random sampling is the simplest and quickest one. It suffices in most cases. Systematic sampling with no sorting order is as fast as Simple and produces a strongly representative data probe, too. Stratified sampling is the trickiest one. It is useful only if the data set can be split into separate groups of reasonable sizes. Otherwise the data probe is much bigger than requested. For a deeper insight into sampling methods in statistics, see Wikipedia.