We suppose that you have already learned what is described in:
If you want to find the appropriate Transformer for your purpose, see Transformers Comparison.
DataSampler passes only some input records to the output. There is a range of filtering strategies you can select from to control the transformation.
Component | Same input metadata | Sorted inputs | Inputs | Outputs | Java | CTL |
---|---|---|---|---|---|---|
DataSampler | - | no | 1 | 1-N | no | no |
DataSampler receives data on its single input edge. It then filters input records and passes only some of them to the output. You can control which input records are passed by selecting one of the filtering strategies called Sampling methods. The input and output metadata have to match each other.
Port type | Number | Required | Description | Metadata |
---|---|---|---|---|
Input | 0 | yes | For input data records | Any |
Output | 0 | Yes | For sampled data records | Input0 |
Attribute | Req | Description | Possible values |
---|---|---|---|
Basic | |||
Sampling method | yes | The filtering strategy that determines which records will be passed to the output. Individual strategies you can choose from are described in Advanced Description | Simple|Systematic|Stratified|PPS |
Required sample size | yes | The desired size of output data expressed as a fraction of the input. If you want the output to be e.g. 15% (roughly) of the input size, set this attribute to 0.15. | (0; 1) |
Sampling key | 1) | A field name the Sampling method uses to define strata. Field names can be chained in a sequence separated by a colon, semicolon or pipe. Every field can be followed by an order indicator in brackets (a for ascending, d for descending, i for ignore and r for automatic estimate). | e.g. Surname(a); FirstName(i); Salary(d) |
Advanced | |||
Random seed | A long number that is used in the
random generator. It assures that results
are random but remain identical on every graph run.
| <0; N> |
Legend:
1) The attribute is required in all sampling methods except for Simple.
A typical use case for DataSamper can be imagined like this. You want to check whether your data transformation works properly. In case you are processing millions of records, it might be useful to get only a few thousands and observe. That is why you will use this component to create a data sample.
DataSampler offers four Sampling methods to create a representative sample of the whole data set:
Simple - every record has equal chance of being
selected. The filtering is based on a double
value
chosen (approx. uniformly) from the <0.0d; 1.0d) interval. A record is selected
if the drawn number is lower than Required sample size.
Systematic - has a random start. It then proceeds by selecting every k-th element of the ordered list. The first element and interval derive from Required sample size. The method depends on the data set being arranged in a sort order given by Sampling key (for the results to be representative). There are also cases you might need to sample an unsorted input. Even though you always have to specify Sampling key, remember you can suppress its sort order by setting the order indicator to i for "ignore". That ensures the data set's sort order will not be regarded. Example key setting: "InvoiceNumber(i)".
Stratified - if the data set contains a number of distinct categories, the set can be organised by these categories into separate strata. Each stratum is then sampled as an independent sub-population out of which individual elements are selected on a random basis. At least one record from each stratum is selected.
PPS (Probability Proportional to Size Sampling) - probability for each record is set to proportional to its stratum size up to a maximum of 1. Strata are defined by the value of the field you have chosen in Sampling key. The method then uses Systematic sampling for each group of records.
Comparing the methods, Simple random sampling is the simplest and quickest one. It suffices in most cases. Systematic sampling with no sorting order is as fast as Simple and produces a strongly representative data probe, too. Stratified sampling is the trickiest one. It is useful only if the data set can be split into separate groups of reasonable sizes. Otherwise the data probe is much bigger than requested. For a deeper insight into sampling methods in statistics, see Wikipedia.