DataSampler

Home \| Table of Contents	DataSampler
Prev	Transformers	Next

Commercial Component

We suppose that you have already learned what is described in:

If you want to find the appropriate Transformer for your purpose, see Transformers Comparison.

Short Summary

DataSampler passes only some input records to the output. There is a range of filtering strategies you can select from to control the transformation.

Component	Same input metadata	Sorted inputs	Inputs	Outputs	Java	CTL
DataSampler	-	no	1	1-N	no	no

Abstract

DataSampler receives data on its single input edge. It then filters input records and passes only some of them to the output. You can control which input records are passed by selecting one of the filtering strategies called Sampling methods. The input and output metadata have to match each other.

Icon

Ports

Port type	Number	Required	Description	Metadata
Input	0	yes	For input data records	Any
Output	0	Yes	For sampled data records	Input0

DataSampler Attributes

Attribute	Req	Description	Possible values
Basic
Sampling method	yes	The filtering strategy that determines which records will be passed to the output. Individual strategies you can choose from are described in Advanced Description	Simple\|Systematic\|Stratified\|PPS
Required sample size	yes	The desired size of output data expressed as a fraction of the input. If you want the output to be e.g. 15% (roughly) of the input size, set this attribute to 0.15.	(0; 1)
Sampling key	1)	A field name the Sampling method uses to define strata. Field names can be chained in a sequence separated by a colon, semicolon or pipe. Every field can be followed by an order indicator in brackets (a for ascending, d for descending, i for ignore and r for automatic estimate).	e.g. Surname(a); FirstName(i); Salary(d)
Advanced
Random seed		A `long` number that is used in the random generator. It assures that results are random but remain identical on every graph run.	<0; N>

Legend:

1) The attribute is required in all sampling methods except for Simple.

Advanced Description

A typical use case for DataSamper can be imagined like this. You want to check whether your data transformation works properly. In case you are processing millions of records, it might be useful to get only a few thousands and observe. That is why you will use this component to create a data sample.

DataSampler offers four Sampling methods to create a representative sample of the whole data set:

Simple - every record has equal chance of being selected. The filtering is based on a double value chosen (approx. uniformly) from the <0.0d; 1.0d) interval. A record is selected if the drawn number is lower than Required sample size.
Systematic - has a random start. It then proceeds by selecting every k-th element of the ordered list. The first element and interval derive from Required sample size. The method depends on the data set being arranged in a sort order given by Sampling key (for the results to be representative). There are also cases you might need to sample an unsorted input. Even though you always have to specify Sampling key, remember you can suppress its sort order by setting the order indicator to i for "ignore". That ensures the data set's sort order will not be regarded. Example key setting: "InvoiceNumber(i)".
Stratified - if the data set contains a number of distinct categories, the set can be organised by these categories into separate strata. Each stratum is then sampled as an independent sub-population out of which individual elements are selected on a random basis. At least one record from each stratum is selected.
PPS (Probability Proportional to Size Sampling) - probability for each record is set to proportional to its stratum size up to a maximum of 1. Strata are defined by the value of the field you have chosen in Sampling key. The method then uses Systematic sampling for each group of records.

Comparing the methods, Simple random sampling is the simplest and quickest one. It suffices in most cases. Systematic sampling with no sorting order is as fast as Simple and produces a strongly representative data probe, too. Stratified sampling is the trickiest one. It is useful only if the data set can be split into separate groups of reasonable sizes. Otherwise the data probe is much bigger than requested. For a deeper insight into sampling methods in statistics, see Wikipedia.

Prev	Up	Next
DataIntersection	Home \| Table of Contents	Dedup