We assume that you have already learned what is described in:
If you want to find the right Joiner for your purposes, see Joiners Comparison.
ApproximativeJoin merges sorted data from two data sources on a common matching key. Afterwards, it distributes records to the output based on a user-specified Conformity limit.
Component | Same input metadata | Sorted inputs | Slave inputs | Outputs | Output for drivers without slave | Output for slaves without driver | Joining based on equality |
---|---|---|---|---|---|---|---|
ApproximativeJoin | no | yes | 1 | 2-4 | yes | yes | yes |
ApproximativeJoin is a fuzzy joiner that is usually used in quite special situations. It requires the input be sorted and is very fast as it processes data in the memory. However, it should be avoided in case of large inputs as its memory requirements may be proportional to the size of the input.
The data attached to the first input port is called master as in the other Joiners. The second input port is called slave.
Unlike other joiners, this component uses two keys for joining. First of all, the records are matched in a standard way using Matching Key. Each pair of these matched records is then reviewed again and the conformity (similarity) of these two records is computed using Join key and a user-defined algorithm. The conformity level is then compared to Conformity limit and each record is sent either to the first (greater conformity) or to the second output port (smaller conformity). The rest of the records is sent to the third and fourth output port.
ApproximativeJoin receives data through two input ports, each of which may have a different metadata structure.
The conformity is then computed for matched data records. The records with greater conformity are sent to the first output port. Those with smaller conformity are sent to the second output port. The third output port can optionally be used to capture unmatched master records. The fourth output port can optionally be used to capture unmatched slave records.
Port type | Number | Required | Description | Metadata |
---|---|---|---|---|
Input | 0 | yes | Master input port | Any |
1 | yes | Slave input port | Any | |
Output | 0 | yes | Output port for the joined data with greater conformity | Any, optionally including additional fields:
_total_conformity_ and _keyName_conformity_ .
See Additional fields. |
1 | yes | Output port for the joined data with smaller conformity | Any, optionally including additional fields:
_total_conformity_ and _keyName_conformity_ .
See Additional fields. | |
2 | no | Optional output port for master data records without slave matches | Input 0 | |
3 | no | Optional output port for slave data records without master matches | Input 1 |
Attribute | Req | Description | Possible values |
---|---|---|---|
Basic | |||
Join key | yes | Key according to which the incoming data flows with the same value of Matching key are compared and distributed between the first and the second output port. Depending on the specified Conformity limit. See Join key. | |
Matching key | yes | This key serves to match master and slave records. | |
Transform | 1) | Transformation in CTL or Java defined in the graph for records with greater conformity. | |
Transform URL | 1) | External file defining the transformation in CTL or Java for records with greater conformity. | |
Transform class | 1) | External transformation class for records with greater conformity. | |
Transform for suspicious | 2) | Transformation in CTL or Java defined in the graph for records with smaller conformity. | |
Transform URL for suspicious | 2) | External file defining the transformation in CTL or Java for records with smaller conformity. | |
Transform class for suspicious | 2) | External transformation class for records with smaller conformity. | |
Conformity limit (0,1) | This attribute defines the limit of conformity for pairs of records. To the records with conformity higher than this value the transformation is applied, to those with conformity less than this value, the transformation for suspicious is applied. | 0.75 (default) | between 0 and 1 | |
Advanced | |||
Transform source charset | Encoding of external file defining the transformation. | ISO-8859-1 (default) | |
Deprecated | |||
Locale | Locale to be used when internationalization is used. | ||
Case sensitive | If set to true , upper and lower
cases of characters are considered different. By default, they
are processed as if they were equal to each other. | false (default) | true | |
Error actions | Definition of the action that should be performed when the specified transformation returns some Error code. See Return Values of Transformations. | ||
Error log | URL of the file to which error messages for specified Error actions should be written. If not set, they are written to Console. | ||
Slave override key | In older versions of
CloudConnect, slave part of
Join key. Join key
was defined as the sequence of individual expressions
consisting of master field names each of them was followed by
parentheses containing the 6 parameters mentioned below. These
individual expressions were separated by semicolon. The
Slave override key was a sequence of
slave counterparts of the master Join key
fields. Thus, in the case mentioned above, Slave
override key would be
fname;lname , whereas Join
key would be first_name(3 0.8 true false
false false);last_name(4 0.2 true false false
false) . | ||
Slave override matching key | In older versions of
CloudConnect, slave part of
Matching key. Matching
key was defined as a master field name. Slave
override matching key was its slave counterpart. Thus, in the
case mentioned above
($masterField=$slaveField ), Slave
override matching key would be this
slaveField only. And Matching
key would be this
masterField . |
Legend:
1) One of these must be set. These transformation attributes
must be specified. Any of these transformation attributes must use a
common CTL template for Joiners or implement a
RecordTransform
interface.
2) One of these must be set. These transformation attributes
must be specified. Any of these transformation attributes must use a
common CTL template for Joiners or implement a
RecordTransform
interface.
See CTL Scripting Specifics or Java Interfaces for more information.
See also Defining Transformations for detailed information about transformations.
You can define the Matching key using the Matching key wizard. You only need to select the desired master (driver) field in the Master key pane on the left and drag and drop it to the Master key pane on the right in the Master key tab. (You can also use the provided buttons.)
In the Slave key tab, you must select one of the slave fields in the Fields pane on the left and drag and drop it to the Slave key field column at the right from the Master key field column (containing the master field the Master key tab) in the Key mapping pane.
You have to define the limit of conformity (Conformity limit (0,1)). The defined value distributes incoming records according to their conformity. The conformity can be greater or smaller than the specified limit. You have to define transformations for either group. The records with smaller conformity are marked "suspicious" and sent to port 1, while records with higher conformity go to port 0 ("good match").
The conformity calculation is a challenge so let us try to explain at least in basic terms. First, groups of records are made based on Matching key. Afterwards, all records in a single group are compared to each other according to the Join Key specification. The strength of comparison selected in particular Join key fields determines what "penalty" characters get (for comparison strength, see Join key):
Identical - is a character-by-character comparison.
The penalty is given for each different character (similar to String.equals()
).
Tertiary - ingores differences in lower/upper case
(similar to String.equalsIgnoreCase()
), if it
is the only comparison strength activated. If activated together
with Identical, then a difference in
diacritic (e.g. 'c' vs. '
') is a full penalty and a difference in case
(e.g. 'a' vs. 'A') is half a penalty.
Secondary - a plain letter and its diacritic derivates
for the same language are considered equal. The language used
during comparison is taken from the metadata on the field. When no
metadata is set on the field, it is treated as en
and should work identically to Primary
(i.e. derivatives are treated as equal).
Example:
language=sk
: 'a', 'á', 'ä' are equal because they are all Slovak characters
language= sk
: 'a', '
' are different because '
' is a Polish (and not Slovak) character
Primary - all diacritic-derivates are considered equal regardless of language settings.
Example:
language=any: 'a', 'á', 'ä', ' ' are equal because they are all derivatives of 'a'
As a final step, the total conformity is calculated as a weighted average of field conformities.
You can define the Join key with the help of the Join key wizard. When you open the Join key wizard, you can see two tabs: Master key tab and Slave key tab.
In the Master key tab, you must select the driver (master) fields in the Fields pane on the left and drag and drop them to the Master key pane on the right. (You can also use the buttons.)
In the Slave key tab, you can see the Fields pane (containing all slave fields) on the left and the Key mapping pane on the right.
You must select some of these slave fields and drag and drop them to the Slave key field column at the right from the Master key field column (containing the master fields selected in the Master key tab in the first step). In addition to these two columns, there are other six columns that should be defined: Maximum changes, Weight and the last four representing strength of comparison.
The maximum changes property contains the integer number that is equal to the the number of letters that should be changed so as to convert one data value to another value. The maximum changes property serves to compute the conformity. The conformity between two strings is 0, if more letters must be changed so as to convert one string to the other.
The weight property defines the weight of the field in computing the similarity. Weight of each field difference is computed as the quotient of the weight defined by user and the sum of the weights defined by user.
The strength of comparison can be identical, tertiary, secondary or primary.
identical
Only identical letters are considered equal.
tertiary
Upper and lower case letters are considered equal.
secondary
Diacritic letters and their Latin equivalents are considered equal.
primary
Letters with additional features such as a peduncle, pink, ring and their Latin equivalents are considered equal.
In the wizard, you can change any boolean value of these
columns by simply clicking. This switches true
to false
, and vice versa. You can also change
any numeric value by clicking and typing the desired value.
When you click
, you will obtain a sequence of assignments of driver (master) fields and slave fields preceded by dollar sign and separated by semicolon. Each slave field is followed by parentheses containing six mentioned parameters separated by white spaces. The sequence will look like this:$driver_field1=$slave_field1(parameters);...;$driver_fieldN=$slave_fieldN(parameters)
Example 56.2. Join Key for ApproximativeJoin
$first_name=$fname(3 0.8 true false false
false);$last_name=$lname(4 0.2 true false false
false)
. In this Join key,
first_name
and last_name
are fields from the first (master) data flow and
fname
and lname
are fields
from the second (slave) data flow.
Metadata on the first and second output ports can contain
additional fields of numeric data type. Their names must be the
following: "_total_conformity_"
and some number
of "_keyName_conformity_"
fields. In the last
field names, you must use the field names of the Join
key attribute as the keyName
in
these additional field names. To these additional fields the
values of computed conformity (total or that for
keyName
) will be written.
When you define your join attributes you must specify a transformation that maps fields from input data sources to the output. This can be done using the Transformations tab of the Transform Editor. However, you may find that you are unable to specify more advanced transformations using this easist approach. This is when you need to use CTL scripting.
For detailed information about CloudConnect Transformation Language see Part XI, CTL - CloudConnect Transformation Language. (CTL is a full-fledged, yet simple language that allows you to perform almost any imaginable transformation.)
CTL scripting allows you to specify custom field mapping using the simple CTL scripting language.
All Joiners share the same transformation template which can be found in CTL Templates for Joiners.
If you define your transformation in Java, it must implement the following interface that is common for all Joiners: