EmailFilter

Commercial Component

We assume that you have already learned what is described in:

If you want to find the right Transformer for your purposes, see Transformers Comparison.

Short Summary

EmailFilter filters input records according to the specified condition.

Component Same input metadata Sorted inputs Inputs Outputs Java CTL
EmailFilter-no10-2--

Abstract

EmailFilter receives incoming records through its input port and verifies specified fields for valid e-mail addresses. Data records that are accepted as valid are sent out through the optional first output port if connected. Specified fields from the rejected inputs can be sent out through the optional second output port if this is connected to other component. Metadata on the optional second output port may also contain up to two additional fields with information about error.

Icon

Ports

Port typeNumberRequiredDescriptionMetadata
Input0yesFor input data recordsAny
Output0noFor valid data recordsInput 01)
1noFor rejected data recordsAny2)

Legend:

1): Metadata cannot be propagated through this component.

2): Metadata on the output port 0 contain any of the input data fields plus up to two additional fields. Fields whose names are the same as those in the input metadata are filled in with input values of these fields.

Table 55.2. Error Fields for EmailFilter

Field numberField nameData typeDescription
FieldAthe Error field attribute valuestringError field
FieldBthe Status field attribute valueinteger1)Status field

Legend:

1): The following error codes are most common:

EmailFilter Attributes

AttributeReqDescriptionPossible values
Basic
Field listyesList of selected input field names whose values should be verified as valid or non-valid e-mail addresses. Expressed as a sequence of field names separated by colon, semicolon, or pipe. 
Level of inspection Various methods used for the e-mail address verification can be specified. Each level includes and extends its predecessor(s) on the left. See Level of Inspection for more information.SYNTAX | DOMAIN (default) | SMTP | MAIL
Accept empty By default, even empty field is accepted as a valid address. This can be switched off, if it is set to false. See Accept Conditions for more information.true (default) | false
Error field Name of the output field to which error message can be written (for rejected records only). 
Status field Name of the output field to which error code can be written (for rejected records only). 
Multi delimiter Regular expression that serves to split individual field value to multiple e-mail addresses. If empty, each field is treated as a single e-mail address.[,;] (default) | other
Accept condition By default, record is accepted even if at least one field value is verified as valid e-mail address. If set to STRICT, record is accepted only if all field values from all fields of the Field list are valid. See Accept Conditions for more information.LENIENT (default) | STRICT
Advanced
E-mail buffer size Maximum number of records that are read into memory after which they are bulk processed. See Buffer and Cache Size for more information.2000 (default) | 1-N
E-mail cache size Maximum number of cached e-mail address verification results. See Buffer and Cache Size for more information.2000 (default) | 0 (caching is turned off) | 1-N
Domain cache size Maximum number of cached DNS query results. Is ignored at SYNTAX level.3000 (default) | 0 (caching is turned off) | 1-N
Domain retry timeout (ms) Timeout in millisecond for each DNS query attempt. Thus, maximum time in milliseconds spent to resolving equals to Domain retry timeout multiplicated by Domain retry count.800 (default) | 1-N
Domain retry count Number of retries for failed DNS queries.2 (default) | 1-N
Domain query A records By default, according to the SMTP standard, if no MX record could be found, A record should be searched. If set to false, DNS query is two times faster, however, this SMTP standard is broken..true (default) | false
SMTP connect attempts (ms,...) Attempts for connection and HELO. Expressed as a sequence of numbers separated by comma. The numbers are delays between individual attempts to connect.1000,2000 (default)
SMTP anti-greylisting attempts (s,...) Anti-greylisting feature. Attempts and delays between individual attempts expressed as a sequence of number separated by comma. If empty, anti-greylisting is turned off. See SMTP Grey-Listing Attempts for more information.30,120,240 (default)
SMTP retry timeout (s) TCP timeout in seconds after which a SMTP request fails.300 (default) | 1-N
SMTP concurrent limit Maximum number of parallel tasks when anti-greylisting is on.10 (default) | 1-N
Mail From The From field of a dummy message sent at MAIL level.CloudConnect <cloudconnect@cloudconnect.org> (default) | other
Mail Subject The Subject field of a dummy message sent at MAIL level.Hello, this is a test message (default) | other
Mail Body The Body of a dummy message sent at MAIL level.Hello,\nThis is CloudConnect text message.\n\nPlease ignore and don't respond. Thank you, have a nice day! (default) | other

Advanced Description

Buffer and Cache Size

Increasing E-mail buffer size avoids unnecessary repeated queries to DNS system and SMTP servers by processing more records in a single query. On the other hand, increasing E-mail cache size might produce even better performance since addresses stored in cache can be verified in an instant. However, both parameters require extra memory so set it to the largest values you can afford on your system.

Accept Conditions

By default, even an empty field from input data records specified in the List of fields is considered to be a valid e-mail address. The Accept empty attribute is set to true by default. If you want to be more strict, you can switch this attribute to false.

In other words, this means that at least one valid e-mail address is sufficient for considering the record accepted.

On the other hand, in case of Accept condition set to STRICT, all e-mail addresses in the List of fields must be valid (either including or excluding empty values depending on the Accept empty attribute).

Thus, be careful when setting these two attributes: Accept empty and Accept condition. If there is an empty field among fields specified in List of fields, and all other non-empty values are verified as invalid addresses, such record gets accepted if both Accept condition is set to LENIENT and Accept empty is set to true. However, in reality, such record does not contain any useful and valid e-mail address, it contains only an empty string which assures that such record is accepted.

Level of Inspection

  1. SYNTAX

    At the first level of validation (SYNTAX), the syntax of e-mail expressions is checked and even both non-strict conditions and international characters (except TLD) are allowed.

  2. DOMAIN

    At the second level of validation (DOMAIN) - which is the default one - DNS system is queried for domain validity and mail exchange server information. The following four attributes can be set to optimize the ratio of performance to false-negative responses: Domain cache size, Domain retry timeout, Domain retry count. and Domain query A records. The number of queries sent to DNS server is specified by the Domain retry count attribute. Its default value is 2. Time interval between individual queries that are sent is defined by Domain retry timeout in milliseconds. By default it is 800 milliseconds. Thus, the whole time during which the queries are being resolved is equal to Domain retry count x Domain retry timeout. The results of queries can be cached. The number of cached results is defined by Domain cache size. By default, 3000 results are cached. If you set this attribute to 0, you turn the caching off. You can also decide whether A records should be searched if no MX record is found (Domain query A records). By default, it is set to true. Thus, A record is searched if MX record is not found. However, you can switch this off by setting the attribute to false. This way you can speed the searching two times, although that breaks the SMTP standard.

  3. SMTP

    At the third level of validation (SMTP), attempts are made to connect SMTP server. You need to specify the number of attempts and time intervals between individual attempts. This is defined using the SMTP connect attempts attribute. This attribute is a sequence of integer numbers separated by commas. Each number is the time (in seconds) between two attempts to connect the server. Thus, the first number is the interval between the first and the second attempts, the second number is the interval between the second and the third attempts, etc. The default value is three attempts with time intervals between the first and the second attempts equal to 1000 and between the second and the third attempts equal to 2000 milliseconds.

    Additionally, the EmailFilter component at SMTP and MAIL levels is capable to increase accuracy and eliminate false-negatives caused by servers incorporating greylisting. Greylisting is one of very common anti-spam techniques based on denial of delivery for unknown hosts. A host becomes known and "greylisted" (i.e. not allowed) when it retries its delivery after specified period of time, usually ranging from 1 to 5 minutes. Most spammers do not retry the delivery after initial failure just for the sake of high performance. EmailFilter has an anti-greylisting feature which retries each failed SMTP/MAIL test for specified number of times and delays. Only after the last retry fails, the address is considered as invalid.

  4. MAIL

    At the fourth level (MAIL), if all has been successful, you can send a dummy message to the specified e-mail address. The message has the following properties: Mail From, Mail Subject and Mail Body. By default, the message is sent from CloudConnect <cloudconnect@cloudconnect.org>, its subject is Hello, this is a test message. And its default body is as follows: Hello,\nThis is CloudConnect test message.\n\nPlease ignore and don't respond. Thank you and have a nice day!

SMTP Grey-Listing Attempts

To turn anti-greylisting feature, you can specify the SMTP grey-listing attempts attribute. Its default value is 30,120,240. These numbers means that four attempts can be made with time intervals between them that equal to 30 seconds (between the first and the second), 120 seconds (between the second and the third) and 240 seconds (between the third and the fourth). You can change the default values by any other comma separated sequence of integer numbers. The maximum number of parallel tasks that are performed when anti-greylisting is turned on is specified by the SMTP concurrent limit attribute. Its default value is 10.