ParallelReader

Home \| Table of Contents	ParallelReader
Prev	Readers	Next

Commercial Component

We assume that you have already learned what is described in:

If you want to find the right Reader for your purposes, see Readers Comparison.

Short Summary

ParallelReader reads data from flat files using multiple threads.

Component	Data source	Input ports	Output ports	Each to all outputs^{[ 1)]}	Different to different outputs^{[ 2)]}	Transformation	Transf. req.	Java	CTL
ParallelReader	flat file	0	1-2
^{[ 1)]} Component sends each data record to all connected output ports. ^{[ 2)]} Component sends different data records to different output ports using return values of the transformation. See Return Values of Transformations for more information.

Abstract

ParallelReader reads delimited flat files like CSV, tab delimited, etc., fixed-length, or mixed text files. Reading goes in several parallel threads, which improves the reading speed. Input file is divided into set of chunks and each reading thread parses just records from this part of file. The component can read a single file as well as a collection of files placed on a local disk or remotely. Remote files are accessible via FTP protocol.

According to the component settings and the data structure, either the fast simplistic parser (SimpleDataParser) or the robust (CharByteDataParser) one is used.

Parsed data records are sent to the first output port. The component has an optional output logging port for getting detailed information about incorrect records. Only if Data Policy is set to controlled and a proper Writer (Trash or CSVWriter) is connected to port 1, all incorrect records together with the information about the incorrect value, its location and the error message are sent out through this error port.

Icon

Ports

Port type	Number	Description	Metadata
Output	0	for correct data records	any ^{[ 1)]}
Output	1	for incorrect data records	specific structure, see table bellow
^{[ 1)]} Metadata on output port can use Autofilling Functions

Table 53.2. Error Metadata for Parallel Reader

Field Number	Field Content	Data Type	Description
0	record number	integer	position of the erroneous record in the dataset (record numbering starts at 1)
1	field number	integer	position of the erroneous field in the record (1 stands for the first field, i.e., that of index 0)
2	raw record	string	erroneous record in raw form (including delimiters)
3	error message	string	error message - detailed information about this error
4	first record offset	long	indicates the initial file offset of the parsing thread

ParallelReader Attributes

Attribute	Description	Possible values
Basic
File URL	Attribute specifying what data source(s) will be read. See Supported File URL Formats for Readers.
Charset	Encoding of records that are read in.	ISO-8859-1 (default) \| <other encodings>
Data policy	Determines what should be done when an error occurs. See Data Policy for more information.	Strict (default) \| Controlled \| Lenient
Trim strings	specifies whether leading and trailing whitespace should be removed from strings before setting them to data fields, see Trimming Data. If `true`, the use of the robust parser is forced.	false (default) \| true
Quoted strings	Fields that contain a special character (comma, newline, or double quote), must be enclosed in quotes (only single/double quote as a quote character is accepted). If `true`, such special characters inside the quoted string are not treated as delimiters and the quotes are removed.	false (default) \| true
Advanced
Skip leading blanks	specifies whether to skip leading whitespace (blanks e.g.) before setting input strings to data fields. If not explicitly set (i.e., having the default value), the value of Trim strings attribute is used. See Trimming Data. If `true`, the use of the robust parser is enforced.	false (default) \| true
Skip trailing blanks	specifies whether to skip trailing whitespace (blanks e.g.) before setting input strings to data fields. If not explicitly set (i.e., having the default value), the value of Trim strings attribute is used. See Trimming Data. If `true`, the use of the robust parser is enforced.	false (default) \| true
Max error count	maximum number of tolerated error records in input file(s); applicable only if `Controlled` Data Policy is set	0 (default) - N
Treat multiple delimiters as one	If a field is delimited by a multiplied delimiter char, it will be interpreted as a single delimiter when setting to `true`.	false (default) \| true
Verbose	By default, less comprehensive error notification is provided and the performance is slightly higher. However, if switched to `true`, more detailed information with less performance is provided.	false (default) \| true
Level of parallelism	Number of threads used to read input data files. The order of records is not preserved if it is 2 or higher. If the file is too small, this value will be switched to 1 automatically.	2 (default) \| 1-n
Distributed file segment reading	In case a graph is running in a CloudConnect Server environment, component can only process the appropriate part of the file. The whole file is divided into segments by CloudConnect Server and each cluster worker processes only one proper part of file. By default, this option is turned off.	false (default) \| true
Parser	By default, the most appropriate parser is applied. Besides, the parser for processing data may be set explicitly. If an improper one is set, an exception is thrown and the graph fails. See Data Parsers	auto (default) \| `<other>`

Advanced Description

Quoted strings

The attribute considerably changes the way your data is parsed. If it is set to true, all field delimiters inside quoted strings will be ignored (after the first Quote character is actually read). Quote characters will be removed from the field.

Example input:

1;"lastname;firstname";gender

Output with Quoted strings == true:

{1}, {lastname;firstname}, {gender}

Output with Quoted strings == false:

{1}, {"lastname}, {firstname";gender}

Prev	Up	Next
MultiLevelReader	Home \| Table of Contents	CSVReader