In a typical Data Warehouse instance, data can be distributed across three or more nodes of a shared or dedicated cluster. By default, Data Warehouse and the underlying Vertica database can manage automatically this distribution. However, in a high-performance database, developers may require better control over how data is distributed.
In a Vertica database, segmentation controls how data from a table can be distributed across nodes of a cluster. When a table or projection is created, you can specify segmentation parameters to define how data is distributed.
If segmentation is not specified explicitly, data is segmented by a hash of the projection columns; columns with fewer than 8 bytes are listed first, followed by larger columns up to the first 32 columns of the table.
If segmentation is not configured for your Data Warehouse instance, the default segmentation is likely to utilize broadcast joins, which can impact performance.
You can specify a custom segmentation using SEGMENTED BY or UNSEGMENTED clauses of the CREATE PROJECTION or CREATE TABLE commands.
The following segmentation options are available:
Hash segmentation Using SEGMENTED BY expression ALL NODES clause after CREATE PROJECTION or CREATE TABLE command The expression is expected to return an integer x for each row in the range: 0 <= x < 2^63^. Compute this expression using the HASH function on one or more columns.
Replication Using the UNSEGMENTED ALL NODES clause after CREATE PROJECTION or CREATE TABLE command This option is recommended for small tables with no more than a few million rows that are joined with large onesNOTE: Avoid using range segmentation, which Vertica supports. Range segmentation ties your data to explicitly named nodes on the actual underlying cluster.
For more information about tuning the segmentation of your physical design, see Minimize Network Joins.