Minimize Network Joins

Your Data Warehouse instance is typically created within a shared or dedicated cluster of 3+ nodes running Vertica, and the data in your tables may be spread across the cluster nodes. Data Warehouse provides options for controlling how your data is distributed.

If you retain the system defaults, it is very likely that your table joins will join records sitting on different parts of the clusters, which affect performance. This section provide recommendations on how to minimize network joins.

The following picture includes two randomly segmented projections. To join the tables using these projections by cust_id, you must combine records from different nodes. For example, orders associated with customer #2 are split across all three nodes:

Solution #1: All joined records on the same node

To make the join operation more efficient, the records to be joined should be available on the same node, as in the following:

Segmentation can be defined in CREATE TABLE statements, like the following:

CREATE TABLE customers (
  cust_id INTEGER,
  name VARCHAR(255)
) SEGMENTED BY HASH(cust_id) ALL NODES;
CREATE TABLE orders (
  order_id INTEGER,
  cust_id, INTEGER,
  order_dt DATE,
  total DECIMAL(12,2)
) SEGMENTED BY HASH(cust_id) ALL NODES;

Solution #2: Replicate data across all nodes.

For smaller tables (up to a few hundred thousand records), network joins can be avoided by replicating all customer data across all nodes:

The corresponding SQL is the following:

CREATE TABLE customers (
  cust_id INTEGER,
  name VARCHAR(255)
) UNSEGMENTED ALL NODES;
CREATE TABLE orders (
  order_id INTEGER,
  cust_id, INTEGER,
  order_dt DATE,
  total DECIMAL(12,2)
);

Solution #3: Constrain data to a single node.

If all tables are small (up to a few million records), consider retaining all data on a single node only:

You can force this constraint by segmenting your projections by a constant that is unique to your implementation such as the Data Warehouse identifier identifier (the HQrKTXGedJ6OngbUJ4QAHrb0pEw5oEif string in the example below).

CREATE TABLE customers (
  cust_id INTEGER,
  name VARCHAR(255)
) SEGMENTED BY HASH('HQrKTXGedJ6OngbUJ4QAHrb0pEw5oEif') ALL NODES;
CREATE TABLE orders (
  order_id INTEGER,
  cust_id, INTEGER,
  order_dt DATE,
  total DECIMAL(12,2)
) SEGMENTED BY HASH('HQrKTXGedJ6OngbUJ4QAHrb0pEw5oEif') ALL NODES;