Skip to main content

DataLoader

Definition

DataLoader can be used now by calling function load_data. It may be called on datanode only.

 load_data( <table-name>,
           <path-to-the-file>,
           <delimiter>,
           <buffer-size>,
           <on-null-behavior>,
           <null-string>,
           <action>)

 

Meaning of these parameters:

Parameter Name

Type

Description

<table-name>

text

a name of table that you want to push data to

<path-to-the-file>

text

a path to the file

<delimiter>

text

a delimiter used in the file

<buffer-size>

integer

a number of rows that will be pushed into GPU table during one iteration,

for example if we have 1.000.000 rows in the file and we set <buffer-size> to

200.000, there will be 5 iterations of pushing data into GPU table; it’s in order to

avoid allocating too much memory

<on-null-behavior>

text

an option that indicates what DataLoader should do when finds ‘null’ value; thereare 3 possibilites:

a. ‘error’ – throw an error and stop loading data,

b. ‘null’ – use database’s null’s mechanism – IT’S NOT SUPPORTED YET,

c. <some string> – a value that will be used to replace a null; the default option is ‘error’

<null-string>

text

a string that tells how the ‘null’ value is represented in the file, it may be, for example, an empty string, or ‘NULL’ string and so on; it’s an equivalent of ‘NULL’ option in standard COPY

<action>

text

an option that indicates what exactly should be done; there are 3 possibilites:

a. ‘append’ – it appends existing data in the table

b. ‘overwrite’ – it clears the table writes from the beginning

c. ‘validate’ – use this option to validate the data in the file if you get any errors while

loading; It doesn’t load data, just checks and prints first 10 errors so you can findĀ and fix them; the default value is ‘append’

As the result, you should see the message:

Succesfully copied <N> rows.

where <N> is the number of copied rows.

If you get any errors in the data set, you will also see the message (that may be combined with the previous message):

Met <N> invalid rows.

It means that you should probably validate your data.

Usage

Let us create the following table:

CREATE FOREIGN TABLE PART
(
    P_PARTKEY INTEGER NOT NULL,
    P_NAME VARCHAR(55) NOT NULL,
    P_MFGR VARCHAR(25) NOT NULL,
    P_BRAND VARCHAR(10) NOT NULL,
    P_TYPE VARCHAR(25) NOT NULL,
    P_SIZE INTEGER NOT NULL,
    P_CONTAINER VARCHAR(10) NOT NULL,
    P_RETAILPRICE REAL NOT NULL,
    P_COMMENT VARCHAR(23) NOT NULL,
    placeholder VARCHAR(1)
)
SERVER
     gm_fdw_server
OPTIONS
(
     max_size '10000'
)
DISTRIBUTE BY REPLICATION;

Now we can load the data using the following command:

EXECUTE DIRECT ON ( d0 )'
      SELECT load_data(
         ''part'', -- table name
         ''/home/postgres/part_0.tbl'', -- path to the file
         ''|'', -- used delimiter in the file
        100000, -- number of rows to be buffered
        ''0'', -- all met null values will be replaced with '0'
        '''', -- null-string - in this case all empty values will be treated as a null value
        ''append'' -- we want to append data in that table
)';

The uploaded file has 33406 rows. It’s more than the ‘max_size’ option, that’s why we get the message:

EXECUTE DIRECT ON ( d0 )'
SELECT load_data(''part'',''/home/postgres/part_0.tbl'',''|'',10000,''0'','''',''append'')';
load_data
---------------------------------
Succesfully copied 10000 rows.
(1 row)

Comments

DataLoader DOES NOT do any data distribution. So if you need some specific distribution you have to do it, for example, by using COPY TO & FROM ‘normal’ table.