Usage

Usage of the upsert scripts ordinarily requires following these steps:

  1. Prepare the data in staging tables that have the same structure as the corresponding base tables.
  2. Create a SQL script (a driver script) to be run with execsql, and use the INCLUDE metacommand to include the file containing the upsert scripts.
  3. Define execsql substitution variables to control the amount of automatic logging that is carried out during QA checks and data updates and additions. This step is optional; if the variables are not defined, no logging will be done.
  4. Run the staged_to_load script using execsql’s EXECUTE SCRIPT metacommand to initiate a control table.
  5. Run the load_staging script using execsql’s EXECUTE SCRIPT metacommand to carry out the complete upsert operation, including QA checks.
  6. Check the control table for a record of QA failures. This step may be unnecessary if the user will be monitoring status messages while the upsert script is running.

Other usage approaches are possible. For example:

  • Settings in the control table may be changed after the staged_to_load script is run and before the load_staging script is run.
  • If the control table is created by some other process, only the load_staging script need be called.
  • Scripts to perform QA checks and the upsert operation may be called separately by the driver script if additional QA checks need to be incorporated or an alternate order of operations is desired.

An Illustration

The following set of execsql metacommands illustrate the typical contents of the driver script, and include steps 2-6 above.

-- Include the upsert script file.
-- !x! include pg_upsert.sql

-- Define substitution variables to control logging.
-- !x! sub logfile     upsert_log.txt
-- !x! sub log_sql     No
-- !x! sub log_errors  Yes
-- !x! sub log_changes No

-- Initialize a control table named 'upsert_ctl'.
-- !x! execute script staged_to_load with (control_table=upsert_ctl, table_list="table_a, table_b, table_c")

-- Run all QA checks and merge the staged data if all checks passed.
-- !x! execute script load_staging with (base_schema=public, staging=staging, control_table=upsert_ctl, do_commit=Yes)

-- Check the control table for errors.  Display it and halt if there are any.
create temporary view upsert_errors as
select table_name, null_errors, fk_errors
from   upsert_ctl
where  null_errors is not null or pk_errors is not null or fk_errors is not null;
-- !x! if(hasrows(upsert_errors)) {halt message "Upsert errors:" display upsert_errors}

This illustration is written to run with a Postgres database. Differences in script arguments that are needed for MariaDB/MySQL are described below.

Substitution Variables

The substitution variables to control logging are:

logfile
The name of a text file in which status and outcome information will be recorded.
log_sql
A value of “Yes” or “No” that determines whether or not the SQL statements (UPDATE and INSERT statements) used to carry out the upsert operation should be recorded in the log file.
log_errors
A value of “Yes” or “No” to indicate whether or not the data values that violate foreign key constraints should be recorded in the log file.
log_changes
A value of “Yes” or “No” to indicate whether or not the updated and inserted data should be recorded in the log file.

Scripts and Script Arguments

The script staged_to_load initializes the control table. The arguments to this script are:

control_table
The name of the control table to create. This will be created as a temporary table in Postgres.
table_list
A comma-separated list of the names of the base tables that are to be updated from staging tables of equivalent names. If there is more than one table to be updated, this list should be quoted.

The script load_staging carries out all of the QA checks and performs the upsert operation. Its arguments are:

base_schema (Postgres and MS-SQL Server only)
For Postgres, this specifies the name of the database schema containing the base tables. This argument is not used with MariaDB/MySQL because it does not support schemas, and the word “schema” is used instead to identify the database.
staging (Postgres and MS-SQL Server) or stage_pfx (MariaDB/MySQL)
For Postgres and MS-SQL Server, the name of the schema containing the staging tables. For MariaDB/MySQL, the table name prefix that identifies staging tables.
control_table
The name of the control table, as created either by the staged_to_load script or directly by the user.

More details about the control table and staging tables are provided in the following sections.

The Control Table

The control table identifies all of the base tables on which the upsert operation is to be performed and includes additional specifications for the upsert operation, and also is used to record the results of the operation. The control table contains the following columns:

table_name
The name of the base table to have data added or updated. This column is initialized by the staged_to_load script.
exclude_cols
A comma-separated list of single-quoted names of columns in the base table that are not to be updated.
exclude_null_checks
A comma-separated list of single-quoted names of non-null columns in the base table for which null checks should not be performed. Ordinarily these should be columns that will be filled in by triggers or by the DBMS’s autonumbering feature.
display_changes
A value of “Yes” or “No” determining whether the load_staging script should display the rows to be added and updated for each table, and prompt the user to cancel or continue. This column is intialized to “Yes” by the staged_to_load script.
display_final
A value of “Yes” or “No” determining whether each entire table should be displayed after all rows have been updated and added, with a prompt to cancel or continue. This column is initialized to “No” by the staged_to_load script.
null_errors
A column that is populated by the load_staging script if the staging table is found to have any rows with null values in a non-nullable column of the base table. If such QA violations exist, this column will contain a list of all of the columns containing nulls, and the number of null rows will be shown in parentheses following the column name.
pk_errors
A column that is populated by the load_staging script if the staging table is found to have duplicate primary key values. If such QA violations exist, this column will contain will identify the number of primary key values that are duplicated, and the number of rows containing duplicates for each of those.
fk_errors
A column that is populated by the load_staging script if the staging table is found to have invalid foreign keys. If such errors exist, this column will contain a list of all of the foreign key constraint names having invalid values; the number of distinct invalid values will be shown in parentheses following the foreign key name. The number of rows having each invalid value is not included, but this information is written to the logfile and displayed interactively, if those options are enabled.
rows_updated
A column that is populated by the load_staging script with the number of rows that have had one or more values updated.
rows_inserted
A column that is populated by the load_staging script with the number of rows that have been added to the table.

Staging Tables

The data that are to be added to the database’s base tables must be prepared in staging tables before the upsert scripts are run. The staging tables must have the same structure as the target base tables, with similar or identical names.

When using the Postgres or SQL Server versions of these scripts, the staging tables must have the same names as the target base tables and must be in a separate schema from the base tables; the name of the staging and base schemas is specified when the upsert scripts are run.

When using MariaDB or MySQL, the staging tables must have the same names as the target base tables, but with a consistent prefix on the names that designates them as staging tables; the name of the prefix is specified when the upsert scripts are run.

The structure of each staging table must be consistent with that of the corresponding base table in the following respects:

  • Column names in the staging table must be identical to the names of the corresponding columns in the base table.
  • Column types in the staging table must be compatible with those in the base table, such that no explicit casts or other conversions are required when data are inserted or updated.
  • The staging table must include all primary key columns of the base table.
  • All columns having non-null constraints without default values in the base table must be present in the staging table.
  • All columns having a foreign key into another table must be present in the staging table.

A staging table may have fewer columns than the corresponding base table if the omitted columns may be null, do not form part of a primary key or foreign key, and do not contain any data to be added to the base table.

Creating a Driver Script

To use the upsert scripts, the SQL script file containing the upsert scripts must be included in another script file–the driver script–that contains the commands to run the upsert scripts. The upsert script file can either be physically copied into the driver script, or included by using an execsql INCLUDE metacommand. Use of the INCLUDE metacommand allows a single copy of the upsert script file to be maintained, and referenced from multiple driver scripts.

The driver script ordinarily will also include the SQL statements and execsql metacommands necessary to prepare data in the staging tables, including any QA checks other than those carried out by the upsert scripts.

Order of Operations

The load_staging script carries out operations in the following order:

  1. Checks every staging table for the presence of null values in non-null columns.
  2. Checks every staging table for duplicate primary key values.
  3. Checks every staging table for valid foreign keys.
  4. If no QA violations are found, carries out the upsert operation on all tables in hierarchical order.

The QA checks are carried out on all tables even if violations are found. This is so that all errors can be reported by a single run of the script. If a different order of operations is desired (for example, to terminate the process as soon as any table is found to have a QA violation), then the driver script must call the lower-level scripts to perform QA checks separately on individual tables.

Handling Auto-filled Columns

Some tables may have columns with non-null constraints that contain sequential values that are automatically populated by the DBMS or that are filled in by triggers. Despite having a non-null constraint, these columns in a base table should not be populated from the staging table, and these columns in the staging table should ordinarily not be checked for null values. There are two ways to handle these columns if the base table is not to be updated from the staging table:

  • Exclude these columns from the staging table and list their names in the exclude_null_checks column of the control table.
  • Include these columns in the staging table and list their names in both the exclude_cols and exclude_null_checks columns of the control table.

If these auto-filled columns make up the primary key of the table (they are identity columns), then the staging table must contain legitimate values because the upsert scripts compare the primary key columns of base and staging tables to determine which rows to update and which rows to insert.