Project Description
Without washrooms today, imagine what would human society be like? In the era of the proliferation of information, data is also need to be localized excreta, control defecation.
The DataWashroom provides a filter core of the data-purification system, includes the common functions of data isolation and waste separation by a simplicity instruction set (7 instructions). The instruction engine handles pre-compiling, building, publishing (deploying), version control and documentation-generation. (It’s compiling, not interpreting) The input of the compiler is some metadata in wash-instruction tables, the output is a stored procedure.

For more intentions about this project, please see also the [Data Quality] section on the related project http://view.codeplex.com wiki page.

Instruction Set
There are 7 build-in wash instructions shown as following:

Instruction Set

  • The basic manipulation instructions mainly perform data isolation from source systems;
  • The relationship clean-up instructions perform the waste separation.

Clean-up is not just about deleting those no longer needed historical data or garbage data. The clean-up here emphasizes particularly on treatment for bad data. In order to simplify the generalized common solution from untold symptomatic treatments for bad data, the design thinking of above build-in wash instructions is based on following conventions:

  • Face duplicate keys, none of them should be deleted physically. Instead, additional tags will be taken part in the discriminator to uniquely identify a business key. For example,

    Duplicate key discriminator

The duplicate key treatment is subdivided into two alternative methods:
 - Check Unique Key (CHK_UK)
   Only adopt qualified rows, exclude all duplicate rows.
 - Sort Duplicate Key (RNK_DK)
   Assign a sequential number of every row within each partition of supposed unique key by a specified ordering rule. So that the first (top 1) row of each supposed unique key can be picked out directly.

  • Face missing keys, all of supposed foreign keys should be made up in the primary table or bridge table if they were missing before, to avoid missing relationship. So that all subsequent users can consume the data straightforwardly and much more efficiently. For example,

    Make up missing keys

Wash Cycle
All 7 wash-instructions inherit from the base WASH_ISTR. They are organized under the wash cycle in sequence.

Washing Cycle Diagram

WASH_CYCLE
A wash cycle is a sequential workflow of wash instructions.
CYCLE_ID: The unique mnemonic identifier for a wash cycle, consider a naming convention within the enterprise (like a short namespace).
PROCEDURE_NAME: Define the stored procedure name of the wash cycle to be generated, just as the executable file name to an application. The schema name must be included in the stored procedure name (e.g. ABC.PRETREATMENT).
DESCRIPTION_: Arbitrary introduction of the wash cycle, it can be brief like an application name.
OWNER_: The owner of the wash cycle.
LATEST_VERSION: This is a free maintenance column with initialization 0. It's used inside version control.

Wash Instructions

WASH_ISTR
This is the base table of all 7 wash-instruction tables. Each row of this table is a wash instruction declaration. For a top-down design, this table can also be used for storing requirement or outline design with every wash step of a wash cycle. For the compiler, this base table is treated like a header file to C++.
ISTR_ID: Defines the unique mnemonic identifier for a wash instruction, consider a naming convention within the enterprise.
CYCLE_ID: Which wash cycle does the wash instruction belong to.
ISTR_TYPE: The type of instruction can be one of: DELETE, COPY, MERGE, CHK_UK, RNK_DK, MUP_MK, MUP_NA. It indicates the definition of the instruction is located in which derived instruction table.
ISTR_ORDER: The step ordinal of the instruction within its wash cycle.
DESCRIPTION_: The brief introduction of what is this step going to do. This is an optional info, but it can be useful for generating the progress status for every step and documentation-generation.

7 derived wash instruction tables:

  • The main practices of following 3 instructions of basic manipulation is to refresh data from source systems to the isolation system, the refresh may include some basic corrections, tagging, null-reference transformation, etc. those rules base on the columns of the the same row or constants, and pivot transformation ... by the view on source data. The refresh can be full refresh, partial refresh or incremental refresh.
WASH_ISTR_DELETE
Removes rows from a table. For instance, to refresh data from source system, a DELETE step could need to be operated before COPY.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
DST_TABLE: Specifies the table from which the rows are to be deleted. Schema name is always required in the table name. (E.g. ABC.WORK_TABLE)
DST_FILTER: Specifies the conditions used to limit the number of rows that are deleted (E.g. ID_TYPE = 'ISIN'). If this column is NULL, DELETE removes all the rows from the table.
 
WASH_ISTR_COPY
Copies all matching columns (by column name) of data from a source table or view to a destination table.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
SRC_VIEW: Specifies the source table or view to be copied from. Schema name is always required in the table or view name. (E.g. ABC.VIEW_SRC)
SRC_FILTER: Specifies the search conditions used to limit the number of rows to be copied (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the source will be copied.
DST_TABLE: Specifies the destination table from which the rows are to be copied to. Schema name is always required in the table name. (E.g. ABC.TABLE_DST)
 
WASH_ISTR_MERGE
Merges specified matching columns (by column name) of data from a source table or view to a target table - Updating specified columns in a target table if a matching row exists, or inserting the data as a new row if a matching row does not exist.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
SRC_VIEW: Specifies the source table or view to be merged from. Schema name is always required in the table or view name. (E.g. ABC.VIEW_SRC)
SRC_FILTER: Specifies the search conditions used to limit the number of rows to be merged (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the source will be consumed as the source data.
DST_TABLE: Specifies the destination table to which the rows are to be copied. Schema name is always required in the table name. (E.g. ABC.TABLE_DST)
JOIN_COLUMNS: Specifies the list of columns on which source table/view is joined with the target table to determine where they match. Multiple columns must be delimited by commas. (E.g. POS_ID, GRP_ID, DATE_) The column names must exist in both source table/view and destination table.
UPDATE_COLUMNS: Specifies a comma-separated list of columns to be updated in the destination table by matching rows from the source table/view (matched by JOIN_COLUMNS). A column that is referenced in JOIN_COLUMNS list can not be included in the UPDATE_COLUMNS list.
INSERT_COLUMNS: Specifies a comma-separated list of columns which will be copied from the source table/view to the destination table when matching rows did not exist. A column in JOIN_COLUMNS list can also be included in the INSERT_COLUMNS list.
UPDATE_COLUMNS and INSERT_COLUMNS are optional, but at least one of them must be specified, they can be specified both.
  • Following 2 instructions of duplicate keys treatment are aimed mostly at the situation of source system does not define proper primary keys or unique constraints, some biased designs don't even allow business key constraints to sit in the table.
WASH_ISTR_CHK_UK
Check Unique Key – Checks the uniqueness by a supposed business key, and tags it something if a row is unique or tags it something if a row is duplicate.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
DST_TABLE: Specifies the target table to be checked. Schema name is always required in the table name. (E.g. ABC.TABLE_DST)
DST_FILTER: Specifies the search conditions used to limit the number of rows to be checked (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the table will be checked.
KEY_COLUMNS: Specifies a column or a list of columns which is supposed to be a unique key. A composite key (includes two or more columns) must be delimited by commas (E.g. DATE_, POS_ID).
SET_EXPR_IF_UNIQUE: Specifies a comma-separated list of clause(s) {column_name = expression}[ ,...n ] for tagging a row if its supposed key is unique. For example: IS_QUALIFIED = 'Y', REDIRECT_CODE = ORIG_CODE.
SET_EXPR_IF_DUPLICATE: Specifies a comma-separated list of clause(s) {column_name = expression}[ ,...n ] for tagging a row if its supposed key is duplicate. For example: IS_QUALIFIED = 'N', REDIRECT_CODE = 'N/A'.
At least one of SET_EXPR_IF_UNIQUE or/and SET_EXPR_IF_DUPLICATE must be specified, normally they should be specified both.
 
WASH_ISTR_RNK_DK
Rank Duplicate Key – Checks the uniqueness by a supposed business key, ranks every row within their partition of the supposed key, and assigns a sequential number of every row, starting at 1 for the first row in each partition.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
DST_TABLE: Specifies the target table to be checked. Schema name is always required in the table name. (E.g. ABC.TABLE_DST)
DST_FILTER: Specifies the search conditions used to limit the number of rows to be checked (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the table will be checked.
KEY_COLUMNS: Specifies a column or a list of columns which is supposed to be a unique key. A composite key (includes two or more columns) must be delimited by commas (E.g. DATE_, POS_ID).
ORDER_BY: The ORDER_BY clause determines the sequence in which the rows are assigned their unique ROW_NUMBER within a specified partition (E.g. TRAN_NO DESC, PRICE). It is required.
RN_COLUMN: Specifies the column for filling the ROW_NUMBER. It is required, the column type must be NUMBER or compatible types.
  • The direct purpose of following 2 instructions of missing keys treatment is to avoid using LEFT/RIGHT OUTER JOIN in most cases. (A common cause of this situation could be the source table does not define properly foreign key constraint, non-null constraint, etc.)
WASH_ISTR_MUP_MK
Make up Missing Keys – the compensation inserts unique rows contained by the source table/view (select distinct supposed foreign key and coping values) but missing in the target table (by supposed primary key) originally.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
SRC_VIEW: The source table/view which references the universal set of supposed foreign key. Schema name is always required in the table or view name. (E.g. ABC.VIEW_SRC)
SRC_FILTER: Specifies the search conditions used to limit the number of rows to be matched (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the source table/view will be taken as the universal set.
SRC_KEY_COLUMNS: Specifies a column or a comma-separated list of columns which is supposed to be a foreign key and will be used to join with DST_KEY_COLUMNS (supposed primary key of target table).
DST_TABLE: The target table to be checked and to be made up. Schema name is always required in the table name. (E.g. ABC.TABLE_DST)
DST_KEY_COLUMNS: Specifies a column or a comma-separated list of columns which is supposed to be a primary key of target table and can be used to join with SRC_KEY_COLUMNS of SRC_VIEW.
DST_VAL_COLUMNS: (Optional) When insert a new compensating row into the target table, some non-key columns may need to be assigned as a special value (such as ‘Dummy’, ‘Unknown’, ‘N/A’, -1, ‘1900-01-01’…). DST_VAL_COLUMNS specifies a column or a comma-separated list of columns (E.g. GRP_ID, ROW_SRC) to be assigned. A column that is listed in DST_KEY_COLUMNS can not be included in the DST_VAL_COLUMNS.
SRC_VALUES: (Optional) If DST_VAL_COLUMNS is specified, and then SRC_VALUES is required. SRC_VALUES specifies a value expression or a comma-separated list of value expressions which will be loaded into DST_VAL_COLUMNS while making up. An expression can be a constant (E.g. -1, 'Made-up') or an expression references on SRC_VIEW (E.g. -1, LAST_UPD). It's natural that only unique rows (by key columns) will be inserted into the target table if all expressions only contain constants. Otherwise, the SRC_VIEW is responsible for ensuring the uniqueness of result set if an expression references on a column(s) of SRC_VIEW.
 
WASH_ISTR_MUP_NA
Make up a N/A key – insert a special primary key into the target table as a reserved row for substituting N/A cases (such as null-references, exception replacement …) if it did not exist before.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
DST_TABLE: The target table to be made up. Schema name is always required in the table name. (E.g. ABC.TABLE_DST)
DST_KEY_COLUMNS: Specifies a column or a comma-separated list of columns which is the primary key of the target table. (E.g. REF_ID)
DST_VAL_COLUMNS: (Optional) When insert a reserved row into the target table, some non-key columns may need to be assigned as special attributes (such as ‘Dummy’, ‘Unknown’, ‘N/A’, -1, ‘1900-01-01’…). DST_VAL_COLUMNS specifies a column or a comma-separated list of columns (E.g. GRP_ID, ROW_SRC) to be assigned. A column that is listed in DST_KEY_COLUMNS can not be included in the DST_VAL_COLUMNS.
MAKE_UP_KEYS: Introduces a constant or a comma-separated list of constants of primary key columns to be inserted if it did not exist (E.g. -1). There must be one data value for each column of DST_KEY_COLUMNS list in the same order. If the same key already exists, nothing will be updated on that row.
MAKE_UP_VALUES: (Optional) If DST_VAL_COLUMNS is specified, and then MAKE_UP_VALUES is required. MAKE_UP_VALUES introduces a constant or a comma-separated list of constants which will be assigned to columns of DST_VAL_COLUMNS. The values in the MAKE_UP_VALUES must be in the same order as the columns in DST_VAL_COLUMNS list. (E.g. -1, 'N/A')

Above is all metadata may need to be filled in instructions tables. The triggers behind these tables will check the validity of input metadata.

Build and Deploy

  • Development Phase
WASH_DEPLOY.BUILD_AND_PUBLISH
The combination function for building and publishing a Wash Cycle.
Parameters Default Description
inCycle_ID   The Wash Cycle to be built and deployed.
inVer_Comment '' Comment for this version.
inGen_Progress 'Y' 'Y': To generate progress status for every step;
'N': Does not generate the code for progress status.
inStep_Commit 'N' 'Y': Commit every step (consider of mass data);
'N': Commit once only at the end of whole wash cycle.
Return Value: A new version number of this publish. (every time of publish will generate a new version)
Output: Accompanied by the stored procedure of the wash cycle being installed, a version snapshot is automatically generated and tracked for every publishing. All version history is stored in the table WASH_VERSION (with deployment status or error messages).

In many instances, developer may need to use individual steps for trial/debug convenience. Following diagram shows internal procedures of the whole build and deployment management.
(Labels in gray indicate the corresponding name of table, view, stored procedure or function)

Build and Deploy

WASH_ENGINE.PRECOMPILE
The procedure precompiles every step of wash-instructions under a wash cycle into DML SQL.
Parameters Default Description
inCycle_ID   The wash cycle to be precompiled.
Output: All DML SQL are planned in the table WASH_DML_PLAN.

VIEW_WASH_CYCLE_CODE can be used to preview the complete code of stored procedure which will be generated for the wash cycle after precompilation.

WASH_DEPLOY.PUBLISH_PROCEDURE
The function publishes a precompiled wash cycle as a stored procedure.
Parameters Default Description
inCycle_ID   The wash cycle to be published(deployed).
inVer_Comment '' Comment for this version.
inGen_Progress 'Y' 'Y': To generate progress status for every step;
'N': Does not generate the code for progress status.
inStep_Commit 'N' 'Y': Commit every step (consider of mass data);
'N': Commit once only at the end of whole wash cycle.
Return Value: A new version number of this publish. (every time of publish will generate a new version)
Output: Accompanied by the stored procedure of the wash cycle being installed, a version snapshot is automatically generated and tracked for every publishing. All version history is stored in the table WASH_VERSION (with deployment status or error messages).
  • Deployment Phase

As shown in above diagram, there are two ways of production deployment:

- Method 1 (recommended)
Distribute a manifest of metadata to production if the production database has the same setup of the DataWashroom environment as a deployment manager.

- Method 2 (left side)
Distribute the final stored procedure to production if the production database is not allow to have the DataWashroom for any reason, or you don't want people to think that you are too lazy to write those boring code. No one else would care your code was written in your seat or was generated in washroom.

Utilities by Product

All database packages of this project are organized in the package dependency graph below.

Packages Dependence

The bottom 3 blue packages PROGRESS_TRACK, DEPLOY_UTILITY and CACHE_UTILITY were designed to be used independently.

  • PROGRESS_TRACK
    This is a generalized redesign (from the parent project http://view.codeplex.com) of Progress Status Tracker which provides 4 public interfaces:
- PROGRESS_TRACK.REGISTER
- PROGRESS_TRACK.GO_STEP
- PROGRESS_TRACK.ON_ERROR
- PROGRESS_TRACK.POLLING

A common usage can be like the below sample (for Oracle version):
    ...
    tProgress_ID    PLS_INTEGER;
BEGIN
    ...
    tProgress_ID := VPI.PROGRESS_TRACK.REGISTER(tTotal_Steps);
    ...
    VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Status information 1 ...');
    ...
    VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Status information 2 ...');
    ...
    VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Status information 3 ...');
    ...
    VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Status information n ...');
    ...
    VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Done.');
EXCEPTION
WHEN OTHERS THEN
    VPI.PROGRESS_TRACK.ON_ERROR(tProgress_ID, SQLERRM);
    ...
END;
Refer to the generated code of a wash stored procedure (with inGen_Progress option turned on) for more information about further binding to a business flow.
Then the view VIEW_PROGRESS_LOG or VIEW_WASH_LAST_PROGRESS_LOG can be used to monitor the whole ongoing progress in background.
Or, UI can use the procedure PROGRESS_TRACK.POLLING to display a progress bar.
  • DEPLOY_UTILITY
DEPLOY_UTILITY.EXPORT_INSERT_SQL
This function exports rows of a table into a script of INSERT SQL. It works like the toolbar button [Export Query Results...to SQL file] on PL/SQL IDE, but provides as a API and solves the reserved characters [&] and ['].
Parameters Default Description
inTable_Name   The table that rows to be exported.
inSrc_Filter NULL (Optional)Specifies the search conditions used to limit the number of rows to be exported (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the table will be exported.
inOrder_By NULL (Optional)Specifies the order of rows to be generated in INSERT script. (E.g. TRAN_NO DESC, PRICE)
Return Value: The complete script of INSERT SQLs.
  • CACHE_UTILITY

CACHE_CONTROL is the central control table which defines a unique identifier and expiration for each cache.
Cache Periods
 
The CACHE_UTILITY package provides a global cache control similar to Reader-Writer Locks (refer to MSDN).
Refresh Lock is different from write lock:
- Only work in critical period while cache expiring, as shown above.
- Only one refresher can obtain the token to exclusively access for refreshing, other blocked refreshers will not refresh again when they enter the lock, seen in the circuit diagram below.
 
Read-Write Lock

  • CACHE_UTILITY.ENTER_REFRESH_LOCK
  • CACHE_UTILITY.EXIT_REFRESH_LOCK
  • CACHE_UTILITY.ENTER_READ_LOCK
  • CACHE_UTILITY.EXIT_READ_LOCK

It is not recommended to use these APIs in application level directly.
Next section [Singleton] will explain the risk of deadlocks and corresponding precautions in a higher level encapsulation.

Extensibility and Applicability

Refer to the source code of 7 build-in wash instructions, extened instructions can also be customized when necessary.

The original intention of DataWashroom is designed for data pretreatment, not for entire processings. To simplify subsequent processing system designs by ensuring the data quality in the central pretreatment stage.
So the generated stored procedure has following characteristics:

  • Parameterless
    It's sufficient for most centralized pretreatment applications.
    If you really need to pass some parameters to the generated stored procedure, the way of implementing a parameterize view can be used for reference (Oracle solution):
    1. Define all parameters as member variables in a database package,
      Encapsulate each of them by a get accessor (public function) and a set accessor (Optional), just like a property of a class;
    2. Create a view to expose these member variables;
    3. Set values to these member variables before execute the generated stored procedure.
      These package-based variables act as globals within a single session (connection) scope.
  • Singleton
    A centralized data pretreatment system is normally hosted as a singleton instance to avoid unnecessary repetition of warm-up time. Two common scenarios:
    1. Act as a scheduled job without considering any concurrency.
      Scheduled job calls the generated stored procedure directly - This use pattern should always be considered as the top priority for most backoffice applications. If daily (nighttime) refreshing can not satisfy the requirement of timeliness, then consider following alternatives. 
    2. Be triggered on-demand by a central controller which handles critical section and expiry window.
      WASH_GLOBAL_CACHE Package:
      Always enclose those queries which will select from global cache in a pair of WASH_GLOBAL_CACHE.ENTER_LOCK(inCycle_ID) and WASH_GLOBAL_CACHE.EXIT_LOCK(inCycle_ID)
      (Note: Don't forget to EXIT_LOCK in exception block to make sure the read lock will always be released).
      The CACHE_EXPIRY (minutes) is defined in CACHE_CONTROL table, a new entry will be there accompanied by a new wash CYCLE_ID be created, default as 240 minutes (4 hours).

    Following diagram shown the internal implementation of cache lock mechanism.

    Global Wash Cache

    In practice, a corresponding schema would be allocated for a cache to organize its target tables. This would help consumers to be clear which cache is using. Also combined with naming convention (E.g. suffix of schema name, prefix of table name), whatever a schema contains a single cache or multiple caches ... all roads lead to a clarity, and reduce unnecessary configurations.

    Risks of dead lock
    • The WASH_GLOBAL_CACHE.EXIT_LOCK might be forgotten to call, then the lock would never be released until a session terminates.
      A similar way introduced by MSDN to avoid forgetting:
      1. Create a stored procedure that encapsulates together your DML operations which will query from a cache (a cache is a set of tables). Put this stored procedure in your package body only as a private member, don't expose it as a public member in package. For example:
        PROCEDURE MY_DM_FROM_CACHEA(...)
        AS
        BEGIN
            ...
            INSERT INTO XYZ.LOCAL_TABLE1 (...)
            SELECT ... FROM ABC.GLOBAL_CACHE_TABLE1, ... WHERE ...;
            ...
            UPDATE XYZ.LOCAL_TABLE2 L
            SET COL1 = (SELECT G.VAL1 FROM ABC.GLOBAL_CACHE_TABLE2 G WHERE ...);
            ...
        END MY_DM_FROM_CACHEA;
      2. Create another wrapper stored procedure that handles the lock and call the private stored procedure. For example:
        PROCEDURE PUB_DM_FROM_CACHEA(...)
        AS
            tCycle_ID CONSTANT VARCHAR2(32) := 'DPT.WASH_A';
        -- The Cycle ID is also the Cache ID. BEGIN VPI.WASH_GLOBAL_CACHE.ENTER_LOCK(tCycle_ID); MY_DM_FROM_CACHEA(...); VPI.WASH_GLOBAL_CACHE.EXIT_LOCK(tCycle_ID); EXCEPTION WHEN OTHERS THEN VPI.WASH_GLOBAL_CACHE.EXIT_LOCK(tCycle_ID); RAISE; END PUB_DM_FROM_CACHEA;
    • Multiple caches' data might be used in the same query, and multiple concurrent sessions interlace them in unpredicted order.
      Following aspects should be taken into comprehensive precautions:
      1. Cache Organizing
        In order to minimise this circumstances, arrange necessary tables in the same cache if they are usually coupling together in most use cases. So investigating and analysising are crucial to cache planning.
      2. Multi-cache Lock
        WASH_GLOBAL_CACHE will reject nested locks and cross locks.
        OK Throw Exception Throw Exception
        ENTER_LOCK('CACHEA');
          DM_CACHEA(...);
        EXIT_LOCK('CACHEA');
        
        ENTER_LOCK('CACHEB');
          DM_CACHEB(...);
        EXIT_LOCK('CACHEB');
        ENTER_LOCK('CACHEA');
          ENTER_LOCK('CACHEB');
        
            DM_CACHEAB(...);
        
          EXIT_LOCK('CACHEB');
        EXIT_LOCK('CACHEA');
        ENTER_LOCK('CACHEA');
          DM_CACHEA(...);
        ENTER_LOCK('CACHEB');
          DM_CACHEAB(...);
        EXIT_LOCK('CACHEA');
          DM_CACHEB(...);
        EXIT_LOCK('CACHEB');
        In order to lock multiple caches as a combination of critical section, please use overloads of ENTER_LOCK and EXIT_LOCK to an array of cache IDs, as example below:
        PROCEDURE PUB_DM_FROM_CACHEAB(...)
        AS
            tCycle_IDs CONSTANT VCH_ID_ARY := VCH_ID_ARY('DPT.WASH_A', 'MKT.WASH_B' ...);   -- An array of Cache IDs.
        BEGIN
            VPI.WASH_GLOBAL_CACHE.ENTER_LOCK(tCycle_IDs);
        
            MY_DM_FROM_CACHEAB(...);
        
            VPI.WASH_GLOBAL_CACHE.EXIT_LOCK(tCycle_IDs);
        EXCEPTION
            WHEN OTHERS THEN
                VPI.WASH_GLOBAL_CACHE.EXIT_LOCK(tCycle_IDs);
                RAISE;
        END PUB_DM_FROM_CACHEAB;
        The internal implementation of overloads guarantee all caches (parameterized by an array) will always be locked in the same order by different sessions.
      3. Cache-lock Monitor
        It is impossible to avoid dead locks completely. The VPI.VIEW_CACHE_LOCK lists the locks currently held by the global cache control, it can be polled from your own alarm system.
        Structure application design as succinctly as possible, that's the ultimate approach. Everybody get sick sometime as long as we live, this should not be as an excuse not to keep clean.
    • Effective scope of cache lock
      Cache locks are automatically released by database engine when a session terminates, so client side applications need special attention: Don't split up a database processing into many small pieces and separate them through a lot of database connections(sessions) -- open a connection, do a small piece of manipulation, close the connection, open another connection, do another small piece, close ... Cache lock does not work for thus client application. This almost like a man rushes into washroom, does a piece of No.1 job, backs to office, runs again to washroom right away, does a piece of No.2 job, backs to office again ... so on and so on, how to reserve a spot in public washroom like a reserved parking spot?
    • Cache Safe Period
      As an additional option, if some end applications are not time sensitive that much, it could be fine to query caches without any explicit lock during the safe period. For example a man-generated report which is just for in-house informal research purposes, you can check all referenced caches at the beginning,
      - if the rest of safe period is not enough time to run related queries (estimation), then declines the human with message: "My cache system is busy in refreshing at the moment, please try again later. Sorry for the inconvenience, my dear!";
      - else go ahead as you love at your own risk of data incompleteness in case your querying time runs too long (than estimated) to fall into crises... You should have a good sense of the result data is integral or not, so that you can retry later.
      The view VPI.VIEW_CACHE_SAFE_PERIOD can be used to determine the safety, column names show as upper case in the figure below:

       
      In order to make this option feasible, a higher level system plan may consider to preassign some consumers as female applications/reports with lock(period), other clients as male applications/reports without lock(period), ensure the cache will not be extinct.
       
  • Sequentiality
    The adoption of executing in sequence was based on the design target, the similar consideration as most washing machines.
    This is one reason why although the instruction infrastructure allow custome instructions to be extended, but it is not encouraged to over extend too beyond the target of data pretreatment. (-- A cardiovascular medicine is not intended to treat haemorrhoid, or vice versa.)

    Another project for data mid-processing aided designing system based on the foundation of http://dbparallel.codeplex.com/ and http://view.codeplex.com/ may be in the plan later, which will be good at parallelism.

Prevention over Treatment

Such note often seen in public washrooms:
Hand Washing
(See: the Before and After Triggers)

An example for data, [START_DATE, END_DATE] is one of common modes for storing slow change data in back office systems, the date range is error prone (overlapped or fragmentary) for manual maintenance without validation. However, introducing the checksum could avoid mistake as easy as washing hands. E.g.

CREATE TABLE ABC.SEC_HOLDINGS
(
    SEC_ID      VARCHAR2(16)                                     NOT NULL,
    HOLDINGS    NUMBER(19)                                       NOT NULL,
    START_DATE  DATE DEFAULT TO_DATE('1990-01-01', 'yyyy-mm-dd') NOT NULL,
    END_DATE    DATE DEFAULT TO_DATE('9999-12-31', 'yyyy-mm-dd') NOT NULL,

    CONSTRAINT PK_SEC_HOLDINGS PRIMARY KEY (SEC_ID, START_DATE),
    CONSTRAINT UK_SEC_HOLDINGS UNIQUE (END_DATE, SEC_ID),
    CONSTRAINT CK_SEC_HOLDINGS_START_END CHECK (START_DATE <= END_DATE)
);
CREATE OR REPLACE TRIGGER ABC.TRG_SEC_HOLDINGS
AFTER INSERT OR DELETE OR UPDATE OF START_DATE, END_DATE ON ABC.SEC_HOLDINGS
DECLARE
    tSec_ID VARCHAR2(16);
BEGIN
    SELECT
        MIN(SEC_ID) INTO tSec_ID
    FROM
    (
        SELECT
            SEC_ID,
            SUM(END_DATE - START_DATE + 1) - 
               (MAX(END_DATE) - MIN(START_DATE) + 1) AS CHECKSUM
        FROM
            ABC.SEC_HOLDINGS
        GROUP BY
            SEC_ID
    )
    WHERE   ROWNUM    = 1
        AND CHECKSUM <> 0;

    IF tSec_ID IS NOT NULL THEN
        RAISE_APPLICATION_ERROR(-20021,
            'Some [START_DATE, END_DATE]s are overlapped or fragmentary!
             First issue SEC_ID: ' || tSec_ID);
    END IF;
END TRG_SEC_HOLDINGS;
This is just a sample (for a small table) to show the CHECKSUM algorithms. In practice, (for Oracle solution) a row level trigger should be used in conjunction with the statement level trigger to validate only changed IDs when the table has large data; (for SQL Server solution) it's even simpler. The main point here is to illustrate that it would always be much easier to prevent a mistake from the source input. But, it should be noted that this sample does not recommend to abuse triggers - mysophobia! After all, the life is not only for eating, the data is not only for inputting.
However, if the same validation processing was requested to repeat in subsequent users' works, it would become a calamity for whole system performance and clarity. This is similar to a car factory doesn't establish an acceptable vendor quality assurance system (before reaching assembly line), but ask every workstations on the assembly line to inspect every accessories/parts while assembling them into a car - repairing the part if fault and marking a special flag ... can you see how complex and efficient the assembly line would be!
Cleaning runway of airfield is airport company's responsibility. Airline companies should not be held respnsible for runway clearance.
 

Data Pollution Report

Distributing a complete list of bad data (base on known rules) everyday to every data producers may help people to look for accurate pathogeny. Sooner or later each source of problem application(design) will be laid bare. No matter how arrogant or humble, strong or weak ... a application is, every data sits equal on the toilet.

A utility for generating these queries(views) of data pollution report will be introduced in later release.

Requirement on requirement
In order to have a compact and well structured requirement for a clean system, some basic principles are recommended to keep in mind during the requirements analysis.

  1. Disentangle functional dependencies between attributes in source systems base on true business meanings.
    As a isolation of basis data, the target system needs to carry out 3NF or BCNF. Some normalized relationships might have been safeguarded in source systems already, the requirement specifications is required to list the rest of them, and some redundant attributes can be cancelled.
  2. Unify different units of measurement into a standardized system of units.
    For example, converting all units of measurement into internationally standard measurement units (miles/h => km/h, yards => metres) if source systems use multiple different systems of units; converting all native dollars into the settlement currency; converting mixed ISO code systems(two-letter code and three-letter code) into a pure ISO code system (either one of them)... etc.
  3. Act as a global cache of basis data from source systems, to provide downstream back-office systems a clean data source with optimization for query performance is a key service purpose of the DataWashroom. It's requirement specification's responsibility to investigate:
    • Which attributes are often used as keys to look up others?
    • Which attributes are usually coupling together in consumers' queries?
    • What amount of data and growthrate of size?
    • What probability distributions of different partitions of data being used in most queries?
    Etc., these investigations will be substantial arguments for the physical design of target schema.
     
  4. Summarize data quality requirements into 3 basic parts:
    • Regular rules within the same row.
      Such as: non-null, reasonable values range and comparisons, etc.
    • Business keys uniqueness, relationships conflict.
    • Relation-chains directivity and connectivity.

The sufficient analyses is intended to simplify solutions through inductiveness trimness and summary, not to complicate matters.

Supported Databases

  • Oracle
    - Supported.
     
  • SQL Server
    - To be implemented later...

Contribute

  • Welcome all feedback through the CodePlex project (through comments, patches, or items in the Issue Tracker);
  • Earnestly request adding implementation for other databases;

Last edited Oct 8, 2013 at 3:38 AM by AbelCheng, version 159