Oracle-Developer.com

 

Navigation: Home  | Discussion Forums (Get expert advice)  |  Scripts  |  About Us  | Links  | Job Openings  

 

 

Shared on dba-oracle.com

 

Enterprise Solutions with Oracle ETL

By V.J. Jain, July 2007 (varun.jain@varunjaininc.com)
Varun Jain, Inc. (an Oracle Partner)

 

 

Oracle 9i and 10g have added some very powerful tools to support high performance ETL.  While you might be familiar with Oracle external tables and pipelined table functions, you might not fully realize the potential of Oracle to offer an attractive alternative to an off the shelf ETL program such as Informatica, Data Stage, or Ab Initio.

 

The requirement for high performance ETL is growing as organizations realize the importance of business intelligence and the need for a comprehensive data warehouse.  The majority of these organizations have made a decision to use an expensive off the shelf product without realizing that they have already licensed the necessary tools to build a potentially better solution.  This article discusses using the Oracle ETL features with the Oracle Scheduler to provide a real-time, high scale, and high performance ETL solution.

 

Before you can understand the solution, it is important that you familiarize yourself with how external tables and pipelined table functions are used. 

 

External Tables

 

External tables are used to provide relational access to the contents of a file from an Oracle directory.  The external table script defines the rules for reading the file.  Once the table is created, the external table is accessed just like a regular table.  Here’s an example

 


Create directory DAT_DIR as ‘/usr/tmp’;

 

 

File: sample.csv saved to /usr/tmp

ID, DATE, FIRST_NAME, LAST_NAME, STATUS

1, 17-JUL-07, John, Adams, Active

2, 17-JUL-07, Tyler, Howell, Active

3, 17-JUL-07, Jim, Lopez, Active

4, 17-JUL-07, Carlos, White, Inactive

5, 17-JUL-07, Scott, Tiger, Active

 

External table definition

 

 

CREATE TABLE SAMPLE_EXT

(

ID NUMBER,

DATE VARCHAR2(10),  

FIRST_NAME VARCHAR2(30),

LAST_NAME VARCHAR2(30),

STATUS VARCHAR2(10)

)                      

 ORGANIZATION EXTERNAL

 (

   TYPE ORACLE_LOADER

   DEFAULT DIRECTORY DAT_DIR

   ACCESS PARAMETERS

   (

            records delimited by newline LOAD WHEN (ID != ‘ID’)

            badfile BAD_DIR:'SAMPLE_EXT%a_%p.bad'

            logfile LOG_DIR:'SAMPLE_EXT%a_%p.log'

fields terminated by ',' optionally enclosed by '"' LRTRIM

            MISSING FIELD VALUES ARE NULL

   )

   LOCATION (‘sample.csv')

 )

 PARALLEL 4

 REJECT LIMIT 1000;

 

 

Note:  I could have used “skip 1” in the table definition but if you can’t guarantee whether all file sources will use a header line, the method above might be a better alternative.  Also, several of the properties in this definition are optional. 

 

Now that the external table has been created and the file has been placed on the server, we can access the data using SQL such as

 

 

Select * from SAMPLE_EXT

/

 

ID  DATE  FIRST_NAME  LAST_NAME  STATUS

1  17-JUL-07  John  Adams  Active

2  17-JUL-07  Tyler  Howell  Active

3  17-JUL-07  Jim  Lopez  Active

4  17-JUL-07  Carlos  White  Inactive

5  17-JUL-07  Scott  Tiger  Active

 

Select * from SAMPLE_EXT where STATUS = ‘Inactive’

/

 

ID  DATE  FIRST_NAME  LAST_NAME  STATUS

4  17-JUL-07  Carlos  White  Inactive

 

 

 

As you can see, we have complete relational access to this file even though the file is outside the database.  It is easy to load the contents of this file into a relational table.

 

First create a matching relational table

 

 

CREATE TABLE SAMPLE_TAB

(

ID NUMBER,

DATE VARCHAR2(10),  

FIRST_NAME VARCHAR2(30),

LAST_NAME VARCHAR2(30),

STATUS VARCHAR2(10)

);

 

 

Now, insert the data from the external table to the relational table

 

 

Insert into sample_tab

Select * from sample_ext

/

 

 

While loading a CSV file is very easy, external tables can also be used for substantially complex files.  It is a common misconception that external tables are only useful for simple formats such as comma or tab delimited files.  This is far from the truth.  In actuality, since external tables are relationally accessible, you can utilize the full flexibility of the SQL language to get the data in the desired format.  If the file format happens to too complex for SQL to structure appropriately, a pipelined table function can be used for even the most complex file formats.

 

When using external tables for more complex file formats, I have found that placing each row from the file into single columned row in the external table offers great flexibility for transforming the data. 

 

Consider the following file…

 

sample2.dat (actual file is several megabytes)

23:18:00 PST 18 Feb 2007

<Inventory name=" CORVETTE">

            <AppId id="1"/>

            <Color>Yellow</Color>

</Inventory>

 

23:18:00 PST 18 Feb 2007

<Inventory name="CAMARO">

            <AppId id="2"/>

            <Color>White</Color>

</Inventory>

 

23:18:00 PST 18 Feb 2007

<Inventory name="PROWLER">

            <AppId id="3"/>

            <Color>Blue</Color>

</Inventory>

 

This file is obviously a blend between some text and XML.  Obviously, this is not a preferred format – but often times in ETL, you do not have control over the data sources.  How would you use an external table to efficiently load a file structured like this?  The solution is probably easier than you think.

 

First, create the external table…

 

 

CREATE TABLE SAMPLE2_EXT (

field1 varchar2(4000)

)                      

 ORGANIZATION EXTERNAL

 (

   TYPE ORACLE_LOADER

   DEFAULT DIRECTORY DAT_DIR

   ACCESS PARAMETERS

   (

             records delimited by newline

             badfile DAT_DIR:'SAMPLE2_EXT%a_%p.bad'

             logfile DAT_DIR:'SAMPLE2_EXT%a_%p.log'

             fields LRTRIM REJECT ROWS WITH ALL NULL FIELDS (field1 char(4000))

   )

   LOCATION ('sample2.dat')

 )

REJECT LIMIT 10;

 

 

 

Now create the relational table…

 

 

CREATE TABLE SAMPLE2

(

  BEGIN_DATE  DATE,

  INVENTORY   VARCHAR2(30),

  APPID_ID    VARCHAR2(30),

  COLOR       VARCHAR2(30)

);

 

 

 

Extract, transform, and load data from external table into the relational table…

 

 

SQL> insert /* append */ into sample2

  2  select /*+ materialize */

  3    max(BEGIN_DATE) BEGIN_DATE,

  4    max(INVENTORY) INVENTORY,

  5    max(APPID_ID) APPID_ID

  6    ,max(COLOR) COLOR

  7    FROM

  8   (

  9   select r, grp,

 10     case when (length(translate(substr(field1,1,1),' 0123456789',' '))) is null then to_date(to_

char(to_timestamp_tz(field1, 'HH24:MI:SS TZD DD Mon YYYY'), 'DD-MON-YYYY HH:MI:SS AM'), 'DD-MON-YYYY

 HH:MI:SS AM') end BEGIN_DATE,

 11     case when (instr(field1,'<Inventory')>0) then extractValue(XMLType(replace(field1, '">', '"/

>')), '/Inventory/@name') end INVENTORY,

 12    case when (instr(field1,'<AppId')>0) then extractValue(XMLType(replace(field1, '">', '"/>')),

 '/AppId/@id') end APPID_ID,

 13    case when (instr(field1,'<Color')>0) then extractValue(XMLType(field1), '/Color') end COLOR

 14    from

 15    (   

 16   select r, field1, nvl(max(grp) over (order by r),0) grp FROM

 17     (

 18      select A.r, A.field1, case when (length(translate(substr(field1,1,1),' 0123456789',' '))) i

s null then A.r end grp from

 19      (

 20       select rownum r, field1 from SAMPLE2_EXT A

 21      ) A

 22     )      

 23   )

 24   )

 25  group by grp

 26  /

 

1674 rows created.

 

Elapsed: 00:00:06.85

SQL>

 

 

 

 

On my personal database the query above extracts, transforms, and loads 1,674 records into a relational table in under 7 seconds.  This is good performance considering that the format has both XML and non-XML content.  Here’s a sample of what is in our relational table…

 

 

select * from sample2

where rownum <=3

/

 

BEGIN_DATE    INVENTORY      APPID_ID          COLOR

 

2/18/2007 11:18:00 PM  CORVETTE       1          Yellow

2/18/2007 11:18:00 PM  CAMARO          2          White

2/18/2007 11:18:00 PM  PROWLER        3          Blue

 

 

 

So, the example above has demonstrated a relatively simple use of a single columned external table to transform raw data into a relational format.  However, what happens when the format is more complex?

 

Pipelined Table Functions

 

While they can get quite complicated, a table function is simply a function that returns rows.  All the row preparation happens inside the table function and each row will be piped back to the session as they are ready.  The output rows are based on a type definition. 

 

Here’s a simple example of a table function that accepts a VARCHAR2 input and returns rows of with 1 character in each row. 

 

 

CREATE TYPE row_typ AS OBJECT

       (p_row_output varchar2(10));

/

 

CREATE TYPE row_typ_tab AS TABLE OF row_typ;

/

 

 

create or replace function example_tab_fnc (pi_value IN varchar2)

return row_typ_tab PIPELINED AS

l_length number;

l_index        PLS_INTEGER := 1;

PRAGMA AUTONOMOUS_TRANSACTION;

BEGIN

LOOP

            l_length := length(pi_value);

            EXIT WHEN l_index > l_length;

            PIPE ROW ( row_typ(substr(pi_value, l_index, 1)));

            l_index := l_index + 1;

END LOOP;

RETURN;

 

END;

 

select * from table (example_tab_fnc('ABCDE'))

/

 

P_ROW_OUTPUT

 

A

B

C

D

E

 

 

 

It is obvious that this feature could certainly prove useful in an ETL environment.  While the demonstration above is an overly simplistic example, pipelined table functions can be used to transform complex files into a usable format.  For instance, a table function can accept a REF cursor pointing at an external table as an input.  Consider the following file

 

Sample3.dat

[CLASS_XXX]

[COLUMN_LISTING]

COL1,COL2,COL3,COL4,COL5

[COLUMN_VALUES]

XXX 219/  1/  0/  0/ 1,95,0,0,0,0

XXX 219/  1/  0/  1/ 1,23,0,0,0,0

XXX 219/  1/  0/  2/ 1,4,0,0,0,4

XXX 219/  1/  0/  3/ 1,37,1,0,0,33

[CLASS_YYY]

[COLUMN_LISTING]

COL21,COL22,COL23,COL24

[COLUMN_VALUES]

YYY 219/ 1/ 1/ 1/ 1,11092,49.52,627,179827

YYY 219/ 1/ 2/ 1/ 1,11092,199.52,628,179828

YYY 219/ 1/ 3/ 1/ 1,11092,459.52,629,179829

[CLASS_ZZZ]

[COLUMN_LISTING]

COL31,COL32,COL33,COL34,COL35,COL36,COL37

[COLUMN_VALUES]

ZZZ 219/  0/  0/ 1/ 1,0,0.00,0,0,7201,0,0

ZZZ 219/  0/  1/ 2/ 1,0,1.00,0,1,7202,0,2

ZZZ 219/  0/  2/ 1/ 2,0,2.00,0,2,7203,0,3

 

Description:  This file details the inventory of different warehouses.  3 different warehouses are included in the same file, each in its own class.  Each class has a dynamic listing of columns and column values.  Preceding the values line, there is a string such as “XXX 219/  1/  0/  3/ 1”, “YYY 219/ 1/ 1/ 1/ 1”, and “ZZZ 219/  0/  0/ 1/ 1”  .  Each of the values in each of these strings maps to source properties that form an inventory locator and must be stored for each row.

 

For example, “ZZZ 219/  0/  0/ 1/ 1”  maps to

Class: ZZZ

Store Number: 219

Pallet: 0

Lot: 0

Case: 1

Item: 1

 

Since there could be thousands of different columns for each class, it is not feasible to have a separate table column for each possibility.  The best solution is to store the data vertically by using a standard label for column name and column value.  Our desired relational storage would include the following columns

 

CLASS STORE_NUMBER PALLET LOT CASE_NUMBER ITEM COL_NAME COL_VALUE

 

In order to have our table function pipe rows with these columns, we need to define an appropriate type

 

 

CREATE TYPE sample3_typ AS OBJECT

(

            CLASS varchar2(10),

            STORE_NUMBER number(6),

            PALLET number(6),

            LOT number(6),

            CASE_NUMBER number(6),

            ITEM number(6),

            COL_NAME varchar2(100),

            COL_VALUE varchar2(100)

);

/

 

CREATE TYPE sample3_typ_tab AS TABLE OF sample3_typ;

/

 

 

If the number of columns were fixed, we might try defining an external table that leverages the comma delimiters in the file format.  However, since the number of columns is unknown, we will define our external table to have a single column.  In the original file, the column listing was over 16,000 characters long.  In order to allow for this condition, we will define the column to be of type CLOB. 

 

 

CREATE TABLE SAMPLE3_EXT (

field1 clob

)                      

 ORGANIZATION EXTERNAL

 (

   TYPE ORACLE_LOADER

   DEFAULT DIRECTORY DAT_DIR

   ACCESS PARAMETERS

   (

             records delimited by newline

             badfile DAT_DIR:'SAMPLE3_EXT%a_%p.bad'

             logfile DAT_DIR:'SAMPLE3_EXT%a_%p.log'

             fields LRTRIM REJECT ROWS WITH ALL NULL FIELDS (field1 char(32000))

   )

   LOCATION ('sample3.dat')

 )

 PARALLEL 10

 REJECT LIMIT 10;

 

 

 

Create the relational table…

 

 

CREATE TABLE SAMPLE3

(

  CLASS         VARCHAR2(10 BYTE),

  STORE_NUMBER  NUMBER(6),

  PALLET        NUMBER(6),

  LOT           NUMBER(6),

  CASE_NUMBER   NUMBER(6),

  ITEM          NUMBER(6),

  COL_NAME      VARCHAR2(100 BYTE),

  COL_VALUE     VARCHAR2(100 BYTE)

);

 

 

 

In order to transform the raw data into the relational format, I generally prefer to try to use SQL instead of a table function.  However, since the number of columns is unknown and we are storing the data vertically, the best solution here is to use a pipelined table function to prepare our rows as required.  

 

 

Now that I have the external table and type definitions, I need to create a table function that accepts a REF cursor (pointing to the external table) as an input parameter.  This function will start at the beginning of the CURSOR which is the first row in the external table and loop through the rows of the external table while returning rows to the session.  This function uses the keywords, CLASS, COLUMN_LISTING, and COLUMN_VALUES to instruct the actions of the function on the next loop.  The CLASS keyword sets the local variable for class.  The COLUMN_LISTING keyword builds a collection to store the column names.  The COLUMN_VALUES keyword causes the preparation and piping of each row as they are parsed. 

 

 

CREATE OR REPLACE function sample3_tab_fnc (pi_row IN sys_refcursor)

return sample3_typ_tab PIPELINED AS

l_string       LONG;

l_comma_index  PLS_INTEGER;

l_index        PLS_INTEGER := 1;

TYPE ListTyp IS TABLE OF VARCHAR2(100);

l_names ListTyp := ListTyp();

l_namesIter NUMBER :=1;

l_Iter NUMBER :=1;

l_cIter NUMBER :=1;

l_element varchar2(100);

l_field1 varchar2(32000);

pi_rownumber number := 1;

 

l_class varchar2(10);

l_store_number varchar2(6);

l_pallet varchar2(6);

l_lot varchar2(6);

l_case varchar2(6);

l_item varchar2(6);

l_action varchar2(10);

PRAGMA AUTONOMOUS_TRANSACTION;

BEGIN

LOOP

     FETCH pi_row INTO l_field1;

     EXIT WHEN pi_row%NOTFOUND;

             if instr(l_field1, ',', -1) = length(l_field1) then

                         l_string  := l_field1;

             else

                         l_string  := l_field1 || ',';

             end if;

             l_index :=1;

             l_Iter :=1;

             l_cIter :=1;

           

             if l_field1 = '[CLASS_XXX]' then

                        l_class := 'XXX';

                        l_action := null;

                        l_names  := ListTyp();

                        l_namesIter :=1;

                        GOTO end_loop;           

             elsif l_field1 = '[CLASS_YYY]' then

                        l_class := 'YYY';

                        l_action := null;

                        l_names := ListTyp();

                        l_namesIter :=1;

                        GOTO end_loop;           

             elsif l_field1 = '[CLASS_ZZZ]' then

                        l_class := 'ZZZ';

                        l_action := null;

                        l_names := ListTyp();

                        l_namesIter :=1;

                        GOTO end_loop;

             elsif l_field1 = '[COLUMN_LISTING]' then

                        l_action := 'COLUMNS';

                        l_namesIter :=1;

                        GOTO end_loop;

             elsif l_field1 = '[COLUMN_VALUES]' then

                        l_action := 'VALUES';

                        GOTO end_loop;

             end if;

             

            --Get Value Names into l_names

            if l_action = 'COLUMNS' then

                        LOOP

                        l_comma_index := INSTR(l_string, ',', l_index);

                        EXIT WHEN l_comma_index = 0;

                                    l_names.EXTEND;

                                    l_names(l_namesIter) := SUBSTR(l_string, l_index, l_comma_index - l_index);

                                    l_index := l_comma_index + 1;

                                    l_namesIter := l_namesIter + 1;

                        END LOOP;

            end if;

           

            --Get Element and Pipe Rows

            if l_action = 'VALUES' then

                        LOOP

                                    l_comma_index := INSTR(l_string, ',', l_index);

                                    EXIT WHEN l_comma_index = 0;

                                                if l_Iter = 1 then

                                                            l_element := SUBSTR(l_string, l_index, l_comma_index - l_index);

                                                            if l_class IN ('XXX', 'YYY', 'ZZZ') then

                                                                        l_store_number := trim(substr(l_element,4,instr(l_element, '/',1,1)-4));

                                                                        l_pallet := trim(substr(l_element,instr(l_element, '/',1,1)+1,instr(l_element, '/',1,2)-instr(l_element, '/',1,1)-1 ));

                                                                        l_lot := trim(substr(l_element,instr(l_element, '/',1,2)+1,instr(l_element, '/',1,3)-instr(l_element, '/',1,2)-1 ));

                                                                        l_case := trim(substr(l_element,instr(l_element, '/',1,3)+1,instr(l_element, '/',1,4)-instr(l_element, '/',1,3)-1 ));

                                                                        l_item := trim(substr(l_element,instr(l_element, '/',1,4)+1,instr(l_element, '/',1,5)-instr(l_element, '/',1,4)-1 ));                                                                     

                                                            end if;                                                              

                                                else

                                                            PIPE ROW ( sample3_typ(trim(l_class), l_store_number, l_pallet, l_lot, l_case, l_item, trim(l_names(l_cIter)), trim(SUBSTR(l_string, l_index, l_comma_index - l_index))) );

                                                            l_cIter := l_cIter + 1;

                                                end if;

                                                l_Iter := 2; --move out of element row

                                                l_index := l_comma_index + 1;

                        END LOOP;

            end if;

<<end_loop>>

null;     

END LOOP;

RETURN;

END;

/

 

 

 

 

The data can be selected by using SQL

 

 

select * from table (sample3_tab_fnc(CURSOR(select * from SAMPLE3_EXT)))

/

 

CLASS STORE_NUMBER         PALLET            LOT      CASE_NUMBER           ITEM     COL_NAME      COL_VALUE

 

XXX       219       1          0          0                      COL1    95

XXX       219       1          0          0                      COL2    0

XXX       219       1          0          0                      COL3    0

XXX       219       1          0          0                      COL4    0

XXX       219       1          0          0                      COL5    0

XXX       219       1          0          1                      COL1    23

XXX       219       1          0          1                      COL2    0

XXX       219       1          0          1                      COL3    0

XXX       219       1          0          1                      COL4    0

XXX       219       1          0          1                      COL5    0

XXX       219       1          0          2                      COL1    4

XXX       219       1          0          2                      COL2    0

XXX       219       1          0          2                      COL3    0

XXX       219       1          0          2                      COL4    0

XXX       219       1          0          2                      COL5    4

XXX       219       1          0          3                      COL1    37

XXX       219       1          0          3                      COL2    1

XXX       219       1          0          3                      COL3    0

XXX       219       1          0          3                      COL4    0

XXX       219       1          0          3                      COL5    33

YYY     219       1          1          1                      COL21  11092

YYY     219       1          1          1                      COL22  49.52

YYY     219       1          1          1                      COL23  627

YYY     219       1          1          1                      COL24  179827

YYY     219       1          2          1                      COL21  11092

YYY     219       1          2          1                      COL22  199.52

YYY     219       1          2          1                      COL23  628

YYY     219       1          2          1                      COL24  179828

YYY     219       1          3          1                      COL21  11092

YYY     219       1          3          1                      COL22  459.52

YYY     219       1          3          1                      COL23  629

YYY     219       1          3          1                      COL24  179829

ZZZ       219       0          0          1                      COL31  0

ZZZ       219       0          0          1                      COL32  0.00

ZZZ       219       0          0          1                      COL33  0

ZZZ       219       0          0          1                      COL34  0

ZZZ       219       0          0          1                      COL35  7201

ZZZ       219       0          0          1                      COL36  0

ZZZ       219       0          0          1                      COL37  0

ZZZ       219       0          1          2                      COL31  0

ZZZ       219       0          1          2                      COL32  1.00

ZZZ       219       0          1          2                      COL33  0

ZZZ       219       0          1          2                      COL34  1

ZZZ       219       0          1          2                      COL35  7202

ZZZ       219       0          1          2                      COL36  0

ZZZ       219       0          1          2                      COL37  2

ZZZ       219       0          2          1                      COL31  0

ZZZ       219       0          2          1                      COL32  2.00

ZZZ       219       0          2          1                      COL33  0

ZZZ       219       0          2          1                      COL34  2

ZZZ       219       0          2          1                      COL35  7203

ZZZ       219       0          2          1                      COL36  0

ZZZ       219       0          2          1                      COL37  3

 

51 rows selected.

 

Elapsed: 00:00:00.76

SQL>

 

 

 

 

Using the same procedure on a larger file in my personal database, the query above extracts, transforms, and loads 31,233 records into the relational table in under 1 second. 

 

 

SQL> insert /*+ append */ into sample3

  2  select * from table (sample3_tab_fnc(CURSOR(select * from SAMPLE3_EXT)))

  3  /

 

31233 rows created.

 

Elapsed: 00:00:00.96

SQL>

 

 

 

This design allows for the content of the data file to change dynamically without requiring modifications to the code.  For example, if a new class (Warehouse) is added or new columns are added or removed, the data will continue to load without interruption.

 

 

Building an enterprise ETL solution

 

The previous examples provide convincing evidence that Oracle external tables and pipelined table functions allow an efficient, highly flexible method for performing ETL.  But how are these features used in a high scale, dynamic, real-time loading environment?  The provided solution consists of a few elements that are the building blocks of the enterprise solution. 

 

 

Dynamic External Table Definitions

 

The first element of the solution is the ability to dynamically control the data source location for an external table.  The following command can change SAMPLE3_EXT from using sample3.dat to using sample3a.dat

 

 

SQL> ALTER TABLE SAMPLE3_EXT LOCATION('sample3a.dat');

Table altered.

 

 

This means that we can dynamically change the file source from PL/SQL.   Keep this in mind as I continue explain the other elements of the solution.

 

NFS – Network File Systems

 

While having a network drive is not necessary, it can help since the incoming directories can be on the network drive instead of the Oracle drive.  This can help to minimize risk of disk space problems that affect the database.  This can also improve the security model since the NFS drive can be written to without access to the Oracle server. 

 

Create and Populate Custom Queue

 

The second element of the solution is a table that will keep track of all ETL processes.  The queue table should reflect all the normal information such as date, start, end, status, path, etc.  There are a few options on how to populate a custom queue table.  One option is to have a process on the operating system that monitors folders for new files.  Another option is to define the incoming directories as Oracle directories.  You can use a directory listing custom procedure in Oracle that invokes a Java procedure to get relational access to a listing of files in that directory.  The procedure copies the filenames in the directory into a temporary table that can be queried.  Tom Kyte gives a great explanation for how to accomplish this at http://asktom.oracle.com/pls/asktom/f?p=100:11:183203105787997::::P11_QUESTION_ID:439619916584

 

For example, consider an incoming folder “/incoming/vendor” and a corresponding Oracle directory “VENDOR_DIR”.  New files “file1.dat”, “file2.dat”, and “file3.dat” are placed in our folder.  The directory listing function is executed from a procedure scheduled in the DBMS_SCHEDULER. 

 

 

Begin

 Get_directory_list(‘/incoming/vendor’);

End;

 

Select filename from directory_list_tmp;

/

 

FILENAME

file1.dat

file2.dat

file3.dat

 

 

The file listing is selected from the scheduled process and the files are compared with the files that are already queued.  The unrecognized files are inserted into the queue.  This simple process adds all incoming files to our custom queue table on a near real-time basis.

 

ETL Procedures

 

As previously described, external tables and pipelined table functions can be used in PL/SQL procedures to load the data from the files to the relational tables.  There should be a different procedure for each file format.  Each procedure should accept the file path as a parameter.  Before initiating the load, the procedure should dynamically change the file source using the syntax mentioned above.  Personally, I like to change the log and bad files to be file specific.  At the end of the procedure, the corresponding queue record should be updated appropriately.    

 

Concurrency (Multi-threaded external tables)

 

Because the Oracle ETL procedures are very fast, a single thread (single concurrency per external table) should be sufficient for the majority of requirements.  However, if you need to multi-thread your external table, this is fairly simple to do by using a semaphore. 

 

 

Putting the pieces together

 

There should be a scheduled job that scans and processes the queue based on your business rules.  If files in the queue have not been started, this process should submit a new job to DBMS_SCHEDULER to run the appropriate ETL procedure for that file and update the record in the queue appropriately.  Since you are passing the file path to the ETL procedure, the procedure is able to dynamically redefine the data source for that external table to point to the file.  Rules for handling queued records in error or complete status can be added to this job.  Also, the rules for concurrency should be defined in this job. 

 

 

At any time, the queue table can provide a real-full view of your system’s entire ETL activity.  Having all elements of the entire ETL architecture within the database has many benefits including reporting, cost management, error handling and maintenance.  Any solution is the sum of its parts.  The sum of the parts described above is a solution that is an efficient alternative for an off the shelf ETL tool. 

 

 

Interested in implementing this solution? Contact Varun Jain, Inc. for Oracle ETL/XMLDB Consulting Services.


 

 

 

Owned and Operated by Varun Jain, Inc, www.varunjaininc.com

Copyright ©2007 Oracle-Developer.com