Oracle-Developer.com
| Navigation:
Discussion Forums (Get expert advice) |
Publications |
Scripts |
About Us | Links |
Enterprise Solutions with Oracle ETL
By V.J.
Jain, July 2007 (varun.jain@varunjaininc.com)
Varun Jain, Inc. (an Oracle Partner)
|
Oracle 9i and 10g have added some very powerful tools to support high performance ETL. While you might be familiar with Oracle external tables and pipelined table functions, you might not fully realize the potential of Oracle to offer an attractive alternative to an off the shelf ETL program such as Informatica, Data Stage, or Ab Initio.
The requirement for high performance ETL is growing as organizations realize the importance of business intelligence and the need for a comprehensive data warehouse. The majority of these organizations have made a decision to use an expensive off the shelf product without realizing that they have already licensed the necessary tools to build a potentially better solution. This article discusses using the Oracle ETL features with the Oracle Scheduler to provide a real-time, high scale, and high performance ETL solution.
Before you can understand the solution, it is important that you familiarize yourself with how external tables and pipelined table functions are used.
External Tables
External tables are used to provide relational access to the contents of a file from an Oracle directory. The external table script defines the rules for reading the file. Once the table is created, the external table is accessed just like a regular table. Here’s an example
File: sample.csv saved to /usr/tmp ID, DATE, FIRST_NAME, LAST_NAME, STATUS 1, 17-JUL-07, John, Adams, Active 2, 17-JUL-07, Tyler, Howell, Active 3, 17-JUL-07, Jim, Lopez, Active 4, 17-JUL-07, Carlos, White, Inactive 5, 17-JUL-07, Scott, Tiger, Active
External table definition
Note: I could have used “skip 1” in the table definition but if you can’t guarantee whether all file sources will use a header line, the method above might be a better alternative. Also, several of the properties in this definition are optional.
Now that the external table has been created and the file has been placed on the server, we can access the data using SQL such as
As you can see, we have complete relational access to this file even though the file is outside the database. It is easy to load the contents of this file into a relational table.
First create a matching relational table
Now, insert the data from the external table to the relational table
While loading a CSV file is very easy, external tables can also be used for substantially complex files. It is a common misconception that external tables are only useful for simple formats such as comma or tab delimited files. This is far from the truth. In actuality, since external tables are relationally accessible, you can utilize the full flexibility of the SQL language to get the data in the desired format. If the file format happens to too complex for SQL to structure appropriately, a pipelined table function can be used for even the most complex file formats.
When using external tables for more complex file formats, I have found that placing each row from the file into single columned row in the external table offers great flexibility for transforming the data.
Consider the following file…
sample2.dat (actual file is several megabytes) 23:18:00 PST 18 Feb 2007 <Inventory name=" CORVETTE"> <AppId id="1"/> <Color>Yellow</Color> </Inventory>
23:18:00 PST 18 Feb 2007 <Inventory name="CAMARO"> <AppId id="2"/> <Color>White</Color> </Inventory>
23:18:00 PST 18 Feb 2007 <Inventory name="PROWLER"> <AppId id="3"/> <Color>Blue</Color> </Inventory>
This file is obviously a blend between some text and XML. Obviously, this is not a preferred format – but often times in ETL, you do not have control over the data sources. How would you use an external table to efficiently load a file structured like this? The solution is probably easier than you think.
First, create the external table…
Now create the relational table…
Extract, transform, and load data from external table into the relational table…
On my personal database the query above extracts, transforms, and loads 1,674 records into a relational table in under 7 seconds. This is good performance considering that the format has both XML and non-XML content. Here’s a sample of what is in our relational table…
So, the example above has demonstrated a relatively simple use of a single columned external table to transform raw data into a relational format. However, what happens when the format is more complex?
Pipelined Table Functions
While they can get quite complicated, a table function is simply a function that returns rows. All the row preparation happens inside the table function and each row will be piped back to the session as they are ready. The output rows are based on a type definition.
Here’s a simple example of a table function that accepts a VARCHAR2 input and returns rows of with 1 character in each row.
It is obvious that this feature could certainly prove useful in an ETL environment. While the demonstration above is an overly simplistic example, pipelined table functions can be used to transform complex files into a usable format. For instance, a table function can accept a REF cursor pointing at an external table as an input. Consider the following file
Sample3.dat [CLASS_XXX] [COLUMN_LISTING] COL1,COL2,COL3,COL4,COL5 [COLUMN_VALUES] XXX 219/ 1/ 0/ 0/ 1,95,0,0,0,0 XXX 219/ 1/ 0/ 1/ 1,23,0,0,0,0 XXX 219/ 1/ 0/ 2/ 1,4,0,0,0,4 XXX 219/ 1/ 0/ 3/ 1,37,1,0,0,33 [CLASS_YYY] [COLUMN_LISTING] COL21,COL22,COL23,COL24 [COLUMN_VALUES] YYY 219/ 1/ 1/ 1/ 1,11092,49.52,627,179827 YYY 219/ 1/ 2/ 1/ 1,11092,199.52,628,179828 YYY 219/ 1/ 3/ 1/ 1,11092,459.52,629,179829 [CLASS_ZZZ] [COLUMN_LISTING] COL31,COL32,COL33,COL34,COL35,COL36,COL37 [COLUMN_VALUES] ZZZ 219/ 0/ 0/ 1/ 1,0,0.00,0,0,7201,0,0 ZZZ 219/ 0/ 1/ 2/ 1,0,1.00,0,1,7202,0,2 ZZZ 219/ 0/ 2/ 1/ 2,0,2.00,0,2,7203,0,3
Description: This file details the inventory of different warehouses. 3 different warehouses are included in the same file, each in its own class. Each class has a dynamic listing of columns and column values. Preceding the values line, there is a string such as “XXX 219/ 1/ 0/ 3/ 1”, “YYY 219/ 1/ 1/ 1/ 1”, and “ZZZ 219/ 0/ 0/ 1/ 1” . Each of the values in each of these strings maps to source properties that form an inventory locator and must be stored for each row.
For example, “ZZZ 219/ 0/ 0/ 1/ 1” maps to Class: ZZZ Store Number: 219 Pallet: 0 Lot: 0 Case: 1 Item: 1
Since there could be thousands of different columns for each class, it is not feasible to have a separate table column for each possibility. The best solution is to store the data vertically by using a standard label for column name and column value. Our desired relational storage would include the following columns
CLASS STORE_NUMBER PALLET LOT CASE_NUMBER ITEM COL_NAME COL_VALUE
In order to have our table function pipe rows with these columns, we need to define an appropriate type
If the number of columns were fixed, we might try defining an external table that leverages the comma delimiters in the file format. However, since the number of columns is unknown, we will define our external table to have a single column. In the original file, the column listing was over 16,000 characters long. In order to allow for this condition, we will define the column to be of type CLOB.
Create the relational table…
In order to transform the raw data into the relational format, I generally prefer to try to use SQL instead of a table function. However, since the number of columns is unknown and we are storing the data vertically, the best solution here is to use a pipelined table function to prepare our rows as required.
Now that I have the external table and type definitions, I need to create a table function that accepts a REF cursor (pointing to the external table) as an input parameter. This function will start at the beginning of the CURSOR which is the first row in the external table and loop through the rows of the external table while returning rows to the session. This function uses the keywords, CLASS, COLUMN_LISTING, and COLUMN_VALUES to instruct the actions of the function on the next loop. The CLASS keyword sets the local variable for class. The COLUMN_LISTING keyword builds a collection to store the column names. The COLUMN_VALUES keyword causes the preparation and piping of each row as they are parsed.
The data can be selected by using SQL
Using the same procedure on a larger file in my personal database, the query above extracts, transforms, and loads 31,233 records into the relational table in under 1 second.
This design allows for the content of the data file to change dynamically without requiring modifications to the code. For example, if a new class (Warehouse) is added or new columns are added or removed, the data will continue to load without interruption.
Building an enterprise ETL solution
The previous examples provide convincing evidence that Oracle external tables and pipelined table functions allow an efficient, highly flexible method for performing ETL. But how are these features used in a high scale, dynamic, real-time loading environment? The provided solution consists of a few elements that are the building blocks of the enterprise solution.
Dynamic External Table Definitions
The first element of the solution is the ability to dynamically control the data source location for an external table. The following command can change SAMPLE3_EXT from using sample3.dat to using sample3a.dat
This means that we can dynamically change the file source from PL/SQL. Keep this in mind as I continue explain the other elements of the solution.
NFS – Network File Systems
While having a network drive is not necessary, it can help since the incoming directories can be on the network drive instead of the Oracle drive. This can help to minimize risk of disk space problems that affect the database. This can also improve the security model since the NFS drive can be written to without access to the Oracle server.
Create and Populate Custom Queue
The second element of the solution is a table that will keep track of all ETL processes. The queue table should reflect all the normal information such as date, start, end, status, path, etc. There are a few options on how to populate a custom queue table. One option is to have a process on the operating system that monitors folders for new files. Another option is to define the incoming directories as Oracle directories. You can use a directory listing custom procedure in Oracle that invokes a Java procedure to get relational access to a listing of files in that directory. The procedure copies the filenames in the directory into a temporary table that can be queried. Tom Kyte gives a great explanation for how to accomplish this at http://asktom.oracle.com/pls/asktom/f?p=100:11:183203105787997::::P11_QUESTION_ID:439619916584
For example, consider an incoming folder “/incoming/vendor” and a corresponding Oracle directory “VENDOR_DIR”. New files “file1.dat”, “file2.dat”, and “file3.dat” are placed in our folder. The directory listing function is executed from a procedure scheduled in the DBMS_SCHEDULER.
The file listing is selected from the scheduled process and the files are compared with the files that are already queued. The unrecognized files are inserted into the queue. This simple process adds all incoming files to our custom queue table on a near real-time basis.
ETL Procedures
As previously described, external tables and pipelined table functions can be used in PL/SQL procedures to load the data from the files to the relational tables. There should be a different procedure for each file format. Each procedure should accept the file path as a parameter. Before initiating the load, the procedure should dynamically change the file source using the syntax mentioned above. Personally, I like to change the log and bad files to be file specific. At the end of the procedure, the corresponding queue record should be updated appropriately.
Concurrency (Multi-threaded external tables)
Because the Oracle ETL procedures are very fast, a single thread (single concurrency per external table) should be sufficient for the majority of requirements. However, if you need to multi-thread your external table, this is fairly simple to do by using a semaphore.
Putting the pieces together
There should be a scheduled job that scans and processes the queue based on your business rules. If files in the queue have not been started, this process should submit a new job to DBMS_SCHEDULER to run the appropriate ETL procedure for that file and update the record in the queue appropriately. Since you are passing the file path to the ETL procedure, the procedure is able to dynamically redefine the data source for that external table to point to the file. Rules for handling queued records in error or complete status can be added to this job. Also, the rules for concurrency should be defined in this job.
At any time, the queue table can provide a real-full view of your system’s entire ETL activity. Having all elements of the entire ETL architecture within the database has many benefits including reporting, cost management, error handling and maintenance. Any solution is the sum of its parts. The sum of the parts described above is a solution that is an efficient alternative for an off the shelf ETL tool.
|
Sponsored by Varun Jain, Inc. Oracle Applications and Database Consulting
Copyright ©2007 Oracle-Developer.com