Life@Net

XML processing in ETL if xml schema is dynamic with Performance Improvement Suggestions:

Posted in abinitio, Hash Parsing, parsing, performance, perl, tree Parsing, xml by Abhi Bhi Zinda Hun on July 13, 2008
Performance Improvement Suggestions for xml processing in ETL if xml schema is dynamic :
Its again a lengthy blog but, it is of much interest if you are concerned of getting good performance for huge volume (in billions ) of xml data processing in ETL .

I have done some analysis for improving execution time of etl process against huge volume of xml data . I hope this analysis has got something which we should consider before designing an etl process which takes xml data as source and processes it to a normal delimited data for further porcessing . The topic is primarily based on abinitio as etl tool and it is to highlight how to improve performance when xml data is having dynamic xsd ( you can not go by xml-to-dml utility of abinitio as your xsd keeps changing ) .

At first , I thought to write an effiicient parser ( tree parser ) using shell and that might reduce execution time . But, it was of no use as it started taking more time than abinitio modules for xml tree parsing ( I wrote one in shell using awk and smaller commands for one special case -only elements and no attributes .) .Therefore despite a strong shell guy, for language considerations, I finally agreed with Jack Schwartz’s notes for xml parser coding. Further search revealed, perl has already number of xml parsers and these are really fast enough ( as perl is interpreted language) . For perl xml parser distributions: Refer: http://cpan.uwinnipeg.ca/dist/XML/ . There is no meaning of reinventing the wheel .

But, something in c++ is still needed for improved performance of tree xml parser . That can be a task for c++ masters and this is the area where shell fails to give good performance .

Currently, AbInitio provides tree approach to parse xml data ( if you want to handle dynamic xsd ) which basically seems to be perl’s dom ( document object model) parser (libxml) wrapped by abinitio’s own layer ( basically get_element_root,get_element_count etc; of abinitio seems to be perl subroutines of similar methods from
dom parser object ) .

The DOM represents an XML document as a tree of doubly-linked nodes, with the first child at each level linked up to its parent and across to siblings. So, a number of functions is defined on the tree, with implementations in c++ in perl dom xml module ( We are using , abinitio wrapper functions on perl methods ( finally calling a c++ methods)).

Although you can navigate a DOM tree by following the links, it is generally more efficient in terms of programmer time to use the XPath protocol whereas it is bad in performance as it gives facilities for navigation to nodes, retrieval of sets of nodes, and so forth. . Efficient xpathing enables a great deal of flexibility in dealing with the documents: facilities like accessing in any order, reorganizing, adding, and deleting .

However, it requires the parsing of the entire XML document, as well as the creation of the tree data structure, before the processing and business logic take place. Since the tree data structure is generally stored in memory, these methods have much larger memory feeding than other methods. The problem is exacerbated by the fact that storing the document in memory as a tree needs several times as much storage as the original XML document did. For larger documents both of these can be significant — the parsing and tree creation time become substantial, and the memory requirements can overrun the available resources.

On the other hand, XML::Simple (http://cpan.uwinnipeg.ca/dist/XML-Simple) believes in hardcoded element names and parses using hash array method ( singly linekd list) which makes it more efficient in performance than dom one but, less flexible and accessible. I know it is not the best way to go when you need to build a representation of the XML document in memory and then either search it or transform it , which is where tree parsing comes in . But, if the XML document won’t fit in memory or is a stream of unknown length, you can’t use XML::Simple at all.
But, we can always make XML document fit into memory as we are already reading recordwise and we are getting one xml flow as input each time . So, Tree parsing will give worst case performance here and xml:simple parsing will be ideal provided reconstruction logic is not needed .

So, I think we should get rid of reconstruction feature and simple parsing mechanism needs to be introduced for good performance .If we will write a recursive tree parsing code using abinitio and even after solving any major bottleneck in algorithm w.r.t performance ,you can not get good execution time . There is a limit for performance of dom parsing and you can not go beyond a certain limit . So, the problem is not with the tree parsing code . It is the strategy which you will adopt needs to be tested for performance improvement . And if we will do this then xml:simple module will come first to use .

Performance Improvement Suggestions for xml processing :
1:
I think abinitio also in turn uses perl xml (::dom) package or some c++ dom xml parser for tree type xml parsing ( if we will write xml parsing xfr using abinitio xml methods ).Though, I am not sure .

But, time complexity of perl module will be very close to c++ modules as it reduces performance by self-interpretability( no compile time ) and it infact
uses xml parse module written in c++ ( xml::simple) .
So, perl would be handy here for parsing thing and then we can use our abinitio tool for further transformations . As I stated, xml:simple parser module will
be ideal as we are getting row by row xml data and not the entire doc as input ( where tree parsing or sax ( event driven one from perl) parsing is needed )
.
2:
Removal of namespaces (in element name ) should not be left on abinitio and I think it should be done before handing data to abinitio
. Perl or shell ( would prefer perl over shell for performance ) .

3:
If you dont need data from attributes of xml element then there is no need of keeping attributes in input file and one should only conc on improving performance and getting task done .
So, I think we should even remove attributes from xml data before handing data to abinitio ( it is anyways optional). Major performance boost will take placeby ch anging the parser ( tree to simple) .

4:
Now, even perl has a number of xml modules which are used as per reqmnt for parsing xml data . But, fastest one is xml::simple module which expects very simple xml data ( only elements and attributes which should fit into memory ) . I think we should remove namespaces and also attributes using shell/perl and then xml::simple module will be used to create a simple flat file . Then, this file will be used by abinitio .
I think net execution time would be dramatically reduced after this ( hoping for 1/4th execution time or less for similar volume of data as per dom parsing design) .

Design:

Step 1:
Input XML File —-> Filter (If needed ) —>Perl Script ( To remove attributes if needed and namespaces ) In Run Program Comp ( MFS Layout ) –> Perl Script
( To parse xml data using xml::simple module and create a normal xml file ) In Run Program Comp ( MFS Layout ) —–> MFS Output File
Use Run program in mfs layout to get good performance of perl program for huge volume of data .
Step 2:
MFS Output File —-> Your Transformation Logics —> ..and so on .

Example Problem:
Input XML Data:

Treat “(” with “” in sample data wherever tags are opened and closed ( due to html formating it is shown like this ) :

(TRAN_DATA)
(Timestamp)2008010112:13:14(/Timestamp)
(cust_id)12345(/cust_id)
(cust_nm)R.K(/cust_nm)
(city)kolkata(/city)
(sex)M(/sex)
(msg)
(request)
(orderid)1234(/orderid)
(tran_event)
(prodid)12(/prodid)
(DiscFlag)1(/DiscFlag)
(prodname)Laptop(/prodname)
(agentid type=”contract”)6547(/agentid)
(strt_ts)2008010112:13:18(/strt_ts)
(end_ts)2008010112:13:44(/end_ts)
(amount)56000(/amount)
(PayMode)Card(/PayMode)
(/tran_event)
(tran_event)
(prodid)76(/prodid)
(DiscFlag)0(/DiscFlag)
(prodname)BedSheet(/prodname)
(agentid type=”contract”)6547(/agentid)
(strt_ts)2008010112:13:18(/strt_ts)
(end_ts)2008010112:13:44(/end_ts)
(amount)800(/amount)
(PayMode)Card(/PayMode)
(/tran_event)
(tran_event)
(prodid)92(/prodid)
(DiscFlag)1(/DiscFlag)
(prodname)TV(/prodname)
(agentid type=”permanent”)6597(/agentid)
(strt_ts)2008010112:13:18(/strt_ts)
(end_ts)2008010112:13:44(/end_ts)
(amount)20000(/amount)
(PayMode)Card(/PayMode)
(/tran_event)
(tran_event)
(prodid)2(/prodid)
(DiscFlag)1(/DiscFlag)
(prodname)Bucket(/prodname)
(agentid type=”contract”)6547(/agentid)
(strt_ts)2008010112:13:18(/strt_ts)
(end_ts)2008010112:13:44(/end_ts)
(amount)100(/amount)
(PayMode)Card(/PayMode)
(/tran_event)
(/request)
(response)
(Total_Num)4(/Total_Num)
(Total_Amount)76900(/Total_Amount)
(Total_Discount)12000(/Total_Discount)
(Total_Net)64900(/Total_Net)
(DeliveredFlg)1(/DeliveredFlg)
(Feedback)Satisfied(/Feedback)
(/response)
(/msg)
(/TRAN_DATA)

Store it as tran.xml . It contains daily transaction data in xml format for a particular customer . The data consists of one dynamic part of transection event data which corresponds to each product purchased by a customer . Even if in future the data is changed by some other elements and the business does not need that then also our logic should work .
We need denormalized data of following record format from daily transaction data :
record
string(“|”) Timestamp=NULL(“”);
string(“|”) cust_id=NULL(“”);
string(“|”) cust_nm
string(“|”) city=NULL(“”);
string(“|”) sex=NULL(“”);
string(“|”) orderid=NULL(“”);
string(“|”) prodid=NULL(“”);
string(“|”) DiscFlag=NULL(“”);
string(“|”) prodname=NULL(“”);
string(“|”) agentid=NULL(“”);
string(“|”) agent_type=NULL(“”);
string(“|”) strt_ts=NULL(“”);
string(“|”) end_ts=NULL(“”);
string(“|”) amount=NULL(“”);
string(“|”) PayMode=NULL(“”);
string(“|”) Total_Num=NULL(“”);
string(“|”) Total_Amount=NULL(“”);
string(“|”) Total_Discount=NULL(“”);
string(“|”) Total_Net=NULL(“”);
string(“|”) DeliveredFlg=NULL(“”);
string(“\n”) Feedback=NULL(“”);
end

Pseudo Code For Step1 (only parsing) :

#!/usr/bin/perl -w
# Auhtor: Chandra Tewary
# Date:2008-07-13
# Program to parse xml data into normal file

# Use xml parser module
use XML::Simple;

# Create xml object
my $xml = new XML::Simple;

# Read xml input file in array
my @arr = ;

# Read XML Array
my $data = $xml->XMLin(“@arr”);

# Named values Of Elements
$f1=”Timestamp”;
$f2=”cust_id”;
$f3=”cust_nm”;
$f4=”city”;
$f5=”sex”;
$f6=”msg”;
$f7=”request”;
$f8=”orderid”;
$f9=”tran_event”;
$f10=”prodid”;
$f11=”DiscFlag”;
$f12=”strt_ts”;
$f13=”end_ts”;
$f14=”prodname”;
$f15=”agentid”;
$f16=”amount”;
$f17=”PayMode”;
$f18=”response”;
$f19=”Total_Num”;
$f20=”Total_Amount”;
$f21=”Total_Discount”;
$f22=”Total_Net”;
$f23=”DeliveredFlg”;
$f24=”Feedback”;
# Name Value for element for attributed elements
$cont=”content”;
# Count number of occureneces of tran event data
my @farr = grep (/\/tran_event/,@arr) ;
my $cnt=$#farr + 1;
# print output
for ($i=0 ; $i{$f6}->{$f7}->{$f8}|$data->{$f6}->{$f7}->{$f9}->[$i]->{$f10}|
$data->{$f6}->{$f7}->{$f9}->[$i]->{$f11}|$data->{$f6}->{$f7}->{$f9}
->[$i]->{$f12}|$data->{$f6}->{$f7}->{$f9}->[$i]->{$f13}|$data->{$f6}->
{$f7}->{$f9}->[$i]->{$f14}|$data->{$f6}->{$f7}->{$f9}->[$i]->{$f15}->{$cont}|
$data->{$f6}->{$f7}->{$f9}->[$i]->{$f15}->{type}|$data->{$f6}->{$f7}->{$f9}->
[$i]->{$f16}|$data->{$f6}->{$f7}->{$f9}->[$i]->{$f17}|$data->{$f6}->{$f18}->
{$f19}|$data->{$f6}->{$f18}->{$f20}|$data->{$f6}->{$f18}->{$f21}|$data->
{$f6}->{$f18}->{$f22}|$data->{$f6}->{$f18}->{$f23}|$data->{$f6}->
{$f18}->{$f24}\n”;
}

Output will be a ‘|’ delimited and ready for normal use:

C:\pls>xml_parse.pl tran.xml

2008010112:13:14|12345|R.K|kolkata|M|1234|12|1|2008010112:13:18|
2008010112:13:44|Laptop|6547|contract|56000|Card|4|76900|12000|64900|
1|Satisfied
2008010112:13:14|12345|R.K|kolkata|M|1234|76|0|2008010112:13:18|
2008010112:13:44|BedSheet|6547|contract|800|Card|4|76900|12000|64900|
1|Satisfied
2008010112:13:14|12345|R.K|kolkata|M|1234|92|1|2008010112:13:18|
2008010112:13:44|TV|6597|permanent|20000|Card|4|76900|12000|64900|
1|Satisfied
2008010112:13:14|12345|R.K|kolkata|M|1234|2|1|2008010112:13:18|
2008010112:13:44|Bucket|6547|contract|100|Card|4|76900|12000|64900|
1|Satisfied


This data will be readable by abinitio using below record format :
record
string(“|”) Timestamp=NULL(“”);
string(“|”) cust_id=NULL(“”);
string(“|”) cust_nm
string(“|”) city=NULL(“”);
string(“|”) sex=NULL(“”);
string(“|”) orderid=NULL(“”);
string(“|”) prodid=NULL(“”);
string(“|”) DiscFlag=NULL(“”);
string(“|”) prodname=NULL(“”);
string(“|”) agentid=NULL(“”);
string(“|”) agent_type=NULL(“”);
string(“|”) strt_ts=NULL(“”);
string(“|”) end_ts=NULL(“”);
string(“|”) amount=NULL(“”);
string(“|”) PayMode=NULL(“”);
string(“|”) Total_Num=NULL(“”);
string(“|”) Total_Amount=NULL(“”);
string(“|”) Total_Discount=NULL(“”);
string(“|”) Total_Net=NULL(“”);
string(“|”) DeliveredFlg=NULL(“”);
string(“\n”) Feedback=NULL(“”);
end

Pseudo Code is tested on activeperl and may need some chnage for linux perl .

Follow

Get every new post delivered to your Inbox.