XML processing in ETL if xml schema is dynamic with Performance Improvement Suggestions:
At first , I thought to write an effiicient parser ( tree parser ) using shell and that might reduce execution time . But, it was of no use as it started taking more time than abinitio modules for xml tree parsing ( I wrote one in shell using awk and smaller commands for one special case -only elements and no attributes .) .Therefore despite a strong shell guy, for language considerations, I finally agreed with
Jack Schwartz’s notes for xml parser coding. Further search revealed, perl has already number of xml parsers and these are really fast enough ( as perl is interpreted language) . For perl xml parser distributions: Refer: http://cpan.uwinnipeg.ca/dist/XML/ . There is no meaning of reinventing the wheel .But, something in c++ is still needed for improved performance of tree xml parser . That can be a task for c++ masters and this is the area where shell fails to give good performance .
dom parser object ) .
However, it requires the parsing of the entire XML document, as well as the creation of the tree data structure, before the processing and business logic
take place. Since the tree data structure is generally stored in memory, these methods have much larger memory feeding than other methods. The problem is exacerbated by the fact that storing the document in memory as a tree needs several times as much storage as the original XML document did. For larger documents both of these can be significant — the parsing and tree creation time become substantial, and the memory requirements can overrun the available resources.
I think abinitio also in turn uses perl xml (::dom) package or some c++ dom xml parser for tree type xml parsing ( if we will write xml parsing xfr using abinitio xml methods ).Though, I am not sure .
Removal of namespaces (in element name ) should not be left on abinitio and I think it should be done before handing data to abinitio
If you dont need data from attributes of xml element then there is no need of keeping attributes in input file and one should only conc on improving performance and getting task done .
So, I think we should even remove attributes from xml data before handing data to abinitio ( it is anyways optional). Major performance boost will take placeby ch anging the parser ( tree to simple) .
Input XML File —-> Filter (If needed ) —>Perl Script ( To remove attributes if needed and namespaces ) In Run Program Comp ( MFS Layout ) –> Perl Script
MFS Output File —-> Your Transformation Logics —> ..and so on .
Treat “(” with “” in sample data wherever tags are opened and closed ( due to html formating it is shown like this ) :
(TRAN_DATA)
(Timestamp)2008010112:13:14(/Timestamp)
(cust_id)12345(/cust_id)
(cust_nm)R.K(/cust_nm)
(city)kolkata(/city)
(sex)M(/sex)
(msg)
(request)
(orderid)1234(/orderid)
(tran_event)
(prodid)12(/prodid)
(DiscFlag)1(/DiscFlag)
(prodname)Laptop(/prodname)
(agentid type=”contract”)6547(/agentid)
(strt_ts)2008010112:13:18(/strt_ts)
(end_ts)2008010112:13:44(/end_ts)
(amount)56000(/amount)
(PayMode)Card(/PayMode)
(/tran_event)
(tran_event)
(prodid)76(/prodid)
(DiscFlag)0(/DiscFlag)
(prodname)BedSheet(/prodname)
(agentid type=”contract”)6547(/agentid)
(strt_ts)2008010112:13:18(/strt_ts)
(end_ts)2008010112:13:44(/end_ts)
(amount)800(/amount)
(PayMode)Card(/PayMode)
(/tran_event)
(tran_event)
(prodid)92(/prodid)
(DiscFlag)1(/DiscFlag)
(prodname)TV(/prodname)
(agentid type=”permanent”)6597(/agentid)
(strt_ts)2008010112:13:18(/strt_ts)
(end_ts)2008010112:13:44(/end_ts)
(amount)20000(/amount)
(PayMode)Card(/PayMode)
(/tran_event)
(tran_event)
(prodid)2(/prodid)
(DiscFlag)1(/DiscFlag)
(prodname)Bucket(/prodname)
(agentid type=”contract”)6547(/agentid)
(strt_ts)2008010112:13:18(/strt_ts)
(end_ts)2008010112:13:44(/end_ts)
(amount)100(/amount)
(PayMode)Card(/PayMode)
(/tran_event)
(/request)
(response)
(Total_Num)4(/Total_Num)
(Total_Amount)76900(/Total_Amount)
(Total_Discount)12000(/Total_Discount)
(Total_Net)64900(/Total_Net)
(DeliveredFlg)1(/DeliveredFlg)
(Feedback)Satisfied(/Feedback)
(/response)
(/msg)
(/TRAN_DATA)
string(“|”) Timestamp=NULL(“”);
string(“|”) cust_id=NULL(“”);
string(“|”) cust_nm
string(“|”) city=NULL(“”);
string(“|”) sex=NULL(“”);
string(“|”) orderid=NULL(“”);
string(“|”) prodid=NULL(“”);
string(“|”) DiscFlag=NULL(“”);
string(“|”) prodname=NULL(“”);
string(“|”) agentid=NULL(“”);
string(“|”) agent_type=NULL(“”);
string(“|”) strt_ts=NULL(“”);
string(“|”) end_ts=NULL(“”);
string(“|”) amount=NULL(“”);
string(“|”) PayMode=NULL(“”);
string(“|”) Total_Num=NULL(“”);
string(“|”) Total_Amount=NULL(“”);
string(“|”) Total_Discount=NULL(“”);
string(“|”) Total_Net=NULL(“”);
string(“|”) DeliveredFlg=NULL(“”);
string(“\n”) Feedback=NULL(“”);
end
#!/usr/bin/perl -w
# Auhtor: Chandra Tewary
# Date:2008-07-13
# Program to parse xml data into normal file
# Use xml parser module
use XML::Simple;
# Create xml object
my $xml = new XML::Simple;
# Read xml input file in array
my @arr = ;
my $data = $xml->XMLin(“@arr”);
# Named values Of Elements
$f1=”Timestamp”;
$f2=”cust_id”;
$f3=”cust_nm”;
$f4=”city”;
$f5=”sex”;
$f6=”msg”;
$f7=”request”;
$f8=”orderid”;
$f9=”tran_event”;
$f10=”prodid”;
$f11=”DiscFlag”;
$f12=”strt_ts”;
$f13=”end_ts”;
$f14=”prodname”;
$f15=”agentid”;
$f16=”amount”;
$f17=”PayMode”;
$f18=”response”;
$f19=”Total_Num”;
$f20=”Total_Amount”;
$f21=”Total_Discount”;
$f22=”Total_Net”;
$f23=”DeliveredFlg”;
$f24=”Feedback”;
$cont=”content”;
my @farr = grep (/\/tran_event/,@arr) ;
my $cnt=$#farr + 1;
for ($i=0 ; $i{$f6}->{$f7}->{$f8}|$data->{$f6}->{$f7}->{$f9}->[$i]->{$f10}|
$data->{$f6}->{$f7}->{$f9}->[$i]->{$f11}|$data->{$f6}->{$f7}->{$f9}
->[$i]->{$f12}|$data->{$f6}->{$f7}->{$f9}->[$i]->{$f13}|$data->{$f6}->
{$f7}->{$f9}->[$i]->{$f14}|$data->{$f6}->{$f7}->{$f9}->[$i]->{$f15}->{$cont}|
$data->{$f6}->{$f7}->{$f9}->[$i]->{$f15}->{type}|$data->{$f6}->{$f7}->{$f9}->
[$i]->{$f16}|$data->{$f6}->{$f7}->{$f9}->[$i]->{$f17}|$data->{$f6}->{$f18}->
{$f19}|$data->{$f6}->{$f18}->{$f20}|$data->{$f6}->{$f18}->{$f21}|$data->
{$f6}->{$f18}->{$f22}|$data->{$f6}->{$f18}->{$f23}|$data->{$f6}->
{$f18}->{$f24}\n”;
}
C:\pls>xml_parse.pl tran.xml
2008010112:13:14|12345|R.K|kolkata|M|1234|12|1|2008010112:13:18|
2008010112:13:44|Laptop|6547|contract|56000|Card|4|76900|12000|64900|
1|Satisfied
2008010112:13:14|12345|R.K|kolkata|M|1234|76|0|2008010112:13:18|
2008010112:13:44|BedSheet|6547|contract|800|Card|4|76900|12000|64900|
1|Satisfied
2008010112:13:14|12345|R.K|kolkata|M|1234|92|1|2008010112:13:18|
2008010112:13:44|TV|6597|permanent|20000|Card|4|76900|12000|64900|
1|Satisfied
2008010112:13:14|12345|R.K|kolkata|M|1234|2|1|2008010112:13:18|
2008010112:13:44|Bucket|6547|contract|100|Card|4|76900|12000|64900|
1|Satisfied
This data will be readable by abinitio using below record format :
string(“|”) Timestamp=NULL(“”);
string(“|”) cust_id=NULL(“”);
string(“|”) cust_nm
string(“|”) city=NULL(“”);
string(“|”) sex=NULL(“”);
string(“|”) orderid=NULL(“”);
string(“|”) prodid=NULL(“”);
string(“|”) DiscFlag=NULL(“”);
string(“|”) prodname=NULL(“”);
string(“|”) agentid=NULL(“”);
string(“|”) agent_type=NULL(“”);
string(“|”) strt_ts=NULL(“”);
string(“|”) end_ts=NULL(“”);
string(“|”) amount=NULL(“”);
string(“|”) PayMode=NULL(“”);
string(“|”) Total_Num=NULL(“”);
string(“|”) Total_Amount=NULL(“”);
string(“|”) Total_Discount=NULL(“”);
string(“|”) Total_Net=NULL(“”);
string(“|”) DeliveredFlg=NULL(“”);
string(“\n”) Feedback=NULL(“”);
end
2 comments