Sunday, April 4, 2010

How can I achieve this with less code...

Hi Folks,



I have 2 database tables, 'products' and 'products_staging'.
Every day, the 'products_staging' table is updated with a fresh
batch of products (which are imported from xml files in a seperate
process using Navicat software). I need to somehow compare this
database with the 'products' database to look for specific changes
- i.e changes in price, deleted products, or new products. As some
of the fields in the 'products' database are manually altered, such
as the product name, I cannot do a straight sync with them.
Therefore, I use the only field that will remain unchanged as the
comparison field.



The somewhat crude method I have come up with is to loop
through the 'products_staging' table, looking for that specific
field against the 'products' table. If it finds it, it then checks
for a change in the price field (i.e has the price of that product
changed). If there is a change, the 'products' table is updated.
Likewise, if it's not found, it's treated as a new product and
inserted as such and so on. At the end of the loop, the record is
deleted from the 'products_staging' table - therefore at the end of
the process, the table is empty ready for the next import the
following day.



At the moment, the database handles around 300,000 rows - but
at some point I'm likely to be dealing with many more than that.
It's working as it is, but taking a considerable time to do so -
around 4 hours to go through the lot.



Any suggestions on a better way of accomplishing this? My
existing crude code is pasted below (the use of maxrows and the
redirect is to prevent time-outs - parse2.cfm is virtually
identical to this);



How can I achieve this with less code...
silstorm,



I only skimmed your code, so I may be wrong. But couldn't you
do this with a few sql statements? This is untested and obviously
simplified but something like this (MS SQL syntax)



How can I achieve this with less code...
Firstly, if possible, get the DBs to talk to each other
without involving

CF. That will just be slowing things down. CF's for
generating HTML

pages, and whilst it CAN interact with databases, it's not
the best tool

for the job.



If that's not possible, use CF as a bridge between them, but
only to bulk

insert all the staging data into a temporary table (either
literally a

temporary table, or a permanent one used for holding data
temporarily ;-)

on the production DB, and then use a DB procedure to process
all the data,

again using only the DB server to do so.



Lastly, if you MUST use CF to do the data manipulation, then
minimise your

hits to the DB. Instead of doing single hits to the
production DB in your

prodconnect query, get ALL the data you're likely to need for
updates in

one hit, then use CF to generate a payload for a bulk update;
similarly any

missing data from production can be inferred from that the
difference

between the staging and update data, and that can be used as
your source of

payload for a bulk insert.



As a rule of thumb, if I find myself needing to perform
single-row queries

within a loop of another query, I start thinking I'm doing
something wrong.



Oh, yeah: when dealing with large amounts of queries like
that, ALWAYS use

%26lt;cfqueryparam%26gt; tags instead of static SQL strings.
Every non-parameterised

query you pass to the DB engine will need to be compiled
first, and then

it's also cached. Which slows you down, and eats memory on
the DB server.



This will leave the problem that you're not going to be
generating your

report as you go. However I imagine that's a ''nice to have''
compared to

ensuring the thing actually works. You can generate your
report

separately, after the data processing has been done. I'm
fairly certain

you would be able to coerce a log out of the DB server,
somehow, anyway.



--

Adam


%26gt;Firstly, if possible, get the DBs to talk to each other
without involving

%26gt;CF. That will just be slowing things down. CF's for
generating HTML

%26gt;pages, and whilst it CAN interact with databases, it's
not the best tool

%26gt;for the job.



I was focused on eliminating the looping first. But I agree
completely.



The poster mentioned that the data was imported daily. Both
of the cfquery's appear to use the same datasource. So I assumed
(perhaps falsely) that the the tables were either in the same
database or that databases could already communicate with each
other.



%26gt;You can generate your report separately, after the data
processing has

%26gt; been done. I'm fairly certain you would be able to
coerce a log out of the

%26gt; DB server, somehow, anyway.



Agreed. As for the rest of your comments...



''Yeah. What he said'' ;-)




Thanks to both of you for your replies. Naturally, if I can
take CF out of the equation it would be the best route, so I will
experiment with direct database queries (new ground for me!)



It's almost 1am here now and much beer has been consumed, so
I will look into what you have both said tomorrow and update
accordingly. I wanted to thank you both for your input before I
headed off for the night.

much beer has been consumed

Tut tut tut. Doesn't fire the Coldfusion neurons as well as
coffee.








%26gt;
much beer has been consumed

%26gt; Tut tut tut. Doesn't fire the Coldfusion neurons as well
as coffee.



;-)



I'm not sure that @ 1am on a Saturday night / Sunday morning
the CF neurons

are *supposed* to be firing! Beer sounds like a much more
appropriate

notion.



--

Adam


silstorm



I would take CF out of the process completely. Below is some
TRANSACT_SQL that can be run on the database to do this job.

It will look for change in description, change in price,
Added records to product table that are new to staging table,
deleted records from the products table that are missing from the
staging table, and clear the staging table.



It requires a table to archive the deleted products rows.

Also, the productID cannot be changed (Primary Key)



You should later add error trapping with COMMIT TRANSACTION
on no errors.



Hope this helps. MUCH FASTER!!



CREATE PROCEDURE [dbo].[testStaging] AS



--PRODUCT_STAGE DESC CHANGE

UPDATE product

set .product.ProdDesc =product_stage.ProdDesc

from product inner join product_stage on product.prodID =
product_stage.prodID

where product.prodDesc %26lt;%26gt; product_stage.prodDesc



--PRODUCT_STAGE PRICE CHANGE

UPDATE product

set product.ProdPrice = product_stage.ProdPrice

from product inner join .product_stage on product.prodID =
product_stage.prodID

where product.ProdPrice %26lt;%26gt; product_stage.ProdPrice



--NEW RECORDS IN PRODUCT_STAGE

INSERT into product

select ps.prodID, ps.ProdDesc, ps.prodPrice

from product_Stage ps where ps.prodID not IN(select prodID
from Product)







--MISSING RECORDS IN PRODUCT_STAGE

--MOVE DELETE RECORD TO ARCHIVE TABLE

INSERT into product_delArc

select prodID, ProdDesc, prodPrice, USER, GETDATE()

from product where prodID not IN(select prodID from
Product_Stage)



--DELETE RECORD FROM PRODUCT TABLE

DELETE

from product where prodID not IN(select prodID from
Product_Stage)





--CLEAR PRODUCT_STAGE

delete product_stage from

product_stage ps inner join product p on p.prodID = ps.prodID

GO

No comments:

Post a Comment