Tuesday, March 6, 2012

Aggregate Component very slow?

Does anyone know of a way to speed up an aggregation? I am pulling in a flat file that goes to an aggregate component. The flat file has approx 10m records and I am grouping by the first 14 columns, the input is sorted in that order. There are no more than 2 rows for each grouping and most have one. The problem is that it takes about 2 minutes to read in the flat file, but 30 min later the aggregation has only put out about 30k rows. I know that there will be around 9m rows after grouping so obviously it is going to take forever to get through all of them. Any ideas would be appreciated. We are doing a POC for SSIS and judging from the posts that I've seen sorting and aggregation seem to have a lot of perfomance issues, which surprizes me in an ETL tool. We are running on a dedicated Xeon box with two 3GHZ cpus and 4GB of ram, so the problem is not there.

Thank You!
Harry
GuideOne InsuranceHello.

Strange. I have had stunning performance on everything I have tried. Haven't tried dual 64-bit, though. Does the same happen on a 32-bit workstation? Have you tried to set processor affinity on the process so that it only uses one processor?

A humble suggestion: Minimize size of data flow tasks. Move functionality to SQL tasks.

I believe I have about the same amount of data you have (10M records, 10-20 columns of text which I pull via FTP and from some in-house systems, I have been running it on SQL 2005 for a years time)

I have found it to make sense to move much of my SSIS Data Flow based functionality into control flow SQL tasks, which I mirror as stored procedures for reference and testing.

This has a number of advantages
1. Any changes in structure is easier.
2. Documentation is easier (I find it easier to print stored procedures than data flow components)
3. Testing and iterativ development is easier.
4. Special needs are easier met. I have found writing CLR code to run inside the server much, much easier to develop than script-based tasks (I have used both). Writing custom data flow components were a real nightmare on the beta I tried it. An extra bonus has been the user defined aggregates, which makes my life easier.
5. Trust. I trust SQL "GROUP BY" more than data flow aggregate. Same goes for sorting.

Disadvantages with moving functionality from data flow to control flow has been
1. Keeping two identical sources is error prone (stored procedures that mimics SQL Server tasks)
2. Paralellism. Since a task needs to be finished before the next one can start, this should teoretically take more time. Still, the new engine is extremely performant, so I have met my requirements easily. Getting the data via FTP has been the bottleneck.
3. Storage. Obviously, saving temporary data along the way puts extra pressure on the storage system. I actually used to run defrag automatically before my dts package, don't know if it helped, but I don think it can hurt.

Just wanted to share some experiences.

Hope this is helpful.|||Gorm,

Thank you for the insights, they are very helpful. The data that I am trying to aggregate is from a flat file that has been ftp'ed from a z/os box. What I ended up doing is creating a "scratch" table, loading the flat file to there and grouping it with a query from the scratch table. That ran reasonably quick. I agree that SQL 2005 seems to do a great job on most things that I have tried. I does still bother me that SSIS has performance issues with sorts, aggregates and merges. I would think those would be the best performers in an ETL tool. That is going to hurt them in a lot of shops. I have managed to find work arounds for most of them doing it, as you it sounds like you are, in the SQL. We are currently running 32 bit. We originally had the 64 bit stood up, but there were a lot of issues there so we backed off. We are just evaluating the product for our warehousing which currently runs on DB2 7.2 on a z/os box. I really haven't tried much with the stored procedure route yet, but that is great thinking. Do you know of some good resource manual/tutorials for SP devlopment?

Once again thank you for your help and suggestions!!
Harry
GuideOne Insurance

No comments:

Post a Comment