Performance tuning within the context of PDI, is something that you need to learn and be cognizant about as you are going through your development cycle.
In the fast becoming outdated world of waterfalls, when performance tuning could easily be an entire work break-down structure on your project plan, you could build for functionality, and iterate for performance.
In the agile world that does not hold, performance tuning should be something you are doing during your development, as with agile, you are already processing iteratively.
PDI gives you several places for performance tuning and we will cover those in subsequent posts. For this particular post we are looking at the 'Change number of copies to start' available to you when you right click any step in your transform.
By selecting this option on the menu, and changing the number from say a '1' to a '2', you are telling PDI to execute multiple copies of that step. Why is this important? Because it allows you to fully utilize the resources available on your server, and run pieces of that step in parallel. What happens under the covers will be covered later.
When changing the number of copies to start, it's very important that you also change the type of Data Movement to 'Distribute'. If you don't you could potentially wind up with dupes in your target table.
A good formula to start with for figuring out how many processes to start is:
Number of Copies to Start = (Number of Processors -1).
There will be more on this subject as we dive deeper in later posts, but this should be enough to get you started down the path to perfection.
The art is in the science.
Doug W.
I've read about this but I think that it's not allways possible to apply. Better steps to use multiple copies are lookup and input/output.
ReplyDeleteI appreciate if you could explain this property deeper and include a benchmark of both points.
Good blog.