Performance tuning within the context of PDI, is something that you need to learn and be cognizant about as you are going through your development cycle.
In the fast becoming outdated world of waterfalls, when performance tuning could easily be an entire work break-down structure on your project plan, you could build for functionality, and iterate for performance.
In the agile world that does not hold, performance tuning should be something you are doing during your development, as with agile, you are already processing iteratively.
PDI gives you several places for performance tuning and we will cover those in subsequent posts. For this particular post we are looking at the 'Change number of copies to start' available to you when you right click any step in your transform.
By selecting this option on the menu, and changing the number from say a '1' to a '2', you are telling PDI to execute multiple copies of that step. Why is this important? Because it allows you to fully utilize the resources available on your server, and run pieces of that step in parallel. What happens under the covers will be covered later.
When changing the number of copies to start, it's very important that you also change the type of Data Movement to 'Distribute'. If you don't you could potentially wind up with dupes in your target table.
A good formula to start with for figuring out how many processes to start is:
Number of Copies to Start = (Number of Processors -1).
There will be more on this subject as we dive deeper in later posts, but this should be enough to get you started down the path to perfection.
The art is in the science.
Doug W.
Learning to Dive a Swamp
Learning to Dive a Swamp
Welcome to my PDI blog. Here we will be keeping the topics short, and informative yet, we will have something here for everyone. Whether you are looking to snorkel around for best practices, or take a two tank dive with the crocodile hunter, it's all going to be here.
Thursday, September 1, 2011
Pentaho PDI Tip: Performance Tuning- Change Number of Copies to Start
Pentaho PDI Tip: Copy Files Step- Using Regular Expressions
A great way to process files in a Job is to use the 'Copy Files' step.
You can use this step in any number of ways, from post transform processing of data in ftp'd files to backup and archive directories, to building of simple utilities.
One of the gotchas that I see often is what to put in the wild card field. This field is a regular expression so if you are on a windows machine, and want all of the excel files in a directory, *.xls won't quite get you there.
Converting that to a regular expression would read: ^.*xls.
There are many sites out there that can help you spin up on regular expressions, and once you get the hang of them, can make your life quite a bit easier when dealing with strings and string processing.
In the context of string searching, happy hunting.
Doug W.
You can use this step in any number of ways, from post transform processing of data in ftp'd files to backup and archive directories, to building of simple utilities.
One of the gotchas that I see often is what to put in the wild card field. This field is a regular expression so if you are on a windows machine, and want all of the excel files in a directory, *.xls won't quite get you there.
Converting that to a regular expression would read: ^.*xls.
There are many sites out there that can help you spin up on regular expressions, and once you get the hang of them, can make your life quite a bit easier when dealing with strings and string processing.
In the context of string searching, happy hunting.
Doug W.
Labels:
Copy Files,
PDI,
Pentaho Data Integration,
Regular Expressions,
Tip
Subscribe to:
Posts (Atom)