In order to benchmark Tuplex versus other ETL frameworks, the easiest way is to use Amazon Web Services. To faciliate the installation and benchmarking process,
in the folder
scripts/aws a number of bash scripts reside that help with spinning up (on-demand) AWS EC2 instances and run some of the preconfigured experiments.
First step is to have the aws-cli installed and configured. I.e. under Mac OS X run
brew install aws-cli aws configure
Enter your credentials and proceed to the next step.
Configuring a Virtual Private Cloud¶
Since it takes some time to configure correctly a virtual private cloud and setup all network settings in a secure way, an easy setup script is provided within this directory. To setup a vpc, run
The wizard will automatically ask which availability zone should be used and create a key pair if it doesn’t exists yet. (Note: If you want to use a keypair that already exists, just enter the name and make sure it is added on AWS EC2). The script will create two files:
<keyname>-config.jsoncontains parameters of the configured virtual private cloud (vpc), security group (sg) and subnet as well as the key.
<keyname >-remove.shbash script to shutdown/remove the created virtual private cloud.
Experiments should run entirely using one script. Use an existing experiment script to create a custom one. To start run simply
for example. Following is a list of available experiments:
runcxxexception_experiment.shbenchmarks different exception handling mechanisms on m5.large / 50GB EBS / 8GB Memory / 4GB Swapfile using gcc-6. Creates a file
results_cxxexception.csvwith the result data.
parallelizeAndSquare_experiment.shbenchmarks Pandas/Spark2.3/Tuplex on a simple map function.