Running Experiments

In order to benchmark Tuplex versus other ETL frameworks, the easiest way is to use Amazon Web Services. To faciliate the installation and benchmarking process, in the folder scripts/aws a number of bash scripts reside that help with spinning up (on-demand) AWS EC2 instances and run some of the preconfigured experiments.

First step is to have the aws-cli installed and configured. I.e. under Mac OS X run

brew install aws-cli
aws configure

Enter your credentials and proceed to the next step.

Configuring a Virtual Private Cloud

Since it takes some time to configure correctly a virtual private cloud and setup all network settings in a secure way, an easy setup script is provided within this directory. To setup a vpc, run

./setup-aws.sh

The wizard will automatically ask which availability zone should be used and create a key pair if it doesn’t exists yet. (Note: If you want to use a keypair that already exists, just enter the name and make sure it is added on AWS EC2). The script will create two files:

  1. <keyname>-config.json contains parameters of the configured virtual private cloud (vpc), security group (sg) and subnet as well as the key.
  2. <keyname >-remove.sh bash script to shutdown/remove the created virtual private cloud.

Launching experiments

Experiments should run entirely using one script. Use an existing experiment script to create a custom one. To start run simply

./runcxxexception_experiment.sh <keyname>-config.json

for example. Following is a list of available experiments:

  1. runcxxexception_experiment.sh benchmarks different exception handling mechanisms on m5.large / 50GB EBS / 8GB Memory / 4GB Swapfile using gcc-6. Creates a file results_cxxexception.csv with the result data.
  2. parallelizeAndSquare_experiment.sh benchmarks Pandas/Spark2.3/Tuplex on a simple map function.