Getting started

Download

Tuplex is available for MacOS and Linux. The current version has been tested under MacOS 10.13-10.14 and Ubuntu 16.04 LTS. Tuplex is not yet officially released, if you want to obtain a copy please contact the database group at Brown.

Unfortunately, as for now Tuplex is not yet available via PyPi and there are no builds for Mac/Linux available. Hence, in order to run Tuplex you need to compile it from source.


Installation

Tuplex consists of 3 components: The python frontend, C++ backend and the Tuplex WebUI. The frontend/backend package are within the <build-prefix>/dist/python folder and can be installed via

python3 setup.py install

or for development mode

python3 setup.py develop

Analogously, in order to use the Tuplex WebUI the corresponding package needs to be installed.


Basic usage

Tuplex behaves like a standard python package and be installed using pip. After having built the project, go into dist/python. Within this folder there should be a standard python setup.py file. To test, whether the package works you can type python3 in a terminal

python3

which should start the CPython interpreter in interactive mode.

Python 3.7.2 (default, Jan 13 2019, 12:50:15)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

With import tuplex the system will be imported. Tuplex automatically creates then a Tuplex Interactive shell within the interpreter.

Welcome to

_____            _
|_   _|   _ _ __ | | _____  __
 | || | | | '_ \| |/ _ \ \/ /
 | || |_| | |_) | |  __/>  <
 |_| \__,_| .__/|_|\___/_/\_\ v0.1.2rc
          |_|

using Python 3.7.1 (default, Nov 28 2018, 11:51:54)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Interactive Shell mode
>>>

To create a context object, which serves as central entry point for all pipelines, type c = Context() or c = tuplex.Context(). Depending on your configuration, this might take a while since Tuplex may start up its history server. To run a simple job execute e.g. the following source code snippet.

c.parallelize([1, 2, 3]).map(lambda x: x * x).collect()

Alternatively, you can write your pipeline in a file pipeline.py and execute it via python3 pipeline.py. Tuplex can be also used in Jupyter notebooks

_images/jupyter.png

You can find more examples on how to use the Tuplex Python API under Examples.


Overview

A simple example of Tuplex in action is

from tuplex import *
c = Context()
res = c.parallelize([1, 2, 3, 4]).map(lambda x: (x, x * x)).collect()
# this prints [(1, 1), (2, 4), (3, 9), (4, 16)]
print(res)

This produces an array of tuples whose second component holds the square of the first.

Imagine now, a user would like to execute the following snippet

from tuplex import *
c = Context()
res = c.parallelize([1, 2, None, 4]).map(lambda x: (x, x * x)).collect()
# this prints [(1, 1), (2, 4), (4, 16)]
print(res)

Using regular python or Apache Spark, this would result in an error and the job would crash. However, Tuplex collects tuples that cause errors in a special memory region which allows for later resolution using its API.

from tuplex import *
c = Context()
res = c.parallelize([1, 2, None, 4]) \
       .resolve(NullException, 0) \
       .map(lambda x: (x, x * x)).collect()
# prints [(1, 1), (2, 4), (0, 0), (4, 16)]
print(res)

Of course this toy example does not really require a sophisticated resolution mechanism. However, when processing large quantities of input files, a job may fail after an unpredictable amount of time. In the best case after seconds, in the worst case after a couple days or even weeks. As for most frameworks the more robust the pipeline itself should be towards the input data, the more tests need to be written, the pipeline tested for different scenarios and a use-case specific special error treatment implemented. All of it coming at the sacrifice of speed and efficient pipeline writing. Tuplex helps to solve this problem by treating errors as first class citizens and making thus pipeline deployment and maintenance easier with its close-to-zero overhead when it comes to exception handling.

Moreover, its new execution model allows to speed up ETL tasks by generating artificial exceptions and resolving them automatically.

Core classes

tuplex.Context Main object that allows to construct an ETL pipeline

tuplex.DataSet Abstraction holding a list of tuples together. DataSets are mapped, filtered or processed in a monadic way using user defined functions (UDFs)


Compiling Tuplex from source

Following prequisites are required to compile tuplex from scratch. For Ubuntu 16.04 Xenial, docker/install_dependencies_xenial.sh provides a script to install all necessary dependencies. When using Mac OSX, brew provides a convenient way.

INSTALL_PREFIX=/usr/local

# brew packages
brew install doxygen
brew install bison
brew install boost-python3
brew install llvm
brew install graphviz
brew install cmake

# Celero (for benchmarking)
git clone https://github.com/DigitalInBlue/Celero.git /tmp/Celero && \
pushd /tmp/Celero && \
git checkout tags/v2.2.0 && \
mkdir build
cd build && \
cmake -DBUILD_SHARED_LIBS=OFF -DCMAKE_INSTALL_PREFIX=${INSTALL_PREFIX} .. && \
make -j 4 && \
make install && \
popd && \
rm -rf /tmp/Celero

On Mac OS X instead you can also run the install_dependencies.sh file.

Tuplex uses CMake to build the backend. Further, the following dependencies need to be installed in order to compile the project successfully:

  • Python 3.5-3.7
  • Boost framework 1.66.0 with Python3 support
  • LLVM 5.0.0
  • yaml-cpp 0.6.2

Further, the following build tools need to be installed

Dependencies that are automatically installed by CMake include

To build the project, use the usual cmake approach:

mkdir build
cd build
cmake ..
cmake --build . --config Release
make test

If you have Boost installed at a non-standard directory, you can specify its direction via -DBOOST_LIBRARYDIR=<boostdir>.

To get a a more detailed ouput of all gtests, run specific for each target i.e. run for the codegen component

./dist/bin/testcodegen

In order to execute the python end-to-end tests, go to <build-prefix>/dist/python. Then run

python3 -m pytest

to execute all unittests (you need to have the pytest package installed).