Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set. Under the hood, Tuplex is based on data-driven compilation and dual-mode processing, two key techniques that make it possible for Tuplex to provide speed comparable to a pipeline written in hand-optimized C++.
Because Tuplex compiles data science pipelines with inline Python to native code, it runs them 5–91x faster than systems that call into a Python interpreter.
Tuplex makes wrangling data easy: it works interactively in the Python toplevel, integrates with Jupyter Notebooks, and provides familiar APIs, all backed by its data-driven compiler. Tuplex jobs never crash on malformed inputs because Tuplex's dual-mode execution model separates the common-case inputs from exception-case inputs (e.g., malformed data, wrong types) and reports them separately.
|Linux, Python 3.7-3.9:|
|$ pip install tuplex|
|macOS, Catalina or later:|
|$ docker run -p 8888:8888 tuplex/tuplex|
|Development version from our Github repository:|
|$ git clone https://github.com/tuplex/tuplex|
|Leonhard F. Spiegelberg, Rahul Yesantharao, Malte Schwarzkopf and Tim Kraska. Tuplex: Data Science in Python at Native Code Speed. Proceedings of SIGMOD 2021, June 2021. URL: https://doi.org/10.1145/3448016.3457244.|
|Leonhard F. Spiegelberg and Tim Kraska. Tuplex: robust, efficient analytics when Python rules (Demo paper). Proceedings of the VLDB Endowment, 12(12):1958–1961, August 2019. URL: https://doi.org/10.14778/3352063.3352109.|
|Andrew Wei||Andy Ly||Benjamin Givertz|
|Colby Anderson||Yunzhi Shao||Raghu Nimmagadda|
If you want to receive updates about Tuplex releases, new features, and development progress, sign up for our updates below.