Tuplex Python API

Warning

The Tuplex project is still under development and the API is thus subject to frequent change.

Modules

Typical usage of the Python client will involve to import the tuplex module first, e.g.

from tuplex import *

Depending on whether this code is executed from the python interpreter in interactive mode or by calling a file via the interpreter, an interactive FastETL shell providing a REPL (read-evaluate-print-loop) may be started.


Context

class tuplex.context.Context(conf=None, **kwargs)[source]
__init__(conf=None, **kwargs)[source]

creates new Context object, the main entry point for all operations with the Tuplex big data framework

Parameters:
  • conf (str) or (dict) – Can be either the path to a YAML configuration file that is used to configure this particular Tuplex context or a dictionary with Tuplex configuration options. For keys and their meaning see below the list of Keyword Arguments.
  • **kwargs – Arbitrary keyword arguments, see below for more information
Keyword Arguments:
 
  • executorMemory (str) or (int) – Specify how much memory each executor should use. If given as int, will be interpreted as number of bytes. Else, one can also specify a memory amount in string syntax, e.g. ‘1G’ for 1GB of memory.
  • executorCount (int) – Number of executors (threads) to use. Defaults to std::thread::hardware_concurrency()
  • driverMemory (str) or (int) – executorMemory for the driver
  • partitionSize (str) or (int) – executorMemory will be divided in blocks of size partitionSize. This also corresponds more or less 1:1 to the task size and is thus a parameter to tune parallelism.
  • runTimeMemory (str) or (int) – Each executor allocates besides the executorMemory a memory region that is used to store temporary objects when processing a single tuple. E.g. for string copy operations arrays etc. This key allows to set memory via a memory string or as integer in bytes.
  • runTimeMemoryBlockSize (str) or (int) – Size of blocks used to allocate runTimeMemory
  • useLLVMOptimizer (str) or (bool) – Specify whether LLVM Optimizers should be applied to generated LLVM IR or not.
  • autoUpcast (str) or (bool) – When transferring data to python, e.g. [1, 3.0, 4.0] the inferred type will be float. When this parameter is set to True, 1 will be automatically cast to float and no exception be raised. In case of the parameter being False, tuple with data 1 will raise a ValueError.
  • allowUndefinedBehavior – (str) or (bool): When set to true, certain errors won’t be raised, e.g. division by zero will be ignored. This allows for better speed.
  • scratchDir (str) – Tuplex allows to process larger than memory datasets. If the main memory budget is exceeded, executors will cache files at scratchDir.
  • logDir (str) – Tuplex produces a log file log.txt per default. Specify with logDir where to store it.
  • historyDir (str) – Tuplex stores the database and logs within this dir when the webui is enabled.
  • normalcaseThreshold (float) – used to detect the normal case
  • webui (bool) – whether to use the WebUI interface. By default true.
  • webui.url (str) – URL where to connect to for history server. Default: localhost
  • webui.port (str) – port to use when connecting to history server. Default: 6543
  • webui.mongodb.url (str) – URL where to connect to MongoDB storage. If empty string, Tuplex will start and exit a local mongodb instance.
  • webui.mongodb.port (int) – port for MongoDB instance
  • webui.mongodb.path (str) – local path where to store files for MongoDB instance to be started.
  • webui.exceptionDisplayLimit (int) – How many exceptions to display in UI max, must be at least 1.
  • csv.maxDetectionRows (int) – maximum number of rows to determine types for CSV files.
  • csv.maxDetectionMemory (str) or (int) – maximum number of bytes to use when performing type detection, separator inference, etc. over CSV files.
  • csv.separators (list) – list of single character strings that are viable separators when autodetecting. E.g. [','. ';', '\t'].
  • csv.quotechar (str) – single character denoting the character that is used as quote char according to RFC-4180 standard. E.g. '"'
  • csv.comments (str) – list of single character string which indicate start of a comment line, e.g. ['#', '~']
  • csv.generateParser (str) or (bool) – Whether to use C++ parser or a LLVM code generated parser
  • csv.selectionPushdown (str) or (bool) – When enabled, then the physical planner will generate a parser that only serializes data that is required within the pipeline.
csv(pattern, delimiter=None, header=None, quotechar='"')[source]

reads csv (comma separated values) files. This function may either be provided with parameters that help to determine the delimiter, whether a header present or what kind of quote char is used. Overall, CSV parsing is done according to the RFC-4180 standard (cf. https://tools.ietf.org/html/rfc4180)

Parameters:
  • pattern (str) – a file pattern, e.g. /data/file.csv or /data/*.csv or /*/*csv
  • delimiter (str) – optional argument, if set Tuplex will use this as delimiter. If set to None, Tuplex will automatically infer the delimiter.
  • header (bool) – optional argument, if set to None Tuplex will automatically infer whether a header is present or not.
  • quotechar (str) – defines quoting according to RFC-4180.
Returns:

A Tuplex Dataset object that allows further ETL operations

Return type:

Tuplex.dataset.DataSet

parallelize(value_list, columns=None)[source]

passes data to the Tuplex framework. Must be a list of primitive objects (e.g. of type bool, int, float, str) or a list of (nested) tuples of these types.

Parameters:
  • value_list (list) – a list of objects to pass to the Tuplex backend.
  • columns (list) – a list of strings or None to pass to the Tuplex backend in order to name the columns. Allows for dict access in functions then.
Returns:

A Tuplex Dataset object that allows further ETL operations

Return type:

Tuplex.dataset.DataSet


DataSet

class tuplex.dataset.DataSet(columns=[], data=[], name='UnknownDS', parent=None, context=None)[source]
collect()[source]

action that generates a physical plan, processes data and collects result then as list of tuples.

Returns:A list of tuples
Return type:(list)
filter(ftor)[source]

performs a map operation using the provided udf function over the dataset and returns a dataset for further processing.

Parameters:ftor (lambda) or (function) – a lambda function, e.g. lambda x: x or an identifier to a function. that returns a boolean. Tuples for which the functor returns True will be kept, the others discarded.
Returns:A Tuplex Dataset object that allows further ETL operations
Return type:Tuplex.dataset.DataSet
map(ftor)[source]

performs a map operation using the provided udf function over the dataset and returns a dataset for further processing.

Parameters:ftor (lambda) or (function) – a lambda function, e.g. lambda x: x or an identifier to a function. Currently there are two supported syntactical options for functions. A function may either take a single parameter which is then interpreted as tuple of the underlying data or a list of parameters, e.g. lambda a, b: a + b would sum the two columns. If there is not match, whenever an action is called Tuplex will point out the mismatch.
Returns:A Tuplex Dataset object that allows further ETL operations
Return type:Tuplex.dataset.DataSet
mapColumn(column, ftor)[source]

maps directly one column. UDF takes as argument direcly the value of the specified column and will overwrite that column with the result. If you need access to multiple columns, use withColumn instead. If the column name already exists, it will be overwritten. :param newcolumn_name: name for the new column/variable :param ftor: function to call

Returns:DataSet
resolve(exception, ftor)[source]

Adds a resolver operator to the pipeline. The signature of ftor needs to be identical to the one of the preceding operator. :param exception: which exception to apply resolution for :param ftor: a function used to resolve this exception. May also throw exceptions.

Returns:a dataset
show(nrows=None)[source]

action that generates a physical plan, processes data and prints results as nicely formatted ASCII table to stdout.

Parameters:nrows (int) – number of rows to collect. If None all rows will be collected
take(nrows=5)[source]

action that generates a physical plan, processes data and collects the top results then as list of tuples.

Parameters:nrows (int) – number of rows to collect. Per default 5.
Returns:A list of tuples
Return type:(list)
withColumn(column, ftor)[source]

appends a new column to the dataset by calling ftor over existing tuples :param column: name for the new column/variable :param ftor: function to call

Returns:DataSet