Tuplex Python API

Warning

The Tuplex project is still under development and the API is thus subject to frequent change.

Modules

Typical usage of the Python client will involve to import the tuplex module first, e.g.

from tuplex import *

Depending on whether this code is executed from the python interpreter in interactive mode or by calling a file via the interpreter, an interactive FastETL shell providing a REPL (read-evaluate-print-loop) may be started.


Context

class tuplex.context.Context(conf=None, name='', **kwargs)[source]
__init__(conf=None, name='', **kwargs)[source]

creates new Context object, the main entry point for all operations with the Tuplex big data framework

Parameters
  • conf (str) or (dict) – Can be either the path to a YAML configuration file that is used to configure this particular Tuplex context or a dictionary with Tuplex configuration options. For keys and their meaning see below the list of Keyword Arguments.

  • name (str) – An optional name can be given to the context object. WHen given an empty string, Tuplex will choose a random name.

  • **kwargs – Arbitrary keyword arguments, confer Keyword Arguments section for more information.

Keyword Arguments
  • executorMemory (str) or (int) – Specify how much memory each executor should use. If given as int, will be interpreted as number of bytes. Else, one can also specify a memory amount in string syntax, e.g. ‘1G’ for 1GB of memory.

  • executorCount (int) – Number of executors (threads) to use. Defaults to std::thread::hardware_concurrency()

  • driverMemory (str) or (int) – executorMemory for the driver

  • partitionSize (str) or (int) – executorMemory will be divided in blocks of size partitionSize. This also corresponds more or less 1:1 to the task size and is thus a parameter to tune parallelism.

  • runTimeMemory (str) or (int) – Each executor allocates besides the executorMemory a memory region that is used to store temporary objects when processing a single tuple. E.g. for string copy operations arrays etc. This key allows to set memory via a memory string or as integer in bytes.

  • runTimeMemoryBlockSize (str) or (int) – Size of blocks used to allocate runTimeMemory

  • useLLVMOptimizer (str) or (bool) – Specify whether LLVM Optimizers should be applied to generated LLVM IR or not.

  • autoUpcast (str) or (bool) – When transferring data to python, e.g. [1, 3.0, 4.0] the inferred type will be float. When this parameter is set to True, 1 will be automatically cast to float and no exception be raised. In case of the parameter being False, tuple with data 1 will raise a ValueError.

  • allowUndefinedBehavior – (str) or (bool): When set to true, certain errors won’t be raised, e.g. division by zero will be ignored. This allows for better speed.

  • scratchDir (str) – Tuplex allows to process larger than memory datasets. If the main memory budget is exceeded, executors will cache files at scratchDir.

  • logDir (str) – Tuplex produces a log file log.txt per default. Specify with logDir where to store it.

  • historyDir (str) – Tuplex stores the database and logs within this dir when the webui is enabled.

  • normalcaseThreshold (float) – used to detect the normal case

  • webui (bool) – whether to use the WebUI interface. By default true.

  • webui.url (str) – URL where to connect to for history server. Default: localhost

  • webui.port (str) – port to use when connecting to history server. Default: 6543

  • webui.mongodb.url (str) – URL where to connect to MongoDB storage. If empty string, Tuplex will start and exit a local mongodb instance.

  • webui.mongodb.port (int) – port for MongoDB instance

  • webui.mongodb.path (str) – local path where to store files for MongoDB instance to be started.

  • webui.exceptionDisplayLimit (int) – How many exceptions to display in UI max, must be at least 1.

  • csv.maxDetectionRows (int) – maximum number of rows to determine types for CSV files.

  • csv.maxDetectionMemory (str) or (int) – maximum number of bytes to use when performing type detection, separator inference, etc. over CSV files.

  • csv.separators (list) – list of single character strings that are viable separators when autodetecting. E.g. [','. ';', '\t'].

  • csv.quotechar (str) – single character denoting the character that is used as quote char according to RFC-4180 standard. E.g. '"'

  • csv.comments (str) – list of single character string which indicate start of a comment line, e.g. ['#', '~']

  • csv.generateParser (str) or (bool) – Whether to use C++ parser or a LLVM code generated parser

  • csv.selectionPushdown (str) or (bool) – When enabled, then the physical planner will generate a parser that only serializes data that is required within the pipeline.

cp(pattern, target_uri)[source]
copies all files matching the pattern to a target uri. If more than one file is found, a folder is created

containing all the files relative to the longest shared path prefix.

Parameters
  • pattern – a UNIX wildcard pattern with a prefix like s3:// or file://. If no prefix is specified, defaults to the local filesystem i.e. file://.

  • target_uri – a uri, i.e. path prefixed with s3:// or file://. If no prefix is used, defaults to file://

Returns: None

csv(pattern, columns=None, header=None, delimiter=None, quotechar='"', null_values=[''], type_hints={})[source]

reads csv (comma separated values) files. This function may either be provided with parameters that help to determine the delimiter, whether a header present or what kind of quote char is used. Overall, CSV parsing is done according to the RFC-4180 standard (cf. https://tools.ietf.org/html/rfc4180)

Parameters
  • pattern (str) – a file glob pattern, e.g. /data/file.csv or /data/*.csv or /*/*csv

  • columns (list) – optional list of columns, will be used as header for the CSV file. If header is True, the first line will be automatically checked against the column names. If header is None, then it will be inferred whether a header is present and a check against the columns performed.

  • header (bool) – optional argument, if set to None Tuplex will automatically infer whether a header is present or not.

  • delimiter (str) – optional argument, if set Tuplex will use this as delimiter. If set to None, Tuplex will automatically infer the delimiter.

  • quotechar (str) – defines quoting according to RFC-4180.

  • null_values (list) – list of strings to be identified as null value, i.e. they will be parsed as None

  • type_hints (dict) – dictionary of hints for column types. Columns can be index either using integers or strings.

Returns

A Tuplex Dataset object that allows further ETL operations

Return type

tuplex.dataset.DataSet

ls(pattern)[source]

return a list of strings of all files found matching the pattern. The same pattern can be supplied to read inputs. :param pattern: a UNIX wildcard pattern with a prefix like s3:// or file://. If no prefix is specified,

defaults to the local filesystem i.e. file://.

Returns: list of strings

options(nested=False)[source]

retrieves all framework parameters as dictionary

Parameters
  • nested (bool) – When set to true, this will return a nested dictionary.

  • be helpful to provide better overview. (May) –

Returns

dictionary with configuration keys and values for this context

optionsToYAML(file_path='config.yaml')[source]

saves options as yaml file to (local) filepath

Parameters

file_path (str) – local filepath where to store file

parallelize(value_list, columns=None, schema=None)[source]

passes data to the Tuplex framework. Must be a list of primitive objects (e.g. of type bool, int, float, str) or a list of (nested) tuples of these types.

Parameters
  • value_list (list) – a list of objects to pass to the Tuplex backend.

  • columns (list) – a list of strings or None to pass to the Tuplex backend in order to name the columns. Allows for dict access in functions then.

  • schema – a schema defined as tuple of typing types. If None, then most likely schema will be inferred.

Returns

A Tuplex Dataset object that allows further ETL operations

Return type

Tuplex.dataset.DataSet

rm(pattern)[source]

removes all files matching the pattern :param pattern: a UNIX wildcard pattern with a prefix like s3:// or file://. If no prefix is specified,

defaults to the local filesystem i.e. file://.

Returns: None

text(pattern, null_values=None)[source]

reads text files. :param pattern: a file glob pattern, e.g. /data/file.csv or /data/*.csv or /*/*csv :type pattern: str :param null_values: a list of string to interpret as None. When empty list or None, empty lines will be the empty string ‘’ :type null_values: List[str]

Returns

A Tuplex Dataset object that allows further ETL operations

Return type

tuplex.dataset.DataSet


DataSet

class tuplex.dataset.DataSet[source]
aggregate(combine, aggregate, initial_value)[source]
Parameters
  • combine – a UDF to combine two aggregates (results of the aggregate function or the initial_value)

  • aggregate – a UDF which produces a result

  • initial_value – a neutral initial value.

Returns

Dataset

aggregateByKey(combine, aggregate, initial_value, key_columns)[source]
Parameters
  • combine – a UDF to combine two aggregates (results of the aggregate function or the initial_value)

  • aggregate – a UDF which produces a result

  • initial_value – a neutral initial value.

  • key_columns – the columns to group the aggregate by

Returns

Dataset

cache(store_specialized=True)[source]

materializes rows in main-memory for reuse with several pipelines. Can be also used to benchmark certain pipeline costs

Parameters

store_specialized – bool whether to store normal case and general case separated or merge everything into one normal case. This affects optimizations for operators called on a cached dataset.

Returns

A Tuplex Dataset object that allows further ETL operations

Return type

tuplex.dataset.DataSet

collect()[source]

action that generates a physical plan, processes data and collects result then as list of tuples.

Returns

A list of tuples, or values if the dataset has only one column.

Return type

(list)

property columns

retrieve names of columns if assigned

Returns

Returns None if columns haven’t been named yet or a list of strings representing the column names.

Return type

None or List[str]

property exception_counts

dictionary of exception class names with integer keys, i.e. the counts. Returns None if error occurred in dataset. Note that Python has an exception hierarchy, e.g. an IndexError is a LookupError. The counts returned here correspond to whatever type is being raised.

Type

Returns

filter(ftor)[source]

performs a map operation using the provided udf function over the dataset and returns a dataset for further processing.

Parameters

ftor (lambda) or (function) – a lambda function, e.g. lambda x: x or an identifier to a function. that returns a boolean. Tuples for which the functor returns True will be kept, the others discarded.

Returns

A Tuplex Dataset object that allows further ETL operations

Return type

tuplex.dataset.DataSet

ignore(eclass)[source]

ignores exceptions of type eclass caused by previous operator

Parameters

eclass – exception type/class to ignore

Returns

A Tuplex Dataset object that allows further ETL operations

Return type

tuplex.dataset.DataSet

join(dsRight, leftKeyColumn, rightKeyColumn, prefixes=None, suffixes=None)[source]

(inner) join with other dataset :param dsRight: other dataset :param leftKeyColumn: column name of the column to use as key in the caller :param rightKeyColumn: column name of the column to use as key in the dsRight dataset :param prefixes: tuple or list of 2 strings. One element may be None. :param suffixes tuple or list of 2 strings. One element may be None:

Returns: Dataset

leftJoin(dsRight, leftKeyColumn, rightKeyColumn, prefixes=None, suffixes=None)[source]

left (outer) join with other dataset :param dsRight: other dataset :param leftKeyColumn: column name of the column to use as key in the caller :param rightKeyColumn: column name of the column to use as key in the dsRight dataset :param prefixes: tuple or list of 2 strings. One element may be None. :param suffixes tuple or list of 2 strings. One element may be None:

Returns: Dataset

map(ftor)[source]

performs a map operation using the provided udf function over the dataset and returns a dataset for further processing.

Parameters

ftor (lambda) or (function) – a lambda function, e.g. lambda x: x or an identifier to a function. Currently there are two supported syntactical options for functions. A function may either take a single parameter which is then interpreted as tuple of the underlying data or a list of parameters, e.g. lambda a, b: a + b would sum the two columns. If there is not match, whenever an action is called Tuplex will point out the mismatch.

Returns

A Tuplex Dataset object that allows further ETL operations

Return type

tuplex.dataset.DataSet

mapColumn(column, ftor)[source]

maps directly one column. UDF takes as argument directly the value of the specified column and will overwrite that column with the result. If you need access to multiple columns, use withColumn instead. If the column name already exists, it will be overwritten.

Parameters
  • column (str) – name for the column to map

  • ftor – function to call

Returns

A Tuplex Dataset object that allows further ETL operations

Return type

tuplex.dataset.DataSet

renameColumn(oldColumnName, newColumnName)[source]

rename a column in dataset :param oldColumnName: str, old column name. Must exist. :param newColumnName: str, new column name

Returns

Dataset

resolve(eclass, ftor)[source]

Adds a resolver operator to the pipeline. The signature of ftor needs to be identical to the one of the preceding operator.

Parameters
  • eclass – Which exception to apply resolution for, e.g. ZeroDivisionError

  • ftor – A function used to resolve this exception. May also produce exceptions.

Returns

A Tuplex Dataset object that allows further ETL operations

Return type

tuplex.dataset.DataSet

selectColumns(columns)[source]

selects a subset of columns as defined through columns which is a list or a single column

Parameters

columns – list of strings or integers. A string should reference a column name, whereas as an integer refers to an index. Indices may be negative according to python rules. Order in list determines output order

Returns

A Tuplex Dataset object that allows further ETL operations

Return type

tuplex.dataset.DataSet

show(nrows=None)[source]

action that generates a physical plan, processes data and prints results as nicely formatted ASCII table to stdout.

Parameters

nrows (int) – number of rows to collect. If None all rows will be collected

take(nrows=5)[source]

action that generates a physical plan, processes data and collects the top results then as list of tuples.

Parameters

nrows (int) – number of rows to collect. Per default 5.

Returns

A list of tuples

Return type

(list)

tocsv(path, part_size=0, num_rows=9223372036854775807, num_parts=0, part_name_generator=None, null_value=None, header=True)[source]

save dataset to one or more csv files. Triggers execution of pipeline. :param path: path where to save files to :param split_size: optional size in bytes for each part to not exceed. :param num_rows: limit number of output rows :param num_parts: number of parts to split output into. The last part will be the smallest :param part_name_generator: optional name generator function to the output parts, receives an integer as parameter for the output number. This is intended as a convenience helper function. Should not raise any exceptions. :param null_value: string to represent null values. None equals empty string. Must provide explicit quoting for this argument. :param header: bool to indicate whether to write a header or not or a list of strings to specify explicitly a header to write. number of names provided must match the column count.

property types

output schema as list of type objects of the dataset. If the dataset has an error, None is returned.

Returns

detected types (general case) of dataset. Typed according to typing module.

unique()[source]

removes duplicates from Dataset (out-of-order). Equivalent to a DISTINCT clause in a SQL-statement. :returns: A Tuplex Dataset object that allows further ETL operations. :rtype: tuplex.dataset.Dataset

withColumn(column, ftor)[source]

appends a new column to the dataset by calling ftor over existing tuples

Parameters
  • column – name for the new column/variable. If column exists, its values will be replaced

  • ftor – function to call

Returns

A Tuplex Dataset object that allows further ETL operations

Return type

tuplex.dataset.DataSet