Tuplex Python API¶
Warning
The Tuplex project is still under development and the API is thus subject to frequent change.
Modules¶
Typical usage of the Python client will involve to import the tuplex
module first, e.g.
from tuplex import *
Depending on whether this code is executed from the python interpreter in interactive mode or by calling a file via the interpreter, an interactive FastETL shell providing a REPL (read-evaluate-print-loop) may be started.
Context¶
-
class
tuplex.context.
Context
(conf=None, name='', **kwargs)[source]¶ -
__init__
(conf=None, name='', **kwargs)[source]¶ creates new Context object, the main entry point for all operations with the Tuplex big data framework
- Parameters
conf (str) or (dict) – Can be either the path to a YAML configuration file that is used to configure this particular Tuplex context or a dictionary with Tuplex configuration options. For keys and their meaning see below the list of Keyword Arguments.
name (str) – An optional name can be given to the context object. WHen given an empty string, Tuplex will choose a random name.
**kwargs – Arbitrary keyword arguments, confer Keyword Arguments section for more information.
- Keyword Arguments
executorMemory (str) or (int) – Specify how much memory each executor should use. If given as int, will be interpreted as number of bytes. Else, one can also specify a memory amount in string syntax, e.g. ‘1G’ for 1GB of memory.
executorCount (int) – Number of executors (threads) to use. Defaults to
std::thread::hardware_concurrency()
partitionSize (str) or (int) –
executorMemory
will be divided in blocks of sizepartitionSize
. This also corresponds more or less 1:1 to the task size and is thus a parameter to tune parallelism.runTimeMemory (str) or (int) – Each executor allocates besides the
executorMemory
a memory region that is used to store temporary objects when processing a single tuple. E.g. for string copy operations arrays etc. This key allows to set memory via a memory string or as integer in bytes.runTimeMemoryBlockSize (str) or (int) – Size of blocks used to allocate
runTimeMemory
useLLVMOptimizer (str) or (bool) – Specify whether LLVM Optimizers should be applied to generated LLVM IR or not.
autoUpcast (str) or (bool) – When transferring data to python, e.g.
[1, 3.0, 4.0]
the inferred type will befloat
. When this parameter is set toTrue
,1
will be automatically cast tofloat
and no exception be raised. In case of the parameter beingFalse
, tuple with data1
will raise aValueError
.allowUndefinedBehavior – (str) or (bool): When set to true, certain errors won’t be raised, e.g. division by zero will be ignored. This allows for better speed.
scratchDir (str) – Tuplex allows to process larger than memory datasets. If the main memory budget is exceeded, executors will cache files at scratchDir.
logDir (str) – Tuplex produces a log file log.txt per default. Specify with logDir where to store it.
historyDir (str) – Tuplex stores the database and logs within this dir when the webui is enabled.
normalcaseThreshold (float) – used to detect the normal case
webui (bool) – whether to use the WebUI interface. By default true.
webui.url (str) – URL where to connect to for history server. Default: localhost
webui.port (str) – port to use when connecting to history server. Default: 6543
webui.mongodb.url (str) – URL where to connect to MongoDB storage. If empty string, Tuplex will start and exit a local mongodb instance.
webui.mongodb.port (int) – port for MongoDB instance
webui.mongodb.path (str) – local path where to store files for MongoDB instance to be started.
webui.exceptionDisplayLimit (int) – How many exceptions to display in UI max, must be at least 1.
csv.maxDetectionRows (int) – maximum number of rows to determine types for CSV files.
csv.maxDetectionMemory (str) or (int) – maximum number of bytes to use when performing type detection, separator inference, etc. over CSV files.
csv.separators (list) – list of single character strings that are viable separators when autodetecting. E.g.
[','. ';', '\t']
.csv.quotechar (str) – single character denoting the character that is used as quote char according to RFC-4180 standard. E.g.
'"'
csv.comments (str) – list of single character string which indicate start of a comment line, e.g.
['#', '~']
csv.generateParser (str) or (bool) – Whether to use C++ parser or a LLVM code generated parser
csv.selectionPushdown (str) or (bool) – When enabled, then the physical planner will generate a parser that only serializes data that is required within the pipeline.
-
cp
(pattern, target_uri)[source]¶ - copies all files matching the pattern to a target uri. If more than one file is found, a folder is created
containing all the files relative to the longest shared path prefix.
- Parameters
pattern – a UNIX wildcard pattern with a prefix like s3:// or file://. If no prefix is specified, defaults to the local filesystem i.e. file://.
target_uri – a uri, i.e. path prefixed with s3:// or file://. If no prefix is used, defaults to file://
Returns: None
-
csv
(pattern, columns=None, header=None, delimiter=None, quotechar='"', null_values=[''], type_hints={})[source]¶ reads csv (comma separated values) files. This function may either be provided with parameters that help to determine the delimiter, whether a header present or what kind of quote char is used. Overall, CSV parsing is done according to the RFC-4180 standard (cf. https://tools.ietf.org/html/rfc4180)
- Parameters
pattern (str) – a file glob pattern, e.g. /data/file.csv or /data/*.csv or /*/*csv
columns (list) – optional list of columns, will be used as header for the CSV file. If header is True, the first line will be automatically checked against the column names. If header is None, then it will be inferred whether a header is present and a check against the columns performed.
header (bool) – optional argument, if set to None Tuplex will automatically infer whether a header is present or not.
delimiter (str) – optional argument, if set Tuplex will use this as delimiter. If set to None, Tuplex will automatically infer the delimiter.
quotechar (str) – defines quoting according to RFC-4180.
null_values (list) – list of strings to be identified as null value, i.e. they will be parsed as None
type_hints (dict) – dictionary of hints for column types. Columns can be index either using integers or strings.
- Returns
A Tuplex Dataset object that allows further ETL operations
- Return type
-
ls
(pattern)[source]¶ return a list of strings of all files found matching the pattern. The same pattern can be supplied to read inputs. :param pattern: a UNIX wildcard pattern with a prefix like s3:// or file://. If no prefix is specified,
defaults to the local filesystem i.e. file://.
Returns: list of strings
-
options
(nested=False)[source]¶ retrieves all framework parameters as dictionary
- Parameters
nested (bool) – When set to true, this will return a nested dictionary.
be helpful to provide better overview. (May) –
- Returns
dictionary with configuration keys and values for this context
-
optionsToYAML
(file_path='config.yaml')[source]¶ saves options as yaml file to (local) filepath
- Parameters
file_path (str) – local filepath where to store file
-
parallelize
(value_list, columns=None, schema=None)[source]¶ passes data to the Tuplex framework. Must be a list of primitive objects (e.g. of type bool, int, float, str) or a list of (nested) tuples of these types.
- Parameters
value_list (list) – a list of objects to pass to the Tuplex backend.
columns (list) – a list of strings or None to pass to the Tuplex backend in order to name the columns. Allows for dict access in functions then.
schema – a schema defined as tuple of typing types. If None, then most likely schema will be inferred.
- Returns
A Tuplex Dataset object that allows further ETL operations
- Return type
Tuplex.dataset.DataSet
-
rm
(pattern)[source]¶ removes all files matching the pattern :param pattern: a UNIX wildcard pattern with a prefix like s3:// or file://. If no prefix is specified,
defaults to the local filesystem i.e. file://.
Returns: None
-
text
(pattern, null_values=None)[source]¶ reads text files. :param pattern: a file glob pattern, e.g. /data/file.csv or /data/*.csv or /*/*csv :type pattern: str :param null_values: a list of string to interpret as None. When empty list or None, empty lines will be the empty string ‘’ :type null_values: List[str]
- Returns
A Tuplex Dataset object that allows further ETL operations
- Return type
-
DataSet¶
-
class
tuplex.dataset.
DataSet
[source]¶ -
aggregate
(combine, aggregate, initial_value)[source]¶ - Parameters
combine – a UDF to combine two aggregates (results of the aggregate function or the initial_value)
aggregate – a UDF which produces a result
initial_value – a neutral initial value.
- Returns
Dataset
-
aggregateByKey
(combine, aggregate, initial_value, key_columns)[source]¶ - Parameters
combine – a UDF to combine two aggregates (results of the aggregate function or the initial_value)
aggregate – a UDF which produces a result
initial_value – a neutral initial value.
key_columns – the columns to group the aggregate by
- Returns
Dataset
-
cache
(store_specialized=True)[source]¶ materializes rows in main-memory for reuse with several pipelines. Can be also used to benchmark certain pipeline costs
- Parameters
store_specialized – bool whether to store normal case and general case separated or merge everything into one normal case. This affects optimizations for operators called on a cached dataset.
- Returns
A Tuplex Dataset object that allows further ETL operations
- Return type
-
collect
()[source]¶ action that generates a physical plan, processes data and collects result then as list of tuples.
- Returns
A list of tuples, or values if the dataset has only one column.
- Return type
(list)
-
property
columns
¶ retrieve names of columns if assigned
-
property
exception_counts
¶ dictionary of exception class names with integer keys, i.e. the counts. Returns None if error occurred in dataset. Note that Python has an exception hierarchy, e.g. an IndexError is a LookupError. The counts returned here correspond to whatever type is being raised.
- Type
Returns
-
filter
(ftor)[source]¶ performs a map operation using the provided udf function over the dataset and returns a dataset for further processing.
- Parameters
ftor (lambda) or (function) – a lambda function, e.g.
lambda x: x
or an identifier to a function. that returns a boolean. Tuples for which the functor returnsTrue
will be kept, the others discarded.- Returns
A Tuplex Dataset object that allows further ETL operations
- Return type
-
ignore
(eclass)[source]¶ ignores exceptions of type eclass caused by previous operator
- Parameters
eclass – exception type/class to ignore
- Returns
A Tuplex Dataset object that allows further ETL operations
- Return type
-
join
(dsRight, leftKeyColumn, rightKeyColumn, prefixes=None, suffixes=None)[source]¶ (inner) join with other dataset :param dsRight: other dataset :param leftKeyColumn: column name of the column to use as key in the caller :param rightKeyColumn: column name of the column to use as key in the dsRight dataset :param prefixes: tuple or list of 2 strings. One element may be None. :param suffixes tuple or list of 2 strings. One element may be None:
Returns: Dataset
-
leftJoin
(dsRight, leftKeyColumn, rightKeyColumn, prefixes=None, suffixes=None)[source]¶ left (outer) join with other dataset :param dsRight: other dataset :param leftKeyColumn: column name of the column to use as key in the caller :param rightKeyColumn: column name of the column to use as key in the dsRight dataset :param prefixes: tuple or list of 2 strings. One element may be None. :param suffixes tuple or list of 2 strings. One element may be None:
Returns: Dataset
-
map
(ftor)[source]¶ performs a map operation using the provided udf function over the dataset and returns a dataset for further processing.
- Parameters
ftor (lambda) or (function) – a lambda function, e.g.
lambda x: x
or an identifier to a function. Currently there are two supported syntactical options for functions. A function may either take a single parameter which is then interpreted as tuple of the underlying data or a list of parameters, e.g.lambda a, b: a + b
would sum the two columns. If there is not match, whenever an action is called Tuplex will point out the mismatch.- Returns
A Tuplex Dataset object that allows further ETL operations
- Return type
-
mapColumn
(column, ftor)[source]¶ maps directly one column. UDF takes as argument directly the value of the specified column and will overwrite that column with the result. If you need access to multiple columns, use withColumn instead. If the column name already exists, it will be overwritten.
- Parameters
column (str) – name for the column to map
ftor – function to call
- Returns
A Tuplex Dataset object that allows further ETL operations
- Return type
-
renameColumn
(oldColumnName, newColumnName)[source]¶ rename a column in dataset :param oldColumnName: str, old column name. Must exist. :param newColumnName: str, new column name
- Returns
Dataset
-
resolve
(eclass, ftor)[source]¶ Adds a resolver operator to the pipeline. The signature of ftor needs to be identical to the one of the preceding operator.
- Parameters
eclass – Which exception to apply resolution for, e.g. ZeroDivisionError
ftor – A function used to resolve this exception. May also produce exceptions.
- Returns
A Tuplex Dataset object that allows further ETL operations
- Return type
-
selectColumns
(columns)[source]¶ selects a subset of columns as defined through columns which is a list or a single column
- Parameters
columns – list of strings or integers. A string should reference a column name, whereas as an integer refers to an index. Indices may be negative according to python rules. Order in list determines output order
- Returns
A Tuplex Dataset object that allows further ETL operations
- Return type
-
show
(nrows=None)[source]¶ action that generates a physical plan, processes data and prints results as nicely formatted ASCII table to stdout.
- Parameters
nrows (int) – number of rows to collect. If
None
all rows will be collected
-
take
(nrows=5)[source]¶ action that generates a physical plan, processes data and collects the top results then as list of tuples.
-
tocsv
(path, part_size=0, num_rows=9223372036854775807, num_parts=0, part_name_generator=None, null_value=None, header=True)[source]¶ save dataset to one or more csv files. Triggers execution of pipeline. :param path: path where to save files to :param split_size: optional size in bytes for each part to not exceed. :param num_rows: limit number of output rows :param num_parts: number of parts to split output into. The last part will be the smallest :param part_name_generator: optional name generator function to the output parts, receives an integer as parameter for the output number. This is intended as a convenience helper function. Should not raise any exceptions. :param null_value: string to represent null values. None equals empty string. Must provide explicit quoting for this argument. :param header: bool to indicate whether to write a header or not or a list of strings to specify explicitly a header to write. number of names provided must match the column count.
-
property
types
¶ output schema as list of type objects of the dataset. If the dataset has an error, None is returned.
- Returns
detected types (general case) of dataset. Typed according to typing module.
-
unique
()[source]¶ removes duplicates from Dataset (out-of-order). Equivalent to a DISTINCT clause in a SQL-statement. :returns: A Tuplex Dataset object that allows further ETL operations. :rtype: tuplex.dataset.Dataset
-
withColumn
(column, ftor)[source]¶ appends a new column to the dataset by calling ftor over existing tuples
- Parameters
column – name for the new column/variable. If column exists, its values will be replaced
ftor – function to call
- Returns
A Tuplex Dataset object that allows further ETL operations
- Return type
-