Skip to content

Module zillion.datasource

AdHocDataTable

Bases: tlbx.logging_utils.PrintMixin

class zillion.datasource.AdHocDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

Create an adhoc table from raw data that can be added to a DataSource

Parameters:

  • name - (str) The name of the table
  • data - (iterable, DataFrame, or str) The data to create the adhoc table from
  • table_type - (str) Specify the TableType
  • columns - (dict, optional) Column configuration for the table
  • primary_key - (list of str, optional) A list of fields that make up the primary key of the table
  • parent - (str, optional) A reference to a parent table in the same datasource
  • if_exists - (str, optional) Control behavior when datatables already exist in the database
  • drop_dupes - (bool, optional) Drop duplicate primary key rows when loading the table
  • convert_types - (dict, optional) A mapping of column names to types to convert to when loading the table. The types must be strings representing valid sqlalchemy types. Ex: {"col1": "date", "col2": "integer"}
  • fillna_value - (str or int, optional) Fill null values in primary key columns with this value before writing to a SQL database.
  • schema - (str, optional) The schema in which the table resides
  • kwargs - Keyword arguments passed to pandas.DataFrame.from_records if a DataFrame is created from iterable data
get_dataframe(self)

Get the DataFrame representation of the data

table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, method='multi', chunksize=1000)

Use pandas to push the adhoc table data to a SQL database.

Parameters:

  • engine - (SQLAlchemy connection engine) The engine used to connect to the database
  • method - (str, optional) Passed through to pandas
  • chunksize - (int, optional) Passed through to pandas

CSVDataTable

Bases: zillion.datasource.AdHocDataTable

class zillion.datasource.CSVDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

AdHocDataTable from a JSON file using pandas.read_csv

get_dataframe(self)
table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, method='multi', chunksize=1000)

Use pandas to push the adhoc table data to a SQL database.

Parameters:

  • engine - (SQLAlchemy connection engine) The engine used to connect to the database
  • method - (str, optional) Passed through to pandas
  • chunksize - (int, optional) Passed through to pandas

DataSource

Bases: zillion.field.FieldManagerMixin, tlbx.logging_utils.PrintMixin

class zillion.datasource.DataSource(name, metadata=None, config=None, nlp=False)

A component of a warehouse that houses one or more related tables

Parameters:

  • name - (str) The name of the datasource
  • metadata - (SQLAlchemy metadata, optional) A SQLAlchemy metadata object that may have zillion configuration information defined in the table and column info.zillion attribute
  • config - (dict, str, or buffer, optional) A dict adhering to the DataSourceConfigSchema or a file location to load the config from
  • nlp - (bool, optional) If true, allow NLP analysis when creating fields
add_dimension(self, dimension, force=False)

Add a reference to a dimension to this FieldManager

add_metric(self, metric, force=False)

Add a reference to a metric to this FieldManager

apply_config(self, config, reflect=False, nlp=False)

Apply a datasource config to this datasource's metadata. This will also ensure zillion info is present on the metadata, populate global fields, and rebuild the datasource graph.

Parameters:

  • config - (dict) The datasource config to apply
  • reflect - (bool, optional) If true, use SQLAlchemy to reflect the database. Table-level reflection will also occur if any tables are created from data URLs
  • nlp - (bool, optional) If true, allow NLP analysis when creating fields
directly_has_dimension(self, name)

Check if this FieldManager directly stores this dimension

directly_has_field(self, name)

Check if this FieldManager directly stores this field

directly_has_metric(self, name)

Check if this FieldManager directly stores this metric

find_descendent_tables(self, table)

Find graph descendents of the table

find_neighbor_tables(self, table)

Find tables that can be joined to or are parents of the given table

Parameters:

  • table - (SQLAlchemy Table) The table to find neighbors for

Returns:

(list) - A list of NeighborTables

find_possible_table_sets(self, ds_tables_with_field, field, grain, dimension_grain)

Find table sets that meet the grain

Parameters:

  • ds_tables_with_field - (list of tables) A list of datasource tables that have the target field
  • field - (str) The target field we are trying to cover
  • grain - (iterable) The grain the table set must support
  • dimension_grain - The subset of the grain that are requested dimensions

Returns:

(list) - A list of TableSets

from_data_file(file, primary_key, ds_name=None, schema='main', table_name=None, table_type='metric', nlp=False, **kwargs)

Create a DataSource from a data file path/url that represents a single table

Parameters:

  • file - (str) A file path/url pointing to a data file. If remote the file will be downloaded to the adhoc data directory.
  • primary_key - (list of str) A list of fields representing the primary key of the table
  • ds_name - (str, optional) The name to give the datasource
  • schema - (str, optional) The schema to create the table in
  • table_name - (str, optional) The name to give the table. If blank it will be derived from the file name.
  • table_type - (str, optional) Specifies the TableType
  • kwargs - (dict, optional) Additional keyword arguments to pass to the the AdHocDataTable
  • nlp - (bool, optional) If true, allow NLP analysis when creating fields

Returns:

(DataSource) - A DataSource created from the file data and config

from_datatables(name, datatables, config=None, nlp=False)

Create a DataSource from a list of datatables

Parameters:

  • name - (str) The name to give the datasource
  • datatables - (list of AdHocDataTables) A list of AdHocDataTables to use to create the DataSource
  • config - (dict, optional) A DataSourceConfigSchema dict config
  • nlp - (bool, optional) If true, allow NLP analysis when creating fields

Returns:

(DataSource) - A DataSource created from the datatables and config

from_db_file(file, name=None, config=None, if_exists='fail', replace_after='1 days', nlp=False)

Create a DataSource from a sqlite db file path or url

Parameters:

  • file - (str) A file path/url pointing to a SQLite database
  • name - (str, optional) The name to give the datasource. If blank it will be derived from the file name.
  • config - (dict, optional) A DataSourceConfigSchema dict config. Note that the connect param of this config will be overwritten if present!
  • if_exists - (str, optional) Control behavior when the database already exists
  • replace_after - (str, optional) Replace the data file after this interval if if_exists is "replace_after". See url_connect docs for more information.
  • nlp - (bool, optional) If true, allow NLP analysis when creating fields

Returns:

(DataSource) - A DataSource created from the db and config

get_child_field_managers(self)

Get a list of child FieldManagers

get_columns_with_field(self, field_name)

Get a list of column objects that support a field

get_dialect_name(self)

Get the name of the SQLAlchemy metadata dialect

get_dim_tables_with_dim(self, dim_name)

Get a list of dimension table objects with the given dimension

get_dimension(self, obj, adhoc_fms=None)

Get a reference to a dimension on this FieldManager

get_dimension_configs(self, adhoc_fms=None)

Get a dict of all dimension configs supported by this FieldManager

get_dimension_names(self, adhoc_fms=None)

Get a set of dimension names supported by this FieldManager

get_dimensions(self, adhoc_fms=None)

Get a dict of all dimensions supported by this FieldManager

get_direct_dimension_configs(self)

Get a dict of dimension configs directly supported by this FieldManager

get_direct_dimensions(self)

Get dimensions directly stored on this FieldManager

get_direct_fields(self)

Get a dict of all fields directly supported by this FieldManager

get_direct_metric_configs(self)

Get a dict of metric configs directly supported by this FieldManager

get_direct_metrics(self)

Get metrics directly stored on this FieldManager

get_field(self, obj, adhoc_fms=None)

Get a reference to a field on this FieldManager

get_field_instances(self, field, adhoc_fms=None)

Get a dict of FieldManagers (including child and adhoc FMs) that support a field

get_field_managers(self, adhoc_fms=None)

Get a list of all child FieldManagers including adhoc

get_field_names(self, adhoc_fms=None)

Get a set of field names supported by this FieldManager

get_fields(self, adhoc_fms=None)

Get a dict of all fields supported by this FieldManager

get_metric(self, obj, adhoc_fms=None)

Get a reference to a metric on this FieldManager. If the object passed is a dict it is expected to define an AdHocMetric.

get_metric_configs(self, adhoc_fms=None)

Get a dict of all metric configs supported by this FieldManager

get_metric_names(self, adhoc_fms=None)

Get a set of metric names supported by this FieldManager

get_metric_tables_with_metric(self, metric_name)

Get a list of metric table objects with the given metric

get_metrics(self, adhoc_fms=None)

Get a dict of all metrics supported by this FieldManager

get_params(self)

Get a simple dict representation of the datasource params. This is currently not sufficient to completely rebuild the datasource.

get_possible_joins(self, table, grain)

This takes a given table (usually a metric table) and tries to find one or more joins to each dimension of the grain. It's possible some of these joins satisfy other parts of the grain too which leaves room for consolidation, but it's also possible to have it generate independent, non-overlapping joins to meet the grain.

Parameters:

  • table - (SQLAlchemy Table) Table to analyze for joins to grain
  • grain - (iterable) An iterable of dimension names that the given table must join to

Returns:

(dict) - A mapping of dimension -> dimension joins

get_table(self, fullname, check_active=True)

Get the table object from the datasource's metadata

Parameters:

  • fullname - (str) The full name of the table
  • check_active - (bool, optional) If true, check that the table

Returns:

(Table) - The SQLAlchemy table object from the metadata

get_tables_with_field(self, field_name, table_type=None)

Get a list of Tables that have a field

Parameters:

  • field_name - (str) The name of the field to check for
  • table_type - (str, optional) Check only this TableType

Returns:

(list) - A list of Table objects

has_dimension(self, name, adhoc_fms=None)

Check whether a dimension is contained in this FieldManager

has_field(self, name, adhoc_fms=None)

Check whether a field is contained in this FieldManager

has_metric(self, name, adhoc_fms=None)

Check whether a metric is contained in this FieldManager

has_table(self, table, check_active=True)

Check whether the table is in this datasource's metadata

Parameters:

  • table - (SQLAlchemy Table) A SQLAlchemy table
  • check_active - (bool, optional) If true, check that the table is active

Returns:

(bool) - True if the table's fullname is in the metadata.tables map

print_dimensions(self, indent=None)

Print all dimensions in this FieldManager

print_info(self)

Print the structure of the datasource

print_metrics(self, indent=None)

Print all metrics in this FieldManager

ExcelDataTable

Bases: zillion.datasource.AdHocDataTable

class zillion.datasource.ExcelDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

AdHocDataTable from a JSON file using pandas.read_excel

get_dataframe(self)
table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, method='multi', chunksize=1000)

Use pandas to push the adhoc table data to a SQL database.

Parameters:

  • engine - (SQLAlchemy connection engine) The engine used to connect to the database
  • method - (str, optional) Passed through to pandas
  • chunksize - (int, optional) Passed through to pandas

GoogleSheetsDataTable

Bases: zillion.datasource.AdHocDataTable

class zillion.datasource.GoogleSheetsDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

AdHocDataTable from a google sheet. Parsed as a CSVDataTable.

get_dataframe(self)
table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, method='multi', chunksize=1000)

Use pandas to push the adhoc table data to a SQL database.

Parameters:

  • engine - (SQLAlchemy connection engine) The engine used to connect to the database
  • method - (str, optional) Passed through to pandas
  • chunksize - (int, optional) Passed through to pandas

HTMLDataTable

Bases: zillion.datasource.AdHocDataTable

class zillion.datasource.HTMLDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

AdHocDataTable from an html table using pandas.read_html. By default it expects a table in the same format as produced by: df.reset_index().to_html("table.html", index=False)

get_dataframe(self)
table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, method='multi', chunksize=1000)

Use pandas to push the adhoc table data to a SQL database.

Parameters:

  • engine - (SQLAlchemy connection engine) The engine used to connect to the database
  • method - (str, optional) Passed through to pandas
  • chunksize - (int, optional) Passed through to pandas

JSONDataTable

Bases: zillion.datasource.AdHocDataTable

class zillion.datasource.JSONDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

AdHocDataTable from a JSON file using pandas.read_json

get_dataframe(self, orient='table')
table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, method='multi', chunksize=1000)

Use pandas to push the adhoc table data to a SQL database.

Parameters:

  • engine - (SQLAlchemy connection engine) The engine used to connect to the database
  • method - (str, optional) Passed through to pandas
  • chunksize - (int, optional) Passed through to pandas

Join

Bases: tlbx.logging_utils.PrintMixin

class zillion.datasource.Join(join_parts, field_map)

Represents a join (potentially multi-part) that will be part of a query

Parameters:

  • join_parts - (list of JoinParts) A list of JoinParts that will make up a single Join
  • field_map - (dict) The requested fields this join is meant to satisfy
add_field(self, field)

Denote that this field is covered in this join

add_fields(self, fields)

Add multiple fields as covered in this join

add_join_part_tables(self, join_part)

Add tables from join parts to the table list

combine(join1, join2)

Create a new Join object that combines the parts and fields of the given joins

get_covered_fields(self)

Generate a list of all possible fields this can cover

join_fields_for_table(self, table_name)

Get a list of join fields for a particular table in the join

join_parts_for_table(self, table_name)

Get a list of JoinParts that reference a particular table

JoinPart

Bases: tlbx.logging_utils.PrintMixin

class zillion.datasource.JoinPart(datasource, table_names, join_fields)

A part of a join that defines a join between two particular tables

NeighborTable

Bases: tlbx.logging_utils.PrintMixin

class zillion.datasource.NeighborTable(table, join_fields)

Represents a neighboring node in the datasource graph

SQLiteDataTable

Bases: zillion.datasource.AdHocDataTable

class zillion.datasource.SQLiteDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

AdHocDataTable from an existing sqlite database on the local filesystem

Note: the "data" param to AdHocDataTable is ignored. This is simply a workaround to get an AdHocDataTable reference for an existing SQLite DB without having to recreate anything from data.

get_dataframe(self)
table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, **kwargs)

TableSet

Bases: tlbx.logging_utils.PrintMixin

class zillion.datasource.TableSet(datasource, ds_table, join, grain, target_fields)

A set of tables in a datasource that can meet a grain and provide target fields.

Parameters:

  • datasource - (DataSource) The DataSource containing all tables
  • ds_table - (Table) A table containing a desired metric or dimension
  • join - (Join) A join to related tables that satisfies the grain and provides the target fields
  • grain - (list of str) A list of dimensions that must be supported by the join
  • target_fields - (list of str) A list of fields being targeted
get_covered_fields(self)

Get a list of all covered fields in this table set

get_covered_metrics(self, wh)

Get a list of metrics covered by this table set

Parameters:

  • wh - (Warehouse) The warehouse to use as a reference for metric fields

Returns:

(list of str) - A list of metric names covered in this TableSet

connect_url_to_metadata

zillion.datasource.connect_url_to_metadata(url, ds_name=None)

Create a bound SQLAlchemy MetaData object from a database URL. The ds_name param is used to determine datasource config context for variable substitution.

data_url_to_metadata

zillion.datasource.data_url_to_metadata(data_url, ds_name, if_exists='fail', replace_after='1 days')

Create a bound SQLAlchemy MetaData object from a data URL. The ds_name param is used to determine datasource config context for variable substitution.

datatable_from_config

zillion.datasource.datatable_from_config(name, config, schema=None, **kwargs)

Factory to create an AdHocDataTable from a given config. The type of the AdHocDataTable created will be inferred from the config["url"] param.

Parameters:

  • name - (str) The name of the table
  • config - (dict) The configuration of the table
  • schema - (str, optional) The schema in which the table resides
  • kwargs - Passed to init of the particular AdHocDataTable class

Returns:

(AdHocDataTable) - Return the created AdHocDataTable (subclass)

entity_name_from_file

zillion.datasource.entity_name_from_file(filename)

get_adhoc_datasource_filename

zillion.datasource.get_adhoc_datasource_filename(ds_name)

Get the filename where the adhoc datasource will be located

get_adhoc_datasource_url

zillion.datasource.get_adhoc_datasource_url(ds_name)

Get a connection URL for the datasource

get_ds_config_context

zillion.datasource.get_ds_config_context(name)

Helper to get datasource context from the zillion config

join_from_path

zillion.datasource.join_from_path(ds, path, field_map=None)

Given a path in the datasource graph, get the corresponding Join

Parameters:

  • ds - (DataSource) The datasource for the join
  • path - (list of str) A list of tables that form a join path
  • field_map - (dict, optional) Passed through to Join init

Returns:

(Join) - A Join between all tables in the path

metadata_from_connect

zillion.datasource.metadata_from_connect(connect, ds_name)

Create a bound SQLAlchemy MetaData object from a "connect" param. The connect value may be a connection string or a DataSourceConnectSchema dict. See the DataSourceConnectSchema docs for more details on that format.

parse_replace_after

zillion.datasource.parse_replace_after(replace_after)

Parse a case-insensitive interval string of the format "number interval". The number may be a float, and the inverval options are: seconds, minutes, hours, days, weeks.

populate_url_context

zillion.datasource.populate_url_context(url, ds_name)

Helper to do variable replacement in URLs

reflect_metadata

zillion.datasource.reflect_metadata(metadata, reflect_only=None)

Reflect the metadata object from the connection. If reflect_only is passed, reflect only the tables specified in that list

url_connect

zillion.datasource.url_connect(ds_name, connect_url=None, data_url=None, if_exists='fail', replace_after='1 days')

A URL-based datasource connector. This is meant to be used as the "func" value of a DataSourceConnectSchema. Only one of connect_url or data_url may be specified.

Parameters:

  • ds_name - (str) The name of the datasource to get a connection for
  • connect_url - (str, optional) If a connect_url is passed, it will create a bound MetaData object from that connection string.
  • data_url - (str, optional) If a data_url is passed, it will first download that data (or make sure it is already downloaded) and then create a connection to that data file, which is assumed to be a SQLite database. The name of the database file will be based on the name of the datasource passed in.
  • if_exists - (str, optional) If a data_url is in use, this will control handling of existing data under the same filename. If "fail", an exception will be raised if the file already exists. If "ignore", it will skip downloading the file if it exists. If "replace", it will create or replace the file. If "replace_after" it will check the age of the file and replace it after the age interval provided in the replace_after param.
  • replace_after - (str, optional) If if_exists is "replace_after, use this value to determine the age threshold. The format is "number interval", where the number may be an int or float and the interval options are: seconds, minutes, hours, days, weeks. Note that this only occurs when url_connect is called, which is typically on datasource init; it does not replace itself periodically while the datasource is instantiated.